DOOMCore v1¶

Back to project overview ¶

[TODO finish the core and detail it ALL here]

This core is based on my previous RV32E CPU: tiny-riscv. However by the end of this project, I anticipate it will have ungone many changes.

That core is pretty useless. It cannot interact with the rest of the system as the instruction and data ports do not support AXI. Performance would also be terrible as there are no caches. Stalling every instruction to wait for DRAM would slow performance by a few orders of magnitude. Ideally, DOOMCore will maintain ~0.8 IPC while running DOOM.

Notes Spam¶

(i plan to clean this up once the core is validated)

tiny-riscv is kinda useless for a real design, it assumes i can hit in both imem and dmem every cycle every time. DOOMCore with its real DRAM and non-perfect caches cannot operate under this assumption.

Furthermore, the two stage design of tiny-riscv simply will not work on the Tang Nano if I want to operate at ~330MHz (2x memory bus speed)

Do I need to run the core at 330MHz? Probably not. but it is more fun this way

Instruction fetch / icache¶

[TODO: insert shitty diagram]

The isntruction fetch keeps track of the current* PC and feeds them into the icache which will either return an instruction on a hit, or stall until the cacheline with the requested instruction is returned from DRAM.

The branch unit will calculate branch targets and pass them to the fetch unit whenever they are encountered. In these cases, the currently fetching instruction is discarded (FE flush) and the fetch unit starts over with the new PC (the branch target)

If the icache is in the process of filling a cacheline for a badpath PC when a branch comes in, it will stall until that cache fill is complete before starting fetch on the good path PC (branch target).

The fetch unit is decoupled from the rest of the core with a skid buffer (why?). The fetch unit continues to produce valid instructions as fast as it can until the downstream skid buffer deasserts ready (stall). at this point the fetch unit holds the current PC and any cache fetches that complete in the meantime will be held until the downstream stages resume.

* this is a pipelined design where 3+ PCs can be active at a time. Defining one of them as "current" is not useful.

TODO: full overview

Decode¶

Decode is dead simple. it doesnt get it's own pipeline stage because that would be a huge waste and complicate the control logic even more.

Execute / Branch¶

In the same stage as the decode unit lives the Execute (LSU) and Branch units.

Since im not worried about minimizing logic area, the branch unit performs its own branch computations as well as computes the branch target. If the current instruction seen by the BRU is a jump or a conditional branch that evaluates to true, it asserts branch which causes the fetch unit to flush and start fetching from branch_target instead.

The ALU impelements all of the other integer instructions (not loads or stores). Since DOOMCore targets an FPGA with lots of hard logic adders and LUTs, no exciting optimizations occur here. This is probably the most boring part of the design.

LSU¶

I haven't finished this yet

when a load is outstanding, the core is free to continue executing other ALU (non LS) instructions. Instead of stalling the core completely, you just keep track of the destination register of the outstanding load. the stall logic is exactly the same as standard data-hazard checking logic.

the only change is if a write instruction is encountered while

Caches¶

Since the i486 has 8KB cache, I want ot duplicate that for DOOMCore.

Both caches are direct mapped, with the dcache being write-through, write-around

icache¶

Direct-mapped, write-through is much easier to implement and should still compensate for the terrible performance i would get from fetching directly from the dram. 8KB should be more than enough to store any of the hot loops in DOOM. If I encounter a lot of icache thrashing, I can probably force the compiler to reorganize the emitted code to reduce/eliminate it.

dcache¶

If the target of a store instruction is not currently cached, that value is written out to the memory bus without fetching it.

Reworking the LSU¶

Old LSU assumed reads completed in one cycle. this true for the common case, but not on a cache miss

Nonblocking Dcache¶

I have no intention of implement a full out of core, but I was made aware that building a nonclocking data cache is trivial