SuperScalar Processor
Luke Mitchell
lm0466@my.bristol.ac.uk
Implemented in modular, object-oriented C++. That was a huge project!
https://github.com/lukem512/risky-business
Processor Architecture I
3-stage SuperScalar pipeline: Fetch, Decode, Execute
Issues data-independent instructions by checking Bernstein Condition
Pipeline stalls at dependencies
The number of each functional unit can be varied programmatically
./proc -f bubble.bin -eus 5 -dus 4 -fus 2
5-stage Out-of-Order pipeline: Fetch, Decode/Issue, Read, Execute, Write
Using the Scoreboarding algorithm
Incomplete. Dependence is not correctly implemented for memory operations
Fewer Write units to alleviate pipeline bubbles
Branch prediction schemes: Dynamic, Static, Stalled
Processor Architecture II
Processor Architecture III
16 general-purpose registers and up to 65535 bytes of memory. These
are initially random to mimic a real machine at boot.
Registers are 32-bit. Memory words are also 32-bit.
Instructions can contain up to 3 register operands, or up to 2 register and
a 16-bit immediate operand.
A compact RISC-like instruction set.
Much of this is customisable using run-time switches*!
* see ./proc -h for full usage instructions
Processor Architecture IV
Each instruction is encoded in 32 bits and is one of 7 types. These types specify
the operands, which are encoded as follows:
Opcodes are encoded in 5 bits
Register operands are encoded in 4 bits
Immediate operands are encoded in 16 bits
A typical encoding looks like this:
27 - 31 23 - 26
19 - 22
3 - 18 0 - 2
Opcode
Register
1
Register
2
Immediate
Reserv
ed
Benchmarks
I have written a suite of benchmarks able to compute the first n items of the
Fibonacci, Triangle and Square number sequences; perform the Bubble Sort
algorithm; compute n factorial; compute the sum of products between two
arrays and play Conway’s Game of Life. I have also implemented two of the
Livermore Loops.
These benchmarks have been chosen to balance memory, compute and
branching instructions.
The images below were generated on my simulator’s Game of Life simulator!
Experiments I
Hypothesis: as the pipeline width increases, so will the number of instructions
executed per cycle. However, the gain will be diminish with each successive
increase, tending towards a theoretical maximum of 16 useful instructions per
cycle*.
Experiment: I wrote a simple program, speed-unrolled.s, that executes a sets of
independent ADD instructions. This was run with different pipeline widths, where
the number of Fetch, Decode and Execute units are the same. The experiment
was also performed on other benchmarks; these programs have many more loops
and dependencies.
* anything that does not require register I/O has limited use; the maximum is due to dependencies between all 16 registers.
Result: the optimal pipeline width for the straight-line speed test is 14, above
which almost no improvement is seen. This is due to the test issuing sets of 14
independent instructions - after which any additional Functional Units are idle.
In the real-world examples the optimal width was much lower, due to more
frequent dependent instructions, such as array elements in the Livermore
Loops. Values of between 2 and 4 were most common.
Experiments II
Hypothesis: as the branch prediction scheme becomes more sophisticated the
accuracy will improve and the number of cycles executed will decrease. The
Static scheme will be accurate for simple loops, however, when inner-loops are
introduced, there will be one inaccuracy per outer-loop iteration. The Dynamic
scheme will perform similarly but with a substantial improvement in programs with
lots of forward branches, or where branch behaviour is unpredictable. All schemes
will outperform the trivial, Stalled pipeline.
Experiment: all three branch prediction schemes were run across a variety of
benchmarks, including an example, dbp.s, with lots of forward branches. This was
designed to illustrate a scenario in which the Dynamic scheme should outperform
the others. The bubble.s benchmark was included for its unpredictability.
Result: in all experiments, the Dynamic scheme performed at least as well as the
Static scheme. In complex benchmarks, featuring nested loops or unpredictable
branching, the Dynamic scheme outperformed the other schemes.
As expected, branch accuracy and performance are correlated. This is due to the
overhead required to re-fill the pipeline.