Skip to content
This repository has been archived by the owner on Sep 2, 2023. It is now read-only.

Idea for compressed instructions: Use a special 32-bit VLIW format #103

Open
mbitsnbites opened this issue Feb 20, 2020 · 3 comments
Open

Comments

@mbitsnbites
Copy link
Member

mbitsnbites commented Feb 20, 2020

Compressed instruction sets

Compressed instruction sets seem to be all the rage these days. The obvious advantages are increased instruction bandwidth and reduced I$ load.

Traditional compressed instruction sets, especially those with mixed instruction sizes, come with a few caveats though:

  • Full-width instructions (typically 32 bits wide) may cross cache line boundaries, or you need to insert NOP:s to pad to an even cache line boundary. In either case, you need more logic to handle these cases, which complicates instruction fetch.
  • Several parts of the pipeline need to be aware of where an instruction starts and stops (e.g. branch prediction).
  • An instruction can start at an "odd" address (e.g. a two-byte boundary rather than a four-byte boundary), which means that you need one more bit to represent things like branch offsets.

Alternative: VLIW

All of the problems mentioned above go away if we pack two instructions into a single 32-bit instruction word.

To simplify as much as possible, let the two instructions be treated as normal instructions that execute one after another (i.e. don't treat the instructions as independent concurrent instructions as in traditional VLIW ISA:s).

The obvious disadvantage with this approach is that it may not be possible to emit compressed/packed instructions in as many situations as with a traditional compressed ISA, and more work may be needed by the compiler to optimally schedule instructions in a way that maximizes the utilization of this encoding format.

Below is an example, assuming that we implement #104 (so that we effectively only need 3 bits for all current type D instructions), and can use a 4-bit prefix, e.g. 1110 or 1111, for compressed instructions.

Encoding

  • Two instructions are encoded in a single word.
  • We use destructive register operands (the destination register may also be a source register).
  • Different instructions have different encodings.
      3             2               1
     |1| | | | | | |4| | | | | | | |6| | | | | | | |8| | | | | | | |0|
     +-------+---------------------------+---------------------------+
     |1 1 1 0|                           |                           |
     +-------+---------------------------+---------------------------+
              \_ Instr. A (14b) ________/ \_ Instr. B (14b) ________/

Formats

Each instruction slot (14 bits) may have one of the following encodings (preliminary - exact configurations and bit positions may be changed):

     1
    |3| | | | |8| | | | | | | |0|
    +-------+---------+---------+
P1: |O      |R        |R        |
    +-------+---------+---------+
P2: |O      |R        |I        |
    +-------+---------+---------+
P3: |O      |I                  |
    +-------+-------------------+

O: Operation
R: Register (source and/or destination)
I: Immediate value

Suggested instructions

The instructions that we should be able to encode in this more compact form should be common ones. We should compile some large software suite and check which instructions are most common - AND would fit together in pairs.

Based on preliminary findings and common sense, a few common instructions seem to be:

  • Stack push/pop.
  • Move register.
  • Load immediate value.
  • Load/store word:
    • ldw/stw sa, sb, #0
    • ldw sa, sa, #imm
  • Some arithmetic, logic and shift instructions.

Table 1.7 in The RISC-V Compressed Instruction Set Manual v1.9 contains some information about the relative frequency of common instructions.

If instructions are chosen in such a way that there can be a 1:1 mapping between regular instructions and packed instructions, the decision of whether or not to pack instructions can be done at assembly time.

The most obvious thing that would benefit from a VLIW-style compact format is stack handling (function prologues and epilogues), since:

  1. It takes much space due to the lack of instructions that load/store many registers.
  2. The code sequence is quite deterministic and compressible instruction pairs are common.

Some examples are given below:

OP Format Instruction Description
0000 P1 mov rd, rs Move register
0001 P1 or rd, rd, rs Or register
0010 P1 and rd, rd, rs And register
0011 P1 xor rd, rd, rs Xor register
0100 P1 add rd, rd, rs Add register
0101 P1 sub rd, rd, rs Sub register
0110 - - -
0111 - - -
1000 P2 ldli rd, #imm Load immediate
1001 P2 asr rd, rd, #imm Asr immediate
1010 P2 lsl rd, rd, #imm Lsl immediate
1011 P2 lsr rd, rd, #imm Lsr immediate
1100 P2 add rd, rd, #imm Add immediate
1101 P2 ldw rd, sp, #imm*4 Load word from stack
1110 P2 stw rs, sp, #imm*4 Store word to stack
1111 P3 add sp, sp, #imm*4 Add immediate to SP

Branches and PC

Suggested rules for the PC of packed instructions:

  • Both instructions have the same PC.
  • The second instruction can not be a branch target.
    • As a consequence, the first instruction can not be an unconditional jump or ret instruction (it's pointless).
  • In case of interrupts or exceptions, we need to handle the possible case that only the first of the two instructions has finished (e.g. for a page fault during instruction B). There are two apparent alternatives:
    1. Both instructions must be treated as an atomic unit (i.e. either both instructions are executed in whole or both are cancelled).
    2. Include one bit of extra state that indicates that the first instruction has already been executed, and should be NOP:ified when the exception returns. This could be implemented by letting the return PC be the actual PC + 2.

Open questions

  • We could use clever encodings of the register numbers so that useless cases are avoided (e.g. there's seldom a point in storing PC or SP on the stack, and using Z as a target register is seldom useful), and we could save a bit or two in encoding.
  • It may be better to have different possible instructions for the A slot and the B slot, since some instructions do not make sense in slot A (e.g. unconditional branches & ret).
  • Should we support branches at all? The instruction encoding space may be better used for other instructions. Branches are problematic because:
    • Early branch prediction logic (e.g. pre-decode static prediction) gets more complex when there are multiple ways to encode a branch.
    • Branch logic (e.g. branch target address calculation) may be subject to longer combinatorial delay when there are different branch instruction encodings.
@mbitsnbites
Copy link
Member Author

mbitsnbites commented Oct 4, 2020

Inspired by #104 and #110, the encoding could be changed as follows:

  • Reduce the number of type D instructions. We only need 6 type D instructions:
    • LDLI (sign extend w MSB)
    • LDHI (bottom-fill w LSB)
    • ADDPCHI
    • J
    • JL
    • B(ranch) - This actually uses special encoding (3 bits of the 21-bit immediate field are used as the condition code).
  • VLIW instructions have the upper 4 bits set to 1.

@mbitsnbites
Copy link
Member Author

mbitsnbites commented Jan 27, 2021

Here are the instruction frequencies for the entire Doom binary (some of these are "false" - e.g. many instances of addpchi would be eliminated by relaxation for instance):

Insn Freq Percent
ldw 13424 16.60%
add 10608 13.11%
stw 9937 12.28%
addpchi 9496 11.74%
ldli 4721 5.84%
jl 3783 4.68%
j 3013 3.72%
bs 2662 3.29%
bz 2412 2.98%
sub 1936 2.39%
bnz 1728 2.14%
bns 1629 2.01%
seq 1548 1.91%
and 1383 1.71%
sne 1148 1.42%
sle 974 1.20%
or 833 1.03%
lsl 780 0.96%
shuf 768 0.95%
lsr 756 0.93%
ldub 735 0.91%
mul 550 0.68%
stb 547 0.68%
sleu 506 0.63%
ldea 490 0.61%
sth 475 0.59%
slt 448 0.55%
lduh 434 0.54%
asr 428 0.53%
ldh 411 0.51%
ldhi 393 0.49%
ble 246 0.30%
pack 238 0.29%
mulhi 223 0.28%
sltu 187 0.23%
blt 170 0.21%
div 136 0.17%
max 121 0.15%
min 107 0.13%
bgt 90 0.11%
rem 77 0.10%
bge 73 0.09%
xor 70 0.09%
sel 54 0.07%
divu 45 0.06%
remu 27 0.03%
ldb 20 0.02%
minu 17 0.02%
clz 15 0.02%
cpuid 9 0.01%
maxu 4 0.00%

...and for Quake:

Insn Freq Percent
ldw 20730 19,45 %
add 15699 14,73 %
stw 14110 13,24 %
addpchi 13416 12,59 %
jl 5634 5,29 %
ldli 4655 4,37 %
bs 3150 2,96 %
bz 2461 2,31 %
bns 2084 1,96 %
bnz 1824 1,71 %
fmul 1712 1,61 %
seq 1491 1,40 %
and 1468 1,38 %
fadd 1348 1,26 %
ldhi 1318 1,24 %
ldub 1298 1,22 %
sne 1295 1,21 %
sub 1239 1,16 %
or 1150 1,08 %
sle 942 0,88 %
stb 830 0,78 %
fsub 755 0,71 %
lsr 568 0,53 %
lsl 546 0,51 %
sleu 543 0,51 %
j 538 0,50 %
itof 467 0,44 %
fslt 447 0,42 %
ldea 440 0,41 %
ftoi 410 0,38 %
slt 392 0,37 %
ble 359 0,34 %
asr 339 0,32 %
mul 304 0,29 %
shuf 293 0,27 %
fseq 205 0,19 %
ldh 193 0,18 %
blt 192 0,18 %
sel 190 0,18 %
sltu 184 0,17 %
sth 173 0,16 %
fdiv 148 0,14 %
fsne 141 0,13 %
lduh 138 0,13 %
fsle 124 0,12 %
bgt 107 0,10 %
bge 97 0,09 %
max 89 0,08 %
rem 61 0,06 %
min 47 0,04 %
div 41 0,04 %
divu 38 0,04 %
xor 33 0,03 %
remu 25 0,02 %
utof 22 0,02 %
pack 18 0,02 %
ldb 15 0,01 %
clz 15 0,01 %
ftou 13 0,01 %
minu 11 0,01 %
cpuid 6 0,01 %
maxu 5 0,00 %
mulhi 2 0,00 %
ftoir 2 0,00 %

@mbitsnbites
Copy link
Member Author

Note: This issue is pretty much made void by #141

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant