Idea for compressed instructions: Use a special 32-bit VLIW format #103

mbitsnbites · 2020-02-20T13:28:22Z

Compressed instruction sets

Compressed instruction sets seem to be all the rage these days. The obvious advantages are increased instruction bandwidth and reduced I$ load.

Traditional compressed instruction sets, especially those with mixed instruction sizes, come with a few caveats though:

Full-width instructions (typically 32 bits wide) may cross cache line boundaries, or you need to insert NOP:s to pad to an even cache line boundary. In either case, you need more logic to handle these cases, which complicates instruction fetch.
Several parts of the pipeline need to be aware of where an instruction starts and stops (e.g. branch prediction).
An instruction can start at an "odd" address (e.g. a two-byte boundary rather than a four-byte boundary), which means that you need one more bit to represent things like branch offsets.

Alternative: VLIW

All of the problems mentioned above go away if we pack two instructions into a single 32-bit instruction word.

To simplify as much as possible, let the two instructions be treated as normal instructions that execute one after another (i.e. don't treat the instructions as independent concurrent instructions as in traditional VLIW ISA:s).

The obvious disadvantage with this approach is that it may not be possible to emit compressed/packed instructions in as many situations as with a traditional compressed ISA, and more work may be needed by the compiler to optimally schedule instructions in a way that maximizes the utilization of this encoding format.

Below is an example, assuming that we implement #104 (so that we effectively only need 3 bits for all current type D instructions), and can use a 4-bit prefix, e.g. 1110 or 1111, for compressed instructions.

Encoding

Two instructions are encoded in a single word.
We use destructive register operands (the destination register may also be a source register).
Different instructions have different encodings.

      3             2               1
     |1| | | | | | |4| | | | | | | |6| | | | | | | |8| | | | | | | |0|
     +-------+---------------------------+---------------------------+
     |1 1 1 0|                           |                           |
     +-------+---------------------------+---------------------------+
              \_ Instr. A (14b) ________/ \_ Instr. B (14b) ________/

Formats

Each instruction slot (14 bits) may have one of the following encodings (preliminary - exact configurations and bit positions may be changed):

     1
    |3| | | | |8| | | | | | | |0|
    +-------+---------+---------+
P1: |O      |R        |R        |
    +-------+---------+---------+
P2: |O      |R        |I        |
    +-------+---------+---------+
P3: |O      |I                  |
    +-------+-------------------+

O: Operation
R: Register (source and/or destination)
I: Immediate value

Suggested instructions

The instructions that we should be able to encode in this more compact form should be common ones. We should compile some large software suite and check which instructions are most common - AND would fit together in pairs.

Based on preliminary findings and common sense, a few common instructions seem to be:

Stack push/pop.
Move register.
Load immediate value.
Load/store word:
- ldw/stw sa, sb, #0
- ldw sa, sa, #imm
Some arithmetic, logic and shift instructions.

Table 1.7 in The RISC-V Compressed Instruction Set Manual v1.9 contains some information about the relative frequency of common instructions.

If instructions are chosen in such a way that there can be a 1:1 mapping between regular instructions and packed instructions, the decision of whether or not to pack instructions can be done at assembly time.

The most obvious thing that would benefit from a VLIW-style compact format is stack handling (function prologues and epilogues), since:

It takes much space due to the lack of instructions that load/store many registers.
The code sequence is quite deterministic and compressible instruction pairs are common.

Some examples are given below:

OP	Format	Instruction	Description
0000	P1	mov rd, rs	Move register
0001	P1	or rd, rd, rs	Or register
0010	P1	and rd, rd, rs	And register
0011	P1	xor rd, rd, rs	Xor register
0100	P1	add rd, rd, rs	Add register
0101	P1	sub rd, rd, rs	Sub register
0110	-	-	-
0111	-	-	-
1000	P2	ldli rd, #imm	Load immediate
1001	P2	asr rd, rd, #imm	Asr immediate
1010	P2	lsl rd, rd, #imm	Lsl immediate
1011	P2	lsr rd, rd, #imm	Lsr immediate
1100	P2	add rd, rd, #imm	Add immediate
1101	P2	ldw rd, sp, #imm*4	Load word from stack
1110	P2	stw rs, sp, #imm*4	Store word to stack
1111	P3	add sp, sp, #imm*4	Add immediate to SP

Branches and PC

Suggested rules for the PC of packed instructions:

Both instructions have the same PC.
The second instruction can not be a branch target.
- As a consequence, the first instruction can not be an unconditional jump or ret instruction (it's pointless).
In case of interrupts or exceptions, we need to handle the possible case that only the first of the two instructions has finished (e.g. for a page fault during instruction B). There are two apparent alternatives:
1. Both instructions must be treated as an atomic unit (i.e. either both instructions are executed in whole or both are cancelled).
2. Include one bit of extra state that indicates that the first instruction has already been executed, and should be NOP:ified when the exception returns. This could be implemented by letting the return PC be the actual PC + 2.

Open questions

We could use clever encodings of the register numbers so that useless cases are avoided (e.g. there's seldom a point in storing PC or SP on the stack, and using Z as a target register is seldom useful), and we could save a bit or two in encoding.
It may be better to have different possible instructions for the A slot and the B slot, since some instructions do not make sense in slot A (e.g. unconditional branches & ret).
Should we support branches at all? The instruction encoding space may be better used for other instructions. Branches are problematic because:
- Early branch prediction logic (e.g. pre-decode static prediction) gets more complex when there are multiple ways to encode a branch.
- Branch logic (e.g. branch target address calculation) may be subject to longer combinatorial delay when there are different branch instruction encodings.

The text was updated successfully, but these errors were encountered:

mbitsnbites · 2020-10-04T16:28:31Z

Inspired by #104 and #110, the encoding could be changed as follows:

Reduce the number of type D instructions. We only need 6 type D instructions:
- LDLI (sign extend w MSB)
- LDHI (bottom-fill w LSB)
- ADDPCHI
- J
- JL
- B(ranch) - This actually uses special encoding (3 bits of the 21-bit immediate field are used as the condition code).
VLIW instructions have the upper 4 bits set to 1.

mbitsnbites · 2021-01-27T09:08:19Z

Here are the instruction frequencies for the entire Doom binary (some of these are "false" - e.g. many instances of addpchi would be eliminated by relaxation for instance):

Insn	Freq	Percent
ldw	13424	16.60%
add	10608	13.11%
stw	9937	12.28%
addpchi	9496	11.74%
ldli	4721	5.84%
jl	3783	4.68%
j	3013	3.72%
bs	2662	3.29%
bz	2412	2.98%
sub	1936	2.39%
bnz	1728	2.14%
bns	1629	2.01%
seq	1548	1.91%
and	1383	1.71%
sne	1148	1.42%
sle	974	1.20%
or	833	1.03%
lsl	780	0.96%
shuf	768	0.95%
lsr	756	0.93%
ldub	735	0.91%
mul	550	0.68%
stb	547	0.68%
sleu	506	0.63%
ldea	490	0.61%
sth	475	0.59%
slt	448	0.55%
lduh	434	0.54%
asr	428	0.53%
ldh	411	0.51%
ldhi	393	0.49%
ble	246	0.30%
pack	238	0.29%
mulhi	223	0.28%
sltu	187	0.23%
blt	170	0.21%
div	136	0.17%
max	121	0.15%
min	107	0.13%
bgt	90	0.11%
rem	77	0.10%
bge	73	0.09%
xor	70	0.09%
sel	54	0.07%
divu	45	0.06%
remu	27	0.03%
ldb	20	0.02%
minu	17	0.02%
clz	15	0.02%
cpuid	9	0.01%
maxu	4	0.00%

...and for Quake:

Insn	Freq	Percent
ldw	20730	19,45 %
add	15699	14,73 %
stw	14110	13,24 %
addpchi	13416	12,59 %
jl	5634	5,29 %
ldli	4655	4,37 %
bs	3150	2,96 %
bz	2461	2,31 %
bns	2084	1,96 %
bnz	1824	1,71 %
fmul	1712	1,61 %
seq	1491	1,40 %
and	1468	1,38 %
fadd	1348	1,26 %
ldhi	1318	1,24 %
ldub	1298	1,22 %
sne	1295	1,21 %
sub	1239	1,16 %
or	1150	1,08 %
sle	942	0,88 %
stb	830	0,78 %
fsub	755	0,71 %
lsr	568	0,53 %
lsl	546	0,51 %
sleu	543	0,51 %
j	538	0,50 %
itof	467	0,44 %
fslt	447	0,42 %
ldea	440	0,41 %
ftoi	410	0,38 %
slt	392	0,37 %
ble	359	0,34 %
asr	339	0,32 %
mul	304	0,29 %
shuf	293	0,27 %
fseq	205	0,19 %
ldh	193	0,18 %
blt	192	0,18 %
sel	190	0,18 %
sltu	184	0,17 %
sth	173	0,16 %
fdiv	148	0,14 %
fsne	141	0,13 %
lduh	138	0,13 %
fsle	124	0,12 %
bgt	107	0,10 %
bge	97	0,09 %
max	89	0,08 %
rem	61	0,06 %
min	47	0,04 %
div	41	0,04 %
divu	38	0,04 %
xor	33	0,03 %
remu	25	0,02 %
utof	22	0,02 %
pack	18	0,02 %
ldb	15	0,01 %
clz	15	0,01 %
ftou	13	0,01 %
minu	11	0,01 %
cpuid	6	0,01 %
maxu	5	0,00 %
mulhi	2	0,00 %
ftoir	2	0,00 %

mbitsnbites · 2022-06-28T07:12:00Z

Note: This issue is pretty much made void by #141

mbitsnbites mentioned this issue Jan 13, 2021

EPIC: Rearrange encoding of Format D instructions #116

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Idea for compressed instructions: Use a special 32-bit VLIW format #103

Idea for compressed instructions: Use a special 32-bit VLIW format #103

mbitsnbites commented Feb 20, 2020 •

edited

mbitsnbites commented Oct 4, 2020 •

edited

mbitsnbites commented Jan 27, 2021 •

edited

mbitsnbites commented Jun 28, 2022

Idea for compressed instructions: Use a special 32-bit VLIW format #103

Idea for compressed instructions: Use a special 32-bit VLIW format #103

Comments

mbitsnbites commented Feb 20, 2020 • edited

Compressed instruction sets

Alternative: VLIW

Encoding

Formats

Suggested instructions

Branches and PC

Open questions

mbitsnbites commented Oct 4, 2020 • edited

mbitsnbites commented Jan 27, 2021 • edited

mbitsnbites commented Jun 28, 2022

mbitsnbites commented Feb 20, 2020 •

edited

mbitsnbites commented Oct 4, 2020 •

edited

mbitsnbites commented Jan 27, 2021 •

edited