Arithmetic Encoder - AV1

Description

The Goal: To accelerate the AV1 Arithmetic Encoder process by creating a pipelined architecture and synthesize it to ASIC.
Current State: the architecture is completed and, after careful validation, no problems were found.
Versions: there are two versions of the architecture within this repository: High-throughput and Low-power.
Features: The High-throughput version has the following characteristics: 588 MHz, 11.7k gates count and 982 bits/sec with ST 65 nm PDK. Meanwhile, Low-power has the following: 562 MHz, 11.2k gates count and 951 bits/sec with ST 65 nm PDK. The Low-power version saves around 21% of power, according to analysis made with real-world datasets.

What is still missing?

Improve frequency and reduce area: as higher as an architecture's frequency is, there are always ways to make it even higher without increasing the area.

Project overview

4-stage pipeline

Stage 1: Pre-calculations for the Low and Range generation.
Stage 2: The main output is the Range ready and normalized, among a few other essential values for the Low generation
Stage 3: Defines the low value, normalizes it, and generates the pre-bitstream.
Stage 4: Carry propagation block. Transforms the 9-bit pre-bitstream generated in Stage 3 into an 8-bit final bitstream.

Verification

This project's testbenches were created in SystemVerilog, and all of them can find any problem with the architecture.
All files responsible for the verification can be found in the verification_area folder.

Simulation data

The simulation data files are generated directly from AV1's algorithm, modified to create output files with important data.

How to generate?

Download the modified file of the AV1's entropy encoder, entenc.c;
Download/clone the AV1's reference code;
Copy and overwrite the file entenc.c modified into the folder aom/aom_dsp;
Run any encoding process according to the procedure as specified in the AV1's website;
After the encoding process is completed, access the folder arith_analysis created;
The normal data sequence for the testbench is a file called main_data.

Other files generated

Input: this file contains only the input values for the architecture. In the case of a testbench, this file cannot be used because it does not contain the algorithm's outputs.

Testbenches

Entropy encoder testbench: This is the main testbench for the entire archicture. It validades the bitstreams generated and after the carry propagation process. The file is called entropy_encoder_tb.sv (1-bool version testbench).
Arithmetic encoder testbench: This testbench is used to validade only the stages 1, 2 and 3 of the pipeline together. The file is called tb_arith_encoder.sv.
Pre-bitstream testbench: This file also verifies the low and range outputs and the bitstream and a few other values used by the function od_ec_enc_done on the AV1's original code. The file is called tb_bitstream.
Component testbenches: The LZC (Leading Zero Counter) has its own testbench called tb_lzc.sv.
Carry propagation testbench: The Carry Propagation testbench verifies only the 4th stage of pipeline with random input data and it is called tb_carry_propagation.sv.

Architecture in-depth explanation

Stage 1

In order to increase frequency, this block executes pre-calculations that are required for both range and low definitions on stages 2 and 3, respectively.
It also accesses the two look-up tables, which are generated by the script lut-generator.py.

Stage 2

This stage basically uses the results coming from the 1st stage and finds the range initial value.
Furthermore, this stage normalizes the range using the LZC.v block.

Stage 3

This stage receives values from the stages 1 and 2.
This stage's main goal is to find the low value and normalize it.
Moreover, this stage is responsible for generating up to two 9-bit bitstreams per clock cycle.

Stage 4

The output bitstreams are 8-bit arrays.
As Stage 3 generates 9-bit bitstreams, the 4th stage propagates the b[9] to previously generated bitstreams.
This block is divided in 2 sub-blocks: carry_propagation.v and final_bits.v.
The following subsections explain exactly the blocks' behaviors.

Carry Propagation block

Always when (B_in > 255) -> B_out = B_prev[7:0] + B_in[15:8]
Therefore, this block will save the last generated 8-bit bistream (B_prev) and propagate the carry of the following bitstreams (B_in).
B_in = B_prev[7:0] + B_in[15:8]; B_prev = B_in[7:0]
This block is also able to count the number of 255s received in sequence and release n 255s at the same time using OUT_BIT_2 (255 or 0) and OUT_BIT_3 (number saved in the counter).
This block is able to release up to five 8-bit arrays when necessary.

255 Exception

Everytime when 255 is received followed by a number > 255, it is necessary to propagate the carry beyond i-1 (i-2, i-3, i-x).
To solve this problem, the bitstream received just before the first 255 is kept stored within Bprev, while a 255_counter counts the number of 255 received in sequence.
Once a bitstream != 255 arrives, the architecture releases as follows: Bout1_ = _Bprev; Bout2 = 255 or 0; Bout3 = _255_counter; Bprev = Bin1 or Bin2 (Bin1 is release as Bout4 if flagIn == 2).

Last Bitstreams generation

When the frame execution is over, it is necessary to release bitstreams according to the low and cnt values.
This block basically generates up to two bitstreams once the final_flag (sent by the testbench) reaches the Stage 4.
The bitstreams generated here are also 9-bit arrays and require to pass through the carry propagation process.

Analysis

Critical Path

On 2020-12-19, the frequency reached for a 65nm library synthesis was 559 MHz.
The architecture's current critical path is the range generation and normalization, which passes through a multiplication and the LZC sub-block.

Ways to improve

Find a way to multiply faster (already did some unsuccessful trials with Vedic multiplication method) -- (not possible);
Find a way to split the range generation equation and execute some parts in Stage -- (not possible);
Use approximate computing to avoid the multiplication of the range;
Find a multiplier-free solution for arithmetic encoding and analyze how it behaves when added to the AV1 reference software (Window Sliding might be a great option).

Why is it not possible? (Indexes above)

When synthesizing the ASIC, the synthesizer will automatically find the best combination of cells possible for the multiplication (usually it's a custom cell specifically made for multiplication). It is just not feasible to beat something designed for multiplications with Verilog code and different multiplication methodologies. The cell itself has state-of-art multiplication methodologies.
Splitting the Range or Low generation process in two or more stages would create bubble in the pipeline. It just doesn't make sense to, for example, split Stage 2 and send inputs with a 2-cycle gap. Sure the frequency will get better, but the throughput rate will still be the same (or perhaps get lower).

More information about the architecture in Project folder.

Versions

Check rtl directory's README for more information regarding the versions.

rtl/entropy-encoder-original is the original architecture comprised by stages 1, 2, 3 and 4, as explained above.
rtl/entropy-encoder-lp is a low-power version of the architecture. This version uses Operand Isolation and Clock Gating to reduce the power consumption of the architecture. This version aims to prevent useless values from Boolean Operation from being generated and stored.
rtl/entropy-encoder-1-bool version with only one Boolean Operation in Stages 2 and 3 (same as the rtl/entropy-encoder-original with the code divided in modules for CDF and Boolean Operations);
rtl/entropy-encoder-2-bool version with 2 Boolean Operations in parallel in Stages 2 and 3.
rtl/entropy-encoder-3-bool version with 3 Boolean Operations in parallel in Stages 2 and 3.
rtl/entropy-encoder-4-bool version with 4 Boolean Operations in parallel in Stages 2 and 3. Also added another Carry Propagation block in Stage 4 totaling 2.
rtl/entropy-encoder-1-bool-lp Low-power version of the 1-bool architecture. Added the clock gating and operand isolation techniques.
rtl/entropy-encoder-2-bool-lp Low-power version of the 2-bool architecture. Added the clock gating and operand isolation techniques.
rtl/entropy-encoder-3-bool-lp Low-power version of the 3-bool architecture. Added the clock gating and operand isolation techniques.

How to run the main testbench?

Generate the simulation data and generate the LUT data (lut-generator.py);
Import the testbench file entropy_encoder_tb.sv and change the simulation file's path;
Import all .v files;
Compile all files in a simulation software (e.g., Modelsim);
Use the scripts in folder verification_area/modelsim_project/scripts/main_entropy_encoder/;
With the waveform scripts, some waveforms will be imported to the project (only tested on Modelsim);
With the re-run.do file, the LUT memories will be filled with generated data and the simulation will start.

Name		Name	Last commit message	Last commit date
Latest commit History 365 Commits
Project		Project
Scripts		Scripts
rtl		rtl
sdc		sdc
verification_area		verification_area
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

License

tuliopereirab/arithmetic-encoder-av1

Folders and files

Latest commit

History

Repository files navigation

Arithmetic Encoder - AV1

Description

What is still missing?

Project overview

4-stage pipeline

Verification

Simulation data

How to generate?

Other files generated

Testbenches

Architecture in-depth explanation

Stage 1

Stage 2

Stage 3

Stage 4

Carry Propagation block

255 Exception

Last Bitstreams generation

Analysis

Critical Path

Ways to improve

Why is it not possible? (Indexes above)

Versions

How to run the main testbench?

About

Topics

Resources

License

Stars

Watchers

Forks

Languages