top of page
A table in a laboratory with various electronic components

Hardware Implementation of an Image Decompressor

Overview

The "Hardware Implementation of an Image Decompressor" project focuses on designing and implementing a custom digital system to decompress images using the McMaster Image Compression revision 17 (.mic17) specification. This project provides an in-depth understanding of digital system design, specifically in the areas of image processing and hardware implementation, utilizing the Altera DE2-115 board and relevant software tools including Quartus II and ModelSim.

Objective

In this project, I will gain experience in digital system design by implementing the custom McMaster Image Compression revision 17 (.mic17) image compression specification in hardware. Compressed data for a 320 x 240 pixel image will be delivered to the Altera DE2-115 board via the universal asynchronous receiver/transmitter (UART) interface from a personal computer (PC) and stored in the external static random access memory (SRAM). The image decoding circuitry will read the compressed data, recover the image using a custom digital circuit designed by me, and store it back to the SRAM, from where the video graphics array (VGA) controller will read it and display it to the monitor.

Flow of basic image compression

Flow of basic image compression (down the left) and decompression (up the right).

Flow of basic image compression

Colourspace Conversion and Downsampling

Colourspace conversion

This shows how a 320x240 image can be decomposed into its red, green, and blue components.
Each pixel (composed of 1 red, 1 green, and 1 blue value) is independently transformed into the YUV
colourspace. Each resulting pixel contains 1 Y (brightness), 1 U (blue chroma), and 1 V (red chroma). The
chroma components have only colour information (no brightness). In Figure 2, the bottom of the
motorcycle is red, and the corresponding bright area in the V image indicates there are large values at
these pixels, representing a strong red colouring. Note this says nothing about whether it is bright red or
dark red, as the brightness comes only from Y 
Each RGB pixel is converted to a YUV pixel by performing the following matrix multiplication:

Matrix

The terminology YUV originates from the video transmission field, and different notations (depending
on the video/colour/transmission standard) can be found in the literature (such as YIQ, YPbPr, YCbCr).
The human eye is much more sensitive to brightness than colour (due to the presence of more rods than
cones in the eye). For this reason, we choose to downsample U and V (the chroma components) to
reduce the amount of data while having little effect on how the image is perceived. We horizontally
downsample by 2, so the U and V images become half the original width (no vertical downsampling, so
the height is unchanged). Therefore, only 2/3 of the total data remains.

Image Compression

The size of the arrow roughly indicates the relative amount of data at each stage. Generally, we start with an image in the standard red/green/blue (RGB) format for image compression. This is converted to YUV format, where Y represents brightness, and U and V are chrominance components (which contain colour
information). The YUV image is downsampled and then transformed into the frequency domain. This
result is quantized, and finally, we apply lossless encoding to obtain the compressed information
bitstream. Decoding proceeds in the reverse order: lossless decoding and dequantization (restores the
frequency domain image), then the inverse signal transform (which brings the image back to the spatial
domain), and finally, upsampling and colourspace conversion (produces an RGB image).
Compression enables smaller file sizes, which reduces storage and transmission costs. We convert the
RGB image to YUV because the human eye is more sensitive to brightness than colour, so if we want to
obtain more lossy compression (by reducing the quality), it will be more effective to compress the colour
information. In the RGB format, there is no immediate way to compress colour more than brightness.
YUV information is compressed by downsampling U and V in the .mic17 specification. Additionally, the
human eye is less sensitive to high-frequency components in an image (compared to low frequencies).
Transforming into the frequency domain enables us to compress higher-frequency components more
than lower-frequency ones. We compress frequency information by applying quantization. Finally, we
use lossless compression as a final effort to reduce the file size. 
Downsampling and quantization are not identically reversible, so some information is lost during compression. The decompressed image only approximates the original image. A good compression scheme leverages this loss of information to reduce the compressed file size but will attempt to allow information loss only in a way that does not drastically affect the quality of the decompressed image. The focus of this project is the hardware implementation of a compression scheme. The .mic17 specification is not meant to rival existing compression standards for images. Simplifying choices have been made in developing the specification to facilitate a straightforward hardware implementation. This limits the capabilities in terms of the achievable compression/quality tradeoff. Even so, the concepts described above still apply in a general sense to the existing compression schemes. The following subsections will discuss each stage of image compression in more detail.

Colourspace Conversion and Downsampling

The subsequent compression stage involves transforming the image from the spatial domain to the
frequency domain. High spatial frequencies (brightness/colour transitions that occur in a small amount
of space) are less noticeable to the human eye, so they may be reduced in accuracy (or even removed)
without drastically affecting the overall quality of the image.

 

In the .mic17 specification, we use a 2-dimensional discrete cosine transform (DCT), and we work on
blocks of 8x8 pixels (each 8x8 block is processed independently; we do not consider images with heights
and widths not divisible by 8). The 2D DCT is computed with the following matrix multiplications:

Calculation Matrix
Q matrix

After dividing each colour plane (Y, U, and V) into blocks of 8x8 pixels, we perform DCT and quantization
to each block. Figure 3 shows this on 2 blocks of Y data. The image is 320x240, so there will be
(320/8)*(240/8) = 1200 Y blocks and 600 blocks each of downsampled U and downsampled V (since
these planes are half as wide). All the Y blocks are processed, then all the U, and finally all the V blocks.
Within each colour plane, we go across a row of blocks first, then down to the next row. Note that 1 row
of blocks is 8 pixels high.

 

The block in Figure 3(a) is nearly uniform, so after DCT, in Figure 3(c2), there is a large value for
frequency 0 and small values for the high frequencies. After quantization, we end up with primarily
zeros, which can be stored efficiently. Many images contain nearly uniform blocks, so the DCT followed
by quantization provides a way to represent the image with much less data. Although this leads to
reduced quality of the decompressed image, the significant features of the original block are still
preserved.

 

The block in Figure 3(b) contains more variation due to the details of the motorcycle’s wheel. The DCT
and quantization results are shown in Figures 3(d2) and (d3), respectively. In the presence of more
details (captured by the higher frequencies), more data is needed to represent the block (compared to
the block from Figure 3(a)). Even so, the amount of data (in Figure 3(d3)) is reduced from that required
to represent the intensity of each pixel explicitly (the block in Figure 3(b)).

DCT (c1→c2 and d1→d2) and quantization with Q0 (c2→c3 and d2→d3) of 2 blocks of Y data

DCT (c1→c2 and d1→d2) and quantization with Q0 (c2→c3 and d2→d3) of 2 blocks of Y data

Lossless Coding

In the final decoding stage, we need a mechanism for efficiently storing the data after DCT and
quantization. We start with a simple 8-byte header, which indicates the image’s width and height, and
which quantization matrix was used. After this, the losslessly coded blocks are packed into the bitstream
in the same order described at the top of this page. To losslessly code one block, the elements of Z (see
Equation 4) are read out of the block. This is known as the scan pattern.

Scan Pattern.PNG

Many images naturally have small values in the high frequencies, so after computing the DCT and then
quantizing, we typically have many zeros in the lower right part of the matrix. The above scan pattern,
which is standard for many mainstream video/image compression schemes, deals with the low
frequencies first (upper left) and the high frequencies last (lower right). This is beneficial because, in this
order, we can get many consecutive zeros (from the high-frequency part), which can be efficiently
coded. For each 8x8 block, as we follow the order of the scan pattern, values are encoded as follows,
and a variable length prefix-coded header indicates what data will follow (a higher priority is given to the
cases higher in the list):

1. Only zeros remain in the block, output the ZEROS_TO_END header (bits 111).
2. A run of zeros exists, but some nonzero coefficients still remain in the block. Output the ZERO_RUN
header (bits 110), and then:

  • If the run of zeros is 8 or more long, output the 3 bits 000; now the run of zeros will be 8elements shorter, so repeat step 2 if necessary.

  • If the run of zeros is less than 8 in length (i.e., 1-7 inclusive), output a 3-bit unsigned numberindicating the length of the run.

3. A run of ones exists. Output the POSITIVE_ONE_RUN header (bits 101), and then:

  • If the run of ones is 4 or more in length, output the 2 bits 00; now the run will be 4 elements shorter, so repeat step 3 if necessary.

  • If the run of ones is less than 4, output a 2-bit unsigned number to indicate the length.

4. Apply the same rules in step 3 to runs of -1 but using the NEGATIVE_ONE_RUN header (bits 100).
5. If there is a value between -8 and 7 inclusive, output the 4_BIT header (bits 01) followed by the 4-bit
signed value. Notice that -1, 0 and +1 are always encoded as runs (not encoded here).
6. If there is a value between -256 and 255 inclusive, output the 9_BIT header (bits 00) followed by the
9-bit signed value.

The figure below shows the coding of the final (quantized) blocks from Figure 3, in both cases, the blocks are
scanned according to Figure 4. For the first block, the first value (127) needs to be coded as 9-bit signed,
so we output 000 (the header) followed by 001111111 (the 9-bit signed representation of 127). The next
value is -1, which we encode as a run of length 1, so we output 100 (header) followed by 01 (length 1
run). The next value is -2, which is representable on 4 bits signed. Following a similar approach, we
output 01 (header) and then 1110 (-2 on 4-bit signed). Now only zeros remain in this block, so we add
111 (ZEROS_TO_END header) to the bitstream to finish this block.

Lossless coding of each block. Header bits are coloured red for illustrative purposes only.

Lossless coding of each block from Figure 3. Header bits are coloured red for illustrative

Coding of the second block proceeds in the same way, however, more data is required due
to the presence of more nonzero values and the typically larger magnitude of the values. As before,
encoding proceeds until the last nonzero value in the scanned sequence (-1) is reached, after which a
ZEROS_TO_END header is placed in the bitstream. When the last block of the downsampled V plane has
been coded, zero padding ensures the file ends at the end of a 16-bit word (for compliance with the 16-
bit wide memory we use).

Image Decompression

In this section, the full decompression process for an image of size 16 x 16 is performed for the purpose
of illustrating the flow of data from one stage to another and elaborating the internal details of each
step in the decompression. Since the aim of this project is the hardware implementation of the
decompression specification, during the decompression, we only use fixed point arithmetic, which has a
simpler hardware design and generally uses fewer logic resources compared to floating point.

Lossless Decoding and Dequantization (Milestone 3 in Hardware)

The reference image example.ppm to be compressed.

The reference image example.ppm to be compressed..PNG

The compressed bitstream example.mic17; the order of the bytes is across first and then down.

The compressed bitstream example.mic17.PNG

Lossless decoding of a block based on the prefix code.

Lossless decoding of a block based on the prefix code..PNG

IDCT (Inverse Discrete Cosine Transform) (Milestone 2 in Hardware)

IDCT.PNG

Upsampling and Colourspace Conversion (Milestone 1 in Hardware)

The output of IDCT is the downsampled YUV data..PNG

The result after horizontally upsampling U and V.

The result after horizontally upsampling U and V..PNG

The final decompress RGB image after colourspace conversion

the final decompress RGB image..PNG

Final decompressed images using quantization matrices 

Final decompressed images using quantization matrices �� (a) and �� (b) and the original i

Project Guidelines

The design task of implementing the overall decoder has been partitioned into three milestones which
can be completed each within one week. As part of the validation methodology, implementation will
proceed from the VGA display backward to the compressed input. Starting from the display end enables
a visual progress check at each milestone, thereby providing the framework for incremental integration
of each newly designed module into the overall system.

Overview of the upsampling and colourspace conversion hardware implementation (milestone 1

The Software Model and its Integration in the Design Process

Starting with a .mic17 file, there are 3 separate source codes, 1 for each milestone. The input/output
files are organized exactly like the SRAM in hardware, that is, 218 = 262144 locations with 16 bits per
location. Each source code reads data from a software model of the SRAM, performs the necessary
computation, and then writes the data back to the SRAM software model. The decoding process starts
at milestone 3, and the source code “decode_m3” takes a .mic17 file and produces a .sram_d2 file.
Continuing into milestone 2, the source code “decode_m2” takes a .sram_d2 file and produces a
.sram_d1 file. Finally, for milestone 1, “decode_m1” takes a .sram_d1 file and produces a .sram_d0 file
and a .ppm file.

 

By separating the decoding process, we can do part of the decoding in software and part of it in
hardware. The decoding software handles the first part of the decoding task, and the intermediate data
can be passed to a testbench or to the board, where the hardware can take over the remainder of the
decoding task to produce the image. In the case of the testbench, detailed feedback can be provided,
which aids in tracing implementation errors, while the board provides less feedback but a full practical operational test. For example, to validate if milestone 1 is working correctly in hardware, the .sram_d1 file can be used to initialize the SRAM, and after the hardware simulation is finished, the .sram_d0 file
can be used to validate if the contents of SRAM are correct.

Milestone 1: Upsampling and Colourspace Conversion

The figure below shows the memory layout, which is used for milestone 1. The last portion of the memory holds the completed RGB image stored in {R0 G0}, {B0 R1}, {G1 B1} … format (as in lab 5 experiment 1). This is where the decompressed image will reside for all three milestones and, thus, for the completed project
as well. It differs from the completed project because downsampled YUV data is stored in the SRAM as
decoder stimuli since the earlier parts of the decoder are not yet implemented. By default (from the
backbone in lab 5), the UART SRAM interface writes data in the SRAM starting at address 0, and the VGA
SRAM interface starts reading from address 0. Clearly, the VGA SRAM interface must be modified to
read the RGB data from the correct segment, as indicated in Figure 17. Also, the UART receiver in lab 5
experiment 4 strips the header from the .ppm file, which is no longer needed.

Memory layout for milestone 1..PNG

Instantiation of a single 32-bit multiplier.

Instantiation of a single 32-bit multiplier..PNG

Resource inefficient FIR filter implementation

Resource inefficient FIR filter implementation.PNG

Resource-efficient FIR filter implementation.

Resource-efficient FIR filter implementation..PNG

Resource-efficient FIR filter implementation.

Resource-efficient FIR filter implementation....PNG

Resource inefficient colourspace conversion implementation

Resource inefficient colourspace conversion implementation.PNG

Resource efficient colourspace conversion implementation

​Resource efficient colourspace conversion implementation.PNG

Milestone 2: IDCT

The figure below shows the overview for milestone 2, which is the implementation of IDCT. As before, the
shaded portions of the figure indicate what needs to be built or modified. In particular, notice that the
FSM and the logic in the top state machine (bottom portion of the figure) do not need to be modified
since, in the previous milestone, you have made provision for the decoder circuitry (all of which will
eventually reside in the block currently labelled “Partial Decoder”) to access the SRAM.
As discussed, the software model can be used to generate the necessary initialization and
verification data for the SRAM. This data is used by the testbench for validation, and it can be sent to the
board with a working design to get visual confirmation that the IDCT, upsampling and colourspace
conversion are working together.

Overview of the IDCT implementation (milestone 2)..PNG

The figure below, the memory map for milestone 2 is provided, which is similar to that for milestone 1. The
pre-IDCT data is stored as 1 value per memory location, whereas the post-IDCT data (YUV) is stored as 2
values per memory location. Note that after IDCT is complete, the RGB data computed in milestone 1
will overwrite some of the pre-IDCT data (used to initialize the SRAM for milestone 2).
Like milestone 1, the blocks are divided into segments of Y, segments of (downsampled) U, and
segments of (downsampled) V. However, the way the data is accessed is different. In milestone 2, we
want to fetch an entire 8x8 block of pre-IDCT data, perform IDCT and write it back to the YUV memory
space. To determine the addressing for writing post-IDCT samples back to the memory, observe Figure
12 and note that the block from block row 0, block column 0 of the Y plane (Y 0,0) contains data from
row 0, row1, … row 7 of the image, all in columns 0 to 7 of the image. Extending this to the case of a
320x240 image, and noting that we have two samples per address, our write addressing for block 0
should be: 0 … 3, …, 1120 … 1123 (as shown in Figure 26). For the second block of the first block row (Y
0,1), we have: 4 … 7, …, 1124 … 1127. Once all the Y blocks have been processed, we proceed to U and V, with the important distinction that these two planes are downsampled and thus contain half the
number of columns which Y contained. Thus U 0,0 would reside in addresses (38400+): 0 … 3, …, 560 …
563. The last block V 29,19 would reside in addresses (57600+): 18636 .. 18639, …, 19196 ... 19199.

Resource efficient colourspace conversion implementation

Memory layout for milestone 2.PNG

Reading pre-IDCT values and writing back YUV data to the external SRAM

Reading pre-IDCT values and writing back YUV data to the external SRAM.PNG

MAC unit for block IDCT.

MAC unit for block IDCT..PNG

Project Report

Introduction

The COE 3DQ5 project focuses on implementing and integrating an image decompressing design in hardware while specifically considering Lossless Decoding, Dequantization, Inverse Signal Transform, Interpolation, and Color-space conversion in three defined Milestones. The project is completed with specific constraints and maximized efficiency of hardware components. The final product is aimed to be a full integration of a compressed image being lossless decoded and restoring the missing details of the quantized image, with matrix multiplication for the IDCT, allowing for a down sampled YUV image representation. And through upscaling and colouring conversion in Milestone 1, the converted and upscaled RGB image values are sent into the SRAM for image viewing. All three milestone implementations considered the multiplier usage requirements as discussed in the implementation details. We can take this implementation and look at the bigger picture of the Digital Systems Design course, where the created Verilog design specification is driven and monitored by the testbench for testing design precision and errors, CAD tools are utilized to simulate the circuit behavior, timing analysis, and circuit mapping, all to be configured and loaded to the FPGA for project application.

Implementation Details

Milestone 1

The first Milestone consists of reading from the external SRAM, utilizing 4 multipliers with a minimum of 75% utilization to perform operations for U and V even and odd calculations and RGB matrix calculations. The milestone takes the input of YUV data and is focused on converting data from YUV to RGB in the order of R0G0, B0R1, G1B1, and repeated.

FSM with Common Cases Colour Space Conversion and Interpolation

Milestone 1 Image Processing Software Project

Algorithm: The general approach for Milestone 1 focused on first requesting the SRAM addresses for U, V, and Y values, which would read two values at a time from the SRAM (e.g. U0 and U1 in lead_4), these values were stored in a register to be utilized in the consequent common case cycle for calculations with the 4 multipliers. Within the common case the required calculations for the even and odd U and V values, and the RGB matrix calculations with the multipliers were completed. While the calculations were completed during the common case cycles, a similar pattern for reading from the SRAM was followed and the acquired YUV data was stored in registers to be used to immediately begin calculations in the next common case cycle.


Efficiency, Timing and Learning: To create an efficient design, YUV values were read in the cycle previous to when they are utilized and preloaded into registers. This method allowed for a reduction in total clock cycles from previous design iterations, all while considering minimal register usage. Furthermore, utilization of the multipliers in the common case can be calculated as: ((5cc * 4mult) + (1cc * 2mult))/(7*4) = 78.57%. Since the lead in does not repeat, the utilization represented as (8 lead in +
(number of common case repeats) *((5cc * 4mult) + (1cc * 2mult))/(7*4)). As seen in figure 3 and 4, the number of registers increases from 369 to 938 registers. Latency can be represented as: (7x 160)) +8 = 1128 x 240 = 260720CC.


Milestone 1 provided a deeper understanding of project setup, debugging experience, and algorithm development. The algorithm development and project setup supplied an in-depth experience of what goes into creating and starting a project. The debugging experience provided experience with indexing, clipping, sign extending and more.

Clipping and Multiplier result shifting

Starting Register Count

Starting Register Count

Ending Register Count

Ending Register Count

Milestone 2

Milestone 2 focuses on the IDCT process through 4 steps: fetching S’, computing T, computing S, and writing S to the SRAM. The S’ values are fetched from the SRAM and stored in the DP0 as a lead in for 64 clock cycles. The computation of T utilizes the C and S’ values and is stored in DP1 for the S’ * C = T matrices. The computation of S is like computation of T, S = CT * T.
For T, the pattern using the 2 S’ values per common case cycle, provides 4 results in 8 cycles; as does for S. The S calculation provides 4 results in 8 cycles with one T value used per common case cycle. After the S calculations are complete, the YUV values are concatenated and written to the SRAM; to be the input to Milestone 1. This design was accomplished with 64 lead ins, and 16 common cases. Additionally, the equation for IDCT requires a shift by 8 and 16 for T and S, respectively. Latency can be represented as: ( 64 x 160)) + 64 = 10304 x 240 = 2472960CC.

Layout of Design for SRAM, DP Memory, and Matrix Multiplication

Layout of Design for SRAM, DP Memory, and Matrix Multiplication

Approach and Timing Analysis:
Moreover, our initial Milestone 2 implementation was replaced by a more efficient and component effective method after discussion with the Professor. The initial implementation consisted of excessive register usage and underutilized the provided DP memory. A newer design which had a slightly longer lead in and more consistent and memory effective design was implemented as a method of improved design and learning. The efficiency and timing is: (163)/(163 + 64) = 99.6% utilization. This calculation aligns with the expectations as there is an extremely large sum of clock cycles, all but approximately 64 lead ins consisting of 4 multiplications.

Layout of Design for SRAM, DP Memory, and Matrix Multiplication 2
Resource Usage and Critical Path

As we progressed through the project, we learned of and found improved methods of implementation. We restarted Milestone 2 as we found register over usage and underutilized DP memory. It created awareness to maximize usage of provided resources, rather than spending on additional and potentially unnecessary resources. Although Milestone 2 was redone, certain flags and buffers could be reused for a more efficient implementation throughout the project. Other than registers, it is likely that more multiplier and delay effective designs could be conceptualized; this would reduce the total number of clock cycles to complete each milestone; providing a high utilization and quick output. The path in the timing analyzer with the highest data delay, excluding inputs and outputs to the RAM top module, is the critical path. This gives V prime as the critical path which is accurate to expectations as the path is designed to involve a large range of logical operators.

Conclusion

To conclude, the project for the course COMPENG 3DQ5: Digital System Design has provided a great opportunity for learning and experiencing from a meaningful engineering project. The Image Decompressor Hardware Implementation required a great deal of thinking, designing, re-designing, debugging, learning, and working with one another to complete. Not only have we improved our fundamental knowledge of digital systems, but we have all learned about time management, how to divide combinable thinking tasks, and how to communicate effectively as a team. Provided more time and improved time management skills, we believe we could have fully completed the project with integration and developed a greater understanding of digital systems design. Although the time commitment was intense, the learning experience and opportunity made this project a great accomplishment. The completed Milestone 1 submission was submitted 18-11-2023 labelled “Official Done Milestone 1 (FR)”.
For Milestone 2, all of Y calculations were correct with no mismatches, and U and V calculations consisted of some mismatches due to indexing issues after the first block of U. This can be verified through the testbench and simulation results with the file from latest push to GitHub.

 

bottom of page