FPGA Based Gesture Recognition System: 2020

Tuesday, March 3, 2020

Finally, the presentation poster is shown in Figure 1. The main objectives of this project are building a gesture recognition system based on FPGA board and using FPGA board to increase the image processing speed because FPGA can have 10 times higher processing speed compared with desktop CPU [1]. Then, the actual implementation of this project is shown as flow chart. Moreover, the result of FPGA and CPU can both be seen in result part to represent the actual visible difference between these two platforms. After that, future enhancement can be done on different camera module and more functions of this system.

Figure 1. Poster of presentation

After that, the final demo of this project can be seen in following video all the video processing and recognition result can be seen in this part, then applications of the recognition result can also be seen in this part.

Reference:

[1] S. Asano, T. Maruyama and Y. Yamaguchi, "Performance comparison of FPGA, GPU and CPU in image processing," 2009 International Conference on Field Programmable Logic and Applications, Prague, 2009, pp. 126-131.
doi: 10.1109/FPL.2009.5272532

After all the main work was done, the comparison work should also be completed to observe the actual enhancement of image processing on FPGA board. Moreover, the basic algorithm should be same as the algorithm applied on FPGA board. This also requires a lot of work to complete a new project with different platform, MATLAB. The image from the front camera of laptop will be firstly binarized, then, median filter, erosion and dilation module will be applied to reduce the noise of video. Algorithm of image processing will be shown below.

Figure 1. Algorithm of Erosion

Figure 2. Algorithm of Dilation

Figure 3. Algorithm of Median Filter

The following video shows the final result of MATLAB project.

Moreover, from this video, it can be seen that time delay of CPU-based image processing is significant and its processing speed can't even reach 1 frame per second.

Compared with the performance on CPU, FPGA has significant enhancement because it is able to handle 30 frame per second real-time video processing. Video below can show the actual high performance of FPGA-based video processing system.

In this video, the system can recognize the hand correctly, although the accuracy will be influenced by the noise from OV7670 module. Moreover, this system also obtained the position and size of hand. This will consume more time; however, it still has 30 frames per second processing speed. Hence, these two videos present the high performance of FPGA in image processing area, compared with desktop CPU.

Furthermore, the gap between desktop CPU and FPGA can be quantified and the final result will be shown on following table.

Table 1. Quantified result of devices

Sunday, February 23, 2020

Application:

A simplified 'Fruit Ninja' has been built in this project, a falling square represent the falling 'Fruit' and another square represent the 'Knife'. Counter will add 1 if the 'Knife' reached the 'Fruit'. In this application, position of 'Knife' is determined by position of hand and initial position and falling speed of 'Fruit' is fixed because this game is only a simplified version.

Scoreboard:

Due to the lack of function of displaying number, this function has to be completed by a separated part. Actually, this part is a plot work, hence, this is not a difficult part but consume considerable time to plot each number.

Figure 1. Example of plotting '0'

Moreover, another important part is converter for converting the binary number to decimal number, then plot the number on the 'scoreboard', the lower right corner. However, this is different with the simple BCD, in this part, it applied a more complicated convention, shift register and add 2'b11 (decimal: 3) to each 4 bits after shift 1 bit. The example of this BCD algorithm is shown below.

Figure 2. BCD with shift register [1]

Then, the implementation can be seen in figure 3.

Figure 3. Implementation of BCD

Scaling:

A pink box will change its size within the changing size of hand. The original assumption is using a picture of Lena to represent the scaling function; however, the storage is not enough to store Lena's figure. This part is just using the position variables and very simple, hence, this part will not be described precisely.

Reference:

[1] N. McDonald, “Binary to BCD Conversion Algorithm,”
[Online]. Available: https://my.eng.utah.edu/~nmcdonal/Tutorials/BCDTutorial/BCDConversion.html

Recognition module

Determination the position of hand:

In this part, the requirement is using a rectangle to locate the position of hand. This step is also basic requirement of subsequent recognition because the scaling and application functions both require the position and size of hand.

Algorithm of finding the position of hand is weird but effective. The core of this algorithm is the x, y coordinates and the corresponding value of each pixel. To determine a rectangle area of hand, the minimum and maximum coordinates of x and y axis should be found, hence, there are x_min, x_max, y_min and y_max variables. VGA display module has lcd_x, lcd_y and image_bit to represent the x, y coordinates and RGB data of this pixel. Firstly, at the begining of each frame, x_min will be assigned as the maximum value of horizontal coordinate, x_max is the minimum of horizontal coordinate and same for the y_min and y_max. Then, x and y coordinates of each pixel will be compared with these 4 above position variables if the value of this pixel is 8 b' 1111 1111, furthermore, corresponding x and y coordinates will be assigned to these 4 position variables, for example, if x coordinate is smaller than x_min, then this x coordinate will be assigned to x_min. The purpose of this algorithm is that lcd_x and lcd_y are both increase from 1 and ensure the minimum and maximum variables are assigned as the true minimum and maximum coordinates of this frame. The implementation of this algorithm can be seen in following figure.

Figure 1. Implementation of finding position of hand

Finally, the maximum values and minimum values will be stored firstly and output to VGA display at the end of each frame because the final value can only be determined at the end of each frame, otherwise, the rectangle will move from top to end of the screen. Actual operation of data has been shown in Figure 2.

Figure 2. Operation of final position variables

Saturday, February 22, 2020

Image Processing

YCbCr

Two color spaces are applied in this project, RGB565 and YCbCr. RGB space is applied to complete the image output and YCbCr for image processing. Firstly, RGB signal will be converted to YCbCr color space according to ITU-R BT.601 conversion [1], then, the gray image and skin color can be extracted to complete the following processing steps. For example, the single Y value of this image can represent the gray degree of this image. Moreover, the threshold of YCbCr can be determined to extract the skin color.

Figure 1. Conversion from RGB to YCbCr

After that, lots of noise still existed, hence, the median filter should be applied to eliminate the noise of corresponding image. Each pixel will be sent into median filter entry by entry and value of each entry will be replaced by the median value of neighboring entries. For instance, the neighbors are elements of a 3*3 matrix. Then this matrix will slide entry by entry until the end of this image. Through this method the influence of unexpected noise can be reduced considerably. For instance, an input signal is (2, 3, 80, 6), then, the output of median filter is (y1, y2, y3, y4), y1=med(2,3,80)=3, y2=med(3,80,6)=6, y3=med(80,6,2)=6, y4=med(6,2,3)=3, the final output is (3, 6, 6, 3). Hence, the unexpected noise 80 was eliminated. However, there is a problem relates to the edge property of image, median filter will make a smoother image hence the sharpness of edge will lose.

In this implementation, the median filter has a 3*3 window move from the first entry to the final entry. However, the implementation of calculation of this 3*3 window is a problem because there is no register for the storage of data of each pixel. Hence, the 3*3 shift register has to be formed.

The construction of 3*3 matrix was based on two shift registers and nine 8-bit registers. The output terminal of the first shift register was connected to the input terminal of the second one, as is demonstrated in Figure 2.

Figure 2. Shift registers design

In the shift register, the value flowed to the next byte at the positive edge of the clock signal. Since there were 320 RGB pixel values in a row for a frame, the depth of the shift register was set as 320. In consequence, the value entered the first shift register would be input to the second one after the time period of 320 positive edges of the clock signal.

Nine 8-bit registers were designed to store 9 elements in a 3*3 matrix, as is indicated in Figure 3. The input data could traverse all the registers in the matrix with a specific sequence.

Figure 3. Matrix registers design

With this designing architecture, every input value could move to the next register in the same row at the positive edge of the clock signal. Similarly, the values could flow to the next register in the same column after flowing through the former shift register.

As a result, a series of input values could move through all 9 registers with the following sequence: mat_33 -> mat_32 -> mat_31 -> mat_23 -> mat_22 -> mat_21 -> mat_13 -> mat_12 -> mat_11, which is illustrated in Figure 4.

Figure 4. Data moving sequence

The simulation results were shown in Figure 5. The input data traversed through all the 9 registers with the expected sequence successfully, which verified the design.

Figure 5. Data moved through the third row

Figure 6. Data moved through the second row

Figure 7. Data moved through the first row

Firstly, the 3 rows data will be sorted and 3 maximum, 3 median and 3 minimum value will be determined, after that the minimum of these 3 maximum data, the maximum of these 3 minimum and the median of 3 median value will be selected, then the minimum, median and maximum value in the second step will be compared to determine the final median value. Finally, the central entry will be assigned as this final median value. The purpose of comparing all the maximum, minimum and median values again is ensure the final median value is true median value because the maximum value in the first row may less than the minimum value in the third row [2].

Figure 8. Algorithm of sort

In the implementation part, the whole algorithm was divided into 3 parts, however, the content of each module is same, thus, only one module is required. The actual circuit of this algorithm is shown below.

Figure 9. Implementation of sort

Moreover, the erosional data will be delivered into 'dilation' module. The algorithm of this module is assigning the central entry of a 3*3 window with 1 if any entry of this 3*3 window is 1. Hence, the vacancy of an object inside will be filled to make the image more solid and increase the subsequent recognition [3].

Figure 10. Implementation of dilation

After that, the image will be delivered into erosion and dilation module to complete image processing. The object for using erosion processing is eliminating the extra noise, especially for the noise which only exists on single pixel. Every pixel in a 3*3 pixels window should be 1 to assign the central entry as 1. Hence, the edge of image will shrink, and the extra noise will be removed [3].

Figure 11. Implementation of erosion

As the figures shown above, the operation of dilation and erosion is inverse, however, these two operations will be neutralized. The combination will firstly making the image more solid, then the erosion operation will eliminate the noise. Finally, the entire combination of 3 modules can be seen in the circuit diagram.

Figure 12. The structure of Video Image Processor module

Reference:

[1] ITU-R. Recommendation ITU-R BT.601-7 . Accessed: Feb. 12, 2020. [Online]. Available: https://www.itu.int/dms_pubrec/itu-r/rec/bt/R-REC-BT.601-7-201103-I!!PDF-E.pdf

[2] R. Fisher et al. "Median Filter". 2003. [Online]. Available: https://homepages.inf.ed.ac.uk/rbf/HIPR2/median.htm

[3] MORPOLOGICAL DIGITAL IMAGE PROCESSING USING FPGA, Shodhganga [Online]. Available: https://shodhganga.inflibnet.ac.in/bitstream/10603/205519/7/07_chapter%204.pdf

Wednesday, February 19, 2020

VGA Display

Due to the sending principle of OV7670 is similar to the VGA display port, hence, the display protocol was selected as the display protocol.

Before the implementation of VGA display, the pixel data should be firstly stored in a dual RAM and VGA display module can read data from dual RAM. Two parts have independent clock signal, hence the clock sequence between VGA display and OV7670 can be combined together without error of address and clock.

To complete this function, IP core can be directly applied in this part.

Figure 1. IP core of dual RAM

After this dual RAM, the VGA display protocol will be completed. Firstly, the basic principle of VGA should be determined. From the timing diagram of VGA display protocol [1], each frame can be divided into 6 different parts, Horizontal Front Porch, Horizontal Active, Horizontal Back Porch, Vertical Front Porch, Vertical Active and Vertical Back Porch, however, only the Horizontal Active and Vertical Active are available for displaying pixel. The porch part is an mandatory part of VGA protocol because VGA is developed from the picture tubes displaying and the tubes consume time to adjust the position and direction of tubes. Hence, the front porch and back porch for the horizontal and vertical directions should be determined at the beginning of this module.

Figure 2. Basic Timing Diagram of VGA display [1]

According to Figure 2, these horizontal and vertical components can be adjusted to put the active part to the upper right part. This position management can simplify the implementation of this part. Moreover, according to the protocol, the Horizontal Front Porch is 16, Horizontal Sync Pulse is 96, Horizontal Back Porch is 48, Vertical Front Porch is 10, Vertical Sync Pulse is 2 and Vertical Back Porch is 33.

Figure 3. VGA Timing Specification [1]

According to the previous information, the basic parameters can be determined. The other display parts are simple to implement because it just requires 'if else' syntax to output pixels in the active part.

Figure 4. Parameters of VGA Timing

Due to the monitor is HDMI port, this program also requires a convertor from VGA to HDMI and this can be done by IP core vga2dvi. The final image can be displayed on the HDMI monitor.

Figure 5. IP core of convertor from VGA to HDMI

Reference:

[1] S.Larson, "VGA Controller", Mar 07, 2018 [Online]. Available: https://www.digikey.com/eewiki/pages/viewpage.action?pageId=15925278

Data of a single pixel

From the previous contents, OV7670 camera module will input the pixel data to FPGA board, then, these data can be stored and displayed through VGA port to the monitor. However, there is a problem in the data transmission part between the OV7670 module and FPGA board. From the manual of OV7670 [1], the RGB565 data for single pixel is divided into 2 parts, first byte and second byte. Hence, the first mission is completing the data conversion from 8 bits to 16 bits.

Figure 1. Two bytes combination of single pixel data

In this part, the simplest way is directly combining 16 bits together. Hence, there should be a temporary register to store the previous 8 bits and combine it with the next 8 bits. Moreover, due to the requirement of enable signal for sending the pixel data to certain address, the enable signal should also be adjusted.

16_enable is the enable signal for the 16 bits data and 8_enable is the enable signal for 8 bits signal. According to the previous information, 8_enable has twice frequency of 16_enable, then, this property should be designed to ensure the address and pixel data can output synchronously.

To complete the frequency division, the first enable signal should be determined and this information can be found from the OV7670 manual. According to Figure 1, the first byte data present just after the HREF signal become high, therefore, the first 8_enable signal can be assigned as HREF signal. Then the next 8_enable signal is the inverse of previous 8_enable, this method can divide the frequency of 8_enable signal to 1/2 of original frequency and the 16_enable is obtained. Finally, the 16_enable should be the AND result of 8_enable signal and HREF because the data is available only within the high HREF.

The conversion of pixel data is not difficult, however, only 12 bits of RGB signal can be enough for displaying, hence the highest 4 bits for each single for RGB are combined as a single 12 bits signal.

Figure 2. Implementation of conversion from 8 bits to 16 bits

Reference:

[1] Omnivision, "OV7670 CMOS VGA(640*480) Camera Chip", July 8 2002, [Online]. Available: https://www.voti.nl/docs/OV7670.pdf

Sunday, February 16, 2020

This week is the first week of this project. The achievement of this week is completed the configuration of the camera registers and the PMOD inputs ports.

Actually, the PMOD inputs ports configuration is gathered from Github contributions [1] and the official documents [2]. Moreover, the configuration of registers is obtained from website [3].
In this part, the configuration of OV7670 was completed with Serial Camera Control Bus (SCCB) or I2C serial communication protocol. Moreover, due to the FPGA board only has HDMI display port and the adaptor may disable the signal transmission, the conversion of VGA signal to HDMI signal should be done inside of the board. These two parts will be explained in the following contents.

1. SCCB/ I2C Data Bus [1]:

Actually, these two protocols are similar even some programmers just consider these two as a single stuff. Hence, the SCCB protocol was applied in this project and this will be explained precisely.
SCCB is a simple bidirectional communication protocol for low-speed devices, for instance, OV7670 camera module. This standard uses a controller called master device to communicate with slave devices and the slave can only transmit data if master has already addressed this slave. In the physical compositions, SCCB consist of serial clock (SIO_C) and serial data (SIO_D) lines to complete the communication functions. To complete the communication process, several important points should be emphasized.

START and STOP signals

Figure 1. The example of START signal, the front straight line

Each phase has 8 bits data and the end of each phase is Don't-Care bit

Figure 2. Composition of each phase

3-phase data transmit ID Address (Slave address), Sub-Address (Register location), Write Data (Overwrite content)

Figure 3. Composition of 3-phase transmission cycle

The 9th bit, D0 defines write or read operation, this can be seen in Figure 1.

According to these instructions, the required waveform of to configure the OV7670 module can be derived. From the instruction, these 3-phase signals can be separated as 32 bits data and the SIO_C should be divided into 4 parts to create the positive edges and negative edges. Furthermore, the START signal is obtained when the SIO_C is high and SIO_D changed from high to low. The STOP signal is the same behavior, however, the SIO_D is changing from low to high when the SIO_C is high.

Figure 4. Total 3-phase signal

Moreover, according to the actual data of OV7670, the required clock frequency can also be obtained. Single bit transmission cycle time has a minimum 10 us requirement. Hence, the frequency of changing data bit should be lower than 100 kHz. Due to the frequency provided by the FPGA board is 100MHz, the required clock should be delayed, and 11 bits counter could provide 100M/2047 Hz. The delayed clock signal can be obtained from a 20 bits counter (Lower 11 bits for delay and higher bits for the counter of SIO_D).

Figure 5. Electrical characteristics of OV7670

Finally, the required waveform of configuration can be achieved. There are two different methods to get this result, the first one is using 'case' syntax and another one is using a shift register to output the SIO_D. The shift register was applied because the simplicity of this way. In this part, the SIOD_EN is an enable signal to stop the SIO_D output when the Don't-Care bits is input from OV7670.

Figure 6. The implementation of this shift register

Figure 7. Plot the SIO_C and set the SIOD_EN signal

According to the simulation waveform, the actual configuration has been achieved. First 32 bits are 0x01, 0x02, 0x03 and the second bits are 0x01, 0x55, 0xaa.

Figure 8. The simulation result of SCCB protocol

Reference:

[1] Omnivision, "Serial Camera Control Bus Functional Specification" , 2002, [Online]. Available: http://www4.cs.umanitoba.ca/~jacky/Teaching/Courses/74.795-LocalVision/ReadingList/ov-sccb.pdf

Tuesday, February 4, 2020

Figure 1. Xilinx Zynq-7000 [1]

According to Xilinx [1], Field Programmable Gate Arrays (FPGAs) are semiconductor devices that are based around a matrix of configurable logic blocks (CLBs) connected via programmable interconnects. FPGAs can be reprogrammed to desired application or functionality requirements after manufacturing. This feature distinguishes FPGAs from Application Specific Integrated Circuits (ASICs), which are custom manufactured for specific design tasks. Although one-time programmable (OTP) FPGAs are available, the dominant types are SRAM based which can be reprogrammed as the design evolves.

Figure 2. OV7670 Camera [2]

In this project, FPGA (Figure 1) was applied to complete the gesture recognition work. This requires well combination of camera, monitor and strict algorithm. The specific aim of this project is providing a primary version of a game 'Fruit Ninja'. Although this game is simple with ARM, considerable work should be done on FPGA because students can only manipulate the logic gates to complete functions.
Moreover, the camera model is OV7670, which can be seen in Figure 2, is a low cost image sensor + DSP that can operate at a maximum of 30 fps and 640 x 480 ("VGA") resolutions, equivalent to 0.3 Megapixels. This preprocess can be configured via the Serial Camera Control Bus (SCCB) [2].

Reference:

[1] Xilinx, What is an FPGA? [Online]. Available: https://www.xilinx.com/products/silicon-devices/fpga/what-is-an-fpga.html
[2] Thaoyu Electronics, OV7670 Camera Module (With the AL433 FIFO) [Online]. Available: https://www.hotmcu.com/ov7670-camera-module-with-the-al433-fifo-p-304.html