Grunt HN: YOLO v4 Dinky with Tensil on Ultra96 FPGA Board

2022-04-04 • PROJECTS , TENSIL , FPGA Introduction This tutorial will use Avnet Ultra96 V2 development board and Tensil open-source inference accelerator to show how to run YOLO v4 Tiny–the state-of-the-art ML model for object detection–on FPGA. The YOLO model contains some operations that Tensil does not support. These operations are in the final stage…

Grunt HN: YOLO v4 Dinky with Tensil on Ultra96 FPGA Board

These substances are reasonably impossible.





This tutorial will exercise Avnet Extremely96 V2 construction board and Tensil beginning-source inference accelerator to show camouflage easy guidelines on how to escape YOLO v4 Dinky–the pronounce-of-the-art work ML model for object detection–on FPGA. The YOLO model comprises some operations that Tensil does not improve. These operations are in the final stage of processing and are not compute-intensive. We are in a position to exercise TensorFlow Lite (TF-Lite) to escape them on the CPU to work around this. We are in a position to exercise the PYNQ framework to ranking exact-time video from a USB webcam and show camouflage detected objects on a show camouflage connected to Cloak Port. This tutorial refers back to the outdated Extremely96 tutorial for step-by-step instructions for producing Tensil RTL and getting Xilinx Vivado to synthesize the bitstream.

Need to you get caught or get an error, that you would possibly perhaps perhaps presumably demand a requirement on our Discord or send an electronic mail to



Forward of we beginning, let’s get a fowl’s establish a question to survey of what we would like to originate. We’ll observe these steps:

  1. Generate and synthesize Tensil RTL
  2. Bring together YOLO v4 Dinky model for Tensil
  3. Put together PYNQ and TF-Lite
  4. Form with PYNQ

1. Generate and synthesize Tensil RTL

Serve to top

Within the essential step, we’ll be getting Tensil instruments to generate the RTL code after which the utilization of Xilinx Vivado to synthesize the bitstream for the Extremely96 board. Since this course of is a connected to other Extremely96 tutorials, we refer you to sections 1 through 4 in the ResNet20 tutorial.

Alternatively, that you would possibly perhaps perhaps presumably skip this step and download the ready made bitstream. For this we contain instructions in the next section.

2. Bring together YOLO v4 Dinky model for Tensil

Serve to top

Now, we must assemble the ML model to a Tensil binary consisting of TCU instructions carried out by the TCU hardware straight. The YOLO v4 Dinky model is included in two resolutions, 192 and 416, in the Tensil docker image at /demo/gadgets/yolov4_tiny_192.onnx and /demo/gadgets/yolov4_tiny_416.onnx. The upper resolution will detect smaller objects the utilization of more computation and thus be pleased fewer frames per 2nd. Cloak that below we can be the utilization of 192 resolution, but merely changing it with 416 ought to work as effectively.

As we mentioned in the introduction, we can be the utilization of the TF-Lite framework to escape the postprocessing of YOLO v4 Dinky. Specifically, this postprocessing entails Sigmoid and Exp operations not supported by the Tensil hardware. (We conception to place into effect them the utilization of desk lookup according to Taylor expansion.) This implies that for Tensil we must assemble the model ending with the final convolution layers. Under these layers, we must assemble the TF-Lite model. To title the output nodes for the Tensil compiler, purchase a gaze on the model in Netron.


Two final convolution operation be pleased outputs named model/conv2d_17/BiasAdd:0 and model/conv2d_20/BiasAdd:0.

From for the duration of the Tensil docker container, escape the next repeat.

tensil assemble -a /demo/arch/ultra96v2.tarch -m /demo/gadgets/yolov4_tiny_192.onnx -o "model/conv2d_17/BiasAdd:0,model/conv2d_20/BiasAdd:0" -s honest

The resulting compiled recordsdata are listed in the ARTIFACTS desk. The manifest (tmodel) is a undeniable textual screech material JSON description of the compiled model. The Tensil program (tprog) and weights recordsdata (tdata) are each and each binaries to be used by the TCU for the duration of execution. The Tensil compiler also prints a COMPILER SUMMARY desk with attention-grabbing stats for each and each the TCU architecture and the model.

Model:                                           yolov4_tiny_192_onnx_ultra96v2 
Files form:                                       FP16BP8                        
Array dimension:                                      16                             
Consts memory dimension (vectors/scalars/bits):       2,097,152                      33,554,432 21
Vars memory dimension (vectors/scalars/bits):         2,097,152                      33,554,432 21
Local memory dimension (vectors/scalars/bits):        20,480                         327,680    15
Accumulator memory dimension (vectors/scalars/bits):  4,096                          65,536     12
Dawdle #0 dimension (bits):                           3                              
Dawdle #1 dimension (bits):                           3                              
Operand #0 dimension (bits):                          24                             
Operand #1 dimension (bits):                          24                             
Operand #2 dimension (bits):                          16                             
Instruction dimension (bytes):                        9                              
Consts memory most utilization (vectors/scalars):   378,669                        6,058,704  
Vars memory most utilization (vectors/scalars):     55,296                         884,736    
Consts memory aggregate utilization (vectors/scalars): 378,669                        6,058,704  
Vars memory aggregate utilization (vectors/scalars):   130,464                        2,087,424  
Selection of layers:                                25                             
Total amount of instructions:                    691,681                        
Compilation time (seconds):                      92.225                         
Sharp consts scalar dimension:                         6,054,190                      
Consts utilization (%):                          98.706                         
Sharp MACs (M):                                   670.349                        
MAC efficiency (%):                              0.000                          

3. Put together PYNQ and TF-Lite

Serve to top

Now it’s time to envision all the pieces together on our construction board. For this, we first must role up the PYNQ surroundings. This course of starts with downloading the SD card image for our construction board. There’s the detailed instruction for surroundings board connectivity on the PYNQ documentation web location. You wants so as to beginning Jupyter notebooks and escape some examples. Cloak that you’ll want wireless info superhighway connectivity on your Extremely96 board in articulate to escape a few of the instructions in this section.

There would possibly be one caveat that wants addressing as soon as PYNQ is installed. On the default PYNQ image, the environment for the Linux kernel CMA (Contiguous Memory Allocator) role dimension is 128MB. Given our Tensil architecture, the default CMA dimension is simply too shrimp. To take care of this, you’ll must download our patched kernel, replica it to /boot, and reboot your board. Cloak that the patched kernel is built for PYNQ 2.7 and is not going to work with other versions. To patch the kernel, escape these instructions on the reach board:

sudo cp /boot/image.ub /boot/image.ub.backup
sudo cp image.ub /boot/
rm image.ub
sudo reboot

Now that PYNQ is up and working, your next step is to scp the Tensil driver for PYNQ. Originate by cloning the Tensil GitHub repository to your work field after which replica drivers/tcu_pynq to /dwelling/xilinx/tcu_pynq onto your board.

git clone
scp -r tensil/drivers/tcu_pynq xilinx@

Subsequent, we’ll download the bitstream created for Extremely96 architecture definition we used with the compiler. The bitstream comprises the FPGA configuration because of Vivado synthesis and implementation. PYNQ also wants a hardware handoff file that describes FPGA formulation accessible to the host, equivalent to DMA. Get and un-tar each and each recordsdata in /dwelling/xilinx by working these instructions on the reach board.

tar -xvf tensil_ultra96v2.tar.gz

Need to you’d savor to get the utilization of Tensil RTL utility and Xilinx Vivado to synthesize the bitstream your self, we refer you to sections 1 through 4 in the ResNet20 tutorial. Fragment 6 in the the same tutorial entails instructions for copying the bitstream and hardware handoff file from Vivado mission onto your board.

Now, replica the .tmodel, .tprog and .tdata artifacts produced by the compiler on your work field to /dwelling/xilinx on the board.

scp yolov4_tiny_192_onnx_ultra96v2.txilinx@

Subsequent, we must role up TF-Lite. We willing the TF-Lite originate compatible with the Extremely96 board. Bustle the next instructions on the reach board to download and set up.

sudo pip set up tflite_runtime-2.8.0-cp38-cp38-linux_aarch64.whl

At final, we can want the TF-Lite model to escape the postprocessing in YOLO v4 Dinky. We willing this model for you as effectively. We’ll also want textual screech material labels for the COCO dataset used for practicing the YOLO model. Get these recordsdata into /dwelling/xilinx by working these instructions on the reach board.


4. Form with PYNQ

Now, we can be tying all the pieces together in PYNQ Jupyter pocket e book. Let’s purchase a closer glimpse at our processing pipeline.

  • Pick the physique image from the webcam;
  • Adjust the image dimension, coloration arrangement, floating-level channel illustration, and Tensil vector alignment to match YOLO v4 Dinky enter;
  • Bustle it through Tensil to get the results of the 2 final convolution layers;
  • Resulting from this truth escape these results through the TF-Lite interpreter to get the model output for bounding boxes and classification scores;
  • Filter bounding boxes according to the acquire threshold and suppress overlapping boxes for the the same detected object;
  • Use the physique first and essential captured from the camera to location bounding boxes, class names, scores (red), the fresh worth for frames per 2nd (green), and the detection role (blue);
  • Send this annotated physique to Cloak Port to show camouflage on the show camouflage.

On the origin of the pocket e book, we clarify global parameters: physique dimensions for each and each camera and show camouflage and YOLO v4 Dinky resolution we can be the utilization of.

model_hw = 192
frame_w = 1280
frame_h = 720

Subsequent, we import the Tensil PYNQ driver and other required utilities.

import sys

import time
import math
import numpy as np
import tflite_runtime.interpreter as tflite
import cv2
import matplotlib.pyplot as plt
import pynq

from pynq import Overlay
from import *

from tcu_pynq.driver import Driver
from tcu_pynq.util import div_ceil
from tcu_pynq.architecture import ultra96

Now, initialize the PYNQ overlay from the bitstream and instantiate the Tensil driver the utilization of the TCU architecture and the overlay’s DMA configuration. Cloak that we’re passing axi_dma_0 object from the overlay–the title suits the DMA block in the Vivado originate.

overlay = Overlay('/dwelling/xilinx/tensil_ultra96v2.bit')
tcu = Driver(ultra96, overlay.axi_dma_0)

Subsequent, we must initialize the capture from the webcam the utilization of OpenCV library.

cap = cv2.VideoCapture(0)

cap.role(cv2.CAP_PROP_FRAME_WIDTH, frame_w);
cap.role(cv2.CAP_PROP_FRAME_HEIGHT, frame_h);

And initialze the Dispay Port.

displayport = DisplayPort()
displayport.configure(VideoMode(frame_w, frame_h, 24), PIXEL_RGB)

Need to you is seemingly to be connecting the board to an HDMI show camouflage, be determined that to exercise full of life DP-to-HDMI cable, equivalent to this one.

Subsequent, load the tmodel manifest for the model into the driver. The manifest tells the driver where to get the different two binary recordsdata (program and weights recordsdata).


Then, instantiate the TF-Lite interpreter according to YOLO postprocessing model.

interpreter = tflite.Interpreter(model_path='/dwelling/xilinx/yolov4_tiny_{0}_post.tflite'.format(model_hw))

Now we load the COCO labels and clarify several utility functions.

with beginning('/dwelling/xilinx/coco-labels-2014_2017.txt') as f: 
    labels_coco = up('n')
def set_tensor(driver, interpreter, hw_size, recordsdata): 
    input_details = interpreter.get_input_details()
    input_idxs = [i for i in range(len(input_details))
                  if input_details[i]['shape'][1] == hw_size and input_details[i]['shape'][2] == hw_size]
    inp = input_details[input_idxs[0]]
    recordsdata = recordsdata.astype(inp['dtype'])
    inner_dim = inp['shape'][-1]
    inner_size = div_ceil(inner_dim, driver.arch.array_size) * driver.arch.array_size
    if inner_size != inner_dim: 
        recordsdata = recordsdata.reshape((-1, inner_size))[:, :inner_dim]
    recordsdata = recordsdata.reshape(inp['shape'])
    interpreter.set_tensor(inp['index'], recordsdata)
def filter_and_reshape(boxes, scores, score_threshold=0.4): 
    scores_max = np.max(scores, axis=-1)
    cloak = scores_max > score_threshold
    filtered_boxes = boxes[mask]
    filtered_scores = scores[mask]
    filtered_boxes = np.reshape(filtered_boxes, [scores.shape[0], -1, filtered_boxes.form[-1]])    
    filtered_scores = np.reshape(filtered_scores, [scores.shape[0], -1, filtered_scores.form[-1]])

    return filtered_boxes, filtered_scores

def non_maximum_suppression(boxes, iou_threshold=0.4): 
    if len(boxes) == 0: 
        return boxes
    role = (boxes[:, 2] - boxes[:, 0]) * (boxes[:, 3] - boxes[:, 1])
    ll_x = np.most.outer(boxes[:, 0], boxes[:, 0])
    ll_y = np.most.outer(boxes[:, 1], boxes[:, 1])
    ur_x = np.minimum.outer(boxes[:, 2], boxes[:, 2])
    ur_y = np.minimum.outer(boxes[:, 3], boxes[:, 3])
    intersection_x = np.most(0, ur_x - ll_x)
    intersection_y = np.most(0, ur_y - ll_y)
    intersection = intersection_x * intersection_y
    iou = intersection / role - np.identity(role.form[-1])
    p = iou >= iou_threshold
    p = p & p.T
    n =  p.form[-1]
    no_needs_merge = role()
    for i in vary(n): 
        if not p[i].any(): 
    needs_merge = role()
    for i in vary(n): 
        for j in vary(n): 
            if p[i, j]: 
                needs_merge.add(tuple(sorted((i, j))))

    def merge(needs_merge): 
        consequence = role()
        discarded = role()
        for indices in needs_merge: 
            idx = indices[0]
            if idx not in discarded: 
            if indices[1] in consequence: 
                consequence.purchase away(indices[1])
        return consequence

    return sorted(checklist(no_needs_merge) + checklist(merge(needs_merge)))

At final, we tie the pipeline together in a loop to course of a mounted amount of frames. (That you would possibly perhaps perhaps exchange it with while(1): to escape the pipeline indefinitely.)

for _ in vary(600): 
    beginning = time.time()
    cap_frame = displayport.newframe()
    crop_h = int(max(0, (frame_h - frame_w) / 2))
    crop_w = int(max(0, (frame_w - frame_h) / 2))
    ratio_h = (frame_h - crop_h * 2)/model_hw
    ratio_w = (frame_w - crop_w * 2)/model_hw

    x_frame = cap_frame    
    x_frame=x_frame[crop_h:frame_h - crop_h, crop_w:frame_w - crop_w]
    x_frame=cv2.resize(x_frame, (model_hw, model_hw), interpolation=cv2.INTER_LINEAR)
    x_frame=cv2.cvtColor(x_frame, cv2.COLOR_BGR2RGB)    
    x_frame = x_frame.astype('drift32') / 255
    x_frame = np.pad(x_frame, [(0, 0), (0, 0), (0, tcu.arch.array_size - 3)], 'fixed', constant_values=0)
    inputs = {'x:0':  x_frame}    
    outputs = tcu.escape(inputs)
    set_tensor(tcu, interpreter, model_hw / 32, np.array(outputs['model/conv2d_17/BiasAdd:0']))
    set_tensor(tcu, interpreter, model_hw / 16, np.array(outputs['model/conv2d_20/BiasAdd:0']))


    output_details = interpreter.get_output_details()
    scores, boxes_xywh = [interpreter.get_tensor(output_details[i]['index']) for i in range(len(output_details))]

    boxes_xywh, scores = filter_and_reshape(boxes_xywh, scores)
    boxes_xy, boxes_wh = np.wreck up(boxes_xywh, (2,), axis=-1)
    boxes_x0y0x1y1 = np.concatenate([boxes_xy - boxes_wh/2, boxes_xy + boxes_wh/2], axis=-1)
    box_indices = non_maximum_suppression(boxes_x0y0x1y1[0])

    latency = (time.time() - beginning)
    fps = 1/latency
    for i in box_indices: 
        category_idx = np.argmax(scores, axis=-1)[0, i]
        category_conf = np.max(scores, axis=-1)[0, i]
        textual screech material = f'{labels_coco[category_idx]}={category_conf:.2}'
        box_x0y0x1y1 = boxes_x0y0x1y1[0, i]        
        box_x0y0x1y1[0] *= ratio_w
        box_x0y0x1y1[1] *= ratio_h
        box_x0y0x1y1[2] *= ratio_w
        box_x0y0x1y1[3] *= ratio_h
        box_x0y0x1y1[0] += crop_w
        box_x0y0x1y1[1] += crop_h
        box_x0y0x1y1[2] += crop_w
        box_x0y0x1y1[3] += crop_h
        box_x0y0x1y1 = box_x0y0x1y1.astype('int')
        cap_frame = cv2.rectangle(cap_frame, (crop_w, crop_h), (frame_w - crop_w, frame_h - crop_h), (255, 0, 0), 1)
        cap_frame = cv2.rectangle(cap_frame, (box_x0y0x1y1[0], box_x0y0x1y1[1]), (box_x0y0x1y1[2], box_x0y0x1y1[3]), (0, 0, 255), 1)
        cap_frame = cv2.putText(cap_frame, textual screech material, (box_x0y0x1y1[0] + 2, box_x0y0x1y1[1] - 2), cv2.FONT_HERSHEY_SIMPLEX, 0.75, (0, 0, 255))
    cap_frame = cv2.putText(cap_frame, f'{fps:.2}fps', (2, frame_h - 2), cv2.FONT_HERSHEY_SIMPLEX, 0.75, (0, 255, 0))

After working the pipeline, we neat up the camera capture and Cloak Port sources.


Congratulations! You ran a pronounce-of-the-art work object detection ML model on a personalized accelerator hooked to a webcam and a show camouflage for exact-time object detection! Unprejudiced imagine the things that you would possibly perhaps perhaps attain with it…


Serve to top

On this tutorial we used Tensil to show camouflage easy guidelines on how to escape YOLO v4 Dinky ML model on FPGA with a postprocessing step dealt with by TF-Lite. We confirmed easy guidelines on how to analyze the model to detemine the layers at which to wreck up the processing between TF-Lite and Tensil. We included step-by-step explanation easy guidelines on how to attain exact-time video processing pipeline the utilization of PYNQ.

Need to you made it the general formula through, edifying congrats! You’re in a position to purchase things to the next level by making an try out your contain model and architecture. Be part of us on Discord to claim hi there and demand questions, or send an electronic mail to

Read More
Fragment this on to envision with folks on this subjectSignal in on now if you’re not registered yet.

Charlie Layers

Charlie Layers

Fill your life with experiences so you always have a great story to tellBio: About: