## ALL PROGRAMMABLE





## Binary Networks on FPGAs

Michaela Blott, Kees Vissers, Giulio Gambardella (Xilinx Research) Yaman Umuroglu (NTNU), Nick Fraser (Sydney Uni.), Gianluca Durelli (Politecnico Milano)







## Agenda





Page 3

## Xilinx Research - Ireland

- 8 researchers + students & visiting scholars

- 2 university program

- Est. 10 years ago







#### **Applications & Architectures:**

Through application-driven technology development with customers, partners, and engineering & marketing





















## Convolutional Neural Networks

#### **➤ CNNs** are the predominant machine learning algorithm

- Achieving superhuman accuracy since 2015
- Use cases span image recognition, language processing, speech recognition, time series prediction, recommender systems, medical diagnosis, autonomous vehicles and many more

#### > CNNs are very high in compute and memory requirements

Increasing operational intensity

| CNN<br>for ImageNet<br>datasets | Memory<br>(SP)<br>[MB] | Operations<br>[GOPS] | Operational<br>Intensity<br>[OPS:B] |
|---------------------------------|------------------------|----------------------|-------------------------------------|
| AlexNet – complete              | 244                    | 1.5                  | 5.97                                |
| VGG-16                          | 552                    | 31                   | 55.84                               |
| GoogleNet                       | 27.2                   | 3.1                  | 55.24                               |

## Increasingly Reduced Precision Networks

- > Floating point (FP) CNNs contain a lot of redundancy
  - Even Nvidia is moving from FP, HP to 8b fixed point integer
- ➤ Reducing precision is shown to work to 6b without loss of accuracy Dec. 2015
  - 50x and more reduction in model size (no external memory needed)
- **▶** Bill Dally (Stanford), EMDNN 2016:
  - showed TTN on par with FP for AlexNet top-1 and top-5,
     ResNet20,32,44,56
- ➤ Reducing to the extreme: binary and almost binary neural networks (BNNs) Jan 2016
  - Possible with retraining
  - No accuracy loss for small networks
  - Small drop for large networks







## Potential of Binary Networks on FPGAs

**▶** Multiply accumulate becomes XNOR with bit counts

| Cost of operations | LUTs | DSPs |
|--------------------|------|------|
| 1b                 | 2.5  | 0    |
| 4b                 | 11   | 0    |
| 8b                 | 40   | 0    |
| 32b                | 178  | 2    |

## Roofline Assumptions: Application can fill device to 70% (fully parallelizable) FPGA cost function:

- 1b int: 2.5LUTs (with HLS)
- 4b int: 11LUTs - 8b int: 40LUTs
- 32b float: 178LUTs, 2DSP
- 250MHZ for KU115

#### KU115:

- 663k LUTs
- 5520 DSPs
- 2160 BRAM

- ➤ Today's FPGAs have a much higher peak performance for binary operations
  - Example: KU115 offers lots of LUTs but limited DSPs for HP & SP: 5'520 DSPs and 663'360 LUTs

| Peak performance     | TOps/s 10x                                      |
|----------------------|-------------------------------------------------|
| 1b                   | 46 Nvidia today:                                |
| 4b                   | 11 4.5 TOps/s                                   |
| 8b                   | Huge performance potential for low bit measured |
| <b>32b</b>           | o precision – today                             |
|                      | No external memory needed                       |
| Model sizes small en | oug                                             |

## Potential of Dataflow Architectures on FPGAs



- > Binary networks can be implemented as feed-forward data flow architectures
  - If we had enough resources to implement a full network fully parallelized
  - ⇒ classifying 1 image @ clock rate (for example 250MHz => 250Mfps)
- > Large networks need to be folded over the input stream
  - Conceptually we have 5 orders of magnitude to play with
- ➤ Lowest storage requirements and lowest latency





## Architecture

## Concepts

#### Memory

- Weights and thresholds are contained in on-chip memory

#### > Custom heterogeneous streaming architecture

- Not a systolic array with scheduling network on processing engines
- Customized network where all layers coexist in a data flow architecture
- Each layer consumes and produces in same order to minimize buffering and latency
- Layers are different instantiations of a C++ template classes (MVTU)
   with equivalent throughput

#### Custom data types & BNN specific optimizations

- $-\{-1/+1\}$  maps to  $\{0,1\}$
- Xnor-popcount as cheap binary multiply-accumulates
- Thresholds as cheap batchnorm activations
- "OR" becomes cheap maxpool







## Architecture of a Matrix-Vector Threshold Unit (MVTU)

- ➤ Fully connected layers & convolutional layers are mapped on matrix-vector multiply threshold units (MVTUs)
- **▶** MVTUs support folding over OFMs (neuron) and folding over weights (synaptic)
- > Weight and output stationary (weights and popcounts are retained locally)
- **▶** Max pool units are optionally placed behind MVTUs





## Architecture of Infrastructure on Zynq SOC



## Work Flow for Exploration of BNNs

First prototype integration with tiny-dnn and Theano (Tensorflow and Caffe in progress)

theano TensorFlow ™

Caffe

- All code in C/C++
- Can execute on CPU and FPGA
  - No RTL needed
- Scheduler is conceptually packed into the synthesizer



## Top Level

```
void DoCompute(ap uint<64> * in, ap uint<64> * out) {
#pragma HLS DATAFLOW
  stream<ap uint<64> > memInStrm("memInStrm");
  stream<ap uint<64> > InStrm("InStrm");
                                                                                   Stream definitions
  stream<ap uint<64> > memOutStrm("memOutStrm");
                                                                                   Move image in from PS memory
  Mem2Stream<64, inBytesPadded>(in, memInStrm);
  StreamingMatrixVector<LO SIMD, LO PE, 16, LO MW, LO MH, LO WMEM, LO TMEM>
          (InStrm, inter0, weightMem0, thresMem0);
  StreamingMatrixVector<L1 SIMD, L1 PE, 16, L1 MW, L1 MH, L1 WMEM, L1 TMEM>
          (inter0, inter1, weightMem1, thresMem1);
                                                                                   Layer instantiation
  StreamingMatrixVector<L2 SIMD, L2 PE, 16, L2 MW, L2 MH, L2 WMEM, L2 TMEM>
                                                                                   connected by streams
          (inter1, inter2, weightMem2, thresMem2);
  StreamingMatrixVector<L3 SIMD, L3 PE, 16, L3 MW, L3 MH, L3 WMEM, L3 TMEM>
          (inter2, outstream, weightMem3, thresMem3);
    StreamingCast<ap uint<16>, ap uint<64> > (outstream, memOutStrm);

→ Move results to PS memory

   Stream2Mem<64, outBytesPadded>(memOutStrm, out);
```

### **MVTU**

```
for (unsigned int nm = 0; nm < neuronFold; nm++) {</pre>
                                                                                        Folding
   for (unsigned int sf = 0; sf < synapseFold; sf++) {</pre>
#pragma HLS PIPELINE II=1
          ap uint<SIMDWidth> inElem;
                                                                                         Reading
         if (nm == 0) {
                                                                                         Inputs or consume
            inElem = in.read();
                                                                                         internal (when folded)
            inputBuf[sf] = inElem;
          } else {
            inElem = inputBuf[sf];
                                                                                         Indexing weight and
         for (unsigned int pe = 0; pe < PECount; pe++) {</pre>
#pragma HLS UNROLL
                                                                                         threshold memory
             ap uint<SIMDWidth> weight = weightMem[pe][nm * synapseFold + sf];
                                                                                         binary MAC
             ap uint<SIMDWidth> masked = ~(weight ^ inElem);
             accPopCount[pe] += NaivePopCount<SIMDWidth, PopCountWidth>(masked);
   ap uint<PECount> outElem = 0;
   for (unsigned int pe = 0; pe < PECount; pe++) {</pre>
                                                                                         Batchnorm
#pragma HLS UNROLL
                                                                                         activations
          outElem(pe, pe) = accPopCount[pe] > thresMem[pe][nm] ? 1 : 0;
         accPopCount[pe] = 0;  // clear the accumulator
```



## **Experimental Setup**



Source: Xilinx Dublin labs - BNN setup

#### **Z706** development platform:

- Z7045
  - 2 A9 processors
  - 350k LUTs
  - 900DSPs
- 2x 1GB DDR3





## **Test Networks**

#### > Fully connected networks

- Input images: 28x28 pixels, binarized MNIST
- Number of layers: 3 FC layers, 256, 512 and 1024 neurons each
- Compute requirement: 0.67, 1.86 and 5.8 MOPS/Frame



#### > CNV (VGG-16 derivative)

- Input images: 32x32 pixels, RGB image
- Number of layers: 2 (3x3) Conv + Max Pool + 2 (3x3) Conv + Max Pool +
   2 Convolutional + 3 FC
- Compute requirement: 0.113 and 1.2 GOPS/Frame

#### DoReFaNet (AlexNet)

- Reduced precision with 2b activations
- Input Images: 227x227, RGB
- In progress





## Test Networks & Input Data









## Results - Performance, Latency, Power & Resources

| Max    | Throug | ahput     |
|--------|--------|-----------|
| 111007 |        | 7' 'P G G |

| 77045                   | FPS   | GOPS/s | BRAM  | Ultra-low latency | Latency [us] | Power [W] |
|-------------------------|-------|--------|-------|-------------------|--------------|-----------|
| Unprecedented           | 12.3M | 8'200  | 130.5 | (P4 ~11ms)        | 0.31         | 21.2      |
| classification<br>rates | 1.5M  | 9'085  | 398   | For robotics, AR, | 2.44         | 22.6      |
| On ARTO Small           | 21.9K | 2'465  | 192   | UAVs (2070)       | 283          | 11.7      |
| Z7020 (PYNQ)            | FPS   | GOPS/s | BRAM  | LUT               | Latency [us] | Power [W] |
| MNIST – small           | 307k  | 203.5  | 64.5  | 23'756 (44%)      | 13           | -         |

#### 12K FPS target

| Z7045           | FPS   | GOPS/s RRAM                | LUT          | Latency [us] | Power [W] |
|-----------------|-------|----------------------------|--------------|--------------|-----------|
| MNIST - small   | 12.2k | Scalability to             | 4'810 (2%)   | 240          | 8.1       |
| MNIST – large   | 12.2k | extremely small footprints | 6'156 (3%)   | 282          | 7.9       |
| CIFAR10 - small | 11.6k | 1 300 130.3                | 40'404 (18%) | 550          | 10        |

Comparable to AlexNet

AlexNet

KU115

FPS

GOF:

Over best measured

Dumbors on GPLI today

Latency [us] Power [W]

CIFAR10 -

High performance, latency, low power with:
 equal accuracy on small networks and promising results for larger networks

ROGRAMMABLE

## Machine Learning Applications



Applications that require large networks and low accuracy (performance, power)

Recommender systems

Data analysis



Applications that require small networks (low latency & speed)

- Wireless: channel equalization
  - High Frequency Trading
  - Identifying malaria cells
- Speech recognition for voice control





Applications that require large networks and high accuracy

Autonomous driving





Different use cases require different networks & different levels of accuracy
 Statistics, recommender systems, UAV and medical diagnosis have very different requirements



## Accuracy of Binary Networks Improving

Published Results for FP CNNs, BNNs and Extreme Reduced Precision NNs



BNNs are new and accuracy results are improving rapidly

# Others are considering it too Facebook, Google, Intel













## **PYNQ Overlay Architecture for BNN**

First release: Rigid networks with high performance, basic tool support



#### **BNN 1b**

- 6 layers conv
- 3 layers pooling
- 3 layers FC
- Up to 64 outputs

- > 1a and 1b support fixed topologies that fit into the given foot print
- Classify images up to 28x28 pixels (1a) or 32x32 (1b)
- > Very high classification speeds (1a => 70kfps, 1b => 6kfps?), very low latency (<1ms)
- > Example use cases: solitaire, handwriting, small colour images, HFT, speech recognition (voice control for robots)
- > (While smaller networks can be mapped onto this architecture, not sure it helps other than training time)



## PYNQ Overlay Architecture for BNN

Second release: Flexible networks, lower performant, high power efficiency



- > 2 supports networks that consist of convolutional, max pool,
- > Classify images up to 32x32 pixels (1b or 24b)
- > Value: Energy efficiency, experimental platform to get comfortable with FPGAs, gaining trust
- Example use cases: ImageNet or vision processing tasks



## Software Flow





## **Provisional Timelines**

- **▶** Early release of 1a and 1b for FPGA 2017 with Theano and fixed networks
- > Release of 2 end of March 2017 with Tensorpack and flexible network design



Page 29

## Summary

- ➤ Binary networks provide some interesting performance resource (cost) trade-offs within the design space
  - Extremely small footprint for slower classification rates for smallest devices (4,6k (2%) LUTs for 60fps)
  - Very high classification rates (12Mfps)
  - Low latency for applications with real-time requirements such as AR, automotive and robotics
- ➤ Proposed architecture is flexible to support different types of neural network topologies (all in C/C++)
  - Number of layers
  - Size and types of layers (convolutions, max pool, fully connected)
  - Experiment without hardware knowhow
- > PYNQ provides a release mechanism that makes technology available to a wide audience

## Many Open Questions

- > Real use cases
  - Medical images?

#### Accuracy

- Research needed in large binarized neural networks with high accuracy
- Design space exploration/navigation accuracy- resource frame rate
- ➤ Adaptation & Integration & Cloud service of standard tool chains (for example Caffe)
- **▶** Performance comparisons with GPUs, CPUs, Phis
  - Microbenchmarks

#### **>** Architecture

- Improving architecture & adding more precision flexibility & adding inception & skip layers
- Pruning & sparse representations





# Consuming and Producing in Same Sequence For Minimal Buffering & Latency



## IFM & Weight Arrangements in Input Buffer & Sequencing



## Experimental Results (Server)

State of the Art - Comparison

Platform

\* - estimated

[GOPS/s]

Titan X (FP32)

5'750



FPGAs can provide further performance scaling using custom data types plus power and latency reductions

- Demonstrator shows 14.8TOps/s
- Potential to scale up to 46 TOps/s for data center applications
- Consuming < 41Watt</li>

## Experimental Results (Embedded)

Z7045 (ZC706)



| State of the Art - Comparison            |          |  |  |  |  |
|------------------------------------------|----------|--|--|--|--|
| Platform<br>* - estimated                | [GOPS/s] |  |  |  |  |
| Tegra X1 (FP16)                          | 335      |  |  |  |  |
| Source: see previous slides *: estimates |          |  |  |  |  |

#### Roofline Assumptions:

#### Z7045:

- 218k LUTs
- 900 DSPs
- 545 BRAM

| Network | FPS   | GOPS/s | BRAM  | LUT          | Latency<br>[us] | Board<br>Power [W] |
|---------|-------|--------|-------|--------------|-----------------|--------------------|
| FC      | 12.3M | 8'200  | 130.5 | 86'110 (39%) | 0.31            | 21.2               |
| FC      | 12.2k | 0.66   | 15.5  | 4'810 (2%)   | 240             | 8.1                |
| CNV     | 21.9K | 2'465  | 192   | 54'538 (25%) | 283             | 11.7               |



## Results- Power and Latency

| Network | FPS   | Latency [us] | Power (a) [W] | Power (b) [W] |
|---------|-------|--------------|---------------|---------------|
| SFC     | 12.3M | 0.31         | 7.3           | 21.2          |
| LFC     | 1.5M  | 2.44         | 8.8           | 22.6          |
| CNV     | 21906 | 283          | 3.6           | 11.7          |

| Network | FPS   | Latency [us] | Power (a) [W] | Power (b) [W] |
|---------|-------|--------------|---------------|---------------|
| SFC     | 12.2k | 240          | 0.43          | 8.1           |
| LFC     | 12.2k | 282          | 0.8           | 7.9           |
| CNV     | 11.6k | 550          | 2.3           | 10            |

| Network | FPS  | Latency [us] | Power (a) [W] | Power (b) [W] |
|---------|------|--------------|---------------|---------------|
| SFC     | 996  | 43029        | 0.4           | 8             |
| LFC     | 190  | 14551        | 0.3           | 7.4           |
| CNV     | 6.83 |              |               |               |



Figure 3: Inference execution time on Tesla P4 and P40 using TensorRT 2 to optimize the trained neural network, compared to IntelCaffe on CPU. (Based on VGG-19 from IntelCaffe Github. CPU: IntelCaffe, batch size = 4, Intel E5-2690v4, using Intel MKL 2017. GPU: Caffe, batch size = 4, using TensorRT internal version.)

#### Source:

https://devblogs.nvidia.com/parallelforall/new-pascal-gpus-accelerate-inference-in-the-data-center/

(a) PL power

(b) Board level power

Very Low latency

• Small network: 12Mfps, 8.2TOPS/s: 310nsec latency

Large network: 21.9Mfps, 2.5TOPS/s: 283usec latency

Required for real-time applications