This is a guest post from Quenton Hall, AI System Architect for Industrial, Scientific and Medical applications.
In 2014, Stanford Professor Mark Horowitz published a paper entitled “Computing’s Energy Problem (and what we can do about it)”. This seminal paper discussed the challenges that the semiconductor industry faces related to the breakdown of Dennard Scaling and Moore’s Law.
If I can be so bold, I would like to borrow and adapt the title of Mark’s paper so that I might provide some perspectives as to why you should consider specialized hardware for Machine Learning inference applications
First, let’s consider the problem. In approximately 2005, processor core clock frequencies stopped scaling. Shrinking process geometry and decreasing core voltages no longer offers the same advantages that it once did. The fundamental problem is that computing has hit the power density (W/mm2) wall.
If we put more cores on the same die, we can increase the number of ops within the same power budget, provided we also reduce the clock frequency somewhat to account for the energy used by the additional cores. It is not by coincidence that AMD and Intel released their first dual-core processors in 2005-2006. However, as we continue to try to increase the number of cores, we must consider the energy per op and the silicon area per op. Moreover, we also need to ensure that we can efficiently parallelize our algorithm by N, where N is the number of cores. The universal solution to this problem, or “Panacea of Compute Saturation”, for all algorithms remains an elusive problem and is today best solved through the application of adaptable hardware.
It turns out that whether your processor design is implemented using a multi-core CPU, GPU or SoC, the overall breakdown in power consumption at a processor level will be ~roughly~ the same. If we were to guesstimate a breakdown as follows, we might not be that far off:
What we fail to consider in the above analysis is there exists an additional plane of optimization available, which is to implement specialized hardware accelerators. Specialized hardware can be optimized to execute a specific function, at a very high level of efficiency. Such hardware is typically designed to reduce external memory accesses, reducing both latency and power consumption. Specialized hardware can be optimized such that the data motion portion of a given algorithm will use localized memory (BlockRAM, UltraRAM) for the storage of intermediate results.
Designing an efficient accelerator is a multi-dimensional design problem:
In Part 2 of this post, we will discuss and evaluate how Xilinx’s adaptable hardware and DNNDK address these challenges, specifically as it relates to machine learning inference. Until next time, I would suggest that you review Mark’s excellent talk on this subject, and then ponder how you might use adaptable hardware to your strategic advantage in your next design.
Interested in more on AI Camera Development? Join us October 22 for a webinar on “Accelerating AI Camera Development” with Quenton Hall, AI System Architect. Find out more and register at: https://event.on24.com/wcc/r/2099987/0590AEFDCE940FE23F526E995EF8FA6E?partnerref=ism.