AI, computational demand, complex contextual scenarios. Can we make it happen?
Shrinking AI? By Chris Edwards
Communications of the ACM, January 2022, Vol. 65 No. 1, Pages 12-14 10.1145/3495562
The computational demand made by artificial intelligence (AI) has soared since the introduction of deep learning more than 15 years ago. Successive experiments have demonstrated the larger the deep neural network (DNN), the more it can do. In turn, developers have seized on the availability of multiprocessor hardware to build models now incorporating billions of trainable parameters.
The growth in DNN capacity now outpaces Moore's Law, at a time when relying on silicon scaling for cost reductions is less assured than it used to be. According to data from chipmaker AMD, cost per wafer for successive nodes has increased at a faster pace in recent generations, offsetting the savings made from being able to pack transistors together more densely (see Figure 1). "We are not getting a free lunch from Moore's Law anymore," says Yakun Sophia Shao, assistant professor in the Electrical Engineering and Computer Sciences department of the University of California, Berkeley.
Though cloud servers can support huge DNN models, the rapid growth in size causes a problem for edge computers and embedded devices. Smart speakers and similar products have demonstrated inferencing can be offloaded to cloud servers and still seem responsive, but consumers have become increasingly concerned over having the contents of their conversations transferred across the Internet to operators' databases. For self-driving vehicles and other robots, the round-trip delay incurred by moving raw data makes real-time control practically impossible.
Specialized accelerators can improve the ability of low-power processors to support complex models, making it possible to run image-recognition models in smartphones. Yet a major focus of R&D is to try to find ways to make the core models far smaller and more energy efficient than their server-based counterparts. The work began with the development of DNN architectures such as ResNet and Mobilenet. The designers of Mobilenet recognized the filters used in the convolutional layers common to many image-recognition DNNs require many redundant applications of the multiply-add operations that form the backbone of these algorithms. The Mobilenet creators showed that by splitting these filters into smaller two-dimensional convolutions, they could cut the number of calculations required by more than 80%.
A further optimization is layer-fusing, in which successive operations funnel data through the weight calculations and activation operations of more than one layer. Though this does not reduce the number of calculations, it helps avoid repeatedly loading values from main memory; instead, they can sit temporarily in local registers or caches, which can provide a big boost to energy efficiency.
More than a decade ago, research presented at the 2010 International Symposium on Computer Architecture by a team from Stanford University showed the logic circuits that perform computations use far less energy compared to what is needed for transfers in and out of main memory. With its reliance on large numbers of parameters and data samples, deep learning has made the effect of memory far more apparent than with many earlier algorithms.
Accesses to caches and local scratchpads are less costly in terms of energy and latency than those made to main memory, but making best use of these local memories is difficult. Gemmini, a benchmarking system developed by Shao and colleagues, shows even the decision to split execution across parallel cores affects hardware design choices. On one test of ResNet-50, Shao notes convolutional layers "benefit massively from a larger scratchpad," but in situations where eight or more cores are working in parallel on the same layer, simulations showed larger level-two cache as more effective.
Reducing the precision of the calculations that determine each neuron's contribution to the output both cuts the required memory bandwidth and energy for computation. Most edge-AI processors now use many 8-bit integer units in parallel, rather than focusing on accelerating the 32-bit floating-point operations used during training. More than 10 8-bit multipliers can fit into the space taken up by a single 32-bit floating-point unit.
With its reliance on large numbers of parameters and data samples, deep learning has made the effect of memory far more apparent than with earlier algorithms.
To try to reduce memory bandwidth even further, core developers such as Cadence Design Systems have put compression engines into their products. "We focus a lot on weight compression, but there is also a lot of data coming in, so we compress the tensor and send that to the execution unit," says Pulin Desai, group director of business development at Cadence. The data is decompressed on the fly before being moved into the execution pipeline.
Compression and precision reduction techniques try to maintain the structure of each layer. More aggressive techniques try to exploit the redundancy found in many large models. Often, the influence of individual neurons on the output of a layer is close to zero; other neurons are far more important to the final result. Many edge-AI processors take advantage of this to cull operations that would involve a zero weight well before they reach the arithmetic units. Some pruning techniques force weights with little influence on the output of a neuron to zero, to provide even more scope for savings. ... '