Computation need for AI soars, as devices expand, and help with Moore's Law
Shrinking Artificial Intelligence By Chris Edwards
Communications of the ACM, January 2022, Vol. 65 No. 1, Pages 12-14 10.1145/3495562
The computational demand made by artificial intelligence (AI) has soared since the introduction of deep learning more than 15 years ago. Successive experiments have demonstrated the larger the deep neural network (DNN), the more it can do. In turn, developers have seized on the availability of multiprocessor hardware to build models now incorporating billions of trainable parameters.
The growth in DNN capacity now outpaces Moore's Law, at a time when relying on silicon scaling for cost reductions is less assured than it used to be. According to data from chipmaker AMD, cost per wafer for successive nodes has increased at a faster pace in recent generations, offsetting the savings made from being able to pack transistors together more densely (see Figure 1). "We are not getting a free lunch from Moore's Law anymore," says Yakun Sophia Shao, assistant professor in the Electrical Engineering and Computer Sciences department of the University of California, Berkeley.
Though cloud servers can support huge DNN models, the rapid growth in size causes a problem for edge computers and embedded devices. Smart speakers and similar products have demonstrated inferencing can be offloaded to cloud servers and still seem responsive, but consumers have become increasingly concerned over having the contents of their conversations transferred across the Internet to operators' databases. For self-driving vehicles and other robots, the round-trip delay incurred by moving raw data makes real-time control practically impossible.
Specialized accelerators can improve the ability of low-power processors to support complex models, making it possible to run image-recognition models in smartphones. Yet a major focus of R&D is to try to find ways to make the core models far smaller and more energy efficient than their server-based counterparts. The work began with the development of DNN architectures such as ResNet and Mobilenet. The designers of Mobilenet recognized the filters used in the convolutional layers common to many image-recognition DNNs require many redundant applications of the multiply-add operations that form the backbone of these algorithms. The Mobilenet creators showed that by splitting these filters into smaller two-dimensional convolutions, they could cut the number of calculations required by more than 80%.
A further optimization is layer-fusing, in which successive operations funnel data through the weight calculations and activation operations of more than one layer. Though this does not reduce the number of calculations, it helps avoid repeatedly loading values from main memory; instead, they can sit temporarily in local registers or caches, which can provide a big boost to energy efficiency.
More than a decade ago, research presented at the 2010 International Symposium on Computer Architecture by a team from Stanford University showed the logic circuits that perform computations use far less energy compared to what is needed for transfers in and out of main memory. With its reliance on large numbers of parameters and data samples, deep learning has made the effect of memory far more apparent than with many earlier algorithms.
Accesses to caches and local scratchpads are less costly in terms of energy and latency than those made to main memory, but making best use of these local memories is difficult. Gemmini, a benchmarking system developed by Shao and colleagues, shows even the decision to split execution across parallel cores affects hardware design choices. On one test of ResNet-50, Shao notes convolutional layers "benefit massively from a larger scratchpad," but in situations where eight or more cores are working in parallel on the same layer, simulations showed larger level-two cache as more effective.
Reducing the precision of the calculations that determine each neuron's contribution to the output both cuts the required memory bandwidth and energy for computation. Most edge-AI processors now use many 8-bit integer units in parallel, rather than focusing on accelerating the 32-bit floating-point operations used during training. More than 10 8-bit multipliers can fit into the space taken up by a single 32-bit floating-point unit.
With its reliance on large numbers of parameters and data samples, deep learning has made the effect of memory far more apparent than with earlier algorithms. ... '
No comments:
Post a Comment