Neural networks are growing!

Neural networks are used throughout cloud or data centre applications, from SPAM filters, to business data analytics, face recognition FB-DEEPFACE and real-time speech translation GOOGLE-SPEECH.

DeepFace

As they’re applied to more complex tasks their size increases, as does their compute requirements increase too. The most basic useful networks require 100s of thousands multipy-and-add operations.

Data centers deployed by technology companies are huge energy users.datacentre Google used 5,743,793 MWh in 2015 GOOG-ENERGY, the same electricity usage of about 1 million houses in Ireland (over 1/2 billion Euros worth). However it should be noted and applauded that unlike it’s peers, its energy usage will be 100% renewable in 2017 and going forward. Google estimated back in 2013, that if everyone used its Google Voice for 3 minutes, it would need to double its number of data centres. That’s crippling, so worrisome in fact that Google rushed to develop its own processor TPU specifically for neural networks (more on this later). Interestingly DeepMind, a division of Google, designed a neural network to optimize energy usage in its data centres, reducing power consumption by 15% ML-OPTIMIZES-USAGE.

Nvidia GPUs became the go-to off-the-shelf processor for training and executing neural network computation. Their design for high throughput parallel execution of floating-point operations for graphics lends itself to the intensive floating-point computation requirement for neural network training. Although it’s generally accepted now that quantized networks are sufficient after training (for the bulk of execution), where these floating pointing numbers are reduced down to much narrow 8-bit integers. With this, industry leaders began to develop their own systems to take advantage of the energy-efficiency & performance potential on offer.

Efforts to keep up…

Before turning to specialized hardware quantized neural networks were recognized as the best way to achieve energy efficiency. Techniques have been documented CPU-TECHNIQUES for reducing the computational burden of neural networks on x86 CPUs. Fixed-point quantized neural networks proved to out-perform optimized BLAS packages when used in conjunction with the Intel SSSE3/4 instruction set. It did so by a factor of 3, performing 16 parallel multiply-and-add operations on 8-bit integers without a reduction in accuracy.

Microsoft revealed project Catapult CATAPULT in 2016, which details the use of the Altera FPGA in its cloud servers as flexible low cost compute accelerators.ms-catapult Testing the FPGA with the machine learning models for Bing Search Page Ranking, it proved it could more than double throughput for just an extra 10% in energy. Originally proposed for processing search queries its responsibilities expanded across all of the Azure applications. Using an FPGA, allows flexibility in design and it’s reported that they handle neural network workloads as well data compression, encryption and network applications for Microsoft Apps. The large-scale deployment of the Altera FPGA led to the Intel €15B buyout of Altera in May 2015. Not long after which, IBM announced a partnership with FPGA manufacturer Xilinx in November 2015 with their own FPGA accelerated systems.

Google went further to produce a processor TPU specifically designed for neural networks.tpu.png As such, this ASIC concentrates on performing 8/16-bit multiply-and-add operations in a time responsive manner. The TPU runs as a co-processor on the PCIe bus performing neural network inference as instructed by the host CPU. The result compared to CPU/GPU performance, is 15-30x more speed and 30-80x more energy efficient. This eclipses that of Microsoft’s accelerators, at the expense of a more focused and inflexible hardware.

Neuro-morphic brain-like processors are another breed of computer design (that try to mimic a brains neuro-biological architecture), and are beginning to show signs of their potential. IBM_truenorth_chip_heroshot Several research groups and companies are working on such cognitive processors, Qualcomm NPU, Stanford NEUROGRID, Human Brain Project Group BRAINSCALES.
IBM however has produced the TrueNorth chip TRUENORTH, and demonstrated the ability to run neural network architectures single-bit synaptic networks TRUENORTH-NN. Although they achieved ground-breaking energy-efficiency, it was at the expense of accuracy.

Neural networks in IoT?

Embedded mobile processors such as that in your smartphone can use a little as 1-2% of the power of a typical CPU. So the recent innovations lend themselves to deployment of neural networks across embedded devices and not just data centres.
While large tech companies might like storing all our data on their premises, it should be noted that data that isn’t transmitted over a network is far more secure. It’s obviously also cheaper and more efficient too to keep the processing local to the device. This is just speculation but it’s interesting to see the Apple IPhone7 now ships with a FPGA chip IPHONE-FPGA, possibly for accelerating NN computation on-device rather than outsourcing to the cloud, it’s a bit hush-hush for now.

How far can quantized neural networks go?

Using these quantized networks offer huge potential. The upsides are they offer better cache performance, use less disk space, compute faster, and are more energy efficient to compute. The challenge is to design algorithms for quantized networks that can keep up with the high accuracy of standard networks. We know neural networks can cope with noise at their input data. Well, quantized weights/activations are just another form of noise, and like the inputs, the network can cope and produce accurate results.

As part of the Nano2017-ESPRIT project at SLIDE, we explored the capabilities of highly quantized neural networks on off-the-shelf hardware such as Xilinx FPGAs.virtex-7 Our work demonstrated that extremely quantized neural networks can balance large energy-efficiency and respectable accuracy. Training the networks is still performed with high-precision floating point weights, all the while ternary counterparts learn to mimic them. Neuron activations are also ternarized, stochastically in training and using deterministic thresholds afterwards. Following training, the network assumes its extreme quantized form, 2-bit ternary weights and activations (not quite the 1-bit extreme of TrueNorth). As a result, multiplication within a neuron can be performed by a single FPGA LUT and achieve equal power efficiency as TrueNorth with higher throughput and accuracy. For an MLP neural network we achieve similarly impressive power savings, 255x the throughput, and an accuracy of 98.14% accuracy compared to 95% of TrueNorths on MNIST data TNN-PAPER. The work was further proven on more challenging datasets CIFAR100 & GTSRB.

Our insatiable demand for more “intelligent” applications has lead to neural networks pervade the cloud data centre. Energy efficiency (or at least cost of ownership) without sacrificing performance is driving innovation across the industry. As we’ve seen initial experiments in hardware accelerators have born fruit for the big companies, I get the impression things are only warming up.

Advertisements