The Only *Analog* Deep Learning Processors (And Why Bother?)

I led the creation what is still the largest analog integrated circuit in history. 440,000 analog transistors did the work of 30,000,000 digital transistor tensor processing core, providing an additional 10x Joules/Ops power win for the first deep learning microchips.

In 2010, this core was a plug-and-play substitute for the processing core in the first deep learning processors that we had created. The final report for DARPA is here.

The main innovation involved taking advantage of the fact that weights and activations in deep learning models can be represented by (quantized into) 8-bit numbers without causing any issues.

The noise we observed in our circuits, was about 128th of our 1.8V power supply, so we could essentially replace a 7-wire digital bus with a single wire carrying analog current. (In practice we used a differential pair – 2 wires – to represent our analog value more robustly.)

Dropping from 7 wires to 2 wires doesn’t seem like a big enough win to justify the effort of designing analog circuits? Furthermore, the win gets even slimmer if you have decided you can quantize your weights and activations down to 2 bits. In that case we simply replaced 2 digital wires with 2 analog wires! So why bother?

The real win (about 10x in ops per Joule) came from two things:

  1. You get to use fewer transistors in the multiply accumulate. Instead of a few thousand digital transistors needed to multiply two 8-bit numbers, we could use just 6 transistors. 500x fewer! A 2-bit multiply-accumulate would still require around 100 digital transistors. This is still a 16x win in transistor count for the analog version!
  2. Less intense switching provided significant additional advantage. On average our analog wires were not switching between 0 and 1, they were varying between intermediate current values.

    In the digital version, switching our wires from 1.8V to 0V and back again dissipates the majority of power in our processor. On any given digital wire, this happens about half of the times that the processor’s clock ticks.

    By contrast, in the analog version, because of the statistical distribution of weights and activations around 0, on average the currents in our analog wires did not change values nearly as much.

Would all of this still work in a modern 1nm semiconductor chips? One would have to try it to find out for certain, but it’s fairly likely to still work. If the noise floor in chips have not changed very significantly, then dropping the power supply from 1.8V to 0.5V would still provide us with 5 bits of analog resolution to represent weights and activations. 5-bits should still be adequate for today’s deep learning quantization.

In 2010, a small % of the world’s computing workloads involved deep learning. Today deep learning work loads are becoming a driver for global energy consumption. There is talk of deep learning compute consuming the equivalent of dozens of Manhattans. Even less than a 10x win matters even more today than it did then. In addition, we are increasingly interested in deep learning chips within our portable devices, where again a 5-10x could be the difference between an incredibly smart phone and a smart phone of only middling intelligence.