Tensor Processing Unit

TPU block diagram. Source: Sato et al. 2017.

Tensor Processing Unit (TPU) is an ASIC announced by Google for executing Machine Learning (ML) algorithms. CPUs are general purpose processors. GPUs are more suited for graphics and tasks that can benefit from parallel execution. DSPs work well for signal processing tasks that typically require mathematical precision. On the other hand, TPUs are optimized for ML. While any of the others could also be used for ML, TPUs are expected to bring better performance per watt for ML. In fact, TPUs are said to catapult computing power seven years into the future, which is equivalent to three generations of Moore's Law.

Google claims that TPUs are tailored for running TensorFlow, which is an open-source software library for Machine Intelligence.

Discussion

What is Google's interest in making the TPU?
TPU Pod: 64xTPU2. Source: Tung 2017, © Google.
Google has claimed that "great software shines brightest with great hardware underneath." This is particularly true of ML where a TPU would offer software the requisite power to run faster and hence process more data. Google wants to use TPUs to power its ML algorithms. As of May 2016, more than 100 teams are said to be using ML within Google. Google Today, Street View, Inbox Smart Reply, RankBrain and voice search are products that are already benefiting from TPU hardware. AlphaGo used TPUs to defeat Go world champion Lee Sedol.
Beyond Google's internal projects, TPUs can offer an advantage for all ML applications implemented in TensorFlow. ML applications looking to run out of a cloud infrastructure will tend to prefer Google Cloud Platform powered by TPUs. Likewise, TPUs may be a differentiator for Google Cloud Platform when application developers select an ML service API for their applications. For example, Google Cloud Machine Learning is a managed ML service from Google that will directly benefit from TPUs.
There's also the claim that TPU may be Google's answer to Intel's Xeon processors that dominate datacenters.
Can TPUs be used for ML frameworks other than TensorFlow?
TensorFlow is not the only framework for ML. More specifically, there are multiple frameworks for Deep Learning (DL).
Google has not disclosed if TensorFlow algorithms are hardwired in TPU or if TPU is a generic accelerator for ML. Instead, Google has said that TPU is "tailored for TensorFlow". However, with the release of Cloud TPU, Facebook researcher and creator of PyTorch commented that there's a plan to port PyTorch to TPU.
Could you compare the performance of TPU against CPU or GPU?
TPU outperforms CPU and GPU for various neural networks in terms of predictions per second. Source: Sato et al. 2017.
Tests show 83x performance per watt gain over CPUs and 29x gain over GPUs. When tested for various neural networks, in terms of predictions per second, we see a 71x gain over CPUs for the particular case of a convolutional neural network (CNN) of 89 layers and 100 million weights.
Within Google's AI workloads, speed gains are in the range of 15x to 30x. In terms of energy efficiency, gains are in the range of 30x to 80x.
Note that the numbers above are for the initial TPU version. With later versions of TPU, we can expect higher performance gains.
How is TPU able to achieve its superior performance compared to other processor types?
Whereas CPUs are generic processors, the design of TPU is focused on ML workloads. The following design choices give it superior performance:
- Quantization: Integer operations are used instead of floating point operations. Precision is sacrificed for performance.
- CISC & Matrix Multiplier Unit (MXU): TPU prefers CISC over RISC instruction architecture. A single instruction can trigger complex operations that are handled by MXU. CPUs is a scalar processor; GPUs is a vector processor; MXU is a matrix processor that can hundreds of thousands of operations in a single clock cycle.
- Systolic Array: Arithmetic operations, done in Arithmetic Logic Unit (ALU), are chained together, thus reducing register accesses. The generality of CPUs/GPUs is sacrificed for simpler energy-efficient design since ALUs in GPUs do the operations in fixed patterns.
- Minimal & Deterministic: CPUs and GPUs have a large control logic to handle caches, branch prediction, out-of-order execution, and so on. TPU's control logic is only 2% of the die. This minimalism also makes it deterministic: we can predict execution latency.
Doesn't low-precision arithmetic reduce accuracy of calculations?
Research has shown that deep learning algorithms are not affected by low-precision arithmetic. In fact, low-precision arithmetic can be used for both training as well as inference. This is because ML is essentially probabilistic in nature and high-precision arithmetic is unnecessary. One writer reported that "having more data that is less precise yield better results than having half as much data that was more precise." In fact, addition of noise during training can improve performance.
One report claimed that Google intends to use TPU only for inference. The report added that low-precision arithmetic is not suited for training. Since the release of TPU2, it's clear that these can be used for both training and inference. With both TPU2 and TPU3, perhaps due to ML training needs, Google's own bfloat16 is being used for float point operations.
What's the competition for Google's TPU?
Nvidia dominates the ML processor market with its GPUs. Nvidia has specialized its Tesla GPUs, named Pascal, that are suited for ML. These can be used either for training or for inference. Nvidia's Volta equipped with 640 tensor cores is Pascal's successor.
Movidius makes Visual Processing Units (VPUs), named Myriad 2, that offer visual intelligence at device level. Intel announced in November 2016 an AI processor named Nervana for both training and inference. It's first commercial version, named Nervana Neural Net L-1000, was announced in May 2018. IBM's own chip named TrueNorth is capable of deep learning inferences. Mid-2018, IBM announced an unnamed prototype chip capable of both training and inference.
Microsoft has been using FPGAs in its datacenters since these can be configured easily unlike ASICs. In May 2018, Microsoft announced Project Brainwave and claims that it makes Azure the fastest cloud for real-time AI.
ARM is promoting its MALI GPUs to offload ML processing from its Cortex CPUs. Others in the AI chip space include Graphcore, Cerebras and Vathys.
Are there any performance results comparing TPUs against Nvidia's GPUs?
Raw throughput (images/sec) for training on auto-generated images without pre-processing. Source: Haußmann 2018.
One engineer at RiseML has compared Cloud TPU (consisting of 4 TPUv2) against Nvidia 4 V100s. In both cases, each core had 16 GB of memory. The former was on Google Cloud while the latter was on AWS. The performance test was on ResNet-50.
In terms of raw throughput without any pre-processing, both had comparable results. However, when batch sizes were decreased, V100s showed better performance. When looking at cost, Google Cloud offers better value for money.
In the real world, we are more interested in the cost for achieving a desired level of accuracy. Cloud TPU outperforms here, costing $55 for an accuracy of 75.7%. The same on AWS reserved instances cost $88. Cloud TPU also results in faster convergence, perhaps due to pre-processing. V100s achieve a final accuracy of 75.7% after 84 epochs, whereas Cloud TPU does it in only 64 epochs. An epoch in deep learning is a single pass of the dataset through the neural networks and multiple epochs are needed for convergence.
Is Google's TPU anyway connected to SGI's product of the same name?
No. Silicon Graphics had something called a TPU in its workstations in the 2000s. It was an advanced DSP that used dynamic shared-memory access. This has nothing to do with Google's TPU.

Milestones

2013

Google realizes that with the growing computational demand of neural networks, it would have to double the number of data centers. This concern triggers the design and development of TPU ASIC.

May
2016

The TPU on a PCB. Source: Schneider 2017, © Google.

Google announces that it's been using TPUs in its data centers for ML for more than a year. In its original form, it's packaged as an external accelerator card that can be easily installed into SATA hard disk slot. The TPU ASIC is built on 28nm process, runs at 700MHz and consumes 40W.

Jan
2017

Qualcomm announces that it has optimized TensorFlow for the Hexagon 682 DSP.

Dec
2017

Google reveals details of TPU2, second-generation TPU. Each TPU2, composed of four TPU2 chips, can deliver 180 teraflops, 64GB of high-bandwidth memory and 2,400GB/s memory bandwidth. These can be interconnected to make a TPU2 Pod capable of 11.5 petaflops.

Feb
2018

As a beta release, users can use Cloud TPU on Google Cloud Platform to accelerate the training of ML models. This is really TPU2 offered on the cloud. Models can now be trained overnight rather than in days or weeks. No special programming expertise is needed; TensorFlow can be used and reference implementations are available. The cost is $6.5/hour compared to Amazon's $24/hour.

May
2018