• TPU block diagram. Source: Sato et al. 2017.
• Different versions of TPU compared. Source: Teich 2018.
• TPU outperforms CPU and GPU for various neural networks in terms of predictions per second. Source: Sato et al. 2017.

# Tensor Processing Unit

## Summary

Tensor Processing Unit (TPU) is an ASIC announced by Google for executing Machine Learning (ML) algorithms. CPUs are general purpose processors. GPUs are more suited for graphics and tasks that can benefit from parallel execution. DSPs work well for signal processing tasks that typically require mathematical precision. On the other hand, TPUs are optimized for ML. While any of the others could also be used for ML, TPUs are expected to bring better performance per watt for ML. In fact, TPUs are said to catapult computing power seven years into the future, which is equivalent to three generations of Moore's Law.

Google claims that TPUs are tailored for running TensorFlow, which is an open-source software library for Machine Intelligence.

## Milestones

2013

Google realizes that with the growing computational demand of neural networks, it would have to double the number of data centers. This concern triggers the design and development of TPU ASIC.

May
2016

Google announces that it's been using TPUs in its data centers for ML for more than a year. In its original form, it's packaged as a external accelerator card that can be easily installed into SATA hard disk slot. The TPU ASIC is built on 28nm process, runs at 700MHz and consumes 40W.

Dec
2017

Google reveals details of TPU2, second-generation TPU. Each TPU2, composed of four TPU2 chips, can deliver 180 teraflops, 64GB of high-bandwidth memory and 2,400GB/s memory bandwidth. These can be interconnected to make a TPU2 Pod capable of 11.5 petaflops.

Feb
2018

As a beta release, users can use Cloud TPU on Google Cloud Platform to accelerate the training of ML models. This is really TPU2 offered on the cloud. Models can now be trained overnight rather than in days or weeks. No special programming expertise is needed; TensorFlow can be used and reference implementations are available. The cost is $6.5/hour compared to Amazon's$24/hour.

May
2018

Google announces TPU 3.0, which requires liquid cooling. A single TPU3 chip is capable of 90 teraflops.

## Discussion

• What is Google's interest in making the TPU?

Google has claimed that "great software shines brightest with great hardware underneath." This is particularly true of ML where a TPU would offer software the requisite power to run faster and hence process more data. Google wants to use TPUs to power its ML algorithms. As of May 2016, more than 100 teams are said to be using ML within Google. Google Today, Street View, Inbox Smart Reply, RankBrain and voice search are products that are already benefiting from TPU hardware. AlphaGo used TPUs to defeat Go world champion Lee Sedol.

Beyond Google's internal projects, TPUs can offer an advantage for all ML applications implemented in TensorFlow. ML applications looking to run out of a cloud infrastructure will tend to prefer Google Cloud Platform powered by TPUs. Likewise, TPUs may be a differentiator for Google Cloud Platform when application developers select an ML service API for their applications. For example, Google Cloud Machine Learning is a managed ML service from Google that will directly benefit from TPUs.

There's also the claim that TPU may be Google's answer to Intel's Xeon processors that dominate datacenters.

• Can TPUs be used for ML frameworks other than TensorFlow?

TensorFlow is not the only framework for ML. More specifically, there are multiple frameworks for Deep Learning (DL). However, Google has not disclosed if TensorFlow algorithms are hardwired in TPU or if TPU is a generic accelerator for ML.

• Could you compare the performance of TPU against CPU or GPU?

Tests show 83x performance per watt gain over CPUs and 29x gain over GPUs. When tested for various neural networks, in terms of predictions per second, we see a 71x gain over CPUs for the particular case of a convolutional neural network (CNN) of 89 layers and 100 million weights.

Within Google's AI workloads, speed gains are in the range of 15x to 30x. In terms of energy efficiency, gains are in the range of 30x to 80x.

Note that the numbers above are for the initial TPU version. With later versions of TPU, we can expect higher performance gains.

• How is TPU able to achieve its superior performance compared to other processor types?

Whereas CPUs are generic processors, the design of TPU is focused on ML workloads. The following design choices give it superior performance:

• Quantization: Integer operations are used instead of floating point operations. Precision is sacrified for performance.
• CISC & Matrix Multiplier Unit (MXU): TPU prefers CISC over RISC instruction architecture. A single instruction can trigger complex operations that are handled by MXU. CPUs is a scalar processor; GPUs is a vector processor; MXU is a matrix processor that can hundreds of thousands of operations in a single clock cycle.
• Systolic Array: Arithmetic operations, done in Arithmetic Logic Unit (ALU), are chained together, thus reducing register accesses. The generality of CPUs/GPUs is sacrificed for simpler energy-efficient design since ALUs in GPUs do the operations in fixed patterns.
• Minimal & Deterministic: CPUs and GPUs have a large control logic to handle caches, branch prediction, out-of-order execution, and so on. TPU's control logic is only 2% of the die. This minimalism also makes it deterministic: we can predict execution latency.
• Doesn't low-precision arthimetric reduce accuracy of calculations?

Research has shown that deep learning algorithms are not affected by low-precision arithmetic. In fact, low-precision arithmetic can be used for both training as well as inference. This is because ML is essentially probabilistic in nature and high-precision arithmetic is unnecessary. One writer reported that "having more data that is less precise yield better results than having half as much data that was more precise." In fact, addition of noise during training can improve performance.

One report claimed that Google intends to use TPU only for inference. The report added that low-precision arithmetic is not suited for training. Since the release of TPU2, it's clear that these can be used for both training and inference. With both TPU2 and TPU3, perhaps due to ML training needs, Google's own bfloat16 is being used for float point operations.

• What's the competition for Google's TPU?

Google's TPU is in fact used only within Google, at least for now. Nvidia dominates the ML processor market with its GPUs. Nvidia has specialized its Tesla GPUs, named Pascal, that are suited for ML. These can be used either for training or for inference. Movidius makes Visual Processing Units (VPUs), named Myriad 2, that offer visual intelligence at device level. IBM's own chip named TrueNorth is based on a project that built the digital equivalent of a rodent's brain. TrueNorth is meant to bring deep learning to devices for the purpose of inference. Intel announced in November 2016 an AI processor named Nervana that may come to market end of 2017. Nervana is designed to be used for both training and inference.

Microsoft for its part has been using FPGAs instead in its datacenters since these can be configured easily unlike ASICs. Configurability is an important aspect when algorithms change frequently and hence ASICs are not suitable. Qualcomm announced in January 2017 that it has optimized TensorFlow for the Hexagon 682 DSP. ARM is promoting its MALI GPUs to offload ML processing from its Cortex CPUs.

• Is Google's TPU anyway connected to SGI's product of the same name?

No. Silicon Graphics had something called a TPU in its workstations in the 2000s. It was an advanced DSP that used dynamic shared-memory access. This has nothing to do with Google's TPU.

