Tensor Processing Unit

TPU block diagram. Source: Sato et al. 2017.
TPU block diagram. Source: Sato et al. 2017.

Tensor Processing Unit (TPU) is an ASIC announced by Google for executing Machine Learning (ML) algorithms. CPUs are general purpose processors. GPUs are more suited for graphics and tasks that can benefit from parallel execution. DSPs work well for signal processing tasks that typically require mathematical precision. On the other hand, TPUs are optimized for ML. While any of the others could also be used for ML, TPUs are expected to bring better performance per watt for ML. In fact, TPUs are said to catapult computing power seven years into the future, which is equivalent to three generations of Moore's Law.

Google claims that TPUs are tailored for running TensorFlow, which is an open-source software library for Machine Intelligence.

Discussion

  • What is Google's interest in making the TPU?
    TPU Pod: 64xTPU2. Source: Tung 2017, © Google.
    TPU Pod: 64xTPU2. Source: Tung 2017, © Google.

    Google has claimed that "great software shines brightest with great hardware underneath." This is particularly true of ML where a TPU would offer software the requisite power to run faster and hence process more data. Google wants to use TPUs to power its ML algorithms. As of May 2016, more than 100 teams are said to be using ML within Google. Google Today, Street View, Inbox Smart Reply, RankBrain and voice search are products that are already benefiting from TPU hardware. AlphaGo used TPUs to defeat Go world champion Lee Sedol.

    Beyond Google's internal projects, TPUs can offer an advantage for all ML applications implemented in TensorFlow. ML applications looking to run out of a cloud infrastructure will tend to prefer Google Cloud Platform powered by TPUs. Likewise, TPUs may be a differentiator for Google Cloud Platform when application developers select an ML service API for their applications. For example, Google Cloud Machine Learning is a managed ML service from Google that will directly benefit from TPUs.

    There's also the claim that TPU may be Google's answer to Intel's Xeon processors that dominate datacenters.

  • Can TPUs be used for ML frameworks other than TensorFlow?

    TensorFlow is not the only framework for ML. More specifically, there are multiple frameworks for Deep Learning (DL).

    Google has not disclosed if TensorFlow algorithms are hardwired in TPU or if TPU is a generic accelerator for ML. Instead, Google has said that TPU is "tailored for TensorFlow". However, with the release of Cloud TPU, Facebook researcher and creator of PyTorch commented that there's a plan to port PyTorch to TPU.

  • Could you compare the performance of TPU against CPU or GPU?
    TPU outperforms CPU and GPU for various neural networks in terms of predictions per second. Source: Sato et al. 2017.
    TPU outperforms CPU and GPU for various neural networks in terms of predictions per second. Source: Sato et al. 2017.

    Tests show 83x performance per watt gain over CPUs and 29x gain over GPUs. When tested for various neural networks, in terms of predictions per second, we see a 71x gain over CPUs for the particular case of a convolutional neural network (CNN) of 89 layers and 100 million weights.

    Within Google's AI workloads, speed gains are in the range of 15x to 30x. In terms of energy efficiency, gains are in the range of 30x to 80x.

    Note that the numbers above are for the initial TPU version. With later versions of TPU, we can expect higher performance gains.

  • How is TPU able to achieve its superior performance compared to other processor types?

    Whereas CPUs are generic processors, the design of TPU is focused on ML workloads. The following design choices give it superior performance:

    • Quantization: Integer operations are used instead of floating point operations. Precision is sacrificed for performance.
    • CISC & Matrix Multiplier Unit (MXU): TPU prefers CISC over RISC instruction architecture. A single instruction can trigger complex operations that are handled by MXU. CPUs is a scalar processor; GPUs is a vector processor; MXU is a matrix processor that can hundreds of thousands of operations in a single clock cycle.
    • Systolic Array: Arithmetic operations, done in Arithmetic Logic Unit (ALU), are chained together, thus reducing register accesses. The generality of CPUs/GPUs is sacrificed for simpler energy-efficient design since ALUs in GPUs do the operations in fixed patterns.
    • Minimal & Deterministic: CPUs and GPUs have a large control logic to handle caches, branch prediction, out-of-order execution, and so on. TPU's control logic is only 2% of the die. This minimalism also makes it deterministic: we can predict execution latency.
  • Doesn't low-precision arithmetic reduce accuracy of calculations?

    Research has shown that deep learning algorithms are not affected by low-precision arithmetic. In fact, low-precision arithmetic can be used for both training as well as inference. This is because ML is essentially probabilistic in nature and high-precision arithmetic is unnecessary. One writer reported that "having more data that is less precise yield better results than having half as much data that was more precise." In fact, addition of noise during training can improve performance.

    One report claimed that Google intends to use TPU only for inference. The report added that low-precision arithmetic is not suited for training. Since the release of TPU2, it's clear that these can be used for both training and inference. With both TPU2 and TPU3, perhaps due to ML training needs, Google's own bfloat16 is being used for float point operations.

  • What's the competition for Google's TPU?

    Nvidia dominates the ML processor market with its GPUs. Nvidia has specialized its Tesla GPUs, named Pascal, that are suited for ML. These can be used either for training or for inference. Nvidia's Volta equipped with 640 tensor cores is Pascal's successor.

    Movidius makes Visual Processing Units (VPUs), named Myriad 2, that offer visual intelligence at device level. Intel announced in November 2016 an AI processor named Nervana for both training and inference. It's first commercial version, named Nervana Neural Net L-1000, was announced in May 2018. IBM's own chip named TrueNorth is capable of deep learning inferences. Mid-2018, IBM announced an unnamed prototype chip capable of both training and inference.

    Microsoft has been using FPGAs in its datacenters since these can be configured easily unlike ASICs. In May 2018, Microsoft announced Project Brainwave and claims that it makes Azure the fastest cloud for real-time AI.

    ARM is promoting its MALI GPUs to offload ML processing from its Cortex CPUs. Others in the AI chip space include Graphcore, Cerebras and Vathys.

  • Are there any performance results comparing TPUs against Nvidia's GPUs?
    Raw throughput (images/sec) for training on auto-generated images without pre-processing. Source: Haußmann 2018.
    Raw throughput (images/sec) for training on auto-generated images without pre-processing. Source: Haußmann 2018.

    One engineer at RiseML has compared Cloud TPU (consisting of 4 TPUv2) against Nvidia 4 V100s. In both cases, each core had 16 GB of memory. The former was on Google Cloud while the latter was on AWS. The performance test was on ResNet-50.

    In terms of raw throughput without any pre-processing, both had comparable results. However, when batch sizes were decreased, V100s showed better performance. When looking at cost, Google Cloud offers better value for money.

    In the real world, we are more interested in the cost for achieving a desired level of accuracy. Cloud TPU outperforms here, costing $55 for an accuracy of 75.7%. The same on AWS reserved instances cost $88. Cloud TPU also results in faster convergence, perhaps due to pre-processing. V100s achieve a final accuracy of 75.7% after 84 epochs, whereas Cloud TPU does it in only 64 epochs. An epoch in deep learning is a single pass of the dataset through the neural networks and multiple epochs are needed for convergence.

  • Is Google's TPU anyway connected to SGI's product of the same name?

    No. Silicon Graphics had something called a TPU in its workstations in the 2000s. It was an advanced DSP that used dynamic shared-memory access. This has nothing to do with Google's TPU.

Milestones

2013

Google realizes that with the growing computational demand of neural networks, it would have to double the number of data centers. This concern triggers the design and development of TPU ASIC.

May
2016
The TPU on a PCB. Source: Schneider 2017, © Google.
The TPU on a PCB. Source: Schneider 2017, © Google.

Google announces that it's been using TPUs in its data centers for ML for more than a year. In its original form, it's packaged as an external accelerator card that can be easily installed into SATA hard disk slot. The TPU ASIC is built on 28nm process, runs at 700MHz and consumes 40W.

Jan
2017

Qualcomm announces that it has optimized TensorFlow for the Hexagon 682 DSP.

Dec
2017

Google reveals details of TPU2, second-generation TPU. Each TPU2, composed of four TPU2 chips, can deliver 180 teraflops, 64GB of high-bandwidth memory and 2,400GB/s memory bandwidth. These can be interconnected to make a TPU2 Pod capable of 11.5 petaflops.

Feb
2018

As a beta release, users can use Cloud TPU on Google Cloud Platform to accelerate the training of ML models. This is really TPU2 offered on the cloud. Models can now be trained overnight rather than in days or weeks. No special programming expertise is needed; TensorFlow can be used and reference implementations are available. The cost is $6.5/hour compared to Amazon's $24/hour.

May
2018
Different versions of TPU compared. Source: Teich 2018.
Different versions of TPU compared. Source: Teich 2018.

Google announces TPU 3.0, which requires liquid cooling. A single TPU3 chip is capable of 90 teraflops.

References

  1. Armasu, Lucian. 2016. "Google's Big Chip Unveil For Machine Learning: Tensor Processing Unit With 10x Better Efficiency." Tom's Hardware. May 19. Retrieved 2017-02-20.
  2. Barrus, John and Zak Stone. 2018. "Cloud TPU machine learning accelerators now available in beta." Google Cloud Platform Blog, February 12. Accessed 2018-07-11.
  3. Bright, Peter. 2016. "Programmable chips turning Azure into a supercomputing powerhouse." Ars Technica. September 28. Retrieved 2017-02-20.
  4. Chintala, Soumith. 2018. "Cloud TPUs are out, we'll start sketching out @PyTorch integration." Twitter, February 12. Accessed 2018-08-28.
  5. Courbariaux, Matthieu, Yoshua Bengio, and Jean-Pierre David. 2015. "Training deep neural networks with low precision multiplications." September 23. arXiv. Retrieved 2017-02-20.
  6. Davies, Jem. 2016. "ARM and Machine Learning." ARM. December 12. Retrieved 2017-02-20.
  7. Dillon. 2018. "Hands-on with the Google TPUv2." Blog, Paperspace, March 27. Accessed 2018-08-28.
  8. Freund, Karl. 2016. "Google's TPU Chip Creates More Questions Than Answers." Forbes. May 26. Retrieved 2017-02-20.
  9. Gupta, Suyog, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. 2015. "Deep Learning with Limited Numerical Precision." February 9. arXiv. Retrieved 2017-02-20.
  10. Haußmann, Elmar. 2018. "Comparing Google’s TPUv2 against Nvidia’s V100 on ResNet-50." RiseML Blog, April 26. Accessed 2018-08-29.
  11. Jacobowitz, P.J. 2017. "TensorFlow machine learning now optimized for the Snapdragon 835 and Hexagon 682 DSP." Qualcomm. January 10. Retrieved 2017-02-20.
  12. Johnson, Khari. 2018a. "Intel unveils Nervana Neural Net L-1000 for accelerated AI training." VentureBeat, May 23. Accessed 2018-08-28.
  13. Johnson, Khari. 2018b. "Microsoft launches Project Brainwave for deep learning acceleration in preview." VentureBeat, May 07. Accessed 2018-08-28.
  14. Jouppi, Norm. 2016. "Google supercharges machine learning tasks with TPU custom chip." Google Cloud Platform Blog. May 18. Retrieved 2017-02-20.
  15. Jouppi, Norm. 2017. "Quantifying the performance of the TPU, our first machine learning chip." Google Cloud Platform Blog, April 05. Accessed 2018-07-11.
  16. Metz, Cade. 2015. "IBM's 'Rodent Brain' Chip Could Make Our Phones Hyper-Smart." Wired. August 17. Retrieved 2017-02-20.
  17. Metz, Cade. 2016. "Intel Looks to a New Chip to Power the Coming Age of AI." Wired. November 18. Retrieved 2017-02-20.
  18. Moore, Samuel K. 2018. "IBM’s New Do-It-All Deep-Learning Chip." IEEE Spectrum, July 02. Accessed 2018-08-28.
  19. Morgan, Timothy Prickett. 2016. "Nvidia Pushes Deep Learning Inference With New Pascal GPUs." The Next Platform. September 13. Retrieved 2017-02-20.
  20. Nvidia. 2016. "Deep Learning Frameworks." Nvidia. April 5. Updated February 9, 2017. Retrieved 2017-02-20.
  21. Nvidia. 2018. "NVIDIA Volta." Accessed 2018-08-28.
  22. Osborne, Joe. 2016. "Google's Tensor Processing Unit explained: this is what the future of computing looks like." Tech Radar India. August 23. Retrieved 2017-02-20.
  23. Quach, Katyanna. 2018. "Wanna gobble Google's custom chips? Now you can – its Cloud TPUs at $6.50 an hour." The Register, February 12. Accessed 2018-07-11.
  24. Racanelli, Heidi. 2000. SGI Tensor Processing Unit (TPU) XIO Board Introduction. Document Number 007-4222-002. Silicon Graphics, Inc. Retrieved 2017-02-20.
  25. Sato, Kaz, Cliff Young, and David Patterson. 2017. "An in-depth look at Google’s first Tensor Processing Unit (TPU)." Google Cloud Big Data and Machine Learning Blog, May 12. Accessed 2018-07-11.
  26. Schneider, David. 2017. "Google Details Tensor Chip Powers." IEEE Spectrum. April 6. Retrieved 2017-04-11.
  27. Sharma, Sagar. 2017. "Epoch vs Batch Size vs Iterations." Towards Data Science, September 23. Accessed 2018-08-29.
  28. Singh, Akash. 2016. "What is the difference among CPU, GPU, APU, FPGA, DSP, and Intel MIC?" Quora. Updated May 25. Retrieved 2017-02-20.
  29. Teich, Paul. 2018. "Tearing Apart Google’s TPU 3.0 AI Coprocessor." The Next Platform, May 10. Accessed 2018-07-11.
  30. Tung, Liam. 2017. "GPU killer: Google reveals just how powerful its TPU2 chip really is." ZDNet. December 14. Accessed 2018-01-05.
  31. Ung, Gordon Mah. 2016. "Google's Tensor Processing Unit could advance Moore's Law 7 years into the future." PC World. May 18. Retrieved 2017-02-20.
  32. Yegulalp, Serdar. 2016. "13 frameworks for mastering machine learning." InfoWorld. January 28. Retrieved 2017-02-20.

Further Reading

  1. Warden, Pete. 2015. "Why are Eight Bits Enough for Deep Neural Networks?" Pete Warden's blog. May 23. Retrieved 2017-02-20.
  2. Sato, Kaz, Cliff Young and David Patterson. 2017. "An in-depth look at Google’s first Tensor Processing Unit (TPU)." Google Cloud Big Data and Machine Learning Blog, May 12. Accessed 2018-07-11.
  3. Teich, Paul. 2018. "Tearing Apart Google’s TPU 3.0 AI Coprocessor." The Next Platform, May 10. Accessed 2018-07-11.
  4. Google Cloud Platform. 2018. "Effective machine learning using Cloud TPUs (Google I/O '18)." YouTube, May 08. Accessed 2018-07-11.
  5. Dillon. 2018. "Hands-on with the Google TPUv2." Blog, Paperspace, March 27. Accessed 2018-08-28.

Article Stats

Author-wise Stats for Article Edits

Author
No. of Edits
No. of Chats
DevCoins
5
0
1682
1
0
203
3
0
12
1518
Words
6
Likes
13K
Hits

Cite As

Devopedia. 2021. "Tensor Processing Unit." Version 9, June 28. Accessed 2023-11-13. https://devopedia.org/tensor-processing-unit
Contributed by
3 authors


Last updated on
2021-06-28 15:54:13