• TPU block diagram. Source: Sato et al. 2017.
    image
  • The TPU on a PCB. Source: Schneider 2017, © Google.
    image
  • Different versions of TPU compared. Source: Teich 2018.
    image
  • TPU Pod: 64xTPU2. Source: Tung 2017, © Google.
    image
  • TPU outperforms CPU and GPU for various neural networks in terms of predictions per second. Source: Sato et al. 2017.
    image

Tensor Processing Unit

Summary

image
TPU block diagram. Source: Sato et al. 2017.

Tensor Processing Unit (TPU) is an ASIC announced by Google for executing Machine Learning (ML) algorithms. CPUs are general purpose processors. GPUs are more suited for graphics and tasks that can benefit from parallel execution. DSPs work well for signal processing tasks that typically require mathematical precision. On the other hand, TPUs are optimized for ML. While any of the others could also be used for ML, TPUs are expected to bring better performance per watt for ML. In fact, TPUs are said to catapult computing power seven years into the future, which is equivalent to three generations of Moore's Law.

Google claims that TPUs are tailored for running TensorFlow, which is an open-source software library for Machine Intelligence.

Milestones

2013

Google realizes that with the growing computational demand of neural networks, it would have to double the number of data centers. This concern triggers the design and development of TPU ASIC.

May
2016
image

Google announces that it's been using TPUs in its data centers for ML for more than a year. In its original form, it's packaged as a external accelerator card that can be easily installed into SATA hard disk slot. The TPU ASIC is built on 28nm process, runs at 700MHz and consumes 40W.

Dec
2017

Google reveals details of TPU2, second-generation TPU. Each TPU2, composed of four TPU2 chips, can deliver 180 teraflops, 64GB of high-bandwidth memory and 2,400GB/s memory bandwidth. These can be interconnected to make a TPU2 Pod capable of 11.5 petaflops.

Feb
2018

As a beta release, users can use Cloud TPU on Google Cloud Platform to accelerate the training of ML models. This is really TPU2 offered on the cloud. Models can now be trained overnight rather than in days or weeks. No special programming expertise is needed; TensorFlow can be used and reference implementations are available. The cost is $6.5/hour compared to Amazon's $24/hour.

May
2018
image

Google announces TPU 3.0, which requires liquid cooling. A single TPU3 chip is capable of 90 teraflops.

Discussion

  • What is Google's interest in making the TPU?
    image
    TPU Pod: 64xTPU2. Source: Tung 2017, © Google.

    Google has claimed that "great software shines brightest with great hardware underneath." This is particularly true of ML where a TPU would offer software the requisite power to run faster and hence process more data. Google wants to use TPUs to power its ML algorithms. As of May 2016, more than 100 teams are said to be using ML within Google. Google Today, Street View, Inbox Smart Reply, RankBrain and voice search are products that are already benefiting from TPU hardware. AlphaGo used TPUs to defeat Go world champion Lee Sedol.

    Beyond Google's internal projects, TPUs can offer an advantage for all ML applications implemented in TensorFlow. ML applications looking to run out of a cloud infrastructure will tend to prefer Google Cloud Platform powered by TPUs. Likewise, TPUs may be a differentiator for Google Cloud Platform when application developers select an ML service API for their applications. For example, Google Cloud Machine Learning is a managed ML service from Google that will directly benefit from TPUs.

    There's also the claim that TPU may be Google's answer to Intel's Xeon processors that dominate datacenters.

  • Can TPUs be used for ML frameworks other than TensorFlow?

    TensorFlow is not the only framework for ML. More specifically, there are multiple frameworks for Deep Learning (DL). However, Google has not disclosed if TensorFlow algorithms are hardwired in TPU or if TPU is a generic accelerator for ML.

  • Could you compare the performance of TPU against CPU or GPU?
    image
    TPU outperforms CPU and GPU for various neural networks in terms of predictions per second. Source: Sato et al. 2017.

    Tests show 83x performance per watt gain over CPUs and 29x gain over GPUs. When tested for various neural networks, in terms of predictions per second, we see a 71x gain over CPUs for the particular case of a convolutional neural network (CNN) of 89 layers and 100 million weights.

    Within Google's AI workloads, speed gains are in the range of 15x to 30x. In terms of energy efficiency, gains are in the range of 30x to 80x.

    Note that the numbers above are for the initial TPU version. With later versions of TPU, we can expect higher performance gains.

  • How is TPU able to achieve its superior performance compared to other processor types?

    Whereas CPUs are generic processors, the design of TPU is focused on ML workloads. The following design choices give it superior performance:

    • Quantization: Integer operations are used instead of floating point operations. Precision is sacrified for performance.
    • CISC & Matrix Multiplier Unit (MXU): TPU prefers CISC over RISC instruction architecture. A single instruction can trigger complex operations that are handled by MXU. CPUs is a scalar processor; GPUs is a vector processor; MXU is a matrix processor that can hundreds of thousands of operations in a single clock cycle.
    • Systolic Array: Arithmetic operations, done in Arithmetic Logic Unit (ALU), are chained together, thus reducing register accesses. The generality of CPUs/GPUs is sacrificed for simpler energy-efficient design since ALUs in GPUs do the operations in fixed patterns.
    • Minimal & Deterministic: CPUs and GPUs have a large control logic to handle caches, branch prediction, out-of-order execution, and so on. TPU's control logic is only 2% of the die. This minimalism also makes it deterministic: we can predict execution latency.
  • Doesn't low-precision arthimetric reduce accuracy of calculations?

    Research has shown that deep learning algorithms are not affected by low-precision arithmetic. In fact, low-precision arithmetic can be used for both training as well as inference. This is because ML is essentially probabilistic in nature and high-precision arithmetic is unnecessary. One writer reported that "having more data that is less precise yield better results than having half as much data that was more precise." In fact, addition of noise during training can improve performance.

    One report claimed that Google intends to use TPU only for inference. The report added that low-precision arithmetic is not suited for training. Since the release of TPU2, it's clear that these can be used for both training and inference. With both TPU2 and TPU3, perhaps due to ML training needs, Google's own bfloat16 is being used for float point operations.

  • What's the competition for Google's TPU?

    Google's TPU is in fact used only within Google, at least for now. Nvidia dominates the ML processor market with its GPUs. Nvidia has specialized its Tesla GPUs, named Pascal, that are suited for ML. These can be used either for training or for inference. Movidius makes Visual Processing Units (VPUs), named Myriad 2, that offer visual intelligence at device level. IBM's own chip named TrueNorth is based on a project that built the digital equivalent of a rodent's brain. TrueNorth is meant to bring deep learning to devices for the purpose of inference. Intel announced in November 2016 an AI processor named Nervana that may come to market end of 2017. Nervana is designed to be used for both training and inference.

    Microsoft for its part has been using FPGAs instead in its datacenters since these can be configured easily unlike ASICs. Configurability is an important aspect when algorithms change frequently and hence ASICs are not suitable. Qualcomm announced in January 2017 that it has optimized TensorFlow for the Hexagon 682 DSP. ARM is promoting its MALI GPUs to offload ML processing from its Cortex CPUs.

  • Is Google's TPU anyway connected to SGI's product of the same name?

    No. Silicon Graphics had something called a TPU in its workstations in the 2000s. It was an advanced DSP that used dynamic shared-memory access. This has nothing to do with Google's TPU.

References

  1. Armasu, Lucian. 2016. "Google's Big Chip Unveil For Machine Learning: Tensor Processing Unit With 10x Better Efficiency." Tom's Hardware. May 19. Retrieved 2017-02-20.
  2. Barrus, John and Zak Stone. 2018. "Cloud TPU machine learning accelerators now available in beta." Google Cloud Platform Blog, February 12. Accessed 2018-07-11.
  3. Bright, Peter. 2016. "Programmable chips turning Azure into a supercomputing powerhouse." Ars Technica. September 28. Retrieved 2017-02-20.
  4. Courbariaux, Matthieu, Yoshua Bengio, and Jean-Pierre David. 2015. "Training deep neural networks with low precision multiplications." September 23. arXiv. Retrieved 2017-02-20.
  5. Davies, Jem. 2016. "ARM and Machine Learning." ARM. December 12. Retrieved 2017-02-20.
  6. Freund, Karl. 2016. "Google's TPU Chip Creates More Questions Than Answers." Forbes. May 26. Retrieved 2017-02-20.
  7. Gupta, Suyog, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. 2015. "Deep Learning with Limited Numerical Precision." February 9. arXiv. Retrieved 2017-02-20.
  8. Jacobowitz, P.J. 2017. "TensorFlow machine learning now optimized for the Snapdragon 835 and Hexagon 682 DSP." Qualcomm. January 10. Retrieved 2017-02-20.
  9. Jouppi, Norm. 2016. "Google supercharges machine learning tasks with TPU custom chip." Google Cloud Platform Blog. May 18. Retrieved 2017-02-20.
  10. Jouppi, Norm. 2017. "Quantifying the performance of the TPU, our first machine learning chip." Google Cloud Platform Blog, April 05. Accessed 2018-07-11.
  11. Metz, Cade. 2015. "IBM's 'Rodent Brain' Chip Could Make Our Phones Hyper-Smart." Wired. August 17. Retrieved 2017-02-20.
  12. Metz, Cade. 2016. "Intel Looks to a New Chip to Power the Coming Age of AI." Wired. November 18. Retrieved 2017-02-20.
  13. Morgan, Timothy Prickett. 2016. "Nvidia Pushes Deep Learning Inference With New Pascal GPUs." The Next Platform. September 13. Retrieved 2017-02-20.
  14. Nvidia. 2016. "Deep Learning Frameworks." Nvidia. April 5. Updated February 9, 2017. Retrieved 2017-02-20.
  15. Osborne, Joe. 2016. "Google's Tensor Processing Unit explained: this is what the future of computing looks like." Tech Radar India. August 23. Retrieved 2017-02-20.
  16. Quach, Katyanna. 2018. "Wanna gobble Google's custom chips? Now you can – its Cloud TPUs at $6.50 an hour." The Register, February 12. Accessed 2018-07-11.
  17. Racanelli, Heidi. 2000. SGI Tensor Processing Unit (TPU) XIO Board Introduction. Document Number 007-4222-002. Silicon Graphics, Inc. Retrieved 2017-02-20.
  18. Sato, Kaz, Cliff Young, and David Patterson. 2017. "An in-depth look at Google’s first Tensor Processing Unit (TPU)." Google Cloud Big Data and Machine Learning Blog, May 12. Accessed 2018-07-11.
  19. Schneider, David. 2017. "Google Details Tensor Chip Powers." IEEE Spectrum. April 6. Retrieved 2017-04-11.
  20. Singh, Akash. 2016. "What is the difference among CPU, GPU, APU, FPGA, DSP, and Intel MIC?" Quora. Updated May 25. Retrieved 2017-02-20.
  21. Teich, Paul. 2018. "Tearing Apart Google’s TPU 3.0 AI Coprocessor." The Next Platform, May 10. Accessed 2018-07-11.
  22. Tung, Liam. 2017. "GPU killer: Google reveals just how powerful its TPU2 chip really is." ZDNet. December 14. Accessed 2018-01-05.
  23. Ung, Gordon Mah. 2016. "Google's Tensor Processing Unit could advance Moore's Law 7 years into the future." PC World. May 18. Retrieved 2017-02-20.
  24. Yegulalp, Serdar. 2016. "13 frameworks for mastering machine learning." InfoWorld. January 28. Retrieved 2017-02-20.

Milestones

2013

Google realizes that with the growing computational demand of neural networks, it would have to double the number of data centers. This concern triggers the design and development of TPU ASIC.

May
2016
image

Google announces that it's been using TPUs in its data centers for ML for more than a year. In its original form, it's packaged as a external accelerator card that can be easily installed into SATA hard disk slot. The TPU ASIC is built on 28nm process, runs at 700MHz and consumes 40W.

Dec
2017

Google reveals details of TPU2, second-generation TPU. Each TPU2, composed of four TPU2 chips, can deliver 180 teraflops, 64GB of high-bandwidth memory and 2,400GB/s memory bandwidth. These can be interconnected to make a TPU2 Pod capable of 11.5 petaflops.

Feb
2018

As a beta release, users can use Cloud TPU on Google Cloud Platform to accelerate the training of ML models. This is really TPU2 offered on the cloud. Models can now be trained overnight rather than in days or weeks. No special programming expertise is needed; TensorFlow can be used and reference implementations are available. The cost is $6.5/hour compared to Amazon's $24/hour.

May
2018
image

Google announces TPU 3.0, which requires liquid cooling. A single TPU3 chip is capable of 90 teraflops.

Tags

See Also

Further Reading

  1. Warden, Pete. 2015. "Why are Eight Bits Enough for Deep Neural Networks?" Pete Warden's blog. May 23. Retrieved 2017-02-20.
  2. Sato, Kaz, Cliff Young and David Patterson. 2017. "An in-depth look at Google’s first Tensor Processing Unit (TPU)." Google Cloud Big Data and Machine Learning Blog, May 12. Accessed 2018-07-11.
  3. Teich, Paul. 2018. "Tearing Apart Google’s TPU 3.0 AI Coprocessor." The Next Platform, May 10. Accessed 2018-07-11.
  4. Google Cloud Platform. 2018. "Effective machine learning using Cloud TPUs (Google I/O '18)." YouTube, May 08. Accessed 2018-07-11.

Top Contributors

Last update: 2018-07-11 17:51:48 by tintin
Creation: 2017-02-20 18:34:31 by tintin

Article Stats

1298
Words
0
Chats
2
Authors
5
Edits
4
Likes
1212
Hits

Cite As

Devopedia. 2018. "Tensor Processing Unit." Version 5, July 11. Accessed 2018-08-14. https://devopedia.org/tensor-processing-unit
BETA V0.16