NumPy
NumPy is an open source Python library that enables efficient manipulation of multi-dimensional numerical data structures. These are called arrays in NumPy. NumPy is an alternative to Interactive Data Language (IDL) and MATLAB.
Since it's release in 2005, NumPy has become a fundamental package for numerical and scientific computing in Python. In addition to efficient data structures and operations on them, it provides many high-level mathematical functions that aid scientific computation. Pandas, SciPy, Matplotlib, scikit-learn and scikit-image are just a few popular scientific packages that make use of NumPy.
Discussion
-
What does NumPy do differently from core Python? Python is slower than compiled languages such as C but it's easy to learn. Python is suited for rapid prototyping and iterative development.
While Python's
list
data type can be used to construct multi-dimensional data structures (lists containing lists), NumPy is faster and provides a better API for developers. Python's lists are general purpose. They can contain data of different types. This means that types are also stored, type-dispatching code is invoked at runtime and types are checked. Lists are processed using loops or comprehensions and can't be vectorized to support elementwise operations. NumPy sacrifices some of Python's flexibility to improve performance.Specifically, NumPy is better at these aspects:
- Size: NumPy data structures take up less space. Each Python integer object takes 28 bytes whereas in NumPy an integer is just 8 bytes. A Python list of
n
items requires64+8n+28n
bytes whereas in NumPy it's96+8n
bytes. - Performance: NumPy code runs faster than Python code, particularly for large input data.
- Functionality: NumPy provides lots of functions and methods to simplify operations. High-level operations such as linear algebra are also included.
- Size: NumPy data structures take up less space. Each Python integer object takes 28 bytes whereas in NumPy an integer is just 8 bytes. A Python list of
-
What are some of the main features of NumPy? NumPy arrays are homogeneous, meaning that array elements are of the same type. Hence, no type checking is required at runtime. All elements of an array take up same amount of space.
The spacing between elements along an axis is also constant. This is called striding. This is useful when the same data in memory can be used to create a new array without copying. Different arrays are therefore different views into memory. Thus, it's easier to modify data subsets in memory.
Operations are vectorized, which means that the operation can be executed in parallel on multiple elements of the array. This speeds up computation. Developers need not write
for
loops.NumPy provides APIs for easy manipulation of arrays. Some of these are indexing, slicing, reshaping, stacking and splitting. Broadcasting is a feature that allows operations between vectors and scalars, or vectors of different sizes.
NumPy integrates easily with C/C++ or Fortran code that may provide optimized implementations. Useful functions covering linear algebra, Fourier transform, and random numbers are provided.
-
Could you share some performance numbers comparing NumPy versus Python implementations? For a simple computation of mean and standard deviation of a million floating point numbers, NumPy was 30X faster than a pure Python implementation. However, optimized Cython and C implementations were even faster. Another study showed that if input is small (less than 200 numbers), pure Python did better than NumPy. For inputs greater than about 15,000 numbers, NumPy outperformed C++.
One experiment in Machine Learning compared pure Python, NumPy and TensorFlow (on CPU) implementations of gradient descent. Runtimes were 18.65, 0.32 and 1.20 seconds respectively. NumPy was 50X faster than pure Python. For more complex ML problems deployed on multiple GPUs, TensorFlow is likely to outperform NumPy.
When evaluating NumPy performance, the underlying library for vector/matrix computations matters. NumPy comes with Default BLAS & Lapack. Depending on the distribution, alternatives may be included: OpenBLAS, Intel MKL, ATLAS, etc. In general, these alternatives are faster than the default library. For example, SVD is 10X faster on Intel MKL.
Hardware platforms may provide further acceleration. For example, Intel AVX2 provides at least 20% improvement on top of OpenBLAS.
-
Does NumPy automatically make use of GPU hardware? NumPy doesn't natively support GPUs. However, there are tools and libraries to run NumPy on GPUs.
Numba is a Python compiler that can compile Python code to run on multicore CPUs and CUDA-enabled GPUs. Numba also understands NumPy and generates optimized compiled code. Developers specify type signatures for Python functions. Numba uses them towards just-in-time (JIT) compilation. Numba team also provides
pyculib
, which is a Python interface to CUDA libraries such as cuBLAS, cuFFT and cuRAND.Grumpy has been proposed as a framework to seamlessly target multicore CPUs and GPUs. It does a mix of JIT compilation and offloading to optimized libraries such as cuBLAS or LAPACK.
CuPy is a Python library that implements NumPy arrays for CUDA-enabled GPUs and leverages CUDA GPU acceleration libraries. The code is mostly a drop-in replacement to NumPy code since the APIs are very similar. PyCUDA is a similar library from NVIDIA.
MinPy is similar to CuPy and is meant to be a NumPy interface above MXNet for building artificial neural networks. It includes auto differentiation in addition to transparent CPU/GPU acceleration.
-
What are some essential resources to learn NumPy? The main NumPy website is the definitive resource to consult. Beginners can start by reading their Quickstart tutorial or the absolute beginner's guide. The latter includes the basics of installing NumPy.
Rougier's book titled From Python to Numpy focuses on Python programmers who wish to learn NumPy and it's vectorization. Perhaps a classic is the PhD thesis titled Guide to NumPy, by Travis E. Oliphant who created NumPy.
MATLAB users might want to read NumPy for Matlab users. It maps MATLAB operations to NumPy equivalents.
DataCamp blog has shared a handy NumPy cheatsheet.
Those who wish to contribute to the NumPy project or study it's source code can head to NumPy's GitHub repository.
Milestones
2009
2019
2019
References
- Candido, Renato. 2018. "Pure Python vs NumPy vs TensorFlow Performance Comparison." Real Python, May 7. Updated 2018-07-05. Accessed 2020-04-27.
- Cohen, Ori. 2019. "Is your Numpy optimized for speed?" Towards Data Science, on Medium, September 27. Accessed 2020-04-27.
- Cournapeau, David. 2018. "File:NumPy logo.svg." Wikipedia, August 29. Accessed 2020-04-27.
- Elliott, Thomas. 2019. "The State of the Octoverse: machine learning." Blog, GitHub, January 24. Accessed 2020-04-27.
- Fowler, Matt. 2016. "Speeding up Python and NumPy: C++ing the Way." Medium, March 20. Accessed 2020-04-27.
- Harris, Mark. 2013. "Numba: High-Performance Python with CUDA Acceleration." NVIDIA Developer Blog, September 19. Updated 2017-09-19. Accessed 2020-04-27.
- Jimenez, Athenas. 2016. "Improving Python performance for scientific tools and libraries." 01.org, Blog, Intel Open Source, Intel Corporation, May 13. Accessed 2020-04-27.
- Konrad, Markus. 2018. "Vectorization and parallelization in Python with NumPy and Pandas." WZB Data Science Blog, February 02. Accessed 2020-04-27.
- MinPy Docs. 2016. "NumPy under MinPy, with GPU." Distributed (Deep) Machine Learning Community, on Read the Docs, November 11. Accessed 2020-04-27.
- NVIDIA Developer. 2011. "PyCUDA." NVIDIA, October 02. Updated 2018-10-11. Accessed 2020-04-27.
- NumPy. 2020a. "Older Array Packages." Accessed 2020-04-27.
- NumPy. 2020b. "Homepage." NumPy. Accessed 2020-04-27.
- NumPy DevDocs. 2020. "NumPy: the absolute basics for beginners." April 26. Accessed 2020-04-27.
- NumPy Docs. 2020a. "Release Notes." NumPy, February 5. Accessed 2020-04-27.
- NumPy Docs. 2020b. "NumPy 1.17.0 Release Notes." NumPy, February 5. Accessed 2020-04-27.
- NumPy Docs. 2020c. "NumPy 1.16.0 Release Notes." NumPy, February 5. Accessed 2020-04-27.
- NumPy Docs. 2020d. "NumPy 1.5.0 Release Notes." NumPy, February 5. Accessed 2020-04-27.
- NumPy Docs. 2020e. "NumPy for Matlab users." NumPy, February 5. Accessed 2020-04-27.
- PyPI. 2020. "Release history." numpy, 1.18.3, April 20. Accessed 2020-04-27.
- PyPI Stats. 2020. "numpy." PyPI Stats, April 27. Accessed 2020-04-27.
- Ravishankar, Mahesh, and Vinod Grover. 2019. "Automatic acceleration of Numpy applications on GPUs and multicore CPUs." arXiv, v1, January 11. Accessed 2020-04-27.
- Ross, Paul. 2014. "The Performance of Python, Cython and C on a Vector." Notes on Cython, October 6. Accessed 2020-04-27.
- Rougier, Nicolas P. 2017. "From Python to Numpy." May. Accessed 2020-04-27.
- SciPy. 2020. "Frequently Asked Questions." SciPy. Accessed 2020-04-27.
- SciPy GitHub. 2020. "SciPy: History_of_SciPy." Accessed 2020-04-27.
- Seif, George. 2019. "Here’s How to Use CuPy to Make Numpy Over 10X Faster." Towards Data Science, on Medium, August 22. Accessed 2020-04-27.
- UCF. 2020. "Python Lists vs. Numpy Arrays - What is the difference?" webcourses@UCF, IST Advanced Topics Primer, Univ. of Central Florida. Accessed 2020-04-27.
- Waters, John K. 2020. "Python 2 Officially Hits End of Life, Final Few Fixes Coming April 2020." ADTMag, 1105 Media Inc., January 09. Accessed 2020-04-27.
Further Reading
- NumPy DevDocs. 2020. "NumPy: the absolute basics for beginners." April 26. Accessed 2020-04-27.
- Harris, Mark. 2013. "Numba: High-Performance Python with CUDA Acceleration." NVIDIA Developer Blog, September 19. Updated 2017-09-19. Accessed 2020-04-27.
- Zelenka, Scott. 2018. "How to shrink NumPy, SciPy, Pandas, and Matplotlib for your data product." Towards Data Science, on Medium, September 25. Accessed 2020-04-27.
Article Stats
Cite As
See Also
- NumPy Data Types
- NumPy Array Operations
- Python for Scientific Computing
- SciPy
- Pandas
- PyCUDA