Optimizing Pandas
- Summary
-
Discussion
- What's an essential approach towards performant Pandas code?
- For data manipulation, what are some techniques for faster code?
- What are some best practices for handling datetime data type?
- What are some techniques to improve Pandas performance?
- How does Pandas compare against NumPy in terms of performance?
- How do I deal with large datasets in Pandas?
- How does Pandas store a DataFrame under the hood?
- Which third-party libraries help improve Pandas performance?
- Milestones
- Sample Code
- References
- Further Reading
- Article Stats
- Cite As
Performance of Pandas can be improved in terms of memory usage and speed of computation. Optimizations can be done in broadly two ways: (a) learning best practices and calling Pandas APIs the right way; (b) going under the hood and optimizing the core capabilities of Pandas. This article covers both these aspects.
As a general observation, Pandas was designed for data analysis and usability. On the other hand, NumPy was designed for performance. However, there are some techniques to make Pandas more performant without sacrificing its ease of use. There are also third-party libraries that build on or work with Pandas to achieve better performance.
Discussion
-
What's an essential approach towards performant Pandas code? Perhaps the most important rule is to avoid using loops in Pandas code. Looping over a Series or a DataFrame processes data one item or row/column at a time. Instead, operations should be vectorized. This means an operation should be performed on the entire Series or DataFrame row/column. Developers should think of all operations as matrix computations that can be parallelized.
The worst or slowest approach is to use
df.iloc[i][j]
within afor
loop. A slightly better approach is to usedf.iterrows()
that returns a tuple containing a row index and a Series for the row. Even better is to remove loops completely and usedf.apply()
that takes as first argument a function that's applied to each row or column. This internally uses Cython iterators.By far, the best approach is to use vectorization. For example, if
addr
is a Series containing addresses,addr.str.upper().str.replace('.', '')
is applied to all items "at once". Register size and the computer's instruction set determines how many items can be parallelized. -
For data manipulation, what are some techniques for faster code? Indices are commonly used in Pandas for lookup or merging two datasets. It's faster to merge based the index. For example,
listings_.merge(reviews_, left_index=True, right_index=True)
is faster thanlistings.merge(reviews, on='listing_id')
, given thatreviews_ = reviews.set_index('listing_id')
andlistings_ = listings.set_index('listing_id')
.When chaining multiple operations, the order is important. For example, it's faster to filter first and then merge. Thus,
listings.merge(reviews[reviews[listing_id].isin(listings[listing_id])], on='listing_id')
is faster thanlistings.merge(reviews, on='listing_id')
. -
What are some best practices for handling datetime data type? In general, adopt vectorization over explicit loops. If this is not possible, prefer
df.apply()
,df.itertuples()
ordf.iterrows()
in that order.If datetime values are stored as object data type, use
pd.to_datetime()
method to convert to datetime type. Explicitly passingformat
argument to this method speeds up the conversion.In electricity billing, suppose different tariffs are applied based on time of the day. We can use
df.isin()
to selectively apply the tariff to data subsets. An even better way is to usepd.cut()
. We can reduce execution time further by converting data to NumPy arrays. In this example, it's also convenient to use the datetime column as the index. -
What are some techniques to improve Pandas performance? There are a few known techniques to speed up Pandas:
- Cython: Cython is a superset of Python. It's Python code with additional type information. This is translated into optimized C/C++ code and then compiled as Python extension modules. Passing NumPy ndarray (rather than creating and passing Pandas Series) into Cython functions gives further improvement.
- Numba: This is a just-in-time (JIT) compiler. With a few annotations to Python code (
@numba.jit
,@numba.vectorize
, etc), runtime performance can come close to C/C++ or Fortran. There's support for both CPU and GPU hardware. However, as of Numba v0.20, optimizations work on NumPy arrays and not Pandas objects. eval()
andquery()
: For datasets larger than 10,000 rows,pandas.eval()
,DataFrame.eval()
andDataFrame.query()
evaluate expressions faster. Only some expressions can be optimized. For example,is
,lambda
,yield
, comprehensions and generator expressions can't be optimized. Package numexpr needs to be installed.
Pandas User Guide documents all the above techniques with useful code examples.
Particularly for large datasets, Pandas official documentation recommends numexpr that uses multiple cores, smart chunking and caching; and bottleneck that uses Cython routines to speed up some types of
nan
evaluations. -
How does Pandas compare against NumPy in terms of performance? NumPy is faster than Pandas because most of the calculations happen using precompiled optimized C code. Although Pandas makes use of NumPy, it does a lot of housekeeping: tracking data types and indices, and checking for errors. This makes Pandas slower than NumPy. Therefore, one way to speed up Pandas code is to convert critical computations into NumPy, for example by calling
to_numpy()
method.One study on selecting a data subset showed NumPy outperforming Pandas by 10x to 1000x, with the gains diminishing on very large datasets. Regardless of DataFrame size, Pandas paid an initial penalty 1ms. Similar results were seen when taking square root of numbers.
Within NumPy, operations within an array are faster than operations that span multiple arrays. Thus,
a1 + a2
is slower thannp.sum(a_both, axis=1)
if we reorganize the data asa_both[0, :] = a1
anda_both[1, :] = a2
. -
How do I deal with large datasets in Pandas? Pandas does in-memory analytics, which makes it difficult to handle large datasets that don't fit in system memory. One approach is to load less data, such as loading only columns needed for analysis. For example,
pandas.read_csv()
has argumentusecols
to do this.Use efficient datatypes. Text data of low cardinality are those with lot of values but only a few unique values. Storing each value as a complete string is memory inefficient. Instead, they should be stored as categorical data. Data with all unique values will not benefit from such a conversion. Numerical values can be downcast to smallest types using
pandas.to_numeric()
.Chunking is a technique to split the dataset and process in chunks. This works best when chunks require little coordination. Some methods such as
pandas.read_csv()
has argumentchunksize
to do chunking. Group-by operations are typically harder to do in chunks.Where complex expressions are involved such as
(df.A < 0.5) & (df.B < 0.5)
), each sub-expression creates temporary memory. This can be avoided withpd.eval()
,df.eval()
anddf.query()
. This improves performance only on large datasets. -
How does Pandas store a DataFrame under the hood? Pandas groups columns of the same type into what is called a block. A DataFrame is actually stored as one or more blocks. Using metadata, these blocks are composed into a DataFrame by a BlockManager. Thus, only a block is contiguous in memory.
It's possible to inspect the blocks using
df['A'].values.base
for column A. The output will also show values of other columns in that block. To see the internal storage of all blocks, calldf._data
.Referring to the figure, slicing with
df.loc[:'2015-07-03', ['quantity', 'points']]
doesn't involve data copy since both columns are of the same data type. But any cross-dtype slicing will typically involve copying. Another performance killer is appending a row to a DataFrame. This would result in reallocating memory and copying every block. Thus, it's better to concatenate two big datasets rather than append one to the other row by row. When a new column is added, block copy is deferred until an operation requires it. -
Which third-party libraries help improve Pandas performance? Indexing involves lots of lookups. klib is a C implementation that uses less memory and runs faster than Python's dictionary lookup. Since version 0.16.2, Pandas already uses klib.
To run on multiple cores, use multiprocessing, Modin, Ray, Swifter, Dask or Spark. In one study, Spark did best on reading/writing large datasets and filling missing values. Pandas did best for group-by operation. An advantage of Swifter is that it decides what to use (vectorization, Pandas apply or Dask) depending on the dataset size.
Swifter can also be used along with Modin. In fact, Modin partitions data in an optimal way but can work with other libraries that perform parallel execution.
Dask can use multiple threads or processes on a single machine or a cluster of machines. Combining Numba with Dask is even better.
PySpark is suitable for very large datasets but Pandas offers better APIs. A suitable approach is to use Spark with Apache Arrow. Arrow is an in-memory columnar format that helps to efficiently transfer data between Spark and Pandas. Based on Arrow, PyPolars is a highly performant DataFrame library.
Milestones
2011
Wes McKinney, creator of Pandas, writes about BlockManager in a blog and explains how it works. It's the internal data structure in Pandas. A block is really a "homogeneously-typed N-dimensional NumPy ndarray object". Gluing together columns of the same type into a single block is called consolidation. In the early days of Pandas (0.1 and 0.2) a DataFrame was stored using Python dict
until the BlockManager took over.
2015
2017
McKinney notes on his blog that Pandas requires typically 5-10 times RAM as the size of the dataset. This is bad for analytics on large datasets. What's needed is a columnar format optimized for analytics with zero-copy access. This is where Apache Arrow fits in. Arrow 0.1.0 was released in October 2016.
2018
As a library for parallel computing in Python, Dask 1.0.0 is released. Pandas DataFrame becomes Dask DataFrame whose processing can be scheduled on a multiple threads/processes of a single machine or distributed across machines. Dask takes care of the task scheduling. The release of Dask 0.2.0 can be traced to January 2015.
2020
Pandas 1.0.0 is released. This provides a number of performance improvements. We mention a few: DataFrame arithmetic and comparison operations with scalars; indexing with a non-unique IntervalIndex; initializing a DataFrame using a range; DataFrame.select_dtypes()
uses vectorization instead of a loop; comparing a Categorical with a scalar and the scalar is not found in the categories.
2020
Swifter 1.0.0 is released. Swifter "efficiently applies any function to a pandas dataframe or series in the fastest available manner". If the function can be vectorized, it vectorizes it. If not, it selects the best of Pandas, Dask or Swifter apply method. Swifter can be traced to April 2018 when version 0.1 was released.
2020
2021
Pandas roadmap as on this date notes a few items related to performance. Apache Arrow may be explored as an in-memory array library. At the least, Pandas would interoperate better with Arrow. BlockManager and 2-D blocks are complex. They could be replaced with a simpler collection of 1-D arrays. BlockManager uses both labels and positions for indexing. In future, it might use only the latter. Where Pandas accepts user-defined functions, developers should be able to provide Numba-jitted functions.
Sample Code
References
- Apache Arrow. 2020. "Apache Arrow Releases." Apache Arrow, October 19. Accessed 2021-01-12.
- Apache Spark. 2020. "PySpark Usage Guide for Pandas with Apache Arrow." Spark SQL Guide, Apache Spark 3.0.1, September. Accessed 2021-01-10.
- Cython. 2019. "Cython - an overview." Docs, Cython 3.0a6, February 27. Accessed 2021-01-11.
- Dask Docs. 2020. "Homepage." Documentation, Dask, December 19. Accessed 2021-01-12.
- Dask GitHub. 2021. "Releases." Dask, on GitHub, January 11. Accessed 2021-01-12.
- Droste, Benedikt. 2019. "How To Make Your Pandas Loop 71803 Times Faster." Towards Data Science, on Medium, August 23. Accessed 2021-01-11.
- Grant, Peter. 2020. "How to Speedup your Pandas Code by 10x." Towards Data Science, on Medium, February 22. Accessed 2021-01-10.
- Kanz, Andreas. 2020. "Speed up your Data Cleaning and Preprocessing with klib." Towards Data Science, on Medium, August 6. Accessed 2021-01-10.
- Kim, Ernest. 2018. "Data Pre-Processing in Python: How I learned to love parallelized applies with Dask and Numba." Towards Data Science, on Medium, March 5. Accessed 2021-01-10.
- Konrad, Markus. 2018. "Vectorization and parallelization in Python with NumPy and Pandas." WZB Data Science Blog, February 02. Accessed 2021-01-11.
- Korn, Uwe. 2020. "The one pandas internal I teach all my new colleagues: the BlockManager." Uwe's Blog, May 24. Accessed 2021-01-10.
- McKinney, Wes. 2011. "A Roadmap for Rich Scientific Data Structures in Python." Blog, July 21. Accessed 2021-01-10.
- McKinney, Wes. 2017. "Apache Arrow and the 10 Things I Hate About pandas." Blog, September 21. Accessed 2021-01-10.
- Ng, Ritchie. 2020. "Making DataFrame Smaller and Faster." May 26. Accessed 2021-01-12.
- Peluritis, Tomas. 2019. "Single Node Processing Vol. 2 — Spark, Dask and Pandas." Medium, December 17. Accessed 2021-01-10.
- Polars GitHub. 2021. "ritchie46/polars." Polars, on GitHub, January 10. Accessed 2021-01-12.
- PyData. 2015. "What's New." Docs, Pandas 0.16.2, June 12. Accessed 2021-01-11.
- PyData. 2020a. "Enhancing performance." User Guide, Pandas, November 14. Accessed 2021-01-10.
- PyData. 2020b. "Scaling to large datasets." User Guide, Pandas, October 5. Accessed 2021-01-10.
- PyData. 2020c. "Roadmap." Docs, Pandas, September 30. Accessed 2021-01-10.
- PyData. 2020d. "Installation." Docs, Pandas, December 28. Accessed 2021-01-11.
- PyData. 2020e. "What's new in 1.0.0." Release Notes, Pandas, January 29. Updated 2020-11-14. Accessed 2021-01-11.
- PyData GitHub. 2016. "Internals: Data structure changes." Pandas Design, PyData, on GitHub, August 24. Accessed 2021-01-12.
- Seif, George. 2019a. "How to Speed up Pandas by 4x with one line of code." KDNuggets, November. Accessed 2021-01-10.
- Seif, George. 2019b. "How to use Pandas the RIGHT way to speed up your code." Towards Data Science, on Medium, May 22. Accessed 2021-01-10.
- Solomon, Brad. 2018. "Python Pandas: Tricks & Features You May Not Know." Real Python, August 29. Updated 2020-12-12. Accessed 2021-01-10.
- Swifter GitHub. 2020. " jmcarpenter2/swifter." Swifter, on GitHub, December 20. Accessed 2021-01-11.
- Tratner, Jeff. 2015. "Pandas Under The Hood." PyData Seattle, July 25. Accessed 2021-01-10.
- VanderPlas, Jake. 2016. "High-Performance Pandas: eval() and query()." In: Python Data Science Handbook, O'Reilly Media, Inc. Accessed 2021-01-10.
- Wyndham, Joe. 2018. "Fast, Flexible, Easy and Intuitive: How to Speed Up Your Pandas Projects." Real Python, July 23. Updated 2020-10-30. Accessed 2021-01-10.
- Zhang, Alina. 2019. "Pandas, Dask or PySpark? What Should You Choose for Your Dataset?" Data Driven Investor, on Medium, August 22. Accessed 2021-01-10.
- van der Gugten, Robbert. 2019. "Advanced Pandas: Optimize speed and memory." Big Data Republic, on Medium, August 30. Accessed 2021-01-10.
- Żero, Oleg. 2020. "Performance of numpy and pandas — comparison." Towards Data Science, on Medium, January 14. Accessed 2021-01-10.
Further Reading
- Heisler, Sofia. 2017. "A Beginner’s Guide to Optimizing Pandas Code for Speed." Upside, August 2. Accessed 2021-01-10.
- Korn, Uwe. 2020. "The one pandas internal I teach all my new colleagues: the BlockManager." Uwe's Blog, May 24. Accessed 2021-01-10.
- Kim, Ernest. 2018. "Data Pre-Processing in Python: How I learned to love parallelized applies with Dask and Numba." Towards Data Science, on Medium, March 5. Accessed 2021-01-10.
- McKinney, Wes. 2017. "Apache Arrow and the 10 Things I Hate About pandas." Blog, September 21. Accessed 2021-01-10.
- Abdullah, Ezz El Din. 2020. "A Rising Library Beating Pandas in Performance." KDNuggets, December. Accessed 2021-01-10.
- Drakos, Georgios. 2019. "Speed Up Pandas apply function using Dask or Swifter (tutorial)." GDCoder, April 30. Accessed 2021-01-10.
Article Stats
Cite As
See Also
- Pandas
- Pandas Data Structures
- Vectorized Operations
- Spark Performance Tuning
- PyCUDA
- Vaex