Optimizing Pandas

Performance of Pandas can be improved in terms of memory usage and speed of computation. Optimizations can be done in broadly two ways: (a) learning best practices and calling Pandas APIs the right way; (b) going under the hood and optimizing the core capabilities of Pandas. This article covers both these aspects.

As a general observation, Pandas was designed for data analysis and usability. On the other hand, NumPy was designed for performance. However, there are some techniques to make Pandas more performant without sacrificing its ease of use. There are also third-party libraries that build on or work with Pandas to achieve better performance.

Discussion

  • What's an essential approach towards performant Pandas code?
    Vectorization is much faster than standard loops. Source: Droste 2019.
    Vectorization is much faster than standard loops. Source: Droste 2019.

    Perhaps the most important rule is to avoid using loops in Pandas code. Looping over a Series or a DataFrame processes data one item or row/column at a time. Instead, operations should be vectorized. This means an operation should be performed on the entire Series or DataFrame row/column. Developers should think of all operations as matrix computations that can be parallelized.

    The worst or slowest approach is to use df.iloc[i][j] within a for loop. A slightly better approach is to use df.iterrows() that returns a tuple containing a row index and a Series for the row. Even better is to remove loops completely and use df.apply() that takes as first argument a function that's applied to each row or column. This internally uses Cython iterators.

    By far, the best approach is to use vectorization. For example, if addr is a Series containing addresses, addr.str.upper().str.replace('.', '') is applied to all items "at once". Register size and the computer's instruction set determines how many items can be parallelized.

  • For data manipulation, what are some techniques for faster code?

    Indices are commonly used in Pandas for lookup or merging two datasets. It's faster to merge based the index. For example, listings_.merge(reviews_, left_index=True, right_index=True) is faster than listings.merge(reviews, on='listing_id'), given that reviews_ = reviews.set_index('listing_id') and listings_ = listings.set_index('listing_id').

    When chaining multiple operations, the order is important. For example, it's faster to filter first and then merge. Thus, listings.merge(reviews[reviews[listing_id].isin(listings[listing_id])], on='listing_id') is faster than listings.merge(reviews, on='listing_id').

  • What are some best practices for handling datetime data type?

    In general, adopt vectorization over explicit loops. If this is not possible, prefer df.apply(), df.itertuples() or df.iterrows() in that order.

    If datetime values are stored as object data type, use pd.to_datetime() method to convert to datetime type. Explicitly passing format argument to this method speeds up the conversion.

    In electricity billing, suppose different tariffs are applied based on time of the day. We can use df.isin() to selectively apply the tariff to data subsets. An even better way is to use pd.cut(). We can reduce execution time further by converting data to NumPy arrays. In this example, it's also convenient to use the datetime column as the index.

  • What are some techniques to improve Pandas performance?

    There are a few known techniques to speed up Pandas:

    • Cython: Cython is a superset of Python. It's Python code with additional type information. This is translated into optimized C/C++ code and then compiled as Python extension modules. Passing NumPy ndarray (rather than creating and passing Pandas Series) into Cython functions gives further improvement.
    • Numba: This is a just-in-time (JIT) compiler. With a few annotations to Python code (@numba.jit, @numba.vectorize, etc), runtime performance can come close to C/C++ or Fortran. There's support for both CPU and GPU hardware. However, as of Numba v0.20, optimizations work on NumPy arrays and not Pandas objects.
    • eval() and query(): For datasets larger than 10,000 rows, pandas.eval(), DataFrame.eval() and DataFrame.query() evaluate expressions faster. Only some expressions can be optimized. For example, is, lambda, yield, comprehensions and generator expressions can't be optimized. Package numexpr needs to be installed.

    Pandas User Guide documents all the above techniques with useful code examples.

    Particularly for large datasets, Pandas official documentation recommends numexpr that uses multiple cores, smart chunking and caching; and bottleneck that uses Cython routines to speed up some types of nan evaluations.

  • How does Pandas compare against NumPy in terms of performance?
    On subsetting operation, NumPy outperforms Pandas by 10x to 1000x. Source: Żero 2020, fig. 1.
    On subsetting operation, NumPy outperforms Pandas by 10x to 1000x. Source: Żero 2020, fig. 1.

    NumPy is faster than Pandas because most of the calculations happen using precompiled optimized C code. Although Pandas makes use of NumPy, it does a lot of housekeeping: tracking data types and indices, and checking for errors. This makes Pandas slower than NumPy. Therefore, one way to speed up Pandas code is to convert critical computations into NumPy, for example by calling to_numpy() method.

    One study on selecting a data subset showed NumPy outperforming Pandas by 10x to 1000x, with the gains diminishing on very large datasets. Regardless of DataFrame size, Pandas paid an initial penalty 1ms. Similar results were seen when taking square root of numbers.

    Within NumPy, operations within an array are faster than operations that span multiple arrays. Thus, a1 + a2 is slower than np.sum(a_both, axis=1) if we reorganize the data as a_both[0, :] = a1 and a_both[1, :] = a2.

  • How do I deal with large datasets in Pandas?

    Pandas does in-memory analytics, which makes it difficult to handle large datasets that don't fit in system memory. One approach is to load less data, such as loading only columns needed for analysis. For example, pandas.read_csv() has argument usecols to do this.

    Use efficient datatypes. Text data of low cardinality are those with lot of values but only a few unique values. Storing each value as a complete string is memory inefficient. Instead, they should be stored as categorical data. Data with all unique values will not benefit from such a conversion. Numerical values can be downcast to smallest types using pandas.to_numeric().

    Chunking is a technique to split the dataset and process in chunks. This works best when chunks require little coordination. Some methods such as pandas.read_csv() has argument chunksize to do chunking. Group-by operations are typically harder to do in chunks.

    Where complex expressions are involved such as (df.A < 0.5) & (df.B < 0.5)), each sub-expression creates temporary memory. This can be avoided with pd.eval(), df.eval() and df.query(). This improves performance only on large datasets.

  • How does Pandas store a DataFrame under the hood?
    DataFrames columns are grouped into blocks and managed by the BlockManager. Source: Tratner 2015, slide 36.
    DataFrames columns are grouped into blocks and managed by the BlockManager. Source: Tratner 2015, slide 36.

    Pandas groups columns of the same type into what is called a block. A DataFrame is actually stored as one or more blocks. Using metadata, these blocks are composed into a DataFrame by a BlockManager. Thus, only a block is contiguous in memory.

    It's possible to inspect the blocks using df['A'].values.base for column A. The output will also show values of other columns in that block. To see the internal storage of all blocks, call df._data.

    Referring to the figure, slicing with df.loc[:'2015-07-03', ['quantity', 'points']] doesn't involve data copy since both columns are of the same data type. But any cross-dtype slicing will typically involve copying. Another performance killer is appending a row to a DataFrame. This would result in reallocating memory and copying every block. Thus, it's better to concatenate two big datasets rather than append one to the other row by row. When a new column is added, block copy is deferred until an operation requires it.

  • Which third-party libraries help improve Pandas performance?
    Dask or PySpark can be suitable for large datasets. Source: Zhang 2019.
    Dask or PySpark can be suitable for large datasets. Source: Zhang 2019.

    Indexing involves lots of lookups. klib is a C implementation that uses less memory and runs faster than Python's dictionary lookup. Since version 0.16.2, Pandas already uses klib.

    To run on multiple cores, use multiprocessing, Modin, Ray, Swifter, Dask or Spark. In one study, Spark did best on reading/writing large datasets and filling missing values. Pandas did best for group-by operation. An advantage of Swifter is that it decides what to use (vectorization, Pandas apply or Dask) depending on the dataset size.

    Swifter can also be used along with Modin. In fact, Modin partitions data in an optimal way but can work with other libraries that perform parallel execution.

    Dask can use multiple threads or processes on a single machine or a cluster of machines. Combining Numba with Dask is even better.

    PySpark is suitable for very large datasets but Pandas offers better APIs. A suitable approach is to use Spark with Apache Arrow. Arrow is an in-memory columnar format that helps to efficiently transfer data between Spark and Pandas. Based on Arrow, PyPolars is a highly performant DataFrame library.

Milestones

Jul
2011

Wes McKinney, creator of Pandas, writes about BlockManager in a blog and explains how it works. It's the internal data structure in Pandas. A block is really a "homogeneously-typed N-dimensional NumPy ndarray object". Gluing together columns of the same type into a single block is called consolidation. In the early days of Pandas (0.1 and 0.2) a DataFrame was stored using Python dict until the BlockManager took over.

Jun
2015

With the release of Pandas 0.16.2, klib package is used to improve performance of indexing. This takes less memory and runs faster than Python's dictionary. This should not be confused with klib (first released in April 2020), a library that simplifies visualization of DataFrame.

Sep
2017

McKinney notes on his blog that Pandas requires typically 5-10 times RAM as the size of the dataset. This is bad for analytics on large datasets. What's needed is a columnar format optimized for analytics with zero-copy access. This is where Apache Arrow fits in. Arrow 0.1.0 was released in October 2016.

Nov
2018

As a library for parallel computing in Python, Dask 1.0.0 is released. Pandas DataFrame becomes Dask DataFrame whose processing can be scheduled on a multiple threads/processes of a single machine or distributed across machines. Dask takes care of the task scheduling. The release of Dask 0.2.0 can be traced to January 2015.

Jan
2020

Pandas 1.0.0 is released. This provides a number of performance improvements. We mention a few: DataFrame arithmetic and comparison operations with scalars; indexing with a non-unique IntervalIndex; initializing a DataFrame using a range; DataFrame.select_dtypes() uses vectorization instead of a loop; comparing a Categorical with a scalar and the scalar is not found in the categories.

Jul
2020
Speed comparison of vectorized function, Pandas/Dask/Swifter apply. Source: Swifter GitHub 2020.

Swifter 1.0.0 is released. Swifter "efficiently applies any function to a pandas dataframe or series in the fastest available manner". If the function can be vectorized, it vectorizes it. If not, it selects the best of Pandas, Dask or Swifter apply method. Swifter can be traced to April 2018 when version 0.1 was released.

Sep
2020
PyPolars shows faster execution and lower memory usage for GroupBy operation. Source: Polars GitHub 2021.

Package PyPolars 0.1.0 is released. It's implemented in Rust and uses Apache Arrow as its memory model. It supports lazy and eager evaluations. On datasets larger than 10,000 rows, it outperforms both pandas and pydatatable (another DataFrame library).

Jan
2021

Pandas roadmap as on this date notes a few items related to performance. Apache Arrow may be explored as an in-memory array library. At the least, Pandas would interoperate better with Arrow. BlockManager and 2-D blocks are complex. They could be replaced with a simpler collection of 1-D arrays. BlockManager uses both labels and positions for indexing. In future, it might use only the latter. Where Pandas accepts user-defined functions, developers should be able to provide Numba-jitted functions.

Sample Code

  • # Source: https://realpython.com/fast-flexible-pandas/
    # Accessed 2021-01-11
     
    # Slow, plain Python code with explicit loops: 3.152 s
    def apply_tariff(kwh, hour):
        """Calculates cost of electricity for given hour."""
        if 0 <= hour < 7:
            rate = 12
        elif 7 <= hour < 17:
            rate = 20
        elif 17 <= hour < 24:
            rate = 28
        else:
            raise ValueError(f'Invalid hour: {hour}')
        return rate * kwh
     
    def apply_tariff_loop(df):
        """Calculate costs in loop.  Modifies `df` inplace."""
        energy_cost_list = []
        for i in range(len(df)):
            # Get electricity used and hour of day
            energy_used = df.iloc[i]['energy_kwh']
            hour = df.iloc[i]['date_time'].hour
            energy_cost = apply_tariff(energy_used, hour)
            energy_cost_list.append(energy_cost)
        df['cost_cents'] = energy_cost_list
     
    apply_tariff_loop(df)
     
     
    # Code using iterrows(): 0.713 s
    def apply_tariff_iterrows(df):
        energy_cost_list = []
        for index, row in df.iterrows():
            # Get electricity used and hour of day
            energy_used = row['energy_kwh']
            hour = row['date_time'].hour
            # Append cost list
            energy_cost = apply_tariff(energy_used, hour)
            energy_cost_list.append(energy_cost)
        df['cost_cents'] = energy_cost_list
     
    apply_tariff_iterrows(df)
     
     
    # Code using apply(): 0.272 s
    def apply_tariff_withapply(df):
        df['cost_cents'] = df.apply(
            lambda row: apply_tariff(
                kwh=row['energy_kwh'],
                hour=row['date_time'].hour),
            axis=1)
     
    apply_tariff_withapply(df)
     
     
    # Code using isin(): 0.010 s
    def apply_tariff_isin(df):
        # Define hour range Boolean arrays
        peak_hours = df.index.hour.isin(range(17, 24))
        shoulder_hours = df.index.hour.isin(range(7, 17))
        off_peak_hours = df.index.hour.isin(range(0, 7))
     
        # Apply tariffs to hour ranges
        df.loc[peak_hours, 'cost_cents'] = df.loc[peak_hours, 'energy_kwh'] * 28
        df.loc[shoulder_hours,'cost_cents'] = df.loc[shoulder_hours, 'energy_kwh'] * 20
        df.loc[off_peak_hours,'cost_cents'] = df.loc[off_peak_hours, 'energy_kwh'] * 12
     
    apply_tariff_isin(df)
     
     
    # Code using cut(): 0.003 s
    def apply_tariff_cut(df):
        cents_per_kwh = pd.cut(x=df.index.hour,
                               bins=[0, 7, 17, 24],
                               include_lowest=True,
                               labels=[12, 20, 28]).astype(int)
        df['cost_cents'] = cents_per_kwh * df['energy_kwh']
     
    apply_tariff_cut(df)
     
     
    # Code usign cut() with NumPy: 0.002 s
    def apply_tariff_digitize(df):
        prices = np.array([12, 20, 28])
        bins = np.digitize(df.index.hour.values, bins=[7, 17, 24])
        df['cost_cents'] = prices[bins] * df['energy_kwh'].values
     
    apply_tariff_digitize(df)

References

  1. Apache Arrow. 2020. "Apache Arrow Releases." Apache Arrow, October 19. Accessed 2021-01-12.
  2. Apache Spark. 2020. "PySpark Usage Guide for Pandas with Apache Arrow." Spark SQL Guide, Apache Spark 3.0.1, September. Accessed 2021-01-10.
  3. Cython. 2019. "Cython - an overview." Docs, Cython 3.0a6, February 27. Accessed 2021-01-11.
  4. Dask Docs. 2020. "Homepage." Documentation, Dask, December 19. Accessed 2021-01-12.
  5. Dask GitHub. 2021. "Releases." Dask, on GitHub, January 11. Accessed 2021-01-12.
  6. Droste, Benedikt. 2019. "How To Make Your Pandas Loop 71803 Times Faster." Towards Data Science, on Medium, August 23. Accessed 2021-01-11.
  7. Grant, Peter. 2020. "How to Speedup your Pandas Code by 10x." Towards Data Science, on Medium, February 22. Accessed 2021-01-10.
  8. Kanz, Andreas. 2020. "Speed up your Data Cleaning and Preprocessing with klib." Towards Data Science, on Medium, August 6. Accessed 2021-01-10.
  9. Kim, Ernest. 2018. "Data Pre-Processing in Python: How I learned to love parallelized applies with Dask and Numba." Towards Data Science, on Medium, March 5. Accessed 2021-01-10.
  10. Konrad, Markus. 2018. "Vectorization and parallelization in Python with NumPy and Pandas." WZB Data Science Blog, February 02. Accessed 2021-01-11.
  11. Korn, Uwe. 2020. "The one pandas internal I teach all my new colleagues: the BlockManager." Uwe's Blog, May 24. Accessed 2021-01-10.
  12. McKinney, Wes. 2011. "A Roadmap for Rich Scientific Data Structures in Python." Blog, July 21. Accessed 2021-01-10.
  13. McKinney, Wes. 2017. "Apache Arrow and the 10 Things I Hate About pandas." Blog, September 21. Accessed 2021-01-10.
  14. Ng, Ritchie. 2020. "Making DataFrame Smaller and Faster." May 26. Accessed 2021-01-12.
  15. Peluritis, Tomas. 2019. "Single Node Processing Vol. 2 — Spark, Dask and Pandas." Medium, December 17. Accessed 2021-01-10.
  16. Polars GitHub. 2021. "ritchie46/polars." Polars, on GitHub, January 10. Accessed 2021-01-12.
  17. PyData. 2015. "What's New." Docs, Pandas 0.16.2, June 12. Accessed 2021-01-11.
  18. PyData. 2020a. "Enhancing performance." User Guide, Pandas, November 14. Accessed 2021-01-10.
  19. PyData. 2020b. "Scaling to large datasets." User Guide, Pandas, October 5. Accessed 2021-01-10.
  20. PyData. 2020c. "Roadmap." Docs, Pandas, September 30. Accessed 2021-01-10.
  21. PyData. 2020d. "Installation." Docs, Pandas, December 28. Accessed 2021-01-11.
  22. PyData. 2020e. "What's new in 1.0.0." Release Notes, Pandas, January 29. Updated 2020-11-14. Accessed 2021-01-11.
  23. PyData GitHub. 2016. "Internals: Data structure changes." Pandas Design, PyData, on GitHub, August 24. Accessed 2021-01-12.
  24. Seif, George. 2019a. "How to Speed up Pandas by 4x with one line of code." KDNuggets, November. Accessed 2021-01-10.
  25. Seif, George. 2019b. "How to use Pandas the RIGHT way to speed up your code." Towards Data Science, on Medium, May 22. Accessed 2021-01-10.
  26. Solomon, Brad. 2018. "Python Pandas: Tricks & Features You May Not Know." Real Python, August 29. Updated 2020-12-12. Accessed 2021-01-10.
  27. Swifter GitHub. 2020. " jmcarpenter2/swifter." Swifter, on GitHub, December 20. Accessed 2021-01-11.
  28. Tratner, Jeff. 2015. "Pandas Under The Hood." PyData Seattle, July 25. Accessed 2021-01-10.
  29. van der Gugten, Robbert. 2019. "Advanced Pandas: Optimize speed and memory." Big Data Republic, on Medium, August 30. Accessed 2021-01-10.
  30. VanderPlas, Jake. 2016. "High-Performance Pandas: eval() and query()." In: Python Data Science Handbook, O'Reilly Media, Inc. Accessed 2021-01-10.
  31. Wyndham, Joe. 2018. "Fast, Flexible, Easy and Intuitive: How to Speed Up Your Pandas Projects." Real Python, July 23. Updated 2020-10-30. Accessed 2021-01-10.
  32. Zhang, Alina. 2019. "Pandas, Dask or PySpark? What Should You Choose for Your Dataset?" Data Driven Investor, on Medium, August 22. Accessed 2021-01-10.
  33. Żero, Oleg. 2020. "Performance of numpy and pandas — comparison." Towards Data Science, on Medium, January 14. Accessed 2021-01-10.

Further Reading

  1. Heisler, Sofia. 2017. "A Beginner’s Guide to Optimizing Pandas Code for Speed." Upside, August 2. Accessed 2021-01-10.
  2. Korn, Uwe. 2020. "The one pandas internal I teach all my new colleagues: the BlockManager." Uwe's Blog, May 24. Accessed 2021-01-10.
  3. Kim, Ernest. 2018. "Data Pre-Processing in Python: How I learned to love parallelized applies with Dask and Numba." Towards Data Science, on Medium, March 5. Accessed 2021-01-10.
  4. McKinney, Wes. 2017. "Apache Arrow and the 10 Things I Hate About pandas." Blog, September 21. Accessed 2021-01-10.
  5. Abdullah, Ezz El Din. 2020. "A Rising Library Beating Pandas in Performance." KDNuggets, December. Accessed 2021-01-10.
  6. Drakos, Georgios. 2019. "Speed Up Pandas apply function using Dask or Swifter (tutorial)." GDCoder, April 30. Accessed 2021-01-10.

Article Stats

Author-wise Stats for Article Edits

Author
No. of Edits
No. of Chats
DevCoins
3
0
1269
1896
Words
0
Chats
3
Edits
0
Likes
7
Hits

Cite As

Devopedia. 2021. "Optimizing Pandas." Version 3, January 12. Accessed 2021-01-12. https://devopedia.org/optimizing-pandas
Contributed by
1 author


Last updated on
2021-01-12 06:38:30