Pandas
Pandas is a Python package that enables in-memory data manipulation and analysis. It offers many ways to read and store data. It can inspect, clean, filter, merge, combine or transform data to suit the needs of analysis. By mid-2010s, it became an essential tool in the data scientist's toolkit.
Pandas is built on another popular Python package called NumPy. NumPy and Pandas data structures have become common formats that many Python packages tend to support.
Pandas is an open-source BSD-licensed project sponsored by NumFOCUS. Pandas is often seen as Python's answer to similar capabilities in R language.
Discussion
-
Why do I need Pandas when there's NumPy? NumPy has multi-dimensional arrays that are optimized for numerical computation. NumPy supports element-wise mathematical operations on arrays (addition, division, cosine) and dot product of arrays. Arrays can be transposed, combined, sliced and indexed in flexible ways. Array content can be aggregated (sum, min, max, mean).
Pandas is built on top of NumPy. It can therefore do everything that NumPy can do. In fact, it's easy to convert between Pandas and NumPy data structures, both of which can be used within the same codebase. In addition, Pandas offers methods that simplify data analysis. NumPy isn't flexible or easy to use for statistical work. A NumPy array must contain values of the same type, which is not common in real-world data.
Pandas is better suited for tabular data. Consider data stored in a spreadsheet or a database. It's typical to give unique labels to rows and columns. Accessing data by these labels makes it easier to write and maintain code. For this reason, Pandas has much better support reading from or writing to CSV/JSON/Excel/HDF5/HTML files and MySQL databases.
-
Could you describe some use cases of Pandas? We can use Pandas for data inspection and profiling. We can view a small sample of the dataset. Data is described using basic statistics: mean, median, mix, max, etc. Profiling can point out missing or duplicate values, or correlations among variables. Pandas Profiling is a package that does this automatically.
Another task for a data scientist is Exploratory Data Analysis (EDA). This is done to gain insights into the data and identify possible input features before any modelling is done. Pandas fits nicely the interactive and iterative nature of EDA. DataPrep.eda is an EDA tool built on top of Pandas.
Before training a machine learning model, Pandas can simplify data preparation. For example, categorical labels are converted to numerical values. Missing values are replaced with mean or median values. Predicted variable is separated from the rest of the dataset. Data are merged, sorted, grouped, filtered, reshaped, etc.
Time Series Analysis (TSA) is another use case of Pandas. Pandas has data types to handle date/time values. It can do time-based indexing, time zone handling, resampling, rolling windows, and trend analysis.
-
Which are the main features of Pandas? We note the following important features:
- Data Structures: Inspired by R's
data.frame
,DataFrame
is a good fit for storing and manipulating tabular data. This includes indices on rows and labels on columns. There's alsoSeries
(1D vectors) andPanel
(3D tables). - Input/Output: Pandas can read/write data between memory and various file formats. For example, reading from a CSV file is just two lines of code whereas this in Java would need 30 lines.
- Indexing: Data can be indexed in many ways. Hierarchical indexing allows intuitive access to high-dimensional data.
- Manipulation: Pandas can reshape or pivot data. It's easy to add or remove columns. Different datasets can be merged or automatically aligned based on the indices. Aggregations and even more sophisticated "group by" operations common with SQL database tables are possible.
- Missing Data: These can be either ignored, dropped or replaced depending on the operation. There are methods to detect the presence of missing data.
- Time-Series Data: There's support for date range generation, frequency conversion, moving windows statistics, date shifting and lagging.
- Optimization: Critical code paths are optimized in Cython or C.
- Data Structures: Inspired by R's
-
How well is Pandas supported by other Python packages and tools? Many well-known Python packages and tools understand or even specialize Pandas
DataFrame
andSeries
. Some of these include pandas-tfrecords, sklearn-pandas, Featuretools, Compose, Altair, Bokeh, Seaborn, qgrid, Spyder, pandas-datareader, PyDatastream, Geopandas, Blaze, Koalas, etc.For visualization, matplotlib can directly read DataFrame type and create plots. For machine learning, scikit-learn methods can read a DataFrame, often by specifying the argument
as_frame=True
.IPython understands and displays Pandas datasets in a more user-friendly manner. Via package ipywidgets, we can interactively explore Pandas data frames within a Jupyter Notebook environment.
statsmodels is a package for econometrics and statistical modelling. This package uses Pandas data structures. It has historical links with Pandas development.
Pandas runs on a single CPU core. With Dask, we can parallelize the computation on multiple cores or a cluster of machines. Dask offers
dask.dataframe
that's a composite of many Pandas DataFrames. Dask APIs are also similar Pandas APIs.With RAPIDS CuDF, Pandas DataFrames can run on GPUs. Computations run in parallel on many GPU cores.
-
What are some criticisms of Pandas? Memory management in Pandas could be better. As a rule of thumb, Pandas requires 5-10x as much RAM as the dataset size. There's no native support for multicore execution. There's no support for memory mapping. It's therefore easy to copy an entire dataset by accident when analytics is done only on a small part of it. Support for categorical data could be better. Appending data to a DataFrame is slow. It doesn't have an SQL-like query processing layer.
Pandas is too tightly coupled to NumPy. For example, an entire DataFrame column must be stored in the same NumPy array. This frequently results in doubling memory requirements and additional computation.
Pandas can't read multiple rows in parallel from a CSV file. Likewise, it can't read multiple CSV files in parallel. It doesn't support SQL-like conditional joins.
Some of the limitations noted above are solved by other libraries such as PandasSQL, PySpark, Terality, Vaex, DataTable, Dask and CuDF.
-
Which are some online resources to learn Pandas? The official Pandas documentation is the place for installation guide, user guide, API reference, tutorials, and more. Beginners can start learning from the 10 minutes to pandas user guide. Getting Started tutorials and tutorials from the community are two useful resources.
DataCamp's Pandas Cheatsheet and Irv Lustig's Pandas Cheatsheet are two useful references.
Wes McKinney's book titled Python for Data Analysis is recommended.
R developers who wish to learn Pandas can start by looking at how R operations map to Pandas.
Developers who wish to contribute to the Pandas project, can visit their GitHub page.
Milestones
McKinney notes at SciPy Conference that other Python packages have appeared recently, offering some of the features of Pandas: Ia, Tab, pydataframe. NumPy and SciPy have made Python more accessible to the scientific community. For statistical modelling, there's StaM, PyMC and SciL. However, there's no cohesive framework for statistical modelling. Statisticians continue to prefer R. Pandas is an attempt to change this.
2013
McKinney, the creator of Pandas, gives a presentation with a sub-title 10 Things I Hate About pandas. To address some of these, he shares his work on a new tool named Badger. It has a consistent type system, compressed columnar binary storage, and an analytical query processor. By late 2015, these ideas result in the formation of Apache Arrow.
2015
2017
In October 2017, StackOverflow records 5 million visits to questions on Pandas from more than 1 million unique visitors. Many companies use Pandas for data analysis including Google, Facebook and JP Morgan. The rising popularity of Python itself from 2012 onwards has been attributed to the adoption of Pandas by data scientists.
2019
2020
Pandas 1.0.0 is released. It requires at least Python 3.6.1. On an experimental basis, data type NA
is introduced to represent missing data. Data types SparseSeries
and SparseDataFrame
are replaced with Series
and DataFrame
with sparsevalues
option. Some APIs are deprecated and others are backward incompatible. Pandas 1.0.0 Release Notes has full details.
References
- Agarwal, Rahul. 2020. "Minimal Pandas Subset for Data Scientists on GPU." Towards Data Science, on Medium, January 14. Accessed 2021-01-02.
- Anthony, Femi. 2015. "Benefits of using pandas." In: Mastering pandas, Packt Publishing Limited, June. Accessed 2021-01-02.
- Asirinaidu P. 2017. "Pandas for Data Analysis and their Benefits." Towards Data Science, on Medium, June 29. Accessed 2021-01-02.
- Chawla, Avi. 2022. "5 Things I Wish the Pandas Library Could Do." Towards Data Science, on Medium, August 29. Accessed 2022-11-07.
- Fumo, David. 2017. "Pandas Library in a Nutshell — Intro To Machine Learning #3." Simple AI, on Medium, January 29. Accessed 2021-01-02.
- Gewalt, Rainer. 2021. "NumPy vs Pandas – Which is used When?" Blog, Fly Spaceships With Your Mind, March 13. Accessed 2022-11-07.
- Helfrich, Gina. 2015. "NumFOCUS Announces New Fiscally Sponsored Project: pandas." Blog, NumFOCUS, October 9. Accessed 2021-01-02.
- Hirst, Tony. 2016. "Simple Interactive View Controls for pandas DataFrames Using IPython Widgets in Jupyter Notebooks." Blog, OUseful.Info, December 29. Accessed 2021-01-05.
- Kopf, Dan. 2017. "Meet the man behind the most important tool in data science." Quartz, December 8. Accessed 2021-01-05.
- Lockhart, Brandon. 2020. "Exploratory Data Analysis: DataPrep.eda vs Pandas-Profiling." Towards Data Science, on Medium, May 7. Accessed 2021-01-02.
- Matplotlib GitHub. 2015. "What's new in matplotlib." Matplotlib, v1.5, on GitHub, October 29. Accessed 2021-01-05.
- McKinney, Wes. 2010. "Data Structures for Statistical Computing in Python." Proc. of the 9th Python in Science Conf., pp. 56-61, June 28 - July 3. Accessed 2021-01-02.
- McKinney, Wes. 2013. "Practical Medium Data Analytics with Python." PyData, NYC, November 8-10. Accessed 2021-01-02.
- McKinney, Wes. 2017. "Apache Arrow and the 10 Things I Hate About pandas." Blog, September 21. Accessed 2021-01-02.
- McKinney, Wes. 2019. "Wes McKinney: pandas in 10 minutes." PyData, on YouTube, September 17. Accessed 2021-01-02.
- NumFOCUS. 2021. "Sponsored Projects." NumFOCUS. Accessed 2021-01-02.
- Orac, Roman. 2020. "Are you still using Pandas for big data?" Towards Data Science, on Medium, April 27. Accessed 2021-01-02.
- Pandas. 2021. "Getting Started." Accessed 2021-01-02.
- Pandas Profiling. 2022. "Examples: Census Dataset." Docs, Pandas Profiling, v3.4.0, October 20. Accessed 2022-11-07.
- PyData. 2018a. "Installation." Pandas 0.23.1 documentation, June. Accessed 2021-01-05.
- PyData. 2018b. "What's new in 0.23.4." Release Notes, Pandas, August 3. Updated 2019-08-01. Accessed 2021-01-05.
- PyData. 2019. "Data Wrangling with pandas: Cheat Sheet." Pandas, February 11. Accessed 2021-01-02.
- PyData. 2020a. "Comparison with R / R libraries." Docs, Pandas, October 5. Accessed 2021-01-02.
- PyData. 2020b. "10 minutes to pandas." Docs, Pandas, December 29. Accessed 2021-01-02.
- PyData. 2020c. "pandas ecosystem." Docs, Pandas, December 28. Accessed 2021-01-02.
- PyData. 2020d. "Community tutorials." Docs, Pandas, August 22. Accessed 2021-01-02.
- PyData. 2020e. "Scaling to large datasets." Docs, Pandas, October 5. Accessed 2021-01-02.
- PyData. 2020f. "Input/output." Docs, Pandas, December 19. Accessed 2021-01-05.
- PyData. 2021. "Citing and logo." Pandas. Accessed 2021-01-02.
- Singh, Pavneet. 2019. "Data Wrangling with Pandas." Pluralsight, March 19. Accessed 2021-01-02.
- Walker, Jennifer. 2019. "Tutorial: Time Series Analysis with Pandas." Blog, Dataquest, January 10. Accessed 2021-01-02.
- Wang, Jiahui. 2019. "Python List, NumPy, and Pandas." Accessed 2021-01-02.
- Willems, Karijn. 2016. "Pandas Cheat Sheet for Data Science in Python." DataCamp, November 2. Accessed 2021-01-02.
- Willems, Karijn. 2017. "NumPy Cheat Sheet: Data Analysis in Python." DataCamp, January 17. Accessed 2021-01-02.
- Yegulalp, Serdar. 2020. "Pandas 1.0 brings big breaking changes." InfoWorld, January 10. Accessed 2021-01-02.
- scikit-learn. 2020. "sklearn.datasets.fetch_openml." v0.24.0, scikit-learn, December. Accessed 2021-01-05.
Further Reading
- McKinney, Wes. 2010. "Data Structures for Statistical Computing in Python." Proc. of the 9th Python in Science Conf., pp. 56-61, June 28 - July 3. Accessed 2021-01-02.
- Solomon, Brad. 2018. "Python Pandas: Tricks & Features You May Not Know." Real Python, August 29. Updated 2020-12-12. Accessed 2021-01-02.
- Seif, George. 2018. "23 great Pandas codes for Data Scientists." Towards Data Science, on Medium, August 22. Accessed 2021-01-02.
- Seif, George. 2019. "5 Advanced Features of Pandas and How to Use Them." KDNuggets, October. Accessed 2021-01-02.
- Data School. 2016. "Easier data analysis in Python with pandas (video series)." Data School. Accessed 2021-01-02.
Article Stats
Cite As
See Also
- Pandas Data Structures
- NumPy
- SciPy
- Python for Scientific Computing
- Data Analysis
- Data Pipeline