Pandas

Article Info

Contributed by
1 author

Last updated on
2022-11-07 04:18:24

Pandas Data Structures
NumPy
SciPy
Python for Scientific Computing
Data Analysis
Data Pipeline

Article Versions

5 2022-11-07 04:18:24
3848,3847 5,3848

By arvindpdmn

Updating a 404 URL.
4 2022-11-07 04:08:57
3847,2337 4,3847

By arvindpdmn

Replacing substandard ref source.
3 2021-01-08 14:22:49
2337,2333 3,2337

By arvindpdmn

Changing Types to Structures in See Also
2 2021-01-05 18:14:46
2333,2331 2,2333

By arvindpdmn

Adding article content. Publishing.
1 2021-01-02 08:01:22
1,2331

By arvindpdmn

First version

Chat Room

Submitting ...

You are editing an existing chat message.
2022-11-07 04:09:38
-

By devbot5S

[URL Check] The following URLs in this article are outdated. Please update.

Missing URLs:
References: 404 HTTP response: https://pandas-profiling.github.io/pandas-profiling/examples/master/census/census_report.html

Redirected URLs:
References: https://subscription.packtpub.com/book/big_data_and_business_intelligence/9781783981960/1/ch01lvl1sec11/benefits-of-using-pandas → https://www.packtpub.com/product/mastering-pandas/9781783981960

Pandas logo. Source: Pandas 2021.

Pandas is a Python package that enables in-memory data manipulation and analysis. It offers many ways to read and store data. It can inspect, clean, filter, merge, combine or transform data to suit the needs of analysis. By mid-2010s, it became an essential tool in the data scientist's toolkit.

Pandas is built on another popular Python package called NumPy. NumPy and Pandas data structures have become common formats that many Python packages tend to support.

Pandas is an open-source BSD-licensed project sponsored by NumFOCUS. Pandas is often seen as Python's answer to similar capabilities in R language.

Discussion

Why do I need Pandas when there's NumPy?
Comparing Python list, NumPy ndarray and Pandas DataFrame. Source: Wang 2019.
NumPy has multi-dimensional arrays that are optimized for numerical computation. NumPy supports element-wise mathematical operations on arrays (addition, division, cosine) and dot product of arrays. Arrays can be transposed, combined, sliced and indexed in flexible ways. Array content can be aggregated (sum, min, max, mean).
Pandas is built on top of NumPy. It can therefore do everything that NumPy can do. In fact, it's easy to convert between Pandas and NumPy data structures, both of which can be used within the same codebase. In addition, Pandas offers methods that simplify data analysis. NumPy isn't flexible or easy to use for statistical work. A NumPy array must contain values of the same type, which is not common in real-world data.
Pandas is better suited for tabular data. Consider data stored in a spreadsheet or a database. It's typical to give unique labels to rows and columns. Accessing data by these labels makes it easier to write and maintain code. For this reason, Pandas has much better support reading from or writing to CSV/JSON/Excel/HDF5/HTML files and MySQL databases.
Could you describe some use cases of Pandas?
Time series analysis using Pandas and Matplotlib. Source: Walker 2019.
We can use Pandas for data inspection and profiling. We can view a small sample of the dataset. Data is described using basic statistics: mean, median, mix, max, etc. Profiling can point out missing or duplicate values, or correlations among variables. Pandas Profiling is a package that does this automatically.
Another task for a data scientist is Exploratory Data Analysis (EDA). This is done to gain insights into the data and identify possible input features before any modelling is done. Pandas fits nicely the interactive and iterative nature of EDA. DataPrep.eda is an EDA tool built on top of Pandas.
Before training a machine learning model, Pandas can simplify data preparation. For example, categorical labels are converted to numerical values. Missing values are replaced with mean or median values. Predicted variable is separated from the rest of the dataset. Data are merged, sorted, grouped, filtered, reshaped, etc.
Time Series Analysis (TSA) is another use case of Pandas. Pandas has data types to handle date/time values. It can do time-based indexing, time zone handling, resampling, rolling windows, and trend analysis.
Which are the main features of Pandas?
An overview of Pandas features. Source: Gewalt 2021.
We note the following important features:
- Data Structures: Inspired by R's data.frame, DataFrame is a good fit for storing and manipulating tabular data. This includes indices on rows and labels on columns. There's also Series (1D vectors) and Panel (3D tables).
- Input/Output: Pandas can read/write data between memory and various file formats. For example, reading from a CSV file is just two lines of code whereas this in Java would need 30 lines.
- Indexing: Data can be indexed in many ways. Hierarchical indexing allows intuitive access to high-dimensional data.
- Manipulation: Pandas can reshape or pivot data. It's easy to add or remove columns. Different datasets can be merged or automatically aligned based on the indices. Aggregations and even more sophisticated "group by" operations common with SQL database tables are possible.
- Missing Data: These can be either ignored, dropped or replaced depending on the operation. There are methods to detect the presence of missing data.
- Time-Series Data: There's support for date range generation, frequency conversion, moving windows statistics, date shifting and lagging.
- Optimization: Critical code paths are optimized in Cython or C.
How well is Pandas supported by other Python packages and tools?
Explore Pandas dataframes interactively using IPython widgets. Source: Hirst 2016.
Many well-known Python packages and tools understand or even specialize Pandas DataFrame and Series. Some of these include pandas-tfrecords, sklearn-pandas, Featuretools, Compose, Altair, Bokeh, Seaborn, qgrid, Spyder, pandas-datareader, PyDatastream, Geopandas, Blaze, Koalas, etc.
For visualization, matplotlib can directly read DataFrame type and create plots. For machine learning, scikit-learn methods can read a DataFrame, often by specifying the argument as_frame=True.
IPython understands and displays Pandas datasets in a more user-friendly manner. Via package ipywidgets, we can interactively explore Pandas data frames within a Jupyter Notebook environment.
statsmodels is a package for econometrics and statistical modelling. This package uses Pandas data structures. It has historical links with Pandas development.
Pandas runs on a single CPU core. With Dask, we can parallelize the computation on multiple cores or a cluster of machines. Dask offers dask.dataframe that's a composite of many Pandas DataFrames. Dask APIs are also similar Pandas APIs.
With RAPIDS CuDF, Pandas DataFrames can run on GPUs. Computations run in parallel on many GPU cores.
What are some criticisms of Pandas?
Memory management in Pandas could be better. As a rule of thumb, Pandas requires 5-10x as much RAM as the dataset size. There's no native support for multicore execution. There's no support for memory mapping. It's therefore easy to copy an entire dataset by accident when analytics is done only on a small part of it. Support for categorical data could be better. Appending data to a DataFrame is slow. It doesn't have an SQL-like query processing layer.
Pandas is too tightly coupled to NumPy. For example, an entire DataFrame column must be stored in the same NumPy array. This frequently results in doubling memory requirements and additional computation.
Pandas can't read multiple rows in parallel from a CSV file. Likewise, it can't read multiple CSV files in parallel. It doesn't support SQL-like conditional joins.
Some of the limitations noted above are solved by other libraries such as PandasSQL, PySpark, Terality, Vaex, DataTable, Dask and CuDF.
Which are some online resources to learn Pandas?
10-minute introduction to Pandas. Source: McKinney 2019.
The official Pandas documentation is the place for installation guide, user guide, API reference, tutorials, and more. Beginners can start learning from the 10 minutes to pandas user guide. Getting Started tutorials and tutorials from the community are two useful resources.
DataCamp's Pandas Cheatsheet and Irv Lustig's Pandas Cheatsheet are two useful references.
Wes McKinney's book titled Python for Data Analysis is recommended.
R developers who wish to learn Pandas can start by looking at how R operations map to Pandas.
Developers who wish to contribute to the Pandas project, can visit their GitHub page.

Milestones

2009

Pandas is open sourced in late 2009. It's creator, Wes McKinney, started working on the project in April 2008. His intent is to make data analysis easier for Python programmers and even those who're not expert programmers.

2010

McKinney notes at SciPy Conference that other Python packages have appeared recently, offering some of the features of Pandas: Ia, Tab, pydataframe. NumPy and SciPy have made Python more accessible to the scientific community. For statistical modelling, there's StaM, PyMC and SciL. However, there's no cohesive framework for statistical modelling. Statisticians continue to prefer R. Pandas is an attempt to change this.

Nov
2013

McKinney, the creator of Pandas, gives a presentation with a sub-title 10 Things I Hate About pandas. To address some of these, he shares his work on a new tool named Badger. It has a consistent type system, compressed columnar binary storage, and an analytical query processor. By late 2015, these ideas result in the formation of Apache Arrow.

Oct
2015

Pandas becomes a fiscally sponsored project of NumFOCUS. With this, Pandas joins other Python projects that are also sponsored by NumFOCUS: NumPy, Matplotlib, Project Jupyter, and IPython.

Oct
2017

Interest in Pandas on StackOverflow. Source: Kopf 2017.

In October 2017, StackOverflow records 5 million visits to questions on Pandas from more than 1 million unique visitors. Many companies use Pandas for data analysis including Google, Facebook and JP Morgan. The rising popularity of Python itself from 2012 onwards has been attributed to the adoption of Pandas by data scientists.

Jan
2019

From January 2019, all future releases of Pandas will support only Python 3. This means that the last version to support Python 2 is 0.23.4 (August 2018).

Jan
2020

Pandas 1.0.0 is released. It requires at least Python 3.6.1. On an experimental basis, data type NA is introduced to represent missing data. Data types SparseSeries and SparseDataFrame are replaced with Series and DataFrame with sparsevalues option. Some APIs are deprecated and others are backward incompatible. Pandas 1.0.0 Release Notes has full details.

References

Article Stats

1483

Words

Authors

Edits

Chats

Likes

5868

Hits

Cite As

Devopedia. 2022. "Pandas." Version 5, November 7. Accessed 2024-06-25. https://devopedia.org/pandas

Contributed by
1 author

Last updated on
2022-11-07 04:18:24

languages data analysis data science library vectorization

Pandas Data Structures
NumPy
SciPy
Python for Scientific Computing
Data Analysis
Data Pipeline

Pandas

Discussion

Milestones

References

Further Reading

Article Stats

Author-wise Stats for Article Edits

Cite As

See Also

Login