Pandas Data Structures

The two main data structures in Pandas are Series for 1-D data and DataFrame for 2-D data. Data in higher dimensions are supported within DataFrame using a concept called hierarchical indexing. For storing axis labels of Series and DataFrame, the data structure used is Index. These data structures can be created from Python or NumPy data structures.

Pandas data structures store data using NumPy or Pandas data types. Pandas has defined new data types where NumPy data types don't satisfy specific use cases. Pandas data types are also called Extension types. They're extended from Pandas array. Developers can extend array to create custom data types.

Pandas includes methods to convert data structures from Pandas to Python or NumPy. There's also implicit or explicit conversion of data types since a Series object or a DataFrame column can store values of only one data type.

Discussion

  • Could you give example usage of Pandas Series and DataFrame?
    Dataset pnwflights14 stored as a DataFrame. Source: Pathak 2020.
    Dataset pnwflights14 stored as a DataFrame. Source: Pathak 2020.

    Consider the pnwflights14 dataset from 2014. It captures 162,049 flights departing from two major airports Seattle and Portland. A flight is characterized by date and time of departure, arrival/departure delays, carrier, origin, destination, distance, airtime, etc. In a Pandas DataFrame, each flight becomes a row and each attribute becomes a labelled column.

    For analysis, it's possible to extract only one column from the DataFrame. The result is a Series data structure. For example, given DataFrame df, we can write df['air_time'] or df.air_time. Thus, DataFrame is for 2-D data and Series is for 1-D data. We can also extract multiple columns (such as df[['origin','dest']]) resulting in another DataFrame.

    It's possible to extract specific rows for analysis. For example, if we wish to analyze all flights originating from only Seattle, we can write df[df.air_time=='SEA']. The result is another DataFrame. If we access a single row (such as df.iloc[0] or df.loc[0]), we get a Series data structure.

    In terms of data types, this dataset has a mix of types: integer, float, datetime, string, etc. However, each column has only one type.

  • What's the anatomy of a Pandas DataFrame?
    Anatomy of a Pandas DataFrame. Source: Birchard 2017.
    Anatomy of a Pandas DataFrame. Source: Birchard 2017.

    A Pandas DataFrame represents tabular data, that is, data in rows and columns. All values in a column must have the same data type. Each row can be thought of as representing an entity or thing, with its attributes represented in the columns.

    By default, rows and columns are indexed with integers, starting from zero. For convenience, it's more common to give labels or descriptive names, particularly to columns. Thus, rather than writing df[0] we can write df['homeTeamName']. Duplicate labels are allowed but can be prevented with method set_flags(allows_duplicate_labels=False).

    We can extract a single column, resulting in a Series. A row can also be extracted into a Series but since columns are likely to have different data types, the type of the resulting Series would be object, a generic type.

    The direction of a Series is called axis. When moving from one row to the next, we're on axis-0. When moving from one column to the next, we're on axis-1.

  • What are the different data types supported in Pandas?
    Different Pandas data types. Source: PyData 2020j.
    Different Pandas data types. Source: PyData 2020j.

    Pandas mostly uses NumPy arrays and dtypes. These are float, int, bool, datetime64[ns], and timedelta[ns]. Unlike NumPy, Pandas datetime type is timezone-aware. For numerical values, defaults are int64 and float64.

    Data types particular to Pandas are BooleanDtype, CategoricalDtype, DatetimeTZDtype, Int64Dtype (plus other size and signed/unsigned variants), IntervalDtype, PeriodDtype, SparseDtype, StringDtype, Float32Dtype, and Float64Dtype.

    When reading a data source, Pandas infers the data type for each column. Sometimes we may want to use a different type. For example, 'Customer ID' may be stored as float64 but we want int64; 'Month' may be object but we want datetime64[ns]. For type conversions, we can use methods astype(), to_numeric() or to_datetime(). There's also convert_dtypes() method to infer the best possible conversion.

  • What use is string type when there's already object type?

    Consider the following examples:

    • pd.Series(['red', 'green', 'blue']): object data type is used by default
    • pd.Series(['red', 'green', 'blue'], dtype='string'): string data type is used

    The object is a more general type. A DataFrame column with different types is collectively given the object type. Instead, string type gives developers clarity on the type of values stored. It also prevents accidentally storing strings and non-strings together. Pandas string type, which is an alias for StringDtype, is also more "closely aligned" to Python's own str class. The underlying storage is arrays.StringArray.

    In a DataFrame, columns of string type can be more easily separated from columns of object type. This is possible by calling the method select_dtypes(include='string').

    While there's no difference between the two types in terms of performance, it's possible that future versions of Pandas might implement optimizations for string type.

  • What are sparse data types and why do they matter?

    Both Series and DataFrame can be used to store sparse data. Sparse doesn't necessarily mean "mostly zero" values. Rather, sparse data structures can be seen as a memory optimization technique. Any value can be specified. This value is not stored. However, these data structures behave the same way as their dense counterparts.

    The underlying storage for sparse data structures is an ExtensionArray called arrays.SparseArray. Fill value, which is not stored in sparse arrays, can be specified using argument fill_value. We can inspect proportion of non-sparse values with property density. Method to_dense() yields the dense array. Methods from_spmatrix() and to_coo() help in interfacing with SciPy matrices.

    As an example, consider the storage size of these two, using df.memory_usage().sum():

    • df = pd.DataFrame([1, 2, 2, 2]): 160 bytes
    • df = pd.DataFrame([1, 2, 2, 2], dtype=pd.SparseDtype("float", 2)): 140 bytes
  • What data types deal with missing values in Pandas?
    Showing how Pandas 1.1.5 handles missing values for different types. Source: Devopedia 2021.
    Showing how Pandas 1.1.5 handles missing values for different types. Source: Devopedia 2021.

    For speed and convenience, it's useful to detect missing values. Traditionally, Pandas has used NumPy np.nan (displayed as NaN) to represent missing data. For example, Python None values will result in NaN in Pandas.

    However, NaN is not used for all data types. It's used in a float context. For datetime objects, NaT is used, which is compatible with NaN. Python None value when converted in a string context becomes string "None", that is, it's not treated as a missing value. List of integers with some None values can't be converted into a Pandas Series.

    To provide more uniform handling of missing values, since Pandas 1.0.0, there's pd.NA (displayed as <NA>) value. This is now used for missing values in string, boolean and Int64 (and variants) types.

    Inserting missing values into a Pandas data structure will trigger the suitable conversion. Infinite values -inf and inf can be treated as pd.NA by calling pandas.options.mode.use_inf_as_na = True.

    DataFrame/Series methods fillna(), dropna(), isna(), notna() and interpolate() are useful.

  • How do I convert Pandas data structures to/from NumPy or Python equivalents?
    Converting from Pandas Series to Python or NumPy. Source: Wang 2019.
    Converting from Pandas Series to Python or NumPy. Source: Wang 2019.

    Input to the Series constructor can be a Python list or NumPy array. Thus, we can write pandas.Series([1,2]) or pandas.Series(numpy.array([1,2,3])). Constructor can also accept a tuple, a dictionary or scalar value. For a dictionary input, the keys become the Series index.

    Series methods tolist() and to_numpy() return Python list and NumPy ndarray respectively. Another way to get a NumPy array is to access property Series.values but this is not recommended since for some data types it doesn't exactly return an ndarray.

    Another property Series.array returns the underlying ExtensionArray. This is Pandas' own array. If the data is of NumPy native type, this is a thin wrapper (no copy) over NumPy ndarray.

    Likewise, there are many ways to create a DataFrame: from lists, dictionaries, Series, another DataFrame, 1-D/2-D/structured NumPy arrays, or list of dictionaries. For a dictionary input, the keys become column labels. If keys are tuples, we create a multi-indexed frame. Useful methods for conversion include to_dict() and from_dict() (Python), and to_records() and from_records() (NumPy).

Milestones

May
2017

In Pandas 0.20.0 (joint release with 0.20.1), pandas.Panel is deprecated. The recommended way to represent 3-D data is to use MultiIndex on a DataFrame. Method to_frame() converts from Panel to DataFrame. To convert to xarray data structure of package xarray, the method to_xarray() can be used.

Jan
2020

Pandas 1.0.0 is released. On an experimental basis, Pandas introduces pd.NA to represent missing scalar values. This is used by Int64Dtype (and variants), StringDtype and BooleanDtype.

May
2020

In Scikit-Learn, a popular Python library for machine learning, version 0.23.0 is released. Dataset loaders in this release now support loading data as a Pandas DataFrame using argument as_frame=True.

Dec
2020

Pandas 1.2.0 is released. This introduces Float32Dtype, Float64Dtype and FloatArray. While the default float uses np.nan for missing values, the new float types use pd.NA.

References

  1. Birchard, Todd. 2017. "Another 'Intro to Data Analysis in Python Using Pandas' Post." Hackers and Slackers, November 16. Updated 2020-12-10. Accessed 2021-01-09.
  2. Moffitt, Chris. 2018. "Overview of Pandas Data Types." Practical Business Python, March 26. Accessed 2021-01-09.
  3. Pathak, Manish. 2020. "Handling Categorical Data in Python." Tutorial, DataCamp, January 6. Accessed 2021-01-09.
  4. PyData. 2017. "Version 0.20.1." Release Notes, Pandas, v0.20.1, May 5. Accessed 2021-01-09.
  5. PyData. 2018. "pandas.Panel." API Reference, Pandas, v0.23.4, August. Accessed 2021-01-09.
  6. PyData. 2020a. "Intro to data structures." User Guide, Pandas, v1.2.0, November 26. Accessed 2021-01-09.
  7. PyData. 2020b. "Series." API Reference, Pandas, v1.2.0, December 2. Accessed 2021-01-09.
  8. PyData. 2020c. "Extending pandas." Docs, Pandas, v1.2.0, November 14. Accessed 2021-01-09.
  9. PyData. 2020d. "pandas arrays." API Reference, Pandas, v1.2.0, October 29. Accessed 2021-01-09.
  10. PyData. 2020e. "DataFrame." API Reference, Pandas, v1.2.0, September 3. Accessed 2021-01-09.
  11. PyData. 2020f. "What’s new in 1.0.0." Release Notes, Pandas, January 29. Updated 2020-11-14. Accessed 2021-01-05.
  12. PyData. 2020g. "What’s new in 1.2.0." Release Notes, Pandas, December 26. Updated 2020-11-14. Accessed 2021-01-05.
  13. PyData. 2020h. "Working with missing data." User Guide, Pandas, v1.2.0, November 2. Accessed 2021-01-09.
  14. PyData. 2020i. "Sparse data structures." User Guide, Pandas, v1.2.0, November 14. Accessed 2021-01-09.
  15. PyData. 2020j. "Essential basic functionality." User Guide, Pandas, v1.2.0, November 14. Accessed 2021-01-09.
  16. PyData. 2020k. "Working with text data." User Guide, Pandas, v1.2.0, October 17. Accessed 2021-01-09.
  17. PyData. 2020l. "Duplicate Labels." User Guide, Pandas, v1.2.0, October 5. Accessed 2021-01-10.
  18. Wang, Jiahui. 2019. "Python List, NumPy, and Pandas." Accessed 2021-01-02.
  19. Yıldırım, Soner. 2020. "Why We Need to Use Pandas New String Dtype Instead of Object for Textual Data." Towards Data Science, on Medium, August 21. Accessed 2021-01-09.
  20. scikit-learn. 2020. "Version 0.23.2." Release Notes, scikit-learn, 0.23.2, August. Accessed 2021-01-10.

Further Reading

  1. Data Carpentry. 2020. "Data Types and Formats." In: Data Analysis and Visualization in Python for Ecologists, Data Carpentry, June 16. Accessed 2021-01-09.
  2. PyData. 2020a. "Intro to data structures." User Guide, Pandas, v1.2.0, November 26. Accessed 2021-01-09.
  3. PyData. 2020j. "Essential basic functionality." User Guide, Pandas, v1.2.0, November 14. Accessed 2021-01-09.

Article Stats

Author-wise Stats for Article Edits

Author
No. of Edits
No. of Chats
DevCoins
5
0
1619
1449
Words
4
Likes
11K
Hits

Cite As

Devopedia. 2021. "Pandas Data Structures." Version 5, January 10. Accessed 2024-06-25. https://devopedia.org/pandas-data-structures
Contributed by
1 author


Last updated on
2021-01-10 11:35:41