Pandas Data Structures
 Summary

Discussion
 Could you give example usage of Pandas Series and DataFrame?
 What's the anatomy of a Pandas DataFrame?
 What are the different data types supported in Pandas?
 What use is string type when there's already object type?
 What are sparse data types and why do they matter?
 What data types deal with missing values in Pandas?
 How do I convert Pandas data structures to/from NumPy or Python equivalents?
 Milestones
 References
 Further Reading
 Article Stats
 Cite As
The two main data structures in Pandas are Series
for 1D data and DataFrame
for 2D data. Data in higher dimensions are supported within DataFrame using a concept called hierarchical indexing. For storing axis labels of Series and DataFrame, the data structure used is Index
. These data structures can be created from Python or NumPy data structures.^{}
Pandas data structures store data using NumPy or Pandas data types. Pandas has defined new data types where NumPy data types don't satisfy specific use cases. Pandas data types are also called Extension types. They're extended from Pandas array
. Developers can extend array
to create custom data types.^{}
Pandas includes methods to convert data structures from Pandas to Python or NumPy. There's also implicit or explicit conversion of data types since a Series object or a DataFrame column can store values of only one data type.
Discussion
Could you give example usage of Pandas Series and DataFrame? Consider the pnwflights14 dataset from 2014. It captures 162,049 flights departing from two major airports Seattle and Portland. A flight is characterized by date and time of departure, arrival/departure delays, carrier, origin, destination, distance, airtime, etc. In a Pandas DataFrame, each flight becomes a row and each attribute becomes a labelled column.^{}
For analysis, it's possible to extract only one column from the DataFrame. The result is a Series data structure. For example, given DataFrame
df
, we can writedf['air_time']
ordf.air_time
. Thus, DataFrame is for 2D data and Series is for 1D data. We can also extract multiple columns (such asdf[['origin','dest']]
) resulting in another DataFrame.It's possible to extract specific rows for analysis. For example, if we wish to analyze all flights originating from only Seattle, we can write
df[df.air_time=='SEA']
. The result is another DataFrame. If we access a single row (such asdf.iloc[0]
ordf.loc[0]
), we get a Series data structure.In terms of data types, this dataset has a mix of types: integer, float, datetime, string, etc. However, each column has only one type.
What's the anatomy of a Pandas DataFrame? A Pandas DataFrame represents tabular data, that is, data in rows and columns. All values in a column must have the same data type. Each row can be thought of as representing an entity or thing, with its attributes represented in the columns.^{}
By default, rows and columns are indexed with integers, starting from zero. For convenience, it's more common to give labels or descriptive names, particularly to columns. Thus, rather than writing
df[0]
we can writedf['homeTeamName']
.^{} Duplicate labels are allowed but can be prevented with methodset_flags(allows_duplicate_labels=False)
.^{}We can extract a single column, resulting in a Series.^{} A row can also be extracted into a Series but since columns are likely to have different data types, the type of the resulting Series would be
object
, a generic type.The direction of a Series is called axis. When moving from one row to the next, we're on axis0. When moving from one column to the next, we're on axis1.^{}
What are the different data types supported in Pandas? Pandas mostly uses NumPy arrays and dtypes. These are float, int, bool, datetime64[ns], and timedelta[ns]. Unlike NumPy, Pandas datetime type is timezoneaware. For numerical values, defaults are
int64
andfloat64
.^{}Data types particular to Pandas are BooleanDtype, CategoricalDtype, DatetimeTZDtype, Int64Dtype (plus other size and signed/unsigned variants), IntervalDtype, PeriodDtype, SparseDtype, StringDtype,^{} Float32Dtype, and Float64Dtype.^{}
When reading a data source, Pandas infers the data type for each column. Sometimes we may want to use a different type. For example, 'Customer ID' may be stored as
float64
but we wantint64
; 'Month' may beobject
but we wantdatetime64[ns]
. For type conversions, we can use methodsastype()
,to_numeric()
orto_datetime()
.^{} There's alsoconvert_dtypes()
method to infer the best possible conversion.^{}What use is string type when there's already object type? Consider the following examples:
pd.Series(['red', 'green', 'blue'])
: object data type is used by defaultpd.Series(['red', 'green', 'blue'], dtype='string')
: string data type is used
The object is a more general type. A DataFrame column with different types is collectively given the object type.^{} Instead, string type gives developers clarity on the type of values stored. It also prevents accidentally storing strings and nonstrings together.^{} Pandas string type, which is an alias for
StringDtype
,^{} is also more "closely aligned" to Python's ownstr
class. The underlying storage isarrays.StringArray
.^{}In a DataFrame, columns of string type can be more easily separated from columns of object type. This is possible by calling the method
select_dtypes(include='string')
.^{}While there's no difference between the two types in terms of performance, it's possible that future versions of Pandas might implement optimizations for string type.^{} ^{}
What are sparse data types and why do they matter? Both Series and DataFrame can be used to store sparse data. Sparse doesn't necessarily mean "mostly zero" values. Rather, sparse data structures can be seen as a memory optimization technique. Any value can be specified. This value is not stored. However, these data structures behave the same way as their dense counterparts.^{}
The underlying storage for sparse data structures is an ExtensionArray called
arrays.SparseArray
. Fill value, which is not stored in sparse arrays, can be specified using argumentfill_value
. We can inspect proportion of nonsparse values with propertydensity
. Methodto_dense()
yields the dense array. Methodsfrom_spmatrix()
andto_coo()
help in interfacing with SciPy matrices.^{}As an example, consider the storage size of these two, using
df.memory_usage().sum()
:^{}df = pd.DataFrame([1, 2, 2, 2])
: 160 bytesdf = pd.DataFrame([1, 2, 2, 2], dtype=pd.SparseDtype("float", 2))
: 140 bytes
What data types deal with missing values in Pandas? For speed and convenience, it's useful to detect missing values. Traditionally, Pandas has used NumPy
np.nan
(displayed asNaN
) to represent missing data. For example, PythonNone
values will result inNaN
in Pandas.^{}However,
NaN
is not used for all data types. It's used in a float context. For datetime objects,NaT
is used, which is compatible withNaN
.^{} PythonNone
value when converted in a string context becomes string "None", that is, it's not treated as a missing value. List of integers with someNone
values can't be converted into a Pandas Series.To provide more uniform handling of missing values, since Pandas 1.0.0, there's
pd.NA
(displayed as<NA>
) value. This is now used for missing values instring
,boolean
andInt64
(and variants) types.^{}Inserting missing values into a Pandas data structure will trigger the suitable conversion. Infinite values
inf
andinf
can be treated aspd.NA
by callingpandas.options.mode.use_inf_as_na = True
.^{}DataFrame/Series methods
fillna()
,dropna()
,isna()
,notna()
andinterpolate()
are useful.^{} ^{}How do I convert Pandas data structures to/from NumPy or Python equivalents? Input to the Series constructor can be a Python list or NumPy array. Thus, we can write
pandas.Series([1,2])
orpandas.Series(numpy.array([1,2,3]))
. Constructor can also accept a tuple, a dictionary or scalar value. For a dictionary input, the keys become the Series index.^{}Series methods
tolist()
andto_numpy()
return Python list and NumPy ndarray respectively. Another way to get a NumPy array is to access propertySeries.values
but this is not recommended since for some data types it doesn't exactly return an ndarray.^{}Another property
Series.array
returns the underlying ExtensionArray. This is Pandas' own array. If the data is of NumPy native type, this is a thin wrapper (no copy) over NumPy ndarray.^{}Likewise, there are many ways to create a DataFrame: from lists, dictionaries, Series, another DataFrame, 1D/2D/structured NumPy arrays, or list of dictionaries. For a dictionary input, the keys become column labels. If keys are tuples, we create a multiindexed frame.^{} Useful methods for conversion include
to_dict()
andfrom_dict()
(Python), andto_records()
andfrom_records()
(NumPy).^{}
Milestones
2017
In Pandas 0.20.0 (joint release with 0.20.1), pandas.Panel
is deprecated.^{} The recommended way to represent 3D data is to use MultiIndex
on a DataFrame. Method to_frame()
converts from Panel to DataFrame. To convert to xarray
data structure of package xarray, the method to_xarray()
can be used.^{}
2020
Pandas 1.0.0 is released. On an experimental basis, Pandas introduces pd.NA
to represent missing scalar values. This is used by Int64Dtype
(and variants), StringDtype
and BooleanDtype
.^{}
2020
In ScikitLearn, a popular Python library for machine learning, version 0.23.0 is released. Dataset loaders in this release now support loading data as a Pandas DataFrame using argument as_frame=True
.^{}
2020
Pandas 1.2.0 is released. This introduces Float32Dtype
, Float64Dtype
and FloatArray
. While the default float uses np.nan
for missing values, the new float types use pd.NA
.^{}
References
 Birchard, Todd. 2017. "Another 'Intro to Data Analysis in Python Using Pandas' Post." Hackers and Slackers, November 16. Updated 20201210. Accessed 20210109.
 Moffitt, Chris. 2018. "Overview of Pandas Data Types." Practical Business Python, March 26. Accessed 20210109.
 Pathak, Manish. 2020. "Handling Categorical Data in Python." Tutorial, DataCamp, January 6. Accessed 20210109.
 PyData. 2017. "Version 0.20.1." Release Notes, Pandas, v0.20.1, May 5. Accessed 20210109.
 PyData. 2018. "pandas.Panel." API Reference, Pandas, v0.23.4, August. Accessed 20210109.
 PyData. 2020a. "Intro to data structures." User Guide, Pandas, v1.2.0, November 26. Accessed 20210109.
 PyData. 2020b. "Series." API Reference, Pandas, v1.2.0, December 2. Accessed 20210109.
 PyData. 2020c. "Extending pandas." Docs, Pandas, v1.2.0, November 14. Accessed 20210109.
 PyData. 2020d. "pandas arrays." API Reference, Pandas, v1.2.0, October 29. Accessed 20210109.
 PyData. 2020e. "DataFrame." API Reference, Pandas, v1.2.0, September 3. Accessed 20210109.
 PyData. 2020f. "What’s new in 1.0.0." Release Notes, Pandas, January 29. Updated 20201114. Accessed 20210105.
 PyData. 2020g. "What’s new in 1.2.0." Release Notes, Pandas, December 26. Updated 20201114. Accessed 20210105.
 PyData. 2020h. "Working with missing data." User Guide, Pandas, v1.2.0, November 2. Accessed 20210109.
 PyData. 2020i. "Sparse data structures." User Guide, Pandas, v1.2.0, November 14. Accessed 20210109.
 PyData. 2020j. "Essential basic functionality." User Guide, Pandas, v1.2.0, November 14. Accessed 20210109.
 PyData. 2020k. "Working with text data." User Guide, Pandas, v1.2.0, October 17. Accessed 20210109.
 PyData. 2020l. "Duplicate Labels." User Guide, Pandas, v1.2.0, October 5. Accessed 20210110.
 scikitlearn. 2020. "Version 0.23.2." Release Notes, scikitlearn, 0.23.2, August. Accessed 20210110.
 Wang, Jiahui. 2019. "Python List, NumPy, and Pandas." Accessed 20210102.
 Yıldırım, Soner. 2020. "Why We Need to Use Pandas New String Dtype Instead of Object for Textual Data." Towards Data Science, on Medium, August 21. Accessed 20210109.
Further Reading
 Data Carpentry. 2020. "Data Types and Formats." In: Data Analysis and Visualization in Python for Ecologists, Data Carpentry, June 16. Accessed 20210109.
 PyData. 2020a. "Intro to data structures." User Guide, Pandas, v1.2.0, November 26. Accessed 20210109.
 PyData. 2020j. "Essential basic functionality." User Guide, Pandas, v1.2.0, November 14. Accessed 20210109.
Article Stats
Cite As
See Also
 Pandas
 Pandas DataFrame Operations
 NumPy Data Types
 Optimizing Pandas
 Python for Scientific Computing
 PyCUDA