Pandas Data Structures
- Summary
-
Discussion
- Could you give example usage of Pandas Series and DataFrame?
- What's the anatomy of a Pandas DataFrame?
- What are the different data types supported in Pandas?
- What use is string type when there's already object type?
- What are sparse data types and why do they matter?
- What data types deal with missing values in Pandas?
- How do I convert Pandas data structures to/from NumPy or Python equivalents?
- Milestones
- References
- Further Reading
- Article Stats
- Cite As
The two main data structures in Pandas are Series
for 1-D data and DataFrame
for 2-D data. Data in higher dimensions are supported within DataFrame using a concept called hierarchical indexing. For storing axis labels of Series and DataFrame, the data structure used is Index
. These data structures can be created from Python or NumPy data structures.
Pandas data structures store data using NumPy or Pandas data types. Pandas has defined new data types where NumPy data types don't satisfy specific use cases. Pandas data types are also called Extension types. They're extended from Pandas array
. Developers can extend array
to create custom data types.
Pandas includes methods to convert data structures from Pandas to Python or NumPy. There's also implicit or explicit conversion of data types since a Series object or a DataFrame column can store values of only one data type.
Discussion
-
Could you give example usage of Pandas Series and DataFrame? Consider the pnwflights14 dataset from 2014. It captures 162,049 flights departing from two major airports Seattle and Portland. A flight is characterized by date and time of departure, arrival/departure delays, carrier, origin, destination, distance, airtime, etc. In a Pandas DataFrame, each flight becomes a row and each attribute becomes a labelled column.
For analysis, it's possible to extract only one column from the DataFrame. The result is a Series data structure. For example, given DataFrame
df
, we can writedf['air_time']
ordf.air_time
. Thus, DataFrame is for 2-D data and Series is for 1-D data. We can also extract multiple columns (such asdf[['origin','dest']]
) resulting in another DataFrame.It's possible to extract specific rows for analysis. For example, if we wish to analyze all flights originating from only Seattle, we can write
df[df.air_time=='SEA']
. The result is another DataFrame. If we access a single row (such asdf.iloc[0]
ordf.loc[0]
), we get a Series data structure.In terms of data types, this dataset has a mix of types: integer, float, datetime, string, etc. However, each column has only one type.
-
What's the anatomy of a Pandas DataFrame? A Pandas DataFrame represents tabular data, that is, data in rows and columns. All values in a column must have the same data type. Each row can be thought of as representing an entity or thing, with its attributes represented in the columns.
By default, rows and columns are indexed with integers, starting from zero. For convenience, it's more common to give labels or descriptive names, particularly to columns. Thus, rather than writing
df[0]
we can writedf['homeTeamName']
. Duplicate labels are allowed but can be prevented with methodset_flags(allows_duplicate_labels=False)
.We can extract a single column, resulting in a Series. A row can also be extracted into a Series but since columns are likely to have different data types, the type of the resulting Series would be
object
, a generic type.The direction of a Series is called axis. When moving from one row to the next, we're on axis-0. When moving from one column to the next, we're on axis-1.
-
What are the different data types supported in Pandas? Pandas mostly uses NumPy arrays and dtypes. These are float, int, bool, datetime64[ns], and timedelta[ns]. Unlike NumPy, Pandas datetime type is timezone-aware. For numerical values, defaults are
int64
andfloat64
.Data types particular to Pandas are BooleanDtype, CategoricalDtype, DatetimeTZDtype, Int64Dtype (plus other size and signed/unsigned variants), IntervalDtype, PeriodDtype, SparseDtype, StringDtype, Float32Dtype, and Float64Dtype.
When reading a data source, Pandas infers the data type for each column. Sometimes we may want to use a different type. For example, 'Customer ID' may be stored as
float64
but we wantint64
; 'Month' may beobject
but we wantdatetime64[ns]
. For type conversions, we can use methodsastype()
,to_numeric()
orto_datetime()
. There's alsoconvert_dtypes()
method to infer the best possible conversion. -
What use is string type when there's already object type? Consider the following examples:
pd.Series(['red', 'green', 'blue'])
: object data type is used by defaultpd.Series(['red', 'green', 'blue'], dtype='string')
: string data type is used
The object is a more general type. A DataFrame column with different types is collectively given the object type. Instead, string type gives developers clarity on the type of values stored. It also prevents accidentally storing strings and non-strings together. Pandas string type, which is an alias for
StringDtype
, is also more "closely aligned" to Python's ownstr
class. The underlying storage isarrays.StringArray
.In a DataFrame, columns of string type can be more easily separated from columns of object type. This is possible by calling the method
select_dtypes(include='string')
.While there's no difference between the two types in terms of performance, it's possible that future versions of Pandas might implement optimizations for string type.
-
What are sparse data types and why do they matter? Both Series and DataFrame can be used to store sparse data. Sparse doesn't necessarily mean "mostly zero" values. Rather, sparse data structures can be seen as a memory optimization technique. Any value can be specified. This value is not stored. However, these data structures behave the same way as their dense counterparts.
The underlying storage for sparse data structures is an ExtensionArray called
arrays.SparseArray
. Fill value, which is not stored in sparse arrays, can be specified using argumentfill_value
. We can inspect proportion of non-sparse values with propertydensity
. Methodto_dense()
yields the dense array. Methodsfrom_spmatrix()
andto_coo()
help in interfacing with SciPy matrices.As an example, consider the storage size of these two, using
df.memory_usage().sum()
:df = pd.DataFrame([1, 2, 2, 2])
: 160 bytesdf = pd.DataFrame([1, 2, 2, 2], dtype=pd.SparseDtype("float", 2))
: 140 bytes
-
What data types deal with missing values in Pandas? For speed and convenience, it's useful to detect missing values. Traditionally, Pandas has used NumPy
np.nan
(displayed asNaN
) to represent missing data. For example, PythonNone
values will result inNaN
in Pandas.However,
NaN
is not used for all data types. It's used in a float context. For datetime objects,NaT
is used, which is compatible withNaN
. PythonNone
value when converted in a string context becomes string "None", that is, it's not treated as a missing value. List of integers with someNone
values can't be converted into a Pandas Series.To provide more uniform handling of missing values, since Pandas 1.0.0, there's
pd.NA
(displayed as<NA>
) value. This is now used for missing values instring
,boolean
andInt64
(and variants) types.Inserting missing values into a Pandas data structure will trigger the suitable conversion. Infinite values
-inf
andinf
can be treated aspd.NA
by callingpandas.options.mode.use_inf_as_na = True
.DataFrame/Series methods
fillna()
,dropna()
,isna()
,notna()
andinterpolate()
are useful. -
How do I convert Pandas data structures to/from NumPy or Python equivalents? Input to the Series constructor can be a Python list or NumPy array. Thus, we can write
pandas.Series([1,2])
orpandas.Series(numpy.array([1,2,3]))
. Constructor can also accept a tuple, a dictionary or scalar value. For a dictionary input, the keys become the Series index.Series methods
tolist()
andto_numpy()
return Python list and NumPy ndarray respectively. Another way to get a NumPy array is to access propertySeries.values
but this is not recommended since for some data types it doesn't exactly return an ndarray.Another property
Series.array
returns the underlying ExtensionArray. This is Pandas' own array. If the data is of NumPy native type, this is a thin wrapper (no copy) over NumPy ndarray.Likewise, there are many ways to create a DataFrame: from lists, dictionaries, Series, another DataFrame, 1-D/2-D/structured NumPy arrays, or list of dictionaries. For a dictionary input, the keys become column labels. If keys are tuples, we create a multi-indexed frame. Useful methods for conversion include
to_dict()
andfrom_dict()
(Python), andto_records()
andfrom_records()
(NumPy).
Milestones
2017
2020
2020
References
- Birchard, Todd. 2017. "Another 'Intro to Data Analysis in Python Using Pandas' Post." Hackers and Slackers, November 16. Updated 2020-12-10. Accessed 2021-01-09.
- Moffitt, Chris. 2018. "Overview of Pandas Data Types." Practical Business Python, March 26. Accessed 2021-01-09.
- Pathak, Manish. 2020. "Handling Categorical Data in Python." Tutorial, DataCamp, January 6. Accessed 2021-01-09.
- PyData. 2017. "Version 0.20.1." Release Notes, Pandas, v0.20.1, May 5. Accessed 2021-01-09.
- PyData. 2018. "pandas.Panel." API Reference, Pandas, v0.23.4, August. Accessed 2021-01-09.
- PyData. 2020a. "Intro to data structures." User Guide, Pandas, v1.2.0, November 26. Accessed 2021-01-09.
- PyData. 2020b. "Series." API Reference, Pandas, v1.2.0, December 2. Accessed 2021-01-09.
- PyData. 2020c. "Extending pandas." Docs, Pandas, v1.2.0, November 14. Accessed 2021-01-09.
- PyData. 2020d. "pandas arrays." API Reference, Pandas, v1.2.0, October 29. Accessed 2021-01-09.
- PyData. 2020e. "DataFrame." API Reference, Pandas, v1.2.0, September 3. Accessed 2021-01-09.
- PyData. 2020f. "What’s new in 1.0.0." Release Notes, Pandas, January 29. Updated 2020-11-14. Accessed 2021-01-05.
- PyData. 2020g. "What’s new in 1.2.0." Release Notes, Pandas, December 26. Updated 2020-11-14. Accessed 2021-01-05.
- PyData. 2020h. "Working with missing data." User Guide, Pandas, v1.2.0, November 2. Accessed 2021-01-09.
- PyData. 2020i. "Sparse data structures." User Guide, Pandas, v1.2.0, November 14. Accessed 2021-01-09.
- PyData. 2020j. "Essential basic functionality." User Guide, Pandas, v1.2.0, November 14. Accessed 2021-01-09.
- PyData. 2020k. "Working with text data." User Guide, Pandas, v1.2.0, October 17. Accessed 2021-01-09.
- PyData. 2020l. "Duplicate Labels." User Guide, Pandas, v1.2.0, October 5. Accessed 2021-01-10.
- Wang, Jiahui. 2019. "Python List, NumPy, and Pandas." Accessed 2021-01-02.
- Yıldırım, Soner. 2020. "Why We Need to Use Pandas New String Dtype Instead of Object for Textual Data." Towards Data Science, on Medium, August 21. Accessed 2021-01-09.
- scikit-learn. 2020. "Version 0.23.2." Release Notes, scikit-learn, 0.23.2, August. Accessed 2021-01-10.
Further Reading
- Data Carpentry. 2020. "Data Types and Formats." In: Data Analysis and Visualization in Python for Ecologists, Data Carpentry, June 16. Accessed 2021-01-09.
- PyData. 2020a. "Intro to data structures." User Guide, Pandas, v1.2.0, November 26. Accessed 2021-01-09.
- PyData. 2020j. "Essential basic functionality." User Guide, Pandas, v1.2.0, November 14. Accessed 2021-01-09.
Article Stats
Cite As
See Also
- Pandas
- Pandas DataFrame Operations
- NumPy Data Types
- Optimizing Pandas
- Python for Scientific Computing
- PyCUDA