Pandas DataFrame Operations
- Summary
-
Discussion
- What are some ways to access values in a Pandas DataFrame?
- What is multi-indexing on a Pandas DataFrame?
- How can I filter a Pandas DataFrame?
- How do I do GroupBy operations on a Pandas DataFrame?
- How can I reshape a Pandas DataFrame?
- What are the ways to combine two datasets in Pandas?
- Could you share some best practices or tips for using Pandas DataFrame?
- Milestones
- References
- Further Reading
- Article Stats
- Cite As
DataFrame is an essential data structure in Pandas and there are many way to operate on it. Arithmetic, logical and bit-wise operations can be done across one or more frames.
Operations specific to data analysis include:
- Subsetting: Access a specific row/column, range of rows/columns, or a specific item.
- Slicing: A form of subsetting in which Python slicing syntax
[:]
is used. Rows/columns are numbered from zero. Negative integers imply traversal from the last row/column. - Filtering: Obtain rows that fulfil one or more conditions.
- Reshaping: Reorganize data such that the number of rows and columns change.
- Merging: A DataFrame is merged with another. This can be a simple concatenation of frames or database-style joins.
Indexing is a general term for subsetting, slicing and filtering.
Discussion
-
What are some ways to access values in a Pandas DataFrame? Consider a simple DataFrame with index values 10-12 and columns A-C. A row or column can be accessed by its index value or column label respectively. Column label is used directly in
[]
:df['A']
. Index values are used via.loc[]
:df.loc[10]
. We can also access a row by its position using.iloc[]
:df.iloc[0]
.In our example, there are many ways to get the last item, a scalar value:
- Row first:
df.iloc[2]['C']
,df.loc[12]['C']
,df.loc[12, 'C']
- Column first:
df['C'][12]
,df['C'].loc[12]
,df['C'].iloc[2]
To access multiple rows use
df.loc[[10,11]]
ordf.iloc[0:2]
. For multiple columns, usedf[['A','B']]
. To get last two columns of last two rows, we can writedf.loc[[11,12], ['B','C']]
,df.iloc[1:][['B','C']]
,df[['B','C']].iloc[1:]
ordf[['B','C']].loc[[11,12]]
. Integer listsdf.iloc[[1,2],[1,2]]
or slicingdf.iloc[1:,1:]
ordf.iloc[-2:, -2:]
can be used.Positional and label-based indexing can be combined:
df.loc[df.index[[0, 2]], 'A']
.Callable can be used for indexing. Function takes a Series or DataFrame and must return a valid index. For example,
df.loc[:, lambda df: ['A','B']]
ordf[lambda df: df.columns[0]]
. - Row first:
-
What is multi-indexing on a Pandas DataFrame? Hierarchical indexing or MultiIndex allows us to work with higher dimensional datasets using 2-D DataFrame. A MultiIndex could be seen as an array of unique tuples. In fact, it can be created from list of arrays, array of tuples, a DataFrame or from a product of iterables.
Referring to the figure, rows can be accessed using
df.loc['a']
ordf.loc['a',:]
. For multi-level indexing, usedf.loc[('a','one')]
,df.loc[('a','one'),:]
ordf.loc[(slice(None),'one'), :]
. Likewise for columns, we havedf['A']
,df.loc[:,'A']
ordf.loc[:, (slice(None), 'x')]
. Multi-indexing across rows and columns can be done withdf['A']['x']
,df['A','x']
ordf.loc[:,('A','x')]
. To get specific items, usedf.loc[('a','one'),('B','x')]
ordf.loc[('a','one'),'B']
.Partial slicing is done using
df.loc[('a', 'one'):('b', 'two')]
(result has four rows). Reindexing is done withdf.loc[[('a', 'one'), ('b', 'two')]]
(result has two rows).With two levels on index,
df.swaplevel()
will change the order of the levels.Note that
df.loc[('a','one'),:]
is valid butdf.loc[['a','one'],:]
is not. In this context, lists and tuples have different semantics:For indexing, a tuple is a multi-level key whereas a list specifies several keys.
-
How can I filter a Pandas DataFrame? Boolean arrays of same size as index can be used for filtering, that is, select rows indexed with
True
. For example,df.loc[~df.index.isin([11])]
ignores a specific index;df[df['A']>0]
selects only positive values of column A;df[(df['A'] > 2) & (df['B'] < 3)]
shows a complex expression where use of parenthesis is important.The method
query()
can simplify the syntax for complex expressions. Thus, for columns a-d, we can writedf.query('a not in b')
instead ofdf[~df['a'].isin(df['b'])]
;df.query('a in b and c < d')
instead ofdf[df['b'].isin(df['a']) & (df['c'] < df['d'])]
.Whereas
df[df.A>0]
will return rows that match the condition,df.where(df.A>0)
will return a DataFrame of same shape as the original. Where the condition is False, NaN will be returned. Optional second argument towhere()
is a replacement value. The inverse boolean operation ofwhere()
ismask()
, that is, it returns rows that don't match the condition.To select only rows with null values, use
df[df.isnull().any(axis=1)]
. Usedf[~df.isnull().any(axis=1)]
to do the reverse.For MultiIndex DataFrame,
isin()
method can be used. For example,df.loc[df.index.isin([('a','one'), ('b','two')])]
ordf.loc[df.index.isin(['a','c'], level=0)]
. -
How do I do GroupBy operations on a Pandas DataFrame? GroupBy uses a split-apply-combine workflow. Data is split using index or column values. A function is applied to each group independently. The results are combined into a single data structure. In the apply step, data can be aggregated (sum, mean, min, etc), transformed (normalize data, fill missing values, etc) or filtered (discard small groups, filter data based on group mean, etc).
To group by single column 'A' or index 'a', use
df.groupby('A')
ordf.groupby('a')
. Index grouping can also be done by position:df.groupby(level=0)
. To group by multiple columns or indices, usedf.groupby(['a', 'A'])
. To group by all levels except 'b', usedf.groupby(level=df.index.names.difference('b'))
. If a column and index share the same name 'b', usedf.groupby([pd.Grouper(level='b'), 'A'])
(index 'b') ordf.groupby([pd.Grouper('b'), 'A'])
(column 'b').To aggregate, simply call the function after grouping:
df.groupby('A').sum()
; ordf.groupby('A').agg([np.sum, np.mean])
to aggregate in many ways. Each column can be aggregated differently:df.groupby('A').agg({'C': np.sum, 'D': lambda x: np.std(x, ddof=1)})
.Methods
transform()
,filter()
,resample()
,expanding()
,rolling()
andemw()
can be used afterdf.groupby()
. For grouping datetime types,pd.Grouper(key, level, freq, axis, sort)
is useful. -
How can I reshape a Pandas DataFrame? When data is in "record" form,
pivot()
spreads rows into columns. Methodmelt()
does the opposite. Methodsstack()
andunstack()
are designed for MultiIndex objects and they're closely related to pivot. They can be applied on Series or DataFrame. Here are a few examples:df.pivot(index='foo', columns='bar', values='baz')
: Column 'foo' becomes the index, 'bar' values become new columns and values of 'baz' becomes values of the new DataFrame. A more generalized API isdf.pivot_table()
that allows for duplicate values of an index/column pair.df.melt(id_vars=['A','B'])
: Two columns are retained and other columns are spread into rows. Two new columns named 'variable' (with old column names as values) and 'value' are introduced. Original index can be retained but values will be duplicated.df.stack()
: Columns becomes part of a new inner-most index level. If DataFrame has hierarchical column labels, level can be specified as argument.df.unstack('second')
ordf.unstack(1)
: Values of index 'second' are spread into new columns. If no argument is supplied, the inner-most index is spread.
These methods when combined with GroupBy and aggregations can be expressive. Examples:
df.stack().mean(1).unstack()
,df.groupby(level=1, axis=1).mean()
,df.stack().groupby(level=1).mean()
anddf.mean().unstack(0)
. -
What are the ways to combine two datasets in Pandas? Perhaps the simplest is to understand is concatenating two or more frames that share the same column labels. Rows of one are concatenated to the other:
pd.concat([df1, df2])
. To concatenate column-wise, usepd.concat([df1, df2], axis=1)
. If the index values don't match, we re-index before concatenating:pd.concat([df1, df2.reindex(df1.index)], axis=1)
. Argumentjoin
takes valuesouter
(default) andinner
. An outermost level of multi-index can be created withpd.concat([df1, df2], keys=['A','B'])
.Database-style joins are possible with
df.join(df1)
in which the argumenthow
takes four values:left
,right
,outer
,inner
. By default, joins are based on index. Withon
argument, we can choose to join column or index level ofdf
with index ofdf1
. Joins based on MultiIndex are supported.Method
df.merge()
is more flexible than join since index levels or columns can be used. If merging on only columns, indices are ignored. Unlike join, cross merge (a cartesian product of both frames) is possible. Methodspd.merge()
,pd.merge_ordered()
andpd.merge_asof()
are related.Examples of merge, join and concatenate are available in the user guide.
-
Could you share some best practices or tips for using Pandas DataFrame? Here are some tips and pointers relevant to DataFrame operations:
- While
df['A']
is used to access a column,df.A
is a handy short form. This is called attribute access. Column name must be a valid Python identifier and it shouldn't conflict with existing method/attribute. Thus,df.1
anddf.first name
(has a whitespace) are not allowed.df.min
anddf.index
conflict with existing method/attribute. - Syntax
df['A']
results in a Series data structure whereasdf[['A']]
returns another DataFrame. - On a MultiIndex DataFrame, consider
dfmi['one']['second']
versusdfmi.loc[:, ('one','second')]
. Both give the same result but the latter is faster. The former form is actually two operations: firstdfmi['one']
returns a DataFrame, then['second']
is chained to it. - To create an explicit copy, write
df.copy()
. The statementsdf1 = df; df1[0:3] = 0
doesn't create a copy and modifies the original DataFrame. - Remember that
loc[]
is for label-based indexing andiloc[]
is for integer-based indexing. Integers may be used for the former but they're treated as labels. - When filtering by datetime values, their attributes can be used, such as,
df.loc[(df['Date'].dt.month==3) | (df['Date'].dt.month==11)]
.
- While
Milestones
2019

In Pandas 0.25.0, Named Aggregation is introduced. This uses a named tuple pd.NamedAgg()
with two fields column
and aggfunc
. The latter can be set to a callable or a string alias. For a more compact code, plain tuples can be used instead. This is now the recommended replacement for using "dict-of-dicts", which was deprecated in version 0.20.0 (May 2017).
2020
In Pandas 1.1.0, ignore_index
argument is added to melt()
with default value True. If set to False, index is retained although duplicate values will be present in the result. For GroupBy, it's now possible to retain NA
values in group keys. For example, df.groupby(by=["b"], dropna=False).sum()
will retain NA
values in column 'b'.
References
- Babu, Vinay. 2020. "Dataframe groupby date and time." Blog, Kanoki, May 26. Accessed 2021-01-12.
- Bosler, Fabian. 2019. "Pandas’ groupby explained in detail." Towards Data Science, on Medium, November 13. Accessed 2021-01-12.
- Bowne-Anderson, Hugo. 2017. "Hierarchical indices, groupby and pandas." Tutorial, DataCamp, October 3. Accessed 2021-01-12.
- Data Carpentry. 2020. "Indexing, Slicing and Subsetting DataFrames in Python." In: Data Analysis and Visualization in Python for Ecologists, Data Carpentry, December 1. Accessed 2021-01-12.
- Jain, Harsh. 2020. "Three ways to combine DataFrames in Pandas." edpresso, Educative, Inc., April 9. Accessed 2021-01-12.
- Lynn, Shane. 2015. "Pandas Groupby: Summarising, Aggregating, and Grouping data in Python." June 14. Updated 2020-06-30. Accessed 2021-01-12.
- PyData. 2019. "What's new in 0.25.0." Release Notes, Pandas, July 18. Updated 2020-10-07. Accessed 2021-01-13.
- PyData. 2020a. "10 minutes to pandas." User Guide, Pandas, v1.2.0, December 29. Accessed 2021-01-12.
- PyData. 2020b. "Intro to data structures." User Guide, Pandas, v1.2.0, November 26. Accessed 2021-01-12.
- PyData. 2020c. "Essential basic functionality." User Guide, Pandas, v1.2.0, November 14. Accessed 2021-01-12.
- PyData. 2020d. "Reshaping and pivot tables." User Guide, Pandas, v1.2.0, October 7. Accessed 2021-01-12.
- PyData. 2020e. "Group by: split-apply-combine." User Guide, Pandas, v1.2.0, November 25. Accessed 2021-01-12.
- PyData. 2020f. "What's new in 1.2.0." Release Notes, Pandas, December 26. Accessed 2021-01-13.
- PyData. 2020g. "What's new in 1.1.0." Release Notes, Pandas, October 7. Accessed 2021-01-14.
- PyData. 2020h. "Indexing and selecting data." User Guide, Pandas, v1.2.0, December 15. Accessed 2021-01-14.
- PyData. 2020i. "MultiIndex / advanced indexing." User Guide, Pandas, v1.2.0, October 7. Accessed 2021-01-14.
- PyData. 2020j. "Merge, join, concatenate and compare." User Guide, Pandas, v1.2.0, November 19. Accessed 2021-01-14.
- PyData. 2020k. "pandas.DataFrame.join." API Reference, Pandas, v1.2.0, December 18. Accessed 2021-01-15.
- PyData. 2020l. "pandas.DataFrame.merge." API Reference, Pandas, v1.2.0, December 18. Accessed 2021-01-15.
- PyData. 2020m. "pandas.merge." API Reference, Pandas, v1.2.0, December 23. Accessed 2021-01-15.
- PyData. 2021. "How do I select a subset of a DataFrame?" Getting started tutorials, Pandas, v1.2.0, January 6. Accessed 2021-01-12.
- Stringham, Jessica. 2019. "Pandas MultiIndex cheatsheet." Blog, December 10. Accessed 2021-01-14.
- The Coding Bot. 2019. "How do we filter rows of a pandas Dataframe by a column value." The Coding Bot, July 27. Accessed 2021-01-12.
- XsPDF. 2021. "Filter dataframe rows by index name." XsPDF. Accessed 2021-01-12.
Further Reading
- Devlin, Josh. 2020. "Pandas Cheat Sheet — Python for Data Science." Blog, Dataquest, March 4. Accessed 2021-01-12.
- VanderPlas, Jake. 2016. "Hierarchical Indexing." In: Python Data Science Handbook, O'Reilly Media, Inc. Accessed 2021-01-12.
- PyData. 2020e. "Group by: split-apply-combine." User Guide, Pandas, v1.2.0, November 25. Accessed 2021-01-12.
- PyData. 2020h. "Indexing and selecting data." User Guide, Pandas, v1.2.0, December 15. Accessed 2021-01-14.
- PyData. 2020i. "MultiIndex / advanced indexing." User Guide, Pandas, v1.2.0, October 7. Accessed 2021-01-14.
- PyData. 2020j. "Merge, join, concatenate and compare." User Guide, Pandas, v1.2.0, November 19. Accessed 2021-01-14.
Article Stats
Cite As
See Also
- Pandas Time Series Analysis
- Pandas Data Structures
- NumPy Array Operations
- Optimizing Pandas
- Exploratory Data Analysis
- Data Analytics