Pandas DataFrame Operations

DataFrame is an essential data structure in Pandas and there are many way to operate on it. Arithmetic, logical and bit-wise operations can be done across one or more frames.

Operations specific to data analysis include:

  • Subsetting: Access a specific row/column, range of rows/columns, or a specific item.
  • Slicing: A form of subsetting in which Python slicing syntax [:] is used. Rows/columns are numbered from zero. Negative integers imply traversal from the last row/column.
  • Filtering: Obtain rows that fulfil one or more conditions.
  • Reshaping: Reorganize data such that the number of rows and columns change.
  • Merging: A DataFrame is merged with another. This can be a simple concatenation of frames or database-style joins.

Indexing is a general term for subsetting, slicing and filtering.

Discussion

  • What are some ways to access values in a Pandas DataFrame?
    Many ways to access rows and columns of a DataFrame. Source: PyData 2020b.
    Many ways to access rows and columns of a DataFrame. Source: PyData 2020b.

    Consider a simple DataFrame with index values 10-12 and columns A-C. A row or column can be accessed by its index value or column label respectively. Column label is used directly in []: df['A']. Index values are used via .loc[]: df.loc[10]. We can also access a row by its position using .iloc[]: df.iloc[0].

    In our example, there are many ways to get the last item, a scalar value:

    • Row first: df.iloc[2]['C'], df.loc[12]['C'], df.loc[12, 'C']
    • Column first: df['C'][12], df['C'].loc[12], df['C'].iloc[2]

    To access multiple rows use df.loc[[10,11]] or df.iloc[0:2]. For multiple columns, use df[['A','B']]. To get last two columns of last two rows, we can write df.loc[[11,12], ['B','C']], df.iloc[1:][['B','C']], df[['B','C']].iloc[1:] or df[['B','C']].loc[[11,12]]. Integer lists df.iloc[[1,2],[1,2]] or slicing df.iloc[1:,1:] or df.iloc[-2:, -2:] can be used.

    Positional and label-based indexing can be combined: df.loc[df.index[[0, 2]], 'A'].

    Callable can be used for indexing. Function takes a Series or DataFrame and must return a valid index. For example, df.loc[:, lambda df: ['A','B']] or df[lambda df: df.columns[0]].

  • What is multi-indexing on a Pandas DataFrame?
    An exampled of MultiIndex on both rows and columns. Source: Stringham 2019.
    An exampled of MultiIndex on both rows and columns. Source: Stringham 2019.

    Hierarchical indexing or MultiIndex allows us to work with higher dimensional datasets using 2-D DataFrame. A MultiIndex could be seen as an array of unique tuples. In fact, it can be created from list of arrays, array of tuples, a DataFrame or from a product of iterables.

    Referring to the figure, rows can be accessed using df.loc['a'] or df.loc['a',:]. For multi-level indexing, use df.loc[('a','one')], df.loc[('a','one'),:] or df.loc[(slice(None),'one'), :]. Likewise for columns, we have df['A'], df.loc[:,'A'] or df.loc[:, (slice(None), 'x')]. Multi-indexing across rows and columns can be done with df['A']['x'], df['A','x'] or df.loc[:,('A','x')]. To get specific items, use df.loc[('a','one'),('B','x')] or df.loc[('a','one'),'B'].

    Partial slicing is done using df.loc[('a', 'one'):('b', 'two')] (result has four rows). Reindexing is done with df.loc[[('a', 'one'), ('b', 'two')]] (result has two rows).

    With two levels on index, df.swaplevel() will change the order of the levels.

    Note that df.loc[('a','one'),:] is valid but df.loc[['a','one'],:] is not. In this context, lists and tuples have different semantics:

    For indexing, a tuple is a multi-level key whereas a list specifies several keys.
  • How can I filter a Pandas DataFrame?

    Boolean arrays of same size as index can be used for filtering, that is, select rows indexed with True. For example, df.loc[~df.index.isin([11])] ignores a specific index; df[df['A']>0] selects only positive values of column A; df[(df['A'] > 2) & (df['B'] < 3)] shows a complex expression where use of parenthesis is important.

    The method query() can simplify the syntax for complex expressions. Thus, for columns a-d, we can write df.query('a not in b') instead of df[~df['a'].isin(df['b'])]; df.query('a in b and c < d') instead of df[df['b'].isin(df['a']) & (df['c'] < df['d'])].

    Whereas df[df.A>0] will return rows that match the condition, df.where(df.A>0) will return a DataFrame of same shape as the original. Where the condition is False, NaN will be returned. Optional second argument to where() is a replacement value. The inverse boolean operation of where() is mask(), that is, it returns rows that don't match the condition.

    To select only rows with null values, use df[df.isnull().any(axis=1)]. Use df[~df.isnull().any(axis=1)] to do the reverse.

    For MultiIndex DataFrame, isin() method can be used. For example, df.loc[df.index.isin([('a','one'), ('b','two')])] or df.loc[df.index.isin(['a','c'], level=0)].

  • How do I do GroupBy operations on a Pandas DataFrame?
    Visualizing groupby().agg() on a DataFrame. Source: Bosler 2019.
    Visualizing groupby().agg() on a DataFrame. Source: Bosler 2019.

    GroupBy uses a split-apply-combine workflow. Data is split using index or column values. A function is applied to each group independently. The results are combined into a single data structure. In the apply step, data can be aggregated (sum, mean, min, etc), transformed (normalize data, fill missing values, etc) or filtered (discard small groups, filter data based on group mean, etc).

    To group by single column 'A' or index 'a', use df.groupby('A') or df.groupby('a'). Index grouping can also be done by position: df.groupby(level=0). To group by multiple columns or indices, use df.groupby(['a', 'A']). To group by all levels except 'b', use df.groupby(level=df.index.names.difference('b')). If a column and index share the same name 'b', use df.groupby([pd.Grouper(level='b'), 'A']) (index 'b') or df.groupby([pd.Grouper('b'), 'A']) (column 'b').

    To aggregate, simply call the function after grouping: df.groupby('A').sum(); or df.groupby('A').agg([np.sum, np.mean]) to aggregate in many ways. Each column can be aggregated differently: df.groupby('A').agg({'C': np.sum, 'D': lambda x: np.std(x, ddof=1)}).

    Methods transform(), filter(), resample(), expanding(), rolling() and emw() can be used after df.groupby(). For grouping datetime types, pd.Grouper(key, level, freq, axis, sort) is useful.

  • How can I reshape a Pandas DataFrame?
    Some mthods that reshape a Pandas DataFrame. Source: PyData 2020d.
    Some mthods that reshape a Pandas DataFrame. Source: PyData 2020d.

    When data is in "record" form, pivot() spreads rows into columns. Method melt() does the opposite. Methods stack() and unstack() are designed for MultiIndex objects and they're closely related to pivot. They can be applied on Series or DataFrame. Here are a few examples:

    • df.pivot(index='foo', columns='bar', values='baz'): Column 'foo' becomes the index, 'bar' values become new columns and values of 'baz' becomes values of the new DataFrame. A more generalized API is df.pivot_table() that allows for duplicate values of an index/column pair.
    • df.melt(id_vars=['A','B']): Two columns are retained and other columns are spread into rows. Two new columns named 'variable' (with old column names as values) and 'value' are introduced. Original index can be retained but values will be duplicated.
    • df.stack(): Columns becomes part of a new inner-most index level. If DataFrame has hierarchical column labels, level can be specified as argument.
    • df.unstack('second') or df.unstack(1): Values of index 'second' are spread into new columns. If no argument is supplied, the inner-most index is spread.

    These methods when combined with GroupBy and aggregations can be expressive. Examples: df.stack().mean(1).unstack(), df.groupby(level=1, axis=1).mean(), df.stack().groupby(level=1).mean() and df.mean().unstack(0).

  • What are the ways to combine two datasets in Pandas?
    Database-style joins of two Pandas DataFrame structures. Source: Jain 2020.
    Database-style joins of two Pandas DataFrame structures. Source: Jain 2020.

    Perhaps the simplest is to understand is concatenating two or more frames that share the same column labels. Rows of one are concatenated to the other: pd.concat([df1, df2]). To concatenate column-wise, use pd.concat([df1, df2], axis=1). If the index values don't match, we re-index before concatenating: pd.concat([df1, df2.reindex(df1.index)], axis=1). Argument join takes values outer (default) and inner. An outermost level of multi-index can be created with pd.concat([df1, df2], keys=['A','B']).

    Database-style joins are possible with df.join(df1) in which the argument how takes four values: left, right, outer, inner. By default, joins are based on index. With on argument, we can choose to join column or index level of df with index of df1. Joins based on MultiIndex are supported.

    Method df.merge() is more flexible than join since index levels or columns can be used. If merging on only columns, indices are ignored. Unlike join, cross merge (a cartesian product of both frames) is possible. Methods pd.merge(), pd.merge_ordered() and pd.merge_asof() are related.

    Examples of merge, join and concatenate are available in the user guide.

  • Could you share some best practices or tips for using Pandas DataFrame?

    Here are some tips and pointers relevant to DataFrame operations:

    • While df['A'] is used to access a column, df.A is a handy short form. This is called attribute access. Column name must be a valid Python identifier and it shouldn't conflict with existing method/attribute. Thus, df.1 and df.first name (has a whitespace) are not allowed. df.min and df.index conflict with existing method/attribute.
    • Syntax df['A'] results in a Series data structure whereas df[['A']] returns another DataFrame.
    • On a MultiIndex DataFrame, consider dfmi['one']['second'] versus dfmi.loc[:, ('one','second')]. Both give the same result but the latter is faster. The former form is actually two operations: first dfmi['one'] returns a DataFrame, then ['second'] is chained to it.
    • To create an explicit copy, write df.copy(). The statements df1 = df; df1[0:3] = 0 doesn't create a copy and modifies the original DataFrame.
    • Remember that loc[] is for label-based indexing and iloc[] is for integer-based indexing. Integers may be used for the former but they're treated as labels.
    • When filtering by datetime values, their attributes can be used, such as, df.loc[(df['Date'].dt.month==3) | (df['Date'].dt.month==11)].

Milestones

Jul
2019
Anatomy of a Pandas groupby() with named aggregation. Source: Lynn 2015.
Anatomy of a Pandas groupby() with named aggregation. Source: Lynn 2015.

In Pandas 0.25.0, Named Aggregation is introduced. This uses a named tuple pd.NamedAgg() with two fields column and aggfunc. The latter can be set to a callable or a string alias. For a more compact code, plain tuples can be used instead. This is now the recommended replacement for using "dict-of-dicts", which was deprecated in version 0.20.0 (May 2017).

Jul
2020

In Pandas 1.1.0, ignore_index argument is added to melt() with default value True. If set to False, index is retained although duplicate values will be present in the result. For GroupBy, it's now possible to retain NA values in group keys. For example, df.groupby(by=["b"], dropna=False).sum() will retain NA values in column 'b'.

Dec
2020

In Pandas 1.2.0, exponentially weighted window operations are supported. Thus, we can write for example df.groupby('A').ewm(com=1.0).mean().

References

  1. Babu, Vinay. 2020. "Dataframe groupby date and time." Blog, Kanoki, May 26. Accessed 2021-01-12.
  2. Bosler, Fabian. 2019. "Pandas’ groupby explained in detail." Towards Data Science, on Medium, November 13. Accessed 2021-01-12.
  3. Bowne-Anderson, Hugo. 2017. "Hierarchical indices, groupby and pandas." Tutorial, DataCamp, October 3. Accessed 2021-01-12.
  4. Data Carpentry. 2020. "Indexing, Slicing and Subsetting DataFrames in Python." In: Data Analysis and Visualization in Python for Ecologists, Data Carpentry, December 1. Accessed 2021-01-12.
  5. Jain, Harsh. 2020. "Three ways to combine DataFrames in Pandas." edpresso, Educative, Inc., April 9. Accessed 2021-01-12.
  6. Lynn, Shane. 2015. "Pandas Groupby: Summarising, Aggregating, and Grouping data in Python." June 14. Updated 2020-06-30. Accessed 2021-01-12.
  7. PyData. 2019. "What's new in 0.25.0." Release Notes, Pandas, July 18. Updated 2020-10-07. Accessed 2021-01-13.
  8. PyData. 2020a. "10 minutes to pandas." User Guide, Pandas, v1.2.0, December 29. Accessed 2021-01-12.
  9. PyData. 2020b. "Intro to data structures." User Guide, Pandas, v1.2.0, November 26. Accessed 2021-01-12.
  10. PyData. 2020c. "Essential basic functionality." User Guide, Pandas, v1.2.0, November 14. Accessed 2021-01-12.
  11. PyData. 2020d. "Reshaping and pivot tables." User Guide, Pandas, v1.2.0, October 7. Accessed 2021-01-12.
  12. PyData. 2020e. "Group by: split-apply-combine." User Guide, Pandas, v1.2.0, November 25. Accessed 2021-01-12.
  13. PyData. 2020f. "What's new in 1.2.0." Release Notes, Pandas, December 26. Accessed 2021-01-13.
  14. PyData. 2020g. "What's new in 1.1.0." Release Notes, Pandas, October 7. Accessed 2021-01-14.
  15. PyData. 2020h. "Indexing and selecting data." User Guide, Pandas, v1.2.0, December 15. Accessed 2021-01-14.
  16. PyData. 2020i. "MultiIndex / advanced indexing." User Guide, Pandas, v1.2.0, October 7. Accessed 2021-01-14.
  17. PyData. 2020j. "Merge, join, concatenate and compare." User Guide, Pandas, v1.2.0, November 19. Accessed 2021-01-14.
  18. PyData. 2020k. "pandas.DataFrame.join." API Reference, Pandas, v1.2.0, December 18. Accessed 2021-01-15.
  19. PyData. 2020l. "pandas.DataFrame.merge." API Reference, Pandas, v1.2.0, December 18. Accessed 2021-01-15.
  20. PyData. 2020m. "pandas.merge." API Reference, Pandas, v1.2.0, December 23. Accessed 2021-01-15.
  21. PyData. 2021. "How do I select a subset of a DataFrame?" Getting started tutorials, Pandas, v1.2.0, January 6. Accessed 2021-01-12.
  22. Stringham, Jessica. 2019. "Pandas MultiIndex cheatsheet." Blog, December 10. Accessed 2021-01-14.
  23. The Coding Bot. 2019. "How do we filter rows of a pandas Dataframe by a column value." The Coding Bot, July 27. Accessed 2021-01-12.
  24. XsPDF. 2021. "Filter dataframe rows by index name." XsPDF. Accessed 2021-01-12.

Further Reading

  1. Devlin, Josh. 2020. "Pandas Cheat Sheet — Python for Data Science." Blog, Dataquest, March 4. Accessed 2021-01-12.
  2. VanderPlas, Jake. 2016. "Hierarchical Indexing." In: Python Data Science Handbook, O'Reilly Media, Inc. Accessed 2021-01-12.
  3. PyData. 2020e. "Group by: split-apply-combine." User Guide, Pandas, v1.2.0, November 25. Accessed 2021-01-12.
  4. PyData. 2020h. "Indexing and selecting data." User Guide, Pandas, v1.2.0, December 15. Accessed 2021-01-14.
  5. PyData. 2020i. "MultiIndex / advanced indexing." User Guide, Pandas, v1.2.0, October 7. Accessed 2021-01-14.
  6. PyData. 2020j. "Merge, join, concatenate and compare." User Guide, Pandas, v1.2.0, November 19. Accessed 2021-01-14.

Article Stats

Author-wise Stats for Article Edits

Author
No. of Edits
No. of Chats
DevCoins
2
1
1057
1592
Words
1
Likes
462
Hits

Cite As

Devopedia. 2021. "Pandas DataFrame Operations." Version 2, January 15. Accessed 2021-03-29. https://devopedia.org/pandas-dataframe-operations
Contributed by
1 author


Last updated on
2021-01-15 06:40:26