Group by: split-apply-combine
This function is especially useful for us at Ambee as it helps us carry out a lot of essential analytics. It can be used similarly for a myriad of other purposes. Keep in mind that
by “group by” we are referring to a process involving one or more of the following steps:
- Splitting the data into groups based on some criteria.
- Applying a function to each group independently.
- Combining the results into a data structure.
Out of these, the split step is the most straightforward. In fact, in many situations, we may wish to split the data set into groups and do something with those groups. In the ‘apply’ step, we might wish to do one of the following:
Aggregation: compute a summary statistic (or statistics) for each group. Some examples:
- Compute group sums or means.
- Compute group sizes/counts
Transformation: perform some group-specific computations and return a like-indexed object. Some examples:
- Standardize data (zscore) within a group.
- Filling NAs within groups with a value derived from each group.
Filtration: discard some groups, according to a group-wise computation that evaluates True or False. Some examples:
- Discard data that belongs to groups with only a few members.
- Filter out data based on the group sum or mean.
Many of these operations are defined on GroupBy objects. These operations are similar to the aggregating API, window API, and resample API.
It is possible that a given operation does not fall into one of these categories or is some combination of them. In such a case, it may be possible to compute the operation using GroupBy’s apply method. This method will examine the results of the apply step and try to return a sensibly combined result if it doesn’t fit into either of the above two categories.
Note: An operation that is split into multiple steps using built-in GroupBy operations will be more efficient than using the apply method with a user-defined Python function.
Since the set of object instance methods on pandas data structures are generally rich and expressive, we often simply want to invoke, say, a DataFrame function on each group. The name GroupBy should be quite familiar to those who have used a SQL-based tool (or itertools) in which you can write code like:
SELECT Column1, Column2, mean(Column3), sum(Column4)
FROM SomeTable
GROUPBY Column1, Column2
We aim to make operations like this natural and easy to express using pandas. We’ll address each area of GroupBy functionality and then provide some non-trivial examples/use cases.
See the cookbook for some advanced strategies.
Splitting an object into groups
In Pandas, objects can be split on any of their axes. The abstract definition of grouping is to provide a mapping of labels to group names. To create a GroupBy object (more on what the GroupBy object is later), you may do the following:
In [1]: speeds = pd.DataFrame(
...: [
...: ("bird", "Falconiformes", 389.0),
...: ("bird", "Psittaciformes", 24.0),
...: ("mammal", "Carnivora", 80.2),
...: ("mammal", "Primates", np.nan),
...: ("mammal", "Carnivora", 58),
...: ],
...: index=["falcon", "parrot", "lion", "monkey", "leopard"],
...: columns=("class", "order", "max_speed"),
...: )
...:
In [2]: speeds
Out[2]:
class order max_speed
falcon bird Falconiformes 389.0
parrot bird Psittaciformes 24.0
lion mammal Carnivora 80.2
monkey mammal Primates NaN
leopard mammal Carnivora 58.0
# default is axis=0
In [3]: grouped = speeds.groupby("class")
In [4]: grouped = speeds.groupby("order", axis="columns")
In [5]: grouped = speeds.groupby(["class", "order"])
The mapping can be specified in many different ways:
- A Python function is to be called on each of the axis labels.
- A list or NumPy array of the same length as the selected axis.
- A dict or Series, providing a label -> group name mapping.
- For DataFrame objects, a string indicating either a column name or an index level name to be used to group.
- df.groupby('A') is just syntactic sugar for df.groupby(df['A']).
- A list of any of the above things.
Collectively we refer to the grouping objects as the keys. For example, consider the following DataFrame:
Note: A string passed to groupby may refer to either a column or an index level. If a string matches both a column name and an index-level name, a ValueError will be raised.
In [6]: df = pd.DataFrame(
...: {
...: "A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],
...: "B": ["one", "one", "two", "three", "two", "two", "one", "three"],
...: "C": np.random.randn(8),
...: "D": np.random.randn(8),
...: }
...: )
...:
In [7]: df
Out[7]:
A B C D
0 foo one 0.469112 -0.861849
1 bar one -0.282863 -2.104569
2 foo two -1.509059 -0.494929
3 bar three -1.135632 1.071804
4 foo two 1.212112 0.721555
5 bar two -0.173215 -0.706771
6 foo one 0.119209 -1.039575
7 foo three -1.044236 0.271860
On a DataFrame, we obtain a GroupBy object by calling groupby(). We could naturally group by either the A or B columns or both:
In [8]: grouped = df.groupby("A")
In [9]: grouped = df.groupby(["A", "B"])
If we also have a MultiIndex on columns A and B, we can group by all but the specified columns
In [10]: df2 = df.set_index(["A", "B"])
In [11]: grouped = df2.groupby(level=df2.index.names.difference(["B"]))
In [12]: grouped.sum()
Out[12]:
C D
A
bar -1.591710 -1.739537
foo -0.752861 -1.402938
These will split the DataFrame on its index (rows). We could also split by the columns:
In [13]: def get_letter_type(letter):
....: if letter.lower() in 'aeiou':
....: return 'vowel'
....: else:
....: return 'consonant'
....:
In [14]: grouped = df.groupby(get_letter_type, axis=1)
pandas Index objects support duplicate values. If a non-unique index is used as the group key in a groupby operation, all values for the same index value will be considered to be in one group, and thus the output of aggregation functions will only contain unique index values:
In [15]: lst = [1, 2, 3, 1, 2, 3]
In [16]: s = pd.Series([1, 2, 3, 10, 20, 30], lst)
In [17]: grouped = s.groupby(level=0)
In [18]: grouped.first()
Out[18]:
1 1
2 2
3 3
dtype: int64
In [19]: grouped.last()
Out[19]:
1 10
2 20
3 30
dtype: int64
In [20]: grouped.sum()
Out[20]:
1 11
2 22
3 33
dtype: int64
Note that no splitting occurs until it’s needed. Creating the GroupBy object only verifies that you’ve passed a valid mapping.