Pandas Aggregation and Grouping
Aggregation
Data analysis of large data typically requires some form of summarization.
Aggregations (sum, mean, min, etc.) reduce the data to a single number and may provide insight into the nature of large datasets.
Pandas includes several methods for computing aggregations
Simple Aggregations
A Pandas
Series
supports aggregations similar to Numpy arrays:>>> s = pd.Series(range(10)) >>> s.sum() 45 >>> s.mean() 4.5
A
DataFrame
aggregates column-wise by default, but can also take an axis argument:>>> df = pd.DataFrame({'A': [1,2], 'B': [3,4]}) >>> df.mean() A 1.5 B 3.5 dtype: float64 >>> df.mean(axis=1) 0 2.0 1 3.0 dtype: float64
Pandas describe
The
describe
method computes all the common aggregations of aSeries
orDataFrame
:>>> df = pd.DataFrame({'A': [1,2], 'B': [3,4]}) >>> df.describe() A B count 2.000000 2.000000 mean 1.500000 3.500000 std 0.707107 0.707107 min 1.000000 3.000000 25% 1.250000 3.250000 50% 1.500000 3.500000 75% 1.750000 3.750000 max 2.000000 4.000000
Pandas Built-in Aggregations
Aggregation | Description |
count |
total number of items |
first, last |
first and last item |
mean, median |
mean and median |
min, max |
minimum and maximum |
std, var |
standard deviation and variance |
mad |
mean absolute deviation |
prod |
product of all items |
sum |
sum of all items |
Groupby
The groupby operation enables conditional aggregations based on some label of index
The name “group by” comes from SQL database language
The groupby operation essentially does the following:
split: break up and group the data based on the specified key
apply: compute some function (aggregation, transformation, or filtering) within the individual groups
combine: merge the results into an output array
Basic groupby
Example
Create a
DataFrame
:>>> df = pd.DataFrame({'key': list('ABCABC'), ... 'data1': range(6), 'data2': range(7,13)}) >>> df key data1 data2 0 A 0 7 1 B 1 8 2 C 2 9 3 A 3 10 4 B 4 11 5 C 5 12
Group by key (becomes the new index) and apply sum:
>>> df.groupby('key').sum() data1 data2 key A 3 17 B 5 19 C 7 21
Performing Multiple Aggregations
The
aggregate
method can apply multiple aggregationsExample: a list of functions
>>> df.groupby('key').aggregate( ... ['min', np.median, max]) data1 data2 min median max min median max key A 0 1.5 3 7 8.5 10 B 1 2.5 4 8 9.5 11 C 2 3.5 5 9 10.5 12
Example: a dictionary mapping column names to functions
>>> df.groupby('key').aggregate({'data1': min, 'data2': max}) data1 data2 key A 0 10 B 1 11 C 2 12
Filtering
A filtering operation drops data based on group properties
The
filter
method takes a function that takes aDataFrame
as a parameter and returns a Boolean>>> df.groupby('key').filter( ... lambda x: x['data1'].sum() > 3) key data1 data2 1 B 1 8 2 C 2 9 4 B 4 11 5 C 5 12
Transformation
The
transform
method returns a transformed version of the data; the output is the same shape as the input.Common example: center data by subtracting the group-wise mean:
>>> df.groupby('key').transform( ... lambda x: x - x.mean()) data1 data2 0 -1.5 -1.5 1 -1.5 -1.5 2 -1.5 -1.5 3 1.5 1.5 4 1.5 1.5 5 1.5 1.5
Applying Arbitrary Functions
The
apply
method can apply arbitrary functions that take aDataFrame
as an argument and returns aDataFrame
,Series
, or scalar value; the combine operation will be tailored to the type of output returned.Example: normalize the first column by the sum of the second
>>> def f(x): ... x['data1'] /= x['data2'].sum() ... return x ... >>> df.groupby('key').apply(f) key data1 data2 0 A 0.000000 7 1 B 0.052632 8 2 C 0.095238 9 3 A 0.176471 10 4 B 0.210526 11 5 C 0.238095 12
Specifying the Split Key
The key can be a sequence with a length matching that of the
DataFrame
>>> L = [0, 1, 0, 1, 2, 0] >>> df.groupby(L).sum() data1 data2 0 7 28 1 4 18 2 4 11
The key can be a dictionary that maps index values to group keys
>>> df2 = df.set_index('key') >>> mapping = {'A': 'vowel', 'B': 'consonant', ... 'C': 'consonant'} >>> df2.groupby(mapping).sum() data1 data2 consonant 12 40 vowel 3 17
Specifying the Split Key (Continued)
The key can be a Python function that takes an index value and returns a group
>>> df2.groupby(str.lower).mean() data1 data2 a 1.5 8.5 b 2.5 9.5 c 3.5 10.5
The key can be a list of key choices combined to group on a multi-index
>>> df2.groupby([str.lower, mapping]).mean() data1 data2 a vowel 1.5 8.5 b consonant 2.5 9.5 c consonant 3.5 10.5