Data analysis of large data typically requires some form of summarization.
Aggregations (sum, mean, min, etc.) reduce the data to a single number and may provide insight into the nature of large datasets.
Pandas includes several methods for computing aggregations
A Pandas Series
supports aggregations similar to Numpy arrays:
>>> s = pd.Series(range(10))
>>> s.sum()
45
>>> s.mean()
4.5
A DataFrame
aggregates column-wise by default, but can also take an axis argument:
>>> df = pd.DataFrame({'A': [1,2], 'B': [3,4]})
>>> df.mean()
A 1.5
B 3.5
dtype: float64
>>> df.mean(axis=1)
0 2.0
1 3.0
dtype: float64
describe
The describe
method computes all the common aggregations of a Series
or DataFrame
:
>>> df = pd.DataFrame({'A': [1,2], 'B': [3,4]})
>>> df.describe()
A B
count 2.000000 2.000000
mean 1.500000 3.500000
std 0.707107 0.707107
min 1.000000 3.000000
25% 1.250000 3.250000
50% 1.500000 3.500000
75% 1.750000 3.750000
max 2.000000 4.000000
Aggregation | Description |
count |
total number of items |
first, last |
first and last item |
mean, median |
mean and median |
min, max |
minimum and maximum |
std, var |
standard deviation and variance |
mad |
mean absolute deviation |
prod |
product of all items |
sum |
sum of all items |
The groupby operation enables conditional aggregations based on some label of index
The name “group by” comes from SQL database language
The groupby operation essentially does the following:
split: break up and group the data based on the specified key
apply: compute some function (aggregation, transformation, or filtering) within the individual groups
combine: merge the results into an output array
groupby
ExampleCreate a DataFrame
:
>>> df = pd.DataFrame({'key': list('ABCABC'),
... 'data1': range(6), 'data2': range(7,13)})
>>> df
key data1 data2
0 A 0 7
1 B 1 8
2 C 2 9
3 A 3 10
4 B 4 11
5 C 5 12
Group by key (becomes the new index) and apply sum:
>>> df.groupby('key').sum()
data1 data2
key
A 3 17
B 5 19
C 7 21
The aggregate
method can apply multiple aggregations
Example: a list of functions
>>> df.groupby('key').aggregate(
... ['min', np.median, max])
data1 data2
min median max min median max
key
A 0 1.5 3 7 8.5 10
B 1 2.5 4 8 9.5 11
C 2 3.5 5 9 10.5 12
Example: a dictionary mapping column names to functions
>>> df.groupby('key').aggregate({'data1': min,
'data2': max})
data1 data2
key
A 0 10
B 1 11
C 2 12
A filtering operation drops data based on group properties
The filter
method takes a function that takes a DataFrame
as a parameter and returns a Boolean
>>> df.groupby('key').filter(
... lambda x: x['data1'].sum() > 3)
key data1 data2
1 B 1 8
2 C 2 9
4 B 4 11
5 C 5 12
The transform
method returns a transformed version of the data; the output is the same shape as the input.
Common example: center data by subtracting the group-wise mean:
>>> df.groupby('key').transform(
... lambda x: x - x.mean())
data1 data2
0 -1.5 -1.5
1 -1.5 -1.5
2 -1.5 -1.5
3 1.5 1.5
4 1.5 1.5
5 1.5 1.5
The apply
method can apply arbitrary functions that take a DataFrame
as an argument and returns a DataFrame
, Series
, or scalar value; the combine operation will be tailored to the type of output returned.
Example: normalize the first column by the sum of the second
>>> def f(x):
... x['data1'] /= x['data2'].sum()
... return x
...
>>> df.groupby('key').apply(f)
key data1 data2
0 A 0.000000 7
1 B 0.052632 8
2 C 0.095238 9
3 A 0.176471 10
4 B 0.210526 11
5 C 0.238095 12
The key can be a sequence with a length matching that of the DataFrame
>>> L = [0, 1, 0, 1, 2, 0]
>>> df.groupby(L).sum()
data1 data2
0 7 28
1 4 18
2 4 11
The key can be a dictionary that maps index values to group keys
>>> df2 = df.set_index('key')
>>> mapping = {'A': 'vowel', 'B': 'consonant',
... 'C': 'consonant'}
>>> df2.groupby(mapping).sum()
data1 data2
consonant 12 40
vowel 3 17
The key can be a Python function that takes an index value and returns a group
>>> df2.groupby(str.lower).mean()
data1 data2
a 1.5 8.5
b 2.5 9.5
c 3.5 10.5
The key can be a list of key choices combined to group on a multi-index
>>> df2.groupby([str.lower, mapping]).mean()
data1 data2
a vowel 1.5 8.5
b consonant 2.5 9.5
c consonant 3.5 10.5