Pandas Hierarchical Indexing
Pandas Hierarchical Indexing
It is often useful to have data indexed by more than one key
Hierarchical indexing (a.k.a. multi-indexing) incorporates multiple index levels within a single index.
Pandas has this capability with the
MultiIndex
objectA Pandas DataFrame can have multiply indexed indices and columns
Multiply Indexed Series
A Series object can have multiple index scheme by using tuples as keys
Example:
>>> index = [('A',1), ('A',2), ('B',1), ('B',2)] >>> s = pd.Series([1.0,2.0,3.0,4.0], index=index) >>> s (A, 1) 1.0 (A, 2) 2.0 (B, 1) 3.0 (B, 2) 4.0 dtype: int64
Getting a particular subset of the data can be verbose:
>>> s[[i for i in s.index if i[1] == 2]] (A, 2) 2.0 (B, 2) 4.0 dtype: int64
Pandas MultiIndex
The Pandas
MultiIndex
type contains multiple levels of indexing and multiple labels for each data point which encode these levels.>>> index = pd.MultiIndex.from_tuples(index) >>> index MultiIndex(levels=[['A', 'B'], [1, 2]], labels=[[0, 0, 1, 1], [0, 1, 0, 1]]) >>> s = s.reindex(index) >>> s A 1 1.0 2 2.0 B 1 3.0 2 4.0 dtype: int64
Pandas MultiIndex
Pandas slicing can be used to conveniently access a subset of the data
>>> s[:,1] A 1.0 B 3.0 dtype: int64 >>> s[:,2] A 2.0 B 4.0 dtype: int64
MultiIndex
as Extra Dimension
The
unstack
method can convert a multiply indexedSeries
into a conventionally indexedDataFrame
>>> df = s.unstack() >>> df 1 2 A 1.0 2.0 B 3.0 4.0
The
stack
method performs the opposite operation>>> df = df.stack() >>> df A 1 1.0 2 2.0 B 1 3.0 2 4.0 dtype: int64
MultiIndex
as Extra Dimension
Each level in a multi-index represents an extra dimension of data
This property adds more flexibility to the types of data that can be represented
>>> df = pd.DataFrame({ ... 'data1': s, ... 'data2': [5.0, 6.0, 7.0, 8.0] ... }) >>> df data1 data2 A 1 1.0 5.0 2 2.0 6.0 B 1 3.0 7.0 2 4.0 8.0
MultiIndex
Operations
Pandas standard data operations work on hierarchical indexes
>>> x = df['data1'] + df['data2'] >>> x.unstack() 1 2 A 6.0 8.0 B 10.0 12.0
Hierarchical indexing makes it convenient to explore high dimensional data
Creating MultiIndex
Objects
A list of two or more index arrays:
>>> data = np.arange(1.0,9).reshape(2,4).T >>> df = pd.DataFrame(data, ... index=[['A','A','B','B'],[1,2,1,2]], ... columns=['data1', 'data2']) >>> df data1 data2 A 1 1.0 5.0 2 2.0 6.0 B 1 3.0 7.0 2 4.0 8.0
A dictionary with tuples as keys:
>>> data = {('A',1): 1.0, ('A',2): 2.0, ... ('B',1): 3.0, ('B',2): 4.0} >>> pd.Series(data) A 1 1.0 2 2.0 B 1 3.0 2 4.0 dtype: float64
MultiIndex
Constructors
from_arrays
>>> pd.MultiIndex.from_arrays( ... [['A','A','B','B'], [1,2,1,2]]) MultiIndex(levels=[['A', 'B'], [1, 2]], labels=[[0, 0, 1, 1], [0, 1, 0, 1]])
from_tuples
>>> pd.MultiIndex.from_tuples( ... [('A',1),('A',2),('B',1),('B',2)])
Cartesian product
>>> pd.MultiIndex.from_product([['A','B'],[1,2]])
MultiIndex
Level Names
The levels in a
MultiIndex
can have names>>> s.index.names = ['one', 'two'] >>> s one two A 1 1.0 2 2.0 B 1 3.0 2 4.0 dtype: float64
Example: MultiIndex
Columns
>>> index = pd.MultiIndex.from_product(
... [[2018, 2019], [1,2]],
... names=['year', 'visit'])
>>> columns = pd.MultiIndex.from_product(
... [['Alice', 'Bob'],['test2','test1']],
... names=['name','type'])
>>> health_data = pd.DataFrame(data,
... index=index, columns=columns)
>>> health_data
name Alice Bob
type test2 test1 test2 test1
year visit
2018 1 -0.856664 -1.819912 0.864426 0.127241
2 0.482796 0.824717 0.605018 -0.613014
2019 1 -1.052893 -0.550636 -0.510358 0.923357
2 0.427429 0.114259 0.835249 1.095828
Example: MultiIndex
Columns
Get Bob’s data:
>>> health_data['Bob'] type test2 test1 year visit 2018 1 0.864426 0.127241 2 0.605018 -0.613014 2019 1 -0.510358 0.923357 2 0.835249 1.095828
Get Alice’s test1 data:
>>> health_data['Alice', 'test1'] year visit 2018 1 -1.819912 2 0.824717 2019 1 -0.550636 2 0.114259 Name: (Alice, test1), dtype: float64