Pandas Hierarchical Indexing

CSC 223 - Advanced Scientific Programming

Pandas Hierarchical Indexing

  • It is often useful to have data indexed by more than one key

  • Hierarchical indexing (a.k.a. multi-indexing) incorporates multiple index levels within a single index.

  • Pandas has this capability with the MultiIndex object

  • A Pandas DataFrame can have multiply indexed indices and columns

Multiply Indexed Series

  • A Series object can have multiple index scheme by using tuples as keys

  • Example:

    >>> index = [('A',1), ('A',2), ('B',1), ('B',2)]
    >>> s = pd.Series([1.0,2.0,3.0,4.0], index=index)
    >>> s
    (A, 1)    1.0
    (A, 2)    2.0
    (B, 1)    3.0
    (B, 2)    4.0
    dtype: int64
  • Getting a particular subset of the data can be verbose:

    >>> s[[i for i in s.index if i[1] == 2]]
    (A, 2)    2.0
    (B, 2)    4.0
    dtype: int64

Pandas MultiIndex

  • The Pandas MultiIndex type contains multiple levels of indexing and multiple labels for each data point which encode these levels.

    >>> index = pd.MultiIndex.from_tuples(index)
    >>> index
    MultiIndex(levels=[['A', 'B'], [1, 2]],
               labels=[[0, 0, 1, 1], [0, 1, 0, 1]])
    >>> s = s.reindex(index)
    >>> s
    A  1    1.0
       2    2.0
    B  1    3.0
       2    4.0
    dtype: int64

Pandas MultiIndex

  • Pandas slicing can be used to conveniently access a subset of the data

    >>> s[:,1]
    A    1.0
    B    3.0
    dtype: int64
    >>> s[:,2]
    A    2.0
    B    4.0
    dtype: int64

MultiIndex as Extra Dimension

  • The unstack method can convert a multiply indexed Series into a conventionally indexed DataFrame

    >>> df = s.unstack()
    >>> df
         1    2
    A  1.0  2.0
    B  3.0  4.0
  • The stack method performs the opposite operation

    >>> df = df.stack()
    >>> df
    A  1    1.0
       2    2.0
    B  1    3.0
       2    4.0
    dtype: int64

MultiIndex as Extra Dimension

  • Each level in a multi-index represents an extra dimension of data

  • This property adds more flexibility to the types of data that can be represented

    >>> df = pd.DataFrame({
    ...     'data1': s,
    ...     'data2': [5.0, 6.0, 7.0, 8.0]
    ...     })
    >>> df
         data1  data2
    A 1    1.0    5.0
      2    2.0    6.0
    B 1    3.0    7.0
      2    4.0    8.0

MultiIndex Operations

  • Pandas standard data operations work on hierarchical indexes

    >>> x = df['data1'] + df['data2']
    >>> x.unstack()
          1     2
    A   6.0   8.0
    B  10.0  12.0
  • Hierarchical indexing makes it convenient to explore high dimensional data

Creating MultiIndex Objects

  • A list of two or more index arrays:

    >>> data = np.arange(1.0,9).reshape(2,4).T
    >>> df = pd.DataFrame(data,
    ...     index=[['A','A','B','B'],[1,2,1,2]],
    ...     columns=['data1', 'data2'])
    >>> df
         data1  data2
    A 1    1.0    5.0
      2    2.0    6.0
    B 1    3.0    7.0
      2    4.0    8.0
  • A dictionary with tuples as keys:

    >>> data = {('A',1): 1.0, ('A',2): 2.0,
    ...         ('B',1): 3.0, ('B',2): 4.0}
    >>> pd.Series(data)
    A  1    1.0
       2    2.0
    B  1    3.0
       2    4.0
    dtype: float64

MultiIndex Constructors

  • from_arrays

    >>> pd.MultiIndex.from_arrays(
    ...    [['A','A','B','B'], [1,2,1,2]])
    MultiIndex(levels=[['A', 'B'], [1, 2]],
               labels=[[0, 0, 1, 1], [0, 1, 0, 1]])
  • from_tuples

    >>> pd.MultiIndex.from_tuples(
      ...    [('A',1),('A',2),('B',1),('B',2)])
  • Cartesian product

    >>> pd.MultiIndex.from_product([['A','B'],[1,2]])

MultiIndex Level Names

  • The levels in a MultiIndex can have names

    >>> s.index.names = ['one', 'two']
    >>> s
    one  two
    A    1      1.0
         2      2.0
    B    1      3.0
         2      4.0
    dtype: float64

Example: MultiIndex Columns

>>> index = pd.MultiIndex.from_product(
...     [[2018, 2019], [1,2]],
...     names=['year', 'visit'])
>>> columns = pd.MultiIndex.from_product(
...     [['Alice', 'Bob'],['test2','test1']],
...     names=['name','type'])
>>> health_data = pd.DataFrame(data,
...     index=index, columns=columns)
>>> health_data
name           Alice                 Bob          
type           test2     test1     test2     test1
year visit                                        
2018 1     -0.856664 -1.819912  0.864426  0.127241
     2      0.482796  0.824717  0.605018 -0.613014
2019 1     -1.052893 -0.550636 -0.510358  0.923357
     2      0.427429  0.114259  0.835249  1.095828

Example: MultiIndex Columns

  • Get Bob’s data:

    >>> health_data['Bob']
    type           test2     test1
    year visit
    2018 1      0.864426  0.127241
         2      0.605018 -0.613014
    2019 1     -0.510358  0.923357
         2      0.835249  1.095828
  • Get Alice’s test1 data:

    >>> health_data['Alice', 'test1']
    year  visit
    2018  1       -1.819912
          2        0.824717
    2019  1       -0.550636
          2        0.114259
    Name: (Alice, test1), dtype: float64