Basic Pandas Types

CSC 223 - Advanced Scientific Programming

Pandas

Pandas is a library built on Numpy that provides an implementation of a DataFrame
A DataFrame is a multidimensional array with row and column labels and can contain heterogeneous types
Pandas provides three main data types: Series, DataFrame, and Index
Conventional way to import Pandas:
```
import pandas as pd
```

Pandas `Series`

The Series type represents a one-dimensional array of indexed data
Constructing Series objects
- pd.Series(data, index=index)
- data can be a list, numpy array, or dict
- index is an array of index values
Indexing Series Object
- A Series is indexed by its index values
- A Series can also be sliced like a Python list

Pandas `Series` Example

>>> data = pd.Series([0.25, 0.5, 0.75, 1.0])
>>> data
0   0.25
1   0.50
2   0.75
3   1.00
dtype: float64
>>> data[1]
0.5
>>> data[1:3]
1    0.50
2    0.75
dtype: float64

Pandas `Series` with Index Example

>>> data = pd.Series([0.25, 0.5, 0.75, 1.0],
           index = ['a', 'b', 'c', 'd'])
>>> data
a   0.25
b   0.50
c   0.75
d   1.00
dtype: float64
>>> data['b']
0.5
>>> data['b':'d']
b    0.50
c    0.75
d    1.00
dtype: float64

Pandas `Series` Attributes

index: the index object

>>> s1 = pd.Series([2,4,6])
>>> s1.index
RangeIndex(start=0, stop=3, step=1)
>>> s2 = pd.Series({100: 1, 200: 3, 300: 5})
>>> s2.index
Int64Index([100, 200, 300], dtype='int64')

values: the underlying NumPy array

>>> s1.values
array([2, 4, 6])
>>> s2.values
array([1, 3, 5])

Pandas `DataFrame` Object

A DataFrame is two-dimensional array with flexible row and column names
Each column in a DataFrame is a Series
DataFrame objects can be constructed from:
- a single Series
- a list of dicts
- a dict of Series objects
- a two-dimensional Numpy array

Pandas `DataFrame` Construction Example

>>> df = pd.DataFrame([[2,4,6], [1,3,5]])
>>> df
   0  1  2
0  2  4  6
1  1  3  5
>>> df.index
RangeIndex(start=0, stop=2, step=1)
>>> df.columns
RangeIndex(start=0, stop=3, step=1)

Pandas `DataFrame` Construction Examples

>>> pd.DataFrame(np.ones((3,2)),
                 columns=['one', 'two'],
                 index=['a', 'b', 'c'])
   one  two
a  1.0  1.0
b  1.0  1.0
c  1.0  1.0
>>> pd.DataFrame([{'a': i, 'b': 2 * i}
                  for i in range(3)])
   a  b
0  0  0
1  1  2
2  2  4

Adding/Removing Columns from a DataFrame

Add a column (similar to adding an element to a dict):

>>> df
   one  two
0    1    3
1    2    4
>>> df['three'] = [5, 6]
>>> df
   one  two  three
0    1    3      5
1    2    4      6

Remove a column:

>>> df.pop('three') # or del df['three']
>>> df
   one  two
0    1    3
1    2    4

Pandas `Index` Object

An Index enables the reference and modification of elements in Series and Index objects
An Index can be thought of as an immutable array or as an ordered set

Example

>>> indA = pd.Index([1, 3, 5, 7, 9])
>>> indB = pd.Index([2, 3, 5, 7, 11])
>>> indA[::2]
Int64Index([1, 5, 9], dtype='int64')
>>> indA & indB # set intersection
Int64Index([3, 5, 7], dtype='int64')
>>> indA | indB # set union
Int64Index([1, 2, 3, 5, 7, 9, 11], dtype='int64')

Pandas Indexers

Indexer attributes expose slicing interfaces to the data in a Series object
- loc allows indexing and slicing based on the explicit index
- iloc allows indexing and slicing based on the implicit Python-style index
- ix is a hybrid of the previous approaches
Indexers can provide access to Numpy-style indexing such as masking and fancy indexing
In Pandas, indexing refers to columns, slicing refers to rows

Pandas Indexer Examples

>>> data
        one       two
a  0.495141  0.965454
b  0.673145  0.246473
c  0.716398  0.730835

>>> data.loc[:'b', :'one']
        one
a  0.495141
b  0.673145

# equivalent to the above
>>> data.iloc[:2, :1]
>>> data.ix[:2, :'one']

Summary of Selection on DataFrames

Operation	Syntax	Result Type
select a column	`df[col]`	`Series`
select row by label	`df.loc[label]`	`Series`
select row by integer location	`df.iloc[loc]`	`Series`
slice rows	`df[5:10]`	`DataFrame`
select rows by boolean vector	`df[bool_vec]`	`DataFrame`

Pandas and UFuncs

Indices are preserved when using ufuncs
Indices are aligned when performing binary ufuncs
Index and column alignment is preserved when performing operations between DataFrame and Series objects

Pandas and UFuncs Examples

>>> A = pd.DataFrame(
          np.arange(4).reshape((2,2)),
          columns=['one', 'two'])
>>> B = pd.DataFrame(
          np.arange(3).reshape((3,3)),
          columns=['three', 'two', 'one'])
>>> A + B
   one  three  two
0  2.0    NaN  2.0
1  7.0    NaN  7.0
2  NaN    NaN  NaN

Missing Data

Pandas treats None and NaN as null (missing) values
Functions related to missing values:
- isnull: return boolean mask of null values
- notnull: opposite of isnull
- dropna: filter out missing values
- fillna: replace missing values with a specified value

Basic Pandas Types

CSC 223 - Advanced Scientific Programming

Pandas

Pandas Series

Pandas Series Example

Pandas Series with Index Example

Pandas Series Attributes

Pandas DataFrame Object

Pandas DataFrame Construction Example

Pandas DataFrame Construction Examples

Adding/Removing Columns from a DataFrame

Pandas Index Object

Pandas Indexers

Pandas Indexer Examples

Summary of Selection on DataFrames

Pandas and UFuncs

Pandas and UFuncs Examples

Missing Data

Pandas `Series`

Pandas `Series` Example

Pandas `Series` with Index Example

Pandas `Series` Attributes

Pandas `DataFrame` Object

Pandas `DataFrame` Construction Example

Pandas `DataFrame` Construction Examples

Pandas `Index` Object