Basic Pandas Types
Pandas
Pandas is a library built on Numpy that provides an implementation of a
DataFrame
A
DataFrame
is a multidimensional array with row and column labels and can contain heterogeneous typesPandas provides three main data types:
Series
,DataFrame
, andIndex
Conventional way to import Pandas:
import pandas as pd
Pandas Series
The
Series
type represents a one-dimensional array of indexed dataConstructing
Series
objectspd.Series(data, index=index
)data
can be a list, numpy array, or dictindex
is an array of index values
Indexing
Series
ObjectA
Series
is indexed by its index valuesA
Series
can also be sliced like a Python list
Pandas Series
Example
>>> data = pd.Series([0.25, 0.5, 0.75, 1.0])
>>> data
0 0.25
1 0.50
2 0.75
3 1.00
dtype: float64
>>> data[1]
0.5
>>> data[1:3]
1 0.50
2 0.75
dtype: float64
Pandas Series
with Index Example
>>> data = pd.Series([0.25, 0.5, 0.75, 1.0],
index = ['a', 'b', 'c', 'd'])
>>> data
a 0.25
b 0.50
c 0.75
d 1.00
dtype: float64
>>> data['b']
0.5
>>> data['b':'d']
b 0.50
c 0.75
d 1.00
dtype: float64
Pandas Series
Attributes
index
: the index object>>> s1 = pd.Series([2,4,6]) >>> s1.index RangeIndex(start=0, stop=3, step=1) >>> s2 = pd.Series({100: 1, 200: 3, 300: 5}) >>> s2.index Int64Index([100, 200, 300], dtype='int64')
values
: the underlying NumPy array>>> s1.values array([2, 4, 6]) >>> s2.values array([1, 3, 5])
Pandas DataFrame
Object
A
DataFrame
is two-dimensional array with flexible row and column namesEach column in a
DataFrame
is aSeries
DataFrame
objects can be constructed from:a single
Series
a list of dicts
a dict of
Series
objectsa two-dimensional Numpy array
Pandas DataFrame
Construction Example
>>> df = pd.DataFrame([[2,4,6], [1,3,5]])
>>> df
0 1 2
0 2 4 6
1 1 3 5
>>> df.index
RangeIndex(start=0, stop=2, step=1)
>>> df.columns
RangeIndex(start=0, stop=3, step=1)
Pandas DataFrame
Construction Examples
>>> pd.DataFrame(np.ones((3,2)),
columns=['one', 'two'],
index=['a', 'b', 'c'])
one two
a 1.0 1.0
b 1.0 1.0
c 1.0 1.0
>>> pd.DataFrame([{'a': i, 'b': 2 * i}
for i in range(3)])
a b
0 0 0
1 1 2
2 2 4
Adding/Removing Columns from a DataFrame
Add a column (similar to adding an element to a
dict
):>>> df one two 0 1 3 1 2 4 >>> df['three'] = [5, 6] >>> df one two three 0 1 3 5 1 2 4 6
Remove a column:
>>> df.pop('three') # or del df['three'] >>> df one two 0 1 3 1 2 4
Pandas Index
Object
An
Index
enables the reference and modification of elements inSeries
andIndex
objectsAn
Index
can be thought of as an immutable array or as an ordered setExample
>>> indA = pd.Index([1, 3, 5, 7, 9]) >>> indB = pd.Index([2, 3, 5, 7, 11]) >>> indA[::2] Int64Index([1, 5, 9], dtype='int64') >>> indA & indB # set intersection Int64Index([3, 5, 7], dtype='int64') >>> indA | indB # set union Int64Index([1, 2, 3, 5, 7, 9, 11], dtype='int64')
Pandas Indexers
Indexer attributes expose slicing interfaces to the data in a
Series
objectloc
allows indexing and slicing based on the explicit indexiloc
allows indexing and slicing based on the implicit Python-style indexix
is a hybrid of the previous approaches
Indexers can provide access to Numpy-style indexing such as masking and fancy indexing
In Pandas, indexing refers to columns, slicing refers to rows
Pandas Indexer Examples
>>> data
one two
a 0.495141 0.965454
b 0.673145 0.246473
c 0.716398 0.730835
>>> data.loc[:'b', :'one']
one
a 0.495141
b 0.673145
# equivalent to the above
>>> data.iloc[:2, :1]
>>> data.ix[:2, :'one']
Summary of Selection on DataFrames
Operation | Syntax | Result Type |
---|---|---|
select a column | df[col] |
Series |
select row by label | df.loc[label] |
Series |
select row by integer location | df.iloc[loc] |
Series |
slice rows | df[5:10] |
DataFrame |
select rows by boolean vector | df[bool_vec] |
DataFrame |
Pandas and UFuncs
Indices are preserved when using ufuncs
Indices are aligned when performing binary ufuncs
Index and column alignment is preserved when performing operations between
DataFrame
andSeries
objects
Pandas and UFuncs Examples
>>> A = pd.DataFrame(
np.arange(4).reshape((2,2)),
columns=['one', 'two'])
>>> B = pd.DataFrame(
np.arange(3).reshape((3,3)),
columns=['three', 'two', 'one'])
>>> A + B
one three two
0 2.0 NaN 2.0
1 7.0 NaN 7.0
2 NaN NaN NaN
Missing Data
Pandas treats
None
andNaN
as null (missing) valuesFunctions related to missing values:
isnull
: return boolean mask of null valuesnotnull
: opposite ofisnull
dropna
: filter out missing valuesfillna
: replace missing values with a specified value