Pandas is a library built on Numpy that provides an implementation of a DataFrame
A DataFrame
is a multidimensional array with row and column labels and can contain heterogeneous types
Pandas provides three main data types: Series
, DataFrame
, and Index
Conventional way to import Pandas:
import pandas as pd
Series
The Series
type represents a one-dimensional array of indexed data
Constructing Series
objects
pd.Series(data, index=index
)
data
can be a list, numpy array, or dict
index
is an array of index values
Indexing Series
Object
A Series
is indexed by its index values
A Series
can also be sliced like a Python list
Series
Example>>> data = pd.Series([0.25, 0.5, 0.75, 1.0])
>>> data
0 0.25
1 0.50
2 0.75
3 1.00
dtype: float64
>>> data[1]
0.5
>>> data[1:3]
1 0.50
2 0.75
dtype: float64
Series
with Index Example>>> data = pd.Series([0.25, 0.5, 0.75, 1.0],
index = ['a', 'b', 'c', 'd'])
>>> data
a 0.25
b 0.50
c 0.75
d 1.00
dtype: float64
>>> data['b']
0.5
>>> data['b':'d']
b 0.50
c 0.75
d 1.00
dtype: float64
Series
Attributesindex
: the index object
>>> s1 = pd.Series([2,4,6])
>>> s1.index
RangeIndex(start=0, stop=3, step=1)
>>> s2 = pd.Series({100: 1, 200: 3, 300: 5})
>>> s2.index
Int64Index([100, 200, 300], dtype='int64')
values
: the underlying NumPy array
>>> s1.values
array([2, 4, 6])
>>> s2.values
array([1, 3, 5])
DataFrame
ObjectA DataFrame
is two-dimensional array with flexible row and column names
Each column in a DataFrame
is a Series
DataFrame
objects can be constructed from:
a single Series
a list of dicts
a dict of Series
objects
a two-dimensional Numpy array
DataFrame
Construction Example>>> df = pd.DataFrame([[2,4,6], [1,3,5]])
>>> df
0 1 2
0 2 4 6
1 1 3 5
>>> df.index
RangeIndex(start=0, stop=2, step=1)
>>> df.columns
RangeIndex(start=0, stop=3, step=1)
DataFrame
Construction Examples>>> pd.DataFrame(np.ones((3,2)),
columns=['one', 'two'],
index=['a', 'b', 'c'])
one two
a 1.0 1.0
b 1.0 1.0
c 1.0 1.0
>>> pd.DataFrame([{'a': i, 'b': 2 * i}
for i in range(3)])
a b
0 0 0
1 1 2
2 2 4
Add a column (similar to adding an element to a dict
):
>>> df
one two
0 1 3
1 2 4
>>> df['three'] = [5, 6]
>>> df
one two three
0 1 3 5
1 2 4 6
Remove a column:
>>> df.pop('three') # or del df['three']
>>> df
one two
0 1 3
1 2 4
Index
ObjectAn Index
enables the reference and modification of elements in Series
and Index
objects
An Index
can be thought of as an immutable array or as an ordered set
Example
>>> indA = pd.Index([1, 3, 5, 7, 9])
>>> indB = pd.Index([2, 3, 5, 7, 11])
>>> indA[::2]
Int64Index([1, 5, 9], dtype='int64')
>>> indA & indB # set intersection
Int64Index([3, 5, 7], dtype='int64')
>>> indA | indB # set union
Int64Index([1, 2, 3, 5, 7, 9, 11], dtype='int64')
Indexer attributes expose slicing interfaces to the data in a Series
object
loc
allows indexing and slicing based on the explicit index
iloc
allows indexing and slicing based on the implicit Python-style index
ix
is a hybrid of the previous approaches
Indexers can provide access to Numpy-style indexing such as masking and fancy indexing
In Pandas, indexing refers to columns, slicing refers to rows
>>> data
one two
a 0.495141 0.965454
b 0.673145 0.246473
c 0.716398 0.730835
>>> data.loc[:'b', :'one']
one
a 0.495141
b 0.673145
# equivalent to the above
>>> data.iloc[:2, :1]
>>> data.ix[:2, :'one']
Operation | Syntax | Result Type |
---|---|---|
select a column | df[col] |
Series |
select row by label | df.loc[label] |
Series |
select row by integer location | df.iloc[loc] |
Series |
slice rows | df[5:10] |
DataFrame |
select rows by boolean vector | df[bool_vec] |
DataFrame |
Indices are preserved when using ufuncs
Indices are aligned when performing binary ufuncs
Index and column alignment is preserved when performing operations between DataFrame
and Series
objects
>>> A = pd.DataFrame(
np.arange(4).reshape((2,2)),
columns=['one', 'two'])
>>> B = pd.DataFrame(
np.arange(3).reshape((3,3)),
columns=['three', 'two', 'one'])
>>> A + B
one three two
0 2.0 NaN 2.0
1 7.0 NaN 7.0
2 NaN NaN NaN
Pandas treats None
and NaN
as null (missing) values
Functions related to missing values:
isnull
: return boolean mask of null values
notnull
: opposite of isnull
dropna
: filter out missing values
fillna
: replace missing values with a specified value