Pandas Combining Datasets
Combining Datasets
In some data science tasks we need to combine data from different sources
Pandas includes functions and methods for combining
Series
andDataFrame
objects:concatenation
database-style merges and joins
Concatenation
The Pandas
concat
function can be used to concatenateSeries
orDataFrame
objectsThe argument to the
concat
function is a sequence or mapping ofSeries
orDataFrame
objects>>> s1 = pd.Series(['A', 'B', 'C'], index=[1,2,3]) >>> s2 = pd.Series(['D', 'E', 'F'], index=[4,5,6]) >>> pd.concat([s1, s2]) 1 A 2 B 3 C 4 D 5 E 6 F dtype: object
Concatenation
The Pandas
concat
function has additional keyword arguments to specify additional concatenation options:Here is an (incomplete) list of some
concat
optionsaxis
: the axis to concatenate alongignore_index
: if True, do not use the index values along the concatenation axisverify_integrity
: raise an exception if there are duplicate indiceskeys
: construct a hierarchical index using the keys valuesjoin
: specify an inner or outer joinjoin_axes
: specific indices for joining
Concatenation Running Example
Method for example DataFrame construction
>>> def make_df(cols, ind): ... data = {c: [str(c) + str(i) for i in ind] ... for c in cols} ... return pd.DataFrame(data, ind) ... >>> make_df('ABC', range(3)) A B C 0 A0 B0 C0 1 A1 B1 C1 2 A2 B2 C2
Concatenation Running Example
Concatenate default axis
>>> df1 = make_df('AB', [1, 2]) >>> df2 = make_df('AB', [3, 4]) >>> pd.concat([df1, df2]) A B 1 A1 B1 2 A2 B2 3 A3 B3 4 A4 B4
Concatenate axis=1
>>> df3 = make_df('AB', [0, 1]) >>> df4 = make_df('CD', [0, 1]) >>> pd.concat([df3, df4], axis='col') >>> pd.concat([df3, df4], axis=1) A B C D 0 A0 B0 C0 D0 1 A1 B1 C1 D1
Concatenation Running Example
Duplicate indices
>>> x = make_df('AB', [0,1]) >>> y = make_df('AB', [2, 3]) >>> y.index = x.index >>> pd.concat([x, y]) A B 0 A0 B0 1 A1 B1 0 A2 B2 1 A3 B3
Catching the repeats as an error
>>> pd.concat([x, y], verify_integrity=True) ValueError: Indexes have overlapping values: Int64Index([0, 1], dtype='int64')
Concatenation Running Example
Ignoring the index
>>> pd.concat([x, y], ignore_index=True) A B 0 A0 B0 1 A1 B1 2 A2 B2 3 A3 B3
Adding MultiIndex keys
>>> pd.concat([x, y], keys=['x','y']) A B x 0 A0 B0 1 A1 B1 y 0 A2 B2 1 A3 B3
Concatenation Running Example
Ignoring the index
>>> pd.concat([x, y], ignore_index=True) A B 0 A0 B0 1 A1 B1 2 A2 B2 3 A3 B3
Adding MultiIndex keys
>>> pd.concat([x, y], keys=['x','y']) A B x 0 A0 B0 1 A1 B1 y 0 A2 B2 1 A3 B3
Concatenation Running Example
Different sets of column names (default outer join)
>>> df5 = make_df('ABC', [1, 2]) >>> df6 = make_df('BCD', [3, 4]) >>> pd.concat([df5, df6]) A B C D 1 A1 B1 C1 NaN 2 A2 B2 C2 NaN 3 NaN B3 C3 D3 4 NaN B4 C4 D4
Different sets of column names (inner join)
>>> pd.concat([df5, df6], join='inner') B C 1 B1 C1 2 B2 C2 3 B3 C3 4 B4 C4
Concatenation Running Example
Specifying the column indices
>>> pd.concat([df5, df6], join_axes=[df5.columns]) A B C 1 A1 B1 C1 2 A2 B2 C2 3 NaN B3 C3 4 NaN B4 C4 >>> pd.concat([df5, df6], join_axes=[df6.columns]) B C D 1 B1 C1 NaN 2 B2 C2 NaN 3 B3 C3 D3 4 B4 C4 D4
Pandas merge
Function
Relational algebra is a formal set of rules for manipulating relational data
The
merge
function is an interface to perform relational algebra join operationsCategories of joins:
one-to-one
many-to-one
many-to-many
One-to-one Join Example
A one-to-one join is similar to column-wise concatenation
>>> df1 name group 0 Bob A 1 Alice B 2 Eve B >>> df2 name number 0 Eve 1 1 Alice 2 2 Bob 3 >>> pd.merge(df1, df2) name group number 0 Bob A 3 1 Alice B 2 2 Eve B 1
Many-to-one Join Example
A many-to-one join is a join where one of the key columns contains duplicate entries
>>> df1 name group number 0 Bob A 3 1 Alice B 2 2 Eve B 1 >>> df2 group leader 0 A Jack 1 B Jill >>> pd.merge(df1, df2) name group number leader 0 Bob A 3 Jack 1 Alice B 2 Jill 2 Eve B 1 Jill
Many-to-many Join Example
A many-to-many join is a join where both the left and the right key columns contain duplicates
>>> df1 name group number 0 Bob A 3 1 Alice B 2 2 Eve B 1 >>> df2 group skill 0 A math 1 A programming 2 B biology >>> pd.merge(df1, df2) name group number skill 0 Bob A 3 math 1 Bob A 3 programming 2 Alice B 2 biology 3 Eve B 1 biology
Pandas merge
Function (continued)
The
merge
function has additional keyword argumentson
: specify the key columnleft_on
andright_on
: specify a key column that has a different name in each tableleft_index
andright_index
: merge on index instead of columnhow
: specify an inner, outer, left, or right joinsuffixes
: append a suffix to conflicting column names