Merging DataFrames with Pandas
Adding bronze, silver and gold Adding all three series together yields 6 rows of output but only three have non-null values, that is, France, Germany and Italy are not index labels in all three series, so each of those rows is NaN in the sum
we can also chain multiple method calls to .add( ) method with fill_value= 0 to get rid of those null values in the triple sum bronze.add(silver, fill_value= 0).add(gold, fill_value=0)
Absolute Temperature Range Find the percentage variation in temperature in the first week of July, that is, the daily minimum and the daily maximum temperatures expressed as a percentage of the daily mean temperature
we can compute this by dividing both the Min TemperatureF and the Max TemperatureF columns by the Mean TemperatureF column, and multiplying both by 100 to begin, slice the Min TemperatureF and the Max TemperatureF columns as a dataframe week1_range week1_range= weather.loc['2013-07-01' : '2013-07-07', ['Min TemperatureF', 'Max TemperatureF']] next, slice the Mean TemperatureF column as a series week1_mean week1_mean= weather.loc['2013-07-01' : '2013-07-07', 'Mean TemperatureF'] dividing the dataframe week1_range by the series week1_mean won't work because the column labels don't match so the result has all null values we want to use the dataframe.divide( ) method with the option axis= rows the .divide( ) method provides more fine grain control than the slash operator for division itself week1_range.divide(week1_mean, axis= 'rows') this broadcasts the series week1_mean across each row to produce the desired ratios
As column labels differ from one dataframe to another, we have to declare
which columns to merge on if we have two dataframes, counties and cities, with CITY NAME as a column in counties, and City as a column name in cities, we merge them as follows pd.merge(counties, cities, left_on= 'CITY NAME', right_on= 'City') both columns are retained in the merged dataframe
We can stack dataframes vertically using
.append( )
Using the method _____ removes entire rows in which null values occur
.dropna( ) this is a common first step when merging dataframes w_max.reindex(w_mean3.index).dropna( ) Max TemperatureF Month Jan 68.0 Apr 89.0
The dataframe indexes are access directly with the
.index attribute
The multi-index can be sliced on the outermost level with the
.loc[ ] accessor print(rain1314.loc[2014])
Having unique indexes is important in most situations. We can create a new index with the method
.reset_index( ) the option drop= True discards the old index with repeated entries rather than keeping it as a column in the dataframe
The function concat( ) accepts
a list or sequence of several series or dataframes to concatenate while the append method can only stack vertically or row-wise, the function concat( ) is more flexible it can concatenate vertically or horizontally
The .append( ) method stacks rows without
adjusting the index values
Using how= 'left' keeps
all rows of the left dataframe in the merged dataframe for rows in the left dataframe with matches in the joining columns of the right dataframe, the non-joining columns of the right dataframe are appended to the left dataframe for rows in the left dataframe with no matches in the joining columns of the right dataframe, the non-joining columns are filled with null values conversely, using how= 'right' does a right join doing the same thing with the roles of left and right interchanged
Examining how arithmetic operations work between distinct series or dataframes with non-aligned indexes which happens often in practice We'll use Olympic medal data from 1896-2008 Top 5 bronze medal winning countries bronze= pd.read_csv('bronze_top5.csv', index_col= 0) Top 5 silver medal winning countries silver= pd.read_csv('silver_top5.csv', index_col= 0) Top 5 gold medal winning countries gold= pd.read_csv('gold_top5.csv', index_col= 0)
all three dataframes have the same indices for the first three rows: United States, Soviet Union, and United Kingdom by contrast, the next two rows are either France, Germany, or Italy if we add bronze and silver, two series of 5 rows, we get back a series with 6 rows the index of the sum is the union of the indices from the original two series arithmetic operations between pandas series are carried out for rows with common index values since Germany does not appear in silver and Italy does not appear in bronze, these rows have NaN in the sum
The input argument to the .reindex( ) method can also be
another dataframe index for instance, we can use the index from w_max to reindex w_mean in chronological order w_mean.reindex(w_max.index) when a suitably indexed dataframe is available, the .reindex( ) method spars us having to create a list manually or having to sort the index
We can also do an inner join along
axis= 0 if no column index label occurs in both of the respective dataframes, the joined dataframe will be empty
A different strategy is to concatenate the columns from dataframe rain2014 to the right of the dataframe rain2013 We do this using the option
axis= 1 or axis= 'columns' in the call to concat rain1314= pd.concat([rain2013, rain2014], axis= 'columns') unfortunately, since the column label precipitation is common to both dataframes, the result has two precipitation columns, again obscuring which column came from which year's data
Using the .add( ) method We can get the same sum bronze + silver with a method invocation using
bronze.add(silver) the null values appear in the same places
Just as pandas series can have repeated index labels, dataframes can have repeated
column labels slicing with a repeated column label yields all matching columns
Using .merge_ordered( ) This behaves like merge when
columns can be ordered the merged dataframe has rows sorted lexicographically according to the column orderings in the input dataframes be aware that the default join is an outer join contrasting the default inner join for merge( )
By default, pd.merge( ) uses all
columns common to both dataframes to merge
Using axis= 1 or axis= 'columns' stacks dataframe
columns horizontally to the right rows with the same index value from the concatenated dataframes get aligned, that is, values get propagated from both dataframes to fill the row for other rows that are present in one dataframe but not another, the columns are filled with NaNs
Inner join has only index labels
common to both tables like a set intersection
pd.merge( )
computes a merge on all columns that occur in both dataframes for any row in which the column entry in df1 matches a row in df2, a new row is made in the merged dataframe the new row contains the row from df1 with the other columns from the corresponding row in df2 appended this is by default an inner join because it glues together only the rows that match in the joining column of both dataframes
If you need more flexible stacking or an inner or outer join on the indexs, the
concat function gets you further pd.concat([df1, df2]) can be used to stack many horizontally or vertically
The index is a privileged column in pandas providing
convenient access to series or dataframe rows
We can define a list called ordered to impose a deliberate ordering on the index labels of the dataframe w_mean The _____ _____ creates a new dataframe, w_mean2 with the same data as w_mean, but with a new row ordering according to the input list ordered
dataframe.reindex( ) method ordered= ['Jan', 'Apr', 'Jul', 'Oct'] w_mean2= w_mean.reindex(ordered) print(w_mean2)
The original alphabetically ordered dataframe can be recovered with the
dataframe.sort_index( ) method pandas index labels are typically sortable data such as numbers, strings, or date/time
The function concat can also accept a _____ rather than a list of dataframes as input
dictionary in that case, the dictionary keys are automatically treated as values for the keys= [ ] argument in building the multi-index on the columns
A value error exception is raised when the arrays have
different sizes along the concatenation axis for instance, trying to stack arrays A and C horizontally or A and B vertically causes problems
When stacking multiple series, concat is in fact
equivalent to chaining method calls using append result1= pd.concat([s1, s2, s3]) result2= s1.append(s2).append(s3) result1 = result2
Using a fill_value The default fill value is NaN when sum and rows
fail to align we can modify this behavior using the fill_value= option of the .add( ) method by specifying fill_value= 0, the values of Germany and Italy are no longer null bronze.add(silver, fill_value= 0) just as the .divide( ) method is more flexible than the / operator for division, the .add( ) method is more flexible than the + operator for addition
It is generally more efficient to iterate over a collection of
file names with that goal, we can create a list of filenames with the two file paths from before we then initialize an empty list called dataframes and iterate through the list filenames within each iteration, we invoke read_csv to read a dataframe from a file and we append the resulting dataframe to the list of dataframes filenames= ['sales-jan-2015.csv', 'sales-feb-2015.csv'] dataframes[ ] for f in filenames: dataframes.append(pd.read_csv(f))
The join method also joins on indexes but gives more
flexibility for left and right joins df1.join(df2)
Specifying fill_method= 'ffill' as an option in calling merge_ordered uses
forward filling to replace the NaN values by the most recent non-null value
When many filenames have a similar pattern, the _____ _____ from the python standard library is very useful
glob module here, we start by importing the function glob from the built in glob module we use the pattern sales asterisks dot csv (sales*.csv) to match any strings that start with the prefix sales and end with the suffix .csv the asterisk is a wild card that matches zero or more standard characters the function glob uses the wild card pattern to create an iterable object filenames containing all matching filenames in the current directory finally, the iterable filenames is consumed in a list comprehension that makes a list called dataframes containing the relevant data structures from glob import glob filenames= glob('sales*.csv') dataframes= [pd.read_csv(f) for f in filenames]
The .join( ) method can also do a right join using
how= 'right'
Using the option _____ with concat( ) spares us having to invoke reset_index explicitly
ignort_index= True the resulting series has no repeated indices
The specific index labels provided to the .reindex( ) method are
important for instance, if we invoke .reindex( ) again, using an input list containing a label that is not in the original dataframe index, Dec in this case, an entirely new row is inserted and filled with the null value NaN, or not a number w_mean3= w_mean.reindex(['Jan', 'Apr, 'Dec']) Mean TemperatureF Month Jan 32.133333 Apr 61.956044 Dec NaN
"Indexes" vs "Indices"
indices: many index labels within index data structures indexes: many pandas index data structures
Using .join(how= 'inner') the .join( ) method also supports
inner and outer joins on the indexes
We call concat along axis= 1 and explicitly specify join= 'inner' for an
inner join this means that only the row label present in both dataframe index is preserved in the joined dataframe the column values in the fow are filled in from the corresponding columns from the respective dataframes
The function merge( ) does an
inner join by default, that is, it extracts the rows that match in the joining columns from both DataFrames and it glues them together in the joined DataFrame we can specify how= 'inner'. but this is the default behavior for merge
Like merge, the function merge_ordered accepts
key word arguments on= and suffixes=
To avoid repeated column indices, we use the
keys= [ ] option and axis= 'columns' with concat the result has a multilevel column index rain1314= pd.concat([rain2013, rain2014], keys= [2013, 2014], axis= 'columns') we can slice the year 2013 in a dictionary type style: rain1314[2013]
Using .join(how= 'left') pandas dataframes have a .join( ) method built in. Calling df1.join(df2) computes a
left join using the index of df1 by default
We can also do the preceding computation with a
list comprehension comprehensions are a convenient python construction for exactly this kind of loop where an empty list is appended to within each iteration filenames= ['sales-jan-2015.csv', 'sales-feb-2015.csv'] dataframes= [pd.read_csv(f) for f in filenames]
Joining tables involves
meaningfully gluing indexed rows together
Using merge_asof( ) Similar to merge_ordered( ), the merge_asof( ) function will also
merge values in order using ht eon column, but for each row in the left dataframe, only from from the right dataframe whose 'on' column values are less than the left value will be kept this function can be used to align disparate date/time frequencies without having to first resample
Often we need to combine dataframes either along multiple columns or along columns other than the index. This is the world of
merging pandas dataframes merge extends concat with the ability to align rows using multiple columns
As a result of .append( ) stacking rows without adjusting the index values, using the .loc[ ] accessor with an argument may return
multiple rows from the appended series'
We can stack a 2 x 4 matrix A and a 2 x 3 matrix B horizontally using
np.hstack( ) np.hstack([A, B]) the input is a list of numpy arrays equivalently, we can use np.concatenate( ) with the same sequence and axis= 1 to append the columns horizontally np.concatenate([A, B], axis=1) in both cases, both A and B must have the same number of rows although the number of columns can differ
We can stack the 2 x 4 matrix A and the 3 x 4 matrix C vertically using
np.vstack( ) or np.concatenate( ) with axis= 0 np.vstack([A, C]) np.concatenate([A, C], axis= 0) it is important here that both matrices have four columns the argument axis= 0 is actually the default and optional
When appended dataframes have disjoint column names
null entries are inserted
Merging on multiple columns To eliminate the redundant columns in two dataframes, use
on= ['column_name_1', 'column_name_2'] to merge on both of those columns this is where merging extends concatenation in allowing matching on multiple columns the result has only one column for each column_name_1 and column_name_2 the remaining columns still have suffixes _x and _y to indicate their origin
Outer join preserves the indices in the original tables, filling null values for missing rows an outer joined table has all the indices of the
original tables without repetition, like in a set union
The union of all rows from the left and right DataFrames can be preserved with an
outer join
When we specify join= 'outer' when concatenating over axis= 1 we get an
outer join if unspecified, the join parameter defaults to outer all row indices from the original indexes exist in the joined dataframe index when a row occurs in one dataframe but not in the other, the missing column entries are filled with null values
We can also use .reindex( ) to see where dataframe rows
overlap for instance, here we reindex w_max with the index of w_mean3, showing that w_max does not have a row labeled Dec either w_max.reindex(w_mean3.index) Max TemperatureF Month Jan 68.0 Apr 89.0 Dec Nan
We can stack dataframes vertically or horizontally using
pd.concat( ) concat is also able to align dataframes cleverly with respect to their indexes
To chose a particular column to merge on
pd.merge(df1, df2, on= 'column_name') this means matching only on the column_name column remaining columns are appended to the right column labels are modified with the suffixes _x and _y to indicate their origin, x for the first argument to merge and y for the second
Tools for pandas data import
pd.read_csv for CSV fliles dataframe= pd.read_csv(filepath) dozens of optional input parameters pd.read_excel( ) pd.read_html( ) pd.read_json( )
The merge function is the
power tool for joining if you need to join on several columns pd.merge([df1, df2])
When appending dataframes, the dataframes are
readily stacked row-wise just like series
Using concat with multiple series results in an index that contains
repeated values
The basic syntax of the .append( ) method is
s1.append(s2) the rows of s2 then are stacked underneath s1 this method works with both dataframes and series
To read multiple files using pandas, we generally need to
separate DataFrames for example, here we call pd.read_csv twice to read two csv files into two distinct DataFrames import pandas as pd dataframe0= pd.read_csv('sales-jan-2015.csv') dataframe1= pd.read_csv('sales-feb-2015.csv')
Scalar Multiplication We use the asterisk to multiply a
series element-wise by 2.54 remember, we can broadcast standard scalar mathematical operations here, broadcasting means the multiplication is applied to all elements in the dataframe weather.loc['2013-07-01' : '2013-07-07', 'PrecipitationIn']*2.54
With date/time indexes, we can use convenient strings to
slice say, the first week of July from the PrecipitationIn column the precipitation data are in inches import pandas as pd weather= pd.read_csv('pittsburg2013.csv, index_col= 'Date', parse_dates= True) weather.loc['2013-07-01' : '2013-07-07', 'PrecipitationIn']
Concatenating two dataframes along axis= 0 means
stacking rows vertically at the bottom stating axis= 0 or axis= 'rows' is optional, that is, its the default behavior
Percentage Change A related computation is to compute a percentage change along a time series We do this by
subtracting the previous day's value from the current day's value and dividing by the previous day's value the .pct_change( ) method does precisely this computation for us here, we also multiply the resulting series by 100 to yield a percentage value notice the value in the first row is NaN because there is no previous entry week1_mean.pct_change( )*100
Using suffixes We can tailor column labels with the argument
suffixes= this replaces the suffixes _x and _y with whatever custom names we choose for example, if the last two columns of a merged dataframe are 'Total_x' and 'Total_y', they can be changed as follows: pd.merge(bronze, gold, on=['NOC', 'Country'], suffixes= ['_bronze', '_gold']) now the last two columns will be 'Total_bronze' and 'Total_gold'
When two dataframes with the same index label are appended
the appended dataframe has two rows with the same index label
Using a multilevel index for rows
the argument keys= [ ] assigns an outer index label associated with each of the original input dataframes note that the order of the list of keys must match the order of the list of input dataframes when printed, the dataframe is displayed with distinct levels for the multi-index rain1314= pd.concat([rain2013, rain2014], keys= [2013, 2014], axis= 0)
If you only need to stack two series or dataframes vertically,
the df1.append(df2) method is sufficient
Order matters w_max.reindex(w_mean.index) is not the same as w_mean.reindex(w_max.index)
the latter fixes the row order as desired in w_mean the former replicates the misleading alphabetical row order in w_max this is likely not desirable