Merging DataFrames with Pandas

Ace your homework & exams now with Quizwiz!

Adding bronze, silver and gold Adding all three series together yields 6 rows of output but only three have non-null values, that is, France, Germany and Italy are not index labels in all three series, so each of those rows is NaN in the sum

we can also chain multiple method calls to .add( ) method with fill_value= 0 to get rid of those null values in the triple sum bronze.add(silver, fill_value= 0).add(gold, fill_value=0)

Absolute Temperature Range Find the percentage variation in temperature in the first week of July, that is, the daily minimum and the daily maximum temperatures expressed as a percentage of the daily mean temperature

we can compute this by dividing both the Min TemperatureF and the Max TemperatureF columns by the Mean TemperatureF column, and multiplying both by 100 to begin, slice the Min TemperatureF and the Max TemperatureF columns as a dataframe week1_range week1_range= weather.loc['2013-07-01' : '2013-07-07', ['Min TemperatureF', 'Max TemperatureF']] next, slice the Mean TemperatureF column as a series week1_mean week1_mean= weather.loc['2013-07-01' : '2013-07-07', 'Mean TemperatureF'] dividing the dataframe week1_range by the series week1_mean won't work because the column labels don't match so the result has all null values we want to use the dataframe.divide( ) method with the option axis= rows the .divide( ) method provides more fine grain control than the slash operator for division itself week1_range.divide(week1_mean, axis= 'rows') this broadcasts the series week1_mean across each row to produce the desired ratios

As column labels differ from one dataframe to another, we have to declare

which columns to merge on if we have two dataframes, counties and cities, with CITY NAME as a column in counties, and City as a column name in cities, we merge them as follows pd.merge(counties, cities, left_on= 'CITY NAME', right_on= 'City') both columns are retained in the merged dataframe

We can stack dataframes vertically using

.append( )

Using the method _____ removes entire rows in which null values occur

.dropna( ) this is a common first step when merging dataframes w_max.reindex(w_mean3.index).dropna( ) Max TemperatureF Month Jan 68.0 Apr 89.0

The dataframe indexes are access directly with the

.index attribute

The multi-index can be sliced on the outermost level with the

.loc[ ] accessor print(rain1314.loc[2014])

Having unique indexes is important in most situations. We can create a new index with the method

.reset_index( ) the option drop= True discards the old index with repeated entries rather than keeping it as a column in the dataframe

The function concat( ) accepts

a list or sequence of several series or dataframes to concatenate while the append method can only stack vertically or row-wise, the function concat( ) is more flexible it can concatenate vertically or horizontally

The .append( ) method stacks rows without

adjusting the index values

Using how= 'left' keeps

all rows of the left dataframe in the merged dataframe for rows in the left dataframe with matches in the joining columns of the right dataframe, the non-joining columns of the right dataframe are appended to the left dataframe for rows in the left dataframe with no matches in the joining columns of the right dataframe, the non-joining columns are filled with null values conversely, using how= 'right' does a right join doing the same thing with the roles of left and right interchanged

Examining how arithmetic operations work between distinct series or dataframes with non-aligned indexes which happens often in practice We'll use Olympic medal data from 1896-2008 Top 5 bronze medal winning countries bronze= pd.read_csv('bronze_top5.csv', index_col= 0) Top 5 silver medal winning countries silver= pd.read_csv('silver_top5.csv', index_col= 0) Top 5 gold medal winning countries gold= pd.read_csv('gold_top5.csv', index_col= 0)

all three dataframes have the same indices for the first three rows: United States, Soviet Union, and United Kingdom by contrast, the next two rows are either France, Germany, or Italy if we add bronze and silver, two series of 5 rows, we get back a series with 6 rows the index of the sum is the union of the indices from the original two series arithmetic operations between pandas series are carried out for rows with common index values since Germany does not appear in silver and Italy does not appear in bronze, these rows have NaN in the sum

The input argument to the .reindex( ) method can also be

another dataframe index for instance, we can use the index from w_max to reindex w_mean in chronological order w_mean.reindex(w_max.index) when a suitably indexed dataframe is available, the .reindex( ) method spars us having to create a list manually or having to sort the index

We can also do an inner join along

axis= 0 if no column index label occurs in both of the respective dataframes, the joined dataframe will be empty

A different strategy is to concatenate the columns from dataframe rain2014 to the right of the dataframe rain2013 We do this using the option

axis= 1 or axis= 'columns' in the call to concat rain1314= pd.concat([rain2013, rain2014], axis= 'columns') unfortunately, since the column label precipitation is common to both dataframes, the result has two precipitation columns, again obscuring which column came from which year's data

Using the .add( ) method We can get the same sum bronze + silver with a method invocation using

bronze.add(silver) the null values appear in the same places

Just as pandas series can have repeated index labels, dataframes can have repeated

column labels slicing with a repeated column label yields all matching columns

Using .merge_ordered( ) This behaves like merge when

columns can be ordered the merged dataframe has rows sorted lexicographically according to the column orderings in the input dataframes be aware that the default join is an outer join contrasting the default inner join for merge( )

By default, pd.merge( ) uses all

columns common to both dataframes to merge

Using axis= 1 or axis= 'columns' stacks dataframe

columns horizontally to the right rows with the same index value from the concatenated dataframes get aligned, that is, values get propagated from both dataframes to fill the row for other rows that are present in one dataframe but not another, the columns are filled with NaNs

Inner join has only index labels

common to both tables like a set intersection

pd.merge( )

computes a merge on all columns that occur in both dataframes for any row in which the column entry in df1 matches a row in df2, a new row is made in the merged dataframe the new row contains the row from df1 with the other columns from the corresponding row in df2 appended this is by default an inner join because it glues together only the rows that match in the joining column of both dataframes

If you need more flexible stacking or an inner or outer join on the indexs, the

concat function gets you further pd.concat([df1, df2]) can be used to stack many horizontally or vertically

The index is a privileged column in pandas providing

convenient access to series or dataframe rows

We can define a list called ordered to impose a deliberate ordering on the index labels of the dataframe w_mean The _____ _____ creates a new dataframe, w_mean2 with the same data as w_mean, but with a new row ordering according to the input list ordered

dataframe.reindex( ) method ordered= ['Jan', 'Apr', 'Jul', 'Oct'] w_mean2= w_mean.reindex(ordered) print(w_mean2)

The original alphabetically ordered dataframe can be recovered with the

dataframe.sort_index( ) method pandas index labels are typically sortable data such as numbers, strings, or date/time

The function concat can also accept a _____ rather than a list of dataframes as input

dictionary in that case, the dictionary keys are automatically treated as values for the keys= [ ] argument in building the multi-index on the columns

A value error exception is raised when the arrays have

different sizes along the concatenation axis for instance, trying to stack arrays A and C horizontally or A and B vertically causes problems

When stacking multiple series, concat is in fact

equivalent to chaining method calls using append result1= pd.concat([s1, s2, s3]) result2= s1.append(s2).append(s3) result1 = result2

Using a fill_value The default fill value is NaN when sum and rows

fail to align we can modify this behavior using the fill_value= option of the .add( ) method by specifying fill_value= 0, the values of Germany and Italy are no longer null bronze.add(silver, fill_value= 0) just as the .divide( ) method is more flexible than the / operator for division, the .add( ) method is more flexible than the + operator for addition

It is generally more efficient to iterate over a collection of

file names with that goal, we can create a list of filenames with the two file paths from before we then initialize an empty list called dataframes and iterate through the list filenames within each iteration, we invoke read_csv to read a dataframe from a file and we append the resulting dataframe to the list of dataframes filenames= ['sales-jan-2015.csv', 'sales-feb-2015.csv'] dataframes[ ] for f in filenames: dataframes.append(pd.read_csv(f))

The join method also joins on indexes but gives more

flexibility for left and right joins df1.join(df2)

Specifying fill_method= 'ffill' as an option in calling merge_ordered uses

forward filling to replace the NaN values by the most recent non-null value

When many filenames have a similar pattern, the _____ _____ from the python standard library is very useful

glob module here, we start by importing the function glob from the built in glob module we use the pattern sales asterisks dot csv (sales*.csv) to match any strings that start with the prefix sales and end with the suffix .csv the asterisk is a wild card that matches zero or more standard characters the function glob uses the wild card pattern to create an iterable object filenames containing all matching filenames in the current directory finally, the iterable filenames is consumed in a list comprehension that makes a list called dataframes containing the relevant data structures from glob import glob filenames= glob('sales*.csv') dataframes= [pd.read_csv(f) for f in filenames]

The .join( ) method can also do a right join using

how= 'right'

Using the option _____ with concat( ) spares us having to invoke reset_index explicitly

ignort_index= True the resulting series has no repeated indices

The specific index labels provided to the .reindex( ) method are

important for instance, if we invoke .reindex( ) again, using an input list containing a label that is not in the original dataframe index, Dec in this case, an entirely new row is inserted and filled with the null value NaN, or not a number w_mean3= w_mean.reindex(['Jan', 'Apr, 'Dec']) Mean TemperatureF Month Jan 32.133333 Apr 61.956044 Dec NaN

"Indexes" vs "Indices"

indices: many index labels within index data structures indexes: many pandas index data structures

Using .join(how= 'inner') the .join( ) method also supports

inner and outer joins on the indexes

We call concat along axis= 1 and explicitly specify join= 'inner' for an

inner join this means that only the row label present in both dataframe index is preserved in the joined dataframe the column values in the fow are filled in from the corresponding columns from the respective dataframes

The function merge( ) does an

inner join by default, that is, it extracts the rows that match in the joining columns from both DataFrames and it glues them together in the joined DataFrame we can specify how= 'inner'. but this is the default behavior for merge

Like merge, the function merge_ordered accepts

key word arguments on= and suffixes=

To avoid repeated column indices, we use the

keys= [ ] option and axis= 'columns' with concat the result has a multilevel column index rain1314= pd.concat([rain2013, rain2014], keys= [2013, 2014], axis= 'columns') we can slice the year 2013 in a dictionary type style: rain1314[2013]

Using .join(how= 'left') pandas dataframes have a .join( ) method built in. Calling df1.join(df2) computes a

left join using the index of df1 by default

We can also do the preceding computation with a

list comprehension comprehensions are a convenient python construction for exactly this kind of loop where an empty list is appended to within each iteration filenames= ['sales-jan-2015.csv', 'sales-feb-2015.csv'] dataframes= [pd.read_csv(f) for f in filenames]

Joining tables involves

meaningfully gluing indexed rows together

Using merge_asof( ) Similar to merge_ordered( ), the merge_asof( ) function will also

merge values in order using ht eon column, but for each row in the left dataframe, only from from the right dataframe whose 'on' column values are less than the left value will be kept this function can be used to align disparate date/time frequencies without having to first resample

Often we need to combine dataframes either along multiple columns or along columns other than the index. This is the world of

merging pandas dataframes merge extends concat with the ability to align rows using multiple columns

As a result of .append( ) stacking rows without adjusting the index values, using the .loc[ ] accessor with an argument may return

multiple rows from the appended series'

We can stack a 2 x 4 matrix A and a 2 x 3 matrix B horizontally using

np.hstack( ) np.hstack([A, B]) the input is a list of numpy arrays equivalently, we can use np.concatenate( ) with the same sequence and axis= 1 to append the columns horizontally np.concatenate([A, B], axis=1) in both cases, both A and B must have the same number of rows although the number of columns can differ

We can stack the 2 x 4 matrix A and the 3 x 4 matrix C vertically using

np.vstack( ) or np.concatenate( ) with axis= 0 np.vstack([A, C]) np.concatenate([A, C], axis= 0) it is important here that both matrices have four columns the argument axis= 0 is actually the default and optional

When appended dataframes have disjoint column names

null entries are inserted

Merging on multiple columns To eliminate the redundant columns in two dataframes, use

on= ['column_name_1', 'column_name_2'] to merge on both of those columns this is where merging extends concatenation in allowing matching on multiple columns the result has only one column for each column_name_1 and column_name_2 the remaining columns still have suffixes _x and _y to indicate their origin

Outer join preserves the indices in the original tables, filling null values for missing rows an outer joined table has all the indices of the

original tables without repetition, like in a set union

The union of all rows from the left and right DataFrames can be preserved with an

outer join

When we specify join= 'outer' when concatenating over axis= 1 we get an

outer join if unspecified, the join parameter defaults to outer all row indices from the original indexes exist in the joined dataframe index when a row occurs in one dataframe but not in the other, the missing column entries are filled with null values

We can also use .reindex( ) to see where dataframe rows

overlap for instance, here we reindex w_max with the index of w_mean3, showing that w_max does not have a row labeled Dec either w_max.reindex(w_mean3.index) Max TemperatureF Month Jan 68.0 Apr 89.0 Dec Nan

We can stack dataframes vertically or horizontally using

pd.concat( ) concat is also able to align dataframes cleverly with respect to their indexes

To chose a particular column to merge on

pd.merge(df1, df2, on= 'column_name') this means matching only on the column_name column remaining columns are appended to the right column labels are modified with the suffixes _x and _y to indicate their origin, x for the first argument to merge and y for the second

Tools for pandas data import

pd.read_csv for CSV fliles dataframe= pd.read_csv(filepath) dozens of optional input parameters pd.read_excel( ) pd.read_html( ) pd.read_json( )

The merge function is the

power tool for joining if you need to join on several columns pd.merge([df1, df2])

When appending dataframes, the dataframes are

readily stacked row-wise just like series

Using concat with multiple series results in an index that contains

repeated values

The basic syntax of the .append( ) method is

s1.append(s2) the rows of s2 then are stacked underneath s1 this method works with both dataframes and series

To read multiple files using pandas, we generally need to

separate DataFrames for example, here we call pd.read_csv twice to read two csv files into two distinct DataFrames import pandas as pd dataframe0= pd.read_csv('sales-jan-2015.csv') dataframe1= pd.read_csv('sales-feb-2015.csv')

Scalar Multiplication We use the asterisk to multiply a

series element-wise by 2.54 remember, we can broadcast standard scalar mathematical operations here, broadcasting means the multiplication is applied to all elements in the dataframe weather.loc['2013-07-01' : '2013-07-07', 'PrecipitationIn']*2.54

With date/time indexes, we can use convenient strings to

slice say, the first week of July from the PrecipitationIn column the precipitation data are in inches import pandas as pd weather= pd.read_csv('pittsburg2013.csv, index_col= 'Date', parse_dates= True) weather.loc['2013-07-01' : '2013-07-07', 'PrecipitationIn']

Concatenating two dataframes along axis= 0 means

stacking rows vertically at the bottom stating axis= 0 or axis= 'rows' is optional, that is, its the default behavior

Percentage Change A related computation is to compute a percentage change along a time series We do this by

subtracting the previous day's value from the current day's value and dividing by the previous day's value the .pct_change( ) method does precisely this computation for us here, we also multiply the resulting series by 100 to yield a percentage value notice the value in the first row is NaN because there is no previous entry week1_mean.pct_change( )*100

Using suffixes We can tailor column labels with the argument

suffixes= this replaces the suffixes _x and _y with whatever custom names we choose for example, if the last two columns of a merged dataframe are 'Total_x' and 'Total_y', they can be changed as follows: pd.merge(bronze, gold, on=['NOC', 'Country'], suffixes= ['_bronze', '_gold']) now the last two columns will be 'Total_bronze' and 'Total_gold'

When two dataframes with the same index label are appended

the appended dataframe has two rows with the same index label

Using a multilevel index for rows

the argument keys= [ ] assigns an outer index label associated with each of the original input dataframes note that the order of the list of keys must match the order of the list of input dataframes when printed, the dataframe is displayed with distinct levels for the multi-index rain1314= pd.concat([rain2013, rain2014], keys= [2013, 2014], axis= 0)

If you only need to stack two series or dataframes vertically,

the df1.append(df2) method is sufficient

Order matters w_max.reindex(w_mean.index) is not the same as w_mean.reindex(w_max.index)

the latter fixes the row order as desired in w_mean the former replicates the misleading alphabetical row order in w_max this is likely not desirable


Related study sets

Seizure Disorders (Pearson questions)

View Set

Curiosamente: ¿Qué es el amor?

View Set

(Limited) Partnerships, Limited Liability Partnerships, Limited Liability Limited Partnerships

View Set

PSYCH 110 - Exam Study Set (CH 1)

View Set

CompTIA Linux+ (XKO-004) Pre-Assessment Quiz: Linux installation and configuration

View Set

CHAPTER 4: THE U.S. CONSTITUTION: ORIGINS, PRINCIPLES, AND DEVELOPMENT

View Set