Merging DataFrames
When should we use these four methods - .append(), .concat(), .join(), .merge()
.append() - useful for stacking 2 series or dataframes vertically .concat() - if you need more flexible stacking, or an inner or outer join on indexes .join() - also joins on indexes but gives you flexibility with left and right joins .merge() - useful for joining on several columns.
if only slicing on the inner level of a multi-index, what must be used with .loc[ ]
.loc[pd.IndexSlice[outer_index_slice,inner_index_slice]]
How does .merge_ordered() order its output?
.merge_ordered() sorts its ouput by column orderings with the first column taking precedence.
What are the three ways to read in multiple data files?
1. using a loop 2. using a comprehension
How do I merge on more than two dataframes?
I chain .merge()
How do I use a loop to read in multiple csv files into a list of dataframes?
I create a list (filenames) and an empty list called dataframes and run this loop: for f in filename: _____dataframes.append(pd.read_csv(f))
How do I read in multiple csv files into a list of dataframes?
I create a list (filenames) and an empty list called dataframes and run this: dataframes = [pd.read_csv(f) for f in filenames]
what does append do for dataframes if they have identical columns including the index?
It stacks the dataframes row-wise.
what does df_name.resample('A').last do?
It takes a datetime column and resamples to output one entry per year and that entry is the end of the year 12-31.
If I pass 0 for level into .sort_index, what gets accomplished?
It will sort dataframe by outermost index.
if I pass a new row label to .reindex(),meaning it is not one in the current dataframe, what will the values be for the entries in this new row?
NaNs
when concatenating horizontally (axis = 'columns') if a column name is shared between the dataframes and the rows have identical indices what will the ouput be?
The output will be a dataframe with two columns of the same name and different values in each column.
What happens if I do df_name['col_name'] = n to the values within the column?
The whole column takes on n.
What does passing a list to the keys parameter of .concat do?
This assigns an outer index to the indices of the dataframes being concatenated. The order of the keys list matches up with the order of the list of dataframes being concatenated.
what does pd.merge_asof() do?
This function constructs a dataframe backwards. The first rows to be merged are from the bottom of the two dataframes and are output at the bottom of the resulting dataframe. For each row in the left dataframe, only a row from the right dataframe whose 'on' column values are less than or equal to the left dataframe's values will be merged. This comparison is done starting from the bottom of the dataframes.
If an index is only in one of the series and we want to add them and not got NaN values, what parameter do we add to .add()?
We add the fill_value = 0 parameter.
what happens when we append dataframes with different index names and column names?
We get a union of indices and columns. We get a bunch of NaNs where we have disjoint columns and the indices from the dataframe passed to append are appended to the other dataframe, even if the indices are the same.
what does np.arange(8).reshape(2,4) + 0.1 output? What does does each method do
[ [ 0.1 1.1 2.1 3.1 ] [ 4.1 5.1 6.1 7.1 ] ]. the arange creates an array of integers from 0 to 7 in this case. the reshape creates a 2d array of 2 rows and 4 columns. The addition of 0.1 broadcasts this value to all entries in the array.
what does a left join do? Is it the opposite for a right join?
a left join keeps all rows of the left dataframe in the merged dataframe. For rows in the left df with matches in the right df, the non-joining columns of the right df are appended to the left df. For rows in the left df with no matches in the right df, the non-joining columns are filled with null values. It is the opposite for a right join
What indices show up when you add two series using the plus operator or .add() method? What happens if an index is only in one of the series?
a union of the indices occur; if an index only shows up in one of the series then the resulting sum is a NaN value.
How do I concatenate horizontally if I have dataframes with the same indices and the same column names if I want columns that denote which dataframe the column came from?
add keys argument to pd.concat().The order of the keys list matches up with the order of the list of the dataframes being concatenated.
What happens when we use .concat() and pass 0 to axis.
axis = 0 ('rows' does the same thing)means that stack rows vertically underneath each other.NaNs are filled in where there are disjoint columns between the dataframes passed to concat.
what does df_name .reindex(list of ordered row labels) do?
creates a new index with ordering of labels corresponding to order of list passed to reindex.
what are the arguments of .divide and what does it do?
df_name.divide(series_name,axis ='rows'). This divide entries in a column of df_name by the series entry-by-entry.
what does pd.merge do without any arguments other than the two dataframes we are merging on?
does an inner join on all common columns.
How do I use glob to import a list of csv files with similar filenames?
from glob import glob filenames = glob('shared_characters*.csv') # glob creates an iterable object with all matching filenames in current directory. These filenames include shared_characters and are followed by some arbitrary sequence of characters(or no characters at all) and end in .csv # using list comprehension way dataframes = [pd.read_csv(f) for f in filenames]
What are the arguments of pd.concat()?
have an axis argument where axis = 0 represents the row access and axis = 1 represents the column axis. Also have a join argument for either 'inner' or 'outer' joins. if do a join on axis = 1, you are doing a join on indexes. if do a join on axis = 0, you are doing a join on columns.
What does the parse_dates argument of .read_csv do?
if passed True it will parse the index and try to give us datetime objects.
What is the difference in use between indices and indexes?
indices: many index labels within index data structure indexes: many pandas index data structures
what does df.expanding.mean() calculate?
it calculates the mean down each column for each entry by running the mean for all entries before and including that entry.
What does filename = "%s.csv" %var do if used in a for loop
it creates filenames that start with whatever value var takes on
what does .reset_index(drop = True) do?
it discards the old index and replaces it with the default integer zero based index.
what does the ignore_index parameter do if set to True? Is this parameter in both .append() and .concat()?
it discards the old index and replaces it with the default integer zero based index. yes.
what does .join() do? What kinds of joins does it do?
it does a left join on the index by default with the left being the df in front of the dot. row orders match the left data frame. it can do an inner,outer(where the index is sorted),left,right
what does df_name.pct_change() do? What will the first value in a column always be?
it finds the percent change of each entry. percent change = (current entry -previous entry)/previous entry. NaN
what does .ffill() do?
it forward fills all entries in a column with the previous non-null value.
what does .append() do for series?
it stacks the series passed into .append() underneath the series the series the method is called on.
what does np.concat() do?
it takes a list of dataframes or arrays and concatenates them either vertically (axis = 0) or horizontally(axis = 1)
Does .append() change index values?
no
How do I stack arrays horizontally?What is a constraint on the function require to stack horizontally?
np.hstack(list of arrays) higher index in list of arrays between two consecutive indexes gets appended to the right. a constraint is the arrays must have the same number of rows.
How do I stack arrays vertically?What is a constraint on the function require to stack vertically?
np.vstack(list of arrays) higher index in list of arrays between two consecutive indexes gets appended below. a constraint is that the arrays need to have the same # of columns.
What are some important arguments of pd.merge() and what can they accomplish?
on = column or list of columns. This specifies which columns to merge on. if have columns not specifies that are still common columns, will have columns modified with a suffix to tell the difference. left_on and right on take in columns names and are useful for merging on columns with same variable but slightly different names. suffixes = list of suffixes: ex: [ '_bronze' , '_gold' ] they get add suffixes to common columns not being merged on and correspond to order of dataframes being passed in. sorted_values with a column passed sorts values by column name.
What is the default join for merge_ordered?
outer.
How do I concatenate dataframes if stored as values in a dictionary?
pd.concat(dict_name.values)
How do I combine scalars and dataframes to do operations?
scalar values will get stretched into same shape as other operand when doing scalar and dataframe operations.
What does passing in n to .pct_change() do?
sets an offset of n entries back as previous entry for the percentage change calculation.
What are some parameters of .merge_ordered()?
suffixes=[ '_suffix1', '_suffix2'] added to common column not being merged on. The order of this list corresponds with the order of the dataframes being passed in. filll_method = 'ffill' does forward filling to replace NaNs
what does pd.concat() take in as argument and what does it do?
takes in a list of series of dataframes and has the parameter axis which has arguments 'rows' or 'columns'.
what happens when we pass 1 or 'columns' to axis in the .concat() method.
the Dataframes are calculated horizontally, meaning shared indexes get data appended together. The row indices are sorted in an ascending manner.
for the index_col argument of the read_csv function, if I pass in a tuple what do the values in the tuple correspond to?
the integers correspond to columns in the column index
what happens if I permute the order of the dataframes in a merge call?
the order of the rows and columns change.
How do we add together three series without NaNs being assigned NaN?
we chain .add() methods together, with each add having the param=arg of fill_value = 0
if I have a dictionary with keys as outer level column labels and values as dataframes and I pass multiple dictionaries to pd.concat (with axes ='columns') what do I output?
we get a dataframe with multi-level column labels.
Can I pass in booleans to .loc[ ] to subset rows?
yes
Does order matter when using reindex with an index object from another dataframe?
yes
can I pass an Index object into .reindex()? What is the use if the Index object comes from another dataframe?
yes. The use is that this operation shows where two dataframes overlap in index labels. NaNs will show where there is indices in the dataframe where .reindex() is appended to and not in the dataframe where the index object is from.
What is an inner join?
you join two dataframes where there is a common column between them. The join combines rows from the first dataframe with the same value in the common column as a row from the second dataframe. This will create a row with a union of column data.
What is an outer join?
you will join two dataframes. The join combines rows from the first dataframe with the same value in the common column as a row from the second dataframe. This will create a row with a union of column data. It also adds rows from both even if there no match in the other dataframe for a shared value in a shared column. These rows will have NaNs for columns that are not shared.