Pandas
How to retrieve a group from a groupby object
.get_group('Group_name')
get the max value of each group in a groupby object
.max()
get mean of a groupby object for each group
.mean()
sum each group in a groupby object
.sum
difference between axis=0 and axis =1?
0 is effecting rows 1 is effecting columns
What does the Groupby() method do
Basically groups the data by a certain value. e.i. group all stocks in the data frame by sector. This returns a object called a group by dataframe
What is the downside of 'query()'
Cant not query columns that have spaces
df.index.set_names(names = ['Day', 'Location'], inplace = True)
Changes the index names to day and location
multi index df.index.set_names(names = 'Day', level = 0, inplace = True)
Chnages the first level in the multi indexs index name to day
.value_counts()
Counts the amount of the same values occur in a series
What will this do?: df.drop_duplicates(subset = ['First Name'], keep ='first')
Drop all Names that have a duplicate but keep the first occurrence
df = pd.merge(df1, df2)
Easy way to do a merge
What will this do: mask = df['Salary'].duplicated() df[mask]
Filter out all duplicated values
What will this do: mask = df['Team'].isin(['Legal', 'Sales', 'Product']) df[mask]
Filter the dataframe to only those who in the team column that are in Legal sales or product
.describe()
Gives us a bunch of statistical stuff on the df or series such as the mean, std, max, count etc...
What will this do? df.sample(frac = .50, axis=1)
Grab a random sample of 50% of the columns
What is a time delta
It shows us the distance between two dates/times. This is generated when you subtract one time from another
What will this do?: data["Charmander":"Weedle":2]
It will grab the data from the index of Charmander to Weedle and skip every other row
What will this do: df.loc['Goldfinger', 'Actor']
It will grab the index row labaled as 'Goldfinger' and will grab the value under the column of 'Actor'
WHat does the unstack method do?
It will turned a stacked series back into a dataframe
What will this do: df = pd.read_csv('bigmac.csv', parse_dates = ['Date'], index_col = ['Date', 'Country'])
Make it in date format and make it a multi index
mask =df['Actor'] =='Sean Connery' df.where(mask)
Makes it so it makes nan values for all rows that don have sean as the actor
Sample()
Python method that is part pf random that that returns a certain index from a sequence
What will this do?: df['Team'].isin(['Legal', 'Sales', 'Product'])
Return booleans for if the the index value contains legal sales or product
groupby object .groups
Returns a dictionary with the group and every single index where the each item in the group occurred in the data frame
Method
Returns a function. items like print(), .append(), .index() etc...
.isin()
Returns boolean value for if data is in a df
groupbyObject.agg(['size', 'sum', 'mean'])
See image
Series
Series are a one dimnensional labeled array. They are like a data frame but are a single column that consts of the data and key values: lottery = [1,2,2,61,6,15] pd.Series(lottery) Output: 0 11 22 23 614 65 15
.reindex
Show a value that is missing for example: data.reindex(index= ['fiurujfi']) this values deosn texist so it will give me a index label with a value of NaN
What is a Left Join?
This JOIN will return all rows from the left table (this is the table you specify first in your code) and any matching records from the right table (this is the table you specify second in your code)
What is a right join?
This JOIN will return all the rows from the right table (this is the table you specify second in your code) and any matches from the left table (this is the table you specify first in your code).
What will this do?: sean = df['Actor'] == 'Sean Connery' df.loc[sean]
This will filter all values of the data set showing rows that have Sean Connery as the actor
What does the stack method do
Turns a multi column dataframe into a series. THis turns it into a multi index
What will this do: mask = df['Salary'].between(6000,70000) df[mask]
WIll filter the data from where the salary is between 6000 and 70000
What will this do?: df.sample(5)
WIll return 5 random rows
df1.append(df2)
Will append df 2 to df1
pd.concat(objs = [df1, df2], ignore_index = False)
Will concat the two dataframes with both having the same index
directors = df['Actor'].copy()
Will copy column of a data frame into a variables called directors
pd.concat(objs = [df1, df2], keys=['df1Name', 'df2Name])
Will create a datafrzame object that is multi demesional meaning there will be a data frame for both df1 and df1 data
What will this do? df.drop('Dr. No', inplace = True)
Will delete the index with the name of 'Dr. No'
df1.merge(ff2, how='inner", on = 'column_name')
Will do a inner join on the two data sets
.join() method
Will do a vertical merge when the two data sets have the same index
df1.merge(ff2, how='outer", on = 'column_name', indicator=True)
Will do an outer merge and will have a colum that will show us where the data is found whether it be a right join a left join or both
.map()
Will essentailly map one series or dict to another. For example say we have two sets od data in two seperate series. series one will food, series two is resteraunts. if we where to go series1.map(series2) it will essentially replace the food data with the resteraunt data
What will this do?: df.query("Actor == 'Sean Connery' or Director == 'John Glen'")
Will extract all rows where the column of actor has the value of Sean or the director is john
What will this do?: filter1 = df['Gender']=='Male' filter2 = df['Team']=='Marketing' df[filter1 & filter2]
Will filter the df by Males in Marketing
What will this do?: filter1 = df['Gender']=='Male' filter2 = df['Team']=='Marketing' df[filter1 | filter2]
Will filter the df by Males or poeple in Marketing
What does a inner join do
Will make a dataframe for only data that is the same in the two dataframes. e.i. if it is merging on customer id it will only merge data for if there is the same id in both data sets. if id of 1 exists in df1 but not in df2 it will not be in the merged data frame. but if id 2 is in both dataframes it will be in the new dataframe
What will this do?: df.columns = ['s','s','f','g', 'h', 'h']
Will name the columns to the values in the list
multi index df.loc[('2010-01-01', 'Argentina'), ["Price in US Dollars"]]
Will pull the values from the given index in a multi index and pull the given columns
What will this do?: df.rename(mapper ={'GoldenEye': 'Golden Eye', 'The World Is Note Enough': 'Best bond movie ever'}, inplace=True)
Will rename indexs labeled as the keys to the key values. And will make it so these values are the new dataframe
What will this do": df.columns = [column_name.replace(' ', '_') for column_name in df.columns]
Will rename the columns to have underscores instead of spaces
groupby object .first()
Will return the first item in each group
what will this do?: data[data.idxmin()]
Will return the index position of the lowest value
df.nsmallest(3, columns ='Box Office')
Will return the three smallest box office values
list()
Will turn a series into a list
How to do a linear regression
X1 = sm.add_constant(X1) results = sm.OLS(X2, X1).fit()
Module
a file of python code that returns certain functions. An example of this could be Pandas, Datetime, flask etc..
Convention
a generally agreed-upon practice in programming. An example of this could be import pandas a pd, or for i in list
what does the agg() method do to a groupby object
allows us to do an apply method asiccaly to each column in a groupby object
What is a multiindex
an item that has a multiple index keys. Or a nested data frame inside of a data frame. An example of this could be for each date you have another data frame that has the secondary key labeled as ticker that has each companies ticker and then for each one of these tickers there is a the OHLC
.apply()
apply will essentially replace the indexs values with some type of other value or calulation. For example say we have a function that returns ok if a value in the df >300 is we use .apply() to our data it will replace all values that are less then 300 with ok: def c(num): if num<300: return 'ok' data.apply(c)
Make it so it pulls data from a yahoo finance dateframe for every day your birthday occurred
birthday = pd.date_range(start = "1991-04-12" end="2010-12-31" freq = pd.DateOfffset(years =1)) birdays_stock_info = df.index.isin(birthday) df[birthday_stock_info]
When there is more then once column/2d df what will the first index/[] atribute return. What will the secound one return?
columns, rows. e.i. 'nba['Name'][1]'
What is a outer join
combines value from both data sets. This can be for all data or just data that exists in one data set that isnt in the other
groupbyObject.agg({'revenues':'sum', 'debt':'sum' })
create a dataframe that has the sum of each column with the index of the df being the group
turn a date time series into a series that is just has weekdays
df.dt.weekday_name
How to create a multi index
df.set_index(keys=['Date', 'Country'])
left merge on column right merge on index
df1.merge(ff2, how='left", left_on = 'column_name', right_index =True)
Do a merge where the column names aren't the same
df1.merge(ff2, how='left", left_on = 'column_name', right_on = "otherColumName", indicator=True)
do an outer join in pandas that has for both inner data and outer data
df1.merge(ff2, how='outer", on = 'column_name')
left join in pandas
df1.merged(d2, how='left", on = 'column_name', indicator=True)
How would we change the name of 'DOG' in the dataframe in the column of animal to 'CAT'
df['animal'.str.replace('DOG', 'CAT'
See if there if the values 'turtle' is contained within the columns animals
df['animals'.str.contains('turtle'). This will return a bool
How to split values by ',' and have them so the new df cell values are in lists
df['animals'.str.split(',')
How to split values by ',' and have them so the new df cell values are in lists and make it so we are grabbing the first value in the list
df['animals'.str.split(',').get(0). get(0) will get the first index
Check if a dataframe colums of animals starts with 'turtle'
df['animals'.str.startswith('turtle'). This will return a bool
How to strip white spaces?
df['animals'.str.strip()
How would I get rid of the '$' in a column named salary and change it to a float?
df['salary'].str.replace('$', '').astype(float)
How to get a time delta column in a dateframe
df['timeDelta'] = df['startDate'] - df['endDate']
split the data frame by ',' and put the values into two different columns and assign the new columns names
df[['first', 'last']] = df['name'].split(',', expand=True). Expand essentially turns the values into columns
Attribute
features of objects. something like [ ] or .__dict__ basically anything thats doesnt require ()
.fillna()
fills na values with a set value
What will this do: mask = df['Team'].notnull() df[mask]
filter out all data that isnt null
Sum a certain column of each group in a group by object
groupbyobject['column'].sum()
What will this do?: df.loc[['Goldfinger', 'Thunderball'], ['Actor', 'Box Office']]
if will grab the rows that have the index of 'Golderfinger' and 'Thunderball' and then will select the columns for 'Actor' and 'Box Office' for those index's
if there is a column in a df that is labeled Team what would the outcome be if we did something like df.Team
if would return the values under the column of teams in a series. This will only work when the column name doesnt have spaces
Create a datetime object and get the day and year from it
import datetime as dt pointintime = dt.datetime(2010,1,10,8,13,57) day = pointintime.day minute = pointintime.miunte
.get()
in a series will retrieve a specified index by name or number. The benifit of this is that you can do somthing called 'default' which will return somthing you specfy if the value isnt found
What will this do? df.nlargest(3, columns ='Box Office')
it will extract the 3 largest values under the box office column
What will this do?: df[df['Team'] == 'Finance']
make the data frame only have people that are on the finance team. This is a good way to filter data
Merged outer join data frame that has only values from left and right merge not inner
merged = df1.merge(ff2, how='outer", on = 'column_name', indicator=True) mask = merged['_merge'].isin(['left_only', 'right_only']) merged[mask]
Difference between attributes and methods
methods require () after e.i print() and return functions while attributes are features of an object
Make a pandas datatime bject
pd.Timestamp("2015-03-31")
Move backward in times by 1 hour increments pandas
pd.date_range(end = "2018-01-01", freq ="H")
Create a date range index with pandas with increments by 3 hours
pd.date_range(start = "2016-01-01", end = "2018-01-01", freq ="6H")
Create a date range index with pandas with increments by business days
pd.date_range(start = "2016-01-01", end = "2018-01-01", freq ="B")
Create a date range index with pandas with increments of 1 day
pd.date_range(start = "2016-01-01", end = "2018-01-01", freq ="D")
Change dataframe so it shows the value rounded but it isnt actually rounded
pd.set_option('precision', 2)
how to add 5 days and 5 minutes and 5 hours to a datetime object
pd.to_datetime("2001-01-11") + pd.DateOffset(days = 5, hours =5, minutes=5)
create a pandas datetie index
pd.to_datetime(["2015-03-31", "2015-03-20", "2015-03-11"]) Will convert this into a array of datetime objects
Make a df into dates and if the date isnt a real date change it to a NaT value
pd.to_datetime(df, errors = 'coerce')
Turn a dataframe column into a datatime format
pd.to_datetime(df['Start Date'])
What will this do?: df.loc['Goldfinger':'Thunderball', 'Actor':'Box Office']
pulls rows from Goldfinger to Thunderball and grabs all columns from actor to boxoffice
index
referencing the rows
df.pop('Actor')
removes a single column and returns it aswell. No need to use 'inplace=True' here
is_unique
returns a boolean of if there is unique items in the data
.count()
returns the total amount of real values so basically excludes all nan values
What will this do? df.iloc[1:15, 2:5]
rows 1-15 with column 2-5
.shape
shows the amount of rows and the number of indexs
.size
shows us the amount of cells in a data frame including null values
ndim
shows us the demension of a dataset
Stationarity
the mean and variance are constant throughout the time series. What this means is that a stationary time series is drawn from the same probability distribution
sorted()
the rows are organized in alphabetical or sequential order based on the data in one or more columns
How to itterate through data in groupby objects
use a for loop: e.i.: for sector, data in sectors: data.nlargest(1, "revenue")
Select multiple index's from a series or dataframe
use two [ [] ]. e.i. [ [300,200,100] ]
del df['Director']
will delete a column. del is a attribute of python not pandas
.dropna()
will delete all rows that have na values. we can switch this to replace na values and or switch it so it deletes the column instead. Or we can also check for certain columns. such as nba.dropna(subset=['Salary])
.insert()
will essentially create a column. it takes in the arguments of the row index number, column name and then the values you want to go in
.info
will give us a ton of info on just about every attribute of the df
What will thid do: mask = ~df['Gender'].duplicated() df[mask]
will keep only the first unique values
index_col
will make the index have the values of a specified collumn. rev = pd.read_csv('revenue.csv', index_col='Date') this will make the index instead if being 0,1,23, be the date column
.astype()
will make the value a certain data type. e.i. .astype('int) will turn a specified column into int
df.index.get_level_values('Date')
will return all indexs for the 'Date' multi index
dir()
will show all methods and attributes within the object
.to_dict()
will turn a python df into a dict
dict()
will turn series into dictionary