Pandas

Ace your homework & exams now with Quizwiz!

How to retrieve a group from a groupby object

.get_group('Group_name')

get the max value of each group in a groupby object

.max()

get mean of a groupby object for each group

.mean()

sum each group in a groupby object

.sum

difference between axis=0 and axis =1?

0 is effecting rows 1 is effecting columns

What does the Groupby() method do

Basically groups the data by a certain value. e.i. group all stocks in the data frame by sector. This returns a object called a group by dataframe

What is the downside of 'query()'

Cant not query columns that have spaces

df.index.set_names(names = ['Day', 'Location'], inplace = True)

Changes the index names to day and location

multi index df.index.set_names(names = 'Day', level = 0, inplace = True)

Chnages the first level in the multi indexs index name to day

.value_counts()

Counts the amount of the same values occur in a series

What will this do?: df.drop_duplicates(subset = ['First Name'], keep ='first')

Drop all Names that have a duplicate but keep the first occurrence

df = pd.merge(df1, df2)

Easy way to do a merge

What will this do: mask = df['Salary'].duplicated() df[mask]

Filter out all duplicated values

What will this do: mask = df['Team'].isin(['Legal', 'Sales', 'Product']) df[mask]

Filter the dataframe to only those who in the team column that are in Legal sales or product

.describe()

Gives us a bunch of statistical stuff on the df or series such as the mean, std, max, count etc...

What will this do? df.sample(frac = .50, axis=1)

Grab a random sample of 50% of the columns

What is a time delta

It shows us the distance between two dates/times. This is generated when you subtract one time from another

What will this do?: data["Charmander":"Weedle":2]

It will grab the data from the index of Charmander to Weedle and skip every other row

What will this do: df.loc['Goldfinger', 'Actor']

It will grab the index row labaled as 'Goldfinger' and will grab the value under the column of 'Actor'

WHat does the unstack method do?

It will turned a stacked series back into a dataframe

What will this do: df = pd.read_csv('bigmac.csv', parse_dates = ['Date'], index_col = ['Date', 'Country'])

Make it in date format and make it a multi index

mask =df['Actor'] =='Sean Connery' df.where(mask)

Makes it so it makes nan values for all rows that don have sean as the actor

Sample()

Python method that is part pf random that that returns a certain index from a sequence

What will this do?: df['Team'].isin(['Legal', 'Sales', 'Product'])

Return booleans for if the the index value contains legal sales or product

groupby object .groups

Returns a dictionary with the group and every single index where the each item in the group occurred in the data frame

Method

Returns a function. items like print(), .append(), .index() etc...

.isin()

Returns boolean value for if data is in a df

groupbyObject.agg(['size', 'sum', 'mean'])

See image

Series

Series are a one dimnensional labeled array. They are like a data frame but are a single column that consts of the data and key values: lottery = [1,2,2,61,6,15] pd.Series(lottery) Output: 0 11 22 23 614 65 15

.reindex

Show a value that is missing for example: data.reindex(index= ['fiurujfi']) this values deosn texist so it will give me a index label with a value of NaN

What is a Left Join?

This JOIN will return all rows from the left table (this is the table you specify first in your code) and any matching records from the right table (this is the table you specify second in your code)

What is a right join?

This JOIN will return all the rows from the right table (this is the table you specify second in your code) and any matches from the left table (this is the table you specify first in your code).

What will this do?: sean = df['Actor'] == 'Sean Connery' df.loc[sean]

This will filter all values of the data set showing rows that have Sean Connery as the actor

What does the stack method do

Turns a multi column dataframe into a series. THis turns it into a multi index

What will this do: mask = df['Salary'].between(6000,70000) df[mask]

WIll filter the data from where the salary is between 6000 and 70000

What will this do?: df.sample(5)

WIll return 5 random rows

df1.append(df2)

Will append df 2 to df1

pd.concat(objs = [df1, df2], ignore_index = False)

Will concat the two dataframes with both having the same index

directors = df['Actor'].copy()

Will copy column of a data frame into a variables called directors

pd.concat(objs = [df1, df2], keys=['df1Name', 'df2Name])

Will create a datafrzame object that is multi demesional meaning there will be a data frame for both df1 and df1 data

What will this do? df.drop('Dr. No', inplace = True)

Will delete the index with the name of 'Dr. No'

df1.merge(ff2, how='inner", on = 'column_name')

Will do a inner join on the two data sets

.join() method

Will do a vertical merge when the two data sets have the same index

df1.merge(ff2, how='outer", on = 'column_name', indicator=True)

Will do an outer merge and will have a colum that will show us where the data is found whether it be a right join a left join or both

.map()

Will essentailly map one series or dict to another. For example say we have two sets od data in two seperate series. series one will food, series two is resteraunts. if we where to go series1.map(series2) it will essentially replace the food data with the resteraunt data

What will this do?: df.query("Actor == 'Sean Connery' or Director == 'John Glen'")

Will extract all rows where the column of actor has the value of Sean or the director is john

What will this do?: filter1 = df['Gender']=='Male' filter2 = df['Team']=='Marketing' df[filter1 & filter2]

Will filter the df by Males in Marketing

What will this do?: filter1 = df['Gender']=='Male' filter2 = df['Team']=='Marketing' df[filter1 | filter2]

Will filter the df by Males or poeple in Marketing

What does a inner join do

Will make a dataframe for only data that is the same in the two dataframes. e.i. if it is merging on customer id it will only merge data for if there is the same id in both data sets. if id of 1 exists in df1 but not in df2 it will not be in the merged data frame. but if id 2 is in both dataframes it will be in the new dataframe

What will this do?: df.columns = ['s','s','f','g', 'h', 'h']

Will name the columns to the values in the list

multi index df.loc[('2010-01-01', 'Argentina'), ["Price in US Dollars"]]

Will pull the values from the given index in a multi index and pull the given columns

What will this do?: df.rename(mapper ={'GoldenEye': 'Golden Eye', 'The World Is Note Enough': 'Best bond movie ever'}, inplace=True)

Will rename indexs labeled as the keys to the key values. And will make it so these values are the new dataframe

What will this do": df.columns = [column_name.replace(' ', '_') for column_name in df.columns]

Will rename the columns to have underscores instead of spaces

groupby object .first()

Will return the first item in each group

what will this do?: data[data.idxmin()]

Will return the index position of the lowest value

df.nsmallest(3, columns ='Box Office')

Will return the three smallest box office values

list()

Will turn a series into a list

How to do a linear regression

X1 = sm.add_constant(X1) results = sm.OLS(X2, X1).fit()

Module

a file of python code that returns certain functions. An example of this could be Pandas, Datetime, flask etc..

Convention

a generally agreed-upon practice in programming. An example of this could be import pandas a pd, or for i in list

what does the agg() method do to a groupby object

allows us to do an apply method asiccaly to each column in a groupby object

What is a multiindex

an item that has a multiple index keys. Or a nested data frame inside of a data frame. An example of this could be for each date you have another data frame that has the secondary key labeled as ticker that has each companies ticker and then for each one of these tickers there is a the OHLC

.apply()

apply will essentially replace the indexs values with some type of other value or calulation. For example say we have a function that returns ok if a value in the df >300 is we use .apply() to our data it will replace all values that are less then 300 with ok: def c(num): if num<300: return 'ok' data.apply(c)

Make it so it pulls data from a yahoo finance dateframe for every day your birthday occurred

birthday = pd.date_range(start = "1991-04-12" end="2010-12-31" freq = pd.DateOfffset(years =1)) birdays_stock_info = df.index.isin(birthday) df[birthday_stock_info]

When there is more then once column/2d df what will the first index/[] atribute return. What will the secound one return?

columns, rows. e.i. 'nba['Name'][1]'

What is a outer join

combines value from both data sets. This can be for all data or just data that exists in one data set that isnt in the other

groupbyObject.agg({'revenues':'sum', 'debt':'sum' })

create a dataframe that has the sum of each column with the index of the df being the group

turn a date time series into a series that is just has weekdays

df.dt.weekday_name

How to create a multi index

df.set_index(keys=['Date', 'Country'])

left merge on column right merge on index

df1.merge(ff2, how='left", left_on = 'column_name', right_index =True)

Do a merge where the column names aren't the same

df1.merge(ff2, how='left", left_on = 'column_name', right_on = "otherColumName", indicator=True)

do an outer join in pandas that has for both inner data and outer data

df1.merge(ff2, how='outer", on = 'column_name')

left join in pandas

df1.merged(d2, how='left", on = 'column_name', indicator=True)

How would we change the name of 'DOG' in the dataframe in the column of animal to 'CAT'

df['animal'.str.replace('DOG', 'CAT'

See if there if the values 'turtle' is contained within the columns animals

df['animals'.str.contains('turtle'). This will return a bool

How to split values by ',' and have them so the new df cell values are in lists

df['animals'.str.split(',')

How to split values by ',' and have them so the new df cell values are in lists and make it so we are grabbing the first value in the list

df['animals'.str.split(',').get(0). get(0) will get the first index

Check if a dataframe colums of animals starts with 'turtle'

df['animals'.str.startswith('turtle'). This will return a bool

How to strip white spaces?

df['animals'.str.strip()

How would I get rid of the '$' in a column named salary and change it to a float?

df['salary'].str.replace('$', '').astype(float)

How to get a time delta column in a dateframe

df['timeDelta'] = df['startDate'] - df['endDate']

split the data frame by ',' and put the values into two different columns and assign the new columns names

df[['first', 'last']] = df['name'].split(',', expand=True). Expand essentially turns the values into columns

Attribute

features of objects. something like [ ] or .__dict__ basically anything thats doesnt require ()

.fillna()

fills na values with a set value

What will this do: mask = df['Team'].notnull() df[mask]

filter out all data that isnt null

Sum a certain column of each group in a group by object

groupbyobject['column'].sum()

What will this do?: df.loc[['Goldfinger', 'Thunderball'], ['Actor', 'Box Office']]

if will grab the rows that have the index of 'Golderfinger' and 'Thunderball' and then will select the columns for 'Actor' and 'Box Office' for those index's

if there is a column in a df that is labeled Team what would the outcome be if we did something like df.Team

if would return the values under the column of teams in a series. This will only work when the column name doesnt have spaces

Create a datetime object and get the day and year from it

import datetime as dt pointintime = dt.datetime(2010,1,10,8,13,57) day = pointintime.day minute = pointintime.miunte

.get()

in a series will retrieve a specified index by name or number. The benifit of this is that you can do somthing called 'default' which will return somthing you specfy if the value isnt found

What will this do? df.nlargest(3, columns ='Box Office')

it will extract the 3 largest values under the box office column

What will this do?: df[df['Team'] == 'Finance']

make the data frame only have people that are on the finance team. This is a good way to filter data

Merged outer join data frame that has only values from left and right merge not inner

merged = df1.merge(ff2, how='outer", on = 'column_name', indicator=True) mask = merged['_merge'].isin(['left_only', 'right_only']) merged[mask]

Difference between attributes and methods

methods require () after e.i print() and return functions while attributes are features of an object

Make a pandas datatime bject

pd.Timestamp("2015-03-31")

Move backward in times by 1 hour increments pandas

pd.date_range(end = "2018-01-01", freq ="H")

Create a date range index with pandas with increments by 3 hours

pd.date_range(start = "2016-01-01", end = "2018-01-01", freq ="6H")

Create a date range index with pandas with increments by business days

pd.date_range(start = "2016-01-01", end = "2018-01-01", freq ="B")

Create a date range index with pandas with increments of 1 day

pd.date_range(start = "2016-01-01", end = "2018-01-01", freq ="D")

Change dataframe so it shows the value rounded but it isnt actually rounded

pd.set_option('precision', 2)

how to add 5 days and 5 minutes and 5 hours to a datetime object

pd.to_datetime("2001-01-11") + pd.DateOffset(days = 5, hours =5, minutes=5)

create a pandas datetie index

pd.to_datetime(["2015-03-31", "2015-03-20", "2015-03-11"]) Will convert this into a array of datetime objects

Make a df into dates and if the date isnt a real date change it to a NaT value

pd.to_datetime(df, errors = 'coerce')

Turn a dataframe column into a datatime format

pd.to_datetime(df['Start Date'])

What will this do?: df.loc['Goldfinger':'Thunderball', 'Actor':'Box Office']

pulls rows from Goldfinger to Thunderball and grabs all columns from actor to boxoffice

index

referencing the rows

df.pop('Actor')

removes a single column and returns it aswell. No need to use 'inplace=True' here

is_unique

returns a boolean of if there is unique items in the data

.count()

returns the total amount of real values so basically excludes all nan values

What will this do? df.iloc[1:15, 2:5]

rows 1-15 with column 2-5

.shape

shows the amount of rows and the number of indexs

.size

shows us the amount of cells in a data frame including null values

ndim

shows us the demension of a dataset

Stationarity

the mean and variance are constant throughout the time series. What this means is that a stationary time series is drawn from the same probability distribution

sorted()

the rows are organized in alphabetical or sequential order based on the data in one or more columns

How to itterate through data in groupby objects

use a for loop: e.i.: for sector, data in sectors: data.nlargest(1, "revenue")

Select multiple index's from a series or dataframe

use two [ [] ]. e.i. [ [300,200,100] ]

del df['Director']

will delete a column. del is a attribute of python not pandas

.dropna()

will delete all rows that have na values. we can switch this to replace na values and or switch it so it deletes the column instead. Or we can also check for certain columns. such as nba.dropna(subset=['Salary])

.insert()

will essentially create a column. it takes in the arguments of the row index number, column name and then the values you want to go in

.info

will give us a ton of info on just about every attribute of the df

What will thid do: mask = ~df['Gender'].duplicated() df[mask]

will keep only the first unique values

index_col

will make the index have the values of a specified collumn. rev = pd.read_csv('revenue.csv', index_col='Date') this will make the index instead if being 0,1,23, be the date column

.astype()

will make the value a certain data type. e.i. .astype('int) will turn a specified column into int

df.index.get_level_values('Date')

will return all indexs for the 'Date' multi index

dir()

will show all methods and attributes within the object

.to_dict()

will turn a python df into a dict

dict()

will turn series into dictionary


Related study sets

Microeconomics - Chapter 4 - Elasticity

View Set

Ch 8: Human Capital: Education and Health in Economic Development

View Set

REAL ESTATE PRACTICE - CHAPTER 1: GETTING STARTED IN REAL ESTATE

View Set

Module 3 - AWS Global Infrastructure Overview [Knowledge Check]

View Set

Byzantine Empire and Crisis and Recovery in the West

View Set

001 Newsela: How to Write a 3 Paragraph Summarizing Essay (Maddie Weisblatt)

View Set

Sara• Grammer (past,future,present)

View Set

Maryland Property & Casualty Insurance Practice Questions

View Set

Chapter 5 Nutrition The Lipids: Fats, Oils, Phospholipids, and Sterols

View Set