Pandas (How)

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

define the url you want to scrape

url = 'http://www.fdic.gov/bank/individual/failed/banklist.html'

dataframe.value_counts()

counts each type of value

Binning: This cuts up my years list into decade categories that are outlines in the decade_bins list.

decade_cat = pd.cut(years,decade_bins)

creating and calling your own functions

def max_to_min(arr):return arr.max() - arr.min() wino.agg(max_to_min)

dframe = dframe[dframe.State != 'State']

dropping a particular row given a value in the specified column

clean_dframe1=dframe1.dropna()

drops all rows with at least one null value from dframe1

data1.dropna()

drops missing values.

dframe1.columns.values

lists all the column titles

Cross Tabulation: it is a pivot table made up of count values

pd.crosstab(dframe.column1, dframe.column2, dframe.col3, margins=True)

Will show you how many of each year fall into each bin or cateogry

pd.value_counts(decade_cat)

import json

remember to import

Bootstrap method for calculating Value at risk for a stock. Rets is your table of all the daily returns. Quantile specifies your confidence interval. In this case (.05) indicates a confidence interval of 95%

rets['ESRX'].quantile(0.05)

kernel density estimation plot using seaborn

sns.kdeplot(dataset)

using seaborn for a cumulative distribution function

sns.kdeplot(dataset, cumulative = True)

Shows you the estimate for what the tip percentage is going to be.

sns.lmplot('size','tip_pect',tips,x_estimator=np.mean)

a jitter spreads out your data points and can at times make your plot a lot easier to read

sns.lmplot('size','tip_pect',tips,x_jitter=1)

dframe_st=dframe1.unstack() dframe_st

the unstack command takes your data back to a matrix form

Using Python and SQL

Import sqlite3

Multivariate Density Estimation

Look it up at work

Monte Carlo Value at Risk Simulation: Assumes stock follows a random walk and abides by efficient market hypothesis.

Refer to Ipython notebook on github

dframe.to_csv(sep='?')

Separates every file with a question mark

Where the value is null in Ser1, fill in with value from Ser2, IF not when use Ser1 Value. Keep index from ser1

Series(np.where(pd.isnull(ser1),Ser2,ser1),index=ser1.index)

sql_query='''SELECT * FROM tablenale'''

Set a sql query equal to a constant

con = sqlite3.connect('example.db')

Set your database connection equal to con and link to the db file on your desktop

drame1.sum()

Sums the Columns

dframe1.sum(axis=1)

Sums the rows

how to map location data into your dataframe This adds a column named state to your dframe and matches all the states from the state_map library, to the cities inyour city column. This could be useful if you pull some of the location data off of the web.

dframe[State]=dframe['city'].map('State_map')

Applying a function to a dframe

here is your function: def ranker(df): df['alc_content_rank'] = np.arange(len(df)) +1 return df then to use it you: dframe_wine=dframe_wine.groupby('quality').apply(ranker)

dframe.duplicated()

identifies the duplicates for you

dframe1.min()

max in each column

dframe1.dropna(how='all')

only drops rows in the dataframe where all data is null

concatenation: going to stack alike columns on top of eachother

pd.concat([dframe1,dframe2])

merging data sets (merge on an index)

pd.merge(df_left,df_right,left_on='key',right_index=True)

pip install in terminal

"sudo easy install program_name"

dframe_list = pd.io.html.read_html(url)

# Grab data from html and put it into a list of DataFrame objects!

dframe1.describe()

Important descriptive statistics

Grouping This groups the contents of column 1 by the contents of column 2. You can then perform operations on group1 by using . mean() . Min() etc.

group1=Dframe[columnname1].groupby(dframe[Column2])

Getting data from the web

import pandas.io.data as pdweb import datetime prices = pdweb.get_data_yahoo(['CVX','XOM','BP'],start=datetime.datetime(2010,1,1), end=datetime.datetime(2013,1,1))['Adj Close'] prices

dframe1.min()

min value in each column

making a computed column for a tip percentage

tips['tip_pect']=100*(tips['tip']/tips['total_bill'])

More grouping using wine example Groups by quality and shows us the disciptive statistics at each level or quality

wino=dframe_wine.groupby('quality') wino.describe()

opening an excel file

xlsfile = pd.ExcelFile('Desktop/Seattle_Police_Department_Police_Report_Incident.xls') dframe = xlsfile.parse() dframe

making all words in your index lowercase

dframe2.index.map(str.lower)

Capitalizes the first letter in each record of your index and makes the column titles all lowercase

dframe2.rename(index=str.title,columns=str.lower)

Renames LA to los angeles and Z to Zeta

dframe2.rename(index={'LA': 'Los Angeles'}, columns={'Z': 'ZeTa'})

index hierchy first index is outside the box

dframe=DataFrame(np.arange(16).reshape(4,4),index=[['a','a','b','b'],['1','2','1','2']], columns=[['NY','LA','MI','CA'],['Hot','cold','hot','cold']]) dframe

Grouping using an array This will group matching elements in 'dataset' by the values and contents of array1 and array2, then it will show you the mean.

dframe['dataset'].groupby(['array1,array2]).mean()

Show all rows in your dataframe with at least one value greater than 3

dframe[(np.abs(dframe)>3).any(1)]

Sets all values higher than 3 or negative 3 to exactly 3 or negative 3 which eliminates outliers

dframe[np.abs(dframe)>3]=np.sign(dframe)*3

how to pivot your data(pivoting table)

dframe_pivot=dframe.pivot('rows','columns','values')

from pandas import read_html (installed incredible-soup4 and html5lib)

importing html data

JSON (wont use this a lot)

javascript object notation

turns data into json object

json.dumps(data)

Getting Started

import numpy as np import pandas as pd from pandas import Series, DataFrame

Web Scraping imports

import pandas as pd from pandas import Series,DataFrame from bs4 import BeautifulSoup import requests

Violin Plot

sns.violinplot([data1,data2])

adjusting violin plot bandwith

sns.violinplot(data2,bw=0.01)

a rug plot just puts a tick wherever a value occurs

#Create dataset dataset = randn(25) #Create rugplot sns.rugplot(dataset) #Set y-axis limit plt.ylim(0,1)

plotting installs

#Plotting import matplotlib as mpl import seaborn as sns import matplotlib.pyplot as plt

Stats install

#stats from scipy import stats

Use con and sql_query to pull results into a dataframe

df=pd.read_sql(sql_query,con) df

Join command

df_left.join(df_right)

Opening a CSV in Python

dframe = pd.read_csv('Desktop/Seattle_Police_Department_Police_Report_Incident.csv') dframe

drops the duplicates

dframe.drop_duplicates(['column name'])

Grouping on a row. This will give you the counts for each data type in your designated column

dframe.groupby('columname').sizes()

naming your indexes

dframe.index.names = ['index1','index2'] dframe.columns.names = ['Cities','Temp'] dframe

Sorts outer index

dframe.sortlevel(0)

sorts inner index

dframe.sortlevel(1)

DataFrame Syntax

dframe1 = DataFrame(arr,index=['A','B'], columns=['One','Two','Three'])

turning part of JSON object into DataFrame

dframe1 = DataFrame(data['diet'])

making a DataFrame using the library method All you do is name the column and then input the the contents in this format (not that important but good to recognize)

dframe1=DataFrame({'key':['X','Z','Y','Z','X','X'],'data_set_1':np.arange(6)}) dframe1

the stack command organizes your data and gets it out of a matrix

dframe_st=dframe1.stack() dframe_st

you can also unstack by a specific index to essentially sort the data by that index

dframe_st=dframe1.stack(cities) dframe_st

Matplotlib scatter plot

dframe_wine.plot(kind='scatter',x='quality',y='alcohol')

adding a computed columns

dframe_wine['qual/alc ratio'] = dframe_wine['quality'] / dframe_wine['alcohol']

data = json.loads(json_obj_name) data

displays json object as data

dataframe.unique()

displays only unique values

data1.fillna(1)

fills all null values with a 1

data1.isnull()

finds missing and null values

merging using two columns

pd.merge(dfleft,dfright,on=['key1','key2'],how='outer')

merging data sets (merge on a column)

pd.merge(dframe1,dframe2,on='column_name')

adding 'how' allows you to either merge on the left or right data set. 'Outer' is a union of both data sets.

pd.merge(dframe1,dframe2,on='key',how='left/right/outer')

%matplotlib inline prices.plot()

plots your data. 'Prices' is the variable i am calling

plotting histogram data A histogram plots the category counts in a bar chart

plt.hist(dataset1)

Changing the graph color

plt.hist(dataset1,color='indianred')

THIS IS NOT SEABORN IT IS MATPLOTLIB plotting one histogram on top of another normed:you need to make this true if your data is of different sizes bins: default is ten along the x axis but you can change it Alpha: Adjusts the transparrency Color: does the obvious

plt.hist(dataset1,normed=True,color='indianred',alpha=.5,bins=20) plt.hist(dataset2,normed=True,alpha=.5,bins=20)

dframe.to_csv('filename')

saving a file as a csv

using pandas you can accomplish the same thing with this shortcut

ser1.combine_first(Ser2)

replacing pieces of data replaces all the ones with twos or 1 and 2 with 100 and 200 respectively

ser1.replace(1,2) ser1.replace([1,2],[100,200])

box plots

sns.boxplot([data1,data2])

distribution plot

sns.distplot(dataset,bins=25)

You can customize tables by passing dictionaries onto them.

sns.distplot(dataset,bins=25, kde_kws={'color':'indianred','label':'KDE PLOT'}, hist_kws={'color':'blue','label':"HISTOGRAM"})

editing dist plot. Makes it a rug plot without the histogram

sns.distplot(dataset,bins=25,rug=True,hist=False)

Heatmap:

sns.heatmap(flights,annot=True,fmt='d')

Plotting two data sets using Seaborn. Hex makes it look better

sns.jointplot(data1,data2,kind='hex')

Linear regression plot: plotting only specific columns of a dataset. total_bill and tips are the columns and tips is the dataframe

sns.lmplot('total_bill','tip',tips)

making edits to your graph: you have to pass a library on certain elements of the graph

sns.lmplot('total_bill','tip',tips, scatter_kws={'marker':'o','color':'indianred'}, line_kws={'linewidth':1,'color':'blue'})

This particular plot shows you the relationship between the amount being tipped and the day of the week

sns.lmplot('total_bill','tip_pect',tips,hue='day')

The markers option turns your scatter dots into a symbol of your choosing. The hue setting plots a zone based on sex

sns.lmplot('total_bill','tip_pect',tips,hue='sex',markers=['x','o'])


Ensembles d'études connexes

HST 206 Study Guide for the midterm

View Set

Chapter 31: Health Supervision (Prep U)

View Set

Chapter 13: Care of the Patient with a Sensory Disorder

View Set

All of the following were tennents of Calvinism EXCEPT

View Set

Personal Health Promotion- Chapter 6

View Set