Pandas (How)
define the url you want to scrape
url = 'http://www.fdic.gov/bank/individual/failed/banklist.html'
dataframe.value_counts()
counts each type of value
Binning: This cuts up my years list into decade categories that are outlines in the decade_bins list.
decade_cat = pd.cut(years,decade_bins)
creating and calling your own functions
def max_to_min(arr):return arr.max() - arr.min() wino.agg(max_to_min)
dframe = dframe[dframe.State != 'State']
dropping a particular row given a value in the specified column
clean_dframe1=dframe1.dropna()
drops all rows with at least one null value from dframe1
data1.dropna()
drops missing values.
dframe1.columns.values
lists all the column titles
Cross Tabulation: it is a pivot table made up of count values
pd.crosstab(dframe.column1, dframe.column2, dframe.col3, margins=True)
Will show you how many of each year fall into each bin or cateogry
pd.value_counts(decade_cat)
import json
remember to import
Bootstrap method for calculating Value at risk for a stock. Rets is your table of all the daily returns. Quantile specifies your confidence interval. In this case (.05) indicates a confidence interval of 95%
rets['ESRX'].quantile(0.05)
kernel density estimation plot using seaborn
sns.kdeplot(dataset)
using seaborn for a cumulative distribution function
sns.kdeplot(dataset, cumulative = True)
Shows you the estimate for what the tip percentage is going to be.
sns.lmplot('size','tip_pect',tips,x_estimator=np.mean)
a jitter spreads out your data points and can at times make your plot a lot easier to read
sns.lmplot('size','tip_pect',tips,x_jitter=1)
dframe_st=dframe1.unstack() dframe_st
the unstack command takes your data back to a matrix form
Using Python and SQL
Import sqlite3
Multivariate Density Estimation
Look it up at work
Monte Carlo Value at Risk Simulation: Assumes stock follows a random walk and abides by efficient market hypothesis.
Refer to Ipython notebook on github
dframe.to_csv(sep='?')
Separates every file with a question mark
Where the value is null in Ser1, fill in with value from Ser2, IF not when use Ser1 Value. Keep index from ser1
Series(np.where(pd.isnull(ser1),Ser2,ser1),index=ser1.index)
sql_query='''SELECT * FROM tablenale'''
Set a sql query equal to a constant
con = sqlite3.connect('example.db')
Set your database connection equal to con and link to the db file on your desktop
drame1.sum()
Sums the Columns
dframe1.sum(axis=1)
Sums the rows
how to map location data into your dataframe This adds a column named state to your dframe and matches all the states from the state_map library, to the cities inyour city column. This could be useful if you pull some of the location data off of the web.
dframe[State]=dframe['city'].map('State_map')
Applying a function to a dframe
here is your function: def ranker(df): df['alc_content_rank'] = np.arange(len(df)) +1 return df then to use it you: dframe_wine=dframe_wine.groupby('quality').apply(ranker)
dframe.duplicated()
identifies the duplicates for you
dframe1.min()
max in each column
dframe1.dropna(how='all')
only drops rows in the dataframe where all data is null
concatenation: going to stack alike columns on top of eachother
pd.concat([dframe1,dframe2])
merging data sets (merge on an index)
pd.merge(df_left,df_right,left_on='key',right_index=True)
pip install in terminal
"sudo easy install program_name"
dframe_list = pd.io.html.read_html(url)
# Grab data from html and put it into a list of DataFrame objects!
dframe1.describe()
Important descriptive statistics
Grouping This groups the contents of column 1 by the contents of column 2. You can then perform operations on group1 by using . mean() . Min() etc.
group1=Dframe[columnname1].groupby(dframe[Column2])
Getting data from the web
import pandas.io.data as pdweb import datetime prices = pdweb.get_data_yahoo(['CVX','XOM','BP'],start=datetime.datetime(2010,1,1), end=datetime.datetime(2013,1,1))['Adj Close'] prices
dframe1.min()
min value in each column
making a computed column for a tip percentage
tips['tip_pect']=100*(tips['tip']/tips['total_bill'])
More grouping using wine example Groups by quality and shows us the disciptive statistics at each level or quality
wino=dframe_wine.groupby('quality') wino.describe()
opening an excel file
xlsfile = pd.ExcelFile('Desktop/Seattle_Police_Department_Police_Report_Incident.xls') dframe = xlsfile.parse() dframe
making all words in your index lowercase
dframe2.index.map(str.lower)
Capitalizes the first letter in each record of your index and makes the column titles all lowercase
dframe2.rename(index=str.title,columns=str.lower)
Renames LA to los angeles and Z to Zeta
dframe2.rename(index={'LA': 'Los Angeles'}, columns={'Z': 'ZeTa'})
index hierchy first index is outside the box
dframe=DataFrame(np.arange(16).reshape(4,4),index=[['a','a','b','b'],['1','2','1','2']], columns=[['NY','LA','MI','CA'],['Hot','cold','hot','cold']]) dframe
Grouping using an array This will group matching elements in 'dataset' by the values and contents of array1 and array2, then it will show you the mean.
dframe['dataset'].groupby(['array1,array2]).mean()
Show all rows in your dataframe with at least one value greater than 3
dframe[(np.abs(dframe)>3).any(1)]
Sets all values higher than 3 or negative 3 to exactly 3 or negative 3 which eliminates outliers
dframe[np.abs(dframe)>3]=np.sign(dframe)*3
how to pivot your data(pivoting table)
dframe_pivot=dframe.pivot('rows','columns','values')
from pandas import read_html (installed incredible-soup4 and html5lib)
importing html data
JSON (wont use this a lot)
javascript object notation
turns data into json object
json.dumps(data)
Getting Started
import numpy as np import pandas as pd from pandas import Series, DataFrame
Web Scraping imports
import pandas as pd from pandas import Series,DataFrame from bs4 import BeautifulSoup import requests
Violin Plot
sns.violinplot([data1,data2])
adjusting violin plot bandwith
sns.violinplot(data2,bw=0.01)
a rug plot just puts a tick wherever a value occurs
#Create dataset dataset = randn(25) #Create rugplot sns.rugplot(dataset) #Set y-axis limit plt.ylim(0,1)
plotting installs
#Plotting import matplotlib as mpl import seaborn as sns import matplotlib.pyplot as plt
Stats install
#stats from scipy import stats
Use con and sql_query to pull results into a dataframe
df=pd.read_sql(sql_query,con) df
Join command
df_left.join(df_right)
Opening a CSV in Python
dframe = pd.read_csv('Desktop/Seattle_Police_Department_Police_Report_Incident.csv') dframe
drops the duplicates
dframe.drop_duplicates(['column name'])
Grouping on a row. This will give you the counts for each data type in your designated column
dframe.groupby('columname').sizes()
naming your indexes
dframe.index.names = ['index1','index2'] dframe.columns.names = ['Cities','Temp'] dframe
Sorts outer index
dframe.sortlevel(0)
sorts inner index
dframe.sortlevel(1)
DataFrame Syntax
dframe1 = DataFrame(arr,index=['A','B'], columns=['One','Two','Three'])
turning part of JSON object into DataFrame
dframe1 = DataFrame(data['diet'])
making a DataFrame using the library method All you do is name the column and then input the the contents in this format (not that important but good to recognize)
dframe1=DataFrame({'key':['X','Z','Y','Z','X','X'],'data_set_1':np.arange(6)}) dframe1
the stack command organizes your data and gets it out of a matrix
dframe_st=dframe1.stack() dframe_st
you can also unstack by a specific index to essentially sort the data by that index
dframe_st=dframe1.stack(cities) dframe_st
Matplotlib scatter plot
dframe_wine.plot(kind='scatter',x='quality',y='alcohol')
adding a computed columns
dframe_wine['qual/alc ratio'] = dframe_wine['quality'] / dframe_wine['alcohol']
data = json.loads(json_obj_name) data
displays json object as data
dataframe.unique()
displays only unique values
data1.fillna(1)
fills all null values with a 1
data1.isnull()
finds missing and null values
merging using two columns
pd.merge(dfleft,dfright,on=['key1','key2'],how='outer')
merging data sets (merge on a column)
pd.merge(dframe1,dframe2,on='column_name')
adding 'how' allows you to either merge on the left or right data set. 'Outer' is a union of both data sets.
pd.merge(dframe1,dframe2,on='key',how='left/right/outer')
%matplotlib inline prices.plot()
plots your data. 'Prices' is the variable i am calling
plotting histogram data A histogram plots the category counts in a bar chart
plt.hist(dataset1)
Changing the graph color
plt.hist(dataset1,color='indianred')
THIS IS NOT SEABORN IT IS MATPLOTLIB plotting one histogram on top of another normed:you need to make this true if your data is of different sizes bins: default is ten along the x axis but you can change it Alpha: Adjusts the transparrency Color: does the obvious
plt.hist(dataset1,normed=True,color='indianred',alpha=.5,bins=20) plt.hist(dataset2,normed=True,alpha=.5,bins=20)
dframe.to_csv('filename')
saving a file as a csv
using pandas you can accomplish the same thing with this shortcut
ser1.combine_first(Ser2)
replacing pieces of data replaces all the ones with twos or 1 and 2 with 100 and 200 respectively
ser1.replace(1,2) ser1.replace([1,2],[100,200])
box plots
sns.boxplot([data1,data2])
distribution plot
sns.distplot(dataset,bins=25)
You can customize tables by passing dictionaries onto them.
sns.distplot(dataset,bins=25, kde_kws={'color':'indianred','label':'KDE PLOT'}, hist_kws={'color':'blue','label':"HISTOGRAM"})
editing dist plot. Makes it a rug plot without the histogram
sns.distplot(dataset,bins=25,rug=True,hist=False)
Heatmap:
sns.heatmap(flights,annot=True,fmt='d')
Plotting two data sets using Seaborn. Hex makes it look better
sns.jointplot(data1,data2,kind='hex')
Linear regression plot: plotting only specific columns of a dataset. total_bill and tips are the columns and tips is the dataframe
sns.lmplot('total_bill','tip',tips)
making edits to your graph: you have to pass a library on certain elements of the graph
sns.lmplot('total_bill','tip',tips, scatter_kws={'marker':'o','color':'indianred'}, line_kws={'linewidth':1,'color':'blue'})
This particular plot shows you the relationship between the amount being tipped and the day of the week
sns.lmplot('total_bill','tip_pect',tips,hue='day')
The markers option turns your scatter dots into a symbol of your choosing. The hue setting plots a zone based on sex
sns.lmplot('total_bill','tip_pect',tips,hue='sex',markers=['x','o'])