MGT 153 Python

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

*Residuals by predicts and predicted vs. actual plots* dp['residuals']=model_results.resid dp['predicted']= model_results.fittedvalues plt.scatter(dp.predicted, dp.residuals) plt.title('Residuals by Predicted') plt.xlabel('Predicted') plt.ylabel('Residuals') *dp_subset= dp[(dp.predicted <20)] plt.scatter(dp_subest.predicted, dp_subset.residuals) plt.title('Residuals by Predicted') plt.xlabel('Predicted') plt.ylabel('Residuals')* plt.scatter(dp_subset.predicted, dp_subset.Scaled_Sales) plt.title('Actual by Predicted') plt.xlabel('Actual') plt.ylabel('Residuals') plot.show()

Check OLS regression assumptions *homogeneity of variance* how well the model is fitting. -must run old regression in order to get the proper scatterplot. -creates a scatterplot that shoes the residuals according to the predicted outcomes. *- in the 2nd part we use a subset to get a close up of the scatterplot. -funnel shape is hetersketasticity* -in the last function we see the actual according to the predicted, we wish to have a linear line where the points fall on it.

-df['Year']= financials.datadate.dt.year -df[((df.Year-1==df.Year.shift(1))&(df.gvkey==df.gvkey.shift(1))]

Drop rows with missing lagged data. -creates a new column in order to filter the missing data. -this uses the previous column to remove the empty information and verify that it is using the proper year when comparing change.

new name= df.groupby(['2_digit_sic','Year']) new name*.name*='even newer name'

Creates a new data frame with the grouped data. -can change the name if desired of the column name.

import seaborn as sns from scipy import stats +import matplotlib.pyplot as plt plt.figure() *sns.violinplot(x=df.SALE, color="0.25")*

Creates a violin plot + seperates the graphs instead of overlapping them. +needs to be before the graphs. -color gives the graph a different color. -running both graphs at a time will cause them to overlap

*Statsmodel* import statsmodel.formula.api as sm sm.ols(formula='ScaledSales ~Scaled_prevSales + Scaled_Emp...+BookToMarket', data=df).fit()) print(model_results.summary())

Gives and ordinary least squares. -by adding.fit() at the end it fits the model with a line through the data. -Durbin Watson:how effective the model is when prediciting the values according the residuals. Should be closer to 2. -Prob(JB) close to zero means it is not normally distributed.(not good)

from patsy import dmatrices y, X= damtrices('ScaledSales ~Scaled_prevSales + Scaled_Emp...+BookToMarket', data=financials, return_type='dataframe') robust_result=sm_non_formula.RLM(y, X, M=sm_non_forula.robust.norms.HuberT()).fit() print(robust_results.summary())

Robust Linear Model

df.dropna(inplace=True) robust_results= *model_results.get_robustcov_results (cov_type='cluster', use_t=None, groups=df['2_digit_sic'])* print(robust_results.summary())

Robust Regression -based on the previous results when we get OLS regression. -drops the null values in order to get the proper results. -changes covariance type to cluster. -look at p values

df['CooksD']= model_results.get_influence().summary_frame().filter(['cooks_d']) print(financials.count()) financials=financials[(financials.CooksD< 4/financials.residuals.count()))]

Outliers -

*Join* pd.join(industry_average_sales, how='inner', on=['2_digit_six','Year']) df.sort_values(by=['gvkey','Year'], ascending=[True, True], inplace=True)

used when combined two tables based on indices. Where one table has an index, and the other table has a column that we are trying to join. -sorts the data for easier viewing. Inplace gives us the option to make the changes on the actual data.

pd[['gvkey','2_digit_sic','Year']]

when looking at many columns. -needs a dictionary within the dictionary.

import pandas as pd import statsmodels.formula.api as sm financials = pd.read_csv("C:/Users/johanperols/Desktop/Financials.csv")

when looking for the file needed for the csv file.

*Merge* pd.merge(df, industry_average, how='inner', on=['2_digit_six','Year'])

when trying to combine two tables based on column value. -how:inner,outer, right, left -does't work since on are indices. Not entire columns.

import seaborn as sns from scipy import stats *sns.boxplot(x=df.SALE)*

creates a boxplot

import.statsmodels.stats.outliers_influence as sm_influence myX= dp[['ScaledSales ~Scaled_prevSales + Scaled_Emp...+BookToMarket']] myX= myX.dropna() vif=pd.DataFrame() vif["Vif Factor"]= [sm_influence.variance_inflation_factor(myX.value, i) for i in range(myX.shape[1])] vif['Variable']=my.Xcolumns print(vif.round(2))

multicollinearity -remove variable that have high correlation -5 is the fairly common thresh remove variables with the highest vif values.

np.log() np.log(dp.SALE +1)

running this on the scaled sales would allow us to get a different transformation. we could also comment out the scalars which are the standardizers for the scaled sales, etc.

*Transform* df['SALE_Industry_Mean']= df.groupby(['2_digit_sic','Year'])['SALE'].transform('mean')

Combines both the grouping and joining functions. -this gives is the same outcome from the two steps above. -.transform('How we want to transform') usually is mean, average, sum.

robust_results= model_results.get_robustcov_results (cov_type='HC3', use_t=None) print(robust _results.summar()) new results= sm.ols(formula='ScaledSales ~Scaled_prevSales + Scaled_Emp...+BookToMarket', data=df).fit(cov_type='HC3',ust_t=None)

*robust regression* Running a robust regression with a n HC3 estimator with a heterosexuality issue and you still want to be able to interpret the beta coefficient.

*df.shift(1)* df['prevSALE']=df.SALE.shift(1)

Create new columns with lagged data. -shows previous year in a column next to it. -df.Shift(1): bring the data down one row. -can use higher numbers or negative numbers.

df.groupby('2_digit_sic', 'Year')['SALE'].mean()

Create additional variables based on industry means. -groups by the () and summarizes the mean for [] according to the groups.

df['Scaled-Sales]= df.SALE/df.AT df['Scaled_prevSALE']=(df.prevSALE/df.prevAT) Scaled Emp= df.EMP/df.AT ScaledEmpChange= df.EMP-df.prevEMP)/df.AT book-to-market value= dp.BV/dp.MV

Create change variable -created Sales change variable which are a little different. -Scaled the emp and then scaled the emp change. -Book to market value is BV/MV.

import seaborn as sns from scipy import stats -df.SALE=np.log(df.SALE) *sns.distplot(df.SALE,kde=False ,fit=stats.norm)*

Create histogram with normal distribution and variable transformation. kde=True gives you a line to represent the distribution. -transform to see a better view.*variable transformation*.

sns.distplot(dp.residuals, kde=False, fit=stats.norm) plt.show()

Normally distributed errors

*df.sort_values* df.sort_values(by=['gvkey','datadate'], ascending=[True, True], inplace=True)

Sort data in ascending order per GVKEY and DATADATE.


Ensembles d'études connexes

COMMON PERIPHERAL DEVICES COMPTIA

View Set

Chapter 52. Anything Goes: Schoenberg and Musical Expressionism

View Set

PMA Q&A MONITORING AND CONTROLLING 2

View Set

Financial Statements for a Sole Proprietorship

View Set

Exam 2 - Mastering Microbiology - Chapter 5: Viruses Part II

View Set