Linear Regression
Multiple Linear Regression
Statistical method that can be used to project future demand; several variables are utilized. y=m1x1+m2X2+M3X3
key difference between statistician and machine learner
Statistician starts with an assumption like y=a+bx negates or gives within an margin of error saying there is noise ML says if not this maybe nonlinear maybe exponential decisiontree etc DGM Data generation model Statistician as interested in interpretation as prdiction ML only prediction
When a linear regression was trained, it was found that its R-squared value was 0.85. Which of the following statements is correct?
The R-squared value tells us the proportion of variance of the dependent variable explained by the model. For an R-squared value of k, 100 times k percent of the variance is explained. So, for an R-squared value of 0.85, 85% of the variance is explained by the model.
A model is giving a very low error on the training set but a very high error on the test set. Which of the following is correct?
The model is suffering from overfitting
One-hot encoding
The process by which categorical variables are converted into binary form (0 or 1) for machine reading. It is one of the most common methods for handling categorical features in text data. Example departments E A M
Overfitting
The process of fitting a model too closely to the training data for the model to be effective on other data.
A model is built on the _______ data and its performance is evaluated on the _______ data.
Training data, Testing data
An overfit model does well on _______ data but performs poorly on the ______ data.
Training, Testing As mentioned, we expect the past information to represent the future information very well. However, when the model captures the noise too, which is not something that is expected to reflect itself in the future as it was in the past, we reach an overfit model. Such a model does very well on the training data (data that the model has seen) but fails with the testing data (data unseen to the model).
Linear Regression is a guided learning process, i.e., the data needs to come with labels/targets, and then the regression model is trained to minimize the error.
True Residuals are a way to figure out the deviation of the actual values from the _preicted_ values.
measures of association
Variance of x and covariance of(x,y)
Unsupervised Learning
A type of model creation, derived from the field of machine learning, that does not have a defined target variable. there is no such thing as defined variable
adjusted r2 formula
Adj R2 will increase if variable is adding value but will decrease if it does not add value
But how good is our fit. Measured by Standard Error or the Root Mean square error
If the standard deviation of residuals is small then its a good fit else a bad fit Comparing RMSE to Std. dev of y Root Mean Squared Error (RMSE) is a widely used measure of the difference between the predicted and actual values for a set of data.
RMSE (root mean squared error)
RMSE tells you how off model is in the units it is being measured. SQRT(MSE) MSE: the average of the squared differences between the forecasted and observed values Can be compared across models, data sets, degrees of freedoms, etc.
We will use the isdigit() function to check the values in the horsepower column that are not being recognized as numbers
hpIsDigit = pd.DataFrame( cData.horsepower.str.isdigit() ) # if the string is made of digits store True else False # print the entries where isdigit = False cData[hpIsDigit["horsepower"] == False]
We perform the Godlfeldquandt test to check for homoscedasticity. For which of the following p-values of the test will we conclude that the residuals are homoscedastic?
if p value is greater than 0.05 then we say value aare homoscedastic
Homoscedasticity
the variance of the residuals is symmetrically distributed
Multi collinearity
when two or more predictors are highly correlated
The coefficient of determination, also known as R-squared, R^2 = 1 - (SSE/SST)
where SSE (sum of squared errors) is the sum of the squared differences between the predicted and actual response values, and SST (total sum of squares) is the sum of the squared differences between the actual response values and their mean. The value of R-squared ranges from 0 to 1, where a value of 1 indicates that the regression model perfectly fits the data (i.e., all the variability in the response variable is explained by the predictors),
Linear Regression Assumptions REsidual is diff between predicted value and actual value
you have to check if these are acceptable
In simple linear regression, the R-squared value is equal to which of the following?
square of correlation
Needed libraries for Linear REgression
##%load_ext nb_black !pip install nb_black # Libraries to help with reading and manipulating data import numpy as np import pandas as pd # Libraries to help with data visualization import matplotlib.pyplot as plt import seaborn as sns sns.set() # split the data into train and test from sklearn.model_selection import train_test_split # to build linear regression_model import statsmodels.api as sm # to check model performance from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
Good Machine learning model characteristic
A good Machine Learning model captures information in the data and filters out the noise. Overfit models capture the noise too, which does not allow them to generalize well.
regression line (line of best fit)
A line, segment, or ray drawn on a scatter plot to estimate the relationship between two sets of data. There is real data, then you use an hypothesis equation to give a predicted value. Idea is to keep the squares of difference betwee predicted and real value (RESIDUAL) to a minimum Least Squares line is the one that minimizes the sum of the squared residuals.
Categorical variables can't be used directly in a linear regression because
Categorical variables represent categories while the best-fit line needs to be fit on numerical values. So, they are encoded before being used in linear regression.
Supervised Learning
Category of data-mining techniques in which an algorithm learns how to predict or classify an outcome variable of interest. Does dependent variable change by independent variable. Prediction error is in supervised learning
What is Correlation
Correlation is always between -1 and +1. The correlation between a variable and itself is 1. • The correlation between X and Y is the same as the correlation between Y and X. • Correlation is scale invariant
Assumption of Machine learning
Machine Learning assumes that the past is a good representation of the future. The historic (past) data is used to train models so that they can make accurate predictions on future data.
Machine Learning
Machine learning refers to computer systems trying to learn about a process by examining data from the process.
A good fit model will have a smaller standard deviation of residuals.
Observed Value - Fitted Value = Residual
Underfitting
Occurs when a machine learning model has poor predictive abilities because it did not learn the complexity in the training data and the test sets
A Model that captures Noise too
Overfit models capture the pattern as well as the noise in the data, which does not allow them to generalize well.
Data analytics
Prediction AND interpretation
Statistical summary of the dataset
data.describe(include="all").T data.duplicated().sum() - tells if there are duplicates data.isnull().sum() df=data.copy() so oringal data set is unchanged
Multiple subplots in same plot
def histogram_boxplot(data, feature, figsize=(15, 10), kde=False, bins=None): """ Boxplot and histogram combined data: dataframe feature: dataframe column figsize: size of figure (default (15,10)) kde: whether to show the density curve (default False) bins: number of bins for histogram (default None) """ f2, (ax_box2, ax_hist2) = plt.subplots( nrows=2, # Number of rows of the subplot grid= 2 sharex=True, # x-axis will be shared among all subplots gridspec_kw={"height_ratios": (0.25, 0.75)}, figsize=figsize, ) # creating the 2 subplots sns.boxplot( data=data, x=feature, ax=ax_box2, showmeans=True, color="violet" ) # boxplot will be created and a triangle will indicate the mean value of the column sns.histplot( data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins ) if bins else sns.histplot( data=data, x=feature, kde=kde, ax=ax_hist2 ) # For histogram ax_hist2.axvline( data[feature].mean(), color="green", linestyle="--" ) # Add mean to the histogram ax_hist2.axvline( data[feature].median(), color="black", linestyle="-" ) # Add median to the histogram plt.show()
To fill dataframe with median
df1["duration"] = df1["duration"].fillna( value=df1.groupby(["genre", "mediaType"])["duration"].transform("median") ) df1["votes"] = df1["votes"].fillna( value=df1.groupby(["genre", "mediaType"])["votes"].transform("median") ) df1.isnull().sum()
Preparing the data
drop the target variable create the newdf with target variable create dummies for categoraical varaible create testing and training data sets by splitting and put random_stats
coefficient of determination
r^2 in sample training data. Out of sample R2 1 being a perfect fit 0 being a bad fit or explains nothing