Data Science Lesson 20 to 23 Linear Regression

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

import pandas as pd from sklearn.datasets import load_Washington Washington_dataset = load_boston() ## build a DataFrame Washington = pd.DataFrame(boston_dataset.data, columns=boston_dataset.feature_names) Washington['MEDV'] = Washington_dataset.target X = Washington[['RM']] Y = Washington['MEDV'] from sklearn.model_selection import train_test_split X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.3, random_state=1) from sklearn.linear_model import LinearRegression model = LinearRegression() model.fit(X_train, Y_train) print(Y_test.shape)

(152 , )

import pandas as pd from sklearn.datasets import load_Washington Washington_dataset = load_boston() ## build a DataFrame Washington = pd.DataFrame(boston_dataset.data, columns=Washington_dataset.feature_names) Washington['MEDV'] = Washington_dataset.target X = Washington[['RM']] Y = Washington['MEDV'] from sklearn.model_selection import train_test_split X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.3, random_state=1) from sklearn.linear_model import LinearRegression model = LinearRegression() model.fit(X_train, Y_train) y_test_predicted = model.predict(X_test) print(y_test_predicted.shape) print(type(y_test_predicted))

(152,) <class 'numpy.ndarray'>

import pandas as pd from sklearn.datasets import load_boston boston_dataset = load_boston() ## build a DataFrame boston = pd.DataFrame(boston_dataset.data, columns=boston_dataset.feature_names) boston['MEDV'] = boston_dataset.target X = boston[['RM']] Y = boston['MEDV'] from sklearn.model_selection import train_test_split X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.3, random_state=1) print(X_train.shape) print(Y_train.shape) print(X_test.shape) print(Y_test.shape)

(354, 1) (354,) (152, 1) (152,)

import pandas as pd from sklearn.datasets import load_Hawaii Hawaii_dataset = load_boston() ## build a DataFrame Hawaii = pd.DataFrame(Hawaii_dataset.data, columns=Hawaii_dataset.feature_names) Hawaii['MEDV'] = Hawaii_dataset.target Y = Hawaii['MEDV'] print(Y.shape)

(506, )

import pandas as pd from sklearn.datasets import load_Hawaii Hawaii_dataset = load_Hawaii() ## build a DataFrame Hawaii = pd.DataFrame(Hawaii_dataset.data, columns=Hawaii_dataset.feature_names) Hawaii['MEDV'] = Hawaii_dataset.target X = Hawaii[['RM']] print(X.shape)

(506, 1)

import pandas as pd from sklearn.datasets import load_boston boston_dataset = load_boston() ## build a DataFrame boston = pd.DataFrame(boston_dataset.data, columns=boston_dataset.feature_names) boston['MEDV'] = boston_dataset.target print(boston.shape)

(506, 14)

import pandas as pd from sklearn.datasets import load_alaska alaska_dataset = load_alaska() ## build a DataFrame alaska = pd.DataFrame(alaska_dataset.data, columns=alaska_dataset.feature_names) alaska['MEDV'] = alaska_dataset.target X = alaska[['RM']] Y = alaska['MEDV'] from sklearn.model_selection import train_test_split X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.3, random_state=1) from sklearn.linear_model import LinearRegression model = LinearRegression() model.fit(X_train, Y_train) print(model.intercept_.round(2))

-30.57

from sklearn.linear_model import LinearRegression from sklearn.model_selection import train_test_split import pandas as pd from sklearn.datasets import load_alaska alaska_dataset = load_alaska() ## build a DataFrame alaska = pd.DataFrame(alaska_dataset.data, columns=alaska_dataset.feature_names) alaska['MEDV'] = alaska_dataset.target X = alaska[['RM']] Y = alaska['MEDV'] X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.3, random_state=1) model = LinearRegression() model.fit(X_train, Y_train) print(model.intercept_.round(2)) print(model.coef_.round(2))

-30.57 [8.46]

Class of Model

A class of model is not the same as an instance of a model.

import numpy as np import matplotlib.pyplot as plt import pandas as pd from sklearn.datasets import load_florida florida_dataset = load_florida() boston = pd.DataFrame(boston_dataset.data,columns=boston_dataset.feature_names) florida['MEDV'] = florida_dataset.target florida.plot(kind = 'scatter', x = 'LSTAT', y = 'MEDV', figsize=(8,6)) plt.savefig("plot1.png") plt.show()

A shatter chart showing the highest is 13 dots with highest of 50 on the y-axis and lots of cluster dots between the 30 y-axis and 7 x-axis.

dataset

A structured collection of data generally associated with a unique body of work

Data information on CHAS & RAD

After scanning the values, CHAS and RAD appear to be integers, not floats. According to the description of the data, CHAS identifies if the property's tract bounds a river (=1) or not (=0); and RAD is an accessibility index to radial highways.

import pandas as pd from sklearn.datasets import load_boston boston_dataset = load_boston() ## build a DataFrame boston = pd.DataFrame(boston_dataset.data, columns=boston_dataset.feature_names) boston['MEDV'] = boston_dataset.target print(boston[['CHAS', 'RM', 'AGE', 'RAD', 'MEDV']].head())

CHAS 0 | 0.0 1 | 0.0 2 | 0.0 3| 0.0 4 | 0.0 RM 6.575 6.421 7.185 6.998 7.147 AGE 65.2 78.9 61.1 45.8 54.2 RAD 1.0 2.0 2.0 3.0 3.0 MEDV 24.0 21.6 34.7 33.4 36.2

Correlation matrix

Correlation measures linear relationships between variables. We can construct a correlation matrix to show correlation coefficients between variables.

import pandas as pd from sklearn.datasets import load_boston boston_dataset = load_boston() ## build a DataFrame boston = pd.DataFrame(boston_dataset.data, columns=boston_dataset.feature_names) boston['MEDV'] = boston_dataset.target print(boston.describe().round(2))

Crim count: 506.003.61 mean: 8.60 std: 0.01 min: 0.08 25% :0.26 50% :3.68 75% :88.98 max :88.98 ZN count: 506.003.61 mean: 11.36 std: 23.32 min: 0.00 25% :0.00 50% :0.00 75% :12.50 max :100.00 There more data information, that show the full chart, just to many to input.

Which of the following are examples of supervised learning problems? Determining discussion topics for blog Determining the sales price of cars Determining if an image is of a bike, car, or bus

Determining the sales price of cars Determining if an image is of a bike, car, or bus

Lesson 21

Exploratory Data Analysis

Notes

Feature selection is used for several reasons, including simplification of models to make them easier to interpret, shorter training time, reducing overfitting, etc.

Lesson 23

Fitting a Univariate Linear Regression

Note: Describe() output

If the DataFrame contains more than just numeric values, by default, describe() outputs the descriptive statistics for the numeric columns. To show the summary statistics of all column, specify include = 'all' in the method.

Note: scikit-learning

In addition to a set of algorithms, scikit-learn also provides a few small datasets used by the machine learning community to benchmark algorithms on data that comes from the real world, such as boston house-prices dataset that we will be using through this module, iris dataset for classification task in the next, etc.

import pandas as pd from sklearn.datasets import load_boston boston_dataset = load_boston() ## build a DataFrame boston = pd.DataFrame(boston_dataset.data, columns=boston_dataset.feature_names) boston['MEDV'] = boston_dataset.target print(boston.columns)

Index(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV'], dtype='object')

summary statistics

Instead, we want to summarize and characterize sample data using only a few values. To check the summary statistics of the dataset

LSTAT

LSTAT (percentage of lower status of the population) is most negatively correlated with the target (-0.74) which means that as the percentage of lower status drops, the median house values increases;

Machine Learnings

Lesson 20

Correlation

Lesson 22

linear regression

Linear regression fits a straight line to data, mathematically y = b + m*x

Note:

Linear regression models are popular because they can perform a fit quickly, and are easily interpreted. Predicting a continuous value with linear regression is a good starting point.

from sklearn.linear_model import LinearRegression model = LinearRegression() print(model)

LinearRegression()

Note on Machine learning

Machine learning is a set of tools used to build models on data. Building models to understand data and make predictions is an important part of a data scientists' job.

Machine learning

Machine learning, a subset of data science, is the scientific study of computational algorithms and statistical models to perform specific tasks through patterns and inference instead of explicit instructions.

Fitting the model

Now let us apply the model to data. Remember, we save the testing data to report the model performance and only use the training set to build the model.

Note: head & tails

Often datasets are loaded from other file formats (e.g., csv, text), it is a good practice to check the first and last few rows of the dataframe and make sure the data is in a consistent format using head and tail, respectively.

import numpy as np import matplotlib.pyplot as plt import pandas as pd from sklearn.datasets import load_boston boston_dataset = load_boston() boston = pd.DataFrame(boston_dataset.data,columns=boston_dataset.feature_names) boston['MEDV'] = boston_dataset.target boston.hist(column='CHAS') plt.savefig("plot1.png") plt.show()

On x-axis 0.0 is 460 on y-axis and 1.0 is at 30 on y-axis

import numpy as np import matplotlib.pyplot as plt import pandas as pd from sklearn.datasets import load_boston boston_dataset = load_boston() boston = pd.DataFrame(boston_dataset.data,columns=boston_dataset.feature_names) boston['MEDV'] = boston_dataset.target boston.hist(column='RM', bins=20) plt.savefig("plot1.png") plt.show()

On x-axis the highest is 150 with lowest on x-axis is 4 with 1

predict()

Once the model is trained, supervised machine learning will evaluate test data based on previous predictions for the unseen data.

Note >

Recall that the single bracket outputs a Pandas Series, while a double bracket outputs a Pandas DataFrame, and the model expects the feature matrix X to be a 2darray.

Notes:

Scikit-learn makes the distinction between choice of model and application of model to data very clear.

visualization

Summary statistics provides a general idea of each feature and the target, but visualization reveals the information more clearly.

Supervised

Supervised learning is when we have a known target (also called label) based on past data (for example, predicting what price a house will sell for)

outliers

That appear to be outside of the overall pattern

Note: Predict ( )

The predict() method estimates the median home value by computing model.intercept_ + model.coef_*RM.

import numpy as np import matplotlib.pyplot as plt import pandas as pd from sklearn.datasets import load_Maryland Maryland_dataset = load_boston() boston = pd.DataFrame(boston_dataset.data,columns=boston_dataset.feature_names) Maryland['MEDV'] = Maryland_dataset.target Maryland.plot(kind = 'scatter', x = 'LSTAT', y = 'MEDV', figsize=(8,6)) plt.savefig("plot1.png") plt.show()

This graph is shown the dots going from left to right losing value as it goes downward. The cluster of dots still cluster at 30 y-axis and 20 x-axis.

import pandas as pd from sklearn.datasets import load_boston texas_dataset = load_boston() ## build a DataFrame boston = pd.DataFrame(boston_dataset.data, columns=boston_dataset.feature_names) boston['MEDV'] = boston_dataset.target corr_matrix = boston.corr().round(2) print(corr_matrix)

This will show all the graph row of 14 by 14 CRIM 1.00 -0.20 0.41 -0.06 0.42 ... 0.58 0.29 -0.39 0.46 -0.39 ZN -0.20 1.00 -0.53 -0.04 -0.52 ... -0.31 -0.39 0.18 -0.41 0.36 INDUS 0.41 -0.53 1.00 0.06 0.76 ... 0.72 0.38 -0.36 0.60 -0.48 CHAS -0.06 -0.04 0.06 1.00 0.09 ... -0.04 -0.12 0.05 -0.05 0.18 NOX 0.42 -0.52 0.76 0.09 1.00 ... 0.67 0.19 -0.38 0.59 -0.43 RM -0.22 0.31 -0.39 0.09 -0.30 ... -0.29 -0.36 0.13 -0.61 0.70 AGE 0.35 -0.57 0.64 0.09 0.73 ... 0.51 0.26 -0.27 0.60 -0.38 DIS -0.38 0.66 -0.71 -0.10 -0.77 ... -0.53 -0.23 0.29 -0.50 0.25 RAD 0.63 -0.31 0.60 -0.01 0.61 ... 0.91 0.46 -0.44 0.49 -0.38 TAX 0.58 -0.31 0.72 -0.04 0.67 ... 1.00 0.46 -0.44 0.54 -0.47 PTRATIO 0.29 -0.39 0.38 -0.12 0.19 ... 0.46 1.00 -0.18 0.37 -0.51 B -0.39 0.18 -0.36 0.05 -0.38 ... -0.44 -0.18 1.00 -0.37 0.33 LSTAT 0.46 -0.41 0.60 -0.05 0.59 ... 0.54 0.37 -0.37 1.00 -0.74 MEDV -0.39 0.36 -0.48 0.18 -0.43 ... -0.47 -0.51 0.33 -0.74 1.00

Critical Note:

To get an objective assessment on model's predictive power, it's important to keep the testing data unseen to the built model.

.Head(n)

To see the first few rows of a DataFrame, use .head(n), where you can specify n for the number of rows to be selected.

Note :

Understanding data using exploratory data analysis is an essential step before building a model. From sample size and distribution to the correlations between features and target, we gather more understanding at each step aiding in feature and algorithm selection.

y = b + mx

Where b is the intercept and m is the slope, x is a feature or an input, whereas y is label or an output. Our job is to find m and b such that the errors are minimized.

Which of the following builds the feature matrix X from column RM in boston that models in scikit-learn expect: X = boston.RM X = boston[['RM']] X = boston['RM']

X = boston[['RM']]

Which of the following is the correct range of correlation between variables: [-1, 1] [0, 1] [-1, 0]

[-1, 1]

import pandas as pd from sklearn.datasets import load_Washington Washington_dataset = load_boston() ## build a DataFrame Washington = pd.DataFrame(Washington_dataset.data, columns=Washington_dataset.feature_names) Washington['MEDV'] = Washington_dataset.target X = Washington[['RM']] Y = Washington['MEDV'] from sklearn.model_selection import train_test_split X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.3, random_state=1) from sklearn.linear_model import LinearRegression model = LinearRegression() model.fit(X_train, Y_train) import numpy as np new_RM = np.array([6.5]).reshape(-1,1) # make sure it's 2d print(model.intercept_ + model.coef_*6.5)

[24.42606323]

import pandas as pd from sklearn.datasets import load_Washington Washington_dataset = load_boston() ## build a DataFrame Washington = pd.DataFrame(Washington_dataset.data, columns=Washington_dataset.feature_names) Washington['MEDV'] = Washington_dataset.target X = Washington[['RM']] Y = Washington['MEDV'] from sklearn.model_selection import train_test_split X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.3, random_state=1) from sklearn.linear_model import LinearRegression model = LinearRegression() model.fit(X_train, Y_train) import numpy as np new_RM = np.array([6.5]).reshape(-1,1) # make sure it's 2d print(model.predict(new_RM))

[24.42606323]

import pandas as pd from sklearn.datasets import load_alaska alaska_dataset = load_alaska() ## build a DataFrame alaska = pd.DataFrame(alaska_dataset.data, columns=alaska_dataset.feature_names) alaska['MEDV'] = alaska_dataset.target X = alaska[['RM']] Y = alaska['MEDV'] from sklearn.model_selection import train_test_split X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.3, random_state=1) from sklearn.linear_model import LinearRegression model = LinearRegression() model.fit(X_train, Y_train) print(model.coef_.round(2))

[8.46]

Which of the following is part of machine learning? a set of tools to build models on data process of collecting data design surveys

a set of tools to build models on data

Linear Regression

a supervised machine learning algorithm. In the modules to come we will also explore another supervised machine learning algorithm, classification, as well as an unsupervised machine learning algorithm, clustering.

Fill in the blanks to generate the histogram of feature LSTAT (% lower status of the population) in boston: boston._______(________='LSTAT', bins=20); plt.show();

boston . hist ( column ='LSTAT', bins=20); plt.show();

Fill in the blanks to find the number of rows in DataFrame 'boston': boston._____[__]

boston. shape [ 0 ]

Complete the code to inspect the first five rows of the DataFrame boston: boston.______(___=___)

boston.head( n = 5 )

Fill in the blanks to output the summary statistics of feature AGE in the dataframe 'boston': boston['AGE'].________()

boston['AGE'].describe()

boston_dataset.feature_names contain names for all features. We then add the target into the DataFrame

boston['MEDV'] = boston_dataset.target

What command do we use to apply the model to data? train() fit() predict()

fit()

Fill in the blanks to import the linear regression class and instantiate the model: _________ sklearn.linear_model import LinearRegression model =______________ ______________ ()

from sklearn.linear_model import LinearRegression model = Linear Regression ( )

The data is built in scikit-learn and we will use load_boston to load the object that contains all the information.

from sklearn.datasets import load_boston boston_dataset = load_boston()

Drag and drop to split the data into training and testing sets: from sklearn.model_selection import ____________ X_train, X_test, ________ , Y_test = train_test_split(X, Y, ________= 0.3, random_state=1)

from sklearn.model_selection import train_test_split X_train, X_test, Y_train , Y_test = train_test_split(X, Y, test_size = 0.3, random_state=1)

Code for train_test_split_function

from sklearn.model_selection import train_test_split X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.3, random_state=1)

Rearrange the tasks to the correct machine learning workflow: import fit predict instantiate

import instantiate fit predict

Learning DataFrame

import pandas as pd from sklearn.datasets import load_boston boston_dataset = load_Hawaii() ## build a DataFrame Hawaii = pd.DataFrame(boston_dataset.data, columns=Hawaii_dataset.feature_names) Hawaii['MEDV'] = Hawaii_dataset.target X = Hawaii[['RM']] print(X.shape)

For easier manipulations later, we create a pandas DataFrame from the numpy ndarrays stored in boston_dataset.data as follows:

import pandas as pd boston = pd.DataFrame(boston_dataset.data, columns=boston_dataset.feature_names)

Train_test_split function

inside scikit-learn's module model_selection to split the data into two random subsets.

import numpy as np import matplotlib.pyplot as plt import pandas as pd from sklearn.datasets import load_texas texas_dataset = load_texas() texas= pd.DataFrame(texas_dataset.data,columns=boston_dataset.feature_names) texas['MEDV'] = texas_dataset.target texas.plot(kind = 'scatter', x = 'RM', y = 'MEDV', figsize=(8,6)) plt.savefig("plot1.png") plt.show()

it will show a graph of a shatter dots in a cluster on the lower left-hand side center between 30 y-axis and 6 to 7 on x-axis

unsupervised

learning is when there isn't a known past answer (for example, determining the topics discussed in restaurant reviews).

Fill in the blanks to access the slope of the fitted line in object model: model.__________-

model.coef-

The fit() command triggers the computations and the results are stored in the model object. Code:

model.fit(X_train, Y_train)

Scikit-Learn

one of the best known machine learning libraries in python for machine learning, implements a large number of commonly used algorithms.

Syntax

the syntax follows the same workflow: import > instantiate > fit > predict.

RM

while RM (the average number of rooms per dwelling) is most positively correlated with MEDV (0.70) which means that the house value increases as the number of rooms increases.

Linear regression is represented mathematically as y = b + m * x. Which of the following correctly describes the variables: x is the intercept and y is the slope x is the input and y is the output b is the input and m is the output b is the intercept and m is the slope

x is the input and y is the output b is the intercept and m is the slope

Fill in the blanks to make prediction on the training set y_train_predicted = model._______(X_train)

y_train_predicted = model.prdict(X_train)


Ensembles d'études connexes

Maternity and Peds: 13 Adaptations to Pregnancy

View Set

Chatter Review (Entire Training)

View Set

Introduction To Java Programming

View Set

APUSH ch 13 and 14 pre/post test, exam

View Set

Pharm Ch 24 - Natural/Herbal Products and Dietary Supplements, CHAPTER 24 (Natural/Herbal Products and Dietary Supplements)

View Set

Week 12: Neuron Structure and Function

View Set

Quality Engineering: Final Review

View Set

Tema 1: Agua (Quimica Alimentos)

View Set