Machine Learning Final

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

What is the purpose of the variable type category? When should it be used?

A Pandas object refers to a non-numerical object, which is necessarily a categorical variable. But categorical variables can also be integer variables. A category is a newer data type meant specifically to refer to any categorical variables. By default, non-numerical objects are objects, but they can be declared as type category. Same for integer scaled variables that are an integer type, but can, and should, be declared as type category.

What is a cluster centroid and how is it computed?

A cluster centroid is the center of the cluster that is calculated as the mean of the corresponding feature values of each sample in the cluster for each feature

What are the two types of cells in a Jupyter notebook? What is the purpose of each?

A code cell is for entering and running Python code. A markdown cell is for documentation.

Distinguish between training data and testing data

A core concept of machine learning is that forecasting efficiency cannot be evaluated on the data on which the model trained, i.e., the data from which the model coefficients are estimated. That evaluation can only occur by observing the errors of applying the model to new data, which is the actual forecasting situation.

What is a csv file? What are its properties, its primary advantage and its primary disadvantage (compared to an equivalent Excel file)?

A csv file is a comma separated values file, pure text, so readable by virtually every application that can read text. Its primary advantage is near-universal readability. Its primary disadvantage is unlike a worksheet, its columns are not aligned when viewing the file as text, remedied by opening the file in a worksheet app such as Excel.

Explain the following Pandas code: data[rows, columns]

A data frame is a 2-D object, rows and columns. Any one data value in a data frame is identified by its row and column coordinates. This notation specifies the name of the data frame and then references data values in one or more rows and columns.

What is a data table and how are the data values organized?

A data table is a rectangular table of the data values subject to analysis. The first row contains the variable names, each other row contains the data for a single unit of analysis, such as a person or company. Each column contains the data values for a single variable.

What is a hold-out sample, and what is its purpose?

A hold-out sample is the testing data, data on which the model was not trained (fit). Its purpose is to evaluate the forecasting efficacy of the model in a true forecasting situation of which the model is "unaware" of the value of y, but the analyst is aware, so can evaluate the error directly.

What is meant by the term homogeneous group in the context of classification?

A homogeneous group consists of samples that are in the same group, such as Male or Female body type.

What is a leaf from a decision tree? What does it represent?

A leaf is a node at the bottom of the decision tree, a final classification in which no more splits

What is the distinction between a model parameter and a hyper-parameter? Give an example of each

A model parameter is a characteristic of a specific model estimated from the training data, such as the slope coefficients of a regression model. A hyper-parameter is a characteristic of any one model that is set by the analyst, but could vary across models, such as the depth of a decision tree.

Why is the logit transformation of best fit more appropriate for binary classification than a straight line of best fit?

A straight line cannot effectively summarize a scatter plot of a target variable that only has two values. Instead of a cloud of points, there are two lines of points across the values of the x-variable. Instead of a straight line, an S-shaped curve of the logit provides a more suitable curve for summarizing the relation between continuous x and binary y

What is the shape of the visualization of a linear model?

A straight surface. In two dimensions, that is a line. In three dimensions, a cube. And beyond.

What is the only way to know for sure which rescaling is best for the data values of the predictor variables (features) - MinMax, Standardization, or Robust Scaling?

A theme of machine learning is to see what works. Keep testing data separate from training data, and do whatever you want with the data, choosing what ultimately works best. Most machine learning algorithms perform better, that is, more accurate forecasts, when the data have about the same scales. Experience is that often it makes no difference which method is chosen, but one does not know in advance if working with new data. Try them out and see if looking for ways to increase the forecast.

What is a variable transformation and how do you perform one with Python Pandas?

A variable transformation defines a new variable, or creates new values for an existing variable, by performing an arithmetic operation on existing variables. To specify in Python, simply specify the arithmetic operation, realizing that variables are identified as part of the data frame in which they exist. For example: d['Salary000'] = d['Salary'] / 1000

What is a linear model? What are its parameters?

A weighted sum of variables plus a constant term.

What is standardization to z-scores?

A z-score indicates how many standard deviations the original value is from the mean of the distribution. The distribution of z-scores has the same shape as the original distribution, but with a different scaling.

What is the accuracy of a binary prediction? When it is not of the most interest?

Accuracy is the percent of correct classifications, the number of true positives plus true negatives divided by the total number of samples, including the false classifications.

What is the purpose of an indicator (or dummy) variable?

An indicator variable is a numerical representation of a categorical variable. The number of indicator variables formed is equal to the number of levels or categories. A dummy variable is an indicator variable that assumes the value 0 if the category level is not present, and a 1 when present.

How is the least-squares regression model obtained with a gradient descent solution?

An initial, even arbitrary solution for the model parameters is given. Then, to minimize the squared errors across all the rows of data, the parameter values are changed. Then again. Then again, each time getting closer to the smallest possible sum of squared errors. The process stops when changing the parameter estimates results in virtually no change in the sum of squared errors.

What is the problem overfitting presents in evaluating model forecasting performance?

An overfit model fits the training data well, but has poor generalization to actual forecasting, that is, to new data (e.g., the testing data). The good fit to the training data is irrelevant.

What is a package manager? Why do we use a package manager for our Python work, and which package manager do we use?

Base Python by itself makes an excellent programming language. But one does not want to program everything, but use other, developed software, such as for machine learning. This additional software is organized by packages of related functions. A package manager is used to download, install, and update these different packages without much work on the part of the user

What is supervised machine learning?

Based on a prediction equation, or in more complex cases, a network of inter-related prediction equations, the supervised machine learning forecasts unknown values of a variable of interest.

Why is binary prediction the process of classification?

Binary prediction is classification into only one of two categories. Distinguish classification into a category from measurement of a quantitative variable. For example, classify someone as Male or Female body type, but measure their height.

1. A cluster should demonstrate cohesion and separation. What are these concepts? What fit index simultaneously assesses these two concepts? How is it interpreted?

Cohesion describes how tight a cluster is - ie how closely the members of a cluster are together. Separation describes how distinct the clusters are from each other. The silhouette metric asses both concepts. Silhouette varies between 1 and -1, and we want to see high positive values. This is because it is measured by subtracting the distance of a point to its assigned cluster divide from the distance of a point to the nearest cluster. That difference should be positive and larger is better. IT is then divided by the max of the two points to standardize the value between 1 and -1.

One potential issue with multiple regression is collinearity. Describe the problem and how it can be addressed

Collinearity means that predictor variables (features) correlate substantially with each other. Collinearity increases the standard errors of the estimated collinear slope coefficients as the estimation algorithm cannot readily separate their effects (holding the others constant). Informal detection is to inspect the feature correlation matrix for high correlations. More formal is to regress each feature on all the others, to detect which ones can be explained in terms of the other features and perhaps, then, not needed in the model, a feature (de)selection technique.

How can the analyst determine if a model is overfit?

Compare the fit of the model from the training data to the testing data. If there is a big decrease, the model is overfit to the training data.

What makes machine learning a 2nd-decade 21st century technology, as opposed to, say, the 1990's?

Computer power. Having massively more computer power allows more intensive analyses of algorithms that existed for decades, such as applying hyper-parameter tuning to multiple regression. Further, new estimation algorithms have been developed, such as random forest, that are only possible with much computer power. (Cheap laptops today are more powerful in terms of numerical crunch power than the fastest super-computers just 15 years ago.)

What is generally the best statistic to use to identify outliers in regression analysis? What is its meaning?

Cook's Distance is an influence statistic. That is, it indicates the influence of a single observation, the values of the predictor variables in a single row of data, on the value of the estimated regression coefficients, the y-intercept and slope coefficients. There is one Cook's Distance for every row of data. The observations with values of Cook's Distance much larger than the remaining values may be outliers. If these observations are removed from the analysis, there may be a noticeable change of the resulting model. This change may be especially noticeable as the estimated coefficients minimize the sum of squared errors, and outliers generate very large squared errors.

What is data wrangling? Why is it important?

Data almost never arrives ready for analysis. Many issues need to be addressed, such as inconsistent coding of responses, missing data, and superfluous variables. Even with those issues settled, the data usually needs pre-processing, including standardization (or similar) and conversion of categorical variables to indicator variables.

Meaning and interpretation of the hypothesis test of the slope coefficient.

Each predictor variable in the model is associated with a slope-coefficient. The purpose of each T − test, as specified by its null hypothesis, is to evaluate the hypothesis that the relation, as specified by the population slope coefficient Yfor the predictor variable (feature), between the corresponding predictor variable and the response variable is zero. That is, the null hypothesis is B1 = 0. The alternative hypothesis is that the relation exists, either in a positive or negative sense,that is, B1 ≠ 0.If T − value < A, where A generally equals 0.05,then reject the null hypothesis and conclude either that B1 > 0 or B1< 0, depending on the sign of the corresponding sample slope coefficient, B.

Meaning and interpretation of the confidence interval of the slope coefficient.

Each predictor variable in the model is associated with a slope-coefficient. The slope-coefficient specifies the average change in y for an unit increase in the change in the predictor variable, . For each estimated slope coefficient,BJ, that is,the sample value,there is a corresponding population value, B. What is the value of B? That is not known,but the 95% confidence interval provides the range of values that likely contain B . If zero is in the confidence interval, then the analysis is unable to demonstrate a relationship between the predictor variable and the response variable, with all other predictor variables held constant

Explain the concept of a row name in a data frame. Describe the default row names and the advantage of replacing them with a suitable column from the read data frame.

Each row has a unique identifier. By default, the identifiers are the consecutive integers, starting from zero. However, the data file may contain a unique identifier as one of the already existing columns, such as Name for a data file of employees. In that situation, better to replace the default integers with the more meaningful names.

For a given set of customers, almost all weigh between 110 and 300 lbs. One customer, an outlier, reports a weight of 460 lbs. What is the basis for dropping the customer from the analysis? How does such an action change the reported results?

Either the data value is mis-entered, or, if correct, any generalizations would not properly apply to these people, and perhaps bias the model for the vast majority of customers. In practice, experiment with different deletion thresholds as perhaps a better model can be constructed by focusing, for example, on only people with weights between 100 and 300 lbs. With otherwise so much variability, the model may perform more poorly for everyone, instead of better performance for the vast majority of customers who weigh between 100 and 300 lbs.

What is model validation and what is the problem training data to validate a model?

Every data set sampled from a population differs from any other data set sampled from the same population. Every sample reflects the underlying population values, but every corresponding sample value, such as the mean, does not equal the corresponding population value. So fitting sample data from which the model trained fits random sampling error as well as true, stable population characteristics. A model can fit training data perfectly, but have no useful ability to forecast on new (testing) data.

Why can a model not be properly evaluated on its training data?

Every data set sampled from a population differs from any other data set sampled from the same population. Every sample reflects the underlying population values, but every corresponding sample value, such as the mean, does not equal the corresponding population value. So fitting sample data from which the model trained fits random sampling error as well as true, stable population characteristics. A model can fit training data perfectly, but have no useful ability to forecast on new (testing) data.

When predicting a binary outcome, what are the two ways to be wrong?

False Negative, when the model predicts the sample in a group is not in the group. False Positive, when the model predicts the sample is in the group, and it is not.

What does it mean to filter rows by data values?

Filtering subsets a data frame by rows, selecting only those samples that satisfy some logical criterion, such as Gender == 'F', which reduces a data frame down to only those rows of data marked with F as the gender

What are quartiles of a distribution and how are they computed?

For a given variable, the first quartile (Q1) is the middle number between the smallest number and the median of the data set. The second quartile (Q2) is the median of the data. The third quartile (Q3) is the middle value between the median and the highest value of variable.

What is their range if the distribution is normal.

For a normal distribution, little more than 95% of the values fall within two standard deviations of the mean.

What is y^? What are the two primary situations in which it is applied?

Given a regression model, Y^ is the value calculated from the values of the predictor variables. 2 If applied to the data from which the model was estimated, Y^ is the of y, that is, fit by the model for the associated values of the predictor variables, called the fitted value. If predicting a future event for which the value of y was not available or used to estimate the model, then Y^ is called the predicted value or the forecasted value

What is local optimization (such as regarding the decision tree solution)? What is its primary disadvantage?

Given the initial model configuration, such as the first split in a decision tree analysis, a 3 different final model likely emerges than if another first split had been taken on a different variable. The problem is that the splitting process to move down one more layer of depth is done locally, without any "long-view" of where the process is going. So an optimal first split may have been barely more optimal than if another variable and split had been chosen, but perhaps the second alternative would ultimately had led to a series of splits that achieved a more successful level of classification.

How is hyper-parameter tuning related to fishing? When is OK to do so?

Hyper-parameter tuning is searching for the best parameter setting, such as the number of features in a model, without any real theoretical reason for choosing the value. Instead, use modern computers to grind away at a large range of possibilities, choosing the best. This procedure is OK as long as the searching is done on training data and the testing on completely different data.

What does it mean to say that the median and inter-quartile range are robust to outliers?

If all the values of a distribution remain the same except that the largest value of the distribution changes from 10 to 10,000,000,000, the median and IQR remain unchanged. On the contrary, the mean and standard deviation will be drastically affected.

Model Fit: The standard deviation of the residuals to interpret model fit

If the residuals are normally distributed as the result of a random process, as they usually are, then +2 and -2 standard deviations on either side of zero contains about ~95% of the forecasting errors. How is the standard deviation of the residuals used to interpret model fit?

Meaning of the slope coefficient in y^ = b0 + b1X1

In this regression model with a single predictor (feature), B1 is the slope coefficient, estimated from the data. The slope coefficient determines, on average, how much y changes with a increase of 1 unit in X. When applied to a regression model, this change in y is the average (or expected) change. (If there is more than one predictor variable in the model, then the values of all the other predictor variables are held constant.) With multiple regression, each slope coefficient is interpreted with the values of the remaining predictor variables (features) held constant.

How is it that given a decision tree with enough levels of depth, the model can recover the correct class value (e.g., Gender) with perfect accuracy?

Increasing the complexity of the model ensures better fit on the training data. With a decision tree analysis, the analyst can add enough depth (splits) to the model to correctly classify everything. Of course, such classification will not generalize beyond the training data, so such a model is useless to deploy.

Why is k-fold cross-validation preferred to just splitting the original data into training and testing data, a train/test split?

Instead of just one arbitrary, usually randomly selected hold-out sample, there are k holdout samples. Any one train/test split may result, by chance, in a weird test sample or training sample. With k different such training/test splits the average performance of the model across all the k-folds yields a more stable estimate of the forecasting efficacy, such as with MSE or se.

Why is it good practice to investigate multiple initial configurations of centroids when pursuing a K-means cluster analysis?

It is good practice because different initial configurations can lead to a different cluster solution. By starting with different initial configurations, we can ultimately select that configuration which best balances cohesion with separation.

In machine learning the variable to forecasted or predicted is called the target or the label. When is the term label most appropriate?

Label is most appropriate when forecasting the level or category of a categorical variable. The level is described by a label in the usual English definition of the word.

Discuss with an example: The core of machine learning is pattern recognition

Life and the world are not random collections of atoms. There are patterns everywhere, which form, for example, the basis of science, and also the basis of human decision making. Many of these patterns are not easily detected by humans. The usefulness of machine learning is to detect these patterns and generalize them to future events, the basis of forecasting.

What does it mean to state that "A model should be made as simple as possible, but not simpler."?

Make the model as complex as it can be to capture the relevant information in the training data to avoid underfitting without so much complexity the model overfits.

What is the purpose and benefit of feature selection (i.e., select predictor variables for a model)

Not all potential features are relevant (correlate with y) and unique (do not correlate with other X's). As such, they contribute little, or even detract, from forecasting efficiency and model interpretability. Particularly for large data sets, they can also add potential machine time for the computations. As such, best to rid the model of irrelevant features.

Define overfitting.

Overfitting is when a model is too complex, where the extra complexity takes advantage of random sampling fluctuations in the training data to increase fit.

What is the relationship of Python and Pandas?

Pandas is a package of functions that add pre-built data analysis capabilities to Python. The primary Pandas data structure is the data frame.

Criteria that a potential feature (predictor variable) should satisfy before added to a model.

Predictor variables (features) should be relevant (correlate with y), and provide unique information (do not correlate with the other X's).

Briefly explain how multiple regression enhances the two primary purposes of regression analysis?

Predictor variables (features) that are relevant (correlate with y), and provide unique information (don't correlate with the other X's), lead to a) better, more accurate prediction, and b) a better understanding of how the variables are related to each other.

Why is Python a good language to use for machine learning?

Python offers a framework for machine learning, so if you run one machine learning analysis with one algorithm, it is easy to re-run the same code with another algorithm just by changing the name of the algorithm

Why is R-squared called a relative index of fit?

R-sq literally compares the residuals from two models: the specified model, and the null model where the X's are unrelated to y so that the forecast is just the mean of y.

Meaning of the residual variable e

Residual variable E is the difference between the actual value of y and the estimated value of y, Y^. The residual or error represents the influences on the value y not explained or accounted for by the model.

What is the purpose of the sensitivity (recall) metric?

Sensitivity assesses how many samples in the positive group are correctly classified as positive. It is applicable to situations where the concern is of missing something that exists, such as cancer in a medical diagnosis, a terrorist as a passenger on an airplane, or a poor- 2 quality part in a manufacturing scenario.

Identify and briefly explain the two types of supervised machine learning regarding the nature of the target variable

Supervised machine learning trains a model to predict a target. The two types of supervised machine learning either forecast a continuous variable, such as linear regression, or forecast a classification into a category, such as logistic regression.

What is one statistic for assessing homogeneity? What is its range and how is it interpreted?

The Gini coefficient is a primary statistic for which to evaluate homogeneity of classification. The value of the coefficient ranges from 0 for maximum equality to 1 for maximum inequality with no benefit obtained from the classification system.

Briefly explain the concept and purpose of a Jupyter notebook

The Jupyter notebook allows the user to interactively program and run analyses with full documentation.

Explain the concept of a current working directory

The Jupyter notebook needs a starting point for referencing files to read and write. That reference point is the current working directory. All file references are relative to this directory

Consider the following scatterplot as related to forecasting body type according to Gender. To forecast Gender, would a decision tree algorithm choose the Waist or Shoe feature to make the first split? Why? About where would the split occur (the decision boundary)? Why?

The algorithm would choose the Shoe size feature because a split at about 8 ¼ does a good, though not perfect, job of separating Male and Female body types. There is no split on the Waist feature that attains any decent accuracy of differentiation

When classifying customers to identify those most likely to churn (exist as a customer), which of the three-classification metrics is the most useful: accuracy, recall, or precision? Why?

The analysis of customer churn is primarily concerned with not losing existing customers. Unless the amount of resources dedicated to those customers predicted as likely to leave is not excessive, better to avoid the false positives. That is, OK to have some customers predicted to churn who do not, than miss those who do churn. As such, the most relevant fit index is sensitivity (recall).

7. When conducting a machine learning analysis, how can the analyst detect overfitting?

The fit indices will look great on the training data, and much worse on the testing data, the indicator of real-world performance.

Graph of X with y vs the graph of X with y^.

The graph of X with Y^ is a single line (for a linear function to predict y). The graph of X and y is a scatter plot.

How does a heat map facilitate feature selection?

The heat map is a visualization of a correlation matrix. An informal, but useful, feature selection technique is to delete some collinear features. The heat map can not only provide the correlations, but also color codes each according to the magnitude of each correlation. Such color-coding assists in identifying large correlations.

Criterion of ordinary least squares regression to obtain the estimated model.

The least squares criterion is the choice of the regression model that minimizes the sum of squared residuals across all the rows of data in the analysis. That is, this estimation process yields values of each BJ such that, as a set, yield the linear function that results in the smallest possible sum of squared residuals, ∑X Y, where X ≡ Y − Y^

What are the mean and standard deviation of a distribution of z-scores?

The mean of a distribution of z-scores is 0, with a standard deviation of 1

How is the inter-quartile range analogous to the standard deviation in terms of both being summary statistics of a distribution of a continuous variable?

The more variable the values of a distribution, the more extreme are the first and third quartiles of the distribution. The IQR is the positive difference between the first and third quartiles. So the larger the variability of the values of a variable, the larger are both the standard deviation as well as the IQR.

What is the purpose of the precision metric?

The purpose of precision is to determine how many samples in the negative group are incorrectly classified into the positive group. How many airline passengers were incorrectly identified initially as terrorists? How many patients were told they have cancer when they did not? How many good parts were incorrectly classified as bad?

11. What is the relation of a random forest estimator/model to a decision tree?

The random forest is an evaluation of many different decision trees. The algorithm constructs a series of decision trees where each tree is based on a different random sample of (a) the data, with replacement, and (b) the available features, with the final model an average of the different trees.

What is the root node of a decision tree? What does it represent?

The root node is the beginning node, before any classification takes place. Its membership is the number of samples in each group as the analysis begins, such as the number of Men and Women in the analysis.

Define outliers of a distribution in terms of its inter-quartile range.

The traditional definition is that an outlier is beyond 1.5 IQR's from the first or third quartile of the distribution.

What metric balances sensitivity and precision? How does it accomplish the balance?

The value of the F1 metric lies between the sensitivity and precision values, as their harmonic mean.

What is the distinction between categorical and continuous variables? Provide an example variable of each along with some sample values.

The values of categorical variables are non-numeric categories, even if with integer values, and there are relatively few unique values. Continuous variables are always numeric and have many possible values.

When predicting a binary outcome, what are the two ways to be correct?

There are two groups. The two ways to be correct are to correctly classify a sample into its correct group, one called the positive group and the other the negative group, so a true positive or a true negative.

What is the distinction between the Pandas .loc and .iloc methods? What is their purpose?

These two methods subset a data frame, by rows and/or columns. .loc subsets by row name or column name. .iloc subsets by index, i.e., the ordinal position of the row or column, starting with (unfortunately) 0.

When in the sklearn Python machine learning environment, how similar is the code for doing k-fold validation for least-squares regression vs. logistic regression? What is the distinction?

This is a huge strength of the sklearn machine learning analysis environment. Simple code changes, such as instantiating another estimation module, can invoke an entirely different estimation algorithm. The analyst can easily test multiple algorithms and choose the best for the data set.

. A machine learning analyst investigates fit for a decision tree model with 2, 3 and 4 features at depths of 2 and 6, with a 3-fold cross-validation. a. How many distinct models are analyzed? b. How many analyses are performed? c. How many hyper-parameters are investigated?

Three features and two depths lead to the analysis of 3x2=6 models. The 3-fold cross-validation subjects each model to three analyses, so 6x3=18 analyses. Two hyper-parameters are investigated, tree depth and number of features.

Consider an item on a survey with three possible responses D (disagree) N (neutral) A (agree). What indicator variables would be defined and how are the values of those variables determined?

Three indicators would be defined, one for each potential response. The values of these variables would be 0's and 1's, with the indicator variables getting a '0' where that response was not present in the initial categorical variable, and a '1' where that response was present in the initial categorical variable.

When breaking data into training/test subsets, when forecasting a categorical variable why do we want the same proportion of people in each group in each subset as in the full data set?

To evaluate how good a model is, we need to compare to forecasting without the model. That forecast is from what is called the null model, which, for logistic regression, is the forecast to the group with the most members. If the proportion of members in the groups change for each random assignment of samples to training and test data, so does the performance of the model.

What are the two goals of supervised machine learning?

Two important goals are accomplished with regression analysis: • Understand the relationship between a predictor variable (feature) and the response variable • Forecast the unknown future values of a response variable Adding relevant predictor variables with new information contributes to both goals. Additional relevant predictor variables contribute to our understanding of the relation between a predictor and response variables with the values of all other predictors held constant, and increase the forecasting accuracy of the model, and via the imposition of statistical control.

Define underfitting, and discuss the problem it presents for model development.

Underfitting means the model is too simple to capture all the information in the training data that is not random variation, but reflects stable aspects of the underlying population.

What is an iterative solution for model coefficients?

When a direct algebraic solution is not possible, then the method to compute the estimated parameter values relies upon iteration, the method of gradient descent. Start with a somewhat if not completely random guess as to the parameter values. Then, using calculus, the algorithm moves the parameter values in a direction that further minimizes the error. Keep going until no further error minimization is obtained, or until the maximum number of iterations is exceeded because some estimates never converge. The problem is that a local minimum may have been obtained. With another set of starting values, a better minimum may result.

. How does k-fold cross-validation extend the concept of a hold-out sample?

With k-fold validation there are k hold-out samples and so k cross-validations.

Write the Python expression for referring to variables x1, x2 and x3 in the df data frame

most general, df.loc[:, 'x1','x2','x3'], or, sometime works, df['x1','x2','x3']

1. What is the distinction between a parameter of a model, and a hyper-parameter? Give an example of each.

· A parameter of a model is a value that is estimated by the model such as the slope coefficient. A hyper parameter is a characteristic of the model such as the number of clusters.

K-means cluster analysis minimizes the cluster inertia for each cluster. What is cluster inertia?

· Cluster inertia is the sum of distance of each point from its assigned centroid.

1. a. What is the distinction between supervised learning and unsupervised learning? b. How can unsupervised learning be a preliminary step to supervised learning?

· In supervised machine learning we provide use a set of x variables to predict a y variable. In unsupervised machine learning, we use only x variables to identify patterns within the data. · Through unsupervised learning, we can identify relationships within the variables which can help us determine what variables we may want to use to predict another.

1. What is the Pythagorean theorem so important to cluster analysis?

· The Pythagorean theorem allows us to calculate the distance between two points of a cluster.

1. Express in words the calculation of Euclidean distance between two points calculated over p features. [Write the formula if you wish, but do describe verbally.]

· To calculate the Euclidean distance, we take the square root of the squared distance of the first feature plus the squared distance of the second feature, up to the squared distance of the pth feature.

1. Describe the statistical procedure to assess the best number of clusters for a given data set.

· To find the best number of clusters, select a value that provides both a low inertia with a high silhouette.

1. Why is it important to standardize (or otherwise normalize) the data before pursuing a K-Means cluster analysis?

· We need to standardize the data before doing a K-Means cluster analysis because we need to be able to calculate the distance between each feature using the same unit.


Kaugnay na mga set ng pag-aaral

KIN 245 Chapter 6: The Elbow and Radioulnar Joints

View Set

Pre-Algebra Lesson 1-1 Vocabulary

View Set

mouth, pharynx, esophagus, and stomach, intestine test

View Set

8.3 Bureaucrats and Accountability

View Set

FNAN 321 Chapter 8-12, 15, and 16

View Set

Blood Types Genotypes and Phenotypes

View Set

Austin Police Department Policy Manual

View Set