GSCM 510

Ace your homework & exams now with Quizwiz!

What is the purpose of the variable type category? When should it be used?

A Pandas object refers to a non-numerical object, which is necessarily a categorical variable. But categorical variables can also be integer variables. A category is a newer data type meant specifically to refer to any categorical variables. By default, non-numerical objects are objects, but they can be declared as type category. Same for integer scaled variables that are an integer type, but can, and should, be declared as type category.

What is a cluster centroid and how is it computed?

A cluster centroid is the center of the cluster. Each coordinate is the mean of the corresponding coordinate for each of the samples assigned to the cluster.

What are the two types of cells in a Jupyter notebook? What is the purpose of each?

A code cell is for entering and running Python code. A markdown cell is for documentation.

Distinguish between training data and testing data

A core concept of machine learning is that forecasting efficiency cannot be evaluated on the data on which the model trained, i.e., the data from which the model coefficients are estimated. That evaluation can only occur by observing the errors of applying the model to new data, which is the actual forecasting situation.

Explain the following reference to a data frame named data: data[rows, columns]

A data frame is a 2-D object, rows and columns. Any one data value in a data frame is identified by its row and column coordinates. This notation specifies the name of the data frame and then references data values in one or more rows and columns.

What is a data table and how are the data values organized?

A data table is a rectangular table of the data values subject to analysis. The first row contains the variable names, each other row contains the data for a single unit of analysis, such as a person or company. Each column contains the data values for a single variable.

What is a hold-out sample, and what is its purpose?

A hold-out sample is the testing data, data on which the model was not trained (fit). Its purpose is to evaluate the forecasting efficacy of the model in a true forecasting situation of which the model is "unaware" of the value of y, but the analyst is aware, so can evaluate the error directly.

What is meant by the term homogeneous group in the context of classification?

A homogeneous group consists of samples that are in the same group, such as Male or Female body type.

K-means cluster analysis minimizes the cluster inertia for each cluster. What is cluster inertia?

A key assessment of cluster analysis is how far a point lies from the center of the cluster to which it is assigned. The sum of the squared distances on each data value from the cluster center (square of Euclidean distance) for a point is an index of how far the point is from the center. Cluster inertia is total within-cluster variation, the sum of the individual within-cluster variations for all samples (objects) assigned to a cluster.

What is a leaf from a decision tree? What does it represent?

A leaf is a node at the bottom of the decision tree, a final classification in which no more splits are taken.

What is the distinction between a model parameter and a hyper-parameter? Give an example of each.

A model parameter is a characteristic of a specific model estimated from the training data, such as the slope coefficients of a regression model. A hyper-parameter is a characteristic of any one model that is set by the analyst, but could vary across models, such as the depth of a decision tree.

Why is the logit transformation of best fit more appropriate for binary classification than a straight line of best fit?

A straight line cannot effectively summarize a scatter plot of a target variable that only has two values. Instead of a cloud of points, there are two lines of points across the values of the x-variable. Instead of a straight line, an S-shaped curve of the logit provides a more suitable curve for summarizing the relation between continuous x and binary y.

What is the shape of the visualization of a linear model?

A straight surface. In two dimensions, that is a line. In three dimensions, a cube. And beyond.

What is the only way to know for sure which rescaling is best for the data values of the predictor variables (features) - MinMax, Standardization, or Robust Scaling?

A theme of machine learning is to see what works. Keep testing data separate from training data, and do whatever you want with the data, choosing what ultimately works best. Most machine learning algorithms perform better, that is, more accurate forecasts, when the data have about the same scales. Experience is that often it makes no difference which method is chosen, but one does not know in advance if working with new data. Try them out and see if looking for ways to increase the forecast.

What is a variable transformation for a continuous (numeric, not categorical) variable? In general terms, not specific code, how is a transformation implemented?

A variable transformation defines a new variable, or creates new values for an existing variable, by performing an arithmetic operation on existing variables. To specify in Python, simply specify the arithmetic operation, realizing that variables are identified as part of the data frame in which they exist. For example: d['Salary000'] = d['Salary'] / 1000

What is a linear model? What are its parameters?

A weighted sum of variables plus a constant term.

What is standardization to z-scores?

A z-score indicates how many standard deviations the original value is from the mean of the distribution. The distribution of z-scores has the same shape as the original distribution, but with a different scaling.

What is the accuracy of a binary prediction? When it is of the most interest?

Accuracy is the percent of correct classifications, the number of true positives plus true negatives divided by the total number of samples, including the false classifications. It is of the most interest when the cost of the two misclassifications are the same.

What is the purpose of an indicator (or dummy) variable?

An indicator variable is a numerical representation of a categorical variable. The number of indicator variables formed is equal to the number of levels or categories. A dummy variable is an indicator variable that assumes the value 0 if the category level is not present, and a 1 when present.

How is the least-squares regression model obtained with a gradient descent solution?

An initial, even arbitrary solution for the model parameters is given. Then, to minimize the squared errors across all the rows of data, the parameter values are changed. Then again. Then again, each time getting closer to the smallest possible sum of squared errors. The process stops when changing the parameter estimates results in virtually no change in the sum of squared errors.

What is the problem overfitting presents in evaluating model forecasting performance?

An overfit model fits the training data well, but has poor generalization to actual forecasting, that is, to new data (e.g., the testing data). The good fit to the training data is irrelevant.

Describe two different data visualizations for different groups of data.

Bar chart: Visualize the numerical value associated with each category of a numerical variable, such as the count of each category in a data set. Histogram: Visualize the count or proportion of data values in each bin of similar values for a numerical variable. Scatterplot: Visualize the relationship between two numerical variables by plotting each point defined with coordinates of the two values for the two variables.

What is a package manager? Why do we use a package manager for our Python work, and which package manager do we use?

Base Python by itself makes an excellent programming language. But one does not want to program everything, but use other, developed software, such as for machine learning. This additional software is organized by packages of related functions. A package manager is used to download, install, and update these different packages without much work on the part of the user.

What is supervised machine learning?

Based on a prediction equation, or in more complex cases, a network of inter-related prediction equations, the supervised machine learning forecasts unknown values of a variable of interest based on the values of known variables related to the unknown variable of interest.

Why is binary prediction the process of classification?

Binary prediction is classification into only one of two categories. Distinguish classification into a category from measurement of a quantitative variable. For example, classify someone as Male or Female body type, but measure their height.

A cluster should demonstrate cohesion and separation. a. What are these concepts? b. What fit index simultaneously assesses these two concepts? How is it interpreted?

Cohesion is a measure of how close a point is to its own cluster. Separation is a measure of how far a point is to the nearest cluster. The silhouette index assesses both cohesion and separation. Values close to 1 indicate cohesion and separation. Values of 0 indicate the point is on a cluster boundary. A negative value indicates the point likely does not belong in the cluster.

One potential issue with multiple regression is collinearity. Describe the problem and how it can be addressed.

Collinearity means that predictor variables (features) correlate substantially with each other. Collinearity increases the standard errors of the estimated collinear slope coefficients as the estimation algorithm cannot readily separate their effects (holding the others constant). Informal detection is to inspect the feature correlation matrix for high correlations. More formal is to regress each feature on all the others, to detect which ones can be explained in terms of the other features and perhaps, then, not needed in the model, a feature (de)selection technique.

How can ¥the analyst determine if a model is overfit?

Compare the fit of the model from the training data to the testing data. If there is a big decrease, the model is overfit to the training data.

What makes machine learning a 2nd-decade 21st century technology, as opposed to, say, the 1990's?

Computer power. Having massively more computer power allows more intensive analyses of algorithms that existed for decades, such as applying hyper-parameter tuning to multiple regression. Further, new estimation algorithms have been developed, such as random forest, that are only possible with much computer power. (Cheap laptops today are more powerful in terms of numerical crunch power than the fastest super-computers just 15 years ago.)

How is the concept of data aggregation related to a pivot table?

Data aggregation summarizes the values of a numerical variable with descriptive statistics across different groups defined by levels of one or more categorical variables. The pivot table displays these summary statistics for the different groups.

What is data wrangling? Why is it important?

Data almost never arrives ready for analysis. Many issues need to be addressed, such as inconsistent coding of responses, missing data, and superfluous variables. Even with those issues settled, the data usually needs pre-processing, including standardization (or similar) and conversion of categorical variables to indicator variables.

Meaning and interpretation of the hypothesis test of the slope coefficient.

Each predictor variable in the model is associated with a slope-coefficient. Each test evaluates the hypothesis of no relationship between predictor and target variables. The null hypothesis is that the population slope coefficient that relates x to y is 0, beta=0. The meaning of the slope coefficient is that as x increases by one unit, y, on average, changes by the value of beta.

Meaning and interpretation of the confidence interval of the slope coefficient.

Each predictor variable in the model is associated with a slope-coefficient. The slope-coefficient specifies the average change in y for a unit increase in the change in the predictor variable. For each estimated slope coefficient, the sample slope coefficient, b, there is a corresponding population value beta. If zero is in the confidence interval, then the analysis is unable to demonstrate a relationship between the predictor variable and the response variable, with all other predictor variables held constant.

Explain the concept of a row name in a data frame. Describe the default row names and the advantage of replacing them with a suitable column from the read data frame.

Each row has a unique identifier. By default, the identifiers are the consecutive integers, starting from zero. However, the data file may contain a unique identifier as one of the already existing columns, such as Name for a data file of employees. In that situation, better to replace the default integers with the more meaningful names.

For a given set of customers, almost all weigh between 110 and 300 lbs. One customer, an outlier, reports a weight of 460 lbs. What is the basis for dropping the customer from the analysis? How does such an action change the reported results?

Either the data value is mis-entered, or, if correct, any generalizations would not properly apply to these people, and perhaps bias the model for the vast majority of customers. In practice, experiment with different deletion thresholds as perhaps a better model can be constructed by focusing, for example, on only people with weights between 100 and 300 lbs. With otherwise so much variability, the model may perform more poorly for everyone, instead of better performance for the vast majority of customers who weigh between 100 and 300 lbs.

When classifying customers to identify those most likely to churn (exist as a customer), which of the three-classification metrics is the most useful: accuracy, recall, or precision? Why?

Errors cost money. There are two kinds of errors in the classification problem: false positives and false negatives. We cannot make a final assessment As to the appropriate fit index until we know the cost of each of these two errors to the business. The analysis of customer churn is primarily concerned with not losing existing customers. Unless the amount ofresources dedicated to those customers predicted as likely to leave is not excessive, better to avoid the false positives. That is, OK to have some customers predicted to churn who do not, than miss those who do churn. As such, the most relevant fit index may be sensitivity (recall), but then again cannot say for sure until the costs of the errors are known.

What is model validation and what is the problem training data to validate a model?

Every data set sampled from a population differs from any other data set sampled from the same population. Every sample reflects the underlying population values, but every corresponding sample value, such as the mean, does not equal the corresponding population value. So fitting sample data from which the model trained fits random sampling error as well as true, stable population characteristics. A model can fit training data perfectly, but have no useful ability to forecast on new (testing) data.

Why can a model not be properly evaluated on its training data?

Every data set sampled from a population differs from any other data set sampled from the same population. Every sample reflects the underlying population values, but every corresponding sample value, such as the mean, does not equal the corresponding population value. So fitting sample data from which the model trained fits random sampling error as well as true, stable population characteristics. A model can fit training data perfectly, but have no useful ability to forecast on new (testing) data.

When predicting a binary outcome, what are the two ways to be wrong? Define your terms.

False Negative, when the model predicts the sample in a group is not in the group. False Positive, when the model predicts the sample is in the group, and it is not.

What does it mean to filter rows by data values?

Filtering subsets a data frame by rows, selecting only those samples that satisfy some logical criterion, such as Gender == 'F', which reduces a data frame down to only those rows of data marked with F as the gender.

What are quartiles of a distribution and how are they computed?

For a given variable, the first quartile (Q1) is the middle number between the smallest number and the median of the data set. The second quartile (Q2) is the median of the data. The third quartile (Q3) is the middle value between the median and the highest value of variable.

Z Scores: What is their range if the distribution is normal.

For a normal distribution, little more than 95% of the values fall within two standard deviations of the mean.

Why is it important to standardize (or otherwise normalize) the data before pursuing a KMeans cluster analysis? Explain specifically of the importance in terms of K-Means.

For example, length measured in millimeters yields much larger numbers than length measured in meters. Corresponding distances between points across a variety of variables would be much different depending on the unit, which different cluster analysis results obtained. Standardized ensures that the metrics of the variables are about the same. In particular, the same distribution of standard scores are obtained regardless if the length of something is measured in millimeters, meters, inches, feet, or whatever.

What is y^? What are the two primary situations in which it is applied?

Given a regression model, 𝑦" is the value calculated from the values of the predictor variables. If applied to the data from which the model was estimated, 𝑦" is the of y, that is, fit by the model for the associated values of the predictor variables, called the fitted value. If predicting a future event for which the value of y was not available or used to estimate the model, then 𝑦" is called the predicted value or the forecasted value.

What is local optimization (such as regarding the decision tree solution)? What is its primary disadvantage?

Given the initial model configuration, such as the first split in a decision tree analysis, a different final model likely emerges than if another first split had been taken on a different variable. The problem is that the splitting process to move down one more layer of depth is done locally, without any "long-view" of where the process is going. So an optimal first split may have been barely more optimal than if another variable and split had been chosen, but perhaps the second alternative would ultimately had led to a series of splits that achieved a more successful level of classification.

How is hyper-parameter tuning related to fishing? When is OK to do so?

Hyper-parameter tuning is searching for the best parameter setting, such as the number of features in a model, without any real theoretical reason for choosing the value. Instead, use modern computers to grind away at a large range of possibilities, choosing the best. This procedure is OK as long as the searching is done on training data and the testing on completely different data.

What does it mean to say that the median and inter-quartile range are robust to outliers?

If all the values of a distribution remain the same except that the largest value of the distribution changes from 10 to 10,000,000,000, the median and IQR remain unchanged. On the contrary, the mean and standard deviation will be drastically affected.

Model Fit: The standard deviation of the residuals to interpret model fit.

If the residuals are normally distributed as the result of a random process, as they usually are, then +2 and -2 standard deviations on either side of zero contains about 95% of the forecasting errors. How is the standard deviation of the residuals used to interpret model fit?

Meaning of the slope coefficient in y^ = b0 + b1X1

In this regression model with a single predictor (feature), 𝑏! is the slope coefficient, estimated from the data. The slope coefficient determines, on average, how much y changes with a increase of 1 unit in X. When applied to a regression model, this change in y is the average (or expected) change. (If there is more than one predictor variable in the model, then the values of all the other predictor variables are held constant.) With multiple regression, each slope coefficient is interpreted with the values of the remaining predictor variables (features) held constant.

How is it that given a decision tree with enough levels of depth, the model can recover the correct class value (e.g., Gender) with perfect accuracy?

Increasing the complexity of the model ensures better fit on the training data. With a decision tree analysis, the analyst can add enough depth (splits) to the model to correctly classify everything. Of course, such classification will not generalize beyond the training data, so such a model is useless to deploy.

Why is k-fold cross-validation preferred to just splitting the original data into training and testing data, a train/test split?

Instead of just one arbitrary, usually randomly selected hold-out sample, there are k holdout samples. Any one train/test split may result, by chance, in a weird test sample or training sample. With k different such training/test splits the average performance of the model across all the k-folds yields a more stable estimate of the forecasting efficacy, such as with MSE or se.

In machine learning the variable to forecasted or predicted is called the target or the label. When is the term label most appropriate?

Label is most appropriate when forecasting the level or category of a categorical variable. The level is described by a label in the usual English definition of the word.

Discuss with an example: The core of machine learning is pattern recognition.

Life and the world are not random collections of atoms. There are patterns everywhere, which form, for example, the basis of science, and also the basis of human decision making. Many of these patterns are not easily detected by humans. The usefulness of machine learning is to detect these patterns and generalize them to future events, the basis of forecasting.

What does it mean to state that "A model should be made as simple as possible, but not simpler."?

Make the model as complex as it can be to capture the relevant information in the training data to avoid underfitting without so much complexity the model overfits.

What is the purpose and benefit of feature selection (i.e., select predictor variables for a model).

Not all potential features are relevant (correlate with y) and unique (do not correlate with other X's). As such, they contribute little, or even detract, from forecasting efficiency and model interpretability. Particularly for large data sets, they can also add potential machine time for the computations. As such, best to rid the model of irrelevant features.

Define overfitting.

Overfitting is when a model is too complex, where the extra complexity takes advantage of random sampling fluctuations in the training data to increase fit.

What is the relationship of Python and Pandas?

Pandas is a package of functions that add pre-built data analysis capabilities to Python. The primary Pandas data structure is the data frame.

What criteria that a potential feature (predictor variable) should satisfy before added to a model?

Predictor variables (features) should be relevant (correlate with y), and provide unique information (do not correlate with the other X's).

Briefly explain how multiple regression enhances the two primary purposes of regression analysis?

Predictor variables (features) that are relevant (correlate with y), and provide unique information (don't correlate with the other X's), lead to a) better, more accurate prediction, and b) a better understanding of how the variables are related to each other.

Why is Python a good language to use for machine learning?

Python offers a framework for machine learning, so if you run one machine learning analysis with one algorithm, it is easy to re-run the same code with another algorithm just by changing the name of the algorithm.

Why is R-squared called a relative index of fit?

R-sq literally compares the residuals from two models: the specified model, and the null model where the X's are unrelated to y so that the forecast is just the mean of y, which plots as a straiht line.

Meaning of the residual variable e.

Residual variable 𝑒 is the difference between the actual value of y and the estimated value of y, 𝑦". The residual or error represents the influences on the value y not explained or accounted for by the model.

What are some advantages and disadvantages between running Python with the Anaconda distribution on your own computer versus Google Colab? Include an assessment of costs for both alternatives.

Running Python on your computer provides the best security as no one else has access to your files. Plus, no Internet connection is needed to do the work. A negative is that for larger data files a more powerful computer is needed, so more expense. For Colab, access is free with a good amount of available computer power, though ultimately some limits for free access. Moreover, the environment is already set up, ready to go as soon as logged in. Two major disadvantages of Colab is that Google has access to your files, and an Internet connection is needed.

What is the purpose of the sensitivity (recall) metric?

Sensitivity assesses how many samples in the positive group are correctly classified as positive. It is applicable to situations where the concern is of missing something that exists, such as cancer in a medical diagnosis, a terrorist as a passenger on an airplane, or a poor-quality part in a manufacturing scenario.

What is the distinction between a stacked and unstacked bar chart? What do stacked and unstacked bar charts display?

Stacked and unstacked bar charts display the relation between two categorical variables. The values of one categorical variable are listed on one axis with the associated bars, and the associated counts or proportions listed on the other axis. For a stacked bar chart, each bar is divided into the corresponding values of the second categorical variable. For an unstacked bar chart, there is a separate bar for each level of the second categorical variable.

a. What is the distinction between supervised learning and unsupervised learning?

Supervised learning uses features X to forecast target y. There is a correct answer from which to evaluate the success of the analysis. Unsupervised learning uncovers structure without a target by which to aim for.

Identify and briefly explain the two types of supervised machine learning regarding the nature of the target variable.

Supervised machine learning trains a model to predict a target. The two types of supervised machine learning either forecast a continuous variable, such as linear regression, or forecast a classification into a category, such as logistic regression.

What is one statistic for assessing homogeneity? What is its range and how is it interpreted?

The Gini index is a primary statistic for which to evaluate homogeneity of classification. The value of the coefficient ranges from 0 for maximum equality to 1 for maximum inequality with no benefit obtained from the classification system.

Briefly explain the concept and purpose of a Jupyter notebook.

The Jupyter notebook allows the user to interactively program and run analyses with full documentation.

Explain the concept of a current working directory.

The Jupyter notebook needs a starting point for referencing files to read and write. That reference point is the current working directory. All file references are relative to this directory.

Consider the following scatterplot as related to forecasting body type according to Gender. To forecast Gender, would a decision tree algorithm choose the Waist or Shoe feature to make the first split? Why? About where would the split occur (the decision boundary)? Why?

The algorithm would choose the Shoe size feature because a split at about 8 ¼ does a good, though not perfect, job of separating Male and Female body types. There is no split on the Waist feature that attains any decent accuracy of differentiation.

Express in words the calculation of Euclidean distance between two points calculated over p features. [Write the formula if you wish, but do describe verbally.]

The calculation of Euclidean distance is the application of the Pythagorean theorem to the coordinates (data values) of the points. Take the difference of the data values for each feature across the two points, square the difference, sum the squared differences, and then take the square root.

When conducting a machine learning analysis, how can the analyst detect overfitting?

The fit indices will look great on the training data, and much worse on the testing data, the indicator of real-world performance

Graph of X with y vs the graph of X with y^.

The graph of X with 𝑦" is a single line (for a linear function to predict y). The graph of X and y is a scatter plot.

How does a correlation matrix, perhaps in the form of a heat map, facilitate feature selection?

The heat map is a visualization of a correlation matrix. An informal, but useful, feature selection technique is to delete some collinear features. The heat map can not only provide the correlations, but also color codes each according to the magnitude of each correlation. Such color-coding assists in identifying large correlations.

How does a heat map differ from a standard correlation matrix?

The heat map uses colors to indicate the value of the correlation between each pair of a set of two numerical variables. The correlation matrix lists the numerical correlations directly. A modified heat map can show both the colors and the correlations.

Criterion of ordinary least squares regression to obtain the estimated model.

The least squares criterion is the choice of the regression model that minimizes the sum of squared residuals across all the rows of data in the analysis. That is, this estimation process yields values of each 𝑏j such that, as a set, yield the linear function that results in the smallest possible sum of squared residuals,𝑦 − 𝑦".

What are the mean and standard deviation of a distribution of z-scores?

The mean of a distribution of z-scores is 0, with a standard deviation of 1.

How is the inter-quartile range analogous to the standard deviation in terms of both being summary statistics of a distribution of a continuous variable?

The more variable the values of a distribution, the more extreme are the first and third quartiles of the distribution. The IQR is the positive difference between the first and third quartiles. So the larger the variability of the values of a variable, the larger are both the standard deviation as well as the IQR.

What is the purpose of the precision metric?

The purpose of precision is to determine how many samples in the negative group are incorrectly classified into the positive group. How many airline passengers were incorrectly identified initially as terrorists? How many patients were told they have cancer when they did not? How many good parts were incorrectly classified as bad?

What is the relation of a random forest estimator/model to a decision tree?

The random forest is an evaluation of many different decision trees. The algorithm constructs a series of decision trees where each tree is based on a different random sample of (a) the data, with replacement, and (b) the available features, with the final model an average of the different trees.

What is the root node of a decision tree? What does it represent?

The root node is the beginning node, before any classification takes place. Its membership is the number of samples in each group as the analysis begins, such as the number of Men and Women in the analysis.

What is the Pythagorean theorem so important to cluster analysis?

The standard measurement of distance between points, the basis for a cluster analysis where the coordinates of each point are the data values for a single sample, is Euclidean distance. That distance is computed with the application of the Pythagorean theorem to all p dimensions (number of features).

How can unsupervised learning be a preliminary step to supervised learning?

The structure uncovered buy a cluster analysis, for example, can then be used, perhaps with some modification, to define a target for future supervised learning.

What is the target variable in a logit regression analysis? Why?

The target variable is the logarithm of the odds ratio. The reason is to have a target that reflects probability, but varies from negative infinity to positive infinity while effectively summarizing the relationship between a binary target and the features.

Define outliers of a distribution in terms of its inter-quartile range.

The traditional definition is that an outlier is beyond 1.5 IQR's from the first or third quartile of the distribution.

What metric balances sensitivity and precision? How does it accomplish the balance?

The value of the F1 metric lies between the sensitivity and precision values, as their harmonic mean.

When predicting a binary outcome, what are the two ways to be correct? Define your terms.

There are two groups. The two ways to be correct are to correctly classify a sample into its correct group, one called the positive group and the other the negative group, so a true positive or a true negative.

Why is it good practice to investigate multiple initial configurations of centroids when pursuing a K-means cluster analysis?

There is no analytic solution for the optimum cluster solution even given the number of clusters. Instead iteration is used. And iteration begins with an initial, trial solution, usually by more or less randomly seeding the centroids of the initial clusters. Unfortunately, the final solution can depend on the initial configuration because the iterative process cannot guarantee the best optimal solution, only an optimal solution relative to the starting points. Exploring solutions from different starting points can avoid accepting a solution less desirable than another solution obtained from a different starting configuration.

What is the distinction between the Pandas .loc and .iloc methods? What is their purpose?

These two methods subset a data frame, by rows and/or columns. .loc subsets by row name or column name. .iloc subsets by index, i.e., the ordinal position of the row or column, starting with (unfortunately) 0. Write the Python expression for referring to variables x1, x2 and x3 in the df data frame. most general, df.loc[:, 'x1','x2','x3'], or, sometime works, df['x1','x2','x3']

When in the sklearn Python machine learning environment, how similar is the code for doing k-fold validation for leastsquares regression vs. logistic regression? What is the distinction?

This is a huge strength of the sklearn machine learning analysis environment. Simple code changes, such as instantiating another estimation module, can invoke an entirely different estimation algorithm. The analyst can easily test multiple algorithms and choose the best for the data set.

Consider an item on a survey with three possible responses D (disagree) N (neutral) A (agree). What indicator variables would be defined and how are the values of those variables determined?

Three indicators would be defined, one for each potential response: D, N, and A. The values of these variables would be 0's and 1's, with the indicator variables getting a '0' where that response was not present in the initial categorical variable, and a '1' where that response was present in the initial categorical variable.

When breaking data into training/test subsets, when forecasting a categorical variable why do we want the same proportion of people in each group in each subset as in the full data set?

To evaluate how good a model is, we need to compare to forecasting without the model. That forecast is from what is called the null model, which, for logistic regression, is the forecast to the group with the most members. If the proportion of members in the groups change for each random assignment of samples to training and test data, so does the performance of the model.

What are the two goals of supervised machine learning?

Two important goals are accomplished with regression analysis: • Understand the relationship between a predictor variable (feature) and the response variable • Forecast the unknown future values of a response variable Adding relevant predictor variables with new information contributes to both goals. Additional relevant predictor variables contribute to our understanding of the relation between a predictor and response variables with the values of all other predictors held constant, and increase the forecasting accuracy of the model, and via the imposition of statistical control.

Define underfitting and discuss the problem it presents for model development.

Underfitting means the model is too simple to capture all the information in the training data that is not random variation, but reflects stable aspects of the underlying population.

What is a csv file? What are its properties, its primary advantage and its primary disadvantage (compared to an equivalent Excel file)?

What is the distinction between categorical and continuous variables? Provide an example variable of each along with some sample values.

What is the process of an iterative solution for model coefficients?

When a direct algebraic solution is not possible, then the method to compute the estimated parameter values relies upon iteration, the method of gradient descent. Start with a somewhat if not completely random guess as to the parameter values. Then, using calculus, the algorithm moves the parameter values in a direction that further minimizes the error. Keep going until no further error minimization is obtained, or until the maximum number of iterations is exceeded because some estimates never converge. The problem is that a local minimum may have been obtained. With another set of starting values, a better minimum may result.

How does k-fold cross-validation extend the concept of a hold-out sample?

With k-fold validation there are k hold-out samples and so k cross-validations.

A machine learning analyst investigates fit for a decision tree model with 2, 3 and 4 features at depths of 2 and 6, with a 3-fold cross-validation. a. How many distinct models are analyzed? b. How many analyses are performed? c. How many hyper-parameters are investigated?

a. Three features and two depths lead to the analysis of 3x2=6 models._ b. The 3-fold cross-validation subjects each model to three analyses, so 6x3=18 analyses. Each analysis estimates the parameter values of a model c. Hyper parameter is a general characteristic of a model. This grid search investigated two hyper-parameters: tree depth and number of features.

Compare the two primary data visualization Python packages: seaborn and matplotlib.

matplotlib is the Python standard for visualizations. seaborn is a newer more modern alternative that is somewhat easier to use and tends to produce more elegant visualizations.


Related study sets

3. Corporate Social Responsibility and Citizenship

View Set

CEA Chapter 12, CEA - UNIT 10, CEA, CEA Chapter 1, CEA Chapter 2, CEA Chapter 3, CEA Chapter 4, CEA Chapter 5, CEA Chapter 6, CEA Chapter 7, CEA Chapter 8, CEA Chapter 9, CEA Chapter 13, CEA Chapter 11, CEA Chapter 14, CEA Chapter 15, CEA Chapter 16

View Set

Weathering, Erosion and Deposition

View Set

fire science chapter 5 fire behavior

View Set

Biology - Chapter 12: DNA Technology - Quiz

View Set