GSCM 410 Final Study Guide (CH#1-3)
What are the two goals of supervised machine learning?
1) Understand the relationship between a predictor variable (feature) and the response variable 2) Forecast the unknown future values of a response variable
. What are the two types of cells in a Jupyter notebook? What is the purpose of each?
A code cell is for entering and running Python code. A markdown cell is for documentation.
What is a csv file? What are its properties, its primary advantage and its primary disadvantage (compared to an equivalent Excel file)?
A csv file is a comma separated values file, pure text, so readable by virtually every application that can read text. Its primary advantage is near-universal readability. Its primary disadvantage is unlike a worksheet, its columns are not aligned when viewing the file as text, remedied by opening the file in a worksheet app such as Excel
Explain the following reference to a data frame named data: data[rows, columns]
A data frame is a 2-D object, rows and columns. Any one data value in a data frame is identified by its row and column coordinates. This notation specifies the name of the data frame and then references data values in one or more rows and columns.
What is a data table and how are the data values organized?
A data table is a rectangular table of the data values subject to analysis. The first row contains the variable names, each other row contains the data for a single unit of analysis, such as a person or company. Each column contains the data values for a single variable
What is the shape of the visualization of a linear model?
A straight surface. In two dimensions, that is a line. In three dimensions, a cube. And beyond.
What is the only way to know for sure which rescaling is best for the data values of the predictor variables (features) - MinMax, Standardization, or Robust Scaling?
A theme of machine learning is to see what works. Keep testing data separate from training data, and do whatever you want with the data, choosing what ultimately works best. Most machine learning algorithms perform better, that is, more accurate forecasts, when the data have about the same scales. Experience is that often it makes no difference which method is chosen, but one does not know in advance if working with new data. Try them out and see if looking for ways to increase the forecast.
What is a variable transformation for a continuous (numeric, not categorical) variable? In general terms, not specific code, how is a transformation implemented?
A variable transformation defines a new variable, or creates new values for an exsiting variable, by performing an arithmetic operation on existing variables. To specify in Python, simply specify the arithmetic operation, realizing that variables are identified as part of the data frame in which they exist. For example: d['Salary000'] = d['Salary'] / 1000
What is a linear model? What are its parameters?
A weighted sum of variables plus a constant term.
What is standardization to z-scores?
A z-score indicates how many standard deviations the original value is from the mean of the distribution. The distribution of z-scores has the same shape as the original distribution, but with a different scaling.
What is the purpose of an indicator (or dummy) variable?
An indicator variable is a numerical representation of a categorical variable. The number of indicator variables formed is equal to the number of levels or categories. A dummy variable is an indicator variable that assumes the value 0 if the category level is not present, and a 1 when present.
How is the least-squares regression model obtained with a gradient descent solution?
An initial, even arbitrary solution for the model parameters is given. Then, to minimize the squared errors across all the rows of data, the parameter values are changed. Then again. Then again, each time getting closer to the smallest possible sum of squared errors. The process stops when changing the parameter estimates results in virtually no change in the sum of squared errors.
Describe two different data visualizations for different groups of data.
Bar chart: Visualize the numerical value associated with each category of a numerical variable, such as the count of each category in a data set. Histogram: Visualize the count or proportion of data values in each bin of similar values for a numerical variable. Scatterplot: Visualize the relationship between two numerical variables by plotting each point defined with coordinates of the two values for the two variables.
What is a package manager? Why do we use a package manager for our Python work, and which package manager do we use?
Base Python by itself makes an excellent programming language. But one does not want to program everything, but use other, developed software, such as for machine learning. This additional software is organized by packages of related functions. A package manager is used to download, install, and update these different packages without much work on the part of the user
How is the concept of data aggregation related to a pivot table?
Data aggregation summarizes the values of a numerical variable with descriptive statistics across different groups defined by levels of one or more categorical variables. The pivot table displays these summary statistics for the different groups.
What is data wrangling? Why is it important?
Data almost never arrives ready for analysis. Many issues need to be addressed, such as inconsistent coding of responses, missing data, and superfluous variables. Even with those issues settled, the data usually needs pre-processing, including standardization (or similar) and conversion of categorical variables to indicator variables.
Meaning and interpretation of the confidence interval of the slope coefficient.
Each predictor variable in the model is associated with a slope-coefficient. The slope-coefficient specifies the average change in y for a unit increase in the change in the predictor variable.
Explain the concept of a row name in a data frame. Describe the default row names and the advantage of replacing them with a suitable column from the read data frame.
Each row has a unique identifier. By default, the identifiers are the consecutive integers, starting from zero. However, the data file may contain a unique identifier as one of the already existing columns, such as Name for a data file of employees. In that situation, better to replace the default integers with the more meaningful names.
What is model validation and what is the problem training data to validate a model?
Every data set sampled from a population differs from any other data set sampled from the same population. Every sample reflects the underlying population values, but every corresponding sample value, such as the mean, does not equal the corresponding population value. So fitting sample data from which the model trained fits random sampling error as well as true, stable population characteristics. A model can fit training data perfectly, but have no useful ability to forecast on new (testing) data.
What does it mean to filter rows by data values?
Filtering subsets a data frame by rows, selecting only those samples that satisfy some logical criterion, such as Gender == 'F', which reduces a data frame down to only those rows of data marked with F as the gender.
What are quartiles of a distribution and how are they computed?
For a given variable, the first quartile (Q1) is the middle number between the smallest number and the median of the data set. The second quartile (Q2) is the median of the data. The third quartile (Q3) is the middle value between the median and the highest value of variable.
What is their range if the distribution is normal
For a normal distribution, little more than 95% of the values fall within two standard deviations of the mean
What does it mean to say that the median and inter-quartile range are robust to outliers?
If all the values of a distribution remain the same except that the largest value of the distribution changes from 10 to 10,000,000,000, the median and IQR remain unchanged. On the contrary, the mean and standard deviation will be drastically affected.
Model Fit: The standard deviation of the residuals to interpret model fit.
If the residuals are normally distributed as the result of a random process, as they usually are, then +2 and -2 standard deviations on either side of zero contains about 95% of the forecasting errors. How is the standard deviation of the residuals used to interpret model fit?
What is the purpose of the variable type category? When should it be used?
In statistics, a categorical variable (also called qualitative variable) is a variable that can take on one of a limited, and usually fixed, number of possible values. (E.g. Blood : AB, A, B, O)
Meaning of the slope coefficient in y^ = b0 + b1X1.
In this regression model with a single predictor (feature), �! is the slope coefficient, estimated from the data. The slope coefficient determines, on average, how much y changes with a increase of 1 unit in X. When applied to a regression model, this change in y is the average (or expected) change. (If there is more than one predictor variable in the model, then the values of all the other predictor variables are held constant.) With multiple regression, each slope coefficient is interpreted with the values of the remaining predictor variables (features) held constant.
Discuss with an example: The core of machine learning is pattern recognition.
Life and the world are not random collections of atoms. There are patterns everywhere, which form, for example, the basis of science, and also the basis of human decision making. Many of these patterns are not easily detected by humans. The usefulness of machine learning is to detect these patterns and generalize them to future events, the basis of forecasting.
What is the relationship of Python and Pandas?
Pandas is a package of functions that add pre-built data analysis capabilities to Python. The primary Pandas data structure is the data frame.
Why is Python a good language to use for machine learning?
Python offers a framework for machine learning, so if you run one machine learning analysis with one algorithm, it is easy to re-run the same code with another algorithm just by changing the name of the algorithm.
. Why is R-squared called a relative index of fit?
R-sq literally compares the residuals from two models: the specified model, and the null model where the X's are unrelated to y so that the forecast is just the mean of y, which plots as a straiht line.
Meaning of the residual variable e.
Residual variable � is the difference between the actual value of y and the estimated value of y, �". The residual or error represents the influences on the value y not explained or accounted for by the model.
What are some advantages and disadvantages between running Python with the Anaconda distribution on your own computer versus Google Colab? Include an assessment of costs for both alternatives.
Running Python on your computer provides the best security as no one else has access to your files. Plus, no Internet connection is needed to do the work. A negative is that for larger data files a more powerful computer is needed, so more expense. For Colab, access is free with a good amount of available computer power, though ultimately some limits for free access. Moreover, the environment is already set up, ready to go as soon as logged in. Two major disadvantages of Colab is that Google has access to your files, and an Internet connection is needed.
What is the distinction between a stacked and unstacked bar chart? What do stacked and unstacked bar charts display?
Stacked and unstacked bar charts display the relation between two categorical variables. The values of one categorical variable are listed on one axis with the associated bars, and the associated counts or proportions listed on the other axis. For a stacked bar chart, each bar is divided into the corresponding values of the second categorical variable. For an unstacked bar chart, there is a separate bar for each level of the second categorical variable.
Briefly explain the concept and purpose of a Jupyter notebook
The Jupyter notebook allows the user to interactively program and run analyses with full documentation.
Explain the concept of a current working directory
The Jupyter notebook needs a starting point for referencing files to read and write. That reference point is the current working directory. All file references are relative to this directory.
Graph of X with y vs the graph of X with y^.
The graph of X with �" is a single line (for a linear function to predict y). The graph of X and y is a scatter plot
How does a heat map differ from a standard correlation matrix?
The heat map uses colors to indicate the value of the correlation between each pair of a set of two numerical variables. The correlation matrix lists the numerical correlations directly. A modified heat map can show both the colors and the correlations.
Criterion of ordinary least squares regression to obtain the estimated model.
The least squares criterion is the choice of the regression model that minimizes the sum of squared residuals across all the rows of data in the analysis. That is, this estimation process yields values of each
What are the mean and standard deviation of a distribution of z-scores?
The mean of a distribution of z-scores is 0, with a standard deviation of 1
How is the inter-quartile range analogous to the standard deviation in terms of both being summary statistics of a distribution of a continuous variable?
The more variable the values of a distribution, the more extreme are the first and third quartiles of the distribution. The IQR is the positive difference between the first and third quartiles. So the larger the variability of the values of a variable, the larger are both the standard deviation as well as the IQR.
Meaning and interpretation of the hypothesis test of the slope coefficient.
The null hypothesis is that the population slope coefficient that relates x to y is 0, beta=0. The meaning of the slope coefficient is that as x increases by one unit, y, on average, changes by the value of beta.
Define outliers of a distribution in terms of its inter-quartile range.
The traditional definition is that an outlier is beyond 1.5 IQR's from the first or third quartile of the distribution.
What is the distinction between categorical and continuous variables? Provide an example variable of each along with some sample values.
The values of categorical variables are non-numeric categories, even if with integer values, and there are relatively few unique values. Continuous variables are always numeric and have many possible values.
What is the distinction between the Pandas .loc and .iloc methods? What is their purpose?
These two methods subset a data frame, by rows and/or columns. .loc subsets by row name or column name. .iloc subsets by index, i.e., the ordinal position of the row or column, starting with (unfortunately) 0.
Consider an item on a survey with three possible responses D (disagree) N (neutral) A (agree). What indicator variables would be defined and how are the values of those variables determined?
Three indicators would be defined, one for each potential response: D, N, and A. The values of these variables would be 0's and 1's, with the indicator variables getting a '0' where that response was not present in the initial categorical variable, and a '1' where that response was present in the initial categorical variable.
What is a variable transformation for a categorical variable? In general terms, not specific code, how is a transformation implemented?
Variable transformation for the continuous (categorical) variables is completed through changing the value from categorical to numerical. This is completed through the 'rename' function which takes one variable "M" and converts it to "Male" using the custom logic function.
Compare the two primary data visualization Python packages: seaborn and matplotlib.
matplotlib is the Python standard for visualizations. seaborn is a newer more modern alternative that is somewhat easier to use and tends to produce more elegant visualizations.
What is supervised machine learning?
supervised machine learning forecasts unknown values of a variable of interest based on the values of known variables related to the unknown variable of interest
What is y^? What are the two primary situations in which it is applied?
y^ is the value calculated from the values of the predictor variables (called the fitted value). y^ is If Used when predicting a future event for which the value of y was not available.