DATA 310 - Final
ECLAT steps
- 1) Set a minimum joint support. - 2) Take all subsets with a higher support than minimum. - 3) Sort by decreasing support.
L1/Lasso regularization
- A regularization method that facilitates variable selection (estimating some coefficients as zero) - This optimization is equivalent to minimizing the sum of the square residuals with a constraint on the sum of the squared weights
In Bayesian inference the "likelihood" represents
- How probable is the data (evidence) given that our hypothesis is true
forward propagation
- In neuronal networks the process of calculating the subsequent layers of the network. Each layer depends on the calculations done on the layer before it. - The cost function (often referred as objective function) uses the output of the neural network obtained via forward propagation.
A system is linear in its parameters when...
- Its coefficients do not appear as exponents - The system can be written as the product of a feature matrix and a coefficient vector
regression trees
- Main idea: things are different for subsets in the data. Nonlinear relationships may exist and we should be able to accommodate them - STOPPING criteria: stop growing the tree when further splits gives less than some minimal amount of extra information or when we have less than 5% of the data in those nodes - If all the points in the node have the same value for all the independent variables, then stop. Otherwise, search over all binary splits of all variables for the one which will reduce S as much as possible.
Random forests
- Rationale: more resilient to outliers, better for external validity. You can also provide info on how certain or uncertain you are about a result - stop adding trees when it doesn't matter to your results
backward propagation
- The adjustment of the weights is done via backward propagation. - refers to the method of calculating the gradient of neural network parameters
linear regression
- The main point of linear regression is to assume that predictions can be made by using a linear combination of the features - we only need to learn slope
regularization
- The minimization of the sum of squared residuals subject to a constraint on the weights (aka coefficients) - Regularization is achieved by minimizing the average of the squared residuals plus a penalty function whose input is the vector of coefficients.
L2/Ridge/Euclideon regularization
- The regularization with the square of an L2 distance may improve the results compared to OLS when the number of features is higher than the number of observations - The hyperparameter alpha of the ridge model is becoming a tunning parameter - This optimization is equivalent to minimizing the sum of the square residuals with a constraint on the sum of the squared weights
What is true about feature scaling?
- There are many different approaches to feature scaling. - Feature scaling is frequently necessary, as some optimization algorithms are based on euclidean distance calculations which can be less effective in unscaled cases.
dealing with missing values
- Use .dropna() to drop rows containing at least one NaN entry - Data imputations: imputations can be implemented based on different concepts, such as mean/median values, most frequent values, interpolation, k-nearest neighbors, multivariate imputation by chained equation (MICE)
support
- a measure of how frequently the items appear in the dataset of all transactions - ex. Chance you pick someone that took DATA 310 by random out of a pool of students
support vector regression
- a method in machine learning for both regression and classification problems. - Main idea: to map the input feature into a higher dimensional space and then address the problem to solve
lift
- a metric that can help compare the confidence vs. the expected confidence. you are comparing two probability values. the purpose of the lift is to help you claim an 'if then' situation - Improvement in chance when you contrast support to confidence such as the ratio - ex. the probability that someone has taken data 310 given they've taken data 146
monte carlo simulation
- a model used to predict the probability of different outcomes when the intervention of random variables is present - Help explain the impact of risk and uncertainty in prediction and forecasting models
Support Vector Regression with Slack Variables (SVR with slack variables)
- consists of an algorithm that solves a quadratic optimization problem with constraints - Main idea: the slack variables will accommodate points that are close to the epsilon margins and that may influence the influence of the value of the weights - This means that we have at least 2 diff hyperparameters in this case such that and C
ridge model
- consists of learning the weights by the following optimization - This optimization is equivalent to minimizing the sum of the square residuals with a constraint on the sum of the squared weights
Hard classification
- directly target on the classification decision boundary without producing the probability estimation
List each of the processes you undertake before entering a text value into your corpus
- ensure all words are lower or upper case in the same way - remove any extra characters or words that are not informative - convert each word to its root
Soft classification
- explicitly estimate the class conditional probabilities and then perform classification based on estimated probabilities.
What is an advantage of stochastic gradient descent, as contrasted to traditional gradient descent?
- for computing the vector for updating the weights will use a random subset of observations at a time - For determining the optimal weights for the neurons (which means training the network) we have to use an efficient optimization algorithm that is not going to get "stuck" in a sub-optimal local minima
posterior of bayesian inference
- how probable is our hypothesis given the data observations (the specific evidence we collected)
when multiple linear regression (linear models with more features - ols) fails
- if the number of observations is smaller than the number of features
natural language processing
- main idea is to quantify the occurrence of relevant words and, based on the context, to map them into vectors. - That is to say that we want to create mathematically representable quantities from words and text; they will serve as features for data analysis. - One approach is separate the text data into sentences and then sentences can be used to extract (key) words and expressions.
ECLAT algorithm
- main idea: use Transaction Id Sets (tidsets) intersections to compute the support value of a candidate. - only deals with support - Example: Case study (movie recommendation engine): Support(M1,M2) = (# of individuals that liked movie M1 and M2) / (# of individuals)
training the weights
- means running an optimization algorithm and determining the values of the weights that minimize an objective function
ordinary least squares (OLS)
- model seeks to minimize the differences between your true and estimated dependent variable - main idea of ols regression: minimize the sum of the squared (errors) residuals, which means that on average we want to make the 'least amount of mistakes' - To determine the line of best fit the goal is to minimize the sum of squared residuals
interval level of measurement
- numerical values where differences make sense in the context of using them but ratios or proportions could be meaningless - ex. temperatures, heights of people, dates on the calendar
ratio level of measurement
- numerical values, that are at the interval level of measurement - ratio and proportions make sense in the context of using them - ex. salaries, weights, distances, forces
The biological function of the "Axon" is represented by what element of a neural network?
- output
Quantiles
- represent a set of values for a random variable that divide its frequency distribution into groups, each containing the same fraction of the whole data
confidence
- represents the number of times the if-then statements are found true - Chance you pick someone that took DATA 310, given they're in another course. This is a conditional probability
standardization
- scaling the data in order to combine features in a functional way in order to perform operations - Ex. we cannot directly add meters and feet, thus we need to scale these variables in order to get them on the same metric - ways to scale: z-scores, quantiles, 0 to 1 scaling
APRIORI property:
- subsets of a frequent itemset must be frequent. If an item is infrequent, all its supersets will be infrequent. - uses prior knowledge of frequent itemset properties - The idea is to search for frequent if-then patterns; the algorithm uses the concepts of support and confidence in order to identify the most frequent (and thus relevant) relationships.
Machine learning
- the science of using data for automating predictions and analyses
main goal of monte carlo simulations
- to solve problems by approximating a probability value via carefully designed simulations
nominal data
- unique identifiers but order does not matter - ex. peoples' names, names of flowers, colors
ordinal data
- unique identifiers, and order matters - ex. letters grades, military or administration ranks
In the context of a Loss function, what is true when contrasting a L1 norm to a L2 norm?
A L1 norm accounts for error in a linear fashion, in which an estimate that is off by 1 is accounted as an error of "1", and an estimate that is off by 2 is accounted as an error of "2". Contrasting to this, a L2 norm penalizes for errors of larger magnitudes at a different rate compared to the L1 norm.
histograms of z-scores vs histogram of quantiles
A histogram of z-scores follows the same distribution, however, the histogram of the data scaled by quantiles is always uniform
Assume that data is standardized by z-scores and a linear regression model is fit with one feature (input) variable. If the residuals are not normally distributed and they seem to be increasing when the predicting variable is decreasing, choose among the following other regression models the ones that may help improve the predictions.
A nonlinear regression algorithm such as Decision Trees or Random Forests.
Gradient Descent
A technique to minimize loss by computing the gradients of loss with respect to the model's parameters, conditioned on training data. Informally, gradient descent iteratively adjusts parameters, gradually finding the best combination of weights and bias to minimize loss.
stopword
A word that is removed from the analysis
What are key benefits of classification forests, as contrasted to classification trees?
By running multiple instances of the same model, the risk of overfitting to outliers is reduced
ordinal data
Data in which a larger number indicates a larger amount, but not the absolute amount (i.e., a ranking of the top ten colleges in the U.S.A.)
continuous data
Data in which larger values indicate a larger absolute amount (i.e., the cost of ten different coffee options at the daily grind).
discrete data
Data which represents specific classes without order (i.e., USA and France).
Selecting an appropriate model requires:
Determining if your model is linear or non-linear and determining if your problem is discrete or continuous
concept of principal component analysis refers to
Determining the directions along which we maximize the variance of the input features
disagree
Do you agree or disagree with the following statement: In a linear regression model, all features must correlate with the noise in order to obtain a good fit.
In a simple linear regression the goal is to
Estimate the expected value of the predicted variable
projective methods
Main Common Idea: Preserve important qualitative properties from the original features and summarize them by using fewer dimensions. - primarily used for classification
principal components
Main Idea: we can summarize the contribution of a subset of features based on finding the direction of the biggest variablity or variance. Such directions, that summarize most of the variablility in the data, are called principal components.
polynomial regression
Main idea: linear combination of different powers of the feature values
elastic net regularization
Main idea: to combine the L2 and L1 regularizations in a convex way
What is one reason simple OLS is not well suited to calculating the probability of discrete cases?
Ols can result in predicted probabilities greater than 1 or less than 0.
In a classification project the following method is most likely to be overfit
Random forest with 100% pure nodes
true
T/F: If we have one input variable and one output, the process of determining the line of best fit may not require the calculation of the intercept inside the gradient descent algorithm
true
T/F: Linear regression, multiple linear regression, and polynomial regression can be all fit using LinearRegression() from the sklearn.linear_model module in Python.
false
T/F: The slope of the regression line always remains the same if we scale the data by z-scores
false
T/F: polynomial regression is best suited for functional relationships that are non-linear in weights
Bayesian inference
The likelihood is how probable the data is given that our hypothesis is true
In the RandomForestClassifier, what does the n_estimators define?
The number of trees in the forest
In Bayesian classification, our goal is to estimate
The probability of a hypothesis given the data.
What is the Posterior Probability in a Bayesian approach to classification?
The probability you are solving for, for each class.
In the context of a loss function, what does a L1 Norm calculate?
The summation of the absolute difference between each true and predicted value.
In SVR, what does the epsilon parameter define?
The width of the zone within which you will not account for errors.
cost function of neural network
We are going to determine (learn) the weights by optimizing a cost function such as the sum of the square residuals or minimizing the missclassification rate
A good reason for implementing a feature selection technique is:
We want to create a parsimonious model
A good way to summarize the predictions for a classification problem is to look at
confusion matrices
T/F: A monte carlo simulation should never include more than 1000 repetitions of the experiment
f
T/F: For scaling data by quantiles we need to compute the z-scores first.
f
The root node of a classification tree is...
holds all nodes/classifications
What is not a step unique to pre-processing text data?
imputation of any missing values
slack variables in support vector regression
in absolute value added as a penalty to the minimization of the sum of square residuals
The biological function of the "dendrites" is represented by what element of a neural network?
input info
grid search
involves subdividing the range of the parameters by using small increments and then the grid search algorithm is choosing the value of the parameters from a finite set of choices within the range.
Upper Confidence Bound (UCB) reinforcement learning relies on the estimation of what two parameters each step?
max upper bound and
The biological function of the "neuron" is represented by what element of a neural network?
node/ neuron/ activation layer
In Python one of the libraries that we can use for generating repeated experiments in Monte Carlo simulations is
random
- In Python, for creating a random permutation of an array whose entries are nominal variables we used
random.shuffle
T/F: A finite sum of squared quantities that depends on some parameters (weights), always has a minimum value
t
false, it can ONLY work for data that is completely linearly separable
t/f: A hard margin SVM is appropriate for data which is not linearly separable.
true
t/f: A k-fold cross-validation can be used to determine the best choice of hyper-parameters from a finite set of choices.
true
t/f: A tree with no depth restriction (an infinite maximum depth) will always attempt to create only pure nodes.
true
t/f: Classification Tree models can be helpful if your data illustrates non-linearity.
false
t/f: Eclat and Apriori association rule methods both involve the concepts of support, confidence and lift.
f
t/f: Grouping data into quantiles is accomplished by subdividing data into intervals of equal width
true
t/f: In practice we determine the weights for linear regression with the "X_test" data.
true
t/f: In the case of kernel support vector machines for classification, such as the radial basis function kernel, one or more landmark points are considered by the algorithm
false
t/f: Kernel SVM is only applicable if you have at least 3 independent variables (3 dimensions).
true
t/f: NLP can leverage the same classification algorithms used in other machine learning contexts.
true
t/f: Naive Bayes classifiers are probabilistic.
t
t/f: Regression trees are able to model non-linear systems.
false
t/f: The L1 norm always yields shorter distances compared to the Euclidean norm
f
t/f: The concept of support vector machines is based on logistic regression
false
t/f: The gradient descent method does not need any hyperparameters
true
t/f: You can create an ensemble model by re-fitting the same model many times with randomly selected data.
true
t/f: natural language processing, nlp, can be used for both probabilistic (soft) and discrete (hard) classification problems
linear vs non-linear models
the difference is in the weights. non-linear weights are to some power
T/F: One can decide that the number of iterations in a Monte Carlo simulation was sufficient by visualizing a Probability-Iteration plot and determining where the probability graph approaches a horizontal line.
true
bag of words
use concurrences within context and counts of keywords to make predictions
What is ensemble learning (such as in Random Forests)?
using multiple models to identify a solution to a problem
main idea of grid search algorithms
we can use algorithms for discrete optimization in order to estimate the value for the hyperparameters and thus complete the model selection process.
if you increase value of min_support for your apriori algorithm what happens?
you get less results bc you raised the minimum