AI/ML Final Exam

Lakukan tugas rumah & ujian kamu dengan baik sekarang menggunakan Quizwiz!

Support vector machines

(SVM/SVC/SVR) Solves the problem by finding orientation between the data sets and draw a line between them by using the points to define a boundary and try to set a line equidistant between the clusters. The "street" is the area within the dashed lines on either side of the center dividing line produced by the SVM. hyperparameters control how much we avoid outliers

What are some problems that arise in training machine learning algorithms?

- Drawing conclusions that are too specific to the training set - Drawing conclusions based on outliers in the data - Selecting an algorithm that is too simple for the domain

What are 4 problems that arise in machine learning datasets?

1. Training set that is too small 2. Training set that is different than real-life data 3. Training set that is full of errors or empty fields 4. Training set that includes lots of unrelated columns

Multilabel

A classification system that outputs multiple binary tags. Example classes = odd or even, high or low, prime or not prime; output = [1,0,1] or like [True, False, True]

Feature

A column A feature has several meanings depending on the context, but generally means an attribute plus its value (e.g., "Mileage = 15,000").

Backpropagation (ANN)

A common method of training a neural net in which the initial system output is compared to the desired output, and the system is adjusted until the difference (the error) between the two is minimized The error coefficients are called weights

Training set

A data set in which the input and the desired output are both provided to the computer.

Attribute

A data type (e.g., "Mileage")

Multi-layer perceptron (MLP)

A feedforward artificial neural network that generates a set of outputs from a set of inputs. An MLP is characterized by several layers of input nodes connected as a directed graph between the input and output layers. MLP uses backpropogation for training the network. MLP is a deep learning method. The layers between the input and output layers are called the "hidden layers"

Stochastic gradient descent (SGD) classifier

A linear classifier that selects a random (Stochastic = Random) point from the training set and calculates its slope and keeps edging down the "mountain" (the gradient) gradually until finding the minimum (slope=0) to determine the shape of the classifying curve between the groups Capable of handling very large datasets efficiently. Deals with training instances independently, one at a time (which also makes it well suited for online learning)

soft margin SVM classification

A more flexible model. The objective is to find a good balance between keeping the street as large as possible and limiting the margin violations (i.e., instances that end up in the middle of the street or even on the wrong side)

hyperparameter

A parameter of a learning algorithm (not of the model). As such, it is not affected by the learning algorithm itself; it must be set prior to training and remains constant during training.

Training instance

A row

What problem does this graph illustrate?

Adding additional powers to your inputs can improve modeling but adding to many can cause overfitting

polynomial model

Adds powers on the x to the linear model to make it a nonlinear model y = ax + bx2 + x to the powers

Gradient Descent algorithm

Adjust the coefficients a little bit. If it results in a better model, use those new coefficients. Bit by bit tweaking the parameters until you hit a gradient of zero, which is the minimum Incl. three different ways to determine the "best direction to move:" batch, stochastic, and mini-batch

random forest

An algorithm used for regression or classification that uses a collection of tree data structures trees "vote" on the best model decision trees segment the state space and then bagging method for creating training subsets "random" because you can split nodes randomly on either subset of training features or at a random threshold on features

You are building a system to identify which photos contain a surface-to-air missile launcher. In the context of machine learning, which category of algorithms is most likely of interest to you?

Anomaly detection

If the following (massive steps across the curve, fast) illustrated your gradient descent algorithm, what might you try doing to improve it?

Decrease the initial "learning rate" hyperparameter (eta0 in SGDRegressor) or possibly decrease the learning rate as the model trains

orthogonality

Discussed in terms of circumplex models, orthogonality specifies that traits that are perpendicular to each other on the model (at 90 degrees of separation, or at right angles to each other) are unrelated to each other. In general, the term "orthogonal" is used to describe a zero correlation between traits. may be much more natural to use a linear regressor instead of a decision tree

Confusion Matrix

Easy to see if the system is commonly mislabelling one class as another by calculating the true positives, true negatives, false positives, and false negatives.

Closed form vs stochastic

Finding coefficients: Closed form have a known solution equation in which the coefficients are calculated by plugging in the inputs and outputs Whereas in stochastic, coefficients are plugged in randomly until the ones with the lowest error are found

Clustering

Grouping similar instances together into clusters This is a great tool for data analysis, customer segmentation, recommender systems, search engines, image segmentation, semi-supervised learning, dimensionality reduction, and more

Recall

How many times a positive prediction is correct out of the total times a prediction is correct Dividing true positives over all correct guesses Recall = TP / TP + FN Ignores false positives and true negatives Also called sensitivity or true positive rate

Precision

How many times a positive predictions is correct out of all positive predictions Dividing the true positive predictions by the total positive predictions (true and false positives) Ignores true negatives and false negatives Precision = TP / TP + FP

When do you want a higher precision than recall?

If you want a higher precision that means you'd prefer that your model gives you more false negatives - one that mistakenly filters out positives Example: if you trained a classifier to detect videos that are safe for kids, you would probably prefer a classifier that rejects many good videos (low recall) but keeps only safe ones (high precision)

When do you want a higher recall than precision?

If you want a higher recall that means you'd rather your model give you more false positives - one that misidentifies more instances as positive than there are Example: you train a classifier to detect shoplifters on surveillance images: it is probably fine if your classifier has only 30% precision as long as it has 99% recall (sure, the security guards will get a few false alerts (FP), but almost all shoplifters will get caught (TP))

If the following (tiny steps down the curve, taking a long time) illustrated your gradient descent algorithm, what might you try doing to improve it?

Increase the "learning rate" hyperparameter (eta0 in SGDRegressor)

Label

Input data assigned or "tagged" with a label The expected output of a classifier Ex: Positive or Negative, Daisy or Tulip, Win or Loss

Mini-Batch Gradient Descent

Instead of using all m examples as in batch gradient descent, and instead of using only 1 example as in stochastic gradient descent, we will use some in-between number of examples b Takes a random set of points and calculates the slope

What relationship exists between precision and recall?

Inverse

Gini impurity

Is a way to measure error the higher the number is, the higher the error

Linear model

Model that describes a straight line without a change in slope (x to no power)

Decision trees

Models that draws decision boundaries based on different parameters that narrow down in specificity in subsequent layers Can perform both classification and regression

Multioutput

Multioutput-multiclass classification is simply a generalization of multilabel classification where each label can be multiclass (i.e., it can have more than two possible values). Example: the array of pixel intensities classifier's outputs one label per pixel and each label can have multiple values of pixel intensity that range from 0 to 255

What are 4 advantages of decision trees?

No scaling or

Accuracy

Number of correct predictions out of the total number of predictions made (A = TP+TN/TOT) Generally not the preferred performance measure for classifiers because if only 10% of the images are 5s, you can always guess that an image is not a 5 and you will be right about 90% of the time (accuracy=90%)

Batch learning

Offline learning The system must be trained using all the available data. This will generally take a lot of time and computing resources, so it is typically done offline. First the system is trained, and then it is launched into production and runs without learning anymore; it just applies what it has learned. It is incapable of learning incrementally.

Incremental learning

Online learning You train the system incrementally by feeding it data instances sequentially, either individually or by small groups called mini-batches. Each learning step is fast and cheap, so the system can learn about new data on the fly, as it arrives. The model improves over time. Ex data: stock prices

StratifiedKFold classification

Performs stratified sampling to produce folds that contain a representative ratio of each class At each iteration the code creates a clone of the classifier, trains that clone on the training folds, and makes predictions on the test fold. Then it counts the number of correct predictions and outputs the ratio of correct predictions.

Classification

Predicting classes The process of grouping samples into categories based on their similarities

Regression

Predicting values The process of assigning samples a value along a curve of possible values

Stochastic solution

Randomly plugging in coefficients and searching down the mountain for the least error (lowest minimum y) control size of steps using learning rate parameter Stochastic involves some form of randomness to identify a solution because the data set is so large picking random points until you get closer and closer (less and less error)

You are building a system to estimate real estate values. In the context of machine learning, which category of algorithms is most likely of interest to you?

Regression

Stochastic Gradient Descent

Selects a random point from the training set and calculates its slope

K-fold cross validation

Splitting the training set into K-folds (K = number of folds), then making predictions and evaluating them on each fold using a model trained on the remaining folds

How can we avoid these two pitfalls (local minimum not global minimum; plateau) when using gradient descent techniques?

Start with a large learning rate that decreases over time

hard margin SVM classification

Strictly impose that all instances be off the street and on the right side (no samples between the dashed lines on either side of the delimiter)

Cost function

Term for a measure of how bad your model is: the lower the number the better Ex: cost function that measures the distance between the linear model's predictions and the training examples. the smaller distance the better

Utility function

Term for a measure of how good your model is: the higher the number the better

Difference between a training set and a test set

The "training" data set are the samples used to create the model and fine tune the hyper parameters, while the "test" set is used to qualify and unbiasedly assess the model's performance

Normal equation

The closed-form solution to a linear regression model θ=(XT X)−1 (X)T (y) θ is the value of θ that minimizes the cost function y is the vector of target values containing y(1) to y(m)

Confusion matrix: Actual class vs predicted class *Predicted* 5s NOT 5s 5s TP: 53,057 FP: 1,522 *Actual* NOT FP: 1,325 TN: 4,096

The first row of this matrix considers images of 5s (the positive class) and 53,057 of them were correctly classified as 5s (they are called true positives), while the remaining 1,522 were wrongly classified as 5s (false positives). The second row considers the images of non-5s (the negative class): 1,325 were wrongly classified as non-5s (false positives), while the remaining 4,096 were correctly classified as non-5s (true negatives).

What issue is illustrated by this diagram? (cost bowl overlaid on x-y axis)

The inputs to the model are of different scale. The solution is to scale the inputs to a range like 0 to 1 or based on their mean and standard deviation.

Reinforcement learning

The learning system can observe the environment, select and perform actions, and get rewards in return (or penalties). It must then learn by itself what is the best strategy to get the most reward over time. Strategies then determine actions in each given situation. Ex: "Winning" in a game of chess is a reward.

One-versus-all strategy (OvA)

Training a binary classifier for each class, each of which produce a decision score and the highest decision score "wins" Ex: classify the digit images into 10 classes (from 0 to 9) and train 10 binary classifiers, one for each digit (a 0-detector, a 1-detector, a 2-detector, and so on). My sample run through all gets the highest (0.91) score from the "1" classifier. Output = 1

Semi-supervised learning

Training data includes partially labeled data (usually a lot of unlabeled data and a little bit of labeled data) Ex: google photos identifying people's faces. Clusters the faces into categories (unsupervised) and then labels them as Brenda or Stacey, etc. (supervised)

What are two reasons that closed-form solutions may not be usable in a particular application?

Training set may be too large, whereas a stochastic approach would be faster and lighter May not even have a "closed form solution"/a known equation for some models

A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E. Besides the example of spam tagged by humans, what could be the "experience E" in your example of a machine learning system?

Tweet labeled as positive sentiment

Cross-validation

Verifying the results obtained from a validation study by administering a test or test battery to a different sample that's still drawn from the same population

A perfect classifier's matrix

Would have only true positives and true negatives, so its confusion matrix would have nonzero values only on its main diagonal (top left to bottom right) and only zero values on the diagonal FP to FN

ensemble

a group of predictors if you combine multiple models you'll get better results than just one

activation functions

a sigmoid replaced the step function in determining if the neuron is on or off its a slope instead of a hard line training determines where along the slope between 0 and 1 the neuron turns "on" - actually calculating what the weights should be because we know the inputs and the outputs

Artificial Neural Networks (ANNs)

computer systems that are intended to mimic human cognitive functioning introduced 1943, surge in 60s early 80s resurgence early 2010s resurgence with GPUs

Stacking

default ensemble method parallel using one training subset blending meta-learner: all the outputs of all the models blend together to produce one answer; the blending algorithm is determined by the output that we know/expect/want in training/from the labeled data

diverse ensemble model

different kinds of models (bayes, trees, SVM, k-nearest neighbor, etc) but same training data works best when models are independent of each other (in parallel) can take a majority vote for classification or an average for regression

ANN math

each node has inputs the inputs are 0 or 1, but they are each multiplied by a weight the result of input times weight plus input times weight is the sigma and then the "step" is the sigma rounded to either a 1 or a 0

In gradient descent algorithms, what are the Greek letters eta and epsilon used to designate?

eta - the learning rate epsilon - ignoring the deviation of correct label (a point) from the current prediction (our line) if below a certain threshold

dimensionality

how many inputs (columns, axis) you have - x1 and x2

Machine Learning is great for (select all that apply): Brittle domains in which it would be difficult to write a fixed set of rules Complex domains in which a fixed solution is not known Dynamic situations in which inputs might change over time Large datasets in which relationships are not easily recognizable

map

max leaf nodes

maximum number of leaf nodes you want by the end

Min samples leaf

min number of samples to have to make a new leaf

Min samples split

minimum number of samples you can have until the sample size is too small better generalization (avoiding overfitting)

Boosting (AdaBoost algorithm)

multiple models, each step fine-tunes the model and training happens sequentially run the data through the first one which comes up with boundaries and weighs some samples as more important than others - keep running through more to fine tune

Deep Neural Networks

number of hidden layers is greater than two we can do that now because we have greater computing power than in the 80s

ways to aggregate in ensemble learning

parallel or sequential (like models based on each other) diversity of models (decision trees only or other kinds as well) or data (all working with the same or diff?) hard voting (electoral college) or blending (maine/nebraska electoral college)?

bagging

same learning algorithm with different subsets of the original data using random w/ replacement selection "bagging with replacements" = allowing repeats "pasting without replacements" = no repeats

principle component analysis

say you have x1, x2, x3 (a cube) = can reduce to new axes z1 and z2 (a plane) PCA transformers when you identify which columns are most helpful it looks in multidimensional space to reduce down your data to being suspended between those columns (axes)

tensor flow

scaled up scikit learn algorithms to stretch across multiple computers CPU - central processing unit has 8 operations at a time; can be all different operations GPU - graphical processing unit has 10k+ operations at one time; needs to be the same operation on all 10k+ pieces of data; quantum does the same thing except in the millions

Gradient boosting

sequential ensemble Machine learning technique for regression and classification. Produces a prediction model in the form of an ensemble of weak prediction models typically decision trees; stage-wise model input: x1 and x2 2nd model input: the errors of x1 and x2 error becomes the input

manifold projection

takes in x1-3 and gives new values z1-2, based on a known structure of the data (like a spiral function)

Tol

tolerance; if there's no difference between iterations in error decrease then we'll be content to stop searching for the minimum

convolutional neural network

type of ANN: layer 1 nodes (inputs) are only connected to "nearby" layer 2 nodes - unlike conventional which has ALL inputs connected to ALL nodes on the subsequent layer good for image processing/computer vision - pixels near to each other are relevant to each other, but not the ones farther away in the image (nose to nose pixels vs ear to nose pixels)

max features

we could split on two

Dimensionality Reduction

which/how many axis can i drop and still have a clear classification projecting your points onto an additional calculated axis (z)

GridSearchCV

CV=cross validation helps you tune the hyperparameters to the optimal ones without plugging them in individually

Batch gradient descent

Calculates the slope at the current point ( the derivative) based on all the data

Multiclass classifiers

Can distinguish between more than two classes, unlike binary classifiers. Example classes = 1, 2, 3, or 4 with output = [1] Ex: Random Forest classifiers or naive Bayes classifiers

Binary Classifier

Capable of distinguishing between just *two* classes/categories (ex: 5s and not-5s) EX: Support Vector Machine classifiers or Linear classifiers

You are building a system to detect fraudulent financial transactions. In the context of machine learning, which category of algorithms is most likely of interest to you?

Classification

A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E. Besides classifying email as either spam or ham, what could be an example of a "task T" in a machine learning system?

Classifying a tweet as positive or negative sentiment

You are building a system to recommend movies that you might like. In the context of machine learning, which category of algorithms is most likely of interest to you?

Clustering

A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E. Besides the example of the ratio of correctly identified spam, what could be the "measure P" in your example of a machine learning system?

Correctly identified positive sentiment tweets to all positive sentiment tweets

What problem does this code solve and how does it solve it?

Creates a new input matrix that includes squared values Still using a linear regressor/SGDRegressor, we can input both x AND x2 so that our current prediction is a polynomial

Test set

Data set used to estimate accuracy of final model on unseen data

Anomaly Detection

The objective is to learn what "normal" data looks like, and use this to detect abnormal instances, such as defective items on a production line or a new trend in a time series.

Supervised learning

The training data you feed to the algorithm includes the desired output, called labels

Closed-form solution

There is a mathematical equation that best fits/describes the data able to take the inputs and outputs and calculate the coefficients some algorithms have closed form solutions and some don't we have a mathematical solution to the line b = y - mx

Unsupervised learning

There's no labeled data or known output given to the algorithm. Clustering and anomaly detection are good examples. Ex: Netflix has a bunch of people that like the Great British Baking Show, Arthur, and Westworld. These people are clustered together and offered show recommendations based on each other. No label like "white college student Mormon female" is every input/found.

One vs One strategy (OvO)

Train an additional binary classifier for every pair of digits. The result that wins the most duels is the output. Can be faster to train many classifiers on small training sets than training few classifiers on large training sets. Ex: classify the digit images into 10 classes (from 0 to 9) by assigning one classifier to distinguish 0s from 1s, another to distinguish 0s from 2s, another for 1s and 2s, and so on. The sample run through each is a 1, 2, 1, etc. 1 gets the most number of votes. Output = 1 For N classes = N (N-1)/2 = number of groups


Set pelajaran terkait

Salesforce ADM 201- Security and Access

View Set

CH 52, 53, 54, & 55 Take Home Exam

View Set

Beaufort 6 contact 8 Uitdrukkingen

View Set

Chapter 35: The Infant and Family

View Set