Hand-On Machine Learning with Scikit-Learn, Keras, & TensorFlow (Terminologies)
_________ _________ _________ is an unsupervised clustering algorithm which involves creating clusters that have predominant ordering from top to bottom (eg: all files and folders on our hard disk are organized in a hierarchy)
(Unsupervised) Clustering Algorithm: Hierarchical Cluster Analysis (HCA)
_________ _________ _________ partitions data into k distinct clusters based on distance to the centroid of a cluster. _________ _________ is a method from signal processing, with the objective of putting the observations into k clusters in which each observation belongs to a cluster with the nearest mean.
(Unsupervised) Clustering Algorithm: K-Means
_________ _________ _________
(Unsupervised) Dimensionality Reduction: Principal Component Analysis (PCA)
_________ _________ _________
(Unsupervised) Linear Method: Kernel Principal Component Analysis (PCA)
_________ _________ _________ [Chapter 9: pg 266]
Anomaly Detection
________ is an operation that finds the argument that gives the maximum value from a target function. ________ is most commonly used in machine learning for finding the class with the largest predicted probability. ________ can be implemented manually, although the ________() NumPy function is preferred in practice. [Chapter 4: pg 150]
Argmax (operator)
________ ________ ________ is a type of unsupervised learning technique that checks for the dependency of one data item on another data item and maps accordingly so that it can be more profitable. It tries to find some interesting relations or associations among the variables of dataset. It is based on different rules to discover the interesting relations between variables in the database.
Association Rule Learning
_________ _________ _________ is a machine learning method that uses a set of rules to discover interesting relations between variables in large databases i.e. the transaction database of a store. It identifies frequent associations among variables called association rules that consist of an antecedent (if) and a consequent (then). [Chapter 1: pg 11]
Association Rule Learning
_________ _________ _________ is a type of unsupervised learning technique that checks for the dependency of one data item on another data item and maps accordingly so that it can be more profitable. It tries to find some interesting relations or associations among the variables of dataset. It is based on different rules to discover the interesting relations between variables in the database. [Chapter 1: pg 11]
Association Rule Learning
_________ _________ _________ is a classic algorithm in data mining. It is used for analyzing frequent item sets and relevant association rules. It can operate on databases containing a lot of transactions. Take for example transactions made by customers in a grocery shop. [Chapter 1: pg 11]
Association Rule Learning: Apriori Intuition
_________ _________ _________ [Chapter 1: pg 11]
Association Rule Learning: Eclat
________ ________ ________ is a data type (eg: mileage) [Chapter 1: pg 9]
Attribute
________ is a way to decrease the variance of the prediction model by generating additional data in the training stage. This is produced by random sampling with replacement from the original set. These multi-sets of data are used to train multiple models. Both bagging and pasting allow training instances to be sampled several times across multiple predictors, but only bagging allows training instances to be sampled several times for the same predictor. [Chapter 7: pg 191]
Bagging ( Ensemble Method)
In ________ ________ all the training data is taken into consideration to take a single step. You take the average of the gradients of all the training examples and then use that mean gradient to update the parameters. So that's just one step of gradient descent in one epoch. ________ ________ is great for convex or relatively smooth error manifolds. [Chapter 4: pg 124]
Batch Gradient Descent
________ ________ [Chapter 4: pg 114]
Batch Gradient Descent
________ ________ [Chapter 4: pg 125]
Batch Gradient Descent: Convergence Rate
________ ________ ________ is a type of learning where the system is incapable of learning incrementally; it must be trained using all the available data. First, the system is trained, and then it is launched into production and runs without learning anymore. Can adapt to change given a new system to be trained on. [Chapter 1: pg 15]
Batch Learning
________ ________ [Chapter 4: pg 136]
Bias/Variance Tradeoff
________ ________ is the part of the generalization error due to wrong assumptions, such as assuming that the data is linear when it is actually quadratic. A high-bias model is most likely to under-fit the training data. Reducing a model's complexity increases its ________ and reduces its variance. [Chapter 4: pg 136]
Bias/Variance Tradeoff: Bias
________ ________ is the part that is due to the noisiness of the data itself. The only way to reduce this part of the error is to clean up the data (eg: fix the data sources, such as broken sensors, or detect and remove outliers). [Chapter 4: pg 136]
Bias/Variance Tradeoff: Irreducible Error
________ ________ is the part that is due to the model's excessive sensitivity to small variations in the training data. A model with many degrees of freedom (such as a high-degree polynomial model) is likely to have high variance, and thus to overfit the training data. ________ will increase as a result of increasing a model's complexity. [Chapter 4: pg 136]
Bias/Variance Tradeoff: Variance
________ ________ is a type of supervised learning, a method of machine learning where the categories are predefined, and is used to categorize new probabilistic observations into said categories. When there are only two categories the problem is known as statistical binary classification
Binary Classifier
________ ________ [Chapter 6: pg 181]
Black Box Models
________ is a general ensemble method that creates a strong classifier from a number of weak classifiers. This is done by building a model from the training data, then creating a second model that attempts to correct the errors from the first model. [Chapter 7: pg 191]
Boosting ( Ensemble Method)
________ ________ is any node that falls under another node is a ________ ________ or sub-node, and any node which precedes those ________ ________ is called a parent node. [Chapter 6: pg 177]
Child Node
________ ________ ________ sometimes called as targets/ labels or categories [Chapter 1: pg 8]
Class
_________ _________ _________ is a task that requires the use of machine learning algorithms that learn how to assign a class label to examples from the problem domain. An easy to understand example is classifying emails as "spam" or "not spam."
Classification
_________ is a common supervised learning task for predicting classes
Classification
________ ________ first splits the training set in two subsets using a single feature and a threshold. The algorithm searches for the pair that produces the purest subsets. Once the algorithm has successfully split the training set in two, it splits the subsets using the same logic, then the sub-subsets and so on, recursively. The algorithm stops recursing once it reaches the maximum depth, or if it can't find a split that will reduce impurity. [Chapter 6: pg 182]
Classification and Regression Tree Algorithm (CART)
________ ________ just means that if you pick any two points on a curve, the line segment joining them never crosses the curve. This implies that there are no local minima, just one global minimum. It is also a continuous function with a slope that never changes abruptly [Chapter 4: pg 122]
Convex Function
________ ________ is a mechanism utilized in supervised machine learning, the cost function returns the error between predicted outcomes compared with the actual outcomes. The aim of supervised machine learning is to minimize the overall cost, thus optimizing the correlation of the model to the system that it is attempting to represent.
Cost Function
________ ________ measures how bad your model is
Cost Function
________ ________ is frequently used to measure how well a set of estimated class probabilities match the target classes. [Chapter 4: pg 150]
Cross Entropy
________ ________ is a process using many small validation sets. Each model is evaluated once per validation set, after it is trained on the rest of the data. ________ ________ is a very useful technique for assessing the effectiveness of your model, particularly in cases where you need to mitigate overfitting. It is also of use in determining the hyper parameters of your model, in the sense that which parameters will result in lowest test error.
Cross-Validation
A ________ ________ is a sequence of data processing components. The components typically run asynchronously. Each component pulls in a large amount of data, processes it, and spits out the result in another data store. Then some time later, the next component in the ________ ________ pulls this data and spits out its own output.
Data Pipeline
________ ________ ________ is when you estimate the generalization error using the test set, your estimate will be too optimistic, and you will launch a system that will not perform as well as expected.
Data Snooping Bias
________ ________ is a line (in the case of two features), where all (or most) samples of one class are on one side of that line, and all samples of the other class are on the opposite side of the line. The line separates one class from the other. [Chapter 4: pg 146]
Decision Boundary
________ ________ is the region of a problem space in which the output label of a classifier is ambiguous. If the decision surface is a hyperplane, then the classification problem is linear, and the classes are linearly separable. Decision boundaries are not always clear cut. [Chapter 5: pg 167]
Decision Boundary
________ ________ is a method present in classifier{ SVC, Logistic Regression } class of sklearn machine learning framework. This method basically returns a Numpy array, In which each element represents whether a predicted sample for x_test by the classifier lies to the right or left side of the Hyperplane and also how far from the HyperPlane. [Chapter 5: pg 166]
Decision Function
________ ________ is a method present in classifier{ SVC, Logistic Regression } class of sklearn machine learning framework. This method basically returns a Numpy array, In which each element represents whether a predicted sample for x_test by the classifier lies to the right or left side of the Hyperplane and also how far from the HyperPlane. It also tells us that how confidently each value predicted for x_test by the classifier is Positive ( large-magnitude Positive value ) or Negative ( large-magnitude Negative value).
Decision Function
________ ________ is a value that dichotomizes the result of a quantitative test to a simple binary decision. [Chapter 3: pg 79]
Decision Threshold
________ ________ ________ observes features of an object and trains a model in the structure of a tree to predict data in the future to produce meaningful continuous output. Continuous output means that the output/result is not discrete, i.e., it is not represented just by a discrete, known set of numbers or values.
Decision Tree Regression
________ ________ [Chapter 6: pg 184]
Decision Tree Regressor
________ ________ is a structure that includes a root node, branches, and leaf nodes. ________ ________ is a series of nodes, a directional graph that starts at the base with a single node and extends to the many leaf nodes that represent the categories that the tree can classify. Another way to think of a decision tree is as a flow chart, where the flow starts at the root node and ends with a decision made at the leaves. It is a decision-support tool. It uses a tree-like graph to show the predictions that result from a series of feature-based splits. [Chapter 6: pg 177]
Decision Trees
________ ________ ________
Deep Belief Networks (DBNs)
________ ________ ________ describes the number of independent ways a dynamic system can move without impeding any of the constraints placed upon it. Put simply, it is the number of values in a function that are free to vary.
Degrees of Freedom
_________ _________ _________
Dimensionality Reduction
________ ________ ________ are a result of if there are n number of categories in categorical attribute, n new attributes will be created. These attributes created are called ________ ________. These ________ ________ will be created with one hot encoding and each attribute will have value either 0 or 1, representing presence or absence of that attribute.
Dummy Variables
________ ________ happens when the model tries to chase the loss function crazily on the training data, by tuning the parameters. Now, we keep another set of data as the validation set and as we go on training, we keep a record of the loss function on the validation data, and when we see that there is no improvement on the validation set, we stop, rather than going all the epochs. [Chapter 4: pg 142]
Early Stopping
________ ________ is a set of techniques that allows a system to automatically discover the representations needed for feature detection or classification from raw data.
Feature Learning (or Representation Learning)
________ ________ is a technique to standardize the independent features present in the data in a fixed range. It is performed during the data pre-processing to handle highly varying magnitudes or values or units. If ________ ________ is not done, then a machine learning algorithm tends to weigh greater values, higher and consider smaller values as the lower values, regardless of the unit of the values.
Feature Scaling
________ ________ ________ is the selection of the most useful features to train on among existing feature
Feature Selection
________ ________ ________ has several meanings depending on it's context. [Chapter 1: pg 9]
Features
________ ________ [Chapter 5: pg 163]
Liblinear
________ ________ [Chapter 5: pg 163]
Libsvm
For any given phenomenon, the ________ ________ we include in our equations is meant to represent the tendency of the data to have a distribution centered about a given value that is offset from an origin; in a way, the data is biased towards that offset. [Chapter 4: pg 114]
Linear Regression: Bias Term
________ ________ ________ [Chapter 5: pg 155]
Linear SVM Classification
________ ________ [Chapter 4: pg 121]
Local Minimum
________ ________ [Chapter 4: pg 114]
Logistic Regression
________ ________ ________ is commonly used for classification, as it can output a value that corresponds to the probability of belonging to a given class (eg: 20 percent chance of being spam). [Chapter 1: pg 9]
Logistic Regression
________ ________ is commonly used to estimate the probability that an instance belongs to a particular class. ________ ________ is a supervised learning classification algorithm used to predict the probability of a target variable. It is one of the simplest ML algorithms that can be used for various classification problems such as spam detection, Diabetes prediction, cancer detection etc. [Chapter 4: pg 144]
Logistic Regression
________ ________ ________ [Chapter 5: pg 157]
Margin Violations
________ ________ ________ is used in the case where RMSE isn't the best performant measure in the case where there are many outlier districts
Mean Absolute Error
________ ________ ________ is the measure of how much alike two data objects are. A ________ ________ is a data mining or machine learning context is a distance with dimensions representing features of the objects. If the distance is small, the features are having a high degree of ________. [Chapter 1: pg 18]
Measure of Similarity
________ ________ states that if a kernel function K is symmetric, continuous and leads to a positive semi-definite matrix P then there exists a function ϕ that maps xi and xj into another space (possibly with much higher dimensions) such that K(xi,xj)=ϕ(xi)Tϕ(xj). [Chapter 5: pg 173]
Mercer's Theorem
________ ________ is a technique that re-scales a feature or observation value with distribution value between 0 and 1.
Min-Max Scaling
________ ________ [Chapter 4: pg 114]
Mini-Batch Gradient Descent
________ ________ computes the gradients on small random sets of instances called 'mini-batches'. The main advantage of Mini-Batch Gradient Descent over Stochastic Gradient Descent is that you can get a performance boost from hardware optimization of matrix operations, especially when using GPUs. [Chapter 4: pg 129]
Mini-Batch Gradient Descent
________ ________ ________ is a type of learning where the system generalizes from a set of examples to build a model of these examples, then use that model to make predictions. [Chapter 1: pg 19]
Model-Based Learning
________ ________ ________ is an approach to machine learning where all the assumptions about the problem domain are made explicit in the form of a model. [Chapter 1: pg 19]
Model-Based Learning
________ ________ is a configuration variable that is internal to the model and whose value can be estimated from the given data. They are required by the model when making predictions. Their values define the skill of the model on your problem. The ________ that provide the customization of the function are the ________ ________ or simply ________ and they are exactly what the machine is going to learn from data, the training features set. Given some training data, the ________ ________ are fitted automatically. [Chapter 1: pg 20]
Model-Parameters
________ ________ ________ is the process of selecting one final machine learning model from among a collection of candidate machine learning models for a training dataset. [Chapter 1: pg 20]
Model-Selection
________ ________ ________ (eg: trying to predict multiple values per district)
Multi-variate Regression Problem
________ ________ are classifiers that can distinguish between more than two classes. In machine learning, ________ or multinomial classification is the problem of classifying instances into one of three or more classes. [Chapter 3: pg 102]
Multiclass Classification
________ ________ ________ is a strategy that splits a multi-class classification into one binary classification problem per class. Also can be called 'one-versus-the-rest'. [Chapter 3: pg 102]
Multiclass Classification: One Versus All (Ova) Strategy
________ ________ ________ is a strategy that splits a multi-class classification into one binary classification problem per each pair of classes. [Chapter 3: pg 102]
Multiclass Classification: One Versus One (OvO) Strategy
In ________ ________, the training set is composed of instances each associated with a set of labels, and the task is to predict the label sets of unseen instances through analyzing training instances with known label sets. Difference between multi-class classification & multi-label classification is that in multi-class problems the classes are mutually exclusive, whereas for multi-label problems each label represents a different classification task, but the tasks are somehow related. For example, multi-class classification makes the assumption that each sample is assigned to one and only one label: a fruit can be either an apple or a pear but not both at the same time. Whereas, an instance of multi-label classification can be that a text might be about any of religion, politics, finance or education at the same time or none of these. [Chapter 3: pg 108]
Multilabel Classification
________ ________ is simply a generalization of multi-label classification where each level can be multi-class. ________ ________ (also known as multitask classification) is a classification task which labels each sample with a set of non-binary properties. Both the number of properties and the number of classes per property is greater than 2. [Chapter 3: pg 109]
Multioutput Classification
________ ________ ________ is a type of problem where the system will use multiple features to make a prediction (eg: population, median income, housing prices per district in a regression problem)
Multiple Regression Problem
________ ________ is a prediction that was correct and the data point belongs to the positive class.
True Positive
________ ________ ________ occurs when your model is too simple to learn the underlying structure of the data. [Chapter 1: pg 30]
Underfitting
________ ________ ________ is a type of problem where you are trying to predict a single value per feature (eg: predicting a single value for each housing district in a regression problem)
Univariate Regression Problem
________ ________ ________ is a form of learning where the training data is unlabeled. The system tries to learn without a teacher. ________ ________ ________ algorithms include Clustering, Anomaly/Novelty Detection, Visualization, Dimensionality Reduction, and Association Rule Learning. [Chapter 1: pg 10]
Unsupervised Learning
_________ _________ _________
Unsupervised, Non-Linear Dimensionality Reduction Method: t-distributed Stochastic Neighbor Embedding (t-SNE)
________ ________ measures how good your model is.
Utility Function (Fitness Function)
________ ________ is the sample of data used to provide an unbiased evaluation of a model fit on the training dataset while tuning model hyperparameters. The evaluation becomes more biased as skill on the validation dataset is incorporated into the model configuration.
Validation Set
_________ _________ _________
Visualization
________ ________ [Chapter 6: pg 181]
White Box Model
_________ is a built-in colormap that is accessible via matplotlib.
cmap
_________ _________ _________ works on the principle of the decision tree algorithm. It isolates the outliers by randomly selecting a feature from the given set of features and then randomly selecting a split value between the maximum and minimum values of the selected feature. This random partitioning of features will produce smaller paths in trees for the anomalous data values and distinguish them from the normal set of the data.
(Unsupervised) Anomaly Detection: Isolation Forest
_________ _________ _________ are a special case of support vector machine. First, data is modeled and the algorithm is trained. Then when new data are encountered their position relative to the "normal" data (or inliers) from training can be used to determine whether it is "out of class" or not — in other words, whether it is unusual or not. Because they can be trained with unlabelled data they are an example of unsupervised machine learning. _________ _________ _________ is a variation of the SVM that can be used in an unsupervised setting for anomaly detection.
(Unsupervised) Anomaly Detection: One-Class SVM (Support Vector Machines)
_________ _________ _________ refers to density based clustering. Stands for density-based spatial clustering of applications with noise. It is able to find arbitrary shaped clusters and clusters with noise (i.e. outliers). The main idea behind DBSCAN is that a point belongs to a cluster if it is close to many points from that cluster. There are two key parameters of DBSCAN: eps and minPts. Based on these two parameters (eps and minPts), points are classified as core point, border point, or outlier: core point, border point, outlier [Chapter 9: pg ]
(Unsupervised) Clustering Algorithm: DBSCAN
_________ _________ _________ is a point where if a border point is reachable from a core point and there are less than minPts number of points within its surrounding area. [Chapter 9: pg ]
(Unsupervised) Clustering Algorithm: DBSCAN: Border Point
_________ _________ _________ is a point where if there are at least minPts number of points (including the point itself) in its surrounding area with radius eps. [Chapter 9: pg ]
(Unsupervised) Clustering Algorithm: DBSCAN: Core Point
_________ _________ _________ is a point where if it is not a core point and not reachable from any core points.
(Unsupervised) Clustering Algorithm: DBSCAN: Outlier
_________ _________ _________ is the distance that specifies the neighborhoods. Two points are considered to be neighbors if the distance between them are less than or equal to eps. [Chapter 9: pg ]
(Unsupervised) Clustering Algorithm: DBSCAN: eps
_________ _________ _________ is the minimum number of data points to define a cluster. [Chapter 9: pg ]
(Unsupervised) Clustering Algorithm: DBSCAN: minPts
________ ________ is a mathematical equation that gives the results directly. [Chapter 4: pg 116]
Closed-Form Solution
_________ _________ _________ is a type of analysis and unsupervised machine learning task. It involves automatically discovering natural grouping in data. Unlike supervised learning (like predictive modeling), _________ algorithms only interpret the input data and find natural groups or _________ in feature space. [Chapter 9: pg 238]
Clustering (Cluster Analysis)
________ ________ are 2D arrays with a single column. [Chapter 4: pg 115]
Column Vectors
________ ________ [Chapter 4: pg 119]
Computational Complexity
________ ________ is a measure to quantify the uncertainty in an estimated statistic (like mean of a certain quantity) when the true population parameter is unknown.
Confidence Interval
________ ________ is a way to evaluate the performance of a classifier (the general idea is to count the number of times instances of class A are classified as class B.
Confusion Matrix
________ ________ is a set of methods designed to identify efficiently and systematically the best solution (the optimal solution) to a problem characterized by a number of potential solutions in the presence of identified constraints. [Chapter 5: pg 158]
Constrained Optimization
________ ________ is an extension of linear regression that adds regularization penalties to the loss function during training. [Chapter 4: pg 142]
Elastic Net
________ is a low-dimensional, learned continuous vector representation of discrete variables into which you can translate high-dimensional vectors. Generally, ________ make ML models more efficient and easier to work with, and can be used with other models as well.
Embedding
________ is a group of predictors [Chapter 7: pg 191]
Ensemble
________ is a machine learning model that combines the predictions from two or more models. The models that contribute to the ________, referred to as ________ members, may be the same type or different types and may or may not be trained on the same training data.
Ensemble
________ ________ ________ is a result of building a model on top of many other models.
Ensemble Learning
________ ________ the technique composed of a group of predictors [Chapter 7: pg 191]
Ensemble Learning
________ ________ is another name for an ensemble learning algorithm [Chapter 7: pg 191]
Ensemble Method
________ ________ [Chapter 6: pg 184]
Entropy Impurity
________ ________ [Chapter 4: pg 127]
Epoch
________ ________ is a prediction that was incorrect because the prediction was positive but the actual value was negative
False Negatives
________ ________ is a prediction that was incorrect where the actual value is positive and the predicted value was negative.
False Positives
_________ _________ _________ is an input variable—the x variable in simple linear regression. A simple machine learning project might use a single feature, while a more sophisticated machine learning project could use millions of features, specified as: x1,x2,...xN
Feature
________ ________ ________ is the process of combining existing features to produce a more useful one (such as dimensional reduction algorithms)
Feature Extraction
________ are made when code randomly splits the training set into 10 distinct subsets called ________, then it trains and evaluates the Decision Tree model 10 times, picking a different ________ for evaluation every time and training on the other 9 ________. The result is an array containing the 10 evaluation scores.
Folds
________ ________ is a popular kernel function used in various kernelized learning algorithms. In particular, it is commonly used in support vector machine classification. [Chapter 5: pg 162]
Gaussian RBF Kernel
________ a term used to describe a model's ability to react to new data.
Generalization
________ is a definition to demonstrate how well is a trained model to classify or forecast unseen data. Training a ________ machine learning model means, in general, it works for all subset of unseen data. An example is when we train a model to classify between dogs and cats. If the model is provided with dogs images dataset with only two breeds, it may obtain a good performance. But, it possibly gets a low classification score when it is tested by other breeds of dogs as well. This issue can result to classify an actual dog image as a cat from the unseen dataset. Therefore, data diversity is very important factor in order to make a good prediction.
Generalization
________ ________ is the error rate on new cases and by evaluating your model on the test set, you get an estimate of this error. This value tells you how well your model will perform on instances it has never seen before.
Generalization Error (Out-of-Sample Error)
________ ________ [Chapter 6: pg 177]
Gini Impurity
________ ________ [Chapter 4: pg 121]
Global Minimum
The general idea of ________ ________ is to tweak parameters iteratively in order to minimize the cost function. ________ ________ measures the local gradient of the error function with parameter vector θ, and it goes in the direction of descending gradient. Once the gradient is zero, you've reached a minimum. Concretely, you start by filling θ with random values (this is called random initialization) and then you improve it gradually, each step attempted to decrease the cost function (eg: MSE) until the algorithm converges to a minimum. [Chapter 4: pg 119]
Gradient Descent
________ ________ is an iterative optimization approach that gradually tweaks the model parameters to minimize the cost function over the training set, eventually converging to the same set of parameters as the first method. [Chapter 4: pg 114, 119]
Gradient Descent
________ ________ [Chapter 4: pg 125]
Gradient Descent: Tolerance
________ ________ ________ [Chapter 5: pg 156]
Hard Margin Classification
________ is a type of scheme where every classifier votes for a class, and the majority wins. In statistical terms, the predicted target label of the ensemble is the mode of the distribution of individually predicted labels. [Chapter 7: pg 192]
Hard Voting (Majority Voting)
________ ________ is a type of average generally used for numbers that represent a rate or ratio such as the precision and the recall in information retrieval. The harmonic mean can be described as the reciprocal of the arithmetic mean of the reciprocals of the data.
Harmonic Mean
________ ________ [Chapter 5: pg 173]
Hinge Loss
________ ________ is a process where one simply removes part of the training set to evaluate several candidate models and select the best one. After this ________ ________ process, you train the best model on the full training set (including the validation set), and this gives you the final model. Lastly, you evaluate the final model on the test set to get an estimate of the generalization error.
Holdout Validation
________ is a parameter of a learning algorithm (not of the model). As such, it is not affected by the learning algorithm itself; it must be set prior to training and remains constant during training.
Hyperparameter
________ ________ ________ a dictionary where the names are the hyper-parameter arguments to the model and the values are discrete values or a distribution of values to sample in the case of a random search.
Hyperparameter Search Space
________ [Chapter 5: pg 155]
Hyperplanes
________ ________ [Chapter 6: pg 177]
Impurity Measure
________ is the process of running live data points into a machine learning algorithm (or "ML model") to calculate an output such as a single numerical score.
Inferences
________ ________ ________ is a type of learning where the system learns the examples by heart, then generalizes to new cases by comparing them to the learned examples (or a subset of them), using a similarity score. [Chapter 1: pg 18]
Instance-Based Learning
________ ________ ________ can decrease the accuracy of the models and make your model learn based on incorrect information
Irrelevant Features
_________ _________ is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more.
Jupyter Notebook
________ ________ SVM algorithms use a set of mathematical functions that are defined as the kernel. The function of kernel is to take data as input and transform it into the required form. Different SVM algorithms use different types of kernel functions. These functions can be different types. For example linear, nonlinear, polynomial, radial basis function (RBF), and sigmoid.Introduce Kernel functions for sequence data, graphs, text, images, as well as vectors. The most used type of kernel function is RBF. Because it has localized and finite response along the entire x-axis.The kernel functions return the inner product between two points in a suitable feature space. Thus by defining a notion of similarity, with little computational cost even in very high-dimensional spaces. [Chapter 5: pg 161]
Kernalized SVM
________ ________ makes it possible to get the same result as if you added many polynomial features, even with very high degree polynomials, without actually having to add them. [Chapter 5: pg 160]
Kernel Trick
_________ _________ _________ is the thing we're predicting—the 'y' variable in simple linear regression. The label could be the future price of wheat, the kind of animal shown in a picture, the meaning of an audio clip, or just about anything.
Label
________ ________ is a case where each instance comes with an expected output (eg: a district's median housing price) in a regression problem.
Labeled Training
________ ________ ________ [Chapter 5: pg 156]
Large Margin Classification
________ ________ is a type of linear regression that uses shrinkage. Shrinkage is where data values are shrunk towards a central point, like the mean. The lasso procedure encourages simple, sparse models (i.e. models with fewer parameters). This particular type of regression is well-suited for models showing high levels of multi-collinearity or when you want to automate certain parts of model selection, like variable selection/parameter elimination. ________ ________ tends to completely eliminate the weights of the least important features. [Chapter 4: pg 136]
Lasso Regression (Least Absolute Shrinkage and Selection Operator Regression)
________ ________ is a part of a decision tree structure that represents and holds a class label. [Chapter 6: pg 177]
Leaf Node
________ ________ are plots of the model's performance on the training set and the validation set as a function of the training set size (or the training iteration). [Chapter 4: pg 133]
Learning Curves
________ ________ ________ is an important parameter of online learning systems is by how fast they should adapt to changing data. If you set a higher ________ ________, then your system will rapidly adapt to new data; but it will also tend to forget the old data. Conversely, if you set a lower ________ ________, the system will have more inertia; that is, it will learn more slowly, but will also be less sensitive to noise in the new data or to sequences of non-representative data points [Chapter 4: pg 120]
Learning Rate
________ ________ is the function that determines the learning rate at each iteration. [Chapter 4: pg 127]
Learning Schedule
________ ________ [Chapter 5: pg 163]
Levenshtein Distance
_________ _________ _________
Noise
_________ _________ _________
Non-Linear Dimensionality Reduction
_________ _________ _________
Non-Linear Dimensionality Reduction Method: Locally-Linear Embedding (LLE)
________ ________ ________ [Chapter 5: pg 158]
Non-Linear SVM Classification
________ ________ is a type of model where the number of parameters is not determined prior to training. [Chapter 6: pg 184]
Non-Parametric Model
________ ________ ________ is a metric as a result if the sample is too small, you will have sampling noise, which is the ________ ________ as a result of chance, but even large samples can be ________ if the sampling method is flawed. This is called sampling bias.
Non-Representative Training Data
________ ________ ________ is a type of bias that occurs when people are unwilling or unable to respond to a survey due to a factor that makes them differ greatly from people who respond.
Non-Response Bias
________ ________ is an analytical approach to Linear Regression with a Least Square Cost Function. We can directly find out the value of θ without using Gradient Descent. Following this approach is an effective and a time-saving option when are working with a dataset with small features. [Chapter 4: pg 115]
Normal Equation
________ is a technique often applied as part of data preparation for machine learning. The goal of ________ is to change the values of numeric columns in the dataset to use a common scale, without distorting differences in the ranges of values or losing information. ________ is also required for some algorithms to model the data correctly.
Normalization
________ are various distances measures
Norms
_________ _________ _________ [Chapter 9: pg 274]
Novelty Detection
________ ________ ________ is a type of learning that is typically associated with batch learning that takes a lot of time and computing resources. [Chapter 1: pg 15]
Offline Learning
________ ________ is a ________ that is comprised of mostly zero values. ________ ________ are distinct from ________ with mostly non-zero values, which are referred to as ________ ________. A ________ is ________ if many of its coefficients are zero.
Sparse Matrix
________ ________ ________ is a process of converting categorical data variables so they can be provided to machine learning algorithms to improve predictions. ________ ________ ________ is a crucial part of feature engineering for machine learning. ________ ________ ________ is one method of converting data to prepare it for an algorithm and get a better prediction. With ________ ________, we convert each categorical value into a new categorical column and assign a binary value of 1 or 0 to those columns. Each integer value is represented as a binary vector. All the values are zero, and the index is marked with a 1.
One-Hot Encoding
________ ________ ________ is a type of learning where you train the system incrementally by feeding it data instances sequentially; either individually, or by small groups called 'mini-batches'. [Chapter 1: pg 15]
Online Learning
________ ________ [Chapter 5: pg 173]
Online SVMs
________ ________ ________ is a type of learning where online learning algorithms can also be used to train systems on huge datasets that cannot fit in one machine's main memory. The algorithm loads part of the data, runs a training step on that data, and repeats the process until it has run on all of the data. Think of this type of learning as 'incremental learning'. [Chapter 1: pg 17]
Out-of-Core Learning
________ are sequences of non-representative data points
Outliers
________ means that the model has performed well on the training data, but does not generalize well. ________ happens when a model learns the detail and noise in the training data to the extent that it negatively impacts the performance of the model on new data. This means that the noise or random fluctuations in the training data is picked up and learned as concepts by the model. ________ happens when the model is too complex relative to the amount and noisiness of the training data. [Chapter 1: pg 28]
Overfitting
_________ is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series.
Pandas
_________ _________ is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used pandas object.
Pandas: DataFrame
________ ________ ________
Pandas: DataFrame drop()
________ ________ ________
Pandas: DataFrame dropna()
________ ________ ________
Pandas: DataFrame fillna()
________ ________ is a way to check for correlation between attributes, which plots every numerical attribute against every other numerical attribute.
Pandas: scatter_matrix
________ ________ [Chapter 4: pg 149]
Parameter Matrix
________ ________ an observation/condition where the more parameters a model has, the more dimensions this space has and the hard the search is. [Chapter 4: pg 123]
Parameter Space
________ ________ is a type of model (such as a linear model) which has a predetermined number of parameters, so its degree of freedom is limited, reducing the risk of overfitting (but increasing the risk of underfitting). [Chapter 6: pg 184]
Parametric Model
________ ________ can be used to measure the rate of change of the function with respect to x divided by the rate of change of the function with respect to y. [Chapter 4: pg 123]
Partial Derivative
________ is what happens when sampling is performed without replacement. Allows training instance to be sampled several times across multiple predictors. [Chapter 7: pg 192]
Pasting
________ ________ [Chapter 5: pg 160]
Polynomial Kernel
________ ________ is a more complex model than Linear Regression(LR) in that it can fit non-linear datasets. Since this model has more parameters than LR, it is more prone to overfitting the training data [Chapter 4: pg 114]
Polynomial Regression
________ ________ is a technique where you add powers of each features as new features, then train a linear model on this extended set of features. [Chapter 4: pg 130]
Polynomial Regression
________ is how many times an accurate prediction of a particular class occurs per a false prediction of that class. ________ is calculated as the ratio between the number of Positive samples correctly classified to the total number of samples classified as Positive (either correctly or incorrectly). The ________ measures the model's accuracy in classifying a sample as positive.
Precision
________ ________ happens when If you increase ________, it will reduce ________, and vice versa.
Precision/Recall Tradeoff
________ ________ ________ are probabilistic attributes associated with regression tasks. [Chapter 1: pg 9]
Predictors
________ ________ [Chapter 4: pg 140]
Sparse Model
________ ________ is the process of solving certain mathematical optimization problems involving quadratic functions. Specifically, one seeks to optimize (minimize or maximize) a multivariate quadratic function subject to linear constraints on the variables. Quadratic programming is a type of nonlinear programming. [Chapter 5: pg 158]
Quadratic Programming
________ ________ stands for area under the ROC curve. That is, AUC measures the entire two-dimensional area underneath the entire ROC curve (think integral calculus) from (0,0) to (1,1). [Chapter 3: pg 100]
ROC AUC (Area Under The Curve)
________ ________ is a common tool used with binary classifiers. It plots the true positive rate (another name for recall) against the false positive rate. ________ ________ plots sensitivity (recall) versus specificity. [Chapter 3: pg 99]
ROC Curve (Receiver Operating Characteristic)
________ ________ a group of Decision Tree Classifiers [Chapter 7: pg 191]
Random Forest
________ ________ ________ work by training many Decision Trees on random subsets of the features, then averaging out their predictions.
Random Forests
________ ________ [Chapter 4: pg 120]
Random Initialization
________ ________ ________ is a meta estimator that fits a number of classifying decision trees on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.
RandomForestRegressor
________ is the percentage of the data belonging to a particular class which the model properly predicts as belonging to that class. ________ is the number of correct results divided by the number of results that should have been returned. In binary classification, recall is called sensitivity. It can be viewed as the probability that a relevant document is retrieved by the query.
Recall
_________ _________ _________ is a task used to predict a continuous value. A typical task to predict a target numerical value, given a set of 'features' called 'predictors'.
Regression
_________ is a common supervised learning task for predicting values
Regression
________ is the process of constraining a model to make it simpler and reduce the risk of overfitting.
Regularization
________ ________ are a result of reduced overfitting. [Chapter 4: pg 136]
Regularized Linear Models
________ ________ ________ is a type of training of machine learning models to make a sequence of decisions. The agent learns to achieve a goal in an uncertain, potentially complex environment. In ________ ________ ________, an artificial intelligence program faces a game-like situation. The computer employs trial and error to come up with a solution to the problem. To get the machine to do what the programmer wants, the artificial intelligence gets either rewards or penalties for the actions it performs. Its goal is to maximize the total reward. [Chapter 1: pg 14]
Reinforcement Learning
________ ________ ________ is the training of machine learning models to make a sequence of decisions. The agent learns to achieve a goal in an uncertain, potentially complex environment. In ________ ________ ________, an artificial intelligence faces a game-like situation. The computer employs trial and error to come up with a solution to the problem. To get the machine to do what the programmer wants, the artificial intelligence gets either rewards or penalties for the actions it performs. Its goal is to maximize the total reward. [Chapter 1: pg 13]
Reinforcement Learning
________ ________ ________ the learning system that can observe the environment, select and perform actions, and get rewards or penalties in return. [Chapter 1: pg 14]
Reinforcement Learning: Agent
_________ _________ _________ is a reinforcement learning metrics that can be summarized as a negative reward. [Chapter 1: pg 14]
Reinforcement Learning: Penalty
_________ _________ _________ is a reinforcement learning metric in which the learning system must learn by itself what is the best strategy to get the most reward over time. [Chapter 1: pg 14]
Reinforcement Learning: Policy
_________ _________ _________ is a reinforcement learning metric that culminates to a conclusion as opposed to a prediction. Judgement as reward. [Chapter 1: pg 14]
Reinforcement Learning: Reward Functions
________ ________ ________
Restricted Boltzmann Machines (RBMs)
________ ________ is a regularized version of Linear Regression and is added to the cost function. This forces the learning algorithm to not only fit the data but also keep the model weights as small as possible. Note that the regularization term should only be added to the cost function during training. Once the model is trained, you want to evaluate the model's performance using the un-regularized performance measure [Chapter 4: pg 136]
Ridge Regression
________ ________ ________ is a typical performance measure for regression problems. It gives an idea of how much error the system typically makes in its predictions, with a higher weight for larger errors.
Root Mean Square Error (RMSE)
________ ________ is at the beginning of a tree. It represents entire population being analyzed. From the ________ ________, the population is divided according to various features, and those sub-groups are split in turn at each decision node under the ________ ________. [Chapter 6: pg 177]
Root Node
________ ________ [Chapter 5: pg 164]
SVM Regression
________ ________ ________ occurs when the distribution of one's training data doesn't reflect the actual environment that the machine learning model will be running in.
Sampling Bias
________ ________ ________ can be described as non-representative data as a result of chance.
Sampling Noise
________ ________ ________ an estimator that applies transformers to columns of an array or pandas DataFrame. This estimator allows different columns or column subsets of the input to be transformed separately and the features generated by each transformer will be concatenated to form a single feature space. This is useful for heterogeneous or columnar data, to combine several feature extraction mechanisms or transformations into a single transformer.
SciKit-Learn: ColumnTransformer
________ ________ ________ observes features of an object and trains a model in the structure of a tree to predict data in the future to produce meaningful continuous output. Continuous output means that the output/result is not discrete, i.e., it is not represented just by a discrete, known set of numbers or values. A powerful model, capable of finding complex nonlinear relationships in data.
SciKit-Learn: DecisionTreeRegressor
________ ________ ________ can be used when you don't want to manually tune hyper-parameters to find the right combination of hyper-parameters values. Use ________ ________ ________ to search for you. All you need to do is tell it which hyper-parameters you want it to experiment with, and what values to try out, and it will evaluate all the possible combination of hyper-parameters values, using cross-validation
SciKit-Learn: GridSearchCV
________ ________ is a subset to test the trained model.
Test Set
________ ________ ________ is a procedure used to estimate the skill of the model on new data. ________ ________ ________ takes as arguments the number of splits, whether or not to shuffle the sample, and the seed for the pseudorandom number generator used prior to the shuffle. For example, we can create an instance that splits a dataset into 3 folds, shuffles prior to the split, and uses a value of 1 for the pseudorandom number generator. The general procedure is as follows: Shuffle the dataset randomly, Split the dataset into ________ groups, For each unique group: 1.) Take the group as a hold out or test data set. 2) Take the remaining groups as a training data set. 3.) Fit a model on the training set and evaluate it on the test set. 4) Retain the evaluation score and discard the model. Finally, summarize the skill of the model using the sample of model evaluation scores
SciKit-Learn: K-Fold Cross-Validation
________ ________ ________ is a constructor that takes a list of name/estimator pairs defining a sequence of steps. All but the last estimator must be transformers.
SciKit-Learn: Pipeline
_________ _________ _________
SciKit-Learn: StandardScalar
________ ________ ________ is a class that performs stratified sampling to produce folds that contain a representative ratio of each class. At each iteration, the code creates a clone of the classifier, trains that clone on the training folds, and makes predictions on the test fold. Then it counts the number of correct predictions and outputs the ratio of correct predictions.
SciKit-Learn: StratifiedKFold
________ ________ ________ is any object that can estimate some parameters based on a dataset (eg: an 'imputer' is an ________)
Scikit-Learn: Estimators
________ ________ ________ are a type of estimators that are capable of making ________ given a dataset. A ________ has a ________() method that takes a dataset of new instances and returns a dataset of corresponding ________. It also has a score() method that measures the quality of the ________ given a test set (and the corresponding labels in a case of supervised learning algorithm).
Scikit-Learn: Predictors
________ ________ ________
Scikit-Learn: SimpleImputer
________ ________ ________ are a type of estimator (such as an imputer) that can also ________. The ________ is performed by the ________() method with the dataset to ________ as a parameter. It returns the ________ dataset. This ________ generally relies on the learned parameters, as is the cast for an 'imputer'. All ________ also have a convenience method called fit_________() that is equivalent to calling fit() and then ________() (but sometimes fit_________() is optimized and runs much faster).
Scikit-Learn: Transformers
________ ________ ________ is a learning problem that involves a small number of labeled examples and a large number of unlabeled examples. [Chapter 1: pg 13]
Semi-Supervised Learning
________ ________
Signal/Noise Ratio
________ are a piece of information fed to a ML system (you want a high ________/noise ratio
Signals
________ ________ is a technique to tackle nonlinear problems is to add features computed using a ________ ________ that measures how much each instance resembles a particular landmark. [Chapter 5: pg 161]
Similarity Function
________ ________ is an algorithm inspired from the process of annealing in metallurgy where molten metal is slowly cooked down. [Chapter 4: pg 127]
Simulated Annealing
________ ________ is a standard matrix factorization technique that can decompose the training set matrix into the matrix multiplication of three matrices. [Chapter 4: pg 119]
Singular Value Decomposition (SVD)
In practice, real data is messy and cannot be separated perfectly with a hyperplane. The constraint of maximizing the margin of the line that separates the classes must be relaxed. This is often called the ________ ________ ________ This change allows some points in the training data to violate the separating line. [Chapter 5: pg 156]
Soft Margin Classification
________ is a type of scheme where every individual classifier provides a probability value that a specific data point belongs to a particular target class. The predictions are weighted by the classifier's importance and summed up. Then the target label with the greatest sum of weighted probabilities wins the vote. [Chapter 7: pg 192]
Soft Voting
________ ________ [Chapter 4: pg 114]
Softmax Regression
________ ________ is a generalization of logistic regression that we can use for multi-class classification (under the assumption that the classes are mutually exclusive). In contrast, we use the (standard) Logistic Regression model in binary classification tasks. [Chapter 4: pg 149]
Softmax Regression (Multinomial Logistic Regression)
________ is an ensemble learning technique that uses predictions for multiple nodes(for example kNN, decision trees, or SVM) to build a new model. This final model is used for making predictions on the test dataset. [Chapter 7: pg 191]
Stacking ( Ensemble Method)
________ ________ ________
Standard Correlation Coefficient
_________ _________ is a number that describes how spread out the values are. A low _________ _________ means that most of the numbers are close to the mean (average) value. A high _________ _________ means that the values are spread out over a wider range.
Standard Deviation
________ is a scaling technique where the values are centered around the mean with a unit standard deviation. This means that the mean of the attribute becomes zero and the resultant distribution has a unit standard deviation. First, ________ subtracts the mean value (so ________ values always have a zero mean), and then it divides by the standard deviation so that the resulting distribution has unit variance. Unlike min-max scaling, ________ does not bound values to a specific range, which may be a problem for some algorithms (eg: neural networks often expect an input value ranging from 0 to 1).
Standardization
________ ________ results to the most frequent prediction like a hard voting classifier for classification, or the average for regression. Once all predictors are trained, the ensemble can make a prediction for a new instance by simply aggregating the predictions of all predictors. Each individual predictor has a higher bias than if it were trained on the original training set, but aggregation reduces both bias and variance. Generally, the net result is that the ensemble has a similar bias but a lower variance than a single predictor trained on the original training set. [Chapter 7: pg 195]
Statistical Mode
________ ________ [Chapter 4: pg 114]
Stochastic Gradient Descent
________ ________ picks.a random instance in the training set at every step and computes the gradients based only on that single instance. [Chapter 4: pg 125]
Stochastic Gradient Descent
________ ________ is a method of sampling from a population that can be divided into a subset of the population.
Stratified Sampling
________ ________ [Chapter 5: pg 163]
String Kernels
________ ________ [Chapter 5: pg 163]
String Subsequence Kernel
________ ________ [Chapter 4: pg 141]
Subgradient Vector
________ ________ ________ is a form of learning where the training data you feed to the algorithm includes the desired solution, called 'labels'. A typical ________ ________ ________ task is classification and regression. Algorithms that relate to this type of learning include: k-Nearest Neighbors, Linear Regression, Logistic Regression, Support Vector Machines (SVMs), Decision Trees, and Random Forests, and Neural Networks. [Chapter 1: pg 8]
Supervised Learning
________ ________ ________ is a linear model for classification and regression problems. It can solve linear and non-linear problems and work well for many practical problems. The idea of SVM is simple: The algorithm creates a line or a hyperplane which separates the data into classes. [Chapter 5: pg 155]
Support Vector Machine
________ ________ ________ is a power and versatile ML model capable of performing linear or nonlinear classification, regression, and even outlier detection. [Chapter 5: pg 155]
Support Vector Machine
________ ________ ________ [Chapter 5: pg 156]
Support Vectors
________ ________ is a subset to train a model. As implied, you train your model using the ________ ________
Training Set
_________ _________ _________
plt.(legend)