Introduction to Machine Learning

Ace your homework & exams now with Quizwiz!

What is a perceptron?

A basic computational unit in an artificial neural network consisting of a function with a restricted output range. It relates on my own decisions as to the activation function, weight of activation function etc. The perceptron algorithm: - Generate initial random weights for each input - Provide the perceptron with inputs and for each of them multiply it by its weight - Sum all of the weighted inputs and compute the output of the perceptron based on that sum passes through the activation function.

What is accuracy more important for in the scope of learned model generalisation evaluation?

Accuracy can be used as callback for overfitting, whilst other metrics cannot. When a classifier scores 90% accuracy on test data and 50% on new data when it could score 75% on both you are overfitting. So the scope of generalization: if the accuracy on test set is good and close to the accuracy of your train set, it means your model is generalizable.

Explain what bagging is, why is it used?

Bagging is a boosting technique that works by bootstrap aggregation (it creates different versions of the same dataset). Aggregates multiple bootstraps of my algorithm with different versions of the dataset, it controls bias and variance.

Enunciate the generalization problem and what theoretical variables it entail

Generalization error = bias2 + variance + noise Once we have built a classifier, how accurate will it be on future (unseen) test data? The theoretical variables this generalization problem entails are bias and variance. Bias, is how does the average model trained over all available training sets differ from the true model. → error due to inaccurate assumptions/simplifications made by the model. Variance, how much do models estimated from different training sets differ from each other?

Entropy formula

H(D) = - ∑ p(c) * log2 p(c)

Enunciate the perceptron convergence theorem

If the problem is linearly separable and the learning rate is sufficiently low, then gradient descent will converge towards the solution, and that solution is a local optimum. This means that the perceptron will stop updating their weights after a finite number of steps

What is reduced error pruning? What does it consist of?

Reduced error pruning is a technique to overcome the problem of overfitting, with a post-pruning solution. Starting at the leaves of the decision tree model each node is replaced with its most popular class - if the accuracy is not affected then the change is kept. Until pruning is harmful the following steps are used: - Evaluate the impact on validation set of pruning - Greedly remove the one that improves validation set accuracy.

What is sampling?

Sampling is a way to define a part of your population as training data (data that you use to train your ML models) and test data (data dat you use to test you ML model on unseen data). There are several sampling strategies. E.g. percentage split, k-folds, cross-validation, leave one out cross validation

What is the process of pre-pruning?

Stop the three construction a bit earlier. It is preferred not to split a node if its goodness is below a certain threshold.

What are Precision and Recall?

They are both basic evaluation metrics: Precision is the fraction of retrieved docs that are relevant. → TP / (TP + FP) Recall is the fraction of relevant docs that are retrieved. → TP / (TP + FN)

State the difference between overfitting & underfitting

With underfitting the model is too simple to represent all the relevant class characteristics. It has a high bias, low variance High training error, high test error With overfitting the model is too complex and it fits irrelevant characteristics (noise) in the data. It has a low bias, high variance Low training error, high test error

What is the theoretical framework around ML?

You can imitate a function by looking at enough instances of that function and they can be regulated using a function

What is LOOCV?

Leave one out cross validation, it removes the kth record temporarily from the data set. The model is being trained on the remaining R-1 data points. The error is checked with the kth record, when all the points are done the mean error is reported.

What is Machine Learning?

Machine Learning is the study towards algorithms and statistical models that computers use to perform tasks without using explicit instructions.

What's MCC?

Matthew's correlation coefficient, it is in essence a correlation coefficient between the observer and predicted binary classifications. It returns a value between -1 and +1, where +1 represents a perfect prediction, 0 no better than random prediction, and -1 represents total disagreement between two

What is the minimum description length?

Minimum description length is a boosting technique for decision tree learning. It relates to the height of the decision tree. Occam's razor, prefer the shortest hypothesis. Minimize → Size (tree) + Size (misclassifications(tree))

What is the basic principle behind decision-tree learning?

The basic principles of decision tree learning are information gain and entropy. Each attribute in the branches is picked so it will give you the largest information gain at that point of time. Entropy in decision tree learning can be thought of as the purity or impurity of a dataset, or uncertainty in the classification.

What can we do when we have imbalanced data?

Using over / undersampling methods, e.g. SMOTE. In total there are four different methods to clean the data: baseline methods, e.g: random oversampling, random undersampling Undersampling methods, e.g: tomek links Oversampling method: SMOTE Combination of oversampling methods with undersampling methods. E.g. SMOTE + Tomek links

What is the conceptual difference between validation and generalisability evaluation?

Validation evaluation is the process where a trained model is being evaluated with a (unseen) testing dataset, this dataset is a seperate portion from the same dataset used to train the model. With generalisability evaluation the trained model is being evaluated with a complete unseen and new dataset that does not have relations to the dataset used to train the model.

What is overfitting in the context of decision tree models?

You have an overfitted model if the algorithm is too specific to the training data and it cannot be generalized on other data. This means that the hypothesis will only hold on your training data, making the model over-complex. < add graphical example of overfitted data >

What is CRISP-DM?

*cross industry standard process for data mining, it was accepted by IBM. It was designed to be cross industry because data mining models can use ML models but need to be consistent across industries that use the same ML campaign. It connects a business problem to data mining objectives, and it consists of 6 steps: Business understanding Data understanding Data preparation Modeling Evaluation Deployment

What are the advantages of the decision tree model?

- Simplicity - Non parametric method (no assumptions about the data generation are needed) - Versability (Can be used for both classification and regression tasks) - Interpretability

What are the disadvantages of the decision tree model?

- Very sensible to overfitting - It is greedy (chooses the best attribute at that moment of time) - Not adapted to complex problems

What is the so-called Kernel-Trick?

Changing the kernel of the dataset, the kernel trick is any time you increase the order of the plane in which the dataset exists. If the data is not separable in its own plane it could be separable in a plane that is higher in its order.

What is connectionist AI?

Connectionist AI, represent information through a distributed, less explicit form with a network. Biological processes underlying learning, task performance, and problem solving are imitated. This in contrast to symbolic AI, that represent information through symbols and their relationships. Specific algorithms are used to process these symbols to solve problems and/or deduce new knowledge.

What is entropy?

Entropy is a measure of randomness in the information being processed. The higher the entropy the harder it is to draw any conclusions on that information.

What is k-fold cross validation?

Is a cross validation method, which randomly breaks the dataset into k partitions. For each partition it trains on all the points that are not in that partition. And it compares it to the left out partition, to find the test-set sum of errors on these partitions points. This is being done for all partitions and then reports the mean error.

What is F-measure? What other evaluation metrics exist? How are they defined?

Is a single evaluation metric that allows the same-time evaluation of precision and recall. It is the weighted harmonic mean of the precision and recall test. The harmonic mean is the most conservative average. F1 measure is calculated by 2 * precision * recall / ( precision + recall ) Other evaluation metrics that exist are AUC-ROC, accuracy, Matthew's correlation coefficient.

What is an activation function?

It is a parameter that you are using with relation to its fitness to the problem at hand. It is the function that dictates the output range of the perceptron.

Explain what bootstrapping is, why is it used?

It is a resampling technique used to estimate statistics on a population by sampling a dataset with replacement. It is used to estimate the skill of the models when making predictions on data not included the training data.

What is the definition of gradient descent?

It is an optimization algorithm used to minimize some function by iteratively moving in the direction of the steepest descent as defined by the negative of the gradient.

What is the accuracy score? And when is it used?

It is the percentage of correctly classified instances in a dataset. The only score with which you can quickly spot overfitting, given that it gives you a trade-off between TP, TN, FP, FN and it is non-weighted. It is used when evaluating for overfitting conditions, control for generalizability.

What is AUC-ROC?

It plots the TPR & FPR over the total number of samples of class y= +1 in the dataset. Each point on the curve corresponds to a different class of the classifier's parameters. → insert confusion matrix → inser AUC-ROC graphical example

What is support vector?

Support vectors define the boundaries of the margin interplexing between one class and any other class (or other classes). SVM works by increasing this margin optimistically eventually converging towards a global optimum with geometrical approach.

What is the lift score and what conditions are function to its appropriate use?

The lift score is how much better your model is than random guessing. One condition: the problem that we are facing is trivial enough. (not a mission critical, life critical, business critical). Only use it when there is a respectable margin for error. This margin for error is your own parameter.

What types of ML problems do you know? How do they differ?

There are three types of machine learning problems. Firstly, there is supervised learning, secondly there is unsupervised learning and lastly there is reinforcement learning. With supervised learning the task of learning maps an input with a specific output. The data has already been labeled. With unsupervised learning, the algorithm will describe the structure of the unlabeled data. And with reinforcement learning, a software agent ought to make decisions in an environment so as to maximize some notion of cumulative reward


Related study sets

Legal Foundations of Business Chapter 20

View Set

Physics: Definition and Branches

View Set

Relationship of Principal and Agent

View Set

TEST 2 ID, IM, IV, Sub Q injections

View Set

Substitute, Complementary Goods, and Marginal Utility

View Set

Gen Psych Practice Items for Exam 4

View Set

HIS 101 Western Civilization to 1689 - CH 1 WQ

View Set

Biology Chapter 10: Photosynthesis

View Set