ML Int Questions

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

● Can you make the distinction between an algorithm and a model?

Model is a function representing a data set, algorithms are a way to obtain that function.

● Why do we use a train/test split?

To avoid overfitting and better generalization

What is function approximation?

the problem of learning a mapping function from inputs to outputs ( used in prediction model)

● What is cross-validation used for? What types of cross-validation do you know?

Cross validation is a model evaluation method that is better than residuals. - The problem with residual evaluations is that they do not give an indication of how well the learner will do when it is asked to make new predictions for data it has not already seen. - One way to overcome this problem is to not use the entire data set when training a learner. - Some of the data is removed before training begins. Then when training is done, the data that was removed can be used to test the performance of the learned model on ``new'' data. This is the basic idea for a whole class of model evaluation methods called cross validation. - Cross Validation is a very useful technique for assessing the effectiveness of your model, particularly in cases where you need to mitigate overfitting. It is also of use in determining the hyper parameters of your mode ------------------- Types of cross validation: 1 . * Hold-out method:* - Holdout method is the simplest kind of cross validation. - The data set is separated into two sets, called the training set and the testing set. - The function approximator fits a function using the training set only. Then the function approximator is asked to predict the output values for the data in the testing set. *Adv: * Fully independent data; only needs to be run once so has lower computational costs. *Disadvantage:* However, its evaluation can have a high variance with smaller data size. - The evaluation may depend heavily on which data points end up in the training set and which end up in the test set, and thus the evaluation may be significantly different depending on how the division is made. *- k fold cross validation*

● What's the difference between supervised and unsupervised learning?

Labeled vs. unlabeled data ------------------- *Supervised learning*: - Supervised learning is the learning of the model where with input variable ( say, x) and an output variable (say, Y) and an algorithm to map the input to the output.That is, Y = f(X) *Why supervised learning?* - The basic aim is to approximate the mapping function(mentioned above) so well that when there is a new input data (x) then the corresponding output variable can be predicted. - It is called supervised learning because the process of an learning(from the training dataset) can be thought of as a teacher who is supervising the entire learning process. Thus, the "learning algorithm" iteratively makes predictions on the training data and is corrected by the "teacher", and the learning stops when the algorithm achieves an acceptable level of performance(or the desired accuracy). e.g. Example of Supervised Learning Suppose there is a basket which is filled with some fresh fruits, the task is to arrange the same type of fruits at one place.Also, suppose that the fruits are apple, banana, cherry, grape. Suppose one already knows from their previous work (or experience) that, the shape of each and every fruit present in the basket so, it is easy for them to arrange the same type of fruits in one place. Here, the previous work is called as training data in Data Mining terminology. So, it learns the things from the training data. This is because it has a response variable which says y that if some fruit has so and so features then it is grape, and similarly for each and every fruit. This type of information is deciphered from the data that is used to train the model.This type of learning is called Supervised Learning. Such problems are listed under classical Classification Tasks. ----------------------- *Unsupervised Learning:* - Unsupervised learning is where only the input data (say, X) is present and no corresponding output variable is there. *Why Unsupervised Learning?* - The main aim of Unsupervised learning is to model the distribution in the data in order to learn more about the data. It is called so, because there is no correct answer and there is no such teacher(unlike supervised learning). Algorithms are left to their own devises to discover and present the interesting structure in the data. Example of Unsupervised Learning Again, Suppose there is a basket and it is filled with some fresh fruits. The task is to arrange the same type of fruits at one place. This time there is no information about those fruits beforehand, its the first time that the fruits are being seen or discovered So how to group similar fruits without any prior knowledge about those.First, any physical characteristic of a particular fruit is selected. Suppose color. Then the fruits are arranged on the basis of the color. The groups will be something as shown below:RED COLOR GROUP: apples & cherry fruits.GREEN COLOR GROUP: bananas & grapes. So now, take another physical character say, size, so now the groups will be something like this.RED COLOR AND BIG SIZE: apple.RED COLOR AND SMALL SIZE: cherry fruits.GREEN COLOR AND BIG SIZE: bananas.GREEN COLOR AND SMALL SIZE: grapes.The job is done! - Here, there is no need to know or learn anything beforehand. That means, no train data and no response variable. This type of learning is known as Unsupervised Learning.

● What's the difference between a regression and a classification problem? How about clustering?

https://machinelearningmastery.com/classification-versus-regression-in-machine-learning/+&cd=12&hl=en&ct=clnk&gl=us&client=firefox-b-1-d -------------------------------- - Prediction is divided into 1> Regression and 2> Classification. Regression and Classification are Supervised learning (answer for all the feature points are mapped) , Clustering is unsupervised learning (answer will not be given for the points). - The only difference is that in classification, the outputs are discrete, whereas in regression, the outputs are not. - Classification is about predicting a label and Regression is about predicting a quantity. --------------------- Classification predictive modeling is the task of approximating a mapping function (f) from input variables (X) to discrete output variables (y). The output variables are often called labels or categories. The mapping function predicts the class or category for a given observation. For example, an email of text can be classified as belonging to one of two classes: "spam" and "not spam". A problem with two classes is often called a two-class or binary classification problem. A problem with more than two classes is often called a multi-class classification problem. A problem where an example is assigned multiple classes is called a multi-label classification problem. It is common for classification models to predict a continuous value as the probability of a given example belonging to each output class. The probabilities can be interpreted as the likelihood or confidence of a given example belonging to each class. A predicted probability can be converted into a class value by selecting the class label that has the highest probability. For example, a specific email of text may be assigned the probabilities of 0.1 as being "spam" and 0.9 as being "not spam". We can convert these probabilities to a class label by selecting the "not spam" label as it has the highest predicted likelihood. There are many ways to estimate the skill of a classification predictive model, but perhaps the most common is to calculate the classification accuracy. The classification accuracy is the percentage of correctly classified examples out of all predictions made. ------------------ Regression Predictive Modeling Regression predictive modeling is the task of approximating a mapping function (f) from input variables (X) to a continuous output variable (y). A continuous output variable is a real-value, such as an integer or floating point value. These are often quantities, such as amounts and sizes. For example, a house may be predicted to sell for a specific dollar value, perhaps in the range of $100,000 to $200,000. A regression problem requires the prediction of a quantity. A regression can have real valued or discrete input variables. A problem with multiple input variables is often called a multivariate regression problem. A regression problem where input variables are ordered by time is called a time series forecasting problem. Because a regression predictive model predicts a quantity, the skill of the model must be reported as an error in those predictions. There are many ways to estimate the skill of a regression predictive model, but perhaps the most common is to calculate the root mean squared error, abbreviated by the acronym RMSE.


Ensembles d'études connexes

eCommerce Exam 1 Chapter 5 Pt. 3 T/F

View Set

Inquizitive - Synthesizing Ideas

View Set

ECON - 2302 Ch. 27, Ch 5 , Ch 6, Ch 28, Ch 21, Ch 32 Study Questions Final

View Set

Metrix Learning CompTIA A+ 220-1102 Review Questions

View Set

Grammar Unit 4- Identifying Types of Phrases

View Set