Plz Data Job
Bias-Variance Trade off
The goal of any supervised machine learning algorithm is to have low bias and low variance to achive good prediction performance. The k-nearest neighbors algorithm has low bias and high variance, but the trade-off can be changed by increasing the value of k which increases the number of neighbors that contribute to the prediction and in turn increases the bias of the model. The support vector machine algorithm has low bias and high variance, but the trade-off can be changed by increasing the C parameter that influences the number of violations of the margin allowed in the training data which increases the bias but decreases the variance. There is no escaping the relationship between bias and variance in machine learning. Increasing the bias will decrease the variance. Increasing the variance will decrease the bias.
Error Rate
(False Positives + False Negatives)/(Total evaluations)
Specificity
(True Negatives)/(Total Negatives)
Accuracy
(True Positives + True Negatives)/(Total Evaluations)
Precision
(True Positives)/(All Positives)
Sensitivity/Recall
(True Positives)/(Total Positives)
What are Recommender Systems?
A subclass of information filtering systems that are meant to predict the preferences or ratings that a user would give to a product. Recommender systems are widely used in movies, news, research articles, products, social tags, music, etc.
What is bagging (ensemble model technique)?
Bagging tries to implement similar learners on small sample populations and then takes a mean of all the predictions. In generalized bagging, you can use different learners on different population. As you expect this helps us to reduce the variance error. A sample is randomly selected A subset of features are selected to create a model with sample of observations and subset of features Feature from the subset is selected which gives the best split on the training data Repeated to create many models and every model is trained in parallel Prediction is given based on the aggregation of predictions form all the models.
What is bias
Bias is error introduced in your model due to over simplification of machine learning algorithm. It can lead to underfitting. When you train your model at that time model makes simplified assumptions to make the target function easier to understand.
What is boosting (ensemble model technique)?
Boosting is an iterative technique which adjust the weight of an observation based on the last classification. If an observation was classified incorrectly, it tries to increase the weight of this observation and vice versa. Boosting in general decreases the bias error and builds strong predictive models. However, they may overfit on the training data.
What is the difference between Regression and classification ML techniques.
Classification trees have dependent variables that are categorical and unordered. Regression trees have dependent variables that are continuous values or ordered whole values. Regression means to predict the output value using training data. Classification means to group the output into a class
What do you understand by the term Normal Distribution?
Data is usually distributed in different ways with a bias to the left or to the right or it can all be jumbled up. However, there are chances that data is distributed around a central value without any bias to the left or right and reaches normal distribution in the form of a bell shaped curve. The random variables are distributed in the form of an symmetrical bell shaped curve.
Low bias machine learning algorithsm
Decision Trees, k-NN and SVM
Explain Decision Tree Algorithm in detail
Decision tree is a supervised machine learning algorithm mainly used for the Regression and Classification. It breaks down a dataset into smaller and smaller subsets while at the same time an associated decision tree is incrementally developed. The final result is a tree with decision nodes and leaf nodes. Decision tree can handle both categorical and numerical data.
What is deep learning?
Deep learning is subfield of machine learning inspired by structure and function of brain called artificial neural network. We have a lot numbers of algorithms under machine learning like Linear regression, SVM, Neural network etc and deep learning is just an extension of Neural networks. In neural nets we consider small number of hidden layers but when it comes to deep learning algorithms we consider a huge number of hidden layers to better understand the input output relationship.
What is Ensemble Learning ?
Ensemble is the art of combining diverse set of learners(Individual models) together to improvise on the stability and predictive power of the model. Ensemble learning has many types but two more popular ensemble learning techniques are bagging and boosting.
What is XGBoost?
Ensemble learning method (boosting) that uses a gradient boosting framework. In prediction problems involving unstructured data (images, text, etc.) artificial neural networks tend to outperform all other algorithms or frameworks
What are exploding gradients?
Exploding gradients are a problem where large error gradients accumulate and result in very large updates to neural network model weights during training." At an extreme, the values of weights can become so large as to overflow and result in NaN values. This has the effect of your model being unstable and unable to learn from your training data.
What is unsupervised machine learning?
Unsupervised machine learning helps find previously unknown patterns in data set without pre-existing labels.
What is variance
Variance is error introduced in your model due to a model's sensitivity to small fluctuations in the training set. Your model learns to model the noise in the training set. It can lead high sensitivity and overfitting.
What is pruning in Decision Tree ?
When we remove sub-nodes of a decision node, this process is called pruning or opposite process of splitting.
What is p-value?
When you perform a hypothesis test in statistics, a p-value can help you determine the strength of your results. p-value is a number between 0 and 1. Based on the value it will denote the strength of the results. The claim which is on trial is called Null Hypothesis. Low p-value (≤ 0.05) indicates strength against the null hypothesis which means we can reject the null Hypothesis. High p-value (≥ 0.05) indicates strength for the null hypothesis which means we can accept the null Hypothesis p-value of 0.05 indicates the Hypothesis could go either way. To put it in another way, High P values: your data are likely with a true null. Low P values: your data are unlikely with a true null.
What is Data Science?
blend of tools, algorithms, and machine learning principles to discover hidden patterns from raw data
What is TF/IDF vectorization ?
tf-idf is short for term frequency-inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in information retrieval and text mining. The tf-idf value increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently in general.
Gradient
Gradient is the direction and magnitude calculated during training of a neural network that is used to update the network weights in the right direction and by the right amount.
F1- Score
Harmonic mean of precision and recall, and is a measure of a test's accuracy.
What cross-validation technique would you use on a time series dataset?
Instead of using k-fold cross-validation, you should be aware to the fact that a time series is not randomly distributed data - It is inherently ordered by chronological order. In case of time series data, you should use techniques like forward chaining - Where you will be model on past data then look at forward-facing data.
What is logistic regression?
Logistic Regression often referred as logit model is a technique to predict the binary outcome from a linear combination of predictor variables. For example, if you want to predict whether a particular political leader will win the election or not. In this case, the outcome of prediction is binary i.e. 0 or 1 (Win/Lose). The predictor variables here would be the amount of money spent for election campaigning of a particular candidate, the amount of time spent in campaigning, etc.
What is the difference between machine learning and deep learning?
Machine learning: Machine learning is a field of computer science that gives computers the ability to learn without being explicitly programmed. Machine learning can be categorized in following three categories. Supervised machine learning, Unsupervised machine learning, Reinforcement learning Deep learning: Deep Learning is a subfield of machine learning concerned with algorithms inspired by the structure and function of the brain called artificial neural networks.
Optimum model complexity
Normally, as you increase the complexity of your model, you will see a reduction in error due to lower bias in the model. However, this only happens till a particular point. As you continue to make your model more complex, you end up over-fitting your model and hence your model will start suffering from high variance.
How will you define the number of clusters in a clustering algorithm?
Plot within groups sum of squares (WSS), find point after which you dont see any decrements in WSS, also known as the bending point.
What is Random Forest?
Random forest is a versatile machine learning method capable of performing both regression and classification tasks. It is also used for dimensionality reduction, treats missing values, outlier values. It is a type of ensemble learning method, where a group of weak models combine to form a powerful model. In Random Forest, we grow multiple trees as opposed to a single tree. To classify a new object based on attributes, each tree gives a classification. The forest chooses the classification having the most votes(Over all the trees in the forest) and in case of regression, it takes the average of outputs by different trees.
What are Recurrent Neural Networks(RNNs)
Recurrent nets are type of artificial neural networks designed to recognize pattern from the sequence of data such as Time series, stock market and government agencies etc. To understand recurrent nets, first you have to understand the basics of feedforward nets. Both these networks RNN and feedforward named after the way they channel information through a series of mathematical operations performed at the nodes of the network. One feeds information through straight(never touching same node twice), while the other cycles it through loop, and the latter are called recurrent.
What is bootstrap?
Refers to random sampling with replacement. Allows us to better understand the bias and the variance with the dataset. Bootstrap involves random sampling of small subset of data from the dataset. Create many random sub samples of our dataset with replacement (meaning we can select the same value multiple times( Calculate the mean of each sub sample Calculate the average of all our collected means and use that as our estimated mean for the data.
Explain what regularization is and why it is useful.
Regularization is the process of adding tuning parameter to a model to induce smoothness in order to prevent over fitting. This is most often done by adding a constant multiple to an existing weight vector. This constant is often the L1(Lasso) or L2(ridge). The model predictions should then minimize the loss function calculated on the regularized training set.
What is reinforcement learning ?
Reinforcement Learning is learning what to do and how to map situations to actions. The end result is to maximize the numerical reward signal. The learner is not told which action to take, but instead must discover which action will yield the maximum reward.Reinforcement learning is inspired by the learning of human beings, it is based on the reward/penality mechanism.
Explain SVM machine learning algorithm in detail.
SVM stands for support vector machine, it is a supervised machine learning algorithm which can be used for both Regression and Classification. SVMs are based on the idea of finding a hyperplane that best divides a dataset into two (or more) classes.
What is Selection bias?
Selection bias occurs when sample obtained is not representative of the population intended to be analyzed Selection bias is the bias introduced by the selection of individuals, groups or data for analysis in such a way that proper randomization is not achieved, thereby ensuring that the sample obtained is not representative of the population intended to be analyzed. It is sometimes referred to as the selection effect. The phrase "selection bias" most often refers to the distortion of a statistical analysis, resulting from the method of collecting samples. If the selection bias is not taken into account, then some conclusions of the study may not be accurate.
What is Supervised Learning?
Supervised learning is the machine learning task of inferring a function from labeled training data. The training data consist of a set of training examples. In supervised learning, each example is a pair consisting of an input object (typically a vector) and a desired output value (also called the supervisory signal).
What are support vectors in SVM?
Support vectors are the data points nearest to the hyperplane, the points of a data set that, if removed, would alter the position of the dividing hyperplane. Because of this, they can be considered the critical elements of a data set.
High bias machine learning algorithms
Linear Regression, Logistic Regression
What is naive bayes?
The Naive Bayes Algorithm is based on the Bayes Theorem. Bayes' theorem describes the probability of an event, based on prior knowledge of conditions that might be related to the event. The Algorithm is 'naive' because it makes assumptions that may or may not turn out to be correct.
Explain how a ROC curve works
The ROC curve is a graphical representation of the contrast between true positive rates and false positive rates at various thresholds. It is often used as a proxy for the trade-off between the sensitivity(true positive rate) and false positive rate.
What is a confusion matrix?
The confusion matrix is a 2X2 table that contains 4 outputs provided by the binary classifier. Various measures, such as error-rate, accuracy, specificity, sensitivity, precision and recall are derived from it. A binary classifier predicts all data instances of a test dataset as either positive or negative. This produces four outcomes- True positive(TP) - Correct positive prediction False positive(FP) - Incorrect positive prediction True negative(TN) - Correct negative prediction False negative(FN) - Incorrect negative prediction
What is Entropy and Information gain in Decision tree algorithm ?
The core algorithm for building decision tree is called ID3. ID3 uses Entropy and Information Gain to construct a decision tree. Entropy A decision tree is built top-down from a root node and involve partitioning of data into homogeneous subsets. ID3 uses entropy to check the homogeneity of a sample. If the sample is completely homogeneous then entropy is zero and if the sample is an equally divided it has entropy of one. Information Gain The Information Gain is based on the decrease in entropy after a dataset is split on an attribute. Constructing a decision tree is all about finding attributes that returns the highest information gain.
What are kernels in SVM?
The function of kernel is to take data as input and transform it into the required form.
