DS_interview_final
What are the key concepts introduced by the Spark ML APIs?
1. ML Dataset 2. Transformer 3. Estimator 4. Pipeline 5. Param
What are some feature engineering techniques?
1. TF x IDF 2. ChiSquare 3. Kernel Trick 4. Hashing 5. Binning
Describe supervised learning in more details
1. Training phase: Sample extracted from true labels is used to learn a family of models. 2. Validation phase 3. Test Phase 4. Application phases
What's a Fourier transform?
A Fourier transform is a generic method to decompose generic functions into a superposition of symmetric functions. Or as this more intuitive tutorial puts it, given a smoothie, it's how we find the recipe. The Fourier transform finds the set of cycle speeds, amplitudes and phases to match any time signal. A Fourier transform converts a signal from time to frequency domain — it's a very common way to extract features from audio signals or other time series such as sensor data.
What is a Transformer?
A Transformer is an algorithm which can transform one DataFrame into another.
What's the difference between a generative and discriminative model?
A generative model will learn categories of data while a discriminative model will simply learn the distinction between different categories of data. Discriminative models will generally outperform generative models on classification tasks.
Cross Validation
A model validation technique that splits training data into two parts: one is a training set and the other is a validation set. Checks how well a model will generalize to new data.
How would you evaluate a logistic regression model?
A subsection of the question above. You have to demonstrate an understanding of what the typical goals of a logistic regression are (classification, prediction etc.) and bring up a few examples and use cases.
How would we reduce bias?
Add more features / more complex model
What it is important to have a robust set of metrics for machine learning?
Any ml technique should be evaluated by using metrics for assessing the quality of results.
What is the mean?
Arithmetic mean is the sum of values / number of values. Central value of a discrete set of numbers.
What is the Central Limit Theorem?
Average random variables independently drawn from independent distributions are normally distributed
What is Bayes' Theorem? How is it useful in a machine learning context?
Bayes' Theorem gives you the posterior probability of an event given what is known as prior knowledge. Mathematically, it's expressed as the true positive rate of a condition sample divided by the sum of the false positive rate of the population and the true positive rate of a condition. Say you had a 60% chance of actually having the flu after a flu test, but out of people who had the flu, the test will be false 50% of the time, and the overall population only has a 5% chance of having the flu. Would you actually have a 60% chance of having the flu after having a positive test? Bayes' Theorem says no. It says that you have a (.6 * 0.05) (True Positive Rate of a Condition Sample) / (.6*0.05)(True Positive Rate of a Condition Sample) + (.5*0.95) (False Positive Rate of a Population) = 0.0594 or 5.94% chance of getting a flu. Bayes' Theorem is the basis behind a branch of machine learning that most notably includes the Naive Bayes classifier. That's something important to consider when you're faced with machine learning interview questions.
What are Bayesian Networks (BN) ?
Bayesian Network is used to represent the graphical model for probability relationship among a set of variables .
Explain the two components of Bayesian logic program?
Bayesian logic program consists of two components. The first component is a logical one ; it consists of a set of Bayesian Clauses, which captures the qualitative structure of the domain. The second component is a quantitative one, it encodes the quantitative information about the domain.
What is bias?
Bias is the error representing missing relations between features and outputs
How would you simulate the approach AlphaGo took to beat Lee Sidol at Go?
AlphaGo beating Lee Sidol, the best human player at Go, in a best-of-five series was a truly seminal event in the history of machine learning and deep learning. The Nature paper above describes how this was accomplished with "Monte-Carlo tree search with deep neural networks that have been trained by supervised learning, from human expert games, and by reinforcement learning from games of self-play."
What are some differences between a linked list and an array?
An array is an ordered collection of objects. A linked list is a series of objects with pointers that direct how to process them sequentially. An array assumes that every element has the same size, unlike the linked list. A linked list can more easily grow organically: an array has to be pre-defined or re-defined for organic growth. Shuffling a linked list involves changing which points direct where — meanwhile, shuffling an array is more complex and takes more memory.
How would you handle an imbalanced dataset?
An imbalanced dataset is when you have, for example, a classification test and 90% of the data is in one class. That leads to problems: an accuracy of 90% can be skewed if you have no predictive power on the other category of data! Here are a few tactics to get over the hump: 1- Collect more data to even the imbalances in the dataset. 2- Resample the dataset to correct for imbalances. 3- Try a different algorithm altogether on your dataset. What's important here is that you have a keen sense for what damage an unbalanced dataset can cause, and how to balance that.
Which technique is used to predict categorical responses?
Classification technique
What is associative rule learning?
Computer is given a large set of observations made up of multiple variables. The task is to learn relationships between variables. If A and B => C
What is covariance?
Covariance is a measure of how much two random variables change together.
What is cross-validation?
Cross-validation is a technique for evaluating ML models by training several ML models on subsets of the available input data and evaluating them on the complementary subset of the data. ... In k-fold cross-validation, you split the input data into k subsets of data (also known as folds).
Why ensemble learning is used?
Ensemble learning is used to improve the classification, prediction, function approximation etc of a model.
When to use ensemble learning?
Ensemble learning is used when you build component classifiers that are more accurate and independent from each other.
Name an example where ensemble techniques might be useful.
Ensemble techniques use a combination of learning algorithms to optimize better predictive performance. They typically reduce overfitting in models and make the model more robust (unlikely to be influenced by small changes in the training data). You could list some examples of ensemble methods, from bagging to boosting to a "bucket of models" method and demonstrate how they could increase predictive power.
Why is feature engineering so important ?
Features are what you use to make predictions. Your choice of features can dramatically affect your model regardless of the algorithm you use. A simple algorithm on a good set of features can perform better than a sophisticated algorithm on a bad set
What is Genetic Programming?
Genetic programming is one of the two techniques used in machine learning. The model is based on the testing and selecting the best choice among a set of results.
What is the variance?
How much are a set of numbers spread out? Small variances are close to the mean,
What is dimension reduction in Machine Learning?
In Machine Learning and statistics, dimension reduction is the process of reducing the number of random variables under considerations and can be divided into feature selection and feature extraction
What is 'Overfitting' in Machine learning?
In machine learning, when a statistical model describes random error or noise instead of underlying relationship 'overfitting' occurs. When a model is excessively complex, overfitting is normally observed, because of having too many parameters with respect to the number of training data types. The model exhibits poor performance which has been overfit.
What is 'Training set' and 'Test set'?
In various areas of information science like machine learning, a set of data is used to discover the potentially predictive relationship known as 'Training Set'. Training set is an examples given to the learner, while Test set is used to test the accuracy of the hypotheses generated by the learner, and it is the set of example held back from the learner. Training set are distinct from Test set.
What is an Incremental Learning algorithm in ensemble?
Incremental learning method is the ability of an algorithm to learn from new data that may be available after classifier has already been generated from already available dataset.
What is latent semantic indexing?
Indexing and retrieval method that uses singular value decomposition to identify patterns in the relationships between the terms and concepts contained in an unstructured collection of text Based on the principle that words that are used in the same contexts tend to have similar meanings "Latent": semantic associations between words is present not explicitly but only latently For example: two synonyms may never occur in the same passage but should nonetheless have highly associated representations
What is Inductive Logic Programming in Machine Learning?
Inductive Logic Programming (ILP) is a subfield of machine learning which uses logical programming representing background knowledge and examples.
Why instance based learning algorithm sometimes referred as Lazy learning algorithm?
Instance based learning algorithm is also referred as Lazy learning algorithm as they delay the induction or generalization process until classification is performed.
What cross-validation technique would you use on a time series dataset?
Instead of using standard k-folds cross-validation, you have to pay attention to the fact that a time series is not randomly distributed data — it is inherently ordered by chronological order. If a pattern emerges in later time periods for example, your model may still pick up on it even if that effect doesn't hold in earlier years! You'll want to do something like forward chaining where you'll be able to model on past data then look at forward-facing data. fold 1 : training [1], test [2] fold 2 : training [1 2], test [3] fold 3 : training [1 2 3], test [4] fold 4 : training [1 2 3 4], test [5] fold 5 : training [1 2 3 4 5], test [6]
How is KNN different from k-means clustering?
K-Nearest Neighbors is a supervised classification algorithm, while k-means clustering is an unsupervised clustering algorithm. While the mechanisms may seem similar at first, what this really means is that in order for K-Nearest Neighbors to work, you need labeled data you want to classify an unlabeled point into (thus the nearest neighbor part). K-means clustering requires only a set of unlabeled points and a threshold: the algorithm will take unlabeled points and gradually learn how to cluster them into groups by computing the mean of the distance between different points. The critical difference here is that KNN needs labeled points and is thus supervised learning, while k-means doesn't — and is thus unsupervised learning.
What is algorithm independent machine learning?
Machine learning in where mathematical foundations is independent of any particular classifier or learning algorithm is referred as algorithm independent machine learning?
Where do you usually source datasets?
Machine learning interview questions like these try to get at the heart of your machine learning interest. Somebody who is truly passionate about machine learning will have gone off and done side projects on their own, and have a good idea of what great datasets are out there. If you're missing any, check out Quandl for economic and financial data, and Kaggle's Datasets collection for another great list.
How do you think Google is training data for self-driving cars?
Machine learning interview questions like this one really test your knowledge of different machine learning methods, and your inventiveness if you don't know the answer. Google is currently using recaptcha to source labelled data on storefronts and traffic signs. They are also building on training data collected by Sebastian Thrun at GoogleX — some of which was obtained by his grad students driving buggies on desert dunes!
What is Machine learning?
Machine learning is a branch of computer science which deals with system programming in order to automatically learn and improve with experience. For example: Robots are programed so that they can perform the task based on data they gather from sensors. It automatically learns programs from data.
2) Mention the difference between Data Mining and Machine learning?
Machine learning relates with the study, design and development of the algorithms that give computers the capability to learn without being explicitly programmed. While, data mining can be defined as the process in which the unstructured data tries to extract knowledge or unknown interesting patterns. During this process machine, learning algorithms are used.
What are some classification methods?
Naive Bayes, SVM, Decision Trees, and Neural Networks
What is PAC Learning?
PAC (Probably Approximately Correct) learning is a learning framework that has been introduced to analyze learning algorithms and their statistical efficiency.
What is PCA, KPCA and ICA used for?
PCA (Principal Components Analysis), KPCA ( Kernel based Principal Component Analysis) and ICA ( Independent Component Analysis) are important feature extraction techniques used for dimensionality reduction.
What is Parquet?
Parquet is a tabular format for saving and retrieving data.
In what areas Pattern Recognition is used?
Pattern Recognition can be used in a) Computer Vision b) Speech Recognition c) Data Mining d) Statistics e) Informal Retrieval f) Bio-Informatics
What is statistical power?
Probability that the test correctly rejects the null hypothesis when the alternate hypothesis is true
How is a decision tree pruned?
Pruning is what happens in decision trees when branches that have weak predictive power are removed in order to reduce the complexity of the model and increase the predictive accuracy of a decision tree model. Pruning can happen bottom-up and top-down, with approaches such as reduced error pruning and cost complexity pruning. Reduced error pruning is perhaps the simplest version: replace each node. If it doesn't decrease predictive accuracy, keep it pruned. While simple, this heuristic actually comes pretty close to an approach that would optimize for maximum accuracy.
Example of feature engineering
Text files: bag of words 1. Each word is associated with a unique integer 2. For each document, # occurances of each word is computed and stored in a matrix
What's the F1 score? How would you use it?
The F1 score is a measure of a model's performance. It is a weighted average of the precision and recall of a model, with results tending to 1 being the best, and those tending to 0 being the worst. You would use it in classification tests where true negatives don't matter much.
What is classification?
The computer is given pairs of (inputs, target classes) and the computer learns to attribute classes to unseen data.
What is the difference between heuristic for rule learning and heuristics for decision trees?
The difference is that the heuristics for decision trees evaluate the average quality of a number of disjointed sets while rule learners only evaluate the quality of the set of instances that is covered with the candidate rule.
What are the different methods for Sequential Supervised Learning?
The different methods to solve Sequential Supervised Learning problems are a) Sliding-window methods b) Recurrent sliding windows c) Hidden Markow models d) Maximum entropy Markow models e) Conditional random fields f) Graph transformer networks
What are the different Algorithm techniques in Machine Learning?
The different types of techniques in Machine Learning are a) Supervised Learning b) Unsupervised Learning c) Semi-supervised Learning d) Reinforcement Learning e) Transduction f) Learning to Learn
What is bias-variance decomposition of classification error in ensemble method?
The expected error of a learning algorithm can be decomposed into bias and variance. A bias term measures how closely the average classifier produced by the learning algorithm matches the target function. The variance term measures how much the learning algorithm's prediction fluctuates for different training sets.
What is the general principle of an ensemble method and what is bagging and boosting in ensemble method?
The general principle of an ensemble method is to combine the predictions of several models built with a given learning algorithm in order to improve robustness over a single model. Bagging is a method in ensemble for improving unstable estimation or classification schemes. While boosting method are used sequentially to reduce the bias of the combined model. Boosting and Bagging both can reduce errors by reducing the variance term.
Why are vectors used in machine learning?
The give a synthetic summary of characteristics of real world objects.
How do you ensure you're not overfitting with a model?
This is a simple restatement of a fundamental problem in machine learning: the possibility of overfitting training data and carrying the noise of that data through to the test set, thereby providing inaccurate generalizations. There are three main methods to avoid overfitting: 1- Keep the model simpler: reduce variance by taking into account fewer variables and parameters, thereby removing some of the noise in the training data. 2- Use cross-validation techniques such as k-folds cross-validation. 3- Use regularization techniques such as LASSO that penalize certain model parameters if they're likely to cause overfitting.
Pick an algorithm. Write the psuedo-code for a parallel implementation.
This kind of question demonstrates your ability to think in parallelism and how you could handle concurrency in programming implementations dealing with big data. Take a look at pseudocode frameworks such as Peril-L and visualization tools such as Web Sequence Diagrams to help you demonstrate your ability to write code that reflects parallelism.
What do you think of our current data process?
This kind of question requires you to listen carefully and impart feedback in a manner that is constructive and insightful. Your interviewer is trying to gauge if you'd be a valuable member of their team and whether you grasp the nuances of why certain things are set the way they are in the company's data process based on company- or industry-specific conditions. They're trying to see if you can be an intellectual peer. Act accordingly.
Which is more important to you- model accuracy, or model performance?
This question tests your grasp of the nuances of machine learning model performance! Machine learning interview questions often look towards the details. There are models with higher accuracy that can perform worse in predictive power — how does that make sense? Well, it has everything to do with how model accuracy is only a subset of model performance, and at that, a sometimes misleading one. For example, if you wanted to detect fraud in a massive dataset with a sample of millions, a more accurate model would most likely predict no fraud at all if only a vast minority of cases were fraud. However, this would be useless for a predictive model — a model designed to find fraud that asserted there was no fraud at all! Questions like this help you demonstrate that you understand model accuracy isn't the be-all and end-all of model performance.
What's your favorite algorithm, and can you explain it to me in less than a minute?
This type of question tests your understanding of how to communicate complex and technical nuances with poise and the ability to summarize quickly and efficiently. Make sure you have a choice and make sure you can explain different algorithms so simply and effectively that a five-year-old could grasp the basics!
Explain what regularization is and why it is useful.
Used to prevent overfitting: improve the generalization of a model Decreases complexity of a model Introducing a regularization term to a general loss function: adding a term to the minimization problem Impose Occam's Razor in the solution
What's a null hypothesis?
We want to see if we can reject the status quo as being highly improbable
Which data visualization libraries do you use? What are your thoughts on the best data visualization tools?
What's important here is to define your views on how to properly visualize data and your personal preferences when it comes to tools. Popular tools include R's ggplot, Python's seaborn and matplotlib, and tools such as Plot.ly and Tableau.
Whats a false negative?
When we wrongly accept the null hypothesis as highly probable
Whats a false positive?
When we wrongly reject the null hypothesis as highly probable
Do you contribute to any open source projects?
Yes
Roboustness
not sensitive to changes in the data.
What is recall?
tp / (tp + fn)
What is precision?
tp / (tp + fp)
What evaluation approaches would you work to gauge the effectiveness of a machine learning model?
You would first split the dataset into training and test sets, or perhaps use cross-validation techniques to further segment the dataset into composite sets of training and test sets within the data. You should then implement a choice selection of performance metrics: here is a fairly comprehensive list. You could use measures such as the F1 score, the accuracy, and the confusion matrix. What's important here is to demonstrate that you understand the nuances of how a model is measured and how to choose the right performance measures for the right situations.
Do you have experience with Spark or big data tools for machine learning?
You'll want to get familiar with the meaning of big data for different companies and the different tools they'll want. Spark is the big data tool most in demand now, able to handle immense datasets with speed. Be honest if you don't have experience with the tools demanded, but also take a look at job descriptions and see what tools pop up: you'll want to invest in familiarizing yourself with them.
Common metrics in classification:
Recall / Sensitivity / True positive rate: High when FN low. Sensitive to unbalanced classes. Sensitivity=TPTP+FNSensitivity=TPTP+FN Precision / Positive Predictive Value High when FP low. Sensitive to unbalanced classes. Precision=TPTP+FPPrecision=TPTP+FP Specificity / True Negative Rate High when FP low. Sensitive to unbalanced classes. Specificity=TNTN+FPSpecificity=TNTN+FP Accuracy High when FP and FN are low. Sensitive to unbalanced classes (see "Accuracy paradox") Accuracy=TP+TNTN+TP+FP+FNAccuracy=TP+TNTN+TP+FP+FN ROC / AUC ROC is a graphical plot that illustrates the performance of a binary classifier (SensitivitySensitivity Vs 1−Specificity1−Specificity or SensitivitySensitivity Vs SpecificitySpecificity). They are not sensitive to unbalanced classes. AUC is the area under the ROC curve. Perfect classifier: AUC=1, fall on (0,1); 100% sensitivity (no FN) and 100% specificity (no FP) Logarithmic loss Punishes infinitely the deviation from the true value! It's better to be somewhat wrong than emphatically wrong! logloss=−1N∑ni=1(yilog(pi)+(1−yi)log(1−pi))logloss=−1N∑i=1n(yilog(pi)+(1−yi)log(1−pi)) Misclassification Rate Misclassification=1n∑iI(yi≠y^i)Misclassification=1n∑iI(yi≠y^i) F1-Score Used when the target variable is unbalanced. F1Score=2Precision×RecallPrecision+RecallF1Score=2Precision×RecallPrecision+Recall
Define precision and recall
Recall is also known as the true positive rate: the amount of positives your model claims compared to the actual number of positives there are throughout the data. Precision is also known as the positive predictive value, and it is a measure of the amount of accurate positives your model claims compared to the number of positives it actually claims. It can be easier to think of recall and precision in the context of a case where you've predicted that there were 10 apples and 5 oranges in a case of 10 apples. You'd have perfect recall (there are actually 10 apples, and you predicted there would be 10) but 66.7% precision because out of the 15 events you predicted, only 10 (the apples) are correct.
What is not Machine Learning?
a) Artificial Intelligence b) Rule based inference
Explain what is the function of 'Supervised Learning'?
a) Classifications b) Speech recognition c) Regression d) Predict time series e) Annotate strings
What are the two classification methods that SVM ( Support Vector Machine) can handle?
a) Combining binary classifiers b) Modifying binary to incorporate multiclass learning
What are the five popular algorithms of Machine Learning?
a) Decision Trees b) Neural Networks (back propagation) c) Probabilistic networks d) Nearest Neighbor e) Support vector machines
Explain what is the function of 'Unsupervised Learning'?
a) Find clusters of the data b) Find low-dimensional representations of the data c) Find interesting directions in data d) Interesting coordinates and correlations e) Find novel observations/ database cleaning
What are the different categories you can categorized the sequence learning process?
a) Sequence prediction b) Sequence generation c) Sequence recognition d) Sequential decision
Regularization
Smoothing a model to prevent overfitting
What is batch statistical learning?
Statistical learning techniques allow learning a function or predictor from a set of observed data that can make predictions about unseen or future data. These techniques provide guarantees on the performance of the learned predictor on the future unseen data based on a statistical assumption on the data generating process.
How is kNN different from k-means clustering?
Supervised classification algorithm, unsupervised clustering algorithm
How would you validate a regression model.
1. Eyeball it. If values are outside the response variable values could indicate poor accuracy
Give an example of a Transformer
A ML model is a Transformer which transforms a DataFrame with features into a DataFrame with predictions
Pipeline
A Pipeline chains multiple Transformers and Estimators together to specify a workflow
What are the new Spark DataFrame and the Spark Pipeline?
A Spark DataFrame is a table where columns are explicitly associated with names.
What is classifier in machine learning?
A classifier in a Machine Learning is a system that inputs a vector of discrete or continuous feature values and outputs a single discrete value, the class.
What is a Gaussian?
A family of functions that show a "bell curve" shape.
Give an example of an Estimator
A learning algorithm is an Estimator which trains on a training set and produces a model
What is a sigmoid function and what is a logistic function?
A logistic function is a sigmoid used in logistic regression
How would you implement a recommendation system for our company's users?
A lot of machine learning interview questions of this type will involve implementation of machine learning models to a company's problems. You'll have to research the company and its industry in-depth, especially the revenue drivers the company has, and the types of users the company takes on in the context of the industry it's in.
How can you prove an improvement to an algorithm is an improvement over doing nothing?
Good experimental design 1. No selection bias in test data 2. Test data is a good model of the real world 3. Ensure results are repeatable
Explain what significance means
If a statistical test returns significant, then the effect is unlikely to be from random chance alone
Explain what a confidence interval means
If you reject something with 95% confidence then in the case there is no true effect, a result like ours will happen in less than 5% of all possible samples
What is Perceptron in Machine Learning?
In Machine Learning, Perceptron is an algorithm for supervised classification of the input into one of several possible non-binary outputs.
What are the advantages of Naive Bayes?
In Naïve Bayes classifier will converge quicker than discriminative models like logistic regression, so you need less training data. The main advantage is that it can't learn interactions between features.
What are percentiles?
A percentile is a metric indicating a value, below which a percentage of values falls.
Param
All Transformers and Estimators now share a common API for specifying parameters
What's a Spark RDD?
An abstraction that distributes data and marshalls data behind the scenes
Estimator
An estimator is an algorithm which is fit on a DataFrame to produce a Transformer
Why is randomization important in experimental design?
Because it balances out confounding variables. You can ensure possible confounding variables are balanced out.
What's the trade-off between bias and variance?
Bias is error due to erroneous or overly simplistic assumptions in the learning algorithm you're using. This can lead to the model underfitting your data, making it hard for it to have high predictive accuracy and for you to generalize your knowledge from the training set to the test set. Variance is error due to too much complexity in the learning algorithm you're using. This leads to the algorithm being highly sensitive to high degrees of variation in your training data, which can lead your model to overfit the data. You'll be carrying too much noise from your training data for your model to be very useful for your test data. The bias-variance decomposition essentially decomposes the learning error from any algorithm by adding the bias, the variance and a bit of irreducible error due to noise in the underlying dataset. Essentially, if you make the model more complex and add more variables, you'll lose bias but gain some variance — in order to get the optimally reduced amount of error, you'll have to tradeoff bias and variance. You don't want either high bias or high variance in your model.
How can you avoid overfitting ?
By using a lot of data overfitting can be avoided, overfitting happens relatively as you have a small dataset, and you try to learn from it. But if you have a small database and you are forced to come with a model based on that. In such situation, you can use a technique known as cross validation. In this method the dataset splits into two section, testing and training datasets, the testing dataset will only test the model while, in training dataset, the datapoints will come up with the model. In this technique, a model is usually given a dataset of a known data on which training (training data set) is run and a dataset of unknown data against which the model is tested. The idea of cross validation is to define a dataset to "test" the model in the training phase.
What is Chi-Square Selection?
Chi-Square is a statistical test used to understand if two categorical features are correlated.
When should you use classification over regression?
Classification produces discrete values and dataset to strict categories, while regression gives you continuous results that allow you to better distinguish differences between individual points. You would use classification over regression if you wanted your results to reflect the belongingness of data points in your dataset to certain explicit categories (ex: If you wanted to know whether a name was male or female rather than just how correlated they were with male and female names.)
What are examples of supervised learning?
Classification, Neural Networks, Regression
What is an example of unsupervised learning?
Clustering and Density Estimations
What is F1?
Combines precision and recall into a single value
K-fold cross validation
Data is divided into a train and validation set for k-times (folds) and the minimizing combination is selected
What is the best way to use Hadoop and R together for analysis?
Don't know
What is the biggest data set that you processed, and how did you process it, what were the results?
Don't know
What is the command used to store R objects in a file?
Don't know
What is the difference between a tuple and a list in Python?
Don't know
What's the difference between Type I and Type II error?
Don't think that this is a trick question! Many machine learning interview questions will be an attempt to lob basic questions at you just to make sure you're on top of your game and you've prepared all of your bases. Type I error is a false positive, while Type II error is a false negative. Briefly stated, Type I error means claiming something has happened when it hasn't, while Type II error means that you claim nothing is happening when in fact something is. A clever way to think about this is to think of Type I error as telling a man he is pregnant, while Type II error means you tell a pregnant woman she isn't carrying a baby.
How would we reduce variance?
Get more data / decrease complexity of the mode
What is bias / variance trade off?
More powerful methods have less bias but more variance
How would you deal with categorical features?
One-hot encoding
What is precision?
Precision: How many selected items are relevant? TP / ALL Recall: How many relevant elemenets were selected? TP / TP + FN
Explain how precision and recall they relate to the ROC curve?
Recall describes what percentage of true positives are described as positive by the model. Precision describes what percent of positive predictions were correct. The ROC curve shows the relationship between model recall and specificity - specificity being a measure of the percent of true negatives being described as negative by the model. Recall, precision, and the ROC are measures used to identify how useful a given classification model is.
What is regression?
Regression gives the computer pairs of (inputs, continuous targets) and the computer learns to predict continuous values on unseen data
Do you have research experience in machine learning?
Related to the last point, most organizations hiring for machine learning positions will look for your formal experience in the field. Research papers, co-authored or supervised by leaders in the field, can make the difference between you being hired and not. Make sure you have a summary of your research experience and papers ready — and an explanation for your background and lack of formal research experience if you don't.
What is root cause analysis?
Root cause analysis (RCA) is a method of problem solving used for identifying the root causes of faults or problems. A factor is considered a root cause if removal thereof from the problem-fault-sequence prevents the final undesirable event from recurring; whereas a causal factor is one that affects an event's outcome, but is not a root cause. Root cause analysis was initially developed to analyze industrial accidents, but is now widely used in other areas, such as healthcare, project management, or software testing. Here is a useful Root Cause Analysis Toolkit from the state of Minnesota. Essentially, you can find the root cause of a problem and show the relationship of causes by repeatedly asking the question, "Why?", until you find the root of the problem. This technique is commonly called "5 Whys", although is can be involve more or less than 5 questions.
What is selection bias, why is it important and how can you avoid it?
Selection bias, in general, is a problematic situation in which error is introduced due to a non-random population sample. For example, if a given sample of 100 test cases was made up of a 60/20/15/5 split of 4 classes which actually occurred in relatively equal numbers in the population, then a given model may make the false assumption that probability could be the determining predictive factor. Avoiding non-random samples is the best way to deal with bias; however, when this is impractical, techniques such as resampling, boosting, and weighting are strategies which can be introduced to help deal with the situation.
What is sequence learning?
Sequence learning is a method of teaching and learning in a logical manner.
How do you avoid false positive?
Set a proper sample size
What is the difference between supervised and unsupervised machine learning?
Supervised learning requires training labeled data. For example, in order to do classification (a supervised learning task), you'll need to first label the data you'll use to train the model to classify data into your labeled groups. Unsupervised learning, in contrast, does not require labeling data explicitly.
What are support vector machines?
Support vector machines are supervised learning algorithms used for classification and regression analysis.
What is TFIDF?
Term frequency inverse document frequency. It is a weighting technique for text classifications. How important is a word in a document contained in a corpus?
What's the "kernel trick" and how is it useful?
The Kernel trick involves kernel functions that can enable in higher-dimension spaces without explicitly calculating the coordinates of points within that dimension: instead, kernel functions compute the inner products between the images of all pairs of data in a feature space. This allows them the very useful attribute of calculating the coordinates of higher dimensions while being computationally cheaper than the explicit calculation of said coordinates. Many algorithms can be expressed in terms of inner products. Using the kernel trick enables us effectively run algorithms in a high-dimensional space with lower-dimensional data.
What are your favorite use cases of machine learning models?
The Quora thread above contains some examples, such as decision trees that categorize people into different tiers of intelligence based on IQ scores. Make sure that you have a few examples in mind and describe what resonated with you. It's important that you demonstrate an interest in how machine learning is implemented.
What is supervised learning?
In supervised learning, tuples of examples (input, desired output) are available and the computer uses this to build a model where a given input produces an output (with minimal error)
What is unsupervised learning?
In unsupervised learning, the computer searches for patterns in the data without any examples.
What are the last machine learning papers you've read?
Keeping up with the latest scientific literature on machine learning is a must if you want to demonstrate interest in a machine learning position. This overview of deep learning in Nature by the scions of deep learning themselves (from Hinton to Bengio to LeCun) can be a good reference paper and an overview of what's happening in deep learning — and the kind of paper you might want to cite.
Explain the difference between L1 and L2 regularization.
L2 regularization tends to spread error among all the terms, while L1 is more binary/sparse, with many variables either being assigned a 1 or 0 in weighting. L1 corresponds to setting a Laplacean prior on the terms, while L2 corresponds to a Gaussian prior.
What is linear least squares regression?
Linear
What is one-hot econding ?
Maps a column of categories to a column of sparse binary vectors. Use if you don't want to order categorical variables
What is Variance?
Variance is the error representing sensitivities to small training data fluctuations. (overfitting)
Overfitting
When a model makes good predictions on training data but has poor performance on the test data
Which method is frequently used to prevent overfitting?
When there is sufficient data 'Isotonic Regression' is used to prevent an overfitting issue.
How do you handle missing or corrupted data in a dataset?
You could find missing/corrupted data in a dataset and either drop those rows or columns, or decide to replace them with another value. In Pandas, there are two very useful methods: isnull() and dropna() that will help you find columns of data with missing or corrupted data and drop those values. If you want to fill the invalid values with a placeholder value (for example, 0), you could use the fillna() method.
What should we worry about if we have an experiment with 20 different metrics?
The more metrics you are measuring, the more likely it is you'll get a false positive
Why overfitting happens?
The possibility of overfitting exists as the criteria used for training the model is not the same as the criteria used to judge the efficacy of a model.
What is Model Selection in Machine Learning?
The process of selecting models among different mathematical models, which are used to describe the same data set is known as Model Selection. Model selection is applied to the fields of statistics, machine learning and data mining.
Give a popular application of machine learning that you see on day to day basis?
The recommendation engine implemented by major ecommerce websites uses Machine Learning
What is variance?
The tendency to learn random things irrespective of the true signal
What are the two methods used for the calibration in Supervised Learning?
The two methods used for predicting good probabilities in Supervised Learning are a) Platt Calibration b) Isotonic Regression These methods are designed for binary classification, and it is not trivial.
How can we use your machine learning skills to generate revenue?
This is a tricky question. The ideal answer would demonstrate knowledge of what drives the business and how your skills could relate. For example, if you were interviewing for music-streaming startup Spotify, you could remark that your skills at developing a better recommendation model would increase user retention, which would then increase revenue in the long run. The startup metrics Slideshare linked above will help you understand exactly what performance indicators are important for startups and tech companies as they think about revenue and growth.
What is clustering?
The computers learn how to partition observations in various subsets. So each partition will be made of similar observations
List down various approaches for machine learning?
The different approaches in Machine Learning are a) Concept Vs Classification Learning b) Symbolic Vs Statistical Learning c) Inductive Vs Analytical Learning
What are the components of relational evaluation techniques?
The important components of relational evaluation techniques are a) Data Acquisition b) Ground Truth Acquisition c) Cross Validation Technique d) Query Type e) Scoring Metric f) Significance Test
What is inductive machine learning?
The inductive machine learning involves the process of learning by examples, where a system, from a set of observed instances tries to induce a general rule.
What is bias?
The learner's tendency to learn about the same wrong thing
Describe a hash table.
hash table is a data structure that produces an associative array. A key is mapped to certain values through the use of a hash function. They are often used for tasks such as database indexing.
What can be done to avoid local optima?
Avoid local optima in a K-means context: repeat K-means and take the solution that has the lowest cost
How would you clean a data-set in (insert language here)?
Don't know
What is Interpolation and Extrapolation?
Estimating a value, approximating a value
KPI
Key Performance Indicator or metric or feature.
Python or R - Which one would you prefer for text analytics?
Pandas , data structures, high performance data analysis tools
How would you validate a model you created to generate a predictive model of a quantitative outcome variable using multiple regression?
Proposed methods for model validation: If the values predicted by the model are far outside of the response variable range, this would immediately indicate poor estimation or model inaccuracy. If the values seem to be reasonable, examine the parameters; any of the following would indicate poor estimation or multi-collinearity: opposite signs of expectations, unusually large or small values, or observed inconsistency when the model is fed new data. Use the model for prediction by feeding it new data, and use the coefficient of determination (R squared) as a model validity measure. Use data splitting to form a separate dataset for estimating model parameters, and another for validating predictions. Use jackknife resampling if the dataset contains a small number of instances, and measure validity with R squared and mean squared error (MSE).
With which programming languages and environments are you most comfortable working?
Python, Anaconda environment
What are some pros and cons about your favorite statistical software?
R
Explain what regularization is and why it is useful
Regularization is the process of adding a tuning parameter to a model to induce smoothness in order to prevent overfitting. This is most often done by adding a constant multiple to an existing weight vector. This constant is often either the L1 (Lasso) or L2 (ridge), but can in actuality can be any norm. The model predictions should then minimize the mean of the loss function calculated on the regularized training set.
What are the benefits and drawbacks of specific methods, such as lasso regression?
We use an L1L1 penalty when fitting the model using least squares Can force regression coefficients to be exactly: feature selection method by itself β^lasso=argminβ{∑ni=1(yi−β0−∑pj=1xijβj)2+λ∑pj=1||βj||}
What are the benefits and drawbacks of specific methods, such as ridge regression?
We use an L2L2 penalty when fitting the model using least squares We add to the minimization problem an expression (shrinkage penalty) of the form λ×∑coefficientsλ×∑coefficients λλ: tuning parameter; controls the bias-variance tradeoff; accessed with cross-validation A bit faster than the lasso β^ridge=argminβ{∑ni=1(yi−β0−∑pj=1xijβj)2+λ∑pj=1β2j}
Collaborative filtering
a technique used by recommender systems a method of making automatic predictions (filtering) about the interests of a user by collecting preferences or taste information from many users (collaborating).
Why is Naive Bayes so bad?
assumes features are not correlated
What is probabilistic merging (aka fuzzy merging)? Is it easier to handle with SQL or other languages?
on A the key is first name/lastname in some char set; in B another char set is used. Data is sometimes missing in A or B. No
Mapreduce
process large data sets
What is Linear Regression?
score of a variable Y, predictor variable
Which languages would you choose for semi-structured text data reconciliation?
scripting languages (Python and Perl)
Why data cleaning plays a vital role in analysis?
time take, 80%
Write a function in R language to replace the missing value in a vector with the mean of that vector.
Don't know
What are the areas in robotics and information processing where sequential prediction problem arises?
The areas in robotics and information processing where sequential prediction problem arises are a) Imitation Learning b) Structured prediction c) Model based reinforcement learning
How do you split a continuous variable into different groups/ranks in R?
Don't know
What is deep learning, and how does it contrast with other machine learning algorithms?
Deep learning is a subset of machine learning that is concerned with neural networks: how to use backpropagation and certain principles from neuroscience to more accurately model large sets of unlabelled or semi-structured data. In that sense, deep learning represents an unsupervised learning algorithm that learns representations of data through the use of neural nets.
What is the difference between artificial learning and machine learning?
Designing and developing algorithms according to the behaviours based on empirical data are known as Machine Learning. While artificial intelligence in addition to machine learning, it also covers other aspects like knowledge representation, natural language processing, planning, robotics etc.
Why is "Naive" Bayes naive?
Despite its practical applications, especially in text mining, Naive Bayes is considered "Naive" because it makes an assumption that is virtually impossible to see in real-life data: the conditional probability is calculated as the pure product of the individual probabilities of components. This implies the absolute independence of features — a condition probably never met in real life. As a Quora commenter put it whimsically, a Naive Bayes classifier that figured out that you liked pickles and ice cream would probably naively recommend you a pickle ice cream. Bayes' Theorem is the basis behind a branch of machine learning that most notably includes the Naive Bayes classifier. That's something important to consider when you're faced with machine learning interview questions.
How would you approach the "Netflix Prize" competition?
The Netflix Prize was a famed competition where Netflix offered $1,000,000 for a better collaborative filtering algorithm. The team that won called BellKor had a 10% improvement and used an ensemble of different methods to win. Some familiarity with the case and its solution will help demonstrate you've paid attention to machine learning for a while.
Explain how a ROC curve works.
The ROC curve is a graphical representation of the contrast between true positive rates and the false positive rate at various thresholds. It's often used as a proxy for the trade-off between the sensitivity of the model (true positives) vs the fall-out or the probability it will trigger a false alarm (false positives).
Discuss the meaning of the ROC curve, and write pseudo-code to generate the data for such a curve.
The ROC curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings. ROC analysis provides tools to select possibly optimal models and to discard suboptimal ones independently from (and prior to specifying) the cost context or the class distribution. In a Receiver Operating Characteristic (ROC) curve the true positive rate (Sensitivity) is plotted in function of the false positive rate (100-Specificity) for different cut-off points.
What's the difference between probability and likelihood?
The answer depends on whether you are dealing with discrete or continuous random variables. So, I will split my answer accordingly. I will assume that you want some technical details and not necessarily an explanation in plain English. If my assumption is not correct please let me know and I will revise my answer. Discrete Random Variables Suppose that you have a stochastic process that takes discrete values (e.g., outcomes of tossing a coin 10 times, number of customers who arrive at a store in 10 minutes etc). In such cases, we can calculate the probability of observing a particular set of outcomes by making suitable assumptions about the underlying stochastic process (e.g., probability of coin landing heads is pp and that coin tosses are independent). Denote the observed outcomes by OO and the set of parameters that describe the stochastic process as θθ. Thus, when we speak of probability we want to calculate P(O|θ)P(O|θ). In other words, given specific values for θθ, P(O|θ)P(O|θ) is the probability that we would observe the outcomes represented by OO. However, when we model a real life stochastic process, we often do not know θθ. We simply observe OO and the goal then is to arrive at an estimate for θθ that would be a plausible choice given the observed outcomes OO. We know that given a value of θθ the probability of observing OO is P(O|θ)P(O|θ). Thus, a 'natural' estimation process is to choose that value of θθ that would maximize the probability that we would actually observe OO. In other words, we find the parameter values θθ that maximize the following function: L(θ|O)=P(O|θ)L(θ|O)=P(O|θ) L(θ|O)L(θ|O) is called as the likelihood function. Notice that by definition the likelihood function is conditioned on the observed OO and that it is a function of the unknown parameters θθ. Continuous Random Variables In the continuous case the situation is similar with one important difference. We can no longer talk about the probability that we observed OO given θθ because in the continuous case P(O|θ)=0P(O|θ)=0. Without getting into technicalities, the basic idea is as follows: Denote the probability density function (pdf) associated with the outcomes OO as: f(O|θ)f(O|θ). Thus, in the continuous case we estimate θθ given observed outcomes OO by maximizing the following function: L(θ|O)=f(O|θ)L(θ|O)=f(O|θ) In this situation, we cannot technically assert that we are finding the parameter value that maximizes the probability that we observe OO as we maximize the pdf associated with the observed outcomes OO.
What are the three stages to build the hypotheses or model in machine learning?
The standard approach to supervised learning is to split the set of example into the training set and the test.
What are the two paradigms of ensemble methods?
The two paradigms of ensemble methods are a) Sequential ensemble methods b) Parallel ensemble methods
What are two techniques of Machine Learning ?
The two techniques of Machine Learning are a) Genetic Programming b) Inductive Learning
What is ensemble learning?
To solve a particular computational program, multiple models such as classifiers or experts are strategically generated and combined. This process is known as ensemble learning.
What are the phases of supervised machine learning?
Training phase, validation phase, test phase, application
What is binning?
Transforms continuous features into a discrete one
Tell me about how you designed the model you created for a past employer or client.
Don't know
What are hash table collisions? How are they avoided? How frequently do they happen?
Don't know
What are the different types of sorting algorithms available in R language?
Don't know
What are the supported data types in Python?
Don't know
What are the two main components of the Hadoop Framework?
Don't know
What are your favorite data visualization techniques?
Don't know
What is a statistical interaction?
Don't know
What is an example of a dataset with a non-Gaussian distribution?
Don't know
What is sampling? How many sampling methods do you know?
Don't know
What is the Binomial Probability Formula?
Don't know
What is the Central Limit Theorem and why is it important?
Don't know
What is the significance of each of these components?
Don't know
Explain what a local optimum is?
A solution that is optimal in within a neighboring set of candidate solutions In contrast with global optimum: the optimal solution among all others
What packages are you most familiar with? What do you like or dislike about them?
Don't know
How would you effectively represent data with 5 dimensions?
Don't know
Tell me about an original algorithm you've created.
Don't know
Explain what resampling methods are and why they are useful. Also explain their limitations.
Classical statistical parametric tests compare observed statistics to theoretical sampling distributions. Resampling a data-driven, not theory-driven methodology which is based upon repeated sampling within the same sample. Resampling refers to methods for doing one of these Estimating the precision of sample statistics (medians, variances, percentiles) by using subsets of available data (jackknifing) or drawing randomly with replacement from a set of data points (bootstrapping) Exchanging labels on data points when performing significance tests (permutation tests, also called exact tests, randomization tests, or re-randomization tests) Validating models by using random subsets (bootstrapping, cross validation)
Design of experiments
Design of experiments or experimental design is the initial process used (before data is collected) to split your data, sample and set up a data set for statistical analysis, for instance in A/B testing frameworks or clinical trials.
Explain how MapReduce works as simply as possible.
Don't know
Is it better to have too many false positives, or too many false negatives? Explain.
It depends on the question as well as on the domain for which we are trying to solve the question. In medical testing, false negatives may provide a falsely reassuring message to patients and physicians that disease is absent, when it is actually present. This sometimes leads to inappropriate or inadequate treatment of both the patient and their disease. So, it is desired to have too many false positive. For spam filtering, a false positive occurs when spam filtering or spam blocking techniques wrongly classify a legitimate email message as spam and, as a result, interferes with its delivery. While most anti-spam tactics can block or filter a high percentage of unwanted emails, doing so without creating significant false-positive results is a much more demanding task. So, we prefer too many false negatives over many false positives.
why local optimum is important in a specific context, such as K-means clustering
K-means clustering context: It's proven that the objective cost function will always decrease until a local optimum is reached. Results will depend on the initial random cluster assignment
What latent semantic indexing is used for?
Learning correct word meanings Subject matter comprehension Information retrieval Sentiment analysis (social network analysis) Here's a great tutorial on it.
Model Fitting
Model fitting is a procedure that takes three steps: First you need a function that takes in a set of parameters and returns a predicted data set. Second you need an 'error function' that provides a number representing the difference between your data and the model's prediction for any given set of model parameters. This is usually either the sums of squared error (SSE) or maximum likelihood. Third you need to find the parameters that minimize this difference
Differentiate between univariate, bivariate and multivariate analysis.
descriptive statistical analysis techniques, pie charts of sales based on territory, difference between 2 variables, scatter plot, analyzing the volume of sale and a spending, study of more than two variables
What do you think about in-database analytics?
don't know
Probabilistic merging ( fuzzy merging )
You do a join on two tables A and B, but the keys are not compatible.
What do you understand by the term Normal Distribution?
bias, central value, bell shaped curve, random variables
What is logistic regression? Or State an example when you have used logistic regression recently.
binary outcome, predict, political leader, outcome, binary, predictor variables
How would you create a taxonomy to identify key customer trends in unstructured data?
business owner, accuracy, results
Lift
data mining and association rule learning
How would you improve a spam detection algorithm that uses Naïve Bayes?
hidden decision trees or decorrelate your features.
What are Recommender Systems?
information filtering systems, used - movies
Cosine distance
measures how close two sentences are to each other
Explain what resampling methods are and why they are useful
repeatedly drawing samples from a training set and refitting a model of interest on each sample in order to obtain additional information about the fitted model example: repeatedly draw different samples from training data, fit a linear regression to each new sample, and then examine the extent to which the resulting fit differ most common are: cross-validation and the bootstrap cross-validation: random sampling with no replacement bootstrap: random sampling with replacement cross-validation: evaluating model performance, model selection (select the appropriate level of flexibility) bootstrap: mostly used to quantify the uncertainty associated with a given estimator or statistical learning method
n-gram
token permutations associated with a keyword
Explain what precision and recall are. How do they relate to the ROC curve?
Calculating precision and recall is actually quite easy. Imagine there are 100 positive cases among 10,000 cases. You want to predict which ones are positive, and you pick 200 to have a better chance of catching many of the 100 positive cases. You record the IDs of your predictions, and when you get the actual results you sum up how many times you were right or wrong. There are four ways of being right or wrong: TN / True Negative: case was negative and predicted negative TP / True Positive: case was positive and predicted positive FN / False Negative: case was positive but predicted negative FP / False Positive: case was negative but predicted positive Now, your boss asks you three questions: What percent of your predictions were correct? You answer: the "accuracy" was (9,760+60) out of 10,000 = 98.2% What percent of the positive cases did you catch? You answer: the "recall" was 60 out of 100 = 60% What percent of positive predictions were correct? You answer: the "precision" was 60 out of 200 = 30% ROC curve represents a relation between sensitivity (RECALL) and specificity(NOT PRECISION) and is commonly used to measure the performance of binary classifiers. However, when dealing with highly skewed datasets, Precision-Recall (PR) curves give a more representative picture of performance.
What are specific ways of determining if you have a local optimum problem?
Determining if you have a local optimum problem: Tendency of premature convergence Different initialization induces different optima
Describe a data science project in which you worked with a substantial programming component. What did you learn from that experience?
Don't know
Explain the difference between L1 and L2 regularization methods.
Don't know
Explain what precision and recall are. How do they relate to the ROC curve?
Don't know
Have you used a time series model? Do you understand cross-correlations with time lags?
Don't know
Here is a big dataset. What is your plan for dealing with outliers? How about missing values? How about transformations?
Don't know
How do you access the element in the 2nd column and 4th row of a matrix named M?
Don't know
How would you create a logistic regression model?
Don't know
How would you sort a large list of numbers?
Don't know
Is it better to have 100 small hash tables or one big hash table in memory, in terms of access speed (assuming both fit within RAM)?
Don't know
What are the assumptions required for linear regression?
Don't know
What are the different data objects in R?
Don't know
What is selection bias?
Don't know
What modules/libraries are you most familiar with? What do you like or dislike about them?
Don't know
Common metrics in regression:
Mean Squared Error Vs Mean Absolute Error RMSE gives a relatively high weight to large errors. The RMSE is most useful when large errors are particularly undesirable. The MAE is a linear score: all the individual differences are weighted equally in the average. MAE is more robust to outliers than MSE. RMSE=1n∑ni=1(yi−y^i)2−−−−−−−−−−−−−−√RMSE=1n∑i=1n(yi−y^i)2 MAE=1n∑ni=1|yi−y^i|MAE=1n∑i=1n|yi−y^i| Root Mean Squared Logarithmic Error RMSLE penalizes an under-predicted estimate greater than an over-predicted estimate (opposite to RMSE) RMSLE=1n∑ni=1(log(pi+1)−log(ai+1))2−−−−−−−−−−−−−−−−−−−−−−−−−−−√RMSLE=1n∑i=1n(log(pi+1)−log(ai+1))2 Where pipi is the ith prediction, aiai the ith actual response, log(b)log(b) the natural logarithm of bb. Weighted Mean Absolute Error The weighted average of absolute errors. MAE and RMSE consider that each prediction provides equally precise information about the error variation, i.e. the standard variation of the error term is constant over all the predictions. Examples: recommender systems (differences between past and recent products) WMAE=1∑wi∑ni=1wi|yi−y^i|
How can you prove that one improvement you've brought to an algorithm is really an improvement over not doing anything?
Often it is observed that in the pursuit of rapid innovation (aka "quick fame"), the principles of scientific methodology are violated leading to misleading innovations, i.e. appealing insights that are confirmed without rigorous validation. One such scenario is the case that given the task of improving an algorithm to yield better results, you might come with several ideas with potential for improvement. An obvious human urge is to announce these ideas ASAP and ask for their implementation. When asked for supporting data, often limited results are shared, which are very likely to be impacted by selection bias (known or unknown) or a misleading global minima (due to lack of appropriate variety in test data). Data scientists do not let their human emotions overrun their logical reasoning. While the exact approach to prove that one improvement you've brought to an algorithm is really an improvement over not doing anything would depend on the actual case at hand, there are a few common guidelines: Ensure that there is no selection bias in test data used for performance comparison Ensure that the test data has sufficient variety in order to be symbolic of real-life data (helps avoid overfitting) Ensure that "controlled experiment" principles are followed i.e. while comparing performance, the test environment (hardware, etc.) must be exactly the same while running original algorithm and new algorithm Ensure that the results are repeatable with near similar results Examine whether the results reflect local maxima/minima or global maxima/minima One common way to achieve the above guidelines is through A/B testing, where both the versions of algorithm are kept running on similar environment for a considerably long time and real-life input data is randomly split between the two. This approach is particularly common in Web Analytics.
Assume you need to generate a predictive model using multiple regression. Explain how you intend to validate this model
Validation using R2R2: - % of variance retained by the model - Issue: R2R2 is always increased when adding variables - R2=RSStot−RSSresRSStot=RSSregRSStot=1−RSSresRSStotR2=RSStot−RSSresRSStot=RSSregRSStot=1−RSSresRSStot Analysis of residuals: - Heteroskedasticity (relation between the variance of the model errors and the size of an independent variable's observations) - Scatter plots residuals Vs predictors - Normality of errors - Etc. : diagnostic plots Out-of-sample evaluation: with cross-validation
Is it better to design robust or accurate algorithms?
The ultimate goal is to design systems with good generalization capacity, that is, systems that correctly identify patterns in data instances not seen before The generalization performance of a learning system strongly depends on the complexity of the model assumed If the model is too simple, the system can only capture the actual data regularities in a rough manner. In this case, the system has poor generalization properties and is said to suffer from underfitting By contrast, when the model is too complex, the system can identify accidental patterns in the training data that need not be present in the test set. These spurious patterns can be the result of random fluctuations or of measurement errors during the data collection process. In this case, the generalization capacity of the learning system is also poor. The learning system is said to be affected by overfitting Spurious patterns, which are only present by accident in the data, tend to have complex forms. This is the idea behind the principle of Occam's razor for avoiding overfitting: simpler models are preferred if more complex models do not significantly improve the quality of the description for the observations Quick response: Occam's Razor. It depends on the learning task. Choose the right balance Ensemble learning can help balancing bias/variance (several weak learners together = strong learner)
Are you familiar with price optimization, price elasticity, inventory management, competitive intelligence? Give examples.
Those are economics terms that are not frequently asked of Data Scientists but they are useful to know. Price optimization is the use of mathematical tools to determine how customers will respond to different prices for its products and services through different channels. Big Data and data mining enables use of personalization for price optimization. Now companies like Amazon can even take optimization further and show different prices to different visitors, based on their history, although there is a strong debate about whether this is fair. Price elasticity in common usage typically refers to Price elasticity of demand, a measure of price sensitivity. It is computed as: Price Elasticity of Demand = % Change in Quantity Demanded / % Change in Price. Similarly, Price elasticity of supply is an economics measure that shows how the quantity supplied of a good or service responds to a change in its price. Inventory management is the overseeing and controlling of the ordering, storage and use of components that a company will use in the production of the items it will sell as well as the overseeing and controlling of quantities of finished products for sale. Wikipedia defines Competitive intelligence as the action of defining, gathering, analyzing, and distributing intelligence about products, customers, competitors, and any aspect of the environment needed to support executives and managers making strategic decisions for an organization. Tools like Google Trends, Alexa, Compete, can be used to determine general trends and analyze your competitors on the web.
How to define/select metrics?
Type of task: regression? Classification? Business goal? What is the distribution of the target variable? What metric do we optimize for? Regression: RMSE (root mean squared error), MAE (mean absolute error), WMAE(weighted mean absolute error), RMSLE (root mean squared logarithmic error)... Classification: recall, AUC, accuracy, misclassification error, Cohen's Kappa...
What is statistical power?
Wikipedia defines Statistical power or sensitivity of a binary hypothesis test is the probability that the test correctly rejects the null hypothesis (H0) when the alternative hypothesis (H1) is true. To put in another way, Statistical power is the likelihood that a study will detect an effect when the effect is present. The higher the statistical power, the less likely you are to make a Type II error (concluding there is no effect when, in fact, there is).
How to do cross-validation right?
the training and validation data sets have to be drawn from the same population predicting stock prices: trained for a certain 5-year period, it's unrealistic to treat the subsequent 5-year a draw from the same population common mistake: for instance the step of choosing the kernel parameters of a SVM should be cross-validated as well Bias-variance trade-off for k-fold cross validation: Leave-one-out cross-validation: gives approximately unbiased estimates of the test error since each training set contains almost the entire data set (n−1n−1 observations). But: we average the outputs of n fitted models, each of which is trained on an almost identical set of observations hence the outputs are highly correlated. Since the variance of a mean of quantities increases when correlation of these quantities increase, the test error estimate from a LOOCV has higher variance than the one obtained with k-fold cross validation Typically, we choose k=5k=5 or k=10k=10, as these values have been shown empirically to yield test error estimates that suffer neither from excessively high bias nor high variance.