Data Science Interview Questions and Answers
Say you're given a large data set. What would be your plan for dealing with outliers? How about missing values? How about transformations?
(Elaborate on outliers and Missing data)
What is a statistical interaction?
Basically, an interaction is when the effect of one factor (input variable) on the dependent variable (output variable) differs among levels of another factor. (http://icbseverywhere.com/blog/mini-lessons-tutorials-and-support-pages/statistical-interactions/)
What is an exact test?
In statistics, an exact (significance) test is a test where all assumptions, upon which the derivation of the distribution of the test statistic is based, are met as opposed to an approximate test (in which the approximation may be made as close as desired by making the sample size big enough). This will result in a significance test that will have a false rejection rate always equal to the significance level of the test. For example an exact test at significance level 5% will in the long run reject true null hypotheses exactly 5% of the time. (https://en.wikipedia.org/wiki/Exact_test)
Is it better to have too many false positives or too many false negatives?
It depends on the use-case and what you are trying to optimize for. Let's say you are trying to detect cancer patients. Statement: The patient has cancer. False positive: Predicting that a patient has cancer when he does not. False negative: Predicting that a patient does not have cancer when in fact he does. In this case, False positives are okay. The patient might get an initial shock. However, doctors will later find out that he doesn't have cancer. This is better than not detecting the cancer in the first place. Let's take another use-case of spam-filtering. Your mail service provides like gmail/outlook automatically does this. Statement: This email is SPAM. False Positive: Predicting that an email is spam when it's not. False Negative: Predicting that an email is not spam (or harmless) when it is. In this case, False negatives are okay. Otherwise, there might be important emails (personal and professional ones) that could be missed if there are too many false positives (
In Python, how is memory managed?
In Python, memory is managed in a private heap space. This means that all the objects and data structures will be located in a private heap. However, the programmer won't be allowed to access this heap. Instead, the Python interpreter will handle it. At the same time, the core API will enable access to some Python tools for the programmer to start coding. The memory manager will allocate the heap space for the Python objects while the inbuilt garbage collector will recycle all the memory that's not being used to boost available heap space. (https://www.springboard.com/blog/python-interview-questions/)
Tell me the difference between an inner join, left join/right join, and union.
In a Venn diagram the inner join is when both tables have a match, a left join is when there is a match in the left table and the right table is null, a right join is the opposite of a left join, and a full join is all of the data combined. (https://www.springboard.com/blog/joining-data-tables/)
What are the supported data types in Python?
Python's built-in (or standard) data types can be grouped into several classes. Sticking to the hierarchy scheme used in the official Python documentation these are numeric types, sequences, sets and mappings. (https://www.quora.com/What-are-the-supported-data-types-in-Python)
What are the different data objects in R?
R objects can store values as different core data types (referred to as modes in R jargon); these include numeric (both integer and double), character and logical. (https://mgimond.github.io/ES218/Week02a.html)
Explain the 80/20 rule, and tell me about its importance in model validation.
People usually tend to start with a 80-20% split (80% training set - 20% test set) and split the training set once more into a 80-20% ratio to create the validation set. (https://www.beyondthelines.net/machine-learning/how-to-split-a-dataset/)
What is the difference between SQL and MySQL or SQL Server?
SQL stands for Structured Query Language. It's a standard language for accessing and manipulating databases. MySQL is a database management system, like SQL Server, Oracle, Informix, Postgres, etc. (https://www.quora.com/What-is-the-difference-between-SQL-and-MySQL-or-SQL-Server)
What is the Central Limit Theorem and why is it important?
Suppose that we are interested in estimating the average height among all people. Collecting data for every person in the world is impossible. While we can't obtain a height measurement from everyone in the population, we can still sample some people. The question now becomes, what can we say about the average height of the entire population given a single sample. The Central Limit Theorem addresses this question exactly. (https://spin.atomicobject.com/2015/02/12/central-limit-theorem-intro/)
What is selection bias?
Selection (or 'sampling') bias occurs in an 'active,' sense when the sample data that is gathered and prepared for modeling has characteristics that are not representative of the true, future population of cases the model will see. That is, active selection bias occurs when a subset of the data are systematically (i.e., non-randomly) excluded from analysis. (https://www.elderresearch.com/blog/selection-bias-in-analytics)
What are the assumptions required for linear regression?
There are four major assumptions: 1. There is a linear relationship between the dependent variables and the regressors, meaning the model you are creating actually fits the data, 2. The errors or residuals of the data are normally distributed and independent from each other, 3. There is minimal multicollinearity between explanatory variables, and 4. Homoscedasticity. This means the variance around the regression line is the same for all values of the predictor variable.
What are the different types of sorting algorithms available in R language?
There are insertion, bubble, and selection sorting algorithms. (https://www.quora.com/What-are-the-different-type-of-sorting-algorithms-available-in-R-language)
How would you detect bogus reviews, or bogus Facebook accounts used for bad purposes?
This is an opportunity to showcase your knowledge of machine learning algorithms; specifically, sentiment analysis and text analysis algorithms. Showcase your knowledge of fraudulent behavior—what are the abnormal behaviors that can typically be seen from fraudulent accounts?
What does UNION do? What is the difference between UNION and UNION ALL?
UNION removes duplicate records (where all columns in the results are the same), UNION ALL does not. (https://stackoverflow.com/questions/49925/what-is-the-difference-between-union-and-union-all)
How do you split a continuous variable into different groups/ranks in R?
https://stackoverflow.com/questions/6104836/splitting-a-continuous-variable-into-equal-sized-groups
What is the command used to store R objects in a file?
save (x, file="x.Rdata")
What is root cause analysis?
All of us dread that meeting where the boss asks 'why is revenue down?' The only thing worse than that question is not having any answers! There are many changes happening in your business every day, and often you will want to understand exactly what is driving a given change — especially if it is unexpected. Understanding the underlying causes of change is known as root cause analysis. (https://towardsdatascience.com/how-to-conduct-a-proper-root-cause-analysis-789b9847f84b)
What is an example of a data set with a non-Gaussian distribution?
The Gaussian distribution is part of the Exponential family of distributions, but there are a lot more of them, with the same sort of ease of use, in many cases, and if the person doing the machine learning has a solid grounding in statistics, they can be utilized where appropriate. (https://www.quora.com/Most-machine-learning-datasets-are-in-Gaussian-distribution-Where-can-we-find-the-dataset-which-follows-Bernoulli-Poisson-gamma-beta-etc-distribution)
What are two main components of the Hadoop framework?
The Hadoop Distributed File System (HDFS), MapReduce, and YARN. (https://www.quora.com/What-are-the-main-components-of-a-Hadoop-Application)
What is linear regression? What do the terms p-value, coefficient, and r-squared value mean? What is the significance of each of these components?
A linear regression is a good tool for quick predictive analysis: for example, the price of a house depends on a myriad of factors, such as its size or its location. In order to see the relationship between these variables, we need to build a linear regression, which predicts the line of best fit between them and can help conclude whether or not these two factors have a positive or negative relationship. (https://www.springboard.com/blog/linear-regression-in-python-a-tutorial/) (http://blog.minitab.com/blog/adventures-in-statistics/how-to-interpret-regression-analysis-results-p-values-and-coefficients)
How is k-NN different from k-means clustering?
k-NN, or k-nearest neighbors is a classification algorithm, where the k is an integer describing the number of neighboring data points that influence the classification of a given observation. K-means is a clustering algorithm, where the k is an integer describing the number of clusters to be created from the given data.
When modifying an algorithm, how do you know that your changes are an improvement over not doing anything?
(elaborate)
What is one way that you would handle an imbalanced data set that's being used for prediction (i.e., vastly more negative classes than positive classes)?
(https://towardsdatascience.com/methods-for-dealing-with-imbalanced-data-5b761be45a18) (https://towardsdatascience.com/handling-imbalanced-datasets-in-machine-learning-7a0e84220f28)
Write a function in R language to replace the missing value in a vector with the mean of that vector.
(code)
How would you sort a large list of numbers?
(describe algorithms)
How would you create a logistic regression model?
(elaborate on logistic regression)
How would you validate a model you created to generate a predictive model of a quantitative outcome variable using multiple regression?
(elaborate on multiple regression)
How do you deal with sparsity?
(elaborate on sparsity)
Have you used a time series model? Do you understand cross-correlations with time lags?
(elaborate on time series)
Have you ever thought about creating your own startup? Around which idea / concept?
(elaborate)
How would you come up with a solution to identify plagiarism?
(elaborate)
I have two models of comparable accuracy and computational performance. Which one should I choose for production and why?
(elaborate)
If you won a million dollars in the lottery, what would you do with the money?
(elaborate)
Is it better to spend five days developing a 90-percent accurate solution or 10 days for 100-percent accuracy?
(elaborate)
Python. What modules/libraries are you most familiar with? What do you like or dislike about them?
(elaborate)
Tell me about how you designed a model for a past employer or client.
(elaborate)
What (outside of data science) are you passionate about?
(elaborate)
What are your top 5 predictions for the next 20 years?
(elaborate)
What can your hobbies tell me that your resume can't?
(elaborate)
What data would you love to acquire if there were no limitations?
(elaborate)
What did you do today? Or what did you do this week / last week?
(elaborate)
What is one thing you believe that most people do not?
(elaborate)
What packages are you most familiar with? What do you like or dislike about them?
(elaborate)
What personality traits do you butt heads with?
(elaborate)
What unique skills do you think you'd bring to the team?
(elaborate)
What are some situations where a general linear model fails?
(https://www.quora.com/What-are-the-limitations-of-linear-regression-modeling-in-data-analysis) Linear regressions are sensitive to outliers. E.g. if most of your data lives in the range (20,50) on the x-axis, but you have one or two points out at x= 200, this could significantly swing your regression results. Similarly, if you build your regression on the range x in (20,50), and then try to use that model to predict a y-value for x = 200, this is pretty significant extrapolation and is not necessarily going to be accurate. Overfitting - It is easy to overfit your model such that your regression begins to model the random error (noise) in the data, rather than just the relationship between the variables. This most commonly arises when you have too many parameters compared to the number of samples Linear regressions are meant to describe linear relationships between variables. So, if there is a nonlinear relationship, then you will have a bad model. However, you can sometimes compens
Describe a data science project in which you worked with a substantial programming component. What did you learn from that experience?
(open question)
Tell me about an original algorithm you've created.
(open question)
What are some pros and cons about your favorite statistical software?
(open question)
With which programming languages and environments are you most comfortable working?
(open question)
What's a project you would want to work on at our company?
(research your company of interest)
What is the latest data science book / article you read? What is the latest data mining conference / webinar / class / workshop / training you attended?
11 of the Best Data Science Books: https://www.springboard.com/blog/eleven-best-data-science-books/
Explain the difference between L1 and L2 regularization methods.
A regression model that uses L1 regularization technique is called Lasso Regression and model which uses L2 is called Ridge Regression. The key difference between these two is the penalty term. (https://towardsdatascience.com/l1-and-l2-regularization-methods-ce25e7fc831c)
What is the difference between type I vs type II error?
A type I error occurs when the null hypothesis is true, but is rejected. A type II error occurs when the null hypothesis is false, but erroneously fails to be rejected. (https://www.datasciencecentral.com/profiles/blogs/understanding-type-i-and-type-ii-errors)
What is the difference between a tuple and a list in Python?
Apart from tuples being immutable there is also a semantic distinction that should guide their usage. (https://stackoverflow.com/questions/626759/whats-the-difference-between-lists-and-tuples)
What is sampling? How many sampling methods do you know?
Data sampling is a statistical analysis technique used to select, manipulate and analyse a representative subset of data points to identify patterns and trends in the larger data set being examined. (https://searchbusinessanalytics.techtarget.com/definition/data-sampling)
What is the purpose of the group functions in SQL? Give some examples of group functions.
Group functions are necessary to get summary statistics of a data set. COUNT, MAX, MIN, AVG, SUM, and DISTINCT are all group functions.
What is the best way to use Hadoop and R together for analysis?
Hadoop and R complement each other quite well in terms of visualization and analytics of big data. There are four different ways of using Hadoop and R together. (https://www.edureka.co/blog/4-ways-to-use-r-and-hadoop-together/)
What are hash table collisions?
If the range of key values is larger than the size of our hash table, which is usually always the case, then we must account for the possibility that two different records with two different keys can hash to the same table index. There are a few different ways to resolve this issue. In hash table vernacular, this solution implemented is referred to as collision resolution. (https://medium.com/@bartobri/hash-crash-the-basics-of-hash-tables-bef82a8ea550)
Explain how MapReduce works as simply as possible.
MapReduce is a programming model that enables distributed processing of large data sets on compute clusters of commodity hardware. Hadoop MapReduce first performs mapping which involves splitting a large file into pieces to make another set of data. (https://bigdata-madesimple.com/basic-components-of-hadoop-architecture-frameworks-used-for-data-science/)
Do you think 50 small decision trees are better than a large one? Why?
Note: I will assume that you are talking about ensembles. It depends. There is enough variability in the data to produce 50 significantly different trees? If there isn't variability between the trees, then maybe you do not need a huge amount of them or you don't need them at all. Depending on the ensemble type, typically an ensemble of weak learners yields better results than a single "large" model, because: You are typically less prone to overfit You average out model bias Also, if you have large amounts of data maybe is not feasible to process all of the observations in a single model. However you will probably lose interpretability when doing an ensemble. But this will depend on your objective. Bottom line: You should always test your solutions with some appropriate fitness measure for your problem, using, in the case of predictive modeling, some out of sample dataset. Typically an ensemble is better, but in my opinion one must be careful about preconceived ideas.
Explain what precision and recall are. How do they relate to the ROC curve?
Recall describes what percentage of true positives are described as positive by the model. Precision describes what percent of positive predictions were correct. The ROC curve shows the relationship between model recall and specificity-specificity being a measure of the percent of true negatives being described as negative by the model. Recall, precision, and the ROC are measures used to identify how useful a given classification model is. (http://www.kdnuggets.com/faq/precision-recall.html)
In your opinion, which is more important when designing a machine learning model: model performance or model accuracy?
The accuracy paradox is the paradoxical finding that accuracy is not a good metric for predictive models when classifying in predictive analytics. This is because a simple model may have a high level of accuracy but be too crude to be useful. For example, if the incidence of category A is dominant, being found in 99% of cases, then predicting that every case is category A will have an accuracy of 99%. Precision and recall are better measures in such cases. The underlying issue is that there is a class imbalance between the positive class and the negative class. Prior probabilities for these classes need to be accounted for in error analysis. Precision and recall help, but precision too can be biased by very unbalanced class priors in the test sets. (https://towardsdatascience.com/accuracy-paradox-897a69e2dd9b)
What is the Binomial Probability Formula?
The binomial distribution consists of the probabilities of each of the possible numbers of successes on N trials for independent events that each have a probability of π (the Greek letter pi) of occurring. (http://onlinestatbook.com/2/probability/binomial.html)
Which data scientists do you admire most? Which startups?
Top Data Scientists to Follow on Twitter: https://www.springboard.com/blog/top-data-scientists-on-twitter/
How do you access the element in the 2nd column and 4th row of a matrix named M?
We can access elements of a matrix using the square bracket [ indexing method. Elements can be accessed as var[row, column]. (https://www.datamentor.io/r-programming/matrix/)
If a table contains duplicate rows, does a query result display the duplicate values by default? How can you eliminate duplicate rows from a query result?
Yes. One way you can eliminate duplicate rows with the DISTINCT clause. (https://www.mssqltips.com/sqlservertip/4486/find-and-remove-duplicate-rows-from-a-sql-server-table/)