Data Science Interview Questions
How do you deal with pressure or stressful situations?
"Choose an answer that shows that you can meet a stressful situation head-on in a productive, positive manner and let nothing stop you from accomplishing your goals," says McKee. A great approach is to talk through your go-to stress-reduction tactics (making the world's greatest to-do list, stopping to take 10 deep breaths), and then share an example of a stressful situation you navigated with ease.
Write a function that takes in two sorted lists and outputs a sorted list that is their union.
****ing link that I made woot woot
Python or R - Which one would you prefer for text analytics?
The best possible answer for this would be Python because it has Pandas library that provides easy to use data structures and high performance data analysis tools
What's your dream job?
Along similar lines, the interviewer wants to uncover whether this position is really in line with your ultimate career goals. While "an NBA star" might get you a few laughs, a better bet is to talk about your goals and ambitions—and why this job will get you closer to them.
What is power analysis?
An experimental design technique for determining the effect of a given sample size.
What are you looking for in a new position?
Hint: Ideally the same things that this position has to offer. Be specific.
Do gradient descent methods always converge to same point?
No, they do not because in some cases it reaches a local minima or a local optima point. You don't reach the global optima point. It depends on the data and starting conditions
Describe Why Data Cleansing Is So Critical and the Methods You Use to Maintain Clean Data
Bad (a.k.a., "dirty") data leads to incorrect insights, which can hurt an organization. For example, if you're putting together a targeted marketing campaign and your data incorrectly tells you that a certain product will be in-demand with your target audience, the campaign will fail, and you'll hurt your brand's reputation. A good data scientist conducts continuous data health checks and creates standardization procedures to maintain high-quality data.
Explain what is overfitting and how would you control for it
Bonus question
What is your greatest professional achievement?
A great way to do so is by using the S-T-A-R method: Set up the situation and the task that you were required to complete to provide the interviewer with background context (e.g., "In my last job as a junior analyst, it was my role to manage the invoicing process"), but spend the bulk of your time describing what you actually did (the action) and what you achieved (the result). For example, "In one month, I streamlined the process, which saved my group 10 man-hours each month and reduced errors on invoices by 25%."
Tell me about a challenge or conflict you've faced at work, and how you dealt with it.
Again, you'll want to use the S-T-A-R method, being sure to focus on how you handled the situation professionally and productively, and ideally closing with a happy ending, like how you came to a resolution or compromise.
Can you cite some examples where a false negative important than a false positive?
Assume there is an airport 'A' which has received high security threats and based on certain characteristics they identify whether a particular passenger can be a threat or not. Due to shortage of staff they decided to scan passenger being predicted as risk positives by their predictive model. What will happen if a true threat customer is being flagged as non-threat by airport model? Another example can be judicial system. What if Jury or judge decide to make a criminal go free? What if you rejected to marry a very good person based on your predictive model and you happen to meet him/her after few years and realize that you had a false negative?
Which technique is used to predict categorical responses?
Classification technique is used widely in mining for classifying data sets.
What is the difference between Cluster and Systematic Sampling?
Cluster sampling is a technique used when it becomes difficult to study the target population spread across a wide area and simple random sampling cannot be applied. Cluster Sample is a probability sample where each sampling unit is a collection, or cluster of elements. Systematic sampling is a statistical technique where elements are selected from an ordered sampling frame. In systematic sampling, the list is progressed in a circular manner so once you reach the end of the list,it is progressed from the top again. The best example for systematic sampling is equal probability method.
What other companies are you interviewing with?
Companies ask this for a number of reasons, from wanting to see what the competition is for you to sniffing out whether you're serious about the industry. "Often the best approach is to mention that you are exploring a number of other similar options in the company's industry," says job search expert Alison Doyle. "It can be helpful to mention that a common characteristic of all the jobs you are applying to is the opportunity to apply some critical abilities and skills that you possess. For example, you might say 'I am applying for several positions with IT consulting firms where I can analyze client needs and translate them to development teams in order to find solutions to technology problems.'"
What do you understand by the term Normal Distribution?
Data is usually distributed in different ways with a bias to the left or to the right or it can all be jumbled up. However, there are chances that data is distributed around a central value without any bias to the left or right and reaches normal distribution in the form of a bell shaped curve. The random variables are distributed in the form of an symmetrical bell shaped curve.
What's a time you exercised leadership?
Depending on what's more important for the the role, you'll want to choose an example that showcases your project management skills (spearheading a project from end to end, juggling multiple moving parts) or one that shows your ability to confidently and effectively rally a team. And remember: "The best stories include enough detail to be believable and memorable," says Skillings. "Show how you were a leader in this situation and how it represents your overall leadership experience and potential."
Can you explain why you changed career paths?
Don't be thrown off by this question—just take a deep breath and explain to the hiring manager why you've made the career decisions you have. More importantly, give a few examples of how your past experience is transferrable to the new role. This doesn't have to be a direct connection; in fact, it's often more impressive when a candidate can make seemingly irrelevant experience seem very relevant to the role.
What is an Eigenvalue and Eigenvector?
Eigenvectors are used for understanding linear transformations. In data analysis, we usually calculate the eigenvectors for a correlation or covariance matrix. Eigenvectors are the directions along which a particular linear transformation acts by flipping, compressing or stretching. Eigenvalue can be referred to as the strength of the transformation in the direction of eigenvector or the factor by which the compression occurs.
What is Interpolation and Extrapolation?
Estimating a value from 2 known values from a list of values is Interpolation. Extrapolation is approximating a value by extending a known set of values or facts.
What's a time you disagreed with a decision that was made at work?
Everyone disagrees with the boss from time to time, but in asking this interview question, hiring managers want to know that you can do so in a productive, professional way. "You don't want to tell the story about the time when you disagreed but your boss was being a jerk and you just gave in to keep the peace. And you don't want to tell the one where you realized you were wrong," says Peggy McKee of Career Confidential. "Tell the one where your actions made a positive difference on the outcome of the situation, whether it was a work-related outcome or a more effective and productive working relationship."
How would your boss and co-workers describe you?
First of all, be honest (remember, if you get this job, the hiring manager will be calling your former bosses and co-workers!). Then, try to pull out strengths and traits you haven't discussed in other aspects of the interview, such as your strong work ethic or your willingness to pitch in on other projects when needed.
Why do you want this job?
First, identify a couple of key factors that make the role a great fit for you (e.g., "I love customer support because I love the constant human interaction and the satisfaction that comes from helping someone solve a problem"), then share why you love the company (e.g., "I've always been passionate about education, and I think you guys are doing great things, so I want to be a part of it").
Explain about the box cox transformation in regression models.
For some reason or the other, the response variable for a regression analysis might not satisfy one or more assumptions of an ordinary least squares regression. The residuals could either curve as the prediction increases or follow skewed distribution. In such scenarios, it is necessary to transform the response variable so that the data meets the required assumptions. A Box cox transformation is a statistical technique to transform non-mornla dependent variables into a normal shape. If the given data is not normal then most of the statistical techniques assume normality. Applying a box cox transformation means that you can run a broader number of tests.
What type of work environment do you prefer?
Hint: Ideally one that's similar to the environment of the company you're applying to. Be specific.
What is the difference between Supervised Learning an Unsupervised Learning?
If an algorithm learns something from the training data so that the knowledge can be applied to the test data, then it is referred to as Supervised Learning. Classification is an example for Supervised Learning. If the algorithm does not learn anything beforehand because there is no response variable or any training data, then it is referred to as unsupervised learning. Clustering is an example for unsupervised learning.
Where do you see yourself in five years?
If asked this question, be honest and specific about your future goals, but consider this: A hiring manager wants to know a) if you've set realistic expectations for your career, b) if you have ambition (a.k.a., this interview isn't the first time you're considering the question), and c) if the position aligns with your goals and growth. Your best bet is to think realistically about where this position could take you and answer along those lines. And if the position isn't necessarily a one-way ticket to your aspirations? It's OK to say that you're not quite sure what the future holds, but that you see this experience playing an important role in helping you make that decision.
What is the difference between Bayesian Estimate and Maximum Likelihood Estimation (MLE)?
In bayesian estimate we have some knowledge about the data/problem (prior) .There may be several values of the parameters which explain data and hence we can look for multiple parameters like 5 gammas and 5 lambdas that do this. As a result of Bayesian Estimate, we get multiple models for making multiple predcitions i.e. one for each pair of parameters but with the same prior. So, if a new example need to be predicted than computing the weighted sum of these predictions serves the purpose. Maximum likelihood does not take prior into consideration (ignores the prior) so it is like being a Bayesian while using some kind of a flat prior.
Can you cite some examples where a false positive is important than a false negative?
In medical field, assume you have to give chemo therapy to patients. Your lab tests patients for certain vital information and based on those results they decide to give radiation therapy to a patient. Assume a patient comes to that hospital and he is tested positive for cancer (But he doesn't have cancer) based on lab prediction. What will happen to him? (Assuming Sensitivity is 1) One more example might come from marketing. Let's say an ecommerce company decided to give $1000 Gift voucher to the customers whom they assume to purchase at least $5000 worth of items. They send free voucher mail directly to 100 customers without any minimum purchase condition because they assume to make at least 20% profit on sold items above 5K.
Can you cite some examples where both false positive and false negatives are equally important?
In the banking industry giving loans is the primary source of making money but at the same time if your repayment rate is not good you will not make any profit, rather you will risk huge losses. Banks don't want to lose good customers and at the same point of time they don't want to acquire bad customers. In this scenario both the false positives and false negatives become very important to measure. These days we hear many cases of players using steroids during sport competitions Every player has to go through a steroid test before the game starts. A false positive can ruin the career of a Great sportsman and a false negative can make the game unfair.
What do you like to do outside of work?
Interviewers ask personal questions in an interview to "see if candidates will fit in with the culture [and] give them the opportunity to open up and display their personality, too," says longtime hiring manager Mitch Fortner. "In other words, if someone asks about your hobbies outside of work, it's totally OK to open up and share what really makes you tick. (Do keep it semi-professional, though: Saying you like to have a few beers at the local hot spot on Saturday night is fine. Telling them that Monday is usually a rough day for you because you're always hungover is not.)"
Is it better to have too many false positives, or too many false negatives? Explain
It depends on the question as well as on the domain for which we are trying to solve the question. In medical testing, false negatives may provide a falsely reassuring message to patients and physicians that disease is absent, when it is actually present. This sometimes leads to inappropriate or inadequate treatment of both the patient and their disease. So, it is desired to have too many false positive. For spam filtering, a false positive occurs when spam filtering or spam blocking techniques wrongly classify a legitimate email message as spam and, as a result, interferes with its delivery. While most anti-spam tactics can block or filter a high percentage of unwanted emails, doing so without creating significant false-positive results is a much more demanding task. So, we prefer too many false negatives over many false positives.
What is the goal of A/B Testing?
It is a statistical hypothesis testing for randomized experiment with two variables A and B. The goal of A/B Testing is to identify any changes to the web page to maximize or increase the outcome of an interest. An example for this could be identifying the click through rate for a banner ad.
What's your management style?
The best managers are strong but flexible, and that's exactly what you want to show off in your answer. (Think something like, "While every situation and every team member requires a bit of a different strategy, I tend to approach my employee relationships as a coach...") Then, share a couple of your best managerial moments, like when you grew your team from five to 15 or coached an underperforming employee to become the company's top salesperson.
A test has a true positive rate of 100% and false positive rate of 5%. There is a population with a 1/1000 rate of having the condition the test identifies. Considering a positive test, what is the probability of having that condition?
Let's suppose you are being tested for a disease, if you have the illness the test will end up saying you have the illness. However, if you don't have the illness- 5% of the times the test will end up saying you have the illness and 95% of the times the test will give accurate result that you don't have the illness. Thus there is a 5% error in case you do not have the illness. Out of 1000 people, 1 person who has the disease will get true positive result. Out of the remaining 999 people, 5% will also get true positive result. Close to 50 people will get a true positive result for the disease. This means that out of 1000 people, 51 people will be tested positive for the disease even though only one person has the illness. There is only a 2% probability of you having the disease even if your reports say that you have the disease.
What is Linear Regression?
Linear regression is a statistical technique where the score of a variable Y is predicted from the score of a second variable X. X is referred to as the predictor variable and Y as the criterion variable.
What is logistic regression? Or State an example when you have used logistic regression recently.
Logistic Regression often referred as logit model is a technique to predict the binary outcome from a linear combination of predictor variables. For example, if you want to predict whether a particular political leader will win the election or not. In this case, the outcome of prediction is binary i.e. 0 or 1 (Win/Lose). The predictor variables here would be the amount of money spent for election campaigning of a particular candidate, the amount of time spent in campaigning, etc.
How can you prove that one improvement you've brought to an algorithm is really an improvement over not doing anything?
Often it is observed that in the pursuit of rapid innovation (aka "quick fame"), the principles of scientific methodology are violated leading to misleading innovations, i.e. appealing insights that are confirmed without rigorous validation. One such scenario is the case that given the task of improving an algorithm to yield better results, you might come with several ideas with potential for improvement. An obvious human urge is to announce these ideas ASAP and ask for their implementation. When asked for supporting data, often limited results are shared, which are very likely to be impacted by selection bias (known or unknown) or a misleading global minima (due to lack of appropriate variety in test data). Data scientists do not let their human emotions overrun their logical reasoning. While the exact approach to prove that one improvement you've brought to an algorithm is really an improvement over not doing anything would depend on the actual case at hand, there are a few common guidelines: Ensure that there is no selection bias in test data used for performance comparison Ensure that the test data has sufficient variety in order to be symbolic of real-life data (helps avoid overfitting) Ensure that "controlled experiment" principles are followed i.e. while comparing performance, the test environment (hardware, etc.) must be exactly the same while running original algorithm and new algorithm Ensure that the results are repeatable with near similar results Examine whether the results reflect local maxima/minima or global maxima/minima One common way to achieve the above guidelines is through A/B testing, where both the versions of algorithm are kept running on similar environment for a considerably long time and real-life input data is randomly split between the two. This approach is particularly common in Web Analytics.
What does P-value signify about the statistical data?
P-value is used to determine the significance of results after a hypothesis test in statistics. P-value helps the readers to draw conclusions and is always between 0 and 1. • P- Value > 0.05 denotes weak evidence against the null hypothesis which means the null hypothesis cannot be rejected. • P-value <= 0.05 denotes strong evidence against the null hypothesis which means the null hypothesis can be rejected. • P-value=0.05is the marginal value indicating it is possible to go either way.
How Would You Describe Data Science to a Business Executive?
Part of data science is being able to translate your findings into insights that can be understood by cross-functional partners with no technical expertise. In an interview setting, this type of question tests your ability to break free from role-specific lingo and make what you do relatable to someone who may have no understanding of data science or its value to the company. You may see alternatives of this question that ask you to describe a more specific aspect or concept of data science, but the same general principle applies: Dumb it down and make it universally applicable.
What do you understand by Fuzzy merging ? Which language will you use to handle it?
Probabilistic record linkage, sometimes called fuzzy matching (also probabilistic merging or fuzzy merging in the context of merging of databases), takes a different approach to record linkage problems by taking into account a wider range of potential identifiers, computing weights for each identifier based on its estimated ability to correctly identify a match or a non-match, and using these weights to calculate the probability that two given records refer to the same entity. Recorded pairs with probabilities above a certain threshold are considered to be matches, while pairs with probabilities below another threshold are considered to be non-matches. Pairs that fall between the two thresholds are considered to be "possible matches" and can be dealt with accordingly (e.g. human reviewed, linked, or not linked, depending on the requirements). It's easier to handle with SQL.
How would you validate a model you created to generate a predictive model of a quantitative outcome variable using multiple regression.
Proposed methods for model validation: If the values predicted by the model are far outside of the response variable range, this would immediately indicate poor estimation or model inaccuracy. If the values seem to be reasonable, examine the parameters; any of the following would indicate poor estimation or multi-collinearity: opposite signs of expectations, unusually large or small values, or observed inconsistency when the model is fed new data. Use the model for prediction by feeding it new data, and use the coefficient of determination (R squared) as a model validity measure. Use data splitting to form a separate dataset for estimating model parameters, and another for validating predictions. Use jackknife resampling if the dataset contains a small number of instances, and measure validity with R squared and mean squared error (MSE).
Give an example of how you would use experimental design to answer a question about user behavior.
Question 12
What is the difference between "long" ("tall") and "wide" format data?
Question 13
Explain Edward Tufte's concept of "chart junk."
Question 15
How would you screen for outliers and what should you do if you find one?
Question 16
How would you use either the extreme value theory, Monte Carlo simulations or mathematical statistics (or anything else) to correctly estimate the chance of a very rare event?
Question 17
What is a recommendation engine? How does it work?
Question 18
Explain what a false positive and a false negative are. Why is it important to differentiate these from each other?
Question 19
Which tools do you use for visualization? What do you think of Tableau? R? SAS? (for graphs). How to efficiently represent 5 dimension in a chart (or in a video)?
Question 20
Explain what precision and recall are. How do they relate to the ROC curve
Question 4
What method do you use to determine whether the statistics published in an article (or appeared in a newspaper or other media) are either wrong or presented to support the author's point of view, rather than correct, comprehensive factual information on a specific subject?
Questions 14
Are you planning on having children?
Questions about your family status, gender ("How would you handle managing a team of all men?"), nationality ("Where were you born?"), religion, or age, are illegal—but they still get asked (and frequently). Of course, not always with ill intent—the interviewer might just be trying to make conversation—but you should definitely tie any questions about your personal life (or anything else you think might be inappropriate) back to the job at hand. For this question, think: "You know, I'm not quite there yet. But I am very interested in the career paths at your company. Can you tell me more about that?"
What do you understand by Recall and Precision?
Recall measures "Of all the actual true samples how many did we classify as true?" Precision measures "Of all the samples we classified as true how many are actually true?" We will explain this with a simple example for better understanding - Imagine that your wife gave you surprises every year on your anniversary in last 12 years. One day all of a sudden your wife asks -"Darling, do you remember all anniversary surprises from me?". This simple question puts your life into danger.To save your life, you need to Recall all 12 anniversary surprises from your memory. Thus, Recall(R) is the ratio of number of events you can correctly recall to the number of all correct events. If you can recall all the 12 surprises correctly then the recall ratio is 1 (100%) but if you can recall only 10 suprises correctly of the 12 then the recall ratio is 0.83 (83.3%). However , you might be wrong in some cases. For instance, you answer 15 times, 10 times the surprises you guess are correct and 5 wrong. This implies that your recall ratio is 100% but the precision is 66.67%. Precision is the ratio of number of events you can correctly recall to a number of all events you recall (combination of wrong and correct recalls).
Explain what regularization is and why it is useful.
Regularization is the process of adding a tuning parameter to a model to induce smoothness in order to prevent overfitting. (see also KDnuggets posts on Overfitting) This is most often done by adding a constant multiple to an existing weight vector. This constant is often either the L1 (Lasso) or L2 (ridge), but can in actuality can be any norm. The model predictions should then minimize the mean of the loss function calculated on the regularized training set.
Why L1 regularizations causes parameter sparsity whereas L2 regularization does not?
Regularizations in statistics or in the field of machine learning is used to include some extra information in order to solve a problem in a better way. L1 & L2 regularizations are generally used to add constraints to optimization problems.
What is root cause analysis?
Root cause analysis (RCA) is a method of problem solving used for identifying the root causes of faults or problems. A factor is considered a root cause if removal thereof from the problem-fault-sequence prevents the final undesirable event from recurring; whereas a causal factor is one that affects an event's outcome, but is not a root cause. Root cause analysis was initially developed to analyze industrial accidents, but is now widely used in other areas, such as healthcare, project management, or software testing. Here is a useful Root Cause Analysis Toolkit from the state of Minnesota. Essentially, you can find the root cause of a problem and show the relationship of causes by repeatedly asking the question, "Why?", until you find the root of the problem. This technique is commonly called "5 Whys", although is can be involve more or less than 5 questions.
How can you deal with different types of seasonality in time series modelling?
Seasonality in time series occurs when time series shows a repeated pattern over time. E.g., stationary sales decreases during holiday season, air conditioner sales increases during the summers etc. are few examples of seasonality in a time series. Seasonality makes your time series non-stationary because average value of the variables at different time periods. Differentiating a time series is generally known as the best method of removing seasonality from a time series. Seasonal differencing can be defined as a numerical difference between a particular value and a value with a periodic lag (i.e. 12, if monthly seasonality is present)
How will you define the number of clusters in a clustering algorithm?
Second link I made like a boss
If you were an animal, which one would you want to be?
Seemingly random personality-test type questions like these come up in interviews generally because hiring managers want to see how you can think on your feet. There's no wrong answer here, but you'll immediately gain bonus points if your answer helps you share your strengths or personality or connect with the hiring manager. Pro tip: Come up with a stalling tactic to buy yourself some thinking time, such as saying, "Now, that is a great question. I think I would have to say... "
What is selection bias, why is it important and how can you avoid it?
Selection bias, in general, is a problematic situation in which error is introduced due to a non-random population sample. For example, if a given sample of 100 test cases was made up of a 60/20/15/5 split of 4 classes which actually occurred in relatively equal numbers in the population, then a given model may make the false assumption that probability could be the determining predictive factor. Avoiding non-random samples is the best way to deal with bias; however, when this is impractical, techniques such as resampling, boosting, and weighting are strategies which can be introduced to help deal with the situation.
What do you understand by statistical power of sensitivity and how do you calculate it?
Sensitivity is commonly used to validate the accuracy of a classifier (Logistic, SVM, RF etc.). Sensitivity is nothing but "Predicted TRUE events/ Total events". True events here are the events which were true and model also predicted them as true. Calculation of senstivity is pretty straight forward- Senstivity = True Positives /Positives in Actual Dependent Variable Where, True positives are Positive events which are correctly classified as Positives.
You created a predictive model of a quantitative outcome variable using multiple regressions. What are the steps you would follow to validate the model?
Since the question asked, is about post model building exercise, we will assume that you have already tested for null hypothesis, multi collinearity and Standard error of coefficients. Once you have built the model, you should check for following - · Global F-test to see the significance of group of independent variables on dependent variable · R^2 · Adjusted R^2 · RMSE, MAPE In addition to above mentioned quantitative metrics you should also check for- · Residual plot · Assumptions of linear regression
What would your first 30, 60, or 90 days look like in this role?
Start by explaining what you'd need to do to get ramped up. What information would you need? What parts of the company would you need to familiarize yourself with? What other employees would you want to sit down with? Next, choose a couple of areas where you think you can make meaningful contributions right away. (e.g., "I think a great starter project would be diving into your email marketing campaigns and setting up a tracking system for them.") Sure, if you get the job, you (or your new employer) might decide there's a better starting place, but having an answer prepared will show the interviewer where you can add immediate impact—and that you're excited to get started.
What are your salary requirements?
The #1 rule of answering this question is doing your research on what you should be paid by using sites like Payscale and Glassdoor. You'll likely come up with a range, and we recommend stating the highest number in that range that applies, based on your experience, education, and skills. Then, make sure the hiring manager knows that you're flexible. You're communicating that you know your skills are valuable, but that you want the job and are willing to negotiate.
What is the difference between a compiled computer language and an interpreted computer language?
The difference between an interpreted and a compiled language lies in the result of the process of interpreting or compiling. An interpreter produces a result from a program, while a compiler produces a program written in assembly language. The assembler of architecture then turns the resulting program into binary code. Assembly language varies for each individual computer, depending upon its architecture. Consequently, compiled programs can only run on computers that have the same architecture as the computer on which they were compiled. (question is hyperlink)
During analysis, how do you treat missing values?
The extent of the missing values is identified after identifying the variables with missing values. If any patterns are identified the analyst has to concentrate on them as it could lead to interesting and meaningful business insights. If there are no patterns identified, then the missing values can be substituted with mean or median values (imputation) or they can simply be ignored.There are various factors to be considered when answering this question- Understand the problem statement, understand the data and then give the answer.Assigning a default value which can be mean, minimum or maximum value. Getting into the data is important. If it is a categorical variable, the default value is assigned. The missing value is assigned a default value. If you have a distribution of data coming, for normal distribution give the mean value. Should we even treat missing values is another important point to consider? If 80% of the values for a variable are missing then you can answer that you would be dropping the variable instead of treating the missing values.
What is Collaborative filtering?
The process of filtering used by most of the recommender systems to find patterns or information by collaborating viewpoints, various data sources and multiple agents.
Can you explain the difference between a Test Set and a Validation Set?
Validation set can be considered as a part of the training set as it is used for parameter selection and to avoid Overfitting of the model being built. On the other hand, test set is used for testing or evaluating the performance of a trained machine leaning model. In simple terms ,the differences can be summarized as- Training Set is to fit the parameters i.e. weights. Test Set is to assess the performance of the model i.e. evaluating the predictive power and generalization. Validation set is to tune the parameters.
How can you assess a good logistic model?
There are various methods to assess the results of a logistic regression analysis- • Using Classification Matrix to look at the true negatives and false positives. • Concordance that helps identify the ability of the logistic model to differentiate between the event happening and not happening. • Lift helps assess the logistic model by comparing it with random selection.
Which Data Scientists Do You Admire the Most and Why?
There's certainly no right answer to this question, but that doesn't make your response any less important. First of all, the question reveals how in-tune you are with the goings-on within your profession. Staying up on industry trends—particularly in a new, faster-growing field like data science—is an important trait most companies want to see in their candidates. Second, your answer may reveal a bit about your approach to your craft. That's why it's important to be sure and answer the latter half of this question—the "why." Doing so reveals to your panel what you value in a good data scientist, which is likely a reflection of how you operate within your field.
Differentiate between univariate, bivariate and multivariate analysis.
These are descriptive statistical analysis techniques which can be differentiated based on the number of variables involved at a given point of time. For example, the pie charts of sales based on territory involve only one variable and can be referred to as univariate analysis. If the analysis attempts to understand the difference between 2 variables at time as in a scatterplot, then it is referred to as bivariate analysis. For example, analysing the volume of sale and a spending can be considered as an example of bivariate analysis. Analysis that deals with the study of more than two variables to understand the effect of variables on the responses is referred to as multivariate analysis.
Are expected value and mean value different?
They are not different but the terms are used in different contexts. Mean is generally referred when talking about a probability distribution or sample population whereas expected value is generally referred in a random variable context. For Sampling Data Mean value is the only value that comes from the sampling data. Expected Value is the mean of all the means i.e. the value that is built from multiple samples. Expected value is the population mean. For Distributions Mean value and Expected value are same irrespective of the distribution, under the condition that the distribution is in the same population.
Is it possible to perform logistic regression with Microsoft Excel?
Third one mofo
How can you iterate over a list and also retrieve element indices at the same time?
This can be done using the enumerate function which takes every element in a sequence just like in a list and adds its location just before it.
What do you think we could do better or differently?
This is a common one at startups (and one of our personal favorites here at The Muse). Hiring managers want to know that you not only have some background on the company, but that you're able to think critically about it and come to the table with new ideas. So, come with new ideas! What new features would you love to see? How could the company increase conversions? How could customer service be improved? You don't need to have the company's four-year strategy figured out, but do share your thoughts, and more importantly, show how your interests and expertise would lend themselves to the job.
Are you familiar with price optimization, price elasticity, inventory management, competitive intelligence? Give examples.
Those are economics terms that are not frequently asked of Data Scientists but they are useful to know. Price optimization is the use of mathematical tools to determine how customers will respond to different prices for its products and services through different channels. Big Data and data mining enables use of personalization for price optimization. Now companies like Amazon can even take optimization further and show different prices to different visitors, based on their history, although there is a strong debate about whether this is fair. Price elasticity in common usage typically refers to Price elasticity of demand, a measure of price sensitivity. It is computed as: Price Elasticity of Demand = % Change in Quantity Demanded / % Change in Price. Similarly, Price elasticity of supply is an economics measure that shows how the quantity supplied of a good or service responds to a change in its price. Inventory management is the overseeing and controlling of the ordering, storage and use of components that a company will use in the production of the items it will sell as well as the overseeing and controlling of quantities of finished products for sale. Wikipedia defines Competitive intelligence: the action of defining, gathering, analyzing, and distributing intelligence about products, customers, competitors, and any aspect of the environment needed to support executives and managers making strategic decisions for an organization. Tools like Google Trends, Alexa, Compete, can be used to determine general trends and analyze your competitors on the web. Wikipedia defines Statistical power or sensitivity of a binary hypothesis test is the probability that the test correctly rejects the null hypothesis (H0) when the alternative hypothesis (H1) is true. To put in another way, Statistical power is the likelihood that a study will detect an effect when the effect is present. The higher the statistical power, the less likely you are to make a Type II error (concluding there is no effect when, in fact, there is). Wikipedia defines Statistical power or sensitivity of a binary hypothesis test is the probability that the test correctly rejects the null hypothesis (H0) when the alternative hypothesis (H1) is true. To put in another way, Statistical power is the likelihood that a study will detect an effect when the effect is present. The higher the statistical power, the less likely you are to make a Type II error (concluding there is no effect when, in fact, there is).
In Your Opinion, How Does Data Science Differ From Data Analytics and Machine Learning?
Though data scientists often get lumped in with these two fields, your role is—in theory—considerably different. Separating data scientists from data analysts all comes down to application. Scientists are responsible for slicing data to extract valuable insights that a data analyst can then apply to real-world business scenarios. Both roles are behind the scenes, but data scientists generally have more technical coding knowledge and don't need to have the intricate understanding of the business required of data analysts. As far as machine learning goes: Machine learning is just one small application of the work done by data scientists. Machine learning deals with algorithms specifically built for machines—data science is multidisciplinary.
What's Your Favorite Part of Your Current Job?
Treat your response here like you're running for president: It should be diplomatic, and whether truthful or not, it should at least feign sincerity. You obviously don't want to pump up your current job too much, or your panel might start wondering whether you really want to leave. But, you also don't want to make the mistake of bad-mouthing your current job either.
What are various steps involved in an analytics project?
Understand the business problem • Explore the data and become familiar with it. • Prepare the data for modelling by detecting outliers, treating missing values, transforming variables, etc. • After data preparation, start running the model, analyse the result and tweak the approach. This is an iterative step till the best possible outcome is achieved. • Validate the model using a new data set. • Start implementing the model and track the result to analyse the performance of the model over the period of time.
Why is vectorization considered a powerful method for optimizing numerical code?
Vectorization is a programming technique that uses vector operations instead of element-by-element loop-based operations. Besides frequently producing more succinct Octave code, vectorization also allows for better optimization in the subsequent implementation. The optimizations may occur either in Octave's own Fortran, C, or C++ internal implementation, or even at a lower level depending on the compiler and external numerical libraries used to build Octave. The ultimate goal is to make use of your hardware's vector instructions if possible or to perform other optimizations in software. Vectorization is not a concept unique to Octave, but it is particularly important because Octave is a matrix-oriented language. Vectorized Octave code will see a dramatic speed up (10X-100X) in most cases.
What do you consider to be your weaknesses?
What your interviewer is really trying to do with this question—beyond identifying any major red flags—is to gauge your self-awareness and honesty. So, "I can't meet a deadline to save my life" is not an option—but neither is "Nothing! I'm perfect!" Strike a balance by thinking of something that you struggle with but that you're working to improve. For example, maybe you've never been strong at public speaking, but you've recently volunteered to run meetings to help you be more comfortable when addressing a crowd.
What is the difference between skewed and uniform distribution?
When the observations in a dataset are spread equally across the range of distribution, then it is referred to as uniform distribution. There are no clear perks in an uniform distribution. Distributions that have more observations on one side of the graph than the other are referred to as skewed distribution.Distributions with fewer observations on the left ( towards lower values) are said to be skewed left and distributions with fewer observation on the right ( towards higher values) are said to be skewed right.
Do you have any questions for us?
You probably already know that an interview isn't just a chance for a hiring manager to grill you—it's your opportunity to sniff out whether a job is the right fit for you. What do you want to know about the position? The company? The department? The team? You'll cover a lot of this in the actual interview, so have a few less-common questions ready to go. We especially like questions targeted to the interviewer ("What's your favorite part about working here?") or the company's growth ("What can you tell me about your new products or plans for growth?")
Give an Example of a Time You've Encountered Selection Bias and Explain How You Avoided It
You probably know that selection bias comes in when an error is introduced due to a non-random population sample. It happens with a fair amount of frequency in data science, and generally speaking, the easiest way to avoid it is to stay away from non-random samples. However, at times, those samples are unavoidable. In which case, leveraging techniques like resampling, boosting, and weighting are good workarounds to the problem.
Why should we hire you?
Your job here is to craft an answer that covers three things: that you can not only do the work, you can deliver great results; that you'll really fit in with the team and culture; and that you'd be a better hire than any of the other candidates.
What are your greatest professional strengths?
accurate (share your true strengths, not those you think the interviewer wants to hear); relevant (choose your strengths that are most targeted to this particular position); and specific (for example, instead of "people skills," choose "persuasive communication" or "relationship building"). Then, follow up with an example of how you've demonstrated these traits in a professional setting.
What is cross-validation?
https://machinelearningmastery.com/k-fold-cross-validation/
Explain the use of Combinatorics in data science.
https://mapr.com/blog/data-science-do-numbers-part-1/
Predictive power or interpretability of a model: Which is more important, and why?
https://medium.com/@chris_bour/interpretable-vs-powerful-predictive-models-why-we-need-them-both-990340074979
What is the difference between cluster and systematic sampling?
https://www.investopedia.com/ask/answers/051815/what-difference-between-systematic-sampling-and-cluster-sampling.asp
What are the benefits of R language in data science?
https://www.newgenapps.com/blog/6-reasons-why-choose-r-programming-for-data-science-projects
What is linear regression? What are some better alternatives?
https://www.quality-control-plan.com/StatGuide/mulreg_alts.htm
What is a recommender system?
https://www.quora.com/What-is-a-recommendation-system
What are confounding variables?
https://www.statisticshowto.datasciencecentral.com/experimental-design/confounding-variable/
How many tennis balls can you fit into a limousine?
https://www.themuse.com/advice/4-insanely-tough-interview-questions-and-how-to-nail-them
How many windows are there in NYC?
https://www.themuse.com/advice/9-steps-to-solving-an-impossible-brain-teaser-in-a-tech-interview-without-breaking-a-sweat
Explain what resampling methods are and why they are useful. Also explain their limitations.
lassical statistical parametric tests compare observed statistics to theoretical sampling distributions. Resampling a data-driven, not theory-driven methodology which is based upon repeated sampling within the same sample. Resampling refers to methods for doing one of these Estimating the precision of sample statistics (medians, variances, percentiles) by using subsets of available data (jackknifing) or drawing randomly with replacement from a set of data points (bootstrapping) Exchanging labels on data points when performing significance tests (permutation tests, also called exact tests, randomization tests, or re-randomization tests) Validating models by using random subsets (bootstrapping, cross validation)