ML - Case Study
Case Study -> features, when prefer fewer vs many?
"*When we add irrelevant features, it increases model's tendency to overfit because those features introduce more noise. When two variables are correlated, they might be harder to interpret in case of regression, etc. *curse of dimensionality *adding random noise makes the model more complicated but useless *computational cost"
Case Study -> Reduce dimensionality What would you do?
"1. Close all apps. 2. Randomly sample the data set to create a smaller data set, having 1000 variables and 300000 rows. 3. Separate the numerical and categorical variables (chi-2) and remove the correlated variables. 4. PCA and pick the components which can explain the maximum variance in the data set. 5. Use linear model using Stochastic Gradient Descent. 6. Apply our business understanding to reduce features. 7. Feature selection techniques, choose only highest."
Case Study -> Steps in analytics project?
"1. Understand the Business problem 2. Explore the data and become familiar with it. 3. Prepare the data for modelling by detecting outliers, treating missing values, transforming variables, etc. 4. After data preparation, start running the model, analyze the result and tweak the approach. This is an iterative step until the best possible outcome is achieved. 5. Validate the model using a new data set. 6. Start implementing the model and track the result to analyze the performance of the model over the period of time."
Case Study -> The data set has missing values which spread along 1 standard deviation from the median. What percentage of data would remain unaffected? Why?
"68-95-99.7 rule. Since, the data is spread across median, let's assume it's a normal distribution. We know, in a normal distribution, ~68% of the data lies in 1 standard deviation from mean (or mode, median), which leaves ~32% of the data unaffected. Therefore, ~32% of the data would remain unaffected by missing values."
Case Study -> The data set has missing values which spread along 1 standard deviation from the median. What percentage of data would remain unaffected? Why?
"Answer: Cancer detection results in imbalanced data. Accuracy should not be used as a measure of performance because 96% (as given) might only be predicting majority class correctly. Class of interest is minority class (4%) which is the people who actually got diagnosed with cancer. Hence, in order to evaluate model performance, we should use Sensitivity (True Positive Rate), Specificity (True Negative Rate), F measure to determine class wise performance of the classifier. If the minority class performance is found to to be poor, we can undertake the following steps: We can use undersampling, oversampling or SMOTE to make the data balanced. We can alter the prediction threshold value by doing probability calibration and finding a optimal threshold using AUC-ROC curve . We can assign weight to classes such that the minority classes gets larger weight. We can also use anomaly detection. Know more: Imbalanced Classification"
Case Study - In a search engine, given partial data on what the user has typed, how would you predict the user's eventual search query?
"Based on the past frequencies of words shown up given a sequence of words, we can construct conditional probabilities of the set of next sequences of words that can show up (n-gram). The sequences with highest conditional probabilities can show up as top candidates. To further improve this algorithm,we can put more weight on past sequences which showed up more recently and near your location to account for trends show your recent searches given partial data"
Case Study - You're Uber and you want to design a heatmap to recommend to drivers where to wait for a passenger. How would you approach this?
"Based on the past pickup location of passengers around the same time of the day, day of the week (month, year), construct Based on the number of past pickups account for periodicity (seasonal, monthly, weekly, daily, hourly) special events (concerts, festivals, etc.) from tweets"
Case Study - Given training data on tweets and their retweets, how would you predict the number of retweets of a given tweet after 7 days after only observing 2 days worth of data?
"Build a time series model with the training data with a seven day cycle and then use that for a new data with only 2 days data. Build a regression function to estimate the number of retweets as a function of time t to determine if one regression function can be built, see if there are clusters in terms of the trends in the number of retweets if not, we have to add features to the regression function features + # of retweets on the first and the second day -> predict the seventh day https://en.wikipedia.org/wiki/Dynamic_time_warping"
Case Study - How would you design the people you may know feature on LinkedIn or Facebook?
"Find strong unconnected people in weighted connection graph Define similarity as how strong the two people are connected Given a certain feature, we can calculate the similarity based on friend connections (neighbors) Check-in's people being at the same location all the time. same college, workplace Have randomly dropped graphs test the performance of the algorithm ref. News Feed Optimization Affinity score: how close the content creator and the users are Weight: weight for the edge type (comment, like, tag, etc.). Emphasis on features the company wants to promote Time decay: the older the less important"
Case Study - a steps, how approach?
"How Approach Case Study Objective Ask Clarifying Questions Metrics Variables Model Scenarios Recommendations Summarize Variable of interest hypothesis to test"
Case Study -> You've got a data set to work having p (no. of variable) > n (no. of observation). Why is OLS as bad option to work with? Which techniques would be best to use? Why?
"In such high dimensional data sets, we can't use classical regression techniques, since their assumptions tend to fail. When p > n, we can no longer calculate a unique least square coefficient estimate, the variances become infinite, so OLS cannot be used at all. To combat this situation, we can use penalized regression methods like lasso, LARS, ridge which can shrink the coefficients to reduce variance. Precisely, ridge regression works best in situations where the least square estimates have higher variance. Among other methods include subset regression, forward stepwise regression."
Case Study -> Memory issues, what do?
"Memory issues - which also For Neural networks: Batch size with Numpy array will work. Steps: 1. Load th whole data in the Numpy array. Numpy array has a property to create a mapping of the complete data set, it doesn't load complete data set in memory. 2. You can pass an index to Numpy array to get required data. 3. Use this data to pass to the Neural network. 4. Have a small batch size. For SVM: Partial fit will work Steps: 1. Divide one big data set in small size data sets. 2. Use a partial fit method of SVM, it requires a subset of the complete data set. 3. Repeat step 2 for other subsets."
Case Study - General process
"Objective: Define - growth, - define: Stakeholders like Marketing/Product teams, time, resource considerations, common goals Metric: - appropriate metric - how measure Variables: - Customer level (demographics age, income ...) - Behavioral level ( computer, phone, etc) - business (price, quantity, ...) Model: (linear or class) Recommendations: A/B Test, Cust Seg Concerns: Bias, problems unknows"
Case Study - You want to run a regression to predict the probability of a flight delay, but there are flights with delays of up to 12 hours that are really messing up your model. How can you address this?
"One vector each for team A and B. Take the difference of the two vectors and use that as an input to predict the probability that team A would win by training the model. Train the models using past tournament data and make a prediction for the new tournament by running the trained model for each round of the tournament Some extensions: Experiment with different ways of consolidating the 2 team vectors into one (e.g concantenating, averaging, etc) Consider using a RNN type model that looks at time series data"
Case Study - How would you build a model to predict a March Madness bracket?
"One vector each for team A and B. Take the difference of the two vectors and use that as an input to predict the probability that team A would win by training the model. Train the models using past tournament data and make a prediction for the new tournament by running the trained model for each round of the tournament Some extensions: Experiment with different ways of consolidating the 2 team vectors into one (e.g concantenating, averaging, etc) Consider using a RNN type model that looks at time series data."
Case Study -> naïve bayes, Explain prior probability, likelihood and marginal likelihood in context of naiveBayes algorithm?
"Prior probability is nothing but, the proportion of dependent (binary) variable in the data set. It is the closest guess you can make about a class, without any further information. For example: In a data set, the dependent variable is binary (1 and 0). The proportion of 1 (spam) is 70% and 0 (not spam) is 30%. Hence, we can estimate that there are 70% chances that any new email would be classified as spam. Likelihood is the probability of classifying a given observation as 1 in presence of some other variable. For example: The probability that the word 'FREE' is used in previous spam message is likelihood. Marginal likelihood is, the probability that the word 'FREE' is used in any message."
Case Study -> How would you validate a predictive model of a quantitative outcome variable using multiple regression.
"Proposed methods for model validation: Are predicted values inside or outside expected range. Outside likely bad model. If the values seem to be reasonable, examine the parameters; any of the following would indicate poor estimation or multi-collinearity: opposite signs of expectations, unusually large or small values, or observed inconsistency when the model is fed new data. Use the model for prediction by feeding it new data, and use the coefficient of determination (R squared) as a model validity measure. Use data splitting to form a separate dataset for estimating model parameters, and another for validating predictions. Use jackknife resampling if the dataset contains a small number of instances, and measure validity with R squared and mean squared error (MSE)."
Case Study -> Provide a response variable. Analyze this dataset.
"Start by: 1) fitting a simple model (multivariate regression, logistic regression), 2) do some feature engineering accordingly 3) try some complicated models. Always split the dataset into train, validation, test dataset and use cross validation to check their performance. Process: * Determine if the problem is classification or regression * Favor simple models that run quickly and you can easily explain. * Mention cross validation as a means to evaluate the model. * Plot and visualize the data."
Case Study -> linear regression, You have built a multiple regression model. Your model R² isn't as good as you wanted. For improvement, your remove the intercept term, your model R² becomes 0.8 from 0.3. Is it possible? How?
"Yes, it is possible. We need to understand the significance of intercept term in a regression model. The intercept term shows model prediction without any independent variable i.e. mean prediction. The formula of R² = 1 - ∑(y - y´)²/∑(y - ymean)² where y´ is predicted value. When intercept term is present, R² value evaluates your model wrt. to the mean model. In absence of intercept term (ymean), the model can make no such evaluation, with large denominator, ∑(y - y´)²/∑(y)² equation's value becomes smaller than actual, resulting in higher R²."
Case Study -> You are assigned a new project which involves helping a food delivery company save more money. The problem is, company's delivery team aren't able to deliver food on time. As a result, their customers get unhappy. And, to keep them happy, they end up delivering food for free. Which machine learning algorithm can save them?
"You might have started hopping through the list of ML algorithms in your mind. But, wait! Such questions are asked to test your machine learning fundamentals. This is not a machine learning problem. This is a route optimization problem. A machine learning problem consist of three things: There exist a pattern. You cannot solve it mathematically (even by writing exponential equations). You have data on it. Always look for these three factors to decide if machine learning is a tool to solve a particular problem."
Case Study - How would you suggest to a franchise where to open a new store?
"build a master dataset with local demographic information available for each location. local income levels, proximity to traffic, weather, population density, proximity to other businesses a reference dataset on local, regional, and national macroeconomic conditions (e.g. unemployment, inflation, prime interest rate, etc.) any data on the local franchise owner-operators, to the degree the manager identify a set of KPIs acceptable to the management that had requested the analysis concerning the most desirable factors surrounding a franchise quarterly operating profit, ROI, EVA, pay-down rate, etc. run econometric models to understand the relative significance of each variable run machine learning algorithms to predict the performance of each location candidate"
Case Study - How would you predict who someone may want to send a Snapchat or Gmail to?
"for each user, assign a score of how likely someone would send an email to the rest is feature engineering: number of past emails, how many responses, the last time they exchanged an email, whether the last email ends with a question mark, features about the other users, etc. Ask someone for more details. People who someone sent emails the most in the past, conditioning on time decay."
Case Study -> missing values
"remove impute the missing values, i.e., to infer them from the known part of the data. Substitute with frequently occurring, average, closest neighbor"
11. How would you measure the impact that sponsored stories on Facebook News Feed have on user engagement? How would you determine the optimum balance between sponsored stories and organic content on a user's News Feed?
AB test on different balance ratio and see
Case Study -> Ensemble. Built five Gradient Boost models, none beat benchmark score. After combining five as an ensemble model, no performance increace. What is the problem?
As we know, ensemble learners are based on the idea of combining weak learners to create strong learners. But, these learners provide superior result when the combined models are uncorrelated. Since, we have used 5 GBM models and got no accuracy improvement, suggests that the models are correlated. The problem with correlated models is, all the models provide same information.
5. What would be good metrics of success for a product that offered in-app purchases? (Zynga, Angry Birds, other gaming apps)
Average Revenue Per Paid User Average Revenue Per User
Case Study - Given a database of all previous alumni donations to your university, how would you predict which recent alumni are most likely to donate?
Based on frequency and amount of donations, graduation year, major, etc, construct a supervised regression (or binary classification) algorithm.
14. What kind of services would nd churn (metric that tracks how many customers leave the service) helpful?
How would you calculate churn? subscription based services
Case Study -> You are working on a classification problem. For validation purposes, you've randomly sampled the training data set into train and validation. You are confident that your model will work incredibly well on unseen data since your validation accuracy is high. However, you get shocked after getting poor test accuracy. What went wrong?
In case of classification problem, we should always use stratified sampling instead of random sampling. A random sampling doesn't takes into consideration the proportion of target classes. On the contrary, stratified sampling helps to maintain the distribution of target variable in the resultant distributed samples also.
13. Say that you are Netflix. How would you determine what original series you should invest in and create?
Netflix uses data to estimate the potential market size for an original series before giving it the go-ahead.
Case Study ->PCA. The data set contains many variables, some of which are highly correlated. Your manager has asked you to run PCA. Would you remove correlated variables first? Why?
Remove correlated, yes. Discarding correlated variables have a substantial effect on PCA because, in presence of correlated variables, the variance explained by a particular component gets inflated. For example: You have 3 variables in a data set, of which 2 are correlated. If you run PCA on this data set, the first principal component would exhibit twice the variance than it would exhibit with uncorrelated variables. Also, adding correlated variables lets PCA put more importance on those variable, which is misleading.
Case Study -> Algo choice, given a data set, how do you decide which?
The choice of machine learning algorithm solely depends of the type of data. If you are given a data set which is exhibits linearity, then linear regression would be the best algorithm to use. If you given to work on images, audios, then neural network would help you to build a robust model. If the data comprises of non linear interactions, then a boosting or bagging algorithm should be the choice. If the business requirement is to build a model which can be deployed, then we'll use regression or a decision tree model (easy to interpret and explain) instead of black box algorithms like SVM, GBM etc.
"Case Study -> We want to predict the probability of death from heart disease based on three risk factors: age, gender, and blood cholesterol level. What is the most appropriate algorithm for this case? Choose the correct option: Logistic Regression Linear Regression K-means clustering Apriori algorithm"
The most appropriate algorithm for this case is A, logistic regression.
Case Study - You want to run a regression to predict the probability of a flight delay, but there are flights with delays of up to 12 hours that are really messing up your model. How can you address this?
This is equivalent to making the model more robust to outliers.
Case study -> time series. You manager has asked you to build a high accuracy model. You start with the decision tree algorithm, since you know it works fairly well on all kinds of data. Later, you tried a time series regression model and got higher accuracy than decision tree model. Can this happen? Why?
Time series data is known to posses linearity. On the other hand, a decision tree algorithm is known to work best to detect non - linear interactions. The reason why decision tree failed to provide robust predictions because it couldn't map the linear relationship as good as a regression model did. Therefore, we learned that, a linear regression model can provide robust prediction given the data set satisfies its linearity assumptions.
Case Study - How could you collect and analyze data to use social media to predict the weather?
We can collect social media data using twitter, Facebook, instagram API's. Then, for example, for twitter, we can construct features from each tweet, e.g. the tweeted date, number of favorites, retweets, and of course, the features created from the tweeted content itself. Then use a multi variate time series model to predict the weather.
Case Study - How would you construct a feed to show relevant content for a site that involves user interactions with items?
We can do so using building a recommendation engine. The easiest we can do is to show contents that are popular other users, which is still a valid strategy if for example the contents are news articles. To be more accurate, we can build a content based filtering or collaborative filtering. If there's enough user usage data, we can try collaborative filtering and recommend contents other similar users have consumed. If there isn't, we can recommend similar items based on vectorization of items (content based filtering).
1. What would be good metrics of success for an advertising-driven consumer product? (Buzzfeed, YouTube, Google Search, etc.) A service-driven consumer product? (Uber, Flickr, Venmo, etc.)
advertising-driven: Pageviews and daily actives, CTR, CPC (cost per click) click-ads display-ads service-driven: number of purchases, conversion rate
6. A certain metric is violating your expectations by going down or up more than you expect. How would you try to identify the cause of the change?
breakdown the KPI's into what consists them and find where the change is then further breakdown that basic KPI by channel, user cluster, etc. and relate them with any campaigns, changes in user behaviors in that segment
9. You are tasked with improving the effciency of a subway system. Where would you start?
define efficiency
3. What would be good metrics of success for an e-commerce product? (Etsy, Groupon, Birchbox, etc.) A subscription product? (Net ix, Birchbox, Hulu, etc.) Premium subscriptions? (OKCupid, LinkedIn, Spotify, etc.)
e-commerce: number of purchases, conversion rate, Hourly, daily, weekly, monthly, quarterly, and annual sales, Cost of goods sold, Inventory levels, Site traffic, Unique visitors versus returning visitors, Customer service phone call count, Average resolution time subscription churn, CoCA, ARPU, MRR, LTV premium subscriptions:
Case Study -> Correlation and causation. Explain, give ex.
explain
15. Let's say that you're are scheduling content for a content provider on television. How would you determine the best times to schedule content?
fill in
8. You're a restaurant and are approached by Groupon to run a deal. What data would you ask from them in order to determine whether or not to do the deal?
for similar restaurants (they should define similarity), average increase in revenue gain per coupon, average increase in customers per coupon, number of meals sold
4. What would be good metrics of success for a consumer product that relies heavily on engagement and interac- tion? (Snapchat, Pinterest, Facebook, etc.) A messaging product? (GroupMe, Hangouts, Snapchat, etc.)
heavily on engagement and interaction: uses AU ratios, email summary by type, and push notification summary by type, resurrection ratio messaging product:
7. Growth for total number of tweets sent has been slow this month. What data would you look at to determine the cause of the problem?
look at competitors' tweet growth look at your social media engagement on other platforms look at your sales data
2. What would be good metrics of success for a productiv-ity tool? (Evernote, Asana, Google Docs, etc.) A MOOC? (edX, Coursera, Udacity, etc.)
productivity tool: same as premium subscriptions MOOC: same as premium subscriptions, completion rate
10. Say you are working on Facebook News Feed. What would be some metrics that you think are important? How would you make the news each person gets more relevant?
rate for each action, duration users stay, CTR for sponsor feed posts ref. News Feed Optimization Affinity score: how close the content creator and the users are Weight: weight for the edge type (comment, like, tag, etc.). Emphasis on features the company wants to promote Time decay: the older the less important
12. You are on the data science team at Uber and you are asked to start thinking about surge pricing. What would be the objectives of such a product and how would you start looking into this?
there is a gradual step-function type scaling mechanism until that imbalance of requests-to-drivers is alleviated and then vice versa as too many drivers come online enticed by the surge pricing structure. I would bet the algorithm is custom tailored and calibrated to each location as price elasticities almost certainly vary across different cities depending on a huge multitude of variables: income, distance/sprawl, traffic patterns, car ownership, etc. With the massive troves of user data that Uber probably has collected, they most likely have tweaked the algos for each city to adjust for these varying sensitivities to surge pricing. Throw in some machine learning and incredibly rich data and you've got yourself an incredible, constantly-evolving algorithm.