101 Data Science Interview Questions
What do you think are the important factors in the algorithm uber uses to assign rides to drivers?
The following list of factors can be used to assign rides to drivers: 1. Drivers who are online at the time of the request. 2. Drivers who have a good reputation (never been rated lower than 3/4 by the passenger making the request). 3. Drivers who are closest to the requesting passenger. 4. Drivers who don't have a destination filter set that excludes the passenger's destination.
How do you inspect missing data?
The following techniques can be used to handle missing data: 1. Imputation of missing values depending on whether the data is numerical or categorical. 2. Replacing values with mean, median, mode. 3. Using the average value of K nearest neighbours as an imputation estimate. 4. Using linear regression to predict values.
Data Science workflow...
https://matlabacademy-content.mathworks.com/4.12.2/R2019b/content/Machine%20Learning/Classification/Onramp/Improving%20the%20Model/Images/workflow1.png
Write a function to check whether a particular word is a palindrome or not.
(Using R) Palindrome <- function(word){rawword <- charToRaw(tolower(word))if(identical(rawword, rev(rawword)) == 1){print("Palindrome")}else{print("Not Palindrome")}} (Using Python) def Palindrome(word):reverse = word[::-1]if word == reverse:print ("Palindrome")else:print ("Not Palindrome")
Why is overfitting a problem in machine learning models? What steps can you take to avoid it?
Overfitting is a phenomenon when a model fits too closely on the training data. It is said to have "memorized" the data and performs very poorly on unseen data because it's not generalized. The reason for this is that an overfitted model learns the details of training data to the utmost degree so the noise, or outliers, in the training data are picked up and learned as concepts by the model. As a result, these concepts learned by the model are not generalized enough to work with unseen data which reduces the predictive ability. Overfitting can be reduced by using the following: - Resampling techniques such as k-fold cross validation that creates multiple train-test splits, - Using Ensembling techniques that combines predictions from separate models and reduce variance, - Increase generalizability using Regularization techniques that add a penalty to the cost function and makes models more flexible
Explain the difference between generative and discriminative algorithms.
Suppose we have a dataset with training input x and labels y. A Generative Model explicitly models the actual distribution of each class. It learns the joint probability distribution, p( x, y), and makes predictions by using Bayes rules to calculate p(ylx). It then picks the most likely label y. Examples of Generative classifiers include Naïve Bayes, Bayesian Networks and Markov random fields. Whereas a Discriminative model directly learns the conditional probability distribution p(y|x) or learns a direct map from inputs x to the class labels. This way, it models the decision boundary between the classes. Some popular discriminative classifiers include Logistic regression, Traditional neural networks and Nearest neighbor.
If you are working at Facebook and you want to detect bogus/fake accounts. How will you go about that?
The company can use the stored data to identify inauthentic profiles by looking for patterns, such as repeatedly posting the same thing over and over or a sudden spike in messaging activity. Moreover, if there is an increased number of requests from a particular account then this might be suspicious as well.
What is the law of large numbers?
The law of large numbers states that as the sample size is increased, the sample mean approaches true population mean. It states that while doing experiments a number of times, each additional trial increases the precision of the average result. Let's take the example of a rolling dice. It has six outcomes labelled 1,2,3,4,5,6 with equal probabilities. The expected value of the dice events is 3.5. Now, let's roll the dice five time and take its average. If the outcomes are 5,6,6,3,4 then the average will be 4.7. This result is far from the expected value. According to the law of the large numbers, if we roll the dice a large number of times, the average result will be close to the expected value of 3.5.
What are different metrics to classify a dataset?
The performance metrics for classification problems are as follows: 1. Confusion Matrix 2. Accuracy 3. Precision and Recall 4. F1 Score 5. AUC-ROC Curve. The choice of selecting a performance metric depends on the type of question and the dataset. For instance, if the dataset is balanced then accuracy would be a good measure to evaluate the model performance.Confusion matrix would be a good alternative if you want to know the cost of False Positives and False Negatives.
Discuss how to randomly select a sample from a product user population.
The sampling techniques to select a sample from a product user population can be divided into two categories: Probability sampling methods - Simple Random Sampling - Stratified Sampling - Clustered Sampling - Systematic Sampling Non-Probability sampling methods - Convenience Sampling - Snowball Sampling - Quota Sampling - Judgement Sampling
Facebook wants to analyze why the "likes per user and minutes spent on a platform are increasing, but total number of users are decreasing". How can they do that?
There can be multiple approaches to answer this question. One way is to gather the context information for this problem. The following factors can be analyzed from the data in order to reach a sound conclusion: Timeline: Is the drop in users a one time event or has it happened progressively? Region: Is the decline in the number of users happening from a specific region? If this is the case, the problem might be related to a country's regulations or a competitive product in that region. Platforms: Is the decline happening on specific platforms, like iOS, Android, or others? If so, then compare the number of users who are leaving on each platform.
Is random weight assignment better than assigning same weights to the units in the hidden layer?
To answer this question, let's think about a situation where the weights are assigned equally. Since neural networks use the gradient descent phenomenon to optimize the parameters and find the lowest point to reduce the error of the cost function, they need to have an initialization point from which they can move in the direction of the local minima. For instance, if the starting point is A at the first iteration then it is possible that the network is unable to find a path towards the local minima. Keeping the initialization point consistent every single time will lead to the same conclusion. However, if the starting point is random at each iteration then the network will have a better chance at finding the local minima to reduce the error of the cost function. This technique is also known as breaking the symmetry. The initialization is asymmetric so we can find various solutions to the same problem.
If the model isn't perfect, how would you like to select the threshold so that the model outputs 1 or 0 for label?
To make a decision, we need to understand the consequences that will happen as a result of selecting a decision boundary. You need to find out the relative cost of a false positive vs. a false negative. A precision-recall curve of your model can be plotted on your validation data. For instance, it's important to understand that if you accidently label a true potential customer as false, then this will result in losing customers. This analysis will help in deciding a right threshold for the model.
Can you explain the concept of false positive and false negative?
When a model incorrectly predicts an outcome to be positive when it should have been negative, we call it a false positive. Similarly, when an outcome of a model is incorrectly predicted to be negative class, we call it the false negative. Both are used as evaluation metrics for classification algorithms.
What is AUC - ROC Curve?
When we need to evaluate or visualize the performance of the multi-class classification problem, we use AUC (Area Under The Curve) ROC (Receiver Operating Characteristics) curve. ROC is a probability curve and AUC represents the degree of separability. It tells how much the model is capable of distinguishing between classes such as spam/not-spam. The higher the AUC, the better the model is at predicting spam email as spam and non- spam email as non-spam. A highly accurate model has AUC close to 1 which reflets it's good measure of separability. A poor model has AUC near 0 which means it has worst measure of separability.
Write a Python code to return the count of words in a string Q
def count_words(my_string):count = len(my_string.split()) print ("The number of words in string are : " + str(count)) count_words(Q)
Write a sorting algorithm for a numerical dataset in Python.
def sort(mylist):n = len(mylist)for i in range(n):for j in range(0, n-i-1):if mylist[j] > mylist[j+1]:mylist[j], mylist[j+1] = mylist[j+1], mylist[j] print(mylist) sort([80, 55, 70])
How does a neural network with one layer and one input and output compare to a logistic regression?
Neural networks and logistic regression are both used for classification problems. Logistic regression can be defined as the simplest form of Neural Network that results in straightforward decision boundaries whereas neural networks are a superset that includes additional complex decision boundaries to cater to more complex and large data. Logistic regression models cannot capture complex non-linear relationships w.r.t features. Meanwhile, a neural network with non-linear activation functions enables one to capture highly complex features.
Why is regularization used in machine learning models? What are the differences between L1 and L2 regularization?
Regularization is a technique used to reduce the error by fitting a function appropriately on the given training set thereby avoiding overfitting. The key difference between L1 and L2 regularization is the penalty term. Lasso Regression (Least Absolute Shrinkage and Selection Operator), also known as L1, adds "absolute value of magnitude" coefficient as the penalty term to the loss function. Ridge regression (L2) adds "squared magnitude" of coefficient as the penalty term to the loss function. Another difference between these techniques is that Ridge sets the weights of some features to small values whereas Lasso shrinks the less important features coefficient to zero thus, removing some features altogether. So, this works well for feature selection/dimensionality reduction in case we have a huge number of features.
Select all customers who purchased at least two items on two separate days from Amazon.
SELECT Customer_ID, COUNT(DISTINCT Item_ID) as 'item', COUNT(DISTINCT Purchase_Date) as 'date' FROM Purchase_List GROUP BY Customer_ID HAVING 'date' >= 2 AND 'item' >= 2
How will you explain JOIN function in SQL in the simplest possible way?
SQL handles queries across more than one table through the use of JOINs. JOINs are clauses in SQL statements that link two tables together, usually based on the common keys that define the relationship between those two tables. The key is the common column between the two tables. There are several types of JOINs: INNER: It selects all rows from both the tables that meet the required condition. LEFT: This returns all the rows of the table on the left side of the join and matching rows for the table on the right side of join. In case of no match on right side, the result will contain null. RIGHT: This returns all the rows of the table on the right side of the join and matching rows for the table on the left side of join. In case of no match on left side, the result will contain null. FULL: It combines the result of both LEFT and RIGHT JOIN. The result will contain all the rows from both the tables. In case of no matching, the result will contain null.
How will you cut a circular cake into 8 equal pieces?
Considering cake in 3 dimensions, make 2 symmetrical cuts at top to make 4 quarters and one cut by side to divide each into 2 to make 8 equal pieces.
What's the role of a cost function?
Cost function is used to learn the parameters in the machine learning model such that the total error is as minimal as possible. A cost function is a measure of how wrong the model is in terms of its ability to estimate the relationship between the dependent and independent variable. This function is typically expressed as a difference between the predicted value and the actual value. Every algorithm can have it's own cost function depending on the problem.
What is Map Reduce?
Enables processing of large datasets. The dataset is split and mapped into k,v pairs. Then the data is shuffle (sorted) by like items and the reduced (combined) together.
How does the Gradient Boosting algorithm work?
Gradient Boosting functions by sequentially adding a predictor to an ensemble where each predictor is correcting its predecessor. Here, we re-imagine the boosting problem as an optimization problem, where we take up a loss function and try to optimize it. It has 3 core elements: a weak learner to make predictions, a loss function to be optimized, and an additive model to add to the weak learners to minimize the loss function.
What is Hadoop HDFS?
HDFS is a distributed storage system that spans the Hadoop framework. Large files placed in HDFS will be split into smaller chunks for easier processing. HDFS employs master-slave architecture, that is, namenode and datanode.
Low accuracy in train and test meaning?
Low accuracy in both your training and testing sets is an indication that your features do not provide enough information to distinguish the different classes. In particular, you might want to look at the data for classes that are frequently confused, to see if there are characteristics that you can capture as new features.
Explain Linear Regression and its assumptions
Linear regression is useful for finding the relationship between two continuous variables. One is the predictor or independent variable and the other is the response or dependent variable. Assumptions There are 5 basic assumptions of linear regression: 1. Linear relationship: Between the dependent and independent variable, 2. Multivariate normality: Multiple regression assumes that the residuals are normally distributed. 3. No or little multicollinearity between the independent variables 4. No autocorrelation: It's a characteristic of data in which the correlation between the values of the same variables is based on related objects. It violates the assumption of instance independence, which underlies most of the conventional models. 5. Homoscedasticity: This assumption means that the variance around the regression line is the same for all values of the predictor variable.
What is the difference between Maximum Likelihood Estimation (MLE) and Maximum A Posteriori (MAP)?
Maximum Likelihood Estimation (MLE) and Maximum A Posteriori (MAP), are both a method for estimating some variable in the setting of probability distributions or graphical models. MAP usually comes up in Bayesian setting because, as the name suggests, it works on a posterior distribution and not only on the likelihood like MLE. If you have any useful prior information, then the posterior distribution will be more informative than the likelihood function. Comparing both MLE and MAP equation, the only thing differs is the inclusion of prior P(θ) in MAP, otherwise they are identical. What it means is that, the likelihood is now weighted with some weight coming from the prior information.
How will you inspect missing data and when are they important for your analysis?
Missing data can be inspected using various techniques depending on the language being used. In Python, the isnull function can be used to find the missing data that is marked with NaN. Whereas in R, missing values can be identified using is.na function. In several cases, summary statistics could be used to point out missing data in quantitative variables where they might be marked as zero where zero would be an abnormality, for example: 0 in the 'age of employee' variable. Missing values can reduce fit of our model and affect their performance as they can make the data bias. Therefore, they can lead to wrong prediction. However, missing data can let us know why certain variables were difficult to collect. They can also form correlation with other variables which is considered as missing not at random. Missing values can also point out if a variable was appropriately gathered. Such as asking a personal question in a survey that will be skipped by participants.
If you're faced with Selection Bias, how will you avoid it?
Selection bias occurs during sampling of the population. It's when a selected sample does not represent the characteristics of the population. The following are three types of selection bias: 1. Undercoverage: Happens when some members of the population are inadequately represented in the sample. This problem usually occurs while doing convenience sampling. 2. Voluntary Response Bias: Happens when members are self-selected volunteers who are strongly opinionated and the resulting sample tends to overrepresent these individuals. 3. Nonresponse Bias: Non response happens when there's a significant difference between those who responded to the survey and those who did not. This may happen for a variety of reasons such as some people refused to participate or some people simply forgot to return the surveys. To avoid selection bias, use random sampling. The following are some of the choices for sampling: - Simple Random Sampling - Stratified Random Sampling
How would you select a representative sample of search queries from 5 million queries?
Some key features need to be kept in mind while selecting a representative sample. Diversity: A sample must be as diverse as the 5 million search queries. It should be sensitive to all the local differences between the search query and should keep those features in mind. Consistency: We need to make sure that any change we see in our sample data is also reflected in the true population which is the 5 million queries. Transparency: It is extremely important to decide the appropriate sample size and structure so that it is a true representative. These properties of a sample should be discussed to ensure that the results are accurate.
When using the Gaussian mixture model, how do you know it's applicable?
A Gaussian mixture model (GMM) is a probabilistic model that assumes that the instances were generated from a mixture of several Gaussian distributions whose parameters are unknown. In this approach we describe each cluster by its centroid (mean), covariance, and the size of the cluster (weight). Therefore, based on this definition, a GMM will be applicable when we know that the data points are mixtures of a gaussian distribution and form clusters with different mean and standard deviation.
Explain a probability distribution that is not normal and how to apply that?
A Poisson distribution is a discrete probability distribution that helps to predict the probability of certain events from happening when you know how often the event has occurred. It predicts the probability of a given number of events occurring in a fixed interval of time. Examples of Poisson distribution include the number of phone calls received by a call center per hour and the number of decay events per second from a radioactive source. The Poisson distribution is applied in the following scenarios: - If the event is possible to count and can be counted in whole numbers - If the average frequency of occurrence for the time period in question is known - When the occurrences are independent
What is the ROC curve and the meaning of sensitivity, specificity, confusion matrix?
A Receiver Operating characteristic (ROC) Curve plots True Positive Rate vs False Positive Rate at different classification thresholds. It tells us how a good a model is in classification. Therefore, curves of different models can be compared directly in general or for different thresholds. Sensitivity is a measure of correctly identified positives over all positives. It's also called true positive rate and can be explained though the following equation: Sensitivity = true positives/(true positive + false negative) Whereas, Specificity is a measure of correctly identified negatives over all negatives. It's also called the true negative rate and can be explained though the following equation: Specificity = true negatives/(true negative + false positives) A confusion matrix is a table that provides summary of performance of a classification algorithm. It gives 4 different combinations of predicted and actual values which include True Positive and False Positive in the 1st row, whereas, False Negative and True Negative in the second row.
In what aspects is a box plot different from a histogram?
A boxplot is a standardized way of displaying the distribution of data based on the following: Minimum, First Quartile, Median, Third Quartile, and Maximum. Each box on the plot shows the range of values from the first quartile at the bottom of the box to the third quartile at the top of the box. A line in the middle of the box occurs at the median of all the values. Whiskers on the plot display the maximum and minimum. It gives information about variability and dispersion of data. It also displays outliers and tells us about the symmetry and skewness of the data. Histograms show the frequency distribution of continuous data using rectangles. It is similar in appearance to a bar graph, however, the bars are adjacent. Data is split into intervals and the frequency of instances in each interval is plotted. It can tell us about the distribution of a graph (i.e. normal or not), its skewness, and lets us know about presence of outliers.
What are the core steps of the data analysis process?
Data analysis is a process in which we can change or analyze the data to draw a conclusion which will help to achieve a certain goal. Briefly, the process involves inspecting, cleansing, transforming and modelling data to discover useful information. This way it can be used for creating conclusions and supporting decision-making. With the right data analysis process and tools, what was once an overwhelming volume of disparate information becomes a simple, clear decision point. A Data Analysis Process consists of the following phases that are iterative in nature: - Setting of goals: This is the first step of the process. The business unit must decide on objectives for the data science team(s) and set clear goals to steer the team(s). The objectives defined in this step will be vital for the next step. - Data Gathering: Data gathering is the process of gathering information on required variables. The emphasis is on ensuring accurate and honest collection of data. Data is collected from various sources ranging from organizational databases to the information in web pages. - Data Processing: This step requires organizing and structuring data in proper format to simplify the approach in the upcoming steps. It also involves encoding and standardizing variables for better interpretation. - Data Cleaning: This is the process where you'll find, change or remove any incorrect or redundant data. Data scientists correct spelling mistakes, handle missing data and weed out nonsense information. This is the most critical step in the data value chain - Data Analysis: In this step, various data analysis techniques can be used to understand, interpret, and derive conclusions based on the requirements. Here we can explore the data, find co-relation amongst features and identify relevance of each feature to the problem. - Result interpretation: Once the data has been sorted and analyzed, it can be interpreted. It is important to know if the data answers your original question and helps in defending against any objections. For these steps, we can use machine learning algorithms as well as descriptive and inferential statistics. - Communication of Results: This is the last step of this process and can be called storytelling. We have understood the basic idea & concluded. Here we try to communicate this to other teams & management using visualization tools.
How does caching work and how do you use it in Data science?
It is often necessary to save various data files when the process of loading and/or manipulating data takes a considerable amount of time. That's where caching comes in handy. There will be caching on the server where already computed elements may not need to be recomputed. When you want to access some data that is expensive to look up (in terms of time/resources), you cache it so that the next time you want to look up that same data, it's much less expensive and time efficient. Caching also enables content to be retrieved faster because an entire network round trip is not necessary. Caches like the browser cache can make information retrieval nearly instantaneous.
What are time series forecasting techniques?
The following are some of the most common time series methods: 1. Simple moving average: A simple moving average (SMA) is the simplest type of forecasting technique. Basically, a simple moving average is calculated by adding up the last 'n' period's values and then dividing that number by 'n'. So the moving average value is considering as the forecast for the next period. 2. Exponential Smoothing: Exponential Smoothing assigns exponentially decreasing weights as the observations get older. 3. Autoregressive Integrated Moving Average (ARIMA): This is a statistical technique that uses time series data to predict the future. The parameters used in ARIMA are (P, d, q) which refers to the autoregressive, integrated and moving average parts of the data set, respectively. ARIMA modeling handles the trends, seasonality, cycles, errors and non-stationary aspects of a data set when making forecasts. 4. Neural networks: They are also used for time series forecasting. There is an increasing interest in using neural networks to model and forecast time series.
How will you decide whether a customer will buy a product today or not given the income of the customer, location where the customer lives, profession, and gender? Define a machine learning algorithm for this.
This is a classification algorithm. Therefore, to solve this problem, the dataset will be collected and stored. It will be cleaned and pre-processed for any abnormalities or missing values. After this it will be subject to feature engineering. Some of these steps may include dealing with missing values, encoding categorical values, and normalizing numerical values if required. The dataset would then be divided into train and test sets, using K-folds validation with k set to an appropriate number of folds. The train set would be used to fit a classification model such as a Logistic Regression, Decision Tree Classifier or Support Vector Classifier with appropriate parameters. The fitted model will then be evaluated against the test set to check how good the model is using an appropriate evaluation metric such as accuracy or F1 score.
How will you design the heatmap for Uber drivers to provide recommendation on where to wait for passengers? How would you approach this?
To design the heatmap, some of the pointers are listed as follows: - You can use k-means clustering to group previous journeys of the customers in similar area. This will give a fair idea about the preference of the potential rides. - Perform exploratory data analysis to analyze how long it took for a driver to find the client once they arrived to the pick-up location. Filter out those locations with the minimum pickup time. - Additionally the model can use maps to identify whether it is possible to pick up people at those points or not in terms of practicality. For instance, it would be inconvenient to pick up people from rushy areas so a nearby pickup point should be suggested to ensure efficiency and quick service.
How do you detect if a new observation is an outlier?
To detect outliers, the following visualizations can be used: - Use Boxplot/Whiskers plot to visualize outlier: Any value that will be more than the upper limit or lesser than the lower limit of the plot will be the outliers. Only the data that lies within Lower and upper limit is statistically considered normal and thus can be used for further analysis. - Standard deviation: Find the points which lie more than 3 times the standard deviation of the data. According to the empirical sciences, the so-called "three-sigma rule of thumb" expresses a conventional heuristic that nearly all values are taken to lie within three standard deviations of the mean. - Clustering: Use K-means or Density-Based Spatial Clustering of Applications with Noise(DBSCAN) for clustering to detect outliers.
Explain advantages and drawbacks of Support Vector Machines (SVM).
Advantages - It has a regularization parameter, which can be tweaked to avoid over-fitting. - SVM uses the kernel trick, so you can build a modified version of a model depending on the problem complexity. - It works well with less data. Disadvantages - The choice of kernel according to the problem type is tricky to choose. Kernel models are usually quite sensitive to over-fitting so a lot of knowledge is required to make sound decisions. - It is difficult to tune the hyperparameters such that the error is the minimum.
What is K-means?
K-means is an unsupervised clustering based algorithm. Initially, K-means algorithm identifies k number of centroids randomly, and then allocates every data point to the nearest cluster, while keeping the centroids as small as possible. The result is that the input unlabelled data is converted into clusters which are differentiable.
Why Rectified Linear Unit/ReLU is a good activation function?
ReLUs are better for the training of deep neural networks when compared to the traditional sigmoid or tangent activation functions because they help in addressing the problem of vanishing gradients. The problem of vanishing gradient occurs during back-propagating the weights through the layers which tend to get smaller in size as we keep moving backwards in the network. Due to this, the learning is very slow for large values of the input as gradient values are small. When a neuron's activation saturates close to 0, the gradients at these regions are close to 0. During back-propagation, this local gradient will be multiplied with the gradient of the state's output. Hence, if the local gradient is really small, it will make the gradients slowly vanish. As a result almost no signal will flow through the neurons to it's weights.ReLUs are faster in learning. They are only used for the hidden layers of the neural networks in deep learning.
Define variance.
Variance of a distribution is a measure of the variability of data. It measures how far a set of (random) numbers are spread out from their average value. It can be formulated as the average of the squared differences from the mean.
Are boosting algorithms better than decision trees? If yes, why?
Yes, they perform better than decision trees. Boosting algorithms combine several weak learners into one strong one. After building a model from the training data, they create a sequence of models that attempt to fix the errors of the models before them. Models are added until the training set is predicted perfectly or a maximum number of models are added. During this process, when an input is misclassified by a hypothesis, its weight is increased so that next hypothesis is more likely to classify it correctly. This process converts weak learners into better performing models. The results are combined to create a final output prediction.
Write a program to generate Fibonacci sequence.
(Using R) Fibonacci <- function(n){if(n<=1){print("Invalid Input") }else if(n == 2){print(0)print(1)}else{a <- 0b <- 1print(a)print(b)for(i in 0:(n-3)){sum <- a+bprint(sum)a <- bb <- sum}}} Fibonacci(8) (Using Python) def Fibonacci(n):if n<1:print('invalid input')elif n == 1:print(0)else:a = 0b = 1print(a)print(b)for i in range(n-2):sum = a + bprint(sum)a = bb = sum Fibonacci(8)
Coding test: moving average Input 10, 20, 30, 10, ... Output: 10, 15, 20, 17.5, ...
(Using R) moving_avg <- function(mylist){mysum <- 0for (i in 1:length(mylist)){mysum <- mylist[i] + mysumavg <- mysum/(i)print(avg)}} moving_avg(c(10, 20, 30, 10)) (Using Python) def moving_avg(mylist):mysum = 0for i in range(len(mylist)):mysum += mylist[i]avg = mysum/(i+1)print(avg) moving_avg([10, 20, 30, 10])
How do you find percentile? Write the code for it.
(Using R) percentile <- function(data, score){size <- length(data)sorted <- sort(data)score_index <- match(score, sorted) -1perc <- (score_index/size)*100print(perc)} percentile(c(80, 55, 70, 44, 33, 21, 65, 90, 12, 18), 55) (Using Python) def percentile(data, score): size = len(data) score_index = sorted(data).index(score)perc = (score_index/size)*100print(perc) percentile([80, 55, 70, 44, 33, 21, 65, 90, 12, 18], 55)
Explain the working of decision trees.
A decision tree classification algorithm uses a training dataset to stratify or segment the predictor space into multiple regions. Each such region has only a subset of the training dataset. To predict the outcome for a given (test) observation, first, we determine which of these regions it belongs to. Once its region is identified, its outcome class is predicted as being the same as the mode (most common) of the outcome classes of all the training observations that are included in that region. The rules used to stratify the predictor space can be graphically described in a tree-like flow-chart, hence the name of the algorithm. The only difference being that these decision trees are drawn upside down. Decision tree classification models can easily handle qualitative predictors without the need to create dummy variables. Missing values are not a problem either. Interestingly, decision tree algorithms are used for regression models as well. The same library that you would use to build a classification model can also be used to build a regression model after changing some of the parameters. However, one major problem with decision trees is their high variance.
Is it necessary to use activation functions in neural networks?
Activation functions are essential to learn and model complex data and its relationships. These functions add non-linearity to the network. If there is no activation function then the input signal will be mapped to an output using a linear function which is just a polynomial of one degree. Now why is that a problem? Linear functions are not able to capture complex functional mappings of the data. However, this is possible with the use of nonlinear functions which have a degree of more than one. These activation functions can be used to model any real world data.
What is the difference between bagging and boosting?
Bagging and Boosting are similar in that they are both ensemble techniques, where a set of weak learners are combined to create a strong learner that obtains better performance than a single one. In Bagging, each model is trained in parallel and is running independently. The outputs are then aggregated at the end without preference to any model. Meanwhile boosting is all about "teamwork". Each previous model decides the subset of features used by the next model depending on the performance. The choice of the model to use depends on the data.
Explain Euclidean distance.
Euclidean distance is used to calculate distance between 2 points P and Q. It stems out from the Pythagoras theorem where the distance from point P to Q (in 2-dimensional space) is calculated by considering the line P to Q as hypotenuse of a triangle. In n-dimensional space, the Euclidean distance can be generalized using the following formula: d(p,q) = under-root(∑ni=1(pi−qi)2) In terms of the Euclidean distance's used in machine learning, it could be used to measure the "similarity" between two vectors. It's used by several classification and clustering algorithms.
What is HBase?
A columnar storage database categorized as a NoSQL database that sits on top of HDFS. It sits on CP on the CAP triangle. That is, it is consistent and partition tolerant. The column data is stored and accessed via key-value pair.
Why is Database Normalization Important?
Database normalization is a process used to organize a database into tables and columns. This procedure helps to achieve the following: - All the data is stored in one place ensuring consistency - Removes the duplicate records - Minimizes data modification issues - Querying the database is simplified
What's the problem of exploding gradients in machine learning?
Exploding gradients arise while training neural networks when the gradients are propagated back through the layers. These gradients are being continuously multiplied as a result of matrix multiplication. If they have values larger than 1 then they will eventually blow up resulting in an unstable network. This will hinder the learning process. The values of the weights can become so large that they overflow which results in NaN/undefined values.
You have 2 dice. What is the probability of getting at least one 4? Also find out the probability of getting at least one 4 if you have n dice.
For 2 die the probability of getting at least one four is: P(at least 1 four) = 1 - P(No four) = 1 - 5/65/6 = 1 - (5/6)^2 = 11/36 Following the pattern above, the probability with n dice will be: P(at least 1 four) = 1 - P(No four) = 1 - 5/6n
What techniques can be used to evaluate a Machine Learning model?
Machine Learning algorithms can be evaluated using various metrics depending on the nature of the problem and the type of model used. Following are some of the techniques to evaluate for regression and classification models respectively: Regression: Mean Absolute Error Mean Squared Error R square Adjusted R square Root Mean Squared Logarithmic Error Classification: Classification Accuracy Logarithmic Loss Precision Recall F1 Score Confusion Matrix Receiver Operating Characteristics (ROC) curve Area under Curve (AUC) Gini coefficient
What is the importance of Markov Chains in Data Science?
Markov Chain can be used in marketing analytics. It's a stochastic model describing a sequence of possible events. These are sequential events that are probabilistically related to each other. The probability of the upcoming event depends only on the present state and not on the previous states. This property of Markov Chain is called Memoryless property. It disregards the events in the past and uses the present information to predict what happens in the next state. For instance, imagine you have an online product selling platform and you would like to know whether the customer's in the stage where they are considering to "buy a product". These are the states at which the customer would be at any point in their purchase journey. To find the customer state at any given point, Markov Chain comes in handy. It provides Information about the current state & transition probabilities of moving from one state to another. As a result, we can predict the next stage. In this case, we can predict how likely a customer is going to buy the specified product.
How do you solve for multicollinearity?
Multicollinearity occurs when independent variables in a regression model are correlated. This correlation is a problem because independent variables should be independent. To solve this issue, remove highly correlated predictors from the model. If you have two or more correlated variables, remove one from the model since they supply redundant information. Regularization can be used to omit the problem of correlation because it stabilizes the regression coefficients so the effect of multicollinearity is mitigated. You can also use Principle component analysis to cut the number of correlated predictors.
How would you describe Data Science to a Business Executive?
Typically, a buisness executive approaches data scientists with an abstract buisness problem, the end goals, and the general guidlines from the client. It is the responsibility of a data scientist to help them define and formulate the problem to come up with a game plan. A data scientist will assist and interact with a buisness executive in the following ways: - Help them in translating the business requirements into a machine learning product. - Communicate the strategy and project plan with the buisness executives to reach a goal defined by the client. Data scientists are trained to identify the data patterns and trends. They use statistical and big data methodologies for model training and development. - Effectively present the analytic findings and insights to the buisness executives in order to help them in making decisions. A buisness executive must define new requirements or changes in a project from a functional perspective to a data scientist whereas a data scientist has to handle data and is required to have knowledge about the latest tools which are used to find optimal conclusions through statistics to deliver the project goals.
Describe Binary Classification.
Binary classification is the process of predicting the class of a given set of data points. These classes are also known as targets/ labels. This method of predictive modeling approximates a mapping function (f) from input variables (X) to discrete output variables (y). For example, spam detection in email service providers can be identified as a classification problem. This is binary classification since there are only 2 classes: spam and not spam.
How to find the F1 score after a model is trained?
F1 score is an evaluation metric for classification algorithms that is derived from precision (ratio of true positives to all positives labeled by algorithm) and recall (ratio of true positives to all positives in reality). F1 score is the harmonic mean of the precision and recall. It seeks a balance between Precision and Recall, specifically in uneven class distributions. It can be formulated as: 2* ((recall * precision) / (recall+precision)) F1 score reaches its best value, 1, when precision and recall are perfect. Whereas, the worst case F1 score is 0. It tells you how precise your classifier is (how many instances it classifies correctly), as well as how robust it is.
How to optimize marketing spend between various marketing channels?
Finding the ideal marketing strategy is a skill. The most important thing is to choose a set of metrics that you should be using to determine which channels get more investment. This will help you in making sound decisions regarding what marketing campaigns work and which one's aren't successful.
Explain how SVM works.
A SVM is a classification and regression algorithm. It works by identifying a hyperplane which separates the classes in the data. A hyper plane is a geometric entity which has a dimension of 1 less than its surrounding (ambient) space. If a SVM is asked to classify a two-dimensional dataset, it will do it with a one-dimensional hyperplane (a line), classes in 3D data will be separated by a 2D plane and Nth dimensional data will be separated by a N-1 dimension line. SVM is also called a margin classifier because it draws a margin between classes.
Explain about string parsing in R language
A collection of combined letters and words is called a string. Whenever you work with text, you need to be able to concatenate words (string them together) and split them apart. In R, you use the paste() function to concatenate and the strsplit() function to split.
Define Central Limit Theorem (CLT) and it's application?
CLT states that the sampling distribution of the sample mean approached normal distribution as the sample size gets larger no matter what the initial shape of the population distribution is. To make statistical inferences about the data, it is important to understand the Central Limit Theorem. The theorem gives us the ability to quantify the probability that the random sample will deviate from the population without having to take any new sample to compare it with. Because of this theorem, we don't need the characteristics about the whole population to understand the likelihood of our sample being representative of it. Confidence intervals, hypothesis testing, and p-value analysis is based on the CLT. In a nutshell, CLT can make inferences from a sample about a population.
How do you weigh 9 marbles three times on a balance scale to select the heaviest one?
Divide the 9 marbles into group of 3 where each group has 4, 4 and 1 marble respectively. Now weigh the 2 groups consisting of 4 marbles. If the scale is balanced, then that 1 marble from last group is the heaviest. If one of the groups is heavier, select that group and divide it into 2 groups consisting of 2 marbles each. Now weigh them and pick up the heaviest group out of them. Scale has been used twice and we're left with 2 marbles from the selected group. Weigh those 2 marbles using the scale and we would be left with the heaviest marble.
What is the difference between DDL, DML, and DCL?
DDL: stands for Data Definition Language. It consists of commands such as CREATE, DROP, ALTER, and TRUNCATE which can be applied on data. DML: stands for Data Manipulation Language. It consists commands such as SELECT, INSERT, UPDATE, and DELETE. DCL: stands for Data Control Language. It consists of commands using which you can GRANT or REVOKE access rights of someone over the database.
How would you handle NULLs when querying a data set?
In a relational database, null means that no entry has been made for that cell and either the values exist but is unknown or there is no information about the existence of value. A null is not the same as 0 or blank. Databases such as SQL reserves the NULL keyword to denote an unknown or missing value. It is extremely important to handle null values when doing some arithmetic operations because if a null value is used in any of these operations then the answer always remains null which is hard to demystify. Understanding the usage of nulls can help you to create meaningful databases and design efficient queries in your database applications.
What's the difference between convex and non-convex cost function?
In terms of cost function, a convex function is one which has one minimum, which is a global minimum. In convex, an optimization algorithm won't get stuck in a local minimum that isn't a global minimum. An example of such cost function is x^2. Therefore, such a function can easily converge at the global minimum. A non-convex function has multiple local minimums, or locally optimal points. Its shape can be visualized as being wavy with multiple 'valleys' that depict local minima. Algorithms can get stuck in the local minimum and it can take a lot of time to identify whether the problem has no solution or if the solution is global. An example of such cost function is x^6 + x^3 -x^2
What do you know about LSTM?
Long short-term memory (LSTM) is based on an artificial recurrent neural network (RNN) architecture. LSTM tackles the problem of long-term dependencies of RNN in which the RNN cannot predict the word stored in the long-term memory but can give more accurate predictions from the recent information. LSTM explicitly introduces a memory unit, called the cell into the network. LSTM can retain the information for long period of time. A common LSTM unit is composed of a cell, an input gate, an output gate, and a forget gate. Each single unit makes decision by considering the current input, previous output and previous memory. And it generates a new output and alters its memory. LSTM is used for processing, predicting and classifying based on time series data. Unlike standard feedforward neural networks, LSTM has feedback connections that make it a general-purpose computer. It can not only process single data points (such as images), but also entire sequences of data (such as speech or video).
Can you explain what MapReduce is and how it works?
MapReduce is a data processing job that enables distributed computations to handle a huge amount of data. It is used to split and process terabytes of data in parallel, achieving quicker results. This way it makes it easy to scale data processing over multiple computing nodes. The processing happens using the map and reduce function. Map takes a set of data and converts it into another set of data, where individual elements are broken down into tuples. Whereas, reduce takes the output from a map as an input and combines those data tuples into a smaller set of tuples.
Before building any model, why do we need the feature selection/engineering step?
The data features used to train your machine learning models have a huge influence on the performance of the model. Some feature sets will be more influential than others on model accuracy. Irrelevant features can increase the complexity of the model and add noise to the data which can negatively impact model performance. Features may be redundant if they're highly correlated with another feature. These types of features can be removed from the data set without any loss of information. Feature selection methods can be used to identify and remove redundant attributes from data that don't contribute to the accuracy of a predictive model. Moreover, variable selection helps in reducing the amount of data that contributes to the curse of dimensionality. Reducing the number of features through feature selection ensures training the model will require minimum memory and computational power, leading to shorter training times and also reducing the common problem of overfitting.
There are 6 marbles in a bag, 1 is white. You reach in the bag 100 times. After drawing a marble, it is placed back in the bag. What is the probability of drawing the white marble at least once?
The probability of drawing out at least one marble is the complement of probability of drawing not a single white marble at all. Therefore, we'll calculate the Probability of drawing all non-white marbles over a hundred times and subtract by 1: P(White at least once) = 1 - [P(Non-white marbles) ^ 100] = 1 - [(5/6) ^ 100]
How do you split your data between training and validation?
Training and validation set from data can be split on the following 2 principles. First, ensure the validation set is large enough to yield statistically meaningful results. Second, the validation set should be representative of the data set as a whole. In other words, don't pick a validation set with different characteristics than the training set. An optimal way would be to use k-folds validation. This method makes multiple splits of the dataset into train and validation sets. This offer various samples of data and ultimately reduces the chances of overfitting.
How do you prove that males are on average taller than females by knowing just gender or height?
We can use the concept of Null and Alternate hypothesis to prove this. It is used for statistical significance testing. First, compare the sample mean of the male heights with the sample mean of female heights. The Null hypothesis will state that the mean female height and male height are the same.The alternate hypothesis will state that the mean male height is greater than mean female height. One tailed hypothesis test can be used to accept or reject the Null Hypothesis. P-value analysis can be used to figure out whether the test is statistically significant or not.
Why is gradient checking important?
Gradient Checking is a method to check out the derivatives in Back-propagation algorithms. Implementation of back-propagation algorithm is usually prone to bugs and errors. Therefore, it's necessary before running the neural network on training data to check if our implementation of back-propagation is correct. Gradient checking is a way to do that. It compares the back-propagation gradients, which are obtained analytically with loss function, with numerically obtained gradient for each parameter. Therefore, it ensures that the implementation is correct and would hence, significantly increase our confidence in the correctness of our code. By numerically checking the derivatives computed, gradient checking eliminates most of the problems that may occur as the back-propagation algorithm may have many subtle bugs. It could look like it's working, and our cost function may end up decreasing on every iteration of gradient descent, but this may result in a neural network that has a higher level of error that could go unnoticed and give us worse performance.
While working at Facebook, you're asked to implement some new features. What type of experiment would you run to implement these features?
A/B testing can be used to check the response on new features by the general audience. A/B testing can be valuable because different audiences behave, well, differently. Something that works for one company may not necessarily work for another. A/B testing is a marketing experiment wherein you "split" your audience to test a number of variations of a campaign/new feature and determine which performs better. For example, in marketing or a web design, you might be comparing two different landing pages with or two different newsletters. Version A shows the layout of a page. Now, you decide to move the content body to the right versus the left. In order for A/B testing to work, you must call out your criteria for success before you begin. What do you think will happen if you change Version A to Version B? Maybe you're hoping to increase newsletter sign ups or decrease the bounce rate. This way you can determine the success rate of both the versions.
What metrics would you use to track whether Uber's strategy of using paid advertising to acquire customers works?
Customer acquisition cost (CAC) is a metric which can be used to track consumers/customers as they progress from interested leads to acquiring customers. CAC, is the cost of convincing a potential customer to buy a product or service .CAC can be calculated by simply dividing all the costs spent on acquiring more customers (marketing expenses) by the number of customers acquired in the period the money was spent. For example, if Uber spent $200 on marketing in a year and acquired 200 customers in the same year, their CAC is $1.00.
According to your judgement, does Data Science differ from Machine Learning?
Data ScienceThe processing and analysis of data that you generate to draw various useful insights from the data. For instance, when you Log on Netflix and browse to watch shows and genres, you're generating data. All of these activites are tracked for each user and the data is consumed by a data scientist at the backend to understand the customer behaviour. This is one of the reasons you see customized ads everywhere regarding a product which you are currently searching for. This is one of the simplest implementations of data science. Machine Learning Machine learning is just a small chunk of the work done by data scientists.We know that data gets generated in massive volumes which becomes extremely cumbersome for a data scientist to work on. Hence, a machine learning algorithm has the ability to learn and process large data sets autonomously without human intervention. For example, Facebook is an example of a machine learning algorithm. The algorithm gathers behavorial information regarding the user by tracking it's activity consistently and then by using the past behaviour of the user, the algorithm trains itself to predict the interests and recommends notifications on the News Feed
Why is dimensionality reduction important?
Datasets with large number of feature sets (specifically images, sound, and/or textual contents) increase space, add overfitting and slow down the time to train the models. Dimensionality reduction is the process of reducing the dimensionality of the feature space with consideration by obtaining a set of principal features. This way, it can assist in better performance of the learning algorithm resulting in less computational cost with simplification of models. I t also eliminates redundant features and features with strong correlation between them, therefore, reducing overfitting. Moreover, projection into two or three dimensions is often used to facilitate the visualization of high dimensional data sets, leading to better human interpretations.
How does a logistic regression model know what the coefficients are?
First, let's consider the case when the input variable is continuous. The first coefficient is the y-axis intercept. It means that when the input variable/feature is 0 the log(odds of output variable) is equal to the intercept value. The second coefficient is the slope. It represents the change of value in the log(odds of output variable) for every one unit of x-axis gained. Now, let's consider the case when the input variable is discrete. Let's take the example where a mouse is "obese" or "not obese". The independent variable is a discrete variable which is whether the mouse has normal genes or mutates genes. In this case the first coefficient/intercept tells us the log(odds of normal gene) and the second coefficient tells us the logodds ratio which determines, on a log scale, how much having a mutated gene increases or decreases the odds of being obese.
What is the difference between clustered and non-clustered index?
The purpose of indexes is to speed-up query process in SQL Server. Clustered Index: A clustered index defines the order in which data is physically stored in a table. Since the data can be sorted in only one order, there can be only one clustered index per table. It is faster to read than non-clustered index as data is physically stored in index order. Non-Clustered Index: A non-clustered index doesn't sort the data inside the table. A non-clustered index is stored at one place and table data is stored in another place. This allows for more than one non-clustered index per table. This method is quicker to insert and update operations than a clustered index.
How to deal with unbalanced binary classification?
Unbalance Binary Classification can be dealt in multiple ways as listed below (considering more data cannot be collected): - Under-sampling: Eliminates majority class examples until data is balanced. - Over-sampling: Increases number of instances in minority class by adding copies of those instances. This can be done randomly or synthetically using Synthetic Minority Over-sampling Technique (SMOTE). - Use suitable performance metrics: Accuracy can be a misleading metric for unbalanced data. Suitable metrics would include Precision, Recall, F1-score, AUC, etc. - Use suitable algorithms: Algorithms such as Decision Tree and Random Forest work well with unbalanced data. - Penalizing: Penalizing wrong classification imposes an additional cost on the model for making classification mistakes on rare class during training more than wrong classifications of the abundant class. These penalties can create bias in the model to pay more attention to and favor of the rare class.
What are anomaly detection methods?
Anomaly detection is a technique used to identify unusual patterns that do not conform to expected behavior, called outliers.The simplest approach to identify irregularities in data is to flag the data points that deviate from common statistical properties of a distribution, including mean, median, mode, and quantiles. Density-Based Anomaly Detection is a Machine Learning approach which works on the assumption that normal data points occur around a dense neighborhood and abnormalities are far away. The nearest set of data points are evaluated using a score, which could be Euclidean distance or something else. Another technique to detect anomalies is Z-score, which is a parametric outlier detection method. This technique assumes a Gaussian distribution of the data. The outliers are the data points that are in the tails of the distribution and therefore far from the mean.
What is cross validation? Why is it used?
Cross-validation is a technique to evaluate predictive models. It partitions the original sample into a training set to train the model and a test set to evaluate it. In k-fold cross-validation, the original sample is randomly partitioned into k equal sized subsamples. Of the k subsamples, a single subsample is retained as the validation data for testing the model, and the remaining k-1 subsamples are used as training data. The cross-validation process is then repeated k times, with each of the k subsamples used exactly once as the validation data. The k results from the folds can then be averaged to produce a single estimation. The advantage of this method is that all observations are used for both training and validation, and each observation is used for validation exactly once. This method helps in reducing bias in a model because cross validation ensures that every observation from the original dataset has the chance of appearing in training and test set.
What was the most challenging project you have worked on so far? Can you explain your learning outcomes?
It is crucial that you prepare an answer in advance since interviews are intimidating for most people, and it would be time consuming to formulate a well thought-out example. You should keep the following points in mind while finalizing the answer: Choose an appropriate example: Pick a project that's most relevant to the responsibilities of the job you're applying for. Be Specific: Take the hiring manager through the process of the project. Break down the project into goals and milestones and explain how you were able to achieve those and describe your responsibilites as well. If you were managing a group project, make sure to mention about your communication and group management skills Look at the keywords in the job description so that you know what they're looking for. For instance, if they are looking for a *leader* then explain your role as a leader in the project. Explain Your Position Clearly: Make sure to highlight the outcomes of the project and your role in achieving those. Align your learning outcomes with the aims of the company you're applying for. In addition to mentioning the learning outcomes from the project, the hiring manager should know your challenges through the project phase and how you overcame those. To sum it up, the project journey should communicate your willingness to learn and overcome hurdles in life.
Formulate Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA) techniques.
Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA) are part of topic modelling. LSI (also known as Latent Semantic Analysis, LSA) learns latent topics by performing a matrix decomposition on the term-document matrix. The objective of LSA is to reduce dimensions for classification in Natural Language Processing. Latent Dirichlet Allocation (LDA) is a "generative probabilistic model" which uses unsupervised learning for topic modeling/classification of topics.
How many topic modeling techniques do you know of? Explain them briefly.
Latent Semantic Analysis (LSA): A Latent Semantic Analysis tries to use the context around the words to find hidden concepts. It does that by generating a document-term matrix, where each cell has TD-IDF score which assigns a weight for every term in the document. Using a technique known as Singular Value Decomposition (SVD), the dimensions of the matrix are reduced to the number of desired topics. The resultant matrices, after decomposition, gives us vectors for every document and term in our data that can then be used to find similar words and similar documents using the cosine similarity method. Probabilistic Latent Semantic Analysis(PLSA): Probabilistic Latent Semantic Analysis is a technique used to model information under a probabilistic framework instead of SVD. It creates a model P(D,W) such that for any document d and word w, P(d,w) corresponds to that entry in the document-term matrix. Latent Dirichlet Allocation (LDA): Latent Dirichlet Allocation is a technique that automatically discovers topics that documents contain. LDA represents documents as mixtures of topics that spit out words with certain probabilities. It assumes that each document mix with various topics and every topic mix with various words. Assuming this, LDA then tries to backtrack from the documents to find a set of topics that are likely to have generated the collection. It maps all the documents to the topics in a way, such that the words in each document are mostly captured by those imaginary topics.
If you have a chance to add something to Facebook then how would you measure its success?
The choice of feature to add is yours. To test any feature, let's walk through an example: To check the popularity of Facebook's feature, it would be a good idea to measure how frequently people are using the feature. One of the metrics to use is: the average number of times a user shares a story per day/per week/per month. To test whether users want to first view stories from close friends or whether they want to see stories from all their friends, we can measure how many times a user clicks on stories from friends they don't engage very often before they click on stories from their close friends. If people click on stories randomly without prioritizing stories from close friends then a more appropriate ordering of stories should be considered. This is an example of a few metrics to monitor. These measures in addition to others can be used to evaluate whether these components are achieving the overall goals set by Facebook.
What is the difference between - (i) Stack and Queue and (ii) Linked list and Array?
The main difference between a stack and a queue is how new elements enter the list and old elements leave the list. A stack is a linear data structure in which elements can be inserted and deleted only from one side of the list, called the top. Stack is a LIFO (last in first out) data structure. In stack we always keep track of the last element present in the list with a pointer called top. Whereas in queue, elements can be inserted only from one side of the list called rear, and the elements can be deleted only from the other side called the front. Queue is a FIFO (first in first out) data structure. The difference between a linked list and an array is the way they allocate memory. A Linked List is an ordered collection of elements of the same type which are connected to each other using pointers. In a linked list, the address of the memory location allocated to the new element is stored in the previous node of the linked list, hence formatting a link between the two nodes/elements. Linked lists have a dynamic size, but random access isn't allowed. Whereas an array is a random-access data structure, where an array consumes contiguous memory locations allocated at compile time. An array has a fixed size, but random access is permissible as elements can be accessed directly using their index.
You call 3 random friends who live in Seattle and ask each independently if it's raining. Each of your friends has a 2/3 chance of telling you the truth and a 1/3 chance of lying. All three say "yes". What's the probability it's actually raining?
We have to find the probability of raining in Seattle given that all three friends said 'Yes'. Therefore, we are trying to find: P(rain | yes, yes, yes) Using Bayes Theorem, our equation will now be: P(rain | yes, yes, yes) = P(yes,yes,yes|rain)*P(rain) / [P(yes,yes,yes|rain)*P(not rain) + P(yes,yes,yes|rain)*P(not rain)] We have the following values: P(yes, yes, yes | rain) = 2/3 ^ 3 = 8/27 P(yes, yes, yes | not rain) = 1/3 ^ 3 = 1/27 P(rain) = R (it is not given in question, so we'll assume R) P(not rain) = 1 - R Substituting these values in equation we get: P(rain | yes, yes, yes) = 8P/(7P + 1)
What are hyperparameters, how to tune them, how to test and know if they worked for the particular problem?
While a machine learning model tries to learn parameters from the training data, hyperparameters are those values that are set before the training process begins. They are properties used to describe how a model is supposed to function, for example: the number of trees in a decision tree or the learning rate of a regression model. Hyperparameters directly control the behavior of the training algorithm and have a significant impact on the performance of the model being trained. To choose the optimal set of hyperparameters for providing the best results, they can be tuned by running multiple trials. By setting different values for those hyperparameters, training different models, and deciding which ones work best by testing them, we can figure out those values. Each trial is a complete execution of our training application with a broad set of values we specify. Some common techniques are: Grid search, Random Search, Bayesian Optimization. To test and find out if the specified hyperparameters work for our model, we need to test them against a specific evaluation metric based on the nature of our problem. For example, we can choose a set of hyperparameters that give us the best accuracy or the best F1 score.
If a Product Manager says that they want to double the number of ads in Facebook's Newsfeed, how would you figure out if this is a good idea or not?
You can use A/B testing to make a conclusion about the success rate of the ad's. A/B testing is experimenting and comparing two types or variations of an online or offline campaign such as ad text, a headline, or any element of a marketing campaign such as ads. For example, one set of the audience can be shown ads that are double the amount they usual see on their newsfeed while the second set will continue to see the existing number of ad's. The reactions of both the sets can be recorded using an appropriate feedback system. Using this approach can help the company in deciding what percentage of audience is comfortable looking at the ads and responding to them. Even a relatively small sample size in an A/B test can provide significant, actionable inights as to which changes are most engaging for users.
What are the core steps for data preprocessing before applying machine learning algorithms?
Data preprocessing is the process of giving structure to the data for better understanding and decision making related to the data. The following steps summarizes the data pre-processing pipeline: 1. Discovering/Data Acquisition: Gather the data from all the sources and try to understand and make sense of your data. 2. Structuring/Data Transformation: Since the data may come in different formats and sizes so it needs to have a consistent size and shape when merged together. 3. Cleaning: This step consists of imputing null values and treating outliers/anomalies in the data to make the data ready for further analysis. 4. Exploratory Data Analysis: Try to find patterns in the dataset and extract new features from the given data in order to optimize the performance of the applied machine learning model. 5. Validating: This stage verifies data consistency and quality. 6. Publishing/Modeling: The wrangled data is ready to be processed further by an algorithm or machine learning model.
Explain Logistic Regression and its assumptions.
Logistic Regression is a go-to method for classification. Logistic regression models the probability of the default class (e.g. the first class). It employs the use of the sigmoid function that can take any real-valued number and map it into a probability value between 0 and 1 to predict the output class. There are two types of logistic regression: Binary and Multinomial. Binary Logistic Regression deals with two categories whereas multinomial deals with three or more categories. Assumptions: Binary logistic regression requires the dependent variable to be binary. The independent variables should be independent of each other. That is, the model should have little or no multicollinearity, The independent variables should be linearly related to the log odds.
Describe a way to detect anomalies in a given dataset.
Anomaly detection is a technique used to identify unusual patterns that do not conform to expected behavior, called outliers. These can be rare items, events or observations which raise suspicions by differing significantly from the majority of the data. Typically, anomalous data can be connected to some kind of problem or rare event such as e.g. bank fraud, medical problems, structural defects, malfunctioning equipment etc. The simplest approach to identifying anomalies in data is to use simple statistical techniques and flag the data points that deviate from common statistical properties of a distribution, including mean, median, mode, and quantiles. For example, marking an anomaly when a data point deviates by a certain standard deviation from the mean. However, in high dimensions, the statistical approach could be difficult, therefore, machine learning techniques could be used. Following are the popular methods used to detect anomalies: - Isolation Forest - One Class SVM - PCA-based Anomaly detection - FAST-MCD - Local Outlier Factor (Explaining one of the above-mentioned methods) Isolation Forests build a Random Forest in which each Decision Tree is grown randomly. At each node, it picks a feature randomly, then it picks a random threshold value (between the min and max value) to split the dataset in two. The dataset gradually gets chopped into pieces this way, until all instances end up isolated from the other instances.Random partitioning produces noticeably shorter paths for anomalies. Hence, when a forest of random trees collectively produces shorter path lengths for particular samples, they are highly likely to be anomalies.
Why is it important to know bias-variance trade off while modeling?
Bias and Variance are part of model prediction errors. A model with high bias pays very little attention to the training data and oversimplifies the model leading to underfitting. A model with high variance pays a lot of attention to training data and does not generalize well on the unseen data which leads to overfitting. Gaining proper insights and understandings into these errors would help us not only in building accurate models, but also in avoiding the mistake of overfitting and underfitting. Underfitting/Bias: Bias error is the difference between the expected/average prediction of the model and the true value. The model building/prediction process is repeated more than once with new variations of the data. Hence, due to the randomness in the underlying data sets, we will have a set of predictions for each point. Bias measures how much the predictions deviate from the true value we are trying to predict. Overfitting/Variance: Variance error is defined as the variability of model prediction for a given data point. The model prediction is repeated for various data sets. It's an indicator to a model's sensitivity to small variations that can exist while feeding a new subset of the training data. For instance, if a model has high variance then small changes in the training data can result in large prediction changes. There is no analytical way to measure the point at which we can achieve the bias-variance tradeoff. To figure it out, it's essential to explore the complexity of the model and measure the prediction error in order to minimize the overall error.
What are the Naive Bayes fundamentals?
Naive Bayes is a probabilistic machine learning model that's primarily used for text classification. It learns the probability of an object with a certain feature belonging to a particular group of class. The Naive Bayes algorithm is called "Naive" because it makes the assumption that the occurrence of a certain feature is independent of the occurrence of other features. The crux of the classifier is based on the Bayes theorem. It gives us a method to calculate the conditional probability, that is the probability of an event A based on the previous knowledge events. There are essentially three types of Naive Bayes: 1. Multinomial Naive Bayes: Used when we have discrete data. With respect to text classification, if the words can be represented in terms of their occurrences/frequency count, then use this method. 2. Bernoulli Naive Bayes: It assumes that the input features are binary with only two categories (e.g. 0 can represent the word is not present in the document while 1 represents the word presence. If you just care about the presence or absence of a particular word in the document, then use bernouli classification. 3. Gaussian Naive Bayes: It is used in the case with continous features. For example, the Iris dataset features have sepal width, petal width, sepal length, and petal length. The values in the data vary from the width to the length.
What are neural networks used for?
Neural networks are learning algorithms which takes one or more inputs, processing them into an output. The neural network itself consists of many small units called neurons. These neurons are grouped into several layers. Neurons of the previous layer are connected with the neurons of the next layer through weighted connections. They can be used for both predictive analytics/regression and classification involving image, audio, video and text analytics. Neural networks are used both in supervised and unsupervised learning.
What does P-Value mean?
P-Value is used to determine the statistical significance in the Null Hypothesis. It stands for probability value and indicates how likely it is that a result occurred by chance alone. If the p-value is small, it indicates that the result was unlikely to have occurred by chance alone. These results are known as being statistically significant. A large p-value indicates that result is within chance or normal sampling error which means that nothing happened and the test is not significant. A large p-value, beyond the chosen significance level, indicates weak evidence against the null hypothesis, so you fail to reject the null hypothesis.
Explain Principle Component Analysis (PCA) and it's assumptions.
Principal component analysis is a dimensionality reduction technique for large data sets, by transforming a large set of variables into a smaller one that still contains most of the information in the large set. It's often used to make data easy to explore and visualize. PCA does not make any explicit assumptions.
How does speech synthesis works?
Speech synthesis is the process of making human like speech artificially using a computing device. It's also referred to as a text-to-speech (TTS) system that converts natural language into audible speech. Simply explained, it's a form of output where a computer or other machine reads words to you out loud in a real or simulated voice played through a loudspeaker. This synthesized speech is usually generated with the help of concatenating pieces of recorded speech, which is contained in a database. The entire process could be described in following 2 stages. Pre-processing: Since there's a lot of ambiguity involved in reading text, as words can be read in several ways, the pre-processing step attempts to eliminate the ambiguity and handles homographs. Therefore, statistical techniques and machine learning algorithms are used to find out the most appropriate way to read text considering the context. Words to Phenomes: Following pre-processing, the computer takes the help of phonemes to convert the text into a sequence of sounds. Phonemes are the sound components used to make spoken words. The speech synthesizer generates the speech sounds that make up the words. Phenomes to sound: Last, techniques are used to mimic the human voice mechanism and read out the entire text. This can be in 3 forms: 1. Using recordings of humans saying the phonemes 2. Using a computer to generate the phonemes itself by generating basic sound frequencies 3. Mimicking the mechanism of the human voice
