# Data Science Interview Questions

What is a type I error?

A Type I error occurs when the null hypothesis is rejected when it is true. (False Positive)

What is a type II error?

A Type II error occurs when the null hypothesis is accepted when it is false. (False Negative)

What is a Probability Density Function?

A function for continuous data where the value at any given sample can be interpreted as providing a relative likelihood that the value of the random variable would equal that sample.

What is a Probability Mass Function?

A function that gives the probability that a discrete random variable is exactly equal to some value.

What is a Cumulative Density Function?

A function that gives the probability that a random variable is less than or equal to a certain value.

What is a null hypothesis?

A general statement that there is no relationship between two measured phenomena or no association among groups.

What is a gradient?

A gradient measures how much the output of a function changes if you change the inputs a little bit. It simply measures the change in all weights with regard to the change in error. You can also think of a gradient as the slope of a function.

What is the Interquartile Range(IQR)?

A measure of statistical dispersion and variability based on dividing a data set into quartiles. IQR = Q3−Q1

What is skewness?

A measure of symmetry.

What is kurtosis?

A measure of whether the data are heavy-tailed or light-tailed relative to a normal distribution

What are Percentiles?

A measure that indicates the value below which a given percentage of observations in a group of observations falls.

What is the Exponential distribution?

A probability distribution of the time between the events in a Poisson point process.

What is Covariance?

A quantitative measure of the joint variability between two or more variables

What Do You Mean by Tensor in Tensorflow?

A tensor is a mathematical object represented as arrays of higher dimensions. These arrays of data with different dimensions and ranks fed as input to the neural network are called "Tensors"

Describe the structure of Artificial Neural Networks.

Artificial Neural Networks works on the same principle as a biological Neural Network. It consists of inputs which get processed with weighted sums and Bias, with the help of Activation Functions.

What are Artificial Neural Networks?

Artificial Neural networks are a specific set of algorithms that have revolutionized machine learning. They are inspired by biological neural networks. Neural Networks can adapt to changing the input so the network generates the best possible result without needing to redesign the output criteria.

What Is a Multi-layer Perceptron(MLP)?

As in Neural Networks, MLPs have an input layer, a hidden layer, and an output layer. It has the same structure as a single layer perceptron with one or more hidden layers. A single layer perceptron can classify only linear separable classes with binary output (0,1), but MLP can classify nonlinear classes. Except for the input layer, each node in the other layers uses a nonlinear activation function. This means the input layers, the data coming in, and the activation function is based upon all nodes and weights being added together, producing the output. MLP uses a supervised learning method called "backpropagation." In backpropagation, the neural network calculates the error with the help of cost function. It propagates this error backward from where it came (adjusts the weights to train the model more accurately).

What is Back Propagation and how does it work?

Backpropagation is a training algorithm used for multilayer neural network. In this method, we move the error from an end of the network to all weights inside the network and thus allowing efficient computation of the gradient. Its steps are as follows: Forward Propagation of Training Data Derivatives are computed using output and target Back Propagate for computing derivative of error wrt output activation Using previously calculated derivatives for output Update the Weights

Describe in brief any type of Ensemble Learning?

Bagging Bagging tries to implement similar learners on small sample populations and then takes a mean of all the predictions. In generalised bagging, you can use different learners on different population. As you expect this helps us to reduce the variance error. Boosting Boosting is an iterative technique which adjusts the weight of an observation based on the last classification. If an observation was classified incorrectly, it tries to increase the weight of this observation and vice versa. Boosting in general decreases the bias error and builds strong predictive models. However, they may over fit on the training data.

What is Bagging?

Bagging is a type of Ensemble Learning that tries to implement similar learners on small sample populations and then takes a mean of all the predictions. In generalised bagging, you can use different learners on different population. As you expect this helps us to reduce the variance error.

What is a Batch in Deep Learning?

Batch - Refers to when we cannot pass the entire dataset into the neural network at once, so we divide the dataset into several batches.

What Is the Difference Between Batch Gradient Descent and Stochastic Gradient Descent?

Batch Gradient Descent computes the gradient using the entire dataset and it takes time to converge because the volume of data is huge and weights update slowly. The stochastic gradient computes the gradient using a single sample and it converges much faster than the batch gradient because it updates weight more frequently

Explain Bayes Theorem

Bayes' Theorem describes the probability of an event based on prior knowledge of conditions that might be related to the event.

What is Boosting?

Boosting a type of Ensemble Learning is an iterative technique which adjusts the weight of an observation based on the last classification. If an observation was classified incorrectly, it tries to increase the weight of this observation and vice versa. Boosting in general decreases the bias error and builds strong predictive models. However, they may over fit on the training data.

What is Uniform Distribution?

Also called a rectangular distribution, is a probability distribution where all outcomes are equally likely

What Is the Cost Function?

Also referred to as "loss" or "error," cost function is a measure to evaluate how good your model's performance is. It's used to compute the error of the output layer during backpropagation. We push that error backwards through the neural network and use that during the different training functions.

What is the Standard Error?

An estimate of the standard deviation of the sampling distribution.

What is power analysis?

An experimental design technique for determining the effect of a given sample size.

What is Ensemble Learning?

Ensemble Learning is basically combining a diverse set of learners(Individual models) together to improvise on the stability and predictive power of the model.

What is an Epoch in Deep Learning?

Epoch - Represents one iteration over the entire dataset (everything put into the training model).

What is the Computational Graph?

Everything in a tensorflow is based on creating a computational graph. It has a network of nodes where each node operates, Nodes represent mathematical operations, and edges represent tensors. Since data flows in the form of a graph, it is also called a "DataFlow Graph."

What is the Central Limit Theorem and why is it important?

Formally, it states that if we sample from a population using a sufficiently large sample size, the mean of the samples (also known as the sample population) will be normally distributed (assuming true random sampling). What's especially important is that this will be true regardless of the distribution of the original population.

Explain Gradient Descent.

Gradient Descent can be thought of climbing down to the bottom of a valley, instead of climbing up a hill. This is because it is a minimization algorithm that minimizes a given function (Activation Function).

How Do You Work Towards a Random Forest?

In Random Forest, we grow multiple trees as opposed to a single tree. To classify a new object based on attributes, each tree gives a classification. The forest chooses the classification having the most votes(Overall the trees in the forest) and in case of regression, it takes the average of outputs by different trees.

What are the support vectors in SVM?

In the diagram, we see that the thinner lines mark the distance from the classifier to the closest data points called the support vectors (darkened data points). The distance between the two thin lines is called the margin.

What cross-validation technique would you use on a time series data set?

Instead of using k-fold cross-validation, you should be aware of the fact that a time series is not randomly distributed data — It is inherently ordered by chronological order. In case of time series data, you should use techniques like forward=chaining — Where you will be model on past data then look at forward-facing data.

What is an Iteration in Deep Learning?

Iteration - if we have 10,000 images as data and a batch size of 200. then an epoch should run 50 iterations (10,000 divided by 50).

List of few clustering algorithms you are familiar with.

K-means K-means++ Hierarchical clustering

What is Linear Regression?

Linear regression is a statistical technique where the score of a variable Y is predicted from the score of a second variable X. X is referred to as the predictor variable and Y as the criterion variable.

How Does an LSTM Network Work?

Long-Short-Term Memory (LSTM) is a special kind of recurrent neural network capable of learning long-term dependencies, remembering information for long periods as its default behaviour. There are three steps in an LSTM network: Step 1: The network decides what to forget and what to remember. Step 2: It selectively updates cell state values. Step 3: The network decides what part of the current state makes it to the output.

What is correlation?

Measure the relationship between two variables and ranges from -1 to 1, the normalized version of covariance.

What is Overfitting?

Overfitting, or *high variance*, is caused by a hypothesis function that fits the available data but does not generalize well to predict new data.

What is Conditional Probability?

P(A|B) is a measure of the probability of one event occurring with some relationship to one or more other events. P(A|B)=P(A∩B)/P(B), when P(B)>0.

What is the formula for Intersection?

P(A∩B)=P(A)P(B)

What is Pooling on CNN and how does it work?

Pooling is used to reduce the spatial dimensions of a CNN. It performs down-sampling operations to reduce the dimensionality and creates a pooled feature map by sliding a filter matrix over the input matrix.

What are Predictive Analytics?

Predictive Analytics predicts what is most likely to happen in the future and provides companies with actionable insights based on the information.

What are Prescriptive Analytics?

Prescriptive Analytics provides recommendations regarding actions that will take advantage of the predictions and guide the possible actions toward a solution.

What are Recurrent Neural Networks?

RNNs are a type of artificial neural networks designed to recognise the pattern from the sequence of data such as Time series, stock market and government agencies etc. To understand recurrent nets, first, you have to understand the basics of feedforward nets. Both these networks RNN and feed-forward named after the way they channel information through a series of mathematical orations performed at the nodes of the network. One feeds information through straight(never touching the same node twice), while the other cycles it through a loop, and the latter are called recurrent. Recurrent networks, on the other hand, take as their input, not just the current input example they see, but also the what they have perceived previously in time. The decision a recurrent neural network reached at time t-1 affects the decision that it will reach one moment later at time t. So recurrent networks have two sources of input, the present and the recent past, which combine to determine how they respond to new data, much as we do in life. The error they generate will return via backpropagation and be used to adjust their weights until error can't go any lower. Remember, the purpose of recurrent nets is to accurately classify sequential input. We rely on the backpropagation of error and gradient descent to do so.

What is a Random Forest? How does it work?

Random forest is a versatile machine learning method capable of performing both regression and classification tasks. It is also used for dimensionality reduction, treats missing values, outlier values. It is a type of ensemble learning method, where a group of weak models combine to form a powerful model.

What is the mean?

The average of the dataset.

What is the Variance?

The average squared difference of the values from the mean to measure how spread out a set of data is relative to mean.

What is the Binomial Probability Formula?

The binomial distribution consists of the probabilities of each of the possible numbers of successes on N trials for independent events that each have a probability of π (the Greek letter pi) of occurring.

What is a confusion matrix?

The confusion matrix is a 2X2 table that contains 4 outputs provided by the binary classifier. Various measures, such as error-rate, accuracy, specificity, sensitivity, precision and recall are derived from it.

What is the alternative hypothesis?

The contrary to the null hypothesis.

What is the Normal Distribution?

The curve of the distribution is bell-shaped and symmetrical and is related to the Central Limit Theorem that the sampling distribution of the sample means approaches a normal distribution as the sample size gets larger.

What is a Box-Cox Transformation?

The dependent variable for a regression analysis might not satisfy one or more assumptions of an ordinary least squares regression. The residuals could either curve as the prediction increases or follow the skewed distribution. In such scenarios, it is necessary to transform the response variable so that the data meets the required assumptions. A Box cox transformation is a statistical technique to transform non-normal dependent variables into a normal shape. If the given data is not normal then most of the statistical techniques assume normality. Applying a box cox transformation means that you can run a broader number of tests.

What is the range?

The difference between the highest and lowest value in the dataset.

What is the Bernoulli distribution?

The distribution of a random variable which takes a single trial and only 2 possible outcomes, namely 1(success) with probability p, and 0(failure) with probability (1-p).

What is the Binomial Distribution?

The distribution of the number of successes in a sequence of n independent experiments, and each with only 2 possible outcomes, namely 1(success) with probability p, and 0(failure) with probability (1-p).

What is the Chi-Square distribution?

The distribution of the sum of squared standard normal deviates.

What is the Poisson Distribution?

The distribution that expresses the probability of a given number of events k occurring in a fixed interval of time if these events occur with a known constant average rate λ and independently of the time.

How Are Weights Initialized in a Network?

There are two methods here: we can either initialize the weights to zero or assign them randomly. Initializing all weights to 0: This makes your model similar to a linear model. All the neurons and every layer perform the same operation, giving the same output and making the deep net useless. Initializing all weights randomly: Here, the weights are assigned randomly by initializing them very close to 0. It gives better accuracy to the model since every neuron performs different computations. This is the most commonly used method.

What is the difference between the Type I and Type II error?

Type I error is the false positive, while Type II is the false negative. Type I error is claiming on something has to happened when it hasn't. For the instance, telling an man he is pregnant. On the other hand, Type II error means you claim nothing is happened but in the fact something . To exemplify, you tell an pregnant lady she isn't carrying baby.

What does UNION do? What is the difference between UNION and UNION ALL?

UNION removes duplicate records (where all columns in the results are the same), UNION ALL does not

What is Underfitting?

Underfitting, or high bias, is when the form of our hypothesis function h maps poorly to the trend of the data.

What is root cause analysis?

Understanding the underlying causes of change is known as root cause analysis.

What is Unsupervised learning?

Unsupervised learning is a type of machine learning algorithm used to draw inferences from datasets consisting of input data without labelled responses. Algorithms: Clustering, Anomaly Detection, Neural Networks and Latent Variable Models

56) How will you find the right K for K-means?

Use the Elbow Method

What are Quartiles?

Values that divide the number of data points into four more or less equal parts, or quarters.

What is Batch Gradient Descent?

We calculate the gradient for the whole dataset and perform the update at each iteration

What is Stochastic Gradient Descent?

We use only a single training example for calculation of gradient and update parameters.

What Will Happen If the Learning Rate Is Set inaccurately

When your learning rate is too low, training of the model will progress very slowly as we are making minimal updates to the weights. It will take many updates before reaching the minimum point. If the learning rate is set too high, this causes undesirable divergent behaviour to the loss function due to drastic updates in weights. It may fail to converge (model can give a good output) or even diverge (data is too chaotic for the network to train).

What is exploding gradients?

While training an RNN, if you see exponentially growing (very large) error gradients which accumulate and result in very large updates to neural network model weights during training, they're known as exploding gradients. At an extreme, the values of weights can become so large as to overflow and result in NaN values. This has the effect of your model is unstable and unable to learn from your training data.

What are vanishing gradients?

While training an RNN, your slope can become too small; this makes the training difficult. When the slope is too small, the problem is known as a Vanishing Gradient. It leads to long training times, poor performance, and low accuracy.

What Are Hyperparameters?

With neural networks, you're usually working with hyperparameters once the data is formatted correctly. A hyperparameter is a parameter whose value is set before the learning process begins. It determines how a network is trained and the structure of the network (such as the number of hidden units, the learning rate, epochs, etc.).

How will you define the number of clusters in a clustering algorithm?

Within Sum of squares is generally used to explain the homogeneity within a cluster. If you plot WSS for a range of number of clusters, you will get the plot shown below. The Graph is generally known as Elbow Curve. Red circled a point in above graph i.e. Number of Cluster =6 is the point after which you don't see any decrement in WSS. This point is known as the bending point and taken as K in K - Means.

How Regularly Must an Algorithm be Updated?

You will want to update an algorithm when: You want the model to evolve as data streams through infrastructure The underlying data source is changing There is a case of non-stationarity The algorithm underperforms/ results lack accuracy

Explain what precision and recall are. How do they relate to the ROC curve?

Recall describes what percentage of true positives are described as positive by the model. Precision describes what percent of positive predictions were correct. The ROC curve shows the relationship between model recall and specificity-specificity being a measure of the percent of true negatives being described as negative by the model. Recall, precision, and the ROC are measures used to identify how useful a given classification model is

What are Recommended Systems?

Recommended Systems is a sub directory of information filtering systems, which predicts the preference or rankings offered by a user to a product. Recommendations are widely used in movies, news, research articles, products, social tips, music, etc.

What is regularization and why is it useful?

Regularization is the method of calculating a tuning parameter upon a method to produce a system in order to prevent overfitting. This appears multiple often made by combining a fixed multiple on an actual weight vector. The model predictions should later overcome the loss function determined on the regularized training set.

What is reinforcement learning?

Reinforcement Learning is learning what to do and how to map situations to actions. The end result is to maximise the numerical reward signal. The learner is not told which action to take but instead must discover which action will yield the maximum reward. Reinforcement learning is inspired by the learning of human beings, it is based on the reward/penalty mechanism.

What is a Generative Adversarial Network?

So, there are two primary components of Generative Adversarial Network (GAN) named: Generator Discriminator The generator is a CNN that keeps keys producing images and is closer in appearance to the real images while the discriminator tries to determine the difference between real and fake images The ultimate aim is to make the discriminator learn to identify real and fake images.

What are the variants of Back Propagation?

Stochastic Gradient Descent Batch Gradient Descent Mini-batch Gradient Descent

What are the difference between supervised and unsupervised machine learning?

Supervised learning is requires training labeled datas. For example, in order to the classification (a supervised learning task), you'll need to the first label the data you'll use to the train the model to classify data into your labeled groups. Unsupervised learning, in contrast, does not a require labeling data explicitly.

What is Supervised Learning?

Supervised learning is the machine learning task of inferring a function from labeled training data. The training data consist of a set of training examples. Algorithms: Support Vector Machines, Regression, Naive Bayes, Decision Trees, K-nearest Neighbor Algorithm and Neural Networks

What is the advantage of performing dimensionality reduction before fitting an SVM?

Support Vector Machine Learning Algorithm performs better in the reduced space. It is beneficial to perform dimensionality reduction before fitting an SVM if the number of features is large when compared to the number of observations.

What is Systematic Sampling?

Systematic sampling is a statistical technique where elements are selected from an ordered sampling frame. In systematic sampling, the list is progressed in a circular manner so once you reach the end of the list, it is progressed from the top again. The best example of systematic sampling is equal probability method.

What is Mini-batch Gradient Descent?

It's one of the most popular optimization algorithms. It's a variant of Stochastic Gradient Descent and here instead of single training example, mini-batch of samples is used.

What is the formula for Union?

P(A∪B)=P(A)+P(B)−P(A∩B)

What is causality?

Relationship between two events where one event is affected by the other.

What is the median?

The middle value of an ordered dataset.

What is the mode?

The most frequently value in the dataset. If the data have multiple values that occurred the most frequently, we have a multimodal distribution.

How can you iterate over a list and also retrieve element indices at the same time?

This can be done using the enumerate function which takes every element in a sequence just like in a list and adds its location just before it.

What are Independent Events?

Two events are independent if the occurrence of one does not affect the probability of occurrence of the other. P(A∩B)=P(A)P(B) where P(A) != 0 and P(B) != 0 , P(A|B)=P(A), P(B|A)=P(B)

What are Mutually Exclusive Events?

Two events are mutually exclusive if they cannot both occur at the same time. P(A∩B)=0 and P(A∪B)=P(A)+P(B).

What is Deep Learning?

Deep Learning is a subfield of machine learning concerned with algorithms inspired by the structure and function of the brain called artificial neural networks.

What are descriptive analytics?

Descriptive Analytics tell me what happened in the past and help a business understand how it is performing by providing context to help stakeholders interpret information.

What are Diagnostic Analytics?

Diagnostic Analytics takes descriptive data a step further and helps you understand why something happened in the past.

Explain how a ROC curve works.

The ROC curve is a graphical representation of the contrast between true positive rates and false positive rates at various thresholds. It is often used as a proxy for the trade-off between the sensitivity(true positive rate) and false positive rate.

What is standard deviation?

The standard difference between each data point and the mean and the square root of variance.

What are the different layers on CNN?

There are four layers. Convolutional Layer - the layer that performs a convolutional operation, creating several smaller picture windows to go over the data. ReLU Layer - it brings non-linearity to the network and converts all the negative pixels to zero. The output is a rectified feature map. Pooling Layer - pooling is a down-sampling operation that reduces the dimensionality of the feature map. Fully Connected Layer - this layer recognizes and classifies the objects in the image.

Explain the difference between L1 and L2 regularization methods.

A regression model that uses L1 regularization technique is called Lasso Regression and model which uses L2 is called Ridge Regression. The key difference between these techniques is that Lasso shrinks the less important feature's coefficient to zero thus, removing some feature altogether. So, this works well for feature selection in case we have a huge number of features

What are Recommender Systems?

A subclass of information filtering systems that are meant to predict the preferences or ratings that a user would give to a product. Recommender systems are widely used in movies, news, research articles, products, social tags, music, etc.

Why data cleaning plays a vital role in analysis?

Cleaning data from multiple sources to transform it into a format that data analysts or data scientists can work with is a cumbersome process because - as the number of data sources increases, the time take to clean the data increases exponentially due to the number of sources and the volume of data generated in these sources. It might take up to 80% of the time for just cleaning data making it a critical part of analysis task.

What is Cluster Sampling?

Cluster sampling is a technique used when it becomes difficult to study the target population spread across a wide area and simple random sampling cannot be applied. Cluster Sample is a probability sample where each sampling unit is a collection or cluster of elements.

Explain cross-validation.

Cross-validation is a model validation technique for evaluating how the outcomes of statistical analysis will generalize to an independent dataset. Mainly used in backgrounds where the objective is forecast and one wants to estimate how accurately a model will accomplish in practice. The goal of cross-validation is to term a data set to test the model in the training phase (i.e. validation data set) in order to limit problems like overfitting and get an insight on how the model will generalize to an independent data set.

What is data science?

Data Science uses automated methods to analyze and retrieve large quantities of data. By combining features of statistics, computer science, application mathematics and visualization, data science can alter the wide range of data generated by the new digital intelligence and new knowledge of digital age.

What do you understand by the term Normal Distribution?

Data is usually distributed in different ways with a bias to the left or to the right or it can all be jumbled up. However, there are chances that data is distributed around a central value without any bias to the left or right and reaches normal distribution in the form of a bell shaped curve. The random variables are distributed in the form of an symmetrical bell shaped curve.

What is EDA to you?

EDA which refers to Exploratory Data Analysis is a process to understand the data prior getting it into machine learning pipeline

What is an Eigenvalue and Eigenvector?

Eigenvectors are used for understanding linear transformations. In data analysis, we usually calculate the eigenvectors for a correlation or covariance matrix. Eigenvectors are the directions along which a particular linear transformation acts by flipping, compressing or stretching. Eigenvalue can be referred to as the strength of the transformation in the direction of eigenvector or the factor by which the compression occurs.

What is the purpose of the group functions in SQL? Give some examples of group functions.

Group functions are necessary to get summary statistics of a data set. COUNT, MAX, MIN, AVG, SUM, and DISTINCT are all group functions.

What is an example of a data set with a non-Gaussian distribution?

I would not say that most machine learning datasets are from the Gaussian distribution. Data are generated by the real world (if you are talking about real world data). They only relate to distributions in that we often pick what is hopefully a fairly close approximation to what (we hope/think) the data really look like in general. (I'm not going to go into the difference between frequentist and Bayesian theory here... I'm just going to keep things relatively simple.) What you probably meant is that we often use the Gaussian distribution to underlie our statistical/machine learning algorithms, usually implicitly assuming that that the data are generated by this distribution.

What kind of error will be solved by organizing?

In mechanical learning, regulation is the process of introducing additional information as a result of an incorrect phenomenon or to avoid additional material. It is basically a reuse form, which evaluates or controls the value for zero. The regulating technique prevents the complexity or the flexible model to avoid the inappropriate risk.

75) What are the basic assumptions to be made for linear regression?

Normality of error distribution, statistical independence of errors, linearity and additivity.

What is pruning in Decision Tree?

Pruning is a technique in machine learning and search algorithms that reduces the size of decision trees by removing sections of the tree that provide little power to classify instances. So, when we remove sub-nodes of a decision node, this process is called pruning or opposite process of splitting.

Explain SVM algorithm in detail.

SVM stands for support vector machine, it is a supervised machine learning algorithm which can be used for both Regression and Classification. If you have n features in your training data set, SVM tries to plot it in n-dimensional space with the value of each feature being the value of a particular coordinate. SVM uses hyperplanes to separate out different classes based on the provided kernel function.

How many sampling methods do you know?

Simple random sampling: Software is used to randomly select subjects from the whole population. Stratified sampling: Subsets of the data sets or population are created based on a common factor, and samples are randomly collected from each subgroup. Cluster sampling: The larger data set is divided into subsets (clusters) based on a defined factor, then a random sampling of clusters is analyzed. Multistage sampling: A more complicated form of cluster sampling, this method also involves dividing the larger population into a number of clusters. Second-stage clusters are then broken out based on a secondary factor, and those clusters are then sampled and analyzed. This staging could continue as multiple subsets are identified, clustered and analyzed. Systematic sampling: A sample is created by setting an interval at which to extract data from the larger population -- for example, selecting every 10th row in a spreadsheet of 200 items to create a sample size of 20 rows to analyze

What is Information Gain in Decision Tree Algorithm?

The Information Gain is based on the decrease in entropy after a dataset is split on an attribute. Constructing a decision tree is all about finding attributes that return the highest information gain.

What is Logistic Recession? An example of when you recently used the logistic backlash?

The Logistic Recreation is often referred to as the Registration Model is a technique to predict binary effects predictive variables from a linear combination.For example, if you want to predict whether a particular political leader should succeed or not. In this case, the end of the forecast is binary ie 0 or 1 (success / loss). Here the predictive variables are the amount spent for a particular candidate's election campaign, the amount of time spent on the campaign, etc.

What is Collaborative filtering?

The process of filtering used by most of the recommender systems to find patterns or information by collaborating viewpoints, various data sources and multiple agents.

41) What is Machine Learning?

The simplest way to answer this question is - we give the data and equation to the machine. Ask the machine to look at the data and identify the coefficient values in an equation. For example for the linear regression y=mx+c, we give the data for the variable x, y and the machine learns about the values of m and c from the data.

What is Machine Learning?

The simplest way to the answer this question is - we give the data and equation to the machine. Ask to the machine look at the data and identify to the coefficient values in an equations. For example for the linear regression y=mx+c, we give the data for variable x, y and the machine learns about to the values of m and c from to the data.

Explain Two Parts of the Bayes on Logic Plan

There are two elements in the Bayesian logic project. The first component is a logical one; It is a collection of the Bayesian Klaus package, which captures the domain's characteristic structure. The second component is a criterion, which marks the amount of information about the domain

Differentiate between univariate, bivariate and multivariate analysis.

These are descriptive statistical analysis techniques which can be differentiated based on the number of variables involved at a given point of time. For example, the pie charts of sales based on territory involve only one variable and can be referred to as univariate analysis. If the analysis attempts to understand the difference between 2 variables at time as in a scatterplot, then it is referred to as bivariate analysis. For example, analysing the volume of sale and a spending can be considered as an example of bivariate analysis. Analysis that deals with the study of more than two variables to understand the effect of variables on the responses is referred to as multivariate analysis.

Are expected value and mean value different?

They are not different but the terms are used in the different contexts. Mean are generally referred to when talking about an probability distribution or sample population whereas expected value is the generally referred in the random of variable context.

What is the difference between a tuple and a list in Python?

Tuples have structure, lists have order.

What does mean by p-value?

When you make a hypothesis analysis in statistics, a p-value can help you discover this strength of your results. In a p-value is a number between 0 and 1. Based on that value it will register the intensity of the results. The part which is before trial is called the Null Hypothesis.

How is k-NN different from k-means clustering?

k-NN, or k-nearest neighbors is a classification algorithm, where the k is an integer describing the number of neighboring data points that influence the classification of a given observation. K-means is a clustering algorithm, where the k is an integer describing the number of clusters to be created from the given data.

Explain Decision Tree algorithm in detail.

A decision tree is a supervised machine learning algorithm mainly used for Regression and Classification. It breaks down a data set into smaller and smaller subsets while at the same time an associated decision tree is incrementally developed. The final result is a tree with decision nodes and leaf nodes. A decision tree can handle both categorical and numerical data.

What is Entropy in the Decision Tree Algorithm?

A decision tree is built top-down from a root node and involve partitioning of data into homogenious subsets. ID3 uses enteropy to check the homogeneity of a sample. If the sample is completely homogenious then entropy is zero and if the sample is an equally divided it has entropy of one.

What is bias-variance trade-off?

Bias is an error introduced in your model due to oversimplification of the machine learning algorithm. It can lead to underfitting. Variance is error introduced in your model due to complex machine learning algorithm, your model learns noise also from the training data set and performs badly on test data set. It can lead to high sensitivity and overfitting. The goal of any supervised machine learning algorithm is to have low bias and low variance to achieve good prediction performance.

Which technique is used to predict categorical responses?

Classification technique is used widely in mining for classifying data sets.

What is clustering?

Clustering technique is a segmentation process. It works whenever we don't have the target variable and still wanted to have a groups created.

What is correlation and covariance in statistics?

Correlation is considered or described as the best technique for measuring and also for estimating the quantitative relationship between two variables. Correlation measures how strongly two variables are related. In covariance two items vary together and it's a measure that indicates the extent to which two random variables change in cycle. It is a statistical term; it explains the systematic relation between a pair of random variables, wherein changes in one variable reciprocal by a corresponding change in another variable.

What is data normalization?

Data normalization is a common practice to get the data features weighted equally.

What is sampling?

Data sampling is a statistical analysis technique used to select, manipulate and analyze a representative subset of data points to identify patterns and trends in the larger data set being examined.

What means by heteroscedasticity?

Heteroscedasticity is specifically the contrast of homoscedasticity, which indicates that the error terms are not uniformly distributed. To change this phenomenon, normally, a log function is used.

What is the difference between Supervised Learning an Unsupervised Learning?

If an algorithm learns something from the training data so that the knowledge can be applied to the test data, then it is referred to as Supervised Learning. Classification is an example for Supervised Learning. If the algorithm does not learn anything beforehand because there is no response variable or any training data, then it is referred to as unsupervised learning. Clustering is an example for unsupervised learning.

What are hash table collisions?

If the range of key values is larger than the size of our hash table, which is usually always the case, then we must account for the possibility that two different records with two different keys can hash to the same table index. There are a few different ways to resolve this issue. In hash table vernacular, this solution implemented is referred to as collision resolution

What is the dimensional reduction in machine learning?

In mechanical learning and statistics, the transfer reduction is a process of reducing random variables in calculations, and the feature feature and feature extraction

What is an exact test?

In statistics, an exact (significance) test is a test where all assumptions, upon which the derivation of the distribution of the test statistic is based, are met as opposed to an approximate test (in which the approximation may be made as close as desired by making the sample size big enough). This will result in a significance test that will have a false rejection rate always equal to the significance level of the test. For example an exact test at significance level 5% will in the long run reject true null hypotheses exactly 5% of the time.

What is the difference between "long" and "wide" format data?

In the wide-format, a subject's repeated responses will be in a single row, and each response is in a separate column. In the long-format, each row is a one-time point per subject. You can recognize data in wide format by the fact that columns generally represent groups.

What is the goal of A/B Testing?

It is a hypothesis testing for a randomized experiment with two variables A and B. The goal of A/B Testing is to identify any changes to the web page to maximize or increase the outcome of interest. A/B testing is a fantastic method for figuring out the best online promotional and marketing strategies for your business. It can be used to test everything from website copy to sales emails to search ads

What is the goal of A/B Testing?

It is a statistical hypothesis testing for randomized experiment with two variables A and B. The goal of A/B Testing is to identify any changes to the web page to maximize or increase the outcome of an interest. An example for this could be identifying the click through rate for a banner ad.

Explain Star Schema

It is a traditional database schema with a central table. Satellite tables map IDs to physical names or descriptions and can be connected to the central fact table using the ID fields; these tables are known as lookup tables and are principally useful in real-time applications, as they save a lot of memory. Sometimes star schemas involve several layers of summarization to recover information faster.

What is K-means?

K-means clustering is a type of unsupervised learning, which is used when you have unlabeled data (i.e., data without defined categories or groups). The goal of this algorithm is to find groups in the data, with the number of groups represented by the variable K. The algorithm works iteratively to assign each data point to one of K groups based on the features that are provided. Data points are clustered based on feature similarity. The results of the K-means clustering algorithm are: The centroids of the K clusters, which can be used to label new data Labels for the training data (each data point is assigned to a single cluster)

What is kernel SVM?

Kernel SVM is the abbreviated version of kernel support vector of machine. Kernel methods are a class of algorithms for pattern analysis and the most common one of the kernel SVM.

What is logistic regression? Or State an example when you have used logistic regression recently.

Logistic Regression often referred as logit model is a technique to predict the binary outcome from a linear combination of predictor variables. For example, if you want to predict whether a particular political leader will win the election or not. In this case, the outcome of prediction is binary i.e. 0 or 1 (Win/Lose). The predictor variables here would be the amount of money spent for election campaigning of a particular candidate, the amount of time spent in campaigning, etc.

Do gradient descent methods always converge to same point?

No, they do not because in some cases it reaches a local minima or a local optima point. You don't reach the global optima point. It depends on the data and starting conditions

Do gradient descent methods of always converge to same point?

No, they do not because in some cases it reaches an local minima or a local optima points. You don't reach to the global optima point. It depends on the data and starting the conditions

How can outlier values be treated?

Outlier values can be identified by using univariate or any other graphical analysis method. If the number of outlier values is few then they can be assessed individually but for large number of outliers the values can be substituted with either the 99th or the 1st percentile values. All extreme values are not outlier values.The most common ways to treat outlier values - 1) To change the value and bring in within a range 2) To just remove the value.

76) Can you write the formula to calculate R-square?

R-Square can be calculated using the below formular - 1 - (Residual Sum of Squares/ Total Sum of Squares)

What is selection bias?

Selection (or 'sampling') bias occurs in an 'active,' sense when the sample data that is gathered and prepared for modeling has characteristics that are not representative of the true, future population of cases the model will see. That is, active selection bias occurs when a subset of the data are systematically (i.e., non-randomly) excluded from analysis.

66) What do you understand by statistical power of sensitivity and how do you calculate it?

Sensitivity is commonly used to validate the accuracy of a classifier (Logistic, SVM, RF etc.). Sensitivity is nothing but "Predicted TRUE events/ Total events". True events here are the events which were true and model also predicted them as true. Calculation of seasonality is pretty straight forward- Seasonality = True Positives /Positives in Actual Dependent Variable Where, True positives are Positive events which are correctly classified as Positives.

What are the assumptions required for linear regression?

There are four major assumptions: 1. There is a linear relationship between the dependent variables and the regressors, meaning the model you are creating actually fits the data, 2. The errors or residuals of the data are normally distributed and independent from each other, 3. There is minimal multicollinearity between explanatory variables, and 4. Homoscedasticity. This means the variance around the regression line is the same for all values of the predictor variable.

What are the different kernels in SVM?

There are four types of kernels in SVM. Linear Kernel Polynomial kernel Radial basis kernel Sigmoid kernel

Differentiate between univariate, bivariate and multivariate analysis

Univariate analyses are descriptive statistical analysis techniques which can be differentiated based on the number of variables involved at a given point of time. For example, the pie charts of sales based on territory involve only one variable and can the analysis can be referred to as univariate analysis. The bivariate analysis attempts to understand the difference between two variables at a time as in a scatterplot. For example, analyzing the volume of sale and spending can be considered as an example of bivariate analysis. Multivariate analysis deals with the study of more than two variables to understand the effect of variables on the responses.

64) Can you explain the difference between a Test Set and a Validation Set?

Validation set can be considered as a part of the training set as it is used for parameter selection and to avoid Overfitting of the model being built. On the other hand, test set is used for testing or evaluating the performance of a trained machine leaning model. In simple terms ,the differences can be summarized as- Training Set is to fit the parameters i.e. weights. Test Set is to assess the performance of the model i.e. evaluating the predictive power and generalization. Validation set is to tune the parameters.

What Are Confounding Variables?

a confounder is a variable that influences both the dependent variable and independent variable.

What are categorical variables?

a variable that can take on one of a limited, and usually fixed number of possible values, assigning each individual or other unit of observation to a particular group or nominal category on the basis of some qualitative property.

What is the difference between Regression and classification ML techniques?

Both Regression and classification machine learning techniques come under Supervised machine learning algorithms. In Supervised machine learning algorithm, we have to train the model using labelled data set, While training we have to explicitly provide the correct labels and algorithm tries to learn the pattern from input to output. If our labels are discrete values then it will a classification problem, e.g A,B etc. but if our labels are continuous values then it will be a regression problem, e.g 1.23, 1.333 etc.

What are various steps involved in an analytics project?

• Understand the business problem • Explore the data and become familiar with it. • Prepare the data for modelling by detecting outliers, treating missing values, transforming variables, etc. • After data preparation, start running the model, analyse the result and tweak the approach. This is an iterative step till the best possible outcome is achieved. • Validate the model using a new data set. • Start implementing the model and track the result to analyse the performance of the model over the period of time.

What is the difference between Cluster and Systematic Sampling?

Cluster sampling is a technique used when it becomes difficult to study the target population spread across a wide area and simple random sampling cannot be applied. Cluster Sample is a probability sample where each sampling unit is a collection, or cluster of elements. Systematic sampling is a statistical technique where elements are selected from an ordered sampling frame. In systematic sampling, the list is progressed in a circular manner so once you reach the end of the list,it is progressed from the top again. The best example for systematic sampling is equal probability method.

What is Interpolation and Extrapolation?

Estimating a value from 2 known values from a list of values is Interpolation. Extrapolation is approximating a value by extending a known set of values or facts.

Tell me the difference between an inner join, left join/right join, and union.

In a Venn diagram the inner join is when both tables have a match, a left join is when there is a match in the left table and the right table is null, a right join is the opposite of a left join, and a full join is all of the data combined

What does P-value signify about the statistical data?

P-value is used to determine the significance of results after a hypothesis test in statistics. P-value helps the readers to draw conclusions and is always between 0 and 1. • P- Value > 0.05 denotes weak evidence against the null hypothesis which means the null hypothesis cannot be rejected. • P-value <= 0.05 denotes strong evidence against the null hypothesis which means the null hypothesis can be rejected. • P-value=0.05is the marginal value indicating it is possible to go either way.

Explain the 80/20 rule, and tell me about its importance in model validation.

People usually tend to start with a 80-20% split (80% training set - 20% test set) and split the training set once more into a 80-20% ratio to create the validation set.

What is the standard deviation, how is it calculated?

Standard Disadvantage (SD) is a statistical measure, which captures the meanings of the meanings and rankings. Step 1: Find the average. Step 2: Find the average square of its distance for each data point. Step 3: A total of values from step 2. Step 4: Separate the number of data points. Step 5: Take a square hunt.

How would you create a taxonomy to identify key customer trends in unstructured data?

The best way to approach this question is to mention that it is good to check with the business owner and understand their objectives before categorizing the data. Having done this, it is always good to follow an iterative approach by pulling new data samples and improving the model accordingly by validating it for accuracy by soliciting feedback from the stakeholders of the business. This helps ensure that your model is producing actionable results and improving over the time.

What is the difference between Point Estimates and Confidence Interval?

Point Estimation gives us a particular value as an estimate of a population parameter. Method of Moments and Maximum Likelihood estimator methods are used to derive Point Estimators for population parameters. A confidence interval gives us a range of values which is likely to contain the population parameter. The confidence interval is generally preferred, as it tells us how likely this interval is to contain the population parameter. This likeliness or probability is called Confidence Level or Confidence coefficient and represented by 1 — alpha, where alpha is the level of significance.

57) Why L1 regularizations causes parameter sparsity whereas L2 regularization does not?

Regularizations in statistics or in the field of machine learning is used to include some extra information in order to solve a problem in a better way. L1 & L2 regularizations are generally used to add constraints to optimization problems. L1 L2 Regularizations In the example shown above H0 is a hypothesis. If you observe, in L1 there is a high likelihood to hit the corners as solutions while in L2, it doesn't. So in L1 variables are penalized more as compared to L2 which results into sparsity. In other words, errors are squared in L2, so model sees higher error and tries to minimize that squared error.

Give some situations where you will use an SVM over a RandomForest Machine Learning algorithm and vice-versa.

SVM and Random Forest are both used in classification problems. a) If you are sure that your data is outlier free and clean then go for SVM. It is the opposite - if your data might contain outliers then Random forest would be the best choice b) Generally, SVM consumes more computational power than Random Forest, so if you are constrained with memory go for Random Forest machine learning algorithm. c) Random Forest gives you a very good idea of variable importance in your data, so if you want to have variable importance then choose Random Forest machine learning algorithm. d) Random Forest machine learning algorithms are preferred for multiclass problems. e) SVM is preferred in multi-dimensional problem set - like text classification but as a good data scientist, you should experiment with both of them and test for accuracy or rather you can use ensemble of many Machine Learning techniques.

Types of Selection Bias

Sampling bias: It is a systematic error due to a non-random sample of a population causing some members of the population to be less likely to be included than others resulting in a biased sample. Time interval: A trial may be terminated early at an extreme value (often for ethical reasons), but the extreme value is likely to be reached by the variable with the largest variance, even if all variables have a similar mean. Data: When specific subsets of data are chosen to support a conclusion or rejection of bad data on arbitrary grounds, instead of according to previously stated or generally agreed criteria. Attrition: Attrition bias is a kind of selection bias caused by attrition (loss of participants) discounting trial subjects/tests that did not run to completion.

During analysis, how do you treat missing values?

The extent of the missing values is identified after identifying the variables with missing values. If any patterns are identified the analyst has to concentrate on them as it could lead to interesting and meaningful business insights. If there are no patterns identified, then the missing values can be substituted with mean or median values (imputation) or they can simply be ignored.There are various factors to be considered when answering this question- Understand the problem statement, understand the data and then give the answer.Assigning a default value which can be mean, minimum or maximum value. Getting into the data is important. If it is a categorical variable, the default value is assigned. The missing value is assigned a default value. If you have a distribution of data coming, for normal distribution give the mean value. Should we even treat missing values is another important point to consider? If 80% of the values for a variable are missing then you can answer that you would be dropping the variable instead of treating the missing values.

What is PCA, KPCA and ICA?

The key feature is the extraction techniques used for the dimensional reduction of PCA (Primary Components Analysis), KPCA (Kernel-based Primary Component Analysis) and ICA (Independent Component Analysis).

How can you assess a good logistic model?

There are various methods to assess the results of a logistic regression analysis- • Using Classification Matrix to look at the true negatives and false positives. • Concordance that helps identify the ability of the logistic model to differentiate between the event happening and not happening. • Lift helps assess the logistic model by comparing it with random selection.

What are the reasonable ways of increasing the accuracy of a linear regression model?

There could be many ways of developing the accuracy of linear regression, most commonly related ways are as follows: Outlier Treatment: Regression is on sensitive to outliers, so it becomes very essential to treat the outliers with proper values. Replacing the importance with mean, median, mode or percentile depending on the distribution can show to be useful.

48) How will you define the number of clusters in a clustering algorithm?

Though the Clustering Algorithm is not specified, this question will mostly be asked in reference to K-Means clustering where "K" defines the number of clusters. The objective of clustering is to group similar entities in a way that the entities within a group are similar to each other but the groups are different from each other. For example, the following image shows three different groups. K Mean Clustering Machine Learning Algorithm Within Sum of squares is generally used to explain the homogeneity within a cluster. If you plot WSS for a range of number of clusters, you will get the plot shown below. The Graph is generally known as Elbow Curve. Data Science Interview Questions K Mean Clustering Red circled point in above graph i.e. Number of Cluster =6 is the point after which you don't see any decrement in WSS. This point is known as bending point and taken as K in K - Means. This is the widely used approach but few data scientists also use Hierarchical clustering first to create dendograms and identify the distinct groups from there.