Data Science Interview: Complete Questions

Lakukan tugas rumah & ujian kamu dengan baik sekarang menggunakan Quizwiz!

Collaborative Filtering

"People who bought X also purchased Y" is an example of which recommendation algorithm?

What is the equation for R squared

1 - ( Sum of the squared residuals / total sun of squares)

What is this difference between R squared and Adjusted R squared

An R squared is the sum of residuals squared but does not take into account the issue in which extra features added in may skew the R squared to be more favorable. An adjusted R squared will penalize extra features unless there is sufficient improvement in the model.

In statistics, what is Central Tendancy

Central tendency is the measure of Mean, median and mode of a distribution.

Variance and underfitting of data

Decision Tree Estimators is an example of which type or error(s)?

When independent variables in a regression model are correlated

Define Multicollinearity

P(A) + P(B) - P(A AND B)

Define P(A OR B) in another way

Collaborative Filtering is a method where you use preexisting information about other users who have similar or near similar patterns in decision/selection behavior to predict the probability that the user will of interest will make a similar decision, and to use that to recommend an option to the user. We need to mathematically derive a way to compute the similarity. We typically will use a cosine similarity. User 1 = Ui. And User2 = Uj Sim(Ui,Uj) = Ui dot Uj / [ abs(Ui) * abs(Uj)] = cos(theta). The closer to 1 the cosine is, the more similar the data. [cos(90) = 1] We finally calculate the estimated value based on a weighted sum of the scores for all users in the dataset to their scores, where similar users have much higher weights than users who were not. This is then normalized to be within the same bounds of the user scores.

Describe Collaborative Filtering. How does it work?

- The first step is to set a significance value, usually at 0.05 - Create a basic linear regression model for each feature - Find the feature with the lowest p-value - Include this feature into the prior feature models and recalculate to find the model with lowest p-values. - Continue this process

Describe Forward Selection Method for Feature FIltering

Markov Chains defines that a state's future probability depends only on its current state. An example: Word Recommendations. In this system the model calculates the next word based on the previous word and not anything else before that.

Describe a Markov Chain?

SVM is a supervised machine learning model that uses classification algorithms for two-group classification problems. 1.). We take our observations of data for our classification groups 2.). We find the data points near the edges of our decision boundary for the groups. 3.). We create a boundary line, halfway between the edges of group A and group B. 3.) If a new data point is closer to Group A, it gets mapped to Group A, otherwise Group B

Describe what a SVM model is

- Univariate refers to a single independent variable. This analysis is for describing the data and finding patterns that exist within it. - Bivariate refers to two independent variables. This analysis is for describing the interaction between two features. This deals with causes and relationships between two variables. Think of an independent / dependent variable - Multivariate refers to multiple independent variables. Similar to bivariate, however there is more than 1 dependent variable.

Differentiate between univariate, bivariate and multivariate analysis

- It helps in data compressing and reduces storage space - It reduces computation time - It removes redundant features

Explain Dimensionality Reduction and its benefits

- Input Layer: Receives the input - Hidden Layer: These are the layers between the input and the output. The initial hidden layers generally help detect low level patterns, while the further layers combine output from the previous layers to find more patterns. - Output Layer: the output layer is the final layer that outputs the prediction. A weight is initially assigned to each input, as it is calculated through an activation function.

Explain Neural Network Fundamentals?

- First the points of transformed so that their center of origin is the axis - Then PCA measures the distances from the data to the line, and tries to either: - Find the line that minimizes those distances OR - Find the line that maximizes the distances from the projected points to the origin. - We measure the squared distance of each projected point to the line passing through the origin. - When we find the line that has the maximum sum of squares for the projected points to the origin, we have Principal Component 1. - Assume 2 dimension example: If the slope of the line is 0.25, we know that we went 4 units on one axis (A) for ever 1 unit we went in axis B. That tells us that the Feature A is more important for determining the spread of the data

Explain PCA.

- Randomly select "k" features from a total "m" features - Among the k features, calculate the node "d" using the best split point - Split the node into daughter nodes using best splits - repeat steps 2 and 3 until leaf nodes are finalized - Build forest by repeating steps 1 to 4 for "n" number of times to create "n" trees - A vote is made, and whoever receives the most votes wins

Explain the steps for a Random Forest

- Take the entire dataset of input - calculate the entropy of the target variable as well as predictor attributes - Calculate the information gain of all attributes - Choose the attribute with the highest information gain as the root node - Repeat the same process on every branch till the decision node of each branch is finalized

Explain the steps of a Decision Tree

- Create a simpler model - For regression, use LASSO penalty to remove features whose beta coefficients hit 0. - Use cross validation such as k-folds, and train/test splits for model training

Explain ways you can prevent overfitting

Standardization: - PCA, LDA because of the algorithms are attempting to track maximum variance captures. Normalization: - Image Processing, Neural Networks where data should be between 0 and 1.

Give an example of where Standard Scaling (Standardization) is better? Where Min-Maxing(Normalization) may be better?

- You can transform the data, using percent ranks - You can smooth / soft cap the numeric values of the feature

How can outliers be treated?

When the mean and variance over time does not change.

How can time-series data be declared as stationery?

[True Positive + True Negative] / [True Positive + True Negative + False Positive + False Negative]

How can you calculate accuracy using a confusion matrix?

We use cross validation to build a soft margin soft margin classifier.

How do we determine which soft margin to use for SVM boundaries?

SQRT( [A1-B1]^2 + [A2 - B2]^2)

How do you calculate Euclidean Distance?

A = [

How do you calculate Eugene values and Eugene vectors of a 3 by 3 matrix

- Impute the values - If very large dataset, can remove bad records for training of the model - Use mean

How do you deal with a dataset that has 30% missing values

Elbow plot. By evaluating the variance as you add another cluster, you will notice that the decrease in variance per new cluster added will decrease. When this rate decrease is detected in a plot, u know the value to pick.

How do you determine the optimal value K for kmeans

1. Select the number of clusters 2. Randomly select K distinct data points 3. Measure the distance between the points from the selected centroids 4. Calculate the mean of each cluster and use that as our new center 5. Repeat steps 3-4 6. When there are no more changes to the center of each cluster we are done. 7. We evaluate the amount of variance in each cluster and attempt to improve it by selecting another set of random centroids

How does K-means work?

Long-Short Term Memory is a special kind of RNN capable of learning long-term dependencies, remembering information for long periods as its default behavior. There are several steps in an LSTM network: 1. The network decides what to forget and what to remember 2. It selectively updates cell state values. 3. The network decides what part of the current state makes it to the output.

How does LSTM Network work?

Recurrent networks take as input not just the current input example, but what they perceived previously in time. They share the same weight matrix as input... Errors that are generated will return to prior hidden layers via backpropagation and will be used to adjust their weights until errors can't go any lower via gradient descent.

How does RNN work?

We need to determine the sample size - Type II error or power - Significance level - Minimum detectable effect sample size ~= 16 *(variance) / [difference between treatment and control]^2 - We need more samples if the variance of delta are larger

How long should an A/B test be run?

- Monitor the model - Continue to evaluate the metrics of the model

How should you maintain a deployed model?

Linear Combination reflects the ratio of spread, that was determined by the features in the principal component. If you have Features A and B, and determine the maximum sum of squares line to have a slope of 0.25, where the line moved 4 units in the A direction for every 1 unit in the B, then we know that the ratios were: - 4 parts A - 1 part B

In PCA terms, what does a linear combination mean?

In order to create a soft boundary, which can help reduce the variance of an SVM, we create a margin which will allow some points that are potentially misclassified to intentionally fall within an acceptable decision boundary. It is like a Fuzzy Boundary.

In SVM, what is a Soft Margin Classifier?

Margins are the shortest distance between an observation and the decision boundary

In SVM, what is the term for the shortest distance between an observation point and a decision boundary?

Bootstrapping and Normal resampling [Sampling from a Normal Distribution] Permutation Resampling [AKA Rearrangements or Rerandomization] Cross Validation

List the three methods for resampling

In statistics What is VIF

Variance inflation Factor is a method to assess the level of multicollinearity between a variable and other variables in a multi variable regression. A value of 1 reflects no collinearity. A value of .25 or 4 reflects mulicollinearity.

- SVMs allow for misclassification. This is done to adjust the decision boundaries and to correctly label data elements into their respective groups. - Low Bias / High Variance trade-off.

What are potential issues with SVM that have to be addressed?

Sparsity: You can have a large matrix with little information entered. Perhaps not many ppl are giving feedback to use Grey Sheep: Users who don't quite fit certain clusters and are somewhere in the middle Black Sheep: Users who have entirely different interests

What are some issues that can occur with Collaborative Filtering?

- Creates wild swings in the beta coefficient as changes in a correlated variable will effect the variable being calculated in the cost function. - Reduces precision of the estimated coefficients

What are some issues with multicollinearity in a regression model?

Convolutional Layer - the layer that performs a convolutional operation, creating several smaller picture windows to go over the data. ReLU Layer - it brings non-linearity to the network and converts all the negative pixels to zero, creating a rectified feature map as output. Pooling Layer - pooling is a down-sampling operation that reduces the dimensionality of the feature map. Fully Connected Layer - this layer recognizes and classifies the objects in the image.

What are the different layers on CNN?

Filter Methods - Linear Discriminant Analysis - ANOVA / T-Tests - Chi-Square Wrapper Methods - Forward Selection - Backward Selection - Recursive Feature Elimination

What are the feature selection methods to select the right variables?

Helps analyze the difference in means between two independent variables, but it will not tell you which statistical group were different from each other.

What are the limitations of ANOVA?

- Selection Bias - Undercoverage Bias - Survivorship Bias

What are the types of biases that can occur in sampling?

High Variance and Low Bias. - To reduce the variance, we can increase the prediction influencing neighbors by increasing the K value.

What bias/variance exists for kNN classification?

Training Yr 1 / test Yr 2.... Train Yrs 1-2, Test Yr3.... Train Yr 1-3, Test Year 4.....

What cross-validation technique would you use on a time series data set?

A/B tests, a.k.a controlled experiments - is used in the industry to make decisions - In simple form: Control A, Treatment B - Where A = control group for existing feature - B = treatment group for new features - Evaluate features with a subset of users

What is A/B testing?

- Bagging is used to reduce the variance of a decision Tree. - It creates several subsets of data from training samples chosen randomly with replacement. - We create N' training data for M bags. Generally we want n' to be 60% of the original size of data N

What is Bagging in Decision Trees?

- Bias is the oversimplification of a model. This leads to under fitting model estimators to the actual values in the dataset - Variance is amount that an estimate of the target response variable will change given different training data.

What is Bias? What is Variance?

Bootstrapping is the process of taking a large amount of small samples, with replacement from a population.

What is Bootstrapping and Normal Resampling Methods?

It is a technique used when it is difficult to study the target population spread across a wide area. Instead of performing a random sample, we get a collection of samples or cluster.

What is Cluster Sampling

Deep learning is a sub field of Machine Learning concerned with algorithms inspired by the structure and function of the brain called artificial neural networks

What is Deep Learning?

- Entropy is the measure of the randomness in the information being processed. -( p[0] * log(p[0]) + p[1] * log(p[1]))

What is Entropy? What is the equation for Entropy

- Each tree is dependent on prior trees. - Uses weak learners where each new tree corrects the errors from the last tree. - You start with a tree with 1 split called a decision stump. We then take a loss function, usually cross Entropy - Loss(p,q) = - Sum [ p(x) log q(x) ], where p = label and q = prediction. - We want to find the next best optimization to the loss by choosing the tree that decreases the loss function the most.

What is Gradient Boosting?

Mean Absolute Error is the the average of the absolute values of the errors. It is easier to understand than the squared root of the average of squared errors.

What is MAE and what benefits does it have over MSE?

Uses a single factor (categorical) feature to compare the effects of an independent variable on dependent variable,

What is ONE-Way ANOVA?

A method to create permutations of sample data which will be used to validate the assumptions of a null hypothesis. Example: If we are comparing a feature "Feedings" to a Response of "Weight Gain", in a permutation test we start by collecting test statistics for each group in Feedings factor. However, then we ignore this factor and assume that IF the data truly has no relationship, we can shuffle permutations of the sample and this should STILL fail to reject the Null Hypothesis, via a P-value statistic gathered from the test statistic. This form of sampling does not need a population, nor is it done with replacement. It also works in nonparametric datasets with low sample sizes.

What is Permutation Resampling?

Pooling is used to reduce the spatial dimensions of a CNN. It performs down-sampling operations to reduce the dimensionality and creates a pooled feature map by sliding a filter matrix over the input matrix.

What is Pooling on CNN, and how does it work?

Root Mean Square Error is a measure of the difference between observed and predicated values from a model, and is the square root of the MSE [Mean Square Error].

What is RMSE. How do you calculate it?

ROC is a plot, where we have the False Positive Rate on the X-axis, and the True Positive Rate on the Y-axis

What is ROC?

Survivorship bias is the logical error of focusing on aspects that support a process and casually overlooking those that did not because of their lack of prominence.

What is Survivorship Bias?

Term Frequency - Inverse Document Frequency is a statistic that reflects how important a word is by weighted by the frequency of the word in the document and offset against the frequency of the word in the corpus.

What is TF/IDF vectorization?

Uses two factors (categorical) features to compare the effects of an independent variable on dependent variables

What is TWO-Way ANOVA?

A Box Cox transformation is a statistical technique to transform non-normal dependent variables into a normal shape.

What is a Box-Cox Transformation?

The ratio of two variances, technically of two means squares. F = Between-groups variance / within-group variance Between-groups variance = (avg_mean of all groups_of_category - avg of individual_group)^2

What is a F Statistic? How is it derived?

-Logistic Regression seeks to model the probability of an event occurring depending on the values of the independent variables, which can be categorical or numerical. - Estimates the probability that an event occurs for a randomly selected observation versus the probability of it not occurring. Logistic Regression measures the relationship between dependent variable and one or more independent variables by estimating probabilities using its underlying logistic function (sigmoid) - Linear Model - Pass through a Sigmoid function to create a transformation where the data points are between 0 and 1 - The probabilities are mapped to an odds ratio, and then a logistic function is applied.

What is a Logistic Regression and how is it done?

Residuals are the difference between the observed data and the estimated value from a sample.

What is a Residual Error?

A support vector is the observations within a soft margin up to the edge points.

What is a Support Vector?

A t-test is a type of inferential statistic used to determine if there is a significant different between the means of two group, which may be related in certain features

What is a T-test?

You predict something is True when it should have been False

What is a Type I Error

You predict something is False when it should have been True

What is a Type II Error

Similar to a Neural Network, MLP have an input,hidden and output layer. It has the same structure as a single layer perceptron with one or more hidden layers. - A single layer perceptron can classify only linear separable classes with binary output (0,1) - However, MLP can classify nonlinear classes.

What is a multi-layer perceptron?

Similar in nature to a word embedding, vectors are created for a given user or product to create a meaningful relationship in which similiar user or product behavior patterns are near another in dimensional space.

What is a recommender system?

Errors are the difference between the observed data and the predicted data

What is an Error?

Let A=matrix, x=vector multiplier Then aX = lambda X The eigenvalue(lambda) is the scalar result of transforming the X vector by A

What is an eigenvalue?

It is the vector that remains unchanged after transforming it by a matrix multiplication

What is an eigenvector

In wide format, we are creating columns for each category of interest. In long format, we may have a category melted into 1 column to represent a single column with varying values.

What is e difference between long and wide formatted data?

Multiple regression is simple linear regression with more independent variables.

What is multiple regression?

Precision is how well the model refrains from capturing False Positives. Precision = TP / [TP + FP]

What is precision? Define the equation

The process of using data to determine an optimal course of action by considering all relevant factors.

What is prescriptive analytics?

It is the possibility of an event, among all other possible outcomes. Probabilities lies in a range of 0 to 1

What is probability?

Recall is a measure for how well a model is correctly capturing the true positives of your target variable. Recall = TP / [TP + FN]

What is recall? Define the equation

The introduction of an error due to a non-random population sample.

What is selection bias?

We start with a random starting point, however then we begin running samples in a fixed periodic interval which qis determined before-hand. If we are trying to sample or observe something, where we are not sure of the time of occurrence (estimating the number of deer in a certain location at an unknown time), we may consider using this technique as we will randomly be doing assessments at unknown starting intervals.

What is systematic sampling?

Mean Square Error is a measure of how close our model's line estimator's predictions matched the actual data value. It is the Total Sum of Squared differences between the observed value and the predicted value for each data point, divided by the number of observations, N.

What is the Mean Square Error?

Machine Learning is the concept of using algorithms to allow a system to learn without being explicitly programmed. It can be categorized into: - Supervised Machine Learning - Unsupervised Machine Learning - Reinforcement Learning

What is the difference between Deep Learning and Machine Learning?

AI solves tasks that require human intelligence while ML is a subset of artificial intelligence that solves specific tasks by learning from data and make predictions. All machine learning is AI, but not all AI is machine learning.

What is the difference between ML and AI?

Standardization performs the following: - Makes the Mean = 0 - Makes the Standard Deviation = 1 - Normally distributes the data Normalization performs the following: - Distributes the data points in a range from 0 to 1

What is the difference between Normalization and Standardization?

Box Plots are used to describe the dataset in terms of outlier quartiles, with a box containing the 25% to 75% percentile of data. Histograms are used to understand the probability distribution by creating bars representing defined buckets of the data.

What is the difference between a box plot and a histogram?

A point estimate is a single value estimate of a parameter: - Example: A sample mean is an estimate of a population mean. - A point estimate lies WITHIN a confidence interval. It is exactly in the middle of the CI. A confidence Interval represents the probability that the data in the population will lie within the interval range. - A CI is preferred as it tells more information about the population than a point estimate.

What is the difference between a point estimate and a Confidence Interval?

While both are used to establish a relationship and measurement of the dependency between two random variables: - Correlation measures the quantitative relationship. It is a normalized value of the covariance - Covariance defines the type of relationship but the actual numerical metric is not normalized. - correlation is covariance / swerve[variance(x)*variance(y)]

What is the difference between covariance and correlation

Supervised Learning: - Has a label. - Has feedback mechanism - Random Forest, Logistic Regression, Linear Regression Unsupervised Learning: - No Label. - no feedback mechanism - Gaussian Mixture Models. KNN, k-means clusters

What is the difference between supervised and unsupervised learning

T = (mean - Mu) / ( s / sqrt(n) )

What is the equation for a Student's T-test?

Z = (X - μ) / σ

What is the equation for a Z-Score?

The Log Odds Ratio modifies the distribution of our probability function [sigmoid] to appear more in-line with a Normal Distribution, which makes it easier to interpret. The Log Odds changes this from a probability model to a likelihood model.

What is the purpose of the Log Odds Ratio in Logistic Regression?

This reflects the probability of how likely the null hypothesis is.

What is the significance of p-value

In order to handle complete multidimensional space calculations, the vectors are mapped via a dot product.

What techniques are done for SVMs to optimize performance?

- Decision Tree Estimators - kNN - Support Vector Machine

What types of models have issues with variance?

No. With 2 groups the p-value = p(No False Positives)^2 = 0.95 * 0.95 = 0.9025. Thus P(at least 1 false positive) = 0.0975. Use Bonferroni Correction. Significance Value / #Tests. or 0.05 significance value / 2 groups = 0.025

When analyzing an A/B that has 2 groups, is a p-value of 0.05 a good benchmark to reject the Null Hypothesis?

Logistic Regression and Linear REgression

Which algorithms are high in bias?

- kNN - Gradient Descent - K-means clustering - Logistic Regression - SVMs - Perceptrons & Neural Networks - PCA / Linear Discriminant Analysis since you want the Eigenvalue calculations to be on the same scale (calculating the max variance detected)

Which algorithms require that you Standardize your data scales for optimal performance?

- To estimate the accuracy of sample statistics by using subsets - To validate models by using random subsets / cross-validation

Why is resampling done?

1. The recall and precision have not been validated 2. The model may have overfitted 3. Was the data cross validated? Did the train/test data be split 4. Is there imbalance in the target variable class sizes? Was the data stratified during the validation?

You are given a dataset on cancer detection. You have built a classification model and achieved an accuracy of 96%. Why shouldn't you be happy with your model performance?


Set pelajaran terkait

Prepu Chapt 45 Mgt Oral and Esophageal Disorders

View Set

section 13 unit 3: Purchase Agreement Negotiations

View Set

Intro to the Visual Arts Midterm exam Flash cards 2022

View Set

CS 108 Final Exam Question Practice

View Set

Textbook Questions (ch 59, 56, 57, 61, 62)

View Set