Data Science Interview Questions

Ace your homework & exams now with Quizwiz!

t-value

# of std errors away form hypothesis

Detecting lack of normality Accounting lack of normality

- hist of residuals - normal probability plot of residuals (QQ plot) - Formal test - outliers -> robust regression - relationship between variables -> transform - can try box cox transformation

DFBETAS

- measure of change in the jth parameter estimate with the deletion of the ith observation - seeing change in Betas -one DFBETA per parameter per observation -influence of parameter coefficient

BoxCox

-help identify transformation that'd make your residuals normally distributed

Detecting Heteroskedasticity Accounting For heteroskedasticity

-residual plots for patterns -WLS(Weighted Least Squares) -Transform data

Adaptable Systems

...

CRF (conditional random fields)

...

Categorical data

...

Convolutional net

...

Distributed Systems.

...

From Mud to Structure.

...

Interactive Systems

...

Ordinal features

...

Random-restart hill climbing

...

Stochastic gradient descent

...

The RM4Es (Research Methods Four Elements) is a good framework to summarize Machine Learning components and processes. The RM4Es include:Equation: Equations are used to represent the models for our researchEstimation: Estimation is the link between equations (models) and the data used for our researchEvaluation: Evaluation needs to be performed to assess the fit between models and the dataExplanation: Explanation is the link between equations (models) and our research purposes. How we explain our research results often depends on our research purposes and also on the subject we are studying

...

Vector

...

Vector space

...

What make big datasets impractical

...

structure prediction

...

structured SVMs

...

structured perceptron

...

term frequency-inverse document frequency

...

value means more similar."

...

empirical rule

1 stdev - 68% 2 stdev - 98% 3 stdev - 99%

confidence interval

95% confidence interval represents a range of values within which you are 95% certain that the true population mean exists *not probability that mean is in there* *I'm going to create a 95% chance it covers the mean*

S curve

A pattern in which something is adopted slowly gains popularity quickly

Serial correlation

A pattern where values in a series are correlated can shift time series by an interval called a lag and then compute the correlation of the shifted and original series

Pandas

A python library for data manipulation

Logarithm

A quantity representing the power to which a fixed number base

Ruby

A scripting language that can be used for data science not as popular as Python

Data wrangling

AKA data munging the conversion of data using scripting languages to make it easy to work with

ANOVA vs Regression

ANOVA - looks at differences of means, meaning if there wasn't a difference then it'd have no predictive power(categorical wise) Regression - does this variable have any impact , have any predictive power on the variable at interest

DataFrame['A'].sum()

Adds up all values in column A

Adjusted R^2

Adjusts for the loss of degrees of freedom when additional independent variables are added to a regression. no interpretation

Centroid

Center of a cluster

Unsupervised learning

Class of machine learning algorithms designed to identify groupings of data without knowing in advance what the groups will be

Regularization

Collect more training dataIntroduce a penalty for complexity via regularizationChoose a simpler model with fewer parametersReduce the dimensionality of the data L1 regularization can be understood as a technique for feature selection.

ANOVA

Compares mean values of a contributes variable for multiple categories/groups analysis of variance assumes means are equal can use glm

D3

Data Driven Documents a JavaScript library that eases the creation of interactive visualizations embedded in web pages

Type II error (beta)

False negative results ex: accept the null hypothesis when you should reject it woman's not pregnant

Regression

Fitting a model to data

Unstructured Information Management Architecture (UIMA)

Framework developed at IBM to analyze unstructured information especially natural language

GATE

General Architecture for Text Engineering; open source java-based framework for natural language processing tasks

Variance

How much a list of numbers vary from the average the average of the squared difference of each number from the mean

Bias

In machine learning when a learner consistently learns the same thing wrong

Dunnett's Test

K-1(k being total groups) testing against the control all being compared to group 1 make sure means are outside adjusted area of control

Instance-based learning

KNN belongs to a subcategory of nonparametric models that is described as instance-based learning. Models based on instance-based learning are characterized by memorizing the training dataset, and lazy learning is a special case of instance-based learning that is associated with no (zero) cost during the learning process.

L2 regularization

L2 regularization (sometimes also called L2 shrinkage or weight decay)

Probability distribution

Listing of all possible distinct outcomes and their probabilities of occurring sum is equal to 1

common ways for evaluation of model

MAPE - ex - off by 3% on avg MAE - ex - off by 100 on avg

K-nearest neighbors

Machine learning algorithm that classifies things based on their similarity to nearby neighbors. Pick the number of neighbors K

Correlation coefficient

Measure of how closely two variables correlate. Ranges from -1 to 1

Prior distribution

Models the many plausible values of the unknown quantity to be estimated in Bayes interference

Nonparametric

Nonparametric models can't be characterized by a fixed set of parameters, and the number of parameters grows with the training data. Two examples of nonparametric models that we have seen so far are the decision tree classifier/random forest and the kernel SVM.

Gradient descent

Optimization algorithm for finding the input to a function that produces the optimal value; iterative

PCA

PCA attempts to find the orthogonal component axes of maximum variance in a dataset. Kernel principal component analysis

Layers

Pipes and Filters, Blackboard, Broker, Model-View-Controller, Presentation-Abstraction-Control, Microkernel, and Reflection , ...

What does regular regularization do?

Regularization basically adds the penalty as model complexity increases. Regularization parameter (lambda) penalizes all the parameters except intercept so that model generalizes the data and won't overfit.

what comparison to use tests

Response\Predictors | Categ | Contin | Cont Categ Continuous | ANOVA | OLS | OLS Categorical | Log Reg | Log Reg | Log Reg Response = Target variable Explain = input variables

DataFrame['A'].mean()

Returns average of values in column A

DataFrame.head(n = 5)

Returns first n rows of a dataframe

Selection bias

Selection bias is a kind of error that occurs when the researcher decides who is going to be studied. The sample is not representative of the whole population.

Cross-validation

Set of techniques that divide up data into training sets and test sets usually 80-20. Training sets are given the correct categorization and an algorithm is created

z

The ISO standard query language for relational databases

Data science

The ability to extract knowledge and insights from large and complex data sets

Artificial intelligence

The ability to have machines act with apparent intelligence. Can be through symbolic logic or statistical analysis

Predictive analytics

The analysis of data to predict future events usually to aid in business planning

N-gram

The analysis of sequences of N items; usually words in natural language

Correlation

The degree of relative correspondence between two variables

Predictive modeling

The development of drastically models to predict future events

Central Limit Theorem

The distribution of sample means is approximately normally distributed, even if the population they are drawn from is not normally distributed, given that the sample size is large enough.

Classification

The identification of two or more discrete categories for items classic machine learning task. Spam or ham. Movie genres. Supervised learning.

Monte Carlo method

The use of randomly generated numbers as part of an algorithm

Classification error

This is a useful criterion for pruning but not recommended for growing a decision tree, since it is less sensitive to changes in the class probabilities of the nodes.

Objective function

Used to find the optimal result of an objective; used to solve an optimization problem

Feature engineering

Using feature to come up with a good model through iteration

Latent variable

Variables that are not directly observed but inferred from other variables that are observed

T-distribution

Variation of normal distribution that accounts for the fact that you're only using a sample of values not all of them

Three Classes Of Metrics: Centrality

Volatility, Bumpiness. , ...

What is p-value?

When you perform a hypothesis test in statistics, a p-value can help you determine the strength of your results. p-value is a number between 0 and 1. Based on the value it will denote the strength of the results. The claim which is on trial is called the Null Hypothesis. Low p-value (≤ 0.05) indicates strength against the null hypothesis which means we can reject the null Hypothesis. High p-value (≥ 0.05) indicates strength for the null hypothesis which means we can accept the null Hypothesis p-value of 0.05 indicates the Hypothesis could go either way. To put it in another way, High P values: your data are likely with a true null. Low P values: your data are unlikely with a true null.

Big data

Working with large datasets that usually require distributed storage

Variance Inflation Factor (VIF)

a method of detecting the severity of multicollinearity by looking at the extent to which a given explanatory variable can be explained by all the other explanatory variables in the equation

Six Sigma

approximate solutions, the 80/20 rule, cross-validation, design of experiments, modern pattern recognition, lift metrics, third-party data, Monte Carlo simulations, or the life cycle of data science projects , ...

Type 1 Sums of squares

builds model subquentially from var to var and tells importance as it adds vars

ordinal variables

categorical ordered variables

Cook's D

change in parameter estimates with the ith observation ex. deleted

Type 3 sums of squares

default how important is this var if it's the last one added to the model

Mallow's Cp

effective variable selection in model that aims to choose the smallest amount of variables

association

expected value(mean) of one variable changes across levels of another variable

Type I error (alpha)

false positive ex: reject the null hypothesis when you should accept it guy's pregnant

cyclical

happening again and again in the same order; happening in cycles

leptokurtic

heavy tail larger amount of data in tails

Examining distributions

histogram - symmetric QQ plots - if falls along line is norm box plots

excess kurtosis

kurtosis-3

Standard Error of Mean

measures of the estimated variability of sample means

Disadvantages of multiple linear regression

options and choice difficult to pick best model interpret models

outlier vs influential

outlier - not following the path of data influential - follows a pattern but on out skerts, has high influence on line

ANOVA parts

predicted value - group mean residuals - difference between observations and predicted

Fvalue

std deviation away from mean

left skewed

tail on left side(skew is below 0)

AI or machine learning

the main conferences are NIPS and ICML, and also conferences like AI Stats, UAI, and KDD, which is more data scienceâ€"oriented , ...

R^2

the proportion (percent) of the variation in the values of y that can be accounted for by the least squares regression line ie how much variance explained

platykurtic

thin tail smaller amount of data in tails

Correlograms

trends, change point, normalization and periodicity , ...

Nominal

unordered categorical variables

Example ways to explore data(Exploritory Data Analysis(EDA))

variables, distributions, associations, anomalies

linear association

when the points in a scatter plot seem to form a line

ICLR

which stands for the International Conference on Learning Representation , ...

Cubic Polynomial Fix

y = B0+B1X1+B2X1^2+B3X1^3+e

Clustering

Unsupervised learning technique for dividing data into groups based on an algorithm

correlation

A measure of the extent to which two factors vary together, and thus of how well either factor predicts the other. The strength of linear relationship

Covariance

A measure of the relationship between two variables whose values are observed at the same time

Overfitting

A model that is too tied to a training set and will not perform well on test data

Deep learning

A multi-level algorithm that gradually identifies things at higher levels of abstraction

DFFITS

- measures impact ith observation has on predicted value - see how much predicted value changes when observation are taken out

Decision tree problems

- overfitting Presence of noise and lack of representative instances - bias error happens when you place too many restrictions on target functions Ex. True or false splits -variance error How much a result will change based on changes to training set

Spearman Rank

-1 to 2 linear relationship between residual and predicted close to 0 = homoskedasticity = good

Reinforcement learning

A class of machine learning algorithms which do not have specific goals but is continuously monitoring if it's doing well or not

SAS

A commercial drastically software suite that includes a programming language

Poisson distribution

A distribution of independent events used to predict the probability of an event occurring in a set time or place

Binomial distribution

A distribution of independent events with two mutually exclusive possible outcomes a fixed number of trials and a constant probability of success. Discrete probability distribution. Graphed using histograms.

Naive Bayes classifier

A family of algorithms that consider every feature as independent of any other feature

Histogram

A graphical representation of the distribution of a set of numeric data usually a vertical bar graph

What are kernels?

A kernel is a similarity function. It is a function that you, as the domain expert, provide to a machine learning algorithm. It takes two inputs and spits out how similar they are. Kernels offer an alternative. Instead of defining a slew of features, you define a single kernel function to compute similarity between images. You provide this kernel, together with the images and labels to the learning algorithm, and out comes a classifier.

Linear/General Linear Model

A linear function representing real-world phenomena. The model also represents patterns found in graphs and/or data.

Feature

A machine learning expression for a piece of measurable information

Standard normal distribution

A normal distribution with a mean of 0 and a standard deviation of 1

Coefficient

A number or algebraic symbol prefixed as a multiplier to a variable or unknown quantity (slope in line equation)

Data structure

A particular arrangement of units of data such as an array or a tree

Gaussian distribution

A probability distribution that when graphed is a symmetrical bell curve with the mean at the center

Confidence interval

A range specified around an estimate to indicate margin of error combined with a probability that a value will fall in that range

Neural network

A robust function that takes an arbitrary set of inputs and fits it to an arbitrary set of outputs that are binary; unique because of hidden layer of weighted functions

Time series data

A sequence of measurements of some quantity taken at different times often at equally spaced intervals

Algorithm

A series of repeatable steps for carrying out a certain type of task with data

Data engineer

A specialist in data wrangling they build infrastructure for real tangible analysis. Run ETL

Tukey Test

A statistical test to measure the difference between several means and tell the user which ones are statistically different from the rest. Several pairwise compairsons creates diffogram

Durbin-Watson test

A test to determine whether first-order Autocorrelation is present If the DW Stat differs sufficiently from 2.00 then have serial correlation. If DW Stat < 2 then have positive correlation. If DW Stat > 2 then have negative correlation.

Supervised learning

A type of machine learning algorithm in which a system is taught to classify input into specific known classes

Discrete variable

A variable whose potential value must be one of a specific number of values

Continuous variable

A variable whose value can be any of infinite values

Computational linguistics

Also called natural language processing (NLP) converting text of spoken languages into structured data to extract valuable information

Backpropagation

An algorithm for iteratively adjusting the weights used in a neural network system. Often used to implement gradient descent.

PageRank

An algorithm that determines the importance of something typically to rank it in a list of search results

Random forest

An algorithm used for regression or classification that uses a collection of tree data structures trees "vote" on the best model

Bayes' Theorem

An equation for calculating the probability that something is true if something potentially related is true. P(A|B) = P(B|A) * P(A) / P(B)

Perl

An older scripting language with roots in pre-unix systems. Popular for text processing like data cleanup and enhancement

Angular JS

An open-source javascript library maintained by google and the community. Lets you create single web page applications to display results

R

An open-source programming language and environment for statistical computing and graph generation

AUC

Area under the ROC curve AUC ROC indicates how well the probabilities from the positive classes are separated from the negative classes When to Use? AUC is scale-invariant. It measures how well predictions are ranked, rather than their absolute values. So, for example, if you as a marketer want to find a list of users who will respond to a marketing campaign. AUC is a good metric to use since the predictions ranked by probability is the order in which you will create a list of users to send the marketing campaign. Another benefit of using AUC is that it is classification-threshold-invariant like log loss. It measures the quality of the model's predictions irrespective of what classification threshold is chosen, unlike F1 score or accuracy which depend on the choice of threshold.

Correlation and Covariance

Both Correlation and Covariance establish the relationship and also measure the dependency between two random variables. Correlation: Correlation is considered or described as the best technique for measuring and also for estimating the quantitative relationship between two variables. Correlation measures how strongly two variables are related. Covariance: In covariance two items vary together and it's a measure that indicates the extent to which two random variables change in cycle. It is a statistical term; it explains the systematic relation between a pair of random variables, wherein changes in one variable reciprocal by a corresponding change in another variable.

Bias-variance tradeoff

Both model prediction errors Bias Bias is the difference between the average prediction of our model and the correct value which we are trying to predict. Model with high bias pays very little attention to the training data and oversimplifies the model. It always leads to high error on training and test data. Variance Variance is the variability of model prediction for a given data point or a value which tells us spread of our data. Model with high variance pays a lot of attention to training data and does not generalize on the data which it hasn't seen before. As a result, such models perform very well on training data but has high error rates on test data. If our model is too simple and has very few parameters then it may have high bias and low variance. On the other hand if our model has large number of parameters then it's going to have high variance and low bias. So we need to find the right/good balance without overfitting and underfitting the data.

CSV

Comma separated values common data file type

Lift

Compares the frequency of an observed pattern with how often you'd expect to see that pattern by chance near 1 is chance

Cost function

Cost function is the key to solving any problem using optimization

DataFrame['A'].count()

Counts the number of values in column A gets the number of rows

K means clustering

Data mining algorithm to cluster classify

Quartile

Data set divided into 4 groups 25% of data in each

Difference between decision, random forest, and xgboost (simple explanation)

Decision tree - A simple decision tree making diagram Random Forest - Are a large number of trees, combined(using averages or majority rules) at the end of the process XGBoost - Combine decision tree but start combing trees at the beginning

Accuracy

Execution time, memory usage, throughput, tuning, and adaptability

ETL

Extract Transform

Stepwise Selection Models

Forward - start with most important Backward - start with all Stepwise - start with most important p-values are not true - over estimating significance inncorrect DFs biases in parameter estimates, predictions, std errors

What is PCA and give example of how it's used

Goal is to reduce dimensions in a dataset due to excessive amount of them - look for uncorrelated factors In short, PCA reduces the data set in dimensionality by finding directions through the data set where the variation is greatest, where each direction is referred to as a principal component. PCA provides new set of dimensions (they are ordered by highest variance) these are called loadings and scores This helps gather the highest amount of variance from a dataset with the least amount of variables through linear combinations Example could be in medical gene analysis when the amount of columns largely out number rows Could also be used to solve multicollinearity

How would you compare different logistic regression models?

Goodness of fit - - Likelihood ratio test = comparing nested models to see if it is statistically significant to choose the larger more complex model - AIC, BIC, Concordance Validation techniques - - ROC= - AUROC = area under the curve. That metric ranges from 0.50 to 1.00, and values above 0.80 indicate that the model does a good job in discriminating between the two categories - Classification - Accuracy

Bayesian network

Graphs that compactly represent the relationship between random variables for a given problem

Stratified sampling

Population is divided into homogeneous groups called strata

K-Nearest Neighbor vs K-means Clustering

K-Nearest Neighbor(KNN) - supervised - classification or regression model K-NN is a classification or regression machine learning algorithm while K-means is a clustering machine learning algorithm. K-Means - unsupervised - clustering

How would you go about making sure that all of the assumptions for linear regression have been met?

Linearity - residual vs predictive plot - points should be symmetrically distributed around diagonal line Observed vs predicted plot - around horizontal line with constant variance Independent - Residual time series plot Normality - Normal probability plot Normal quantile plot of residuals Equal/constant variance/ homoscedasticity - Residuals vs predicted values No perfect multicollinearity - - VIF

Parametric versus nonparametric models

Machine learning algorithms can be grouped into parametric and nonparametric models. Using parametric models, we estimate parameters from the training dataset to learn a function that can classify new data points without requiring the original training dataset anymore. Typical examples of parametric models are the perceptron, logistic regression, and the linear SVM. In contrast, nonparametric models can't be characterized by a fixed set of parameters, and the number of parameters grows with the training data. Two examples of nonparametric models that we have seen so far are the decision tree classifier/random forest and the kernel SVM. KNN belongs to a subcategory of nonparametric models that is described as instance-based learning. Models based on instance-based learning are characterized by memorizing the training dataset, and lazy learning is a special case of instance-based learning that is associated with no (zero) cost during the learning process.

Gradient boosting

Machine learning technique for regression and classification. Produces a prediction model in the form of an ensemble of weak prediction models typically decision trees; stage-wise

DataFrame.boxplot()

Make a boxplot using matplotlib

DataFrame.hist()

Make a histogram using matplotlib

Linear algebra

Math that deals with vector spaces and operations on them such as addition and subtraction

Kurtosis

Measure of the fatness of the tails of a probability distribution relative to that of a normal distribution. Indicates likelihood of extreme outcomes.(kurtosis=3)

Other dimension reduction techniques

Missing value ratio - too many missing values Low variance filter - drop constant variables High correlation filter - high correlation Random forest - find most important variables Backward feature elimination - both take large computational time Forward feature elimination Factor analysis - for highly correlated variables PCA -

Logistic regression

Model where the dependent variable is categorical. Estimates the probability of a relationship between a categorical variable and one or more independent variables

What do you understand about precision and recall and F1score.

Precision - proportion of predicted positives that are truly positive TP/(TP+FP) - When to use? Precision is a valid choice of evaluation metric when we want to be very sure of our prediction. For example: If we are building a system to predict if we should decrease the credit limit on a particular account, we want to be very sure about our prediction or it may result in customer dissatisfaction. - Caveats Being very precise means our model will leave a lot of credit defaulters untouched and hence lose money. Recall - what proportion of actual positives are correctly classified TP/(TP+FN) - When to use? Recall is a valid choice of evaluation metric when we want to capture as many positives as possible. For example: If we are building a system to predict if a person has cancer or not, we want to capture the disease even if we are not very sure. - Caveats Recall is 1 if we predict 1 for all examples. And thus comes the idea of utilizing tradeoff of precision vs. recall — F1 Score. F1score - The F1 score is a number between 0 and 1 and is the harmonic mean of precision and recall. F1=2*((precision*recall)/(precision+recall)) - Let us start with a binary prediction problem. We are predicting if an asteroid will hit the earth or not. So if we say "No" for the whole training set. Our precision here is 0. What is the recall of our positive class? It is zero. What is the accuracy? It is more than 99%. And hence the F1 score is also 0. And thus we get to know that the classifier that has an accuracy of 99% is basically worthless for our case. And hence it solves our problem. - When to use? We want to have a model with both good precision and recall. - Precision-Recall Tradeoff Simply stated the F1 score sort of maintains a balance between the precision and recall for your classifier. If your precision is low, the F1 is low and if the recall is low again your F1 score is low. If you are a police inspector and you want to catch criminals, you want to be sure that the person you catch is a criminal (Precision) and you also want to capture as many criminals (Recall) as possible. The F1 score manages this tradeoff. - Caveats The main problem with the F1 score is that it gives equal weight to precision and recall. We might sometimes need to include domain knowledge in our evaluation where we want to have more recall or more precision. To solve this, we can do this by creating a weighted F1 metric as below where beta manages the tradeoff between precision and recall. FB = (1+B^2)((precision*recall)/(B^2*precision+recall) Here we give β times as much importance to recall as precision.

Python

Programming language that is used in data science. Easy to use and powerful for advanced users by using specialized libraries

R

Python (or Perl), Excel, SQL, graphics (visualization), FTP, basic UNIX commands (sort, grep, head, tail, the pipe and redirect operators, cat, cron jobs, and so on) , ...

Import matplotlib.pyplot as plt

Python module useful for graphing data

Scalar

Quantity that has magnitude but No direction in space such as volume or temperature

Pivot table

Quickly summarizes long lists of data without requiring the writing of formulas or copying cells. Can be arranged dynamically or pivoted

Autocorrection

Refers to degree of correlation between observations of the same variable common method of testing for autocorrelation is the Durbin-Watson test. The Durbin-Watson tests produces a test statistic that ranges from 0 to 4. Values close to 2 (the middle of the range) suggest less autocorrelation, and values closer to 0 or 4 indicate greater positive or negative autocorrelation respectively.

DataFrame['A'].max()

Returns largest value in column A

Total sums of squares (SST)

SSM+SSE what model explains + error you can't explain

Least squares

Smallest sum of the squared distances to the data from the line

DataFrame.groupby()

Splits data into different groups depending on the variable you choose

Root Mean Square Error (RMSE)

Square root of mean squared error. More popular because it gives a number that is easier to understand in the units of the original observations

PCA Principal Component Analysis

Statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values linearly uncorrelated variables called principal components

Chi-square test

Statistical test of whether two categorical variables are independent

Decision tree classifiers - Strength and Weeknesses

Strength: 1) feature scaling is not a requirement for decision tree algorithms 2) Can visualize the DT (using GraphViz) Weakness: 1) we have to be careful since the deeper the decision tree, the more complex the decision boundary becomes, which can easily result in overfitting Note: Using Random Forest allows combining weak learners with strong learners

Hyperplane

Sub space of one dimension less than its ambient space for 3-D space

Support vector machine

Supervised learning classification tool that seeks a dividing hyperplane for any number of dimensions can be used for regression or classification

Linear regression

Technique that looks for a linear relationship between two variables using the line with the least squares

Explain how ROC works

The ROC curve is a graphical representation of the contrast between true positive rates and false-positive rates at various thresholds. It is often used as a proxy for the trade-off between the sensitivity(true positive rate) and false-positive rate.

Mean Absolute Error

The average error of all predicted values when compared with observed values

Mean Squared Error

The average of the squares of all the errors when comparing predicted values with observed values

Difference between ridge and lasso regression?

The key difference between the two is the penalty term. The key difference between these techniques is that Lasso shrinks the less important feature's coefficient to zero thus, removing some feature altogether. So, this works well for feature selection in case we have a huge number of features. Lasso(L1) Lasso Regression (Least Absolute Shrinkage and Selection Operator) adds "absolute value of magnitude" of coefficient as penalty term to the loss function. Ridge(L2) Ridge regression adds "squared magnitude" of coefficient as penalty term to the loss function.

Log-loss / Binary crossentropy

The loss function used in binary logistic regression. Binary Log loss for an example is given by the below formula where p is the probability of predicting 1. -(y log(p)+(1-y)log(1-p)) As you can see the log loss decreases as we are fairly certain in our prediction of 1 and the true label is 1. When to Use? When the output of a classifier is prediction probabilities. Log Loss takes into account the uncertainty of your prediction based on how much it varies from the actual label. This gives us a more nuanced view of the performance of our model. In general, minimizing Log Loss gives greater accuracy for the classifier. Caveats It is susceptible in case of imbalanced datasets. You might have to introduce class weights to penalize minority errors more or you may use this after balancing your dataset.

Moving average

The mean of time series data from several consecutive periods; continually updated

p-value

The probability of results of the experiment being attributed to chance. It's the probability of getting an observation more extreme than your data on average holding all else constant. very sensitive to outliers

P-value

The probability under the assumption of no difference (bill hypothesis)

Perceptron

The simplest neural network approximates a single neuron with N binary inputs. It computes a weighted sum of the inputs and "fires" if that weighted sum is zero or greater

Standard deviation

The square root of the variance common way to indicate how different a particular measurement is from the mean

Collaborative filtering

The term collaborative filtering was first used by David Goldberg at Xerox PARC in 1992 in a paper called 'Using collaborative filtering to weave an information tapestry.' He designed a system called Tapestry that allowed people to annotate documents as either interesting or uninteresting and used this information to filter documents for other people.There are now hundreds of web sites that employ some sort of collaborative filtering algorithm for movies, music, books, dating, shopping, other web sites, podcasts, articles, and even jokes.

Data Mining

The use of computers to analyze large data sets to look for patterns that let people make business decisions

Machine learning

The use of data-driven algorithms that perform better as they have more data to work with; generally uses cross-validation

Econometrics

The use of mathematical and statistical methods in the field of economics to verify and develop economic theories

Dependent variable

The value depends on the value of the independent variable

Spatiotemporal data

Time series data that also includes geographic identifiers such as latitude-longitude pairs

Standardized score

Transformed raw score into units of standard deviation above or below the mean

Matrix

Two dimensional array of values arranged in rows and columns

How to interpret XGBoost results?

Use SHAP values to measure importance of features since we now have individualized explanations for every person, we can do more than just make a bar chart. We can plot the feature importance for every customer in our data set. This will also show outlier effects. Can also plot the shap values for individual variables such as age while also coloring dots by something else for example years of education.

Decision trees

Uses a tree structure to represent a number of possible decision paths and an outcome for each path

Dimension reduction

Using PCA or a similar method to find the smallest subset of dimensions that captures the most variation

Parametric

Using parametric models, we estimate parameters from the training dataset to learn a function that can classify new data points without requiring the original training dataset anymore. Typical examples of parametric models are the perceptron, logistic regression, and the linear SVM.

Kernel trick

Using the kernel trick to find separating hyperplanes in higher dimensional space To solve a nonlinear problem using an SVM, we transform the training data onto a higher dimensional feature space via a mapping function Using the kernel trick to find separating hyperplanes in higher dimensional space and train a linear SVM model to classify the data in this new feature space. Then we can use the same mapping function Using the kernel trick to find separating hyperplanes in higher dimensional space to transform new, unseen data to classify it using the linear SVM model. However, one problem with this mapping approach is that the construction of the new features is computationally very expensive, especially if we are dealing with high-dimensional data. This is where the so-called kernel trick comes into play. Although we didn't go into much detail about how to solve the quadratic programming task to train an SVM, in practice all we need is to replace the dot product Using the kernel trick to find separating hyperplanes in higher dimensional space by Using the kernel trick to find separating hyperplanes in higher dimensional space. In order to save the expensive step of calculating this dot product between two points explicitly, we define a so-called kernel function: Using the kernel trick to find separating hyperplanes in higher dimensional space. One of the most widely used kernels is the Radial Basis Function kernel (RBF kernel) or Gaussian kernel: The trick is to choose a transformation so that the kernel can be computed without actually computing the transformation. replacing the dot-product function with a new function that returns what the dot product would have been if the data had first been transformed to a higher dimensional space. Usually done using the radial-basis function

Three V's

Volume Velocity

Diagnosing bias and variance problems with learning curves

When a model has both low training and cross-validation accuracy, which indicates that it underfits the training data. Common ways to address this issue are to 1) increase the number of parameters of the model, for example, by collecting or constructing additional features, or 2) by decreasing the degree of regularization, for example, in SVM or logistic regression classifiers. When a model suffers from high variance, which is indicated by the large gap between the training and cross-validation accuracy. To address this problem of overfitting, 1) we can collect more training data or reduce the complexity of the model, for example, by increasing the regularization parameter; 2) for unregularized models, it can also help to decrease the number of features via feature selection or feature extraction (Compressing Data via Dimensionality Reduction). 3) Collecting more training data decreases the chance of overfitting. However, it may not always help, for example, when the training data is extremely noisy or the model is already very close to optimal.

quadratic

a polynomial with a degree of 2

describing data

center/location - mean, median, mode spread(variation) - range, interquartile range(25th-75th), variance, std deviation(dispersion expressed in same units of data(sq root of variance)) shape anomalous observations

Cook's D

change in parameter estimates with the ith observation is deleted from analysis

Collinearity

collinearity (high correlation among features), filter out noise from data, and eventually prevent overfitting

comparing two t-tests assumptions

independent observations normally distributed data for each group -> QQ plot equal variances for each group(not an assumptions but a check to see what test to use)

Interpretation for logistic regression coefficient for Intercept: -1.47085 Female: 0.59278 math= 0.15634

intercept= -1.47085 which corresponds to the log odds for males being in an honor class (since male is the reference group, female=0). coefficient for female= 0.59278 which corresponds to the log of odds ratio between the female group and male group. The odds ratio equals 1.81 which means the odds for females are about 81% higher than the odds for males. The coefficient for math= 0.15634 which is interpreted as the expected change in log odds for a one-unit increase in the math score. The odds ratio can be calculated by exponentiating this value to get 1.16922 which means we expect to see about 17% increase in the odds of being in an honors class, for a one-unit increase in math score

General Linear Model

is a generalization of multiple linear regression to the case of more than one dependent variable.

global f test multiple linear regression

is something in this model useful?

Why would accuracy be a poor metric for measuring the strength of a model?

it may misclassify all rare events [1%] as non-events and have a 99% accuracy rate

ANOVA assumptions

normality of errors, homogeneity of variance(all groups have equal variance), independence of errors

Accounting for lack of normality

outliers - robust regression relationship between variables -> transform can try box cox transformation

Random variables

probability, mean, variance, percentiles, experimental design, cross-validation, goodness of fit, and robust statistics , ...

Studentized residuals

residual scanned by std error good for outliers how many std deviations away from mean

Sensitivity vs. Specificity

sensitivity(true positive rate)- how well a test identifies truly ill people(positive) specificity - how well a test identifies truly well people(negative)

right skewed

tail on right side(skew is above 0)

Load data

with packages like RODBC or RMySQLManipulate data, with packages like stringr or lubridateVisualize data, with packages like ggplot2 or leafletModel data, with packages like Random Forest or survivalReport results, with packages like shiny or markdown , ...

Simple Linear Regression Equation & Assumptions

y = B0 + B1X + e assumptions: 1) residuals normal 2) homoskedasticity-errors have equal variance/no pattern 3) independence 4) no perfect multicollinearity 5) linearity of mean

Polynomial Model with Cross Product Term

y = B0 + B1X1 + B2X2 + B3X1X2 + e

Quadratic Polynomial Model Fix

y = B0+B1X1+B2X1^2+e


Related study sets

Results of Industrial revolution

View Set

CMS1 Assignment 1: Nature and Challenges of Human Resources Management

View Set

Ob Gyn: Menstrual Cycle, AUB, dysmenorrhea and PMS

View Set

ATI Medical-Surgical: Cardiovascular and Hematology

View Set