Data Science Interview Questions
t-value
# of std errors away form hypothesis
Detecting lack of normality Accounting lack of normality
- hist of residuals - normal probability plot of residuals (QQ plot) - Formal test - outliers -> robust regression - relationship between variables -> transform - can try box cox transformation
DFBETAS
- measure of change in the jth parameter estimate with the deletion of the ith observation - seeing change in Betas -one DFBETA per parameter per observation -influence of parameter coefficient
BoxCox
-help identify transformation that'd make your residuals normally distributed
Detecting Heteroskedasticity Accounting For heteroskedasticity
-residual plots for patterns -WLS(Weighted Least Squares) -Transform data
Adaptable Systems
...
CRF (conditional random fields)
...
Categorical data
...
Convolutional net
...
Distributed Systems.
...
From Mud to Structure.
...
Interactive Systems
...
Ordinal features
...
Random-restart hill climbing
...
Stochastic gradient descent
...
The RM4Es (Research Methods Four Elements) is a good framework to summarize Machine Learning components and processes. The RM4Es include:Equation: Equations are used to represent the models for our researchEstimation: Estimation is the link between equations (models) and the data used for our researchEvaluation: Evaluation needs to be performed to assess the fit between models and the dataExplanation: Explanation is the link between equations (models) and our research purposes. How we explain our research results often depends on our research purposes and also on the subject we are studying
...
Vector
...
Vector space
...
What make big datasets impractical
...
structure prediction
...
structured SVMs
...
structured perceptron
...
term frequency-inverse document frequency
...
value means more similar."
...
empirical rule
1 stdev - 68% 2 stdev - 98% 3 stdev - 99%
confidence interval
95% confidence interval represents a range of values within which you are 95% certain that the true population mean exists *not probability that mean is in there* *I'm going to create a 95% chance it covers the mean*
S curve
A pattern in which something is adopted slowly gains popularity quickly
Serial correlation
A pattern where values in a series are correlated can shift time series by an interval called a lag and then compute the correlation of the shifted and original series
Pandas
A python library for data manipulation
Logarithm
A quantity representing the power to which a fixed number base
Ruby
A scripting language that can be used for data science not as popular as Python
Data wrangling
AKA data munging the conversion of data using scripting languages to make it easy to work with
ANOVA vs Regression
ANOVA - looks at differences of means, meaning if there wasn't a difference then it'd have no predictive power(categorical wise) Regression - does this variable have any impact , have any predictive power on the variable at interest
DataFrame['A'].sum()
Adds up all values in column A
Adjusted R^2
Adjusts for the loss of degrees of freedom when additional independent variables are added to a regression. no interpretation
Centroid
Center of a cluster
Unsupervised learning
Class of machine learning algorithms designed to identify groupings of data without knowing in advance what the groups will be
Regularization
Collect more training dataIntroduce a penalty for complexity via regularizationChoose a simpler model with fewer parametersReduce the dimensionality of the data L1 regularization can be understood as a technique for feature selection.
ANOVA
Compares mean values of a contributes variable for multiple categories/groups analysis of variance assumes means are equal can use glm
D3
Data Driven Documents a JavaScript library that eases the creation of interactive visualizations embedded in web pages
Type II error (beta)
False negative results ex: accept the null hypothesis when you should reject it woman's not pregnant
Regression
Fitting a model to data
Unstructured Information Management Architecture (UIMA)
Framework developed at IBM to analyze unstructured information especially natural language
GATE
General Architecture for Text Engineering; open source java-based framework for natural language processing tasks
Variance
How much a list of numbers vary from the average the average of the squared difference of each number from the mean
Bias
In machine learning when a learner consistently learns the same thing wrong
Dunnett's Test
K-1(k being total groups) testing against the control all being compared to group 1 make sure means are outside adjusted area of control
Instance-based learning
KNN belongs to a subcategory of nonparametric models that is described as instance-based learning. Models based on instance-based learning are characterized by memorizing the training dataset, and lazy learning is a special case of instance-based learning that is associated with no (zero) cost during the learning process.
L2 regularization
L2 regularization (sometimes also called L2 shrinkage or weight decay)
Probability distribution
Listing of all possible distinct outcomes and their probabilities of occurring sum is equal to 1
common ways for evaluation of model
MAPE - ex - off by 3% on avg MAE - ex - off by 100 on avg
K-nearest neighbors
Machine learning algorithm that classifies things based on their similarity to nearby neighbors. Pick the number of neighbors K
Correlation coefficient
Measure of how closely two variables correlate. Ranges from -1 to 1
Prior distribution
Models the many plausible values of the unknown quantity to be estimated in Bayes interference
Nonparametric
Nonparametric models can't be characterized by a fixed set of parameters, and the number of parameters grows with the training data. Two examples of nonparametric models that we have seen so far are the decision tree classifier/random forest and the kernel SVM.
Gradient descent
Optimization algorithm for finding the input to a function that produces the optimal value; iterative
PCA
PCA attempts to find the orthogonal component axes of maximum variance in a dataset. Kernel principal component analysis
Layers
Pipes and Filters, Blackboard, Broker, Model-View-Controller, Presentation-Abstraction-Control, Microkernel, and Reflection , ...
What does regular regularization do?
Regularization basically adds the penalty as model complexity increases. Regularization parameter (lambda) penalizes all the parameters except intercept so that model generalizes the data and won't overfit.
what comparison to use tests
Response\Predictors | Categ | Contin | Cont Categ Continuous | ANOVA | OLS | OLS Categorical | Log Reg | Log Reg | Log Reg Response = Target variable Explain = input variables
DataFrame['A'].mean()
Returns average of values in column A
DataFrame.head(n = 5)
Returns first n rows of a dataframe
Selection bias
Selection bias is a kind of error that occurs when the researcher decides who is going to be studied. The sample is not representative of the whole population.
Cross-validation
Set of techniques that divide up data into training sets and test sets usually 80-20. Training sets are given the correct categorization and an algorithm is created
z
The ISO standard query language for relational databases
Data science
The ability to extract knowledge and insights from large and complex data sets
Artificial intelligence
The ability to have machines act with apparent intelligence. Can be through symbolic logic or statistical analysis
Predictive analytics
The analysis of data to predict future events usually to aid in business planning
N-gram
The analysis of sequences of N items; usually words in natural language
Correlation
The degree of relative correspondence between two variables
Predictive modeling
The development of drastically models to predict future events
Central Limit Theorem
The distribution of sample means is approximately normally distributed, even if the population they are drawn from is not normally distributed, given that the sample size is large enough.
Classification
The identification of two or more discrete categories for items classic machine learning task. Spam or ham. Movie genres. Supervised learning.
Monte Carlo method
The use of randomly generated numbers as part of an algorithm
Classification error
This is a useful criterion for pruning but not recommended for growing a decision tree, since it is less sensitive to changes in the class probabilities of the nodes.
Objective function
Used to find the optimal result of an objective; used to solve an optimization problem
Feature engineering
Using feature to come up with a good model through iteration
Latent variable
Variables that are not directly observed but inferred from other variables that are observed
T-distribution
Variation of normal distribution that accounts for the fact that you're only using a sample of values not all of them
Three Classes Of Metrics: Centrality
Volatility, Bumpiness. , ...
What is p-value?
When you perform a hypothesis test in statistics, a p-value can help you determine the strength of your results. p-value is a number between 0 and 1. Based on the value it will denote the strength of the results. The claim which is on trial is called the Null Hypothesis. Low p-value (≤ 0.05) indicates strength against the null hypothesis which means we can reject the null Hypothesis. High p-value (≥ 0.05) indicates strength for the null hypothesis which means we can accept the null Hypothesis p-value of 0.05 indicates the Hypothesis could go either way. To put it in another way, High P values: your data are likely with a true null. Low P values: your data are unlikely with a true null.
Big data
Working with large datasets that usually require distributed storage
Variance Inflation Factor (VIF)
a method of detecting the severity of multicollinearity by looking at the extent to which a given explanatory variable can be explained by all the other explanatory variables in the equation
Six Sigma
approximate solutions, the 80/20 rule, cross-validation, design of experiments, modern pattern recognition, lift metrics, third-party data, Monte Carlo simulations, or the life cycle of data science projects , ...
Type 1 Sums of squares
builds model subquentially from var to var and tells importance as it adds vars
ordinal variables
categorical ordered variables
Cook's D
change in parameter estimates with the ith observation ex. deleted
Type 3 sums of squares
default how important is this var if it's the last one added to the model
Mallow's Cp
effective variable selection in model that aims to choose the smallest amount of variables
association
expected value(mean) of one variable changes across levels of another variable
Type I error (alpha)
false positive ex: reject the null hypothesis when you should accept it guy's pregnant
cyclical
happening again and again in the same order; happening in cycles
leptokurtic
heavy tail larger amount of data in tails
Examining distributions
histogram - symmetric QQ plots - if falls along line is norm box plots
excess kurtosis
kurtosis-3
Standard Error of Mean
measures of the estimated variability of sample means
Disadvantages of multiple linear regression
options and choice difficult to pick best model interpret models
outlier vs influential
outlier - not following the path of data influential - follows a pattern but on out skerts, has high influence on line
ANOVA parts
predicted value - group mean residuals - difference between observations and predicted
Fvalue
std deviation away from mean
left skewed
tail on left side(skew is below 0)
AI or machine learning
the main conferences are NIPS and ICML, and also conferences like AI Stats, UAI, and KDD, which is more data scienceâ€"oriented , ...
R^2
the proportion (percent) of the variation in the values of y that can be accounted for by the least squares regression line ie how much variance explained
platykurtic
thin tail smaller amount of data in tails
Correlograms
trends, change point, normalization and periodicity , ...
Nominal
unordered categorical variables
Example ways to explore data(Exploritory Data Analysis(EDA))
variables, distributions, associations, anomalies
linear association
when the points in a scatter plot seem to form a line
ICLR
which stands for the International Conference on Learning Representation , ...
Cubic Polynomial Fix
y = B0+B1X1+B2X1^2+B3X1^3+e
Clustering
Unsupervised learning technique for dividing data into groups based on an algorithm
correlation
A measure of the extent to which two factors vary together, and thus of how well either factor predicts the other. The strength of linear relationship
Covariance
A measure of the relationship between two variables whose values are observed at the same time
Overfitting
A model that is too tied to a training set and will not perform well on test data
Deep learning
A multi-level algorithm that gradually identifies things at higher levels of abstraction
DFFITS
- measures impact ith observation has on predicted value - see how much predicted value changes when observation are taken out
Decision tree problems
- overfitting Presence of noise and lack of representative instances - bias error happens when you place too many restrictions on target functions Ex. True or false splits -variance error How much a result will change based on changes to training set
Spearman Rank
-1 to 2 linear relationship between residual and predicted close to 0 = homoskedasticity = good
Reinforcement learning
A class of machine learning algorithms which do not have specific goals but is continuously monitoring if it's doing well or not
SAS
A commercial drastically software suite that includes a programming language
Poisson distribution
A distribution of independent events used to predict the probability of an event occurring in a set time or place
Binomial distribution
A distribution of independent events with two mutually exclusive possible outcomes a fixed number of trials and a constant probability of success. Discrete probability distribution. Graphed using histograms.
Naive Bayes classifier
A family of algorithms that consider every feature as independent of any other feature
Histogram
A graphical representation of the distribution of a set of numeric data usually a vertical bar graph
What are kernels?
A kernel is a similarity function. It is a function that you, as the domain expert, provide to a machine learning algorithm. It takes two inputs and spits out how similar they are. Kernels offer an alternative. Instead of defining a slew of features, you define a single kernel function to compute similarity between images. You provide this kernel, together with the images and labels to the learning algorithm, and out comes a classifier.
Linear/General Linear Model
A linear function representing real-world phenomena. The model also represents patterns found in graphs and/or data.
Feature
A machine learning expression for a piece of measurable information
Standard normal distribution
A normal distribution with a mean of 0 and a standard deviation of 1
Coefficient
A number or algebraic symbol prefixed as a multiplier to a variable or unknown quantity (slope in line equation)
Data structure
A particular arrangement of units of data such as an array or a tree
Gaussian distribution
A probability distribution that when graphed is a symmetrical bell curve with the mean at the center
Confidence interval
A range specified around an estimate to indicate margin of error combined with a probability that a value will fall in that range
Neural network
A robust function that takes an arbitrary set of inputs and fits it to an arbitrary set of outputs that are binary; unique because of hidden layer of weighted functions
Time series data
A sequence of measurements of some quantity taken at different times often at equally spaced intervals
Algorithm
A series of repeatable steps for carrying out a certain type of task with data
Data engineer
A specialist in data wrangling they build infrastructure for real tangible analysis. Run ETL
Tukey Test
A statistical test to measure the difference between several means and tell the user which ones are statistically different from the rest. Several pairwise compairsons creates diffogram
Durbin-Watson test
A test to determine whether first-order Autocorrelation is present If the DW Stat differs sufficiently from 2.00 then have serial correlation. If DW Stat < 2 then have positive correlation. If DW Stat > 2 then have negative correlation.
Supervised learning
A type of machine learning algorithm in which a system is taught to classify input into specific known classes
Discrete variable
A variable whose potential value must be one of a specific number of values
Continuous variable
A variable whose value can be any of infinite values
Computational linguistics
Also called natural language processing (NLP) converting text of spoken languages into structured data to extract valuable information
Backpropagation
An algorithm for iteratively adjusting the weights used in a neural network system. Often used to implement gradient descent.
PageRank
An algorithm that determines the importance of something typically to rank it in a list of search results
Random forest
An algorithm used for regression or classification that uses a collection of tree data structures trees "vote" on the best model
Bayes' Theorem
An equation for calculating the probability that something is true if something potentially related is true. P(A|B) = P(B|A) * P(A) / P(B)
Perl
An older scripting language with roots in pre-unix systems. Popular for text processing like data cleanup and enhancement
Angular JS
An open-source javascript library maintained by google and the community. Lets you create single web page applications to display results
R
An open-source programming language and environment for statistical computing and graph generation
AUC
Area under the ROC curve AUC ROC indicates how well the probabilities from the positive classes are separated from the negative classes When to Use? AUC is scale-invariant. It measures how well predictions are ranked, rather than their absolute values. So, for example, if you as a marketer want to find a list of users who will respond to a marketing campaign. AUC is a good metric to use since the predictions ranked by probability is the order in which you will create a list of users to send the marketing campaign. Another benefit of using AUC is that it is classification-threshold-invariant like log loss. It measures the quality of the model's predictions irrespective of what classification threshold is chosen, unlike F1 score or accuracy which depend on the choice of threshold.
Correlation and Covariance
Both Correlation and Covariance establish the relationship and also measure the dependency between two random variables. Correlation: Correlation is considered or described as the best technique for measuring and also for estimating the quantitative relationship between two variables. Correlation measures how strongly two variables are related. Covariance: In covariance two items vary together and it's a measure that indicates the extent to which two random variables change in cycle. It is a statistical term; it explains the systematic relation between a pair of random variables, wherein changes in one variable reciprocal by a corresponding change in another variable.
Bias-variance tradeoff
Both model prediction errors Bias Bias is the difference between the average prediction of our model and the correct value which we are trying to predict. Model with high bias pays very little attention to the training data and oversimplifies the model. It always leads to high error on training and test data. Variance Variance is the variability of model prediction for a given data point or a value which tells us spread of our data. Model with high variance pays a lot of attention to training data and does not generalize on the data which it hasn't seen before. As a result, such models perform very well on training data but has high error rates on test data. If our model is too simple and has very few parameters then it may have high bias and low variance. On the other hand if our model has large number of parameters then it's going to have high variance and low bias. So we need to find the right/good balance without overfitting and underfitting the data.
CSV
Comma separated values common data file type
Lift
Compares the frequency of an observed pattern with how often you'd expect to see that pattern by chance near 1 is chance
Cost function
Cost function is the key to solving any problem using optimization
DataFrame['A'].count()
Counts the number of values in column A gets the number of rows
K means clustering
Data mining algorithm to cluster classify
Quartile
Data set divided into 4 groups 25% of data in each
Difference between decision, random forest, and xgboost (simple explanation)
Decision tree - A simple decision tree making diagram Random Forest - Are a large number of trees, combined(using averages or majority rules) at the end of the process XGBoost - Combine decision tree but start combing trees at the beginning
Accuracy
Execution time, memory usage, throughput, tuning, and adaptability
ETL
Extract Transform
Stepwise Selection Models
Forward - start with most important Backward - start with all Stepwise - start with most important p-values are not true - over estimating significance inncorrect DFs biases in parameter estimates, predictions, std errors
What is PCA and give example of how it's used
Goal is to reduce dimensions in a dataset due to excessive amount of them - look for uncorrelated factors In short, PCA reduces the data set in dimensionality by finding directions through the data set where the variation is greatest, where each direction is referred to as a principal component. PCA provides new set of dimensions (they are ordered by highest variance) these are called loadings and scores This helps gather the highest amount of variance from a dataset with the least amount of variables through linear combinations Example could be in medical gene analysis when the amount of columns largely out number rows Could also be used to solve multicollinearity
How would you compare different logistic regression models?
Goodness of fit - - Likelihood ratio test = comparing nested models to see if it is statistically significant to choose the larger more complex model - AIC, BIC, Concordance Validation techniques - - ROC= - AUROC = area under the curve. That metric ranges from 0.50 to 1.00, and values above 0.80 indicate that the model does a good job in discriminating between the two categories - Classification - Accuracy
Bayesian network
Graphs that compactly represent the relationship between random variables for a given problem
Stratified sampling
Population is divided into homogeneous groups called strata
K-Nearest Neighbor vs K-means Clustering
K-Nearest Neighbor(KNN) - supervised - classification or regression model K-NN is a classification or regression machine learning algorithm while K-means is a clustering machine learning algorithm. K-Means - unsupervised - clustering
How would you go about making sure that all of the assumptions for linear regression have been met?
Linearity - residual vs predictive plot - points should be symmetrically distributed around diagonal line Observed vs predicted plot - around horizontal line with constant variance Independent - Residual time series plot Normality - Normal probability plot Normal quantile plot of residuals Equal/constant variance/ homoscedasticity - Residuals vs predicted values No perfect multicollinearity - - VIF
Parametric versus nonparametric models
Machine learning algorithms can be grouped into parametric and nonparametric models. Using parametric models, we estimate parameters from the training dataset to learn a function that can classify new data points without requiring the original training dataset anymore. Typical examples of parametric models are the perceptron, logistic regression, and the linear SVM. In contrast, nonparametric models can't be characterized by a fixed set of parameters, and the number of parameters grows with the training data. Two examples of nonparametric models that we have seen so far are the decision tree classifier/random forest and the kernel SVM. KNN belongs to a subcategory of nonparametric models that is described as instance-based learning. Models based on instance-based learning are characterized by memorizing the training dataset, and lazy learning is a special case of instance-based learning that is associated with no (zero) cost during the learning process.
Gradient boosting
Machine learning technique for regression and classification. Produces a prediction model in the form of an ensemble of weak prediction models typically decision trees; stage-wise
DataFrame.boxplot()
Make a boxplot using matplotlib
DataFrame.hist()
Make a histogram using matplotlib
Linear algebra
Math that deals with vector spaces and operations on them such as addition and subtraction
Kurtosis
Measure of the fatness of the tails of a probability distribution relative to that of a normal distribution. Indicates likelihood of extreme outcomes.(kurtosis=3)
Other dimension reduction techniques
Missing value ratio - too many missing values Low variance filter - drop constant variables High correlation filter - high correlation Random forest - find most important variables Backward feature elimination - both take large computational time Forward feature elimination Factor analysis - for highly correlated variables PCA -
Logistic regression
Model where the dependent variable is categorical. Estimates the probability of a relationship between a categorical variable and one or more independent variables
What do you understand about precision and recall and F1score.
Precision - proportion of predicted positives that are truly positive TP/(TP+FP) - When to use? Precision is a valid choice of evaluation metric when we want to be very sure of our prediction. For example: If we are building a system to predict if we should decrease the credit limit on a particular account, we want to be very sure about our prediction or it may result in customer dissatisfaction. - Caveats Being very precise means our model will leave a lot of credit defaulters untouched and hence lose money. Recall - what proportion of actual positives are correctly classified TP/(TP+FN) - When to use? Recall is a valid choice of evaluation metric when we want to capture as many positives as possible. For example: If we are building a system to predict if a person has cancer or not, we want to capture the disease even if we are not very sure. - Caveats Recall is 1 if we predict 1 for all examples. And thus comes the idea of utilizing tradeoff of precision vs. recall — F1 Score. F1score - The F1 score is a number between 0 and 1 and is the harmonic mean of precision and recall. F1=2*((precision*recall)/(precision+recall)) - Let us start with a binary prediction problem. We are predicting if an asteroid will hit the earth or not. So if we say "No" for the whole training set. Our precision here is 0. What is the recall of our positive class? It is zero. What is the accuracy? It is more than 99%. And hence the F1 score is also 0. And thus we get to know that the classifier that has an accuracy of 99% is basically worthless for our case. And hence it solves our problem. - When to use? We want to have a model with both good precision and recall. - Precision-Recall Tradeoff Simply stated the F1 score sort of maintains a balance between the precision and recall for your classifier. If your precision is low, the F1 is low and if the recall is low again your F1 score is low. If you are a police inspector and you want to catch criminals, you want to be sure that the person you catch is a criminal (Precision) and you also want to capture as many criminals (Recall) as possible. The F1 score manages this tradeoff. - Caveats The main problem with the F1 score is that it gives equal weight to precision and recall. We might sometimes need to include domain knowledge in our evaluation where we want to have more recall or more precision. To solve this, we can do this by creating a weighted F1 metric as below where beta manages the tradeoff between precision and recall. FB = (1+B^2)((precision*recall)/(B^2*precision+recall) Here we give β times as much importance to recall as precision.
Python
Programming language that is used in data science. Easy to use and powerful for advanced users by using specialized libraries
R
Python (or Perl), Excel, SQL, graphics (visualization), FTP, basic UNIX commands (sort, grep, head, tail, the pipe and redirect operators, cat, cron jobs, and so on) , ...
Import matplotlib.pyplot as plt
Python module useful for graphing data
Scalar
Quantity that has magnitude but No direction in space such as volume or temperature
Pivot table
Quickly summarizes long lists of data without requiring the writing of formulas or copying cells. Can be arranged dynamically or pivoted
Autocorrection
Refers to degree of correlation between observations of the same variable common method of testing for autocorrelation is the Durbin-Watson test. The Durbin-Watson tests produces a test statistic that ranges from 0 to 4. Values close to 2 (the middle of the range) suggest less autocorrelation, and values closer to 0 or 4 indicate greater positive or negative autocorrelation respectively.
DataFrame['A'].max()
Returns largest value in column A
Total sums of squares (SST)
SSM+SSE what model explains + error you can't explain
Least squares
Smallest sum of the squared distances to the data from the line
DataFrame.groupby()
Splits data into different groups depending on the variable you choose
Root Mean Square Error (RMSE)
Square root of mean squared error. More popular because it gives a number that is easier to understand in the units of the original observations
PCA Principal Component Analysis
Statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values linearly uncorrelated variables called principal components
Chi-square test
Statistical test of whether two categorical variables are independent
Decision tree classifiers - Strength and Weeknesses
Strength: 1) feature scaling is not a requirement for decision tree algorithms 2) Can visualize the DT (using GraphViz) Weakness: 1) we have to be careful since the deeper the decision tree, the more complex the decision boundary becomes, which can easily result in overfitting Note: Using Random Forest allows combining weak learners with strong learners
Hyperplane
Sub space of one dimension less than its ambient space for 3-D space
Support vector machine
Supervised learning classification tool that seeks a dividing hyperplane for any number of dimensions can be used for regression or classification
Linear regression
Technique that looks for a linear relationship between two variables using the line with the least squares
Explain how ROC works
The ROC curve is a graphical representation of the contrast between true positive rates and false-positive rates at various thresholds. It is often used as a proxy for the trade-off between the sensitivity(true positive rate) and false-positive rate.
Mean Absolute Error
The average error of all predicted values when compared with observed values
Mean Squared Error
The average of the squares of all the errors when comparing predicted values with observed values
Difference between ridge and lasso regression?
The key difference between the two is the penalty term. The key difference between these techniques is that Lasso shrinks the less important feature's coefficient to zero thus, removing some feature altogether. So, this works well for feature selection in case we have a huge number of features. Lasso(L1) Lasso Regression (Least Absolute Shrinkage and Selection Operator) adds "absolute value of magnitude" of coefficient as penalty term to the loss function. Ridge(L2) Ridge regression adds "squared magnitude" of coefficient as penalty term to the loss function.
Log-loss / Binary crossentropy
The loss function used in binary logistic regression. Binary Log loss for an example is given by the below formula where p is the probability of predicting 1. -(y log(p)+(1-y)log(1-p)) As you can see the log loss decreases as we are fairly certain in our prediction of 1 and the true label is 1. When to Use? When the output of a classifier is prediction probabilities. Log Loss takes into account the uncertainty of your prediction based on how much it varies from the actual label. This gives us a more nuanced view of the performance of our model. In general, minimizing Log Loss gives greater accuracy for the classifier. Caveats It is susceptible in case of imbalanced datasets. You might have to introduce class weights to penalize minority errors more or you may use this after balancing your dataset.
Moving average
The mean of time series data from several consecutive periods; continually updated
p-value
The probability of results of the experiment being attributed to chance. It's the probability of getting an observation more extreme than your data on average holding all else constant. very sensitive to outliers
P-value
The probability under the assumption of no difference (bill hypothesis)
Perceptron
The simplest neural network approximates a single neuron with N binary inputs. It computes a weighted sum of the inputs and "fires" if that weighted sum is zero or greater
Standard deviation
The square root of the variance common way to indicate how different a particular measurement is from the mean
Collaborative filtering
The term collaborative filtering was first used by David Goldberg at Xerox PARC in 1992 in a paper called 'Using collaborative filtering to weave an information tapestry.' He designed a system called Tapestry that allowed people to annotate documents as either interesting or uninteresting and used this information to filter documents for other people.There are now hundreds of web sites that employ some sort of collaborative filtering algorithm for movies, music, books, dating, shopping, other web sites, podcasts, articles, and even jokes.
Data Mining
The use of computers to analyze large data sets to look for patterns that let people make business decisions
Machine learning
The use of data-driven algorithms that perform better as they have more data to work with; generally uses cross-validation
Econometrics
The use of mathematical and statistical methods in the field of economics to verify and develop economic theories
Dependent variable
The value depends on the value of the independent variable
Spatiotemporal data
Time series data that also includes geographic identifiers such as latitude-longitude pairs
Standardized score
Transformed raw score into units of standard deviation above or below the mean
Matrix
Two dimensional array of values arranged in rows and columns
How to interpret XGBoost results?
Use SHAP values to measure importance of features since we now have individualized explanations for every person, we can do more than just make a bar chart. We can plot the feature importance for every customer in our data set. This will also show outlier effects. Can also plot the shap values for individual variables such as age while also coloring dots by something else for example years of education.
Decision trees
Uses a tree structure to represent a number of possible decision paths and an outcome for each path
Dimension reduction
Using PCA or a similar method to find the smallest subset of dimensions that captures the most variation
Parametric
Using parametric models, we estimate parameters from the training dataset to learn a function that can classify new data points without requiring the original training dataset anymore. Typical examples of parametric models are the perceptron, logistic regression, and the linear SVM.
Kernel trick
Using the kernel trick to find separating hyperplanes in higher dimensional space To solve a nonlinear problem using an SVM, we transform the training data onto a higher dimensional feature space via a mapping function Using the kernel trick to find separating hyperplanes in higher dimensional space and train a linear SVM model to classify the data in this new feature space. Then we can use the same mapping function Using the kernel trick to find separating hyperplanes in higher dimensional space to transform new, unseen data to classify it using the linear SVM model. However, one problem with this mapping approach is that the construction of the new features is computationally very expensive, especially if we are dealing with high-dimensional data. This is where the so-called kernel trick comes into play. Although we didn't go into much detail about how to solve the quadratic programming task to train an SVM, in practice all we need is to replace the dot product Using the kernel trick to find separating hyperplanes in higher dimensional space by Using the kernel trick to find separating hyperplanes in higher dimensional space. In order to save the expensive step of calculating this dot product between two points explicitly, we define a so-called kernel function: Using the kernel trick to find separating hyperplanes in higher dimensional space. One of the most widely used kernels is the Radial Basis Function kernel (RBF kernel) or Gaussian kernel: The trick is to choose a transformation so that the kernel can be computed without actually computing the transformation. replacing the dot-product function with a new function that returns what the dot product would have been if the data had first been transformed to a higher dimensional space. Usually done using the radial-basis function
Three V's
Volume Velocity
Diagnosing bias and variance problems with learning curves
When a model has both low training and cross-validation accuracy, which indicates that it underfits the training data. Common ways to address this issue are to 1) increase the number of parameters of the model, for example, by collecting or constructing additional features, or 2) by decreasing the degree of regularization, for example, in SVM or logistic regression classifiers. When a model suffers from high variance, which is indicated by the large gap between the training and cross-validation accuracy. To address this problem of overfitting, 1) we can collect more training data or reduce the complexity of the model, for example, by increasing the regularization parameter; 2) for unregularized models, it can also help to decrease the number of features via feature selection or feature extraction (Compressing Data via Dimensionality Reduction). 3) Collecting more training data decreases the chance of overfitting. However, it may not always help, for example, when the training data is extremely noisy or the model is already very close to optimal.
quadratic
a polynomial with a degree of 2
describing data
center/location - mean, median, mode spread(variation) - range, interquartile range(25th-75th), variance, std deviation(dispersion expressed in same units of data(sq root of variance)) shape anomalous observations
Cook's D
change in parameter estimates with the ith observation is deleted from analysis
Collinearity
collinearity (high correlation among features), filter out noise from data, and eventually prevent overfitting
comparing two t-tests assumptions
independent observations normally distributed data for each group -> QQ plot equal variances for each group(not an assumptions but a check to see what test to use)
Interpretation for logistic regression coefficient for Intercept: -1.47085 Female: 0.59278 math= 0.15634
intercept= -1.47085 which corresponds to the log odds for males being in an honor class (since male is the reference group, female=0). coefficient for female= 0.59278 which corresponds to the log of odds ratio between the female group and male group. The odds ratio equals 1.81 which means the odds for females are about 81% higher than the odds for males. The coefficient for math= 0.15634 which is interpreted as the expected change in log odds for a one-unit increase in the math score. The odds ratio can be calculated by exponentiating this value to get 1.16922 which means we expect to see about 17% increase in the odds of being in an honors class, for a one-unit increase in math score
General Linear Model
is a generalization of multiple linear regression to the case of more than one dependent variable.
global f test multiple linear regression
is something in this model useful?
Why would accuracy be a poor metric for measuring the strength of a model?
it may misclassify all rare events [1%] as non-events and have a 99% accuracy rate
ANOVA assumptions
normality of errors, homogeneity of variance(all groups have equal variance), independence of errors
Accounting for lack of normality
outliers - robust regression relationship between variables -> transform can try box cox transformation
Random variables
probability, mean, variance, percentiles, experimental design, cross-validation, goodness of fit, and robust statistics , ...
Studentized residuals
residual scanned by std error good for outliers how many std deviations away from mean
Sensitivity vs. Specificity
sensitivity(true positive rate)- how well a test identifies truly ill people(positive) specificity - how well a test identifies truly well people(negative)
right skewed
tail on right side(skew is above 0)
Load data
with packages like RODBC or RMySQLManipulate data, with packages like stringr or lubridateVisualize data, with packages like ggplot2 or leafletModel data, with packages like Random Forest or survivalReport results, with packages like shiny or markdown , ...
Simple Linear Regression Equation & Assumptions
y = B0 + B1X + e assumptions: 1) residuals normal 2) homoskedasticity-errors have equal variance/no pattern 3) independence 4) no perfect multicollinearity 5) linearity of mean
Polynomial Model with Cross Product Term
y = B0 + B1X1 + B2X2 + B3X1X2 + e
Quadratic Polynomial Model Fix
y = B0+B1X1+B2X1^2+e