ML - Defnitions

Ace your homework & exams now with Quizwiz!

dimension reduction, svd

?? Nlp ?? light weight one

non parametric

A category of statistical tests used when certain assumptions about the data are violated (i.e., normal distribution) or when using ordinal (ranked data). Examples of non-parametric tests include the sign test, Wilcox signed rank test, and the Mann Whitney test.

Underfitting

High Bias - not fitting training data well Cause: * If the hypothesis function is too simple. * If the hypothesis function uses very few features. Solution: * Adding more features to the hypothesis function might solve the high bias problem. * If new features are not available, we gen create new features by combining two or more existing features or by taking a square, cube, etc of the existing feature. * If your model is underfitting (high bias), then getting more data for training will NOT help.

Variables vs attributes

In simple terms attributes control is at the limits, variables control within the limits. Concerning the data that is generated by each concept, attributes data is discreet whereas variables data is continuous.

Statistical Modeling -> Recommendation Systems

Recommender systems or recommendation systems (sometimes replacing "system" with a synonym such as platform or engine) are a subclass of information filtering system that seek to predict the 'rating' or 'preference' that a user would give to an item.

recall, How is True Positive Rate and Recall related?

True Positive Rate = Recall. Yes, they are equal having the formula (TP/TP + FN).

Data Analysis -> EDA, univariate, bivariate, multivariate

Univariate data contains only one variable and describes the data and to find patterns that exist within it. The patterns can be studied by drawing conclusions using mean, median, mode, dispersion or range, quartile analysis, minimum, maximum, etc. Bivariate data analyzes two variables for causes and relationships between the two. Example: temperature and ice cream sales. Multivariate data involves three or more variables, otherwise same as bivariate. Example: data for house price prediction

production, maintain model

The steps to maintain a deployed model are: Monitor -> determine performance accuracy. Evaluate -> evaluation metrics Compare -> Compare models to each other to determine which model performs the best. Rebuild -> re-build prod model using current data periodically

What are the confounding variables?

These are extraneous variables in a statistical model that correlates directly or inversely with both the dependent and the independent variable. The estimate fails to account for the confounding factor.

gradient descent methods always converge to similar points?

They do not, because in some cases, they reach a local minima or a local optima point. You would not reach the global optima point. This is governed by the data and the starting conditions.

statistics, z-score

a measure of how many standard deviations you are away from the norm (average or mean), or standard score, is the number of standard deviations a given data point lies above or below the mean.

Data Analysis -> EDA, bivariate, correlation, Possible to do btw con and cat?

Yes, we can use ANCOVA (analysis of covariance) technique to capture association between continuous and categorical variables.

canonical

following or in agreement with accepted, traditional standards Canonical, in computer science, is the standard state or behavior of an attribute.

dimension reduction, umap

go over details

Descriptive stats: std dev

know the equation

Recommender Systems, what are?

"Recommender Systems are a subclass of information filtering systems that are meant to predict the preferences or ratings that a user would give to a product. Recommender systems are widely used in movies, news, research articles, products, social tags, music, etc. Examples include movie recommenders in IMDB, Netflix & BookMyShow, product recommenders in e-commerce sites like Amazon, eBay & Flipkart, YouTube video recommendations and game recommendations in Xbox."

Clustering -> How choose number?

"1) K defines the number of clusters. The objective of clustering is to group similar entities in a way that the entities within a group are similar to each other but the groups are different from each other. Within Sum of squares is generally used to explain the homogeneity within a cluster. If you plot WSS for a range of number of clusters, you will get the plot shown below. • The Graph is generally known as Elbow Curve. • Red circled a point in above graph i.e. Number of Cluster =6 is the point after which you don't see any decrement in WSS. • This point is known as the bending point and taken as K in K - Means. 2) Hierarchical clustering This is the widely used approach but few data scientists also use Hierarchical clustering first to create dendrograms and identify the distinct groups from there.

parametric

"A category of statistical techniques commonly used to analyze interval (continuous) data such as height, weight, or temperature. Parametric techniques are classified by what we know about the population we are studying. The basic idea is that there is a set of fixed parameters that determine a probability model. Parametric methods are often those for which we know that the population is approximately normal, or we can approximate using a normal distribution after we invoke the central limit theorem."

Outliers, How identify? How treat?

"All extreme values are not outlier values. Outlier values can be identified by using univariate or any other graphical analysis method. The most common ways to treat outlier values 1. If the number of outlier values is few then they can be assessed individually. Can change the value and bring it within a range. 2. Cap outlier values at a certain level. For a large number of outliers, the values can be substituted with either the 99th or the 1st percentile values. 3. To just remove the value. Drop outliers only if it is a garbage or unrealistic and extreme value. 4. Try normalizing or standardizing the data. This way, the extreme data points are pulled to a similar range. 5. You can use algorithms that are less affected by outliers; an example would be random forests. Data detected as outliers by linear models can be fit by nonlinear models. Therefore, be sure you are choosing the correct model."

AI vs Machine Learning vs Data Science

"Artificial intelligence refers to the simulation of a human brain function by machines. This is achieved by creating an artificial neural network that can show human intelligence. The primary human functions that an AI machine performs include logical reasoning, learning and self-correction. Machine learning is the ability of a computer system to learn from the environment and improve itself from experience without the need for any explicit programming. Machine learning focuses on enabling algorithms to learn from the data provided, gather insights and make predictions on previously unanalyzed data using the information gathered. Data science is the extraction of relevant insights from data. It uses various techniques from many fields like mathematics, machine learning, computer programming, statistical modeling, data engineering and visualization, pattern recognition and learning, uncertainty modeling, data warehousing, and cloud computing. "

Statistics -> Transformation, Box-Cox, what is?

"Box-Cox transformation is a way to transform non-normal dependent variables into a normal shape. Normality is an important assumption for many statistical techniques, if your data isn't normal, applying a Box-Cox means that you are able to run a broader number of tests. The dependent variable for a regression analysis might not satisfy one or more assumptions of an ordinary least squares regression. The residuals could either curve as the prediction increases or follow the skewed distribution. In such scenarios, it is necessary to transform the response variable so that the data meets the required assumptions. "

statistic, resampling methods

"Classical statistical parametric tests compare observed statistics to theoretical sampling distributions. Resampling a data-driven, not theory-driven methodology which is based upon repeated sampling within the same sample. Resampling refers to methods for doing one of these Estimating the precision of sample statistics (medians, variances, percentiles) by using subsets of available data (jackknifing) or drawing randomly with replacement from a set of data points (bootstrapping) Exchanging labels on data points when performing significance tests (permutation tests, also called exact tests, randomization tests, or re-randomization tests) Validating models by using random subsets (bootstrapping, cross validation)Classical statistical parametric tests compare observed statistics to theoretical sampling distributions. Resampling a data-driven, not theory-driven methodology which is based upon repeated sampling within the same sample. Resampling refers to methods for doing one of these Estimating the precision of sample statistics (medians, variances, percentiles) by using subsets of available data (jackknifing) or drawing randomly with replacement from a set of data points (bootstrapping) Exchanging labels on data points when performing significance tests (permutation tests, also called exact tests, randomization tests, or re-randomization tests) Validating models by using random subsets (bootstrapping, cross validation)"

We know that one hot encoding increasing the dimensionality of a data set. But, label encoding doesn't.

"Don't get baffled at this question. It's a simple question asking the difference between the two. Using one hot encoding, the dimensionality (a.k.a features) in a data set get increased because it creates a new variable for each level present in categorical variables. For example: let's say we have a variable 'color'. The variable has 3 levels namely Red, Blue and Green. One hot encoding 'color' variable will generate three new variables as Color.Red, Color.Blue and Color.Green containing 0 and 1 value. In label encoding, the levels of a categorical variables gets encoded as 0 and 1, so no new variable is created. Label encoding is majorly used for binary variables."

kNN vs K-means

"Don't get mislead by 'k' in their names. You should know that the fundamental difference between both these algorithms is, kmeans is unsupervised in nature and kNN is supervised in nature. kmeans is a clustering algorithm. kNN is a classification (or regression) algorithm. kmeans algorithm partitions a data set into clusters such that a cluster formed is homogeneous and the points in each cluster are close to each other. The algorithm tries to maintain enough separability between these clusters. Due to unsupervised nature, the clusters have no labels. kNN algorithm tries to classify an unlabeled observation based on its k (can be any number ) surrounding neighbors. It is also known as lazy learner because it involves minimal training of model. Hence, it doesn't use training data to make generalization on unseen data set."

Transformation -> eigenvalue and eigenvector?

"Eigenvalues are the directions along which a particular linear transformation acts by flipping, compressing, or stretching. Eigenvectors are for understanding linear transformations. In data analysis, we usually calculate the eigenvectors for a correlation or covariance matrix."

Entropy, what is

"Entropy higher if uncertain or less confident, lower if more certain or more confident. Decreases when more confident. This equation corresponds to the weighted sum of log base two of the probabilities of all outcomes. The important thing to take away here is that this is a measure of uncertainty in the outcome. As we limit the possible number of outcomes and become more confident in the outcome, the entropy decreases."

feature selection, While working on a data set, how do you select important variables? Explain your methods.

"Following are the methods of variable selection you can use: Remove the correlated variables prior to selecting important variables Use linear regression and select variables based on p values Use Forward Selection, Backward Selection, Stepwise Selection Use Random Forest, Xgboost and plot variable importance chart Use Lasso Regression Measure information gain for the available set of features and select top n features accordingly."

neural networks, exploding gradients

"Gradient: Gradient is the direction and magnitude calculated during training of a neural network that is used to update the network weights in the right direction and by the right amount. ""Exploding gradients are a problem where large error gradients accumulate and result in very large updates to neural network model weights during training."" At an extreme, the values of weights can become so large as to overflow and result in NaN values. This has the effect of your model being unstable and unable to learn from your training data. Now let's understand what is the gradient."

given data set, tall vs wide

"In most data mining / data science applications there are many more records (rows) than features (columns) - such data is sometimes called ""tall"" (or ""long"") data. In some applications like genomics or bioinformatics you may have only a small number of records (patients), eg 100, but perhaps 20,000 observations for each patient. The standard methods that work for ""tall"" data will lead to overfitting the data, so special approaches are needed. Fig 13. Different approaches for tall data and wide data, from presentation Sparse Screening for Exact Data Reduction, by Jieping Ye. The problem is not just reshaping the data (here there are useful R packages), but avoiding false positives by reducing the number of features to find most relevant ones. Approaches for feature reduction like Lasso are well covered in Statistical Learning with Sparsity: The Lasso and Generalizations, by Hastie, Tibshirani, and Wainwright. (you can download free PDF of the book) "

overfitting, what is how avoid?

"Overfitting is finding spurious results that are due to chance and cannot be reproduced by subsequent studies. There are three main methods to avoid over-fitting: Keep the model simple 1. take fewer variables into account, thereby removing some of the noise in the training data 2. Use cross-validation techniques, such as k folds cross-validation 3. Use regularization techniques, such as LASSO, that penalize certain model parameters if they're likely to cause over-fitting 4. Adjusting the False Discovery Rate. Using the reusable holdout method - a breakthrough approach proposed in 2015

Performance measures -> linear, RMSE, MSE

"RMSE and MSE are two of the most common measures of accuracy for a linear regression model. RMSE indicates the Root Mean Square Error. sqrt((sum of n to i (predicted - actual)squared)/N) MSE indicates the Mean Square Error. (sum of n to i (predicted - actual)squared)/N"

Regularization, what is, when helpful?

"Regularization is useful for reducing variance in the model, meaning avoiding overfitting . For example, we can use L1 regularization in Lasso regression to penalize large coefficients. Regularization is the process of adding a tuning parameter to a model to induce smoothness in order to prevent overfitting. This is most often done by adding a constant multiple to an existing weight vector. This constant is often either the L1 (Lasso) or L2 (ridge), but can in actuality can be any norm. The model predictions should then minimize the mean of the loss function calculated on the regularized training set."

statistics, when resampling

"Resampling is done in any of these cases: Estimating the accuracy of sample statistics by using subsets of accessible data, or drawing randomly with replacement from a set of data points Substituting labels on data points when performing significance tests Validating models by using random subsets (bootstrapping, cross-validation)"

Outliers how screen, what do if yes?

"Some methods to screen outliers are z-scores, modified z-score, box plots, Grubb's test, Tietjen-Moore test exponential smoothing, Kimber test for exponential distribution and moving window filter algorithm. However two of the robust methods in detail are: Inter Quartile Range An outlier is a point of data that lies over 1.5 IQRs below the first quartile (Q1) or above third quartile (Q3) in a given data set. High = (Q3) + 1.5 IQR Low = (Q1) - 1.5 IQR Tukey Method It uses interquartile range to filter very large or very small numbers. It is practically the same method as above except that it uses the concept of ""fences"". The two values of fences are: Low outliers = Q1 - 1.5(Q3 - Q1) = Q1 - 1.5(IQR) High outliers = Q3 + 1.5(Q3 - Q1) = Q3 + 1.5(IQR) Anything outside of the fences is an outlier. When you find outliers, you should not remove it without a qualitative assessment because that way you are altering the data and making it no longer pure. It is important to understand the context of analysis or importantly ""The Why question - Why an outlier is different from other data points?"" This reason is critical. If outliers are attributed to error, you may throw it out but if they signify a new trend, pattern or reveal a valuable insight into the data you should retain it."

frequentist vs bayes given data set, experiment design, to answer a question about user behavior.

"Step 1: Formulate the Research Question: What are the effects of page load times on user satisfaction ratings? Step 2: Identify variables: We identify the cause & effect. Independent variable -page load time, Dependent variable- user satisfaction rating Step 3: Generate Hypothesis: Lower page download time will have more effect on the user satisfaction rating for a web page. Here the factor we analyze is page load time. Step 4: Determine Experimental Design. We consider experimental complexity i.e vary one factor at a time or multiple factors at one time in which case we use factorial design (2^k design). A design is also selected based on the type of objective (Comparative, Screening, Response surface) & number of factors. Here we also identify within-participants, between-participants, and mixed model.For e.g.: There are two versions of a page, one with Buy button (call to action) on left and the other version has this button on the right. Within-participants design - both user groups see both versions. Between-participants design - one group of users see version A & the other user group version B. Step 5: Develop experimental task & procedure: Detailed description of steps involved in the experiment, tools used to measure user behavior, goals and success metrics should be defined. Collect qualitative data about user engagement to allow statistical analysis. Step 6: Determine Manipulation & Measurements Manipulation: One level of factor will be controlled and the other will be manipulated. We also identify the behavioral measures: Latency- time between a prompt and occurrence of behavior (how long it takes for a user to click buy after being presented with products). Frequency- number of times a behavior occurs (number of times the user clicks on a given page within a time) Duration-length of time a specific behavior lasts(time taken to add all products) Intensity-force with which a behavior occurs ( how quickly the user purchased a product) Step 7: Analyze results Identify user behavior data and support the hypothesis or contradict according to the observations made for e.g. how majority of users satisfaction ratings compared with page load times."

Structured learning, what is?

"Structured learning is the machine learning task of inferring a function from labeled training data. The training data consist of a set of training examples. Input data is labelled. Uses a training data set. Used for prediction. Regression and classification. Algorithms: Support Vector Machines, Regression, Naive Bayes, Decision Trees, K-nearest Neighbor Algorithm and Neural Networks E.g. If you built a fruit classifier, the labels will be "this is an orange, this is an apple and this is a banana", based on showing the classifier examples of apples, oranges and bananas. "

Statistical Modeling -> Survival Analysis

"Survival analysis is a branch of statistics for analyzing the expected duration of time until one or more events happen, such as death in biological organisms and failure in mechanical systems. This topic is called reliability theory or reliability analysis in engineering, duration analysis or duration modelling in economics, and event history analysis in sociology. Survival analysis attempts to answer questions such as: what is the proportion of a population which will survive past a certain time? Of those that survive, at what rate will they die or fail? Can multiple causes of death or failure be taken into account? How do particular circumstances or characteristics increase or decrease the probability of survival? Survival models are used by actuaries and statisticians, but also by marketers designing churn and user retention models. Survival models are also used to predict time-to-event (time from becoming radicalized to turning into a terrorist, or time between when a gun is purchased and when it is used in a murder), or to model and predict decay (see section 4 in this article)."

Missing values, how deal with?

"The extent of the missing values is identified after identifying the variables with missing values. If any patterns are identified the analyst has to concentrate on them as it could lead to interesting and meaningful business insights. If there are no patterns identified, then the missing values can be substituted with mean or median values (imputation) or they can simply be ignored. Assigning a default value which can be mean, minimum or maximum value. Getting into the data is important. If it is a categorical variable, the default value is assigned. The missing value is assigned a default value. If you have a distribution of data coming, for normal distribution give the mean value. If 80% of the values for a variable are missing then you can answer that you would be dropping the variable instead of treating the missing values."

Descriptive Statistics -> skewness

"The extent to which cases are clustered more at one or the other end of the distribution of a quantitative variable rather than in a symmetric pattern around its center Skewness is a measure of symmetry, or more precisely, the lack of symmetry. A distribution, or data set, is symmetric if it looks the same to the left and right of the center point."

entropy, calculate: target variable = 1 target set: [0, 0, 0, 1, 1, 1, 1, 1]

"The formula for calculating the entropy is: Putting p=5 and n=8, we get Entropy = A = -(5/8 log(5/8) + 3/8 log(3/8))"

precision recall curve vs ROC curve

"There is a very important difference between what a ROC curve represents vs that of a PRECISION vs RECALL curve. A ROC curve represents a relation between sensitivity (RECALL) and False Positive Rate (NOT PRECISION). Sensitivity is the other name for recall but the False Positive Rate is not PRECISION. Recall/Sensitivity is the measure of the probability that your estimate is 1 given all the samples whose true class label is 1. It is a measure of how many of the positive samples have been identified as being positive. Specificity is the measure of the probability that your estimate is 0 given all the samples whose true class label is 0. It is a measure of how many of the negative samples have been identified as being negative. PRECISION on the other hand is different. It is a measure of the probability that a sample is a true positive class given that your classifier said it is positive. It is a measure of how many of the samples predicted by the classifier as positive is indeed positive. Note here that this changes when the base probability or prior probability of the positive class changes. Which means PRECISION depends on how rare is the positive class. In other words, it is used when positive class is more interesting than the negative class. So, if your problem involves kind of searching a needle in the haystack when for ex: the positive class samples are very rare compared to the negative classes, use a precision recall curve. Othwerwise use a ROC curve because a ROC curve remains the same regardless of the baseline prior probability of your positive class (the important rare class)."

economic terms, Are you familiar with price optimization, price elasticity, inventory management, competitive intelligence? Give examples.

"Those are economics terms that are not frequently asked of Data Scientists but they are useful to know. Price optimization is the use of mathematical tools to determine how customers will respond to different prices for its products and services through different channels. Big Data and data mining enables use of personalization for price optimization. Now companies like Amazon can even take optimization further and show different prices to different visitors, based on their history, although there is a strong debate about whether this is fair. Price elasticity in common usage typically refers to Price elasticity of demand, a measure of price sensitivity. It is computed as: Price Elasticity of Demand = % Change in Quantity Demanded / % Change in Price. Similarly, Price elasticity of supply is an economics measure that shows how the quantity supplied of a good or service responds to a change in its price. Inventory management is the overseeing and controlling of the ordering, storage and use of components that a company will use in the production of the items it will sell as well as the overseeing and controlling of quantities of finished products for sale. Wikipedia defines Competitive intelligence: the action of defining, gathering, analyzing, and distributing intelligence about products, customers, competitors, and any aspect of the environment needed to support executives and managers making strategic decisions for an organization. Tools like Google Trends, Alexa, Compete, can be used to determine general trends and analyze your competitors on the web. Here are useful resources: Competitive Intelligence Metrics, Reports by Avinash Kaushik 37 Best Marketing Tools to Spy on Your Competitors from Kissmetrics 10 best competitive intelligence tools from 10 experts "

Statistical Modeling -> Time-series

"Time Series: Methods for time series analyses may be divided into two classes: frequency-domain methods and time-domain methods. The former include spectral analysis and recently wavelet analysis; the latter include auto-correlation and cross-correlation analysis. In time domain, correlation analyses can be made in a filter-like manner using scaled correlation, thereby mitigating the need to operate in frequency domain. Additionally, time series analysis techniques may be divided into parametric and non-parametric methods. The parametric approaches assume that the underlying stationary stochastic process has a certain structure which can be described using a small number of parameters (for example, using an autoregressive or moving average model). In these approaches, the task is to estimate the parameters of the model that describes the stochastic process. By contrast, non-parametric approaches explicitly estimate the covariance or the spectrum of the process without assuming that the process has any particular structure. Methods of time series analysis may also be divided into linear and non-linear, and univariate and multivariate."

Unstructured learning, what is?

"Unstructured learning is a type of machine learning algorithm used to draw inferences from datasets consisting of input data without labelled responses. Algorithms: Clustering, Anomaly Detection, Neural Networks and Latent Variable Models E.g. In the same example, a fruit clustering will categorize as "fruits with soft skin and lots of dimples", "fruits with shiny hard skin" and "elongated yellow fruits". Input data is unlabelled. Uses the input data set. Used for analysis. Enables classification, density estimation, and dimension reduction. common k-means clustering, hierarchical clustering, and apriori algorithm"

Outliers, how make model robust to?

"We can have regularization such as L1 or L2 to reduce variance (increase bias). Changes to the algorithm: * Use tree-based methods instead of regression methods as they are more resistant to outliers. * For statistical tests, use non parametric tests instead of parametric ones. Use robust error metrics such as MAE or Huber Loss instead of MSE. Changes to the data: *Winsorizing the data *Transforming the data (e.g. log) *Remove them only if you're certain they're anomalies not worth predicting"

How would you evaluate a logistic regression model?

"We can use the following methods: Since logistic regression is used to predict probabilities, we can use AUC-ROC curve along with confusion matrix to determine its performance. Also, the analogous metric of adjusted R² in logistic regression is AIC. AIC is the measure of fit which penalizes model for the number of model coefficients. Therefore, we always prefer model with minimum AIC value. Null Deviance indicates the response predicted by a model with nothing but an intercept. Lower the value, better the model. Residual deviance indicates the response predicted by a model on adding independent variables. Lower the value, better the model."

ridge vs lasso, when ridge better

"You can quote ISLR's authors Hastie, Tibshirani who asserted that, in presence of few variables with medium / large sized effect, use lasso regression. In presence of many variables with small / medium sized effect, use ridge regression. Conceptually, we can say, lasso regression (L1) does both variable selection and parameter shrinkage, whereas Ridge regression only does parameter shrinkage and end up including all the coefficients in the model. In presence of correlated variables, ridge regression might be the preferred choice. Also, ridge regression works best in situations where the least square estimates have higher variance. Therefore, it depends on our model objective. Know more: Ridge and Lasso Regression"

Algorithm, when update?

"You will want to update an algorithm when: • You want the model to evolve as data streams through infrastructure • The underlying data source is changing • There is a case of non-stationarity • The algorithm underperforms/ results lack accuracy"

statistics, t-test

"a statistical test that compares two means to see whether they could come from the same population follows a Student's t-distribution under the null hypothesis. A t-test is most commonly applied when the test statistic would follow a normal distribution if the value of a scaling term in the test statistic were known. Ultimately used to accept or reject the null hypothesis with confidence. takes into account size of sample population. Look up value in table and get the p-value."

recommendation engine, how work?

"recommendations in one of two ways: using collaborative or content-based filtering. Collaborative filtering -> users past behavior (items previously purchased, movies viewed and rated, etc) and use decisions made by current and other users. This model is then used to predict items (or ratings for items) that the user may be interested in. Spotify, Amazon Content-based filtering methods use features of an item to recommend additional items with similar properties. These approaches are often combined in Hybrid Recommender Systems. Here is a comparison of these 2 approaches: * Content -> Pandora uses the properties of a song or artist (a subset of the 400 attributes provided by the Music Genome Project) in order to seed a ""station"" that plays music with similar properties. User feedback is used to refine the station's results.

statistics, p-value, significance, what is

"where p-value typically ≤ 0.05 This indicates strong evidence against the null hypothesis; so you reject the null hypothesis. p-value typically > 0.05 This indicates weak evidence against the null hypothesis, so you accept the null hypothesis. p-value at cutoff 0.05 This is considered to be marginal, meaning it could go either way."

bias variance trade off, which algos are low/high bias? Which are low/high variance?

*Low bias Decision Trees, k-NN SVM *High bias Linear Regression, Logistic Regression Bias, Variance trade off: The goal of any supervised machine learning algorithm is to have low bias and low variance to achieve good prediction performance. The k-nearest neighbors algorithm has low bias and high variance, but the trade-off can be changed by increasing the value of k which increases the number of neighbors that contribute to the prediction and in turn increases the bias of the model. The support vector machine algorithm has low bias and high variance, but the trade-off can be changed by increasing the C parameter that influences the number of violations of the margin allowed in the training data which increases the bias but decreases the variance. There is no escaping the relationship between bias and variance in machine learning. Increasing the bias will decrease the variance. Increasing the variance will decrease the bias."

Data Analysis -> EDA feature vector

A feature vector is an n-dimensional vector of numerical features that represent an object. In machine learning, feature vectors are used to represent numeric or symbolic characteristics (called features) of an object in a mathematical way that's easy to analyze.

Statistical Modeling -> Attribution Modeling

An attribution model is the rule, or set of rules, that determines how credit for sales and conversions is assigned to touchpoints in conversion paths. For example, the Last Interaction model in Google Analytics assigns 100% credit to the final touchpoints (i.e., clicks) that immediately precede sales or conversions. Macro-economic models use long-term, aggregated historical data to assign, for each sale or conversion, an attribution weight to a number of channels. These models are also used for advertising mix optimization

Statistical Modeling -> Association rule learning

Association rule learning is a method for discovering interesting relations between variables in large databases. For example, the rule { onions, potatoes } ==> { burger } found in the sales data of a supermarket would indicate that if a customer buys onions and potatoes together, they are likely to also buy hamburger meat. In fraud detection, association rules are used to detect patterns associated with fraud. Linkage analysis is performed to identify additional fraud cases: if credit card transaction from user A was used to make a fraudulent purchase at store B, by analyzing all transactions from store B, we might find another user C with fraudulent activity.

Data Analysis -> EDA, "chart junk", what is?

Chart junk refers to unnecessary chart elements which distract the viewer from this information.

Statistical Modeling -> Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, and bioinformatics.

Recommender Systems, collaborative filtering, what is?

Collaborative filtering can be to predict the rating of a particular user based on his/her ratings for other movies and others' ratings for all movies. This concept is widely used in recommending movies in IMDB, Netflix & BookMyShow, product recommenders in e-commerce sites like Amazon, eBay & Flipkart, YouTube video recommendations and game recommendations in Xbox.

Data Analysis -> EDA, bivariate, covariance and correlation, what diff?

Correlation is the standardized form of covariance. Covariances are difficult to compare. For example: if we calculate the covariances of salary ($) and age (years), we'll get different covariances which can't be compared because of having unequal scales. To combat such situation, we calculate correlation to get a value between -1 and 1, irrespective of their respective scale."

Statistical Modeling -> Churn Analysis

Customer churn analysis helps you identify and focus on higher value customers, determine what actions typically precede a lost customer or sale, and better understand what factors influence customer retention. Statistical techniques involved include survival analysis (see Part I of this article) as well as Markov chains with four states: brand new customer, returning customer, inactive (lost) customer, and re-acquired customer, along with path analysis (including root cause analysis) to understand how customers move from one state to another, to maximize profit. Related topics: customer lifetime value, cost of user acquisition, user retention.

Data science, what is?

Data Science is a blend of various tools, algorithms, and machine learning principles with the goal to discover hidden patterns from the raw data. Predicts vs describes.

Random Forest vs Decision Trees

Decision trees are built on the entire data set, by making use of all the predictor variables. Even though Decision trees are convenient and easily implemented, they lack accuracy. Decision trees work very effectively with the training data that was used to build them, but they're not flexible when it comes to classifying the new sample. Which means that the accuracy during testing phase is very low. This happens due to a process called Over-fitting. * Random Forest usually has a constraint and randomly samples say two from the entire dataset. Random forest is an ensemble of decision trees, it randomly selects a set of parameters and creates a decision tree for each set of chosen parameters. Each tree selects or votes the class (in this case the decision trees will choose whether or not a house is bought), and the class receiving the most votes by a simple majority is termed as the predicted class."

Dimension vs Attribute vs Feature

Dimension usually refers to the number of attributes, although it can also be used in form of ""second dimension of the data vector is person age"", but it is rather rare - in most cases dimension is ""number of attributes" Attribute is one particular ""type of data"" in your points, so each observation/datapoint (like personal record) contains many different attributes (like person weight, height, age, etc.) Feature may have multiple meanings depending on context: It sometimes refers to attribute It sometimes refers to the internal representation of the data generated by particular learning model, for example - neural networks extract features which are combinations of the attributes or other features

dimension reduction diff kinds, benefits

Dimensionality reduction refers to the process of converting a data set with vast dimensions into data with fewer dimensions (fields) to convey similar information concisely. curse of dimensionality -> computation time. system resources -> memory and storage space."

Overfitting - How prevent

High Variance - good on training, bad on test cross validation reduce features (irrelevant) (NN) make network smaller Add weight regularization (L1, L2 meaning Lasso, Ridge) increase size of training data Add dropout (NN) There are a large number of features available, relative to the number of samples (observations).

Statistical Modeling -> Inventory management

Inventory management is the overseeing and controlling of the ordering, storage and use of components that a company will use in the production of the items it will sell as well as the overseeing and controlling of quantities of finished products for sale. Inventory management is an operations research technique leveraging analytics (time series, seasonality, regression), especially for sales forecasting and optimum pricing - broken down per product category, market segment, and geography. It is strongly related to pricing optimization (see item #17). This is not just for brick and mortar operations: inventory could mean the amount of available banner ad slots on a publisher website in the next 60 days, with estimates of how much traffic (and conversions) each banner ad slot is expected to deliver to the potential advertiser. You don't want to over-sell or under-sell this virtual inventory, and thus you need good statistical models to predict the web traffic and conversions (to pre-sell the inventory), for each advertiser category

What is the law of large numbers?

It is a theorem that describes the result of performing the same experiment very frequently. This theorem forms the basis of frequency-style thinking. It states that the sample mean, sample variance and sample standard deviation converge to what they are trying to estimate.

Distance, k-means or kNN, we use euclidean distance to calculate the distance between nearest neighbors. Why not manhattan distance ?

It sometimes refers to the hypothethical representation of the data induced by the kernel method (in Kernel PCA, Kernel k-means, SVM) " "Distance, euclidean, how calc given points plot1 = [1,3] plot2 = [2,5]" The Euclidean distance can be calculated as follows: euclidean_distance = sqrt( (plot1[0]-plot2[0])*2 + (plot1[1]-plot2[1])*2 )" "We don't use manhattan distance because it calculates distance horizontally or vertically only. It has dimension restrictions. On the other hand, euclidean metric can be used in any space to calculate distance. Since, the data points can be present in any dimension, euclidean distance is a more viable option. Example: Think of a chess board, the movement made by a bishop or a rook is calculated by manhattan distance because of their respective vertical & horizontal movements."

Lasso, what is?

L1 (Lasso)

L1 vs L2, what diff

L1 (Lasso) or L2 (ridge)

Ridge, what is?

L2 (ridge)

bias variance trade off, describe low bias issues, how adjust?

Low bias occurs when the model's predicted values are near to actual values. In other words, the model becomes flexible enough to mimic the training data distribution. But a flexible model has no generalization capabilities. It means, when this model is tested on an unseen data, it gives disappointing results. In such situations, we can use bagging algorithm (like random forest) to tackle high variance problem. Bagging algorithms divides a data set into subsets made with repeated randomized sampling. Then, these samples are used to generate a set of models using a single learning algorithm. Later, the model predictions are combined using voting (classification) or averaging (regression). Also, to combat high variance, we can: Use regularization technique, where higher model coefficients get penalized, hence lowering model complexity. Use top n features from variable importance chart. May be, with all the variable in the data set, the algorithm is having difficulty in finding the meaningful signal."

Machine learning, what is?

Machine learning explores the study and construction of algorithms that can learn from and make predictions on data. Closely related to computational statistics. Used to devise complex models and algorithms that lend themselves to a prediction which in commercial use is known as predictive analytics. Given below, is an image representing the various domains Machine Learning lends itself to.

Statistical Modeling -> Market Segmentation

Market segmentation, also called customer profiling, is a marketing strategy which involves dividing a broad target market into subsets of consumers,businesses, or countries that have, or are perceived to have, common needs, interests, and priorities, and then designing and implementing strategies to target them. Market segmentation strategies are generally used to identify and further define the target customers, and provide supporting data for marketing plan elements such as positioning to achieve certain marketing plan objectives. Businesses may develop product differentiation strategies, or an undifferentiated approach, involving specific products or product lines depending on the specific demand and attributes of the target segment.

Descriptive Statistics -> kurtosis

Measure of the fatness of the tails of a probability distribution relative to that of a normal distribution. Indicates likelihood of extreme outcomes. Kurtosis is a measure of whether the data are heavy-tailed or light-tailed relative to a normal distribution. That is, data sets with high kurtosis tend to have heavy tails, or outliers. Data sets with low kurtosis tend to have light tails, or lack of outliers. A uniform distribution would be the extreme case.

Statistical Modeling -> Simulations

Monte-Carlo simulations are used in many contexts: to produce high quality pseudo-random numbers, in complex settings such as multi-layer spatio-temporal hierarchical Bayesian models, to estimate parameters (see picture below), to compute statistics associated with very rare events, or even to generate large amount of data (for instance cross and auto-correlated time series) to test and compare various algorithms, especially for stock trading or in engineering.

Descriptive Statistics -> normal dist vs standard normal dist normalization vs standardization

Normal distribution can be fully described by its mean and standard deviation (SD). A normal distribution is called Standard Normal Distribution when its mean is zero and SD is equal to 1. In normal distribution, 68 percent of all values lie within one s.d, 95.45 percent within two standard deviations, 99.8 within three standard deviations. Standard Normal Distribution mean of zero and variance of one. Normalization rescales the values into a range of [0,1]. This might be useful in some cases where all parameters need to have the same positive scale. However, the outliers from the data set are lost. Xchanged = (X−Xmin) / (Xmax−Xmin) Standardization rescales data to have a mean (μ ) of 0 and standard deviation (σ ) of 1 (unit variance). Xchanged = (X−μ) / σ For most applications standardization is recommended.

statistics, t-test null hypothesis

Null Hypothesis - the assumption that there is no difference between the two means.

"experimentation, a/b Your organization has a website where visitors randomly receive one of two coupons. It is also possible that visitors to the website will not receive a coupon. You have been asked to determine if offering a coupon to website visitors has any impact on their purchase decisions. Which analysis method should you use? One-way ANOVA K-means clustering Association rules Student's t-test"

One-way ANOVA

Performance Measure -> Classification, Accuracy

Percentage of correct predictions. The formula for accuracy is: Accuracy = (True Positive + True Negative) / Total Observations

Performance Measure -> Classification precision and recall

Precision : What proportion of positive identifications was actually correct?Precision = (TP) / (TP + FP) = 0.5 = when it predicts a tumor is malignant, it is correct 50% of the time. Recall: What proportion of actual positives was identified correctly? Recall Rate = (TP) / (TP + FN) = 0.11 = in other words, it correctly identifies 11% of all malignant tumors

Statistical Modeling -> Predictive Modeling

Predictive modeling leverages statistics to predict outcomes. Most often the event one wants to predict is in the future, but predictive modelling can be applied to any type of unknown event, regardless of when it occurred. For example, predictive models are often used to detect crimes and identify suspects, after the crime has taken place. They may also used for weather forecasting, to predict stock market prices, or to predict sales, incorporating time series or spatial models. Neural networks, linear regression, decision trees and naive Bayes are some of the techniques used for predictive modeling. They are associated with creating a training set, cross-validation, and model fitting and selection.

Regularization, what is, when helpful?

Regularization becomes necessary when the model begins to overfit / underfit. This technique introduces a cost term for bringing in more features with the objective function. Hence, it tries to push the coefficients for many variables to zero and hence reduce cost term. This helps to reduce model complexity so that the model can become better at predicting (generalizing).r

Statistical Modeling -> Scoring

Scoring model is a special kind of predictive models. Predictive models can predict defaulting on loan payments, risk of accident, client churn or attrition, or chance of buying a good. Scoring models typically use a logarithmic scale (each additional 50 points in your score reducing the risk of defaulting by 50%), and are based on logistic regression and decision trees, or a combination of multiple algorithms. Scoring technology is typically applied to transactional data, sometimes in real time (credit card fraud detection, click fraud).

Bayes bias -> sampling bias, three types?

Selection bias Where the results are skewed a certain way because you've only captured feedback from a certain segment of your audience. Response bias: Where there's something about how the actual survey questionnaire is constructed that encourages a certain type of answer, leading to measurement error. Survivorship bias Occurs when your survey is limited to customers, clients, and employees who have remained with you over time. Acquiescence bias: When it's all about "yes" Question order bias: Striving for consistency Answer option order/primacy bias: Answer order matters too Social desirability/conformity bias: The coolness factor

bias -> selection bias, What is? How Avoid?

Selection bias, in general, is a problematic situation in which error is introduced due to a non-random population sample. For example, if a given sample of 100 test cases was made up of a 60/20/15/5 split of 4 classes which actually occurred in relatively equal numbers in the population, then a given model may make the false assumption that probability could be the determining predictive factor. Avoiding non-random samples is the best way to deal with bias; however, when this is impractical, techniques such as resampling, boosting, and weighting are strategies which can be introduced to help deal with the situation.

statistics, standardization vs normalization

Standardization (or Z-score normalization) is the process of rescaling the features so that they'll have the properties of a Gaussian distribution with μ=0 and σ=1 where μ is the mean and σ is the standard deviation from the mean; Normalization often also simply called Min-Max scaling basically shrinks the range of the data such that the range is fixed between 0 and 1 (or -1 to 1 if there are negative values).

statistic, statistical power

Statistical power or sensitivity of a binary hypothesis test is the probability that the test correctly rejects the null hypothesis (H0) when the alternative hypothesis (H1) is true. To put in another way, Statistical power is the likelihood that a study will detect an effect when the effect is present. The higher the statistical power, the less likely you are to make a Type II error (concluding there is no effect when, in fact, there is). Here are some tools to calculate statistical power.

Statistical Modeling -> Supply Chain Optimization

Supply chain optimization is the application of processes and tools to ensure the optimal operation of a manufacturing and distribution supply chain. This includes the optimal placement of inventory (see item #15 in this article) within the supply chain, minimizing operating costs (including manufacturing costs, transportation costs, and distribution costs). This often involves the application of mathematical modelling techniques such as graph theory to find optimum delivery routes (and optimum locations of warehouses), the simplex algorithm, and Monte Carlo simulations. Read 21 data science systems used by Amazon to operate its business for typical applications. Again, despite being heavily statistical in nature, this is considered to be an operations research problem.

Descriptive stats: mean, median, mode

The "mean" is the "average" you're used to, where you add up all the numbers and then divide by the number of numbers. The "median" is the "middle" value in the list of numbers. To find the median, your numbers have to be listed in numerical order from smallest to largest, so you may have to rewrite your list before you can find the median. The "mode" is the value that occurs most often.

Data Analysis -> EDA Feature selection, two primary methods

The best analogy for selecting features is bad data in, bad answer out. Filter feature selection methods: Use statistical techniques to evaluate the relationship between each input variable and the target variable, and these scores are used as the basis to choose (filter) those input variables that will be used in the model. Filter Methods Linear discrimination analysis ANOVA Chi-Square Wrapper Methods Forward Selection: We test one feature at a time and keep adding them until we get a good fit Backward Selection: We test all the features and start removing them to see what works better Recursive Feature Elimination: Recursively looks through all the different features and how they pair together Wrapper methods are very labor-intensive, and high-end computers are needed if a lot of data analysis is performed with the wrapper method.

Statistics vs Probability "Classical (also known as Frequentist) vs Bayesian statisticians

The main difference between probability and statistics has to do with knowledge. Inherent in both probability and statistics is a population, consisting of every individual we are interested in studying, and a sample, consisting of the individuals that are selected from the population. A problem in probability would start with us knowing everything about the composition of a population, and then would ask, "What is the likelihood that a selection, or sample, from the population, has certain characteristics?" If instead, we have no knowledge about the types of socks in the drawer, then we enter into the realm of statistics. Statistics help us to infer properties about the population on the basis of a random sample.

Data set shift, what is, what do?

The model that has high training accuracy might have low test accuracy. Without further knowledge, it is hard to know which dataset represents the population data and thus the generalizability of the algorithm is hard to measure. This should be mitigated by repeated splitting of train vs test dataset (as in cross validation). * When there is a change in data distribution, this is called the dataset shift. If the train and test data has a different distribution, then the classifier would likely overfit to the train data. *This issue can be overcome by using a more general learning method. This can occur when: * P(y|x) are the same but P(x) are different. (covariate shift) * P(y|x) are different. (concept shift) The causes can be: * Training samples are obtained in a biased way. (sample selection bias) * Train is different from test because of temporal, spatial changes. (non-stationary environments) Solution to covariate shift * importance weighted cv"

Experimentation, a/b, what is goal

This is statistical hypothesis testing for randomized experiments with two variables, A and B. The objective of A/B testing is to detect any changes to a web page to maximize or increase the outcome of a strategy.

p-value

Thus a p value is simply a measure of the strength of evidence against H0. Low means more strength (<.05), high means less strength. The only question that the p value addresses is, does the experiment provide enough evidence to reasonably reject H0. 1) p value is calculated based on an assumption that H0 is true. It cannot provide information regarding whether H0 is in fact true. This argument also shows that first, p cannot be the probability that the alternative hypothesis is true. 2) Second, the p value is very dependent on the sample size. 3) Third, it is not true that the p value is the probability that any observed difference is simply attributable to the chance selection of subjects from the target population. The p value is calculated based on an assumption that chance is the only reason for observing any difference. Thus it cannot provide evidence for the truth of that statement. A study with a p = 0.531 has much less evidence against H0 than a study with a p = 0.058. However, a study with a p = 0.058 provides similar evidence as a study with a p = 0.049 and a study with a p = 0.049 also has much less evidence than one with a p = 0.015. Although a very small p value does provide strong evidence that H0 is not true, a very large p value, even as large as 0.88, does not provide real evidence that H0 is true. For example, the alternative hypothesis might in fact still be true but owing to a small sample size, the study did not have enough power to detect that H0 was likely to be false. If we state one hypothesis only and the aim of the statistical test is to verify whether this hypothesis is not false, but not, at the same time, to investigate other hypotheses, then such a test is called a significance test.

information gain, what is

We can then use entropy to measure the information gain, defined as the change in entropy from the original state to the weighted potential outcomes of the following state.

bias variance trade off?

When a model has a high bias it means that it is very simple (underfitting) and that adding more features should improve it. For high variance models an alternative is feature reduction, but including more training data is also a viable option. As a general rule, the more flexible a model is, the higher its variance and the lower its bias. The less flexible a model is, the lower its variance and the higher its bias. Bias is the error that occurs when trying to approximate the behavior of a problem's data. (hit bullseye or not, low bias hits bullseye does well with unseen data, high bias does not hit bullseye with unseen data) The variance is the amount by which the model will change with a different training set. Bias error is useful to quantify how much on an average are the predicted values different from the actual value. Variance on the other side quantifies how are the prediction made on same observation different from each other. Model learns noise from the training data set and performs poorly on test data set. It can lead high sensitivity and over fitting. A high bias error means we have a under-performing model which keeps on missing important trends. High bias means over simplified. A high variance model will over-fit on your training population and perform badly on any observation beyond training. Variance: Variance is error introduced in your model due to complex machine learning algorithm, your model learns noise also from the training data set and performs bad on test data set."" It can lead high sensitivity and over fitting. Normally, as you increase the complexity of your model, you will see a reduction in error due to lower bias in the model. However, this only happens till a particular point. As you continue to make your model more complex, you end up over-fitting your model and hence your model will start suffering from high variance.

dimension reduction pca

go over rotation Yes, rotation (orthogonal) is necessary because it maximizes the difference between variance captured by the component. PCA where, we aim to select fewer components (than features) which can explain the maximum variance in the data set. By doing rotation, the relative location of the components doesn't change, it only changes the actual coordinates of the points. If we don't rotate the components, the effect of PCA will diminish and we'll have to select more number of components to explain variance in the data set."

Transformation -> Eigenvalues: eigenvalues, eigenvectors on 3x3 matrix

https://www.simplilearn.com/data-science-interview-questions-article

Descriptive Statistics -> Standard Deviation, Confidence Interval

what is rule 68-95-99? You get accuracy with high confidence interval but you lose precision. With a smaller 'net' you get more precision but you lose accuracy with a lower confidence interval.

lambda or alpha in ridge and lasso

what is?


Related study sets

Synergy What is an Organism? (Topic C)

View Set

Unit 3: Fluid & Electrolyte balance

View Set

The Byzantine Empire and Crusades Study Guide

View Set

Database Design DE Semester 1 Final Exam

View Set

med surg endocrine Nclex practice questions: SET ONE

View Set