Statistical questions

Lakukan tugas rumah & ujian kamu dengan baik sekarang menggunakan Quizwiz!

AUC - ROC Curve

* Used to evaluate performance of logistic regression models * Tells how much model is able to distinguish between classes * Looks at threshold tradeoff between true positive and false positive rates ROC curve plots TP (y-axis) against FP (x-axis)

A/B testing

- Variations are chosen and shown to different users at random - Statistical analysis is used to determine which variation performs better Steps: - Get baseline data: conversions, traffic, clickthrough rate, etc - Form hypothesis - Create experiment and run it - Calculate sample mean and standard deviation and check for statistical significance - Analyze and come up with conclusions

Confidence intervals

A confidence interval (CI) is a range of values that's likely to include a population value with a certain degree of confidence. It is often expressed a % whereby a population means lies between an upper and lower interval

K-folds

A cross validation technique Steps: 1) Split data randomly into k-folds (groups that overlap) 2)Iterate through folds using k as test and k-complement (everything not in k) as train 3) Take average o recorded scores, that is your performance metric

F-statistic

ANOVA - find out if means between 2 populations are significantly different - Regression - probability that the regression coefficients are 0 - compare F value and an F critical value.

Random Forest

An algorithm used for regression or classification that uses a collection of tree data structures trees "vote" on the best model. - Large number of individual trees that act as an ensemble - Each tree has prediction and class which the most votes becomes the prediction - randomly selected subset of features are used for splits

Bootstrapping

Any test or metric that relies on random sampling with replacement

Standard deviation

Average distance of data points from the mean

Bagging

Bagging is used when the goal is to reduce the variance of a decision tree classifier. Here the objective is to create several subsets of data from training sample chosen randomly with replacement. Each collection of subset data is used to train their decision trees. As a result, we get an ensemble of different models. Average of all the predictions from different trees are used which is more robust than a single decision tree classifier. - Ensemble - train multiple models using the same algorithm - Randomly sample with replacement, make new learners and average them - All features are considered

Binomial Distribution

Binomial distribution can be thought of as simply the probability of a success or failure outcome in an experiment or survey that is repeated multiple times. the binomial is a type of distribution that has two possible outcomes.

Confusion Matrix

The confusion matrix is a 2x2 table that contains 4 outputs provided by the binary classifier. Various measures, such as error-rate, accuracy, specificity, sensitivity, precision and recall. It gives you an overview of the classifications that your model made

Machine Learning

Technique for making a computer produce better results by learning from past experiences. Features - independent variables Label - response variable Supervised - input and output data used to build classifier Unsupervised - just input data It is seen as a subset of artificial intelligence.

Boosting

Boosting is used to create a collection of predictors. In this technique, learners are learned sequentially with early learners fitting simple models to the data and then analysing data for errors. Consecutive trees (random sample) are fit and at every step, the goal is to improve the accuracy from the prior tree. When an input is misclassified by a hypothesis, its weight is increased so that next hypothesis is more likely to classify it correctly. This process converts weak learners into better performing model. - Misclassified data increases weight so that subsequent learners focus on it - Sequential rather than parallel - Weighted average of learners, better performance = more weight

Cross validation

Cross validation (also called rotation estimation, or out-of-sample testing) is one way to ensure your model is robust. A portion of your data (called a holdout sample) is held back; The bulk of the data is trained and the holdout sample is used to test the model.

Decision tree

DT builds classification or regression models in the form of a tree structures, where each internal node denote a test on an attributes, each branch represent an outcome of the test, and each leaf node hold a class labels. The topmost node in the tree is the root node

Central limit theorem (CTL )

If you draw repeated large samples (n > 30) from a population and calculate the mean, you will get a normal distribution. The CTL state that the distribution of sample means approximately normal distribution. Assuming that all samples are identical in size and regardless of the population distribution shape. CTL is important because in real life we can not collect the data of the entire population, so CTL can help us derive the information from a sample to the population.

P-value

In statistic, the p-value is the probability of obtaining the observed results of a test which assumes that the null hypothesis it correct. The p-value is used as an alternative to rejection points to provide the smallest level of significance at which the null hypothesis would be rejected. A smaller p-value means that there is stronger evidence in favor of the alternative hypothesis.

Log transformation

Logarithmic transformation is a convenient means of transforming a highly skewed variable into more normalized datasets. converts each data point to its logarithm

Precision, Recall, Accuracy

Precision - % of results that are relevant Recall/Sensitivity - % of relevant results that are correctly classified

Chi-square test

There are two types of chi-square tests. Both use the chi-square statistic and distribution for different purposes: 1/ Chi-Square goodness of fit test: determines if a sample data matches a population. 2/ Chi-square test for independence compares two variables in contingency table to see if they are related. In a more general sense, it tests to see whether distributions of categorical variables differ from each other. * A very small chi square test statistic means that your observed data fits your expected data extremely well. In other words, there is a realtionship. * A very large chi square test statistic means that the data does not fit very well. In other words, there isn't a relationship

Z-test T-test Anova test Chi-Square Test

Z-test: testing if the sample mean = population mean T-test: A t-test is used to compare the mean of two given samples. Anova: also known as analysis of variance, is used to compare multiple (three or more) samples with a single test

K nearest Neighbor

a Supervised machine learning model that uses other data points close to the one being classified in order to come up with a prediction. The KNN algorithm assumes that similar things exist in close proximity. In other words, similar things are near to each other. Steps: - Choose a value for k, typically sqrt(n) where n is the total number of data points - For each example calculate the distance between points and put in order from smallest to largest - take the majority vote of class labels among the k-nearest neighbors - Pick the first k entries to get the label (mode) If regression, return x

Permutations

an arragement or listing in which an order or placement is important

K-Means

an unsupervised machine learning model that separates data into clusters for classification. "k" indicates the number of clusters and "means" represents the clusters' centroids Steps: Specify Number of clusters - Randomly select k data points to be used as initial cluster centers - Assign other data points to cluster centers based on Euclidean distance - Recalculate cluster centers by getting the mean of all data points in a cluster - Repeat Iteratively minimize the sum of squares until cluster centers do not change

Law of Large Numbers

as the number of data points in the sample size increases, the sample mean gets closer to the population mean

Correlation (r)

cov(x,y)/[sd(x) *sd(y)]

R squared

is a statistical measure that represent the proportion of the variance for a dependent variable (Y) that explained by an independent variables(Xs) in a regression model. For example if R2 = 0.5 that mean approximately the independent variables can explain 50% the variation of dependent variable.

Logistic Regression

is an appropriate regression analysis to conduct when the dependent variable is binary categorical. Logistic regression is used to describe data and to explain the relationship between one dependent binary variable and one or more nominal, ordinal, interval or ratio-level independent variables.

Z score

measure of how many standard deviations a point is away from the mean. z = (x - mean(X) )/ sd

Combinations

order doesn't matter

T-test

t = (mean(x) - null hypothesiss)/(s/sqrt(n)) s = sample standard deviation n = sample size

Confidence interval formula

x bar = sample mean z = z score s = sample standard deviation n = sample size


Set pelajaran terkait

Urinalysis Ch 5 Practice Questions

View Set

31 Pairs of Spinal Nerves & 12 Pairs of Cranial Nerves

View Set

Earth Science Test 1 study questions

View Set

Module 4 Post-Quiz Mobility, Fractures, & Inflammation

View Set

Ch. 22 & 23- Common Child and Adolescent Mental Health Disorders

View Set