Statistical tools

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

1. You want to test a hypothesis about one or more categorical variables. If one or more of your variables is quantitative, you should use a different statistical test. Alternatively, you could convert the quantitative variable into a categorical variable by separating the observations into intervals. 2. The sample was randomly selected from the population. 3. There are a minimum of five observations expected in each group or combination of groups.

A Pearson's chi-square test may be an appropriate option for your data if all of the following are true:

alternative hypothesis

Assumes that the means of the two groups are significantly different

• Assumes that the dependent variable is normally distributed. • Assumes that the variance of the two groups are the same as the dependent variable. • Assumes that the two samples are independent of each other. • Samples are drawn from the population at random. • In independent sample t-test, all observations must be independent of each other. • In independent sample t-test, dependent variables must be measured on an interval or ratio scale

Assumptions in independent samples t-test:

-1: Perfect negative correlation. The variables tend to move in opposite directions (i.e., when one variable increases, the other variable decreases). • 0: No correlation. The variables do not have a relationship with each other. • 1: Perfect positive correlation. The variables tend to move in the same direction (i.e., when one variable increases, the other variable also increases).

The interpretations of the values are:

1. How strong the relationship is between two or more independent variables and one dependent variable (e.g. how rainfall, temperature, and amount of fertilizer added affect crop growth). 2. The value of the dependent variable at a certain value of the independent variables (e.g. the expected yield of a crop at certain levels of rainfall, temperature, and fertilizer addition).

You can use multiple linear regression when you want to know:

p-value

are from 0% to 100% and are usually written as a decimal (for example, a p value of 5% is 0.05)

post-hoc tests

are used to identify the population means that are different

Independent Samples t-test

compares the means for two groups.

p-value

from a t test is the probability that the results from your sample data occurred by chance

low p-value

indicate your data did not occur by chance. F

Mann-Whitney U Test

is a common statistical test that is used in many fields including economics, biological sciences and epidemiology. It is particularly useful when you are assessing the difference between two independent groups with low numbers of individuals in each group (usually less than 30), which are not normally distributed, and where the data are continuous.

Kendal Rank Correlation Coefficient (kendall's Tau)

is a nonparametric measure of relationships between columns of ranked data. The Tau correlation coefficient returns a value of 0 to 1, where: • 0 is no relationship, • 1 is a perfect relationship.

t score

is a ratio between the difference between two groups and the difference within the groups. • Larger t scores = more difference between groups. • Smaller t score = more similarity between groups.

correlation

is a statistical measure of the relationship between two variables. The measure is best used in variables that demonstrate a linear relationship between each other.

independent t-test

is a statistical technique that is used to analyze the mean comparison of two independent groups

ANOVA

is also called the Fisher analysis of variance, and it is the extension of the t- and z-tests. The term became well-known in 1925, after appearing in Fisher's book, "Statistical Methods for Research Workers."3 It was employed in experimental psychology and later expanded to subjects that were more complex.

A two-way ANOVA

is an extension of the one-way ANOVA. With a one-way, you have one independent variable affecting a dependent variable

one-way ANOVA

is used to determine whether there are any statistically significant differences between the means of three or more independent (unrelated) groups

Multiple linear regression

is used to estimate the relationship between two or more independent variables and one dependent variable

simple linear regression

is used to estimate the relationship between two quantitative variables.

linear regression analysis

is used to predict the value of a variable based on the value of another variable. The variable you want to predict is called the dependent variable. The variable you are using to predict the other variable's value is called the independent variable.

null hypothesis

it is assumed when the means of the two groups are not significantly different.

1. easier to implement machine learning methods 2. suitable for linearly separable datasets 3. provides valuable insights

key advantage of logistic regression

1. Identify dependent variables to ensure the model's consistency 2. Discover the technical requirements of the model 3. Estimate the model and evaluate the goodness of the fit 4. Appropriately interpret the results 5. Validate observed results 6.Analysis of Variance (ANOVA)

let's understand the logistic regression best practices for 2022 in detail.

frequency distribution table

shows the number of observations in each group.

linearity

the line of best fit through the data points is a straight line (rather than a curve or some sort of grouping factor).

Independence of observations

the observations in the dataset were collected using statistically valid sampling methods, and there are no hidden relationships among observations.

regression models

describe the relationship between variables by fitting a line to the observed data. Linear regression models use a straight line, while logistic and nonlinear regression models use a curved line. Regression allows you to estimate how a dependent variable changes as the independent variable(s) change

The t- and z-test methods

developed in the 20th century were used for statistical analysis until 1918, when Ronald Fisher created the analysis of variance method

t-value

has a p-value to go with it

Phi coefficient mean Square Contingency Coefficient)

is a measure of the association between two binary variables

logistic regression

is classified into binary, multinomial, and ordinal. Each type differs from the other in execution and theory.

one sample t-test

tests the mean of a single group against a known mean.

-1 indicates a perfectly negative relationship between the two variables. • 0 indicates no association between the two variables. • 1 indicates a perfectly positive relationship between the two variables. In general, the further away a Phi Coefficient is from zero, the stronger the relationship between the two variables. In other words, the further away a Phi Coefficient is from zero, the more evidence there is for some type of systematic pattern between the two variables.

Phi Coefficient takes on values between -1 and 1 where:

1. Tau-A and Tau-B are usually used for square tables (with equal columns and rows). Tau-B will adjust for tied ranks. 2. Tau-C is usually used for rectangular tables. For square tables, Tau-B and Tau-C are essentially the same. Formula: Kendall's Tau = (C - D / C + D) Where: C is the number of concordant pairs and D is the number of discordant pairs.

Several version's of Tau exist

homogeneity of variance (homoscedasticity) independence of observations normality linearity

Simple linear regression is a parametric test, meaning that it makes certain assumptions about the data. These assumptions are:

determine the probability of heart attacks possibility of enrolling into a university identifying spam emails

Some examples of such classifications and instances where the binary response is expected or implied are:

You don't need to provide a reference or formula since the Pearson correlation coefficient is a commonly used statistic. • You should italicize r when reporting its value. • You shouldn't include a leading zero (a zero before the decimal point) since the Pearson correlation coefficient can't be greater than one or less than negative one. • You should provide two significant digits after the decimal point.

You can follow these rules if you want to report statistics in APA Style:

• You don't need to provide a reference or formula since the chi-square test is a commonly used statistic. • Refer to chi-square using its Greek symbol, Χ2 . Although the symbol looks very similar to an "X" from the Latin alphabet, it's actually a different symbol. Greek symbols should not be italicized. • Include a space on either side of the equal sign. • If your chi-square is less than zero, you should include a leading zero (a zero before the decimal point) since the chi-square can be greater than zero. • Provide two significant digits after the decimal point. • Report the chi-square alongside its degrees of freedom, sample size, and p value, following this format: Χ2 (degrees of freedom, N = sample size) = chi-square value, p = p value).

You can follow these rules if you want to report statistics in APA Style:

Pearson correlation coefficient

also tells you whether the slope of the line of best fit is negative or positive. When the slope is negative, r is negative. When the slope is positive, r is positive.

logistic regression

analyzes the relationship between one or more independent variables and classifies data into discrete classes. It is extensively used in predictive modeling, where the model estimates the mathematical probability of whether an instance belongs to a specific category or not.

Ordinal logistic regression

applies when the dependent variable is in an ordered state (i.e., ordinal). The dependent variable (y) specifies an order with two or more categories or levels.

priori comparisons

are performed before the data are collected, and post-hoc (or a posteriori) comparisons are done after the data have been collected. When the null hypothesis of an analysis of variance (ANOVA) model is rejected

non-parametric test

are used for data that don't follow the assumptions of parametric tests, especially the assumption of a normal distribution.

one-way (or unidirectional) ANOVA and two-way ANOVA

2 main types of ANOVA

Independent Samples t-test Paired Samples t-test one sample t-test

3 main types of t-test

Pearson correlation coefficient

can also be used to test whether the relationship between two variables is significant.

parametric tests

can't test hypotheses about the distribution of a categorical variable, but they can involve a categorical variable as an independent variable (e.g., ANOVAs).

Analysis of Covariance (ANCOVA)

combines ANOVA and regression. It can be useful for understanding within-group variance that ANOVA tests do not explain.

Paired Samples t-test

compares means from the same group at different times (say, one year apart).

MANOVA (multivariate ANOVA

differs from ANOVA as the former tests for multiple dependent variables simultaneously while the latter assesses only one dependent variable at a time.

one-way ANOVA

evaluates the impact of a sole factor on a sole response variable. It determines whether all the samples are the same.

The correlation coefficient that indicates the strength of the relationship between two variables can be found using the following formula: rxy - the correlation coefficient of the linear relationship between the variables x and y • xi - the values of the x-variable in a sample • x̅- the mean of the values of the x-variable • yi - the values of the y-variable in a sample • ȳ - the mean of the values of the y-variable

how to find the correlation?

1. create a table of the observed and expected frequencies 2. calculate the chi-square value 3. find the critical chi-square value 4. compare the chi-square value to the critical value 5. decide whether to reject the null hypothesis

how to perform a chi-square test

Post hoc

in Latin means 'after this'. Simply put, a post-hoc analysis refers to a statistical analysis specified after a study has been concluded and the data collected.

pearson correlation coefficient

is a descriptive statistic, meaning that it summarizes the characteristics of a dataset. Specifically, it describes the strength and direction of the linear relationship between two quantitative variables.

Wilcoxon signed-rank test

is a non-parametric statistical hypothesis test used either to test the location of a population based on a sample of data, or to compare the locations of two populations using two matched samples. The one-sample version serves a purpose similar to that of the one-sample Student's t-test.

Mann-Whitney U Test (Wilcoxon Rank Sum Test)

is a non-parametric statistical test used to compare two samples or groups.

chi-square test

is a statistical test for categorical data. It is used to determine whether your data are significantly different from what you expected

t test

is a statistical test that is used to compare the means of two groups. It is often used in hypothesis testing to determine whether a process or treatment actually has an effect on the population of interest, or whether two groups are different from one another.

logistic regression

is a supervised machine learning algorithm that accomplishes binary classification tasks by predicting the probability of an outcome, event, or observation. The model delivers a binary or dichotomous outcome limited to two possible outcomes: yes/no, 0/1, or true/false

McNemar's Test

is a test that uses the chi-square test statistic. It isn't a variety of Pearson's chi-square test, but it's closely related. You can conduct this test when you have a related pair of categorical variables that each have two groups. It allows you to determine whether the proportions of the variables are equal.

correlation coefficient

is a value that indicates the strength of the relationship between variables. The coefficient can take any values from -1 to 1.

pearson correlation coefficient

is also an inferential statistic, meaning that it can be used to test statistical hypotheses. Specifically, we can test whether there is a significant relationship between two variables.

Analysis of variance (ANOVA)

is an analysis tool used in statistics that splits an observed aggregate variability found inside a data set into two parts: systematic factors and random factors. The systematic factors have a statistical influence on the given data set, while the random factors do not

independent t-test (two-sample t-test, independent-samples t-test or student's t-test)

is an inferential statistical test that determines whether there is a statistically significant difference between the means in two unrelated groups.

Pearson correlation coefficient

is as a measure of how close the observations are to a line of best fit

logistic regression (logit model)

is often used for classification and predictive analytics. estimates the probability of an event occurring, such as voted or didn't vote, based on a given dataset of independent variables.

chi-square

is often written as Χ2 and is pronounced "kai-square" (rhymes with "eye-square"). It is also called chi-squared.

Pearson correlation coefficient

is one of several correlation coefficients that you need to choose between when you want to measure a correlation.

Pearson correlation coefficient (r)

is the most common way of measuring a linear correlation. It is a number between -1 and 1 that measures the strength and direction of the relationship between two variables.

pearson correlation coefficient (r)

is the most widely used correlation coefficient and is known by many names: • Pearson's r • Bivariate correlation • Pearson product-moment correlation coefficient (PPMCC) • The correlation coefficient

chi-square goodness of fit test

is used to test whether the frequency distribution of a categorical variable is different from your expectations.

chi-square test of independence

is used to test whether two categorical variables are related to each other.

paired t-test (correlated pairs t-test, paired samples t-test, dependent samples t-test)

is where you run a t test on dependent samples. Dependent samples are essentially connected — they are tests on the same person or thing. For example: • Knee MRI costs at two different hospitals, • Two tests on the same person before and after training, • Two blood pressure measurements on the same person using different equipment.

Spearman rank order correlation coefficient (Spearman's rank correlation)

measures the strength and direction of association between two ranked variables. It basically gives the measure of monotonicity of the relation between two variables i.e. how well the relationship between two variables could be represented using a monotonic function.

binary logistic regression

predicts the relationship between the independent and binary dependent variables.

Mann-Whitney U test

t assesses whether two sampled groups are likely to derive from the same population, and essentially asks; do these two populations have the same shape with regards to their data? In other words, we want evidence as to whether the groups are drawn from populations with different levels of a variable of interest.

t test

tells you how significant the differences between group means are. It lets you know if those differences in means could have happened by chance

homogeneity of variance (homoscedasticity)

the size of the error in our prediction doesn't change significantly across the values of the independent variable.

two-way ANOVA

there are two independents.

Multinomial logistic regression

this implies that this regression type has more than two possible outcomes.

chi-square test of homogeneity

to be another variety of Pearson's chi-square test. It tests whether two populations come from the same distribution by determining whether the two populations have the same proportions as each other. You can consider it simply a different way of thinking about the chi-square test of independence

ANOVA test

to determine the influence that independent variables have on the dependent variable in a regression stud

contingency table

to show the number of observation in each combination of groups

t test

uses a t-statistic and compares this to t-distribution values to determine if the results are statistically significant

scatterplot

we can generally assess the relationship between the variables and determine whether they are correlated or not.

independent t-test

when we take two samples from the same population, then the mean of the two samples may be identical. But when samples are taken from two different populations, then the mean of the sample may differ. In this case, it is used to draw conclusions about the means of two populations, and used to tell whether or not they are similar

chi-square goodness of fit test

when you have one categorical variable. It allows you to test whether the frequency distribution of the categorical variable is significantly different from your expectations. Often, but not always, the expectation is that the categories will have equal proportions.

chi-square test of independence

when you have two categorical variables. It allows you to test whether the two variables are related to each other. If two variables are independent (unrelated), the probability of belonging to a certain group of one variable isn't affected by the other variable.

wilcoxon test

which can refer to either the rank sum test or the signed rank test version, is a nonparametric statistical test that compares two paired groups. The tests essentially calculate the difference between sets of pairs and analyze these differences to establish if they are statistically significantly different from one another

• The null hypothesis (H0) is that the two populations are equal. • The alternative hypothesis (H1) is that the two populations are not equal.

It follows that the hypotheses in a Mann-Whitney U Test are:

ANOVA (analysis of variance)

It is a statistical method used to analyze the differences between the means of two or more groups or treatments. It is often used to determine whether there are any statistically significant differences between the means of different groups.

• The Wilcoxon test compares two paired groups and comes in two versions, the rank sum test, and signed rank test. • The goal of the test is to determine if two or more sets of pairs are different from one another in a statistically significant manner. • Both versions of the model assume that the pairs in the data come from dependent populations, i.e., following the same person or share price through time or place.

KEY TAKEAWAY of wilcoxon

1. The dependent/response variable is binary or dichotomous 2. Little or no multicollinearity between the predictor/explanatory variables 3. Linear relationship of independent variables to log odds 4. Prefers large sample size 5. Problem with extreme outliers 6. Consider independent observations

Logistic Regression Best Practices

• The variable being compared between the two groups must be continuous (able to take any number in a range - for example age, weight, height or heart rate). This is because the test is based on ranking the observations in each group. • The data are assumed to take a non-Normal, or skewed, distribution. If your data are normally distributed, the unpaired Student's t-test should be used to compare the two groups instead. • While the data in both groups are not assumed to be Normal, the data are assumed to be similar in shape across the two groups. • The data should be two randomly selected independent samples, meaning the groups have no relationship to each other. If samples are paired (for example, two measurements from the same group of participants), then a paired samples t-test should be used instead. • Sufficient sample size is needed for a valid test, usually more than 5 observations in each gro

Some key assumptions for Mann-Whitney U Test are detailed below:

both variables are quantitative the variables are normally distributed the data have no outliers the relationship is linear

The Pearson correlation coefficient is a good choice when all of the following are true:

A value of +1 means a perfect association of rank • A value of 0 means that there is no association between ranks • A value of -1 means a perfect negative association of rank

The Spearman Rank Correlation can take a value from +1 to -1 where

Normality

The data follows a normal distribution.

chi-square goodness of fit test chi-square test of independence

There are two types of Pearson's chi-square tests:


Kaugnay na mga set ng pag-aaral

A form for recording transactions in chronological order.

View Set

Chapter 18: Evolution and the Fossil Record

View Set

Chapter 21: Respiratory Care Modalities

View Set