CJ Stats 3347: Final Review
*Definition of statistics
mathematical procedures for organizing, summarizing, and interpreting data Everything in stats is a number: either a... -Constant -Variable - primary focus in later analytical techniques A statistic is a measurable characteristic of a sample, such as a mean or standard deviation.
*Multivariate
inferential, regression Multivariate analysis (MVA) is based on the statistical principle of multivariate statistics, which involves observation and analysis of more than one statistical outcome variable at a time. In design and analysis, the technique is used to perform trade studies across multiple dimensions while taking into account the effects of all variables on the responses of interest.
*Fully Exhaustive and Mutually Exclusive
"Mutually exclusive" means of any two possible outcomes A and B, the logical expression A and B cannot occur. Put another way, two propositions are mutually exclusive if the following are true: "If A, then not B" and "If B, then not A." It cannot be both, ("heads" and "tails" are mutually exclusive) but neither can it be something different from "heads" or "tails" ("heads" and "tails" are collectively exhaustive). Mutually exclusive - any two events that can not occur at the same time (only can get heads or tails you can not get heads and tails when flipping a coin) Fully exhaustive - All possible outcomes are those outcomes (has to be heads or tails)
*∑
"sum all values"
*Standard error (look up equation)
(distance between M and µ) (𝑆𝐸) 𝜎𝑀 = 𝜎 / square root 𝑛
*What are the criteria for causality?
1. Time order - the independent variable must precede the dependent variable in time. Remember that we don't seek raw chicken because we are going to be sick tomorrow. We are sick tomorrow because we ate raw chicken today. 2. Association - there must be an identified relationship between the independent and dependent variable. We identify this through a bivariate, or correlation, analysis. 3. Non-spurious relationships - the dependent variable must result from the independent variable. For example, ice cream consumption does not cause crime rates to increase. The variable of warmer seasonal temperatures causes both to increase. Causality cannot be proven by math. This is done through theory. The math can only support the theory. Time order: Seeking raw chicken changes in an independent variable (A) must occur before changes in a dependent variable (B). This is another way of saying that cause must precede effect. Association: Interpretation of the relationship between X and Y What does the second variable do when the first variable increases? Two implications First variable is increasing all the time Second variable can only increase, decrease, or have no change Three relationships Positive Negative No change/zero relationship |1.0| is a perfect correlation; 0 means no relationship Non-spuriousness: A nonspurious relationship between two variables is an association, or co-variation, that you cannot explain with a third variable. A spurious relationship results when the impact of a third variable explains the effect on both the independent and dependent variables under analysis.
*How to find Critical Values (look at equation)
A critical value is a line on a graph that splits the graph into sections. One or two of the sections is the "rejection region"; if your test value falls into that region, then you reject the null hypothesis. https://www.youtube.com/watch?v=zTABmVSAtT0
*Positively skewed
A right-skewed distribution has a long right tail. Right-skewed distributions are also called positive-skew distributions. That's because there is a long tail in the positive direction on the number line. The mean is also to the right of the peak.
*Sample
A sample is a set of observations drawn from a population. A sample is a subset of people, items, or events from a larger population that you collect and analyze to make inferences. To represent the population well, a sample should be randomly collected and adequately large. If the sample is random and large enough, you can use the information collected from the sample to make inferences about the population. For example, you could count the number of apples with bruises in a random sample and then use a hypothesis test to estimate the percentage of all the apples that have bruises.
*A colleague of yours is concerned that your data will not approximate a normal distribution, which is important when conducting a regression analysis. What do you say to refute this claim?
A theoretical probability was used to select each case, therefore we have best approximated the population distribution.
*Outlier
A value that is substantially different from the values obtained for the other cases in the data set
*Multi-stage Multi-cluster sampling
AKA Area Probability Sampling - "representative of a population" With or without replacement No frame, but map multi-stage multi-cluster sampling. For example, this may have a sampling frame of addresses in the United States. However, this results in sampling bias due to the fact that only individuals with an official, registered address have the ability to be selected. Multi-stage sampling represents a more complicated form of cluster sampling in which larger clusters are further subdivided into smaller, more targeted groupings for the purposes of surveying. EX: In Iyoke et al. (2006) Researchers used a multi-stage sampling design to survey teachers in Enugu, Nigeria, in order to examine whether socio-demographic characteristics determine teachers' attitudes towards adolescent sexuality education. First-stage sampling included a simple random sample to select 20 secondary schools in the region. The second stage of sampling selected 13 teachers from each of these schools, who were then administered questionnaires.
*Post HOC Test (look at videos)
ANOVA does not say which group is different In order to identify which group is different, need to conduct post-hoc tests Most common: Tukey's HSD, Tamhane Determined by equal variance - Levene's test All done in SPSS
*Difference between an ANOVA and T-Test?
ANOVA is a test of mean differences between three or more groups. T-tests are a test of difference of means for one group (pretest/posttest) or between two groups.
*Independent variable
An independent variable, sometimes called an experimental or predictor variable, is a variable that is being manipulated in an experiment in order to observe the effect on a dependent variable, sometimes called an outcome variable.
*What is the first step in the standard series of steps to solve a mathematical problem?
Any calculations contained within parentheses.
*What is the difference between correlation and Chi-Square?
Chi-square is a form of correlation but is not correlation Predicts independent and dependent variables testing if there is a relationship between nominal and ordinal if groups are same or different Correlation: not looking at group differences the relation of two variables and degree in which they move together
*Explain why correlation does not equal causation.
Correlation does not allow us to determine if all the criteria for causality have been met. It simply measures the direction and magnitude of the relationship, and the degree to which variables change together. Causality is generally determine using predictive models.
*Law of large numbers
Decreasing standard error- how far you are from correct value Larger samples better approximate population parameters EX: Pr(𝐴) = 15 / 20 =0.75 Pr (𝐴) = 494 / 1000 =0.494 Pr (𝐴) = 5039 / 10000 =0.5039
*Variability
Examining the degree to which a variable varies Range, variance, standard deviation
*As a researcher, you believe that a respondent's race will help explain the relationship between peer influence and crime. When creating your survey, you should ensure that your race measurement is...
Fully Exhaustive and Mutually Exclusive
*To increase statistical power
Function of effect size (observed difference) Larger effect size, more power Increase 𝛼-level - can make critical region larger Limitations on significance; not practical Increase sample size Larger sample sizes give more power We have the most control over this
*Mean
Generally used with ratio and interval data Summation notation 𝑋bar= Σ𝑋 / 𝑛 (the sum of all values, divided by the number of cases in a sample) Highly sensitive to outliers Example: Find the mean of the following test scores. 79 84 83 81 55 84 85 98 82 80 𝑋bar = 81.1 Example (cont.): Find the mean without the 55 included. 𝑋bar= 84
*Z-Scores standardize values in order to ________________?
Identify the exact location of a score in a distribution Create a standard deviation that can be compared to other distributions Compare samples with different means and standard deviations
*Kurtosis
Kurtosis is a statistical measure that's used to describe the distribution, or skewness, of observed data around the mean, sometimes referred to as the volatility of volatility. Kurtosis is used generally in the statistical field to describes trends in charts. Put simply, kurtosis is a measure of the combined weight of a distribution's tails relative to the rest of the distribution.
*What is the difference between a Parametric and Non-Parametric Test?
Non-Parametric: Not a normal distribution, does not adhere to a normal distribution or the data itself does not adhere to a normal distribution
*What is the purpose of descriptive statistics?
Organize, simplify, and summarize data
*Steps to solve mathematical operations
PEMDAS: Parentheses Exponents Multiplication Division Adding Subtracting
You are interested in the average GPA of undergraduate college students in the Unites States. If you, as the researcher, collect GPA data from universities nationwide and calculate the average for the entire population of college students, this value would be an example of what?
Parameter ??
*Measures of dispersion
Range, Standard Deviation, Variance, In statistics, dispersion (also called variability, scatter, or spread) denotes how stretched or squeezed a distribution (theoretical or that underlying a statistical sample) is. Common examples of measures of statistical dispersion are the variance, standard deviation and interquartile range.
*Why do we use Regression to make casual inferences?
Regression models allow us to determine expected values for a set of independent variable values. The least squares method allows us to decrease the amount of variation around the regression by finding the smallest squared variation.
*Why do we use regression to make causal inferences?
Regression models allow us to determine expected values for a set of independent variable values. The least squares method allows us to decrease the amount of variation around the regression by finding the smallest squared variation.
*English vs. Greek letters
Sample: English/Roman x̅ R² Population: Greek σ μ ∑ χ²
*If you have a population of N = 1500 and would like to collect a sample of n = 500 adolescents, what sampling method should you use?
Simple Random Sampling with Replacement
*What are the four steps in correct order that must be followed in order to complete a hypothesis test?
Step 1: State the Hypothesis Step 2: Identify the critical value Step 3: Compute the test statistic Step 4: Draw your conclusion
*Variance
Summation notation s squared= Σ(𝑋−𝑋bar)squared / 𝑛 −1 The average of the squared differences from the Mean. n - 1 is the degrees of freedom You have one less than the sample size of cases to randomly assign Less biased calculation To calculate the variance: Find the mean. Calculate deviations from the mean for each value (X - Xbar) Square each of these values. Why do we do this? Sum the squared deviations (sum of squares = SS = Σ(𝑋−Xbar)squared). Divide the sum of squares by n - 1. Steps: 1. Work out the Mean (the simple average of the numbers) 2. Then for each number: subtract the Mean 3. and square the result (the squared difference). 4. Then work out the average of those squared differences.
*What do the ANOVA and R2 values tell us in a Regression analysis?
The ANOVA compares the full model to the intercept only model to determine if the full model is a better fit for the data (decreases the amount of residual error). If the ANOVA is significant in a regression output, we conclude that the full model is a better fit for the data than the intercept only model. The R2 value is a measure of the total amount of variation in the dependent variable that is explained by all the independent variables in the full model.
*What are the components of a Regression equation?
The components of a regression line are the slope (bi) and the y-intercept (b0). Error is included because there is generally always going to be a difference between the observed and expected value in a regression model. This is not to say that observed values will never equal the predicted value. However, on average an observed point will deviate from the line of best fit. The equation for a regression line includes all three of these elements: Y = b0 + b1(Xi) + ei.
*Difference between an ANOVA and a Chi-Square Test?
The difference is that ANOVA measures relationships/differences between continuous scores. Therefore, the data must be interval or ratio level data. Generally, ANOVA are also examining group differences. Chi-square tests are examining relationships between variables. For this reason, it can also be used to conduct an analysis on continuous/interval/ratio data, but it is primarily used for examining relationships through how much categorical groups differ according to the independent variable between nominal/ordinal level variables (for the purposes of this class, recognize that the observed values are discrete counts of cases in a particular group, e.g. Republicans that agree with the use of capital punishment vs. the number of Democrats that agree with the use of capital punishment). So for example, agreement with capital punishment depends on political party affiliation. If a relationship exists (if differences between each groups' responses are found to be significantly different than the expected values), the Chi-square test statistic will be significant. Similarly, if there are mean differences between groups detected in an ANOVA, the F-test statistic will be significant.
*Mean squared error
Used in ANOVA has Sums of Squares are total values. They can be expressed as averages. These are called Mean Squares, MS
*Type 2 errors
You're told you don't have cancer but you actually do False negatives have much more drastic consequences than false positives. beta error, false negative Occurs when we fail to reject a false null hypothesis We guard against beta error with statistical power Type II errors are when one fails to reject the null when the null is false (you do not identify the existing effect/relationship when one actually exists).
*A hypothesis test can be conducted using a number of distributions and tests, which are ___________
Z-scores, T-tests, ANOVA, and Chi-Square tests And can also be applied to regression via the T-distribution/T-tests
*Confidence Interval
a range of values so defined that there is a specified probability that the value of a parameter lies within it.
*Type 1 errors
alpha error, false positive Occurs when we reject a true null hypothesis 𝛼-level provides the rate at which a Type I error occurs You're told you do have cancer but you really don't have cancer In more formal language, Type I errors are when one rejects the null when the null is true (you identify an effect/relationship when one does not actually exist).
*Variable
are either Discrete or. continuous are Mutually exclusive and fully exhaustive Essentially, one and only one category
*b0
b0 is the: Intercept (value of Y when X = 0) Point at which the regression line crosses the Y-axis (ordinate)
*Method of least squares
https://www.youtube.com/watch?v=0T0z8d0_aY4
*Z-Scores (look for equation)
z-scores indicate the number of standard deviations a score lies from the mean z-scores are a type of standardization Standardization is a type of transformation In the population: 𝑧= 𝑋 − 𝜇 / 𝜎 In a sample: 𝑧= 𝑋 − x̅ / Sx A z-score (aka, a standard score) indicates how many standard deviations an element is from the mean. A z-score can be calculated from the following formula. z = (X - μ) / σ where z is the z-score, X is the value of the element, μ is the population mean, and σ is the standard deviation. Simply put, a z-score is the number of standard deviations from the mean a data point is. But more technically it's a measure of how many standard deviations below or above the population mean a raw score is. A z-score is also known as a standard score and it can be placed on a normal distribution curve. Z-scores range from -3 standard deviations (which would fall to the far left of the normal distribution curve) up to +3 standard deviations (which would fall to the far right of the normal distribution curve). In order to use a z-score, you need to know the mean μ and also the population standard deviation σ. Z-scores are a way to compare results from a test to a "normal" population. Results from tests or surveys have thousands of possible results and units. However, those results can often seem meaningless. For example, knowing that someone's weight is 150 pounds might be good information, but if you want to compare it to the "average" person's weight, looking at a vast table of data can be overwhelming (especially if some weights are recorded in kilograms). A z-score can tell you where that person's weight is compared to the average population's mean weight.
*Delinquent acts (i) = 10.13 + 1.47 (Peer Influence (i)) + error If you find that the average peer influence (Time spent with peers) is 3.2, how many delinquent acts will a respondent who spends an average amount of time with their peers?
10.13 + 1.47 (3.2) + e 10.13 + 4.70 + e = 14.83 + e
*If the correlation coefficient is r = .23 and n = 476 (people in study) How much of the variation in delinquent acts is accounted for by peer influence?
476-1 = 475 23 / 475 = .048 5%
*Continuous variables
A continuous variable is a variable that has an infinite number of possible values. In other words, any value is possible for the variable. A continuous variable is the opposite of a discrete variable, which can only take on a certain number of values. A continuous variable doesn't have to have every possible number (like -infinity to +infinity), it can also be continuous between two numbers, like 1 and 2. For example, discrete variables could be 1,2 while the continuous variables could be 1,2 and everything in between: 1.00, 1.01, 1.001, 1.0001... Time it takes a computer to complete a task. You might think you can count it, but time is often rounded up to convenient intervals, like seconds or milliseconds. Time is actually a continuum: it could take 1.3 seconds or it could take 1.333333333333333... seconds. A person's weight. Someone could weigh 180 pounds, they could weigh 180.10 pounds or they could weigh 180.1110 pounds. The number of possibilities for weight are limitless. Income. You might think that income is countable (because it's in dollars) but who is to say someone can't have an income of a billion dollars a year? Two billion? Fifty nine trillion? And so on... Age. So, you're 25 years-old. Are you sure? How about 25 years, 19 days and a millisecond or two? Like time, age can take on an infinite number of possibilities and so it's a continuous variable. The price of gas. Sure, it might be $4 a gallon. But one time in recent history it was 99 cents. And give inflation a few years it will be $99. not to mention the gas stations always like to use fractions (i.e. gas is rarely $4.47 a gallon, you'll see in the small print it's actually $4.47 9/10ths Continuous Variables would (literally) take forever to count. In fact, you would get to "forever" and never finish counting them.
*Correlation (look for equation)
A correlation is a single number that describes the degree of relationship between two variables. Correlation is the degree to which two variables vary together. It is a test of the magnitude and the direction of the relationship between two variables. The three types of relationships are positive, negative, and zero-relationship. Positive relationships indicate that both variables are varying in the same direction together. They can both decrease or increase together, but the must move in the same direction. Negative, or inverse, relationships indicate that the variables are moving in opposite directions of each other. One will increase as the other decreases, or vice versa. A zero-relationships indicates that as the X variable varies, the Y variable does nothing. This results in a straight horizontal line. A perfect positive relationship is identified by a correlation coefficient of +1.0, while a perfect negative relationship is identified by a correlation coefficient of -1.0.
*Discrete variables
A discrete variable is a variable whose value is obtained by counting. A continuous variable is a variable whose value is obtained by measuring. A random variable is a variable whose value is a numerical outcome of a random phenomenon. A discrete random variable X has a countable number of possible values. In a nutshell, discrete variables are like points plotted on a chart and a continuous variable can be plotted as a line. Discrete variables are countable in a finite amount of time. For example, you can count the change in your pocket. You can count the money in your bank account. You could also count the amount of money in everyone's bank account. It might take you a long time to count that last item, but the point is — it's still countable. Number of quarters in a purse, jar, or bank. Discrete because there can only be a certain number of coins (1,2,3,4,5...). Coins don't come in amounts of 2.3 coins or 10 1/2 coins, so it isn't possible for there to be an infinite number of possibilities. In addition, a purse or even a bank is restricted by size so there can only be so many coins. The number of cars in a parking lot. A parking lot can only hold a certain number of cars. Points on a 10-point rating scale. If you're graded on a 10-point scale, the only possible values are 1,2,3,4,5,6,7,8,9, and 10. Ages on birthday cards. Birthday cards only come in years...they don't come in fractions. So there are a finite amount of possibilities (presumably, about one hundred).
*Negatively skewed
A left-skewed distribution has a long left tail. Left-skewed distributions are also called negatively-skewed distributions. That's because there is a long tail in the negative direction on the number line. The mean is also to the left of the peak.
*Non-spurious
A nonspurious relationship between two variables is an association, or co-variation, that you cannot explain with a third variable. A spurious relationship results when the impact of a third variable explains the effect on both the independent and dependent variables under analysis.
*Anova (look for equation)
ANOVA three groups or more Used a lot in experimental design psychology Evaluates all components at once Advantage: 2 or more means can collapse into a single, interpretable value Disadvantage: does not allow for retrospective analysis of individual components (that value cannot be broken down into its original values) As between group variance increases, support for 𝐻𝐴 increases As within group increases, less likely to reject null Analysis of variance (ANOVA) is a collection of statistical models used to analyze the differences among group means and their associated procedures (such as "variation" among and between groups) ANOVAs are useful for comparing (testing) three or more means (groups or variables) for statistical significance. It is conceptually similar to multiple two-sample t-tests, but is more conservative (results in less type I error) and is therefore suited to a wide range of practical problems. When we have only two samples we can use the t-test to compare the means of the samples but it might become unreliable in case of more than two samples. If we only compare two means, then the t-test (independent samples) will give the same results as the ANOVA. EX: EXAMPLE: Suppose we want to test the effect of five different exercises. For this, we recruit 20 men and assign one type of exercise to 4 men (5 groups). Their weights are recorded after a few weeks. We may find out whether the effect of these exercises on them is significantly different or not and this may be done by comparing the weights of the 5 groups of 4 men each. As mentioned above, the t-test can only be used to test differences between two means. When there are more than two means, it is possible to compare each mean with each other mean using many t-tests. But conducting such multiple t-tests can lead to severe complications and in such circumstances we use ANOVA. Thus, this technique is used whenever an alternative procedure is needed for testing hypotheses concerning means when there are several populations. There are four basic ASSUMPTIONS used in ANOVA. the expected values of the errors are zero the variances of all errors are equal to each other the errors are independent they are normally distributed
*Central limit theorem
As sample size increases Theoretical reaches Empirical Probability. For any population with mean µ and standard deviation σ, the distribution of sample means for sample size n will have a mean of µ and a standard deviation of 𝜎 / 𝑛 and will approach a normal distribution as n approaches infinity What this means is that as the sample size increases, or approaches infinity (the population N), the mean and standard deviation of the resulting sampling distribution will better approximate the population mean and standard deviation. The standard deviation in this case is approximated by the standard error (the average distance between the sample mean and the population mean). To explain further, recall that each sampling distribution is unique to the sample size; larger samples are going to have distributions that are going to better approximate the population distribution.
*Hypotheses
Base of word "hypothetical" not a real situation Two hypotheses to consider: Null ( 𝐻 0 )- Science's "resting position" is the null; nothing has happened until evidence indicates otherwise presumption of innocence the hypothesis that there is no significant difference between specified populations, any observed difference being due to sampling or experimental error. The null hypothesis (H0) is a hypothesis which the researcher tries to disprove, reject or nullify. The 'null' often refers to the common view of something, while the alternative hypothesis is what the researcher really thinks is the cause of a phenomenon. Alternative ( 𝐻 1 )- Two statements that could be true, but only one is correct 3 features of a hypothesis Must make reference to the population inferential statistics Must be mutually exclusive Must be fully exhaustive For example, suppose we wanted to determine whether a coin was fair and balanced. A null hypothesis might be that half the flips would result in Heads and half, in Tails. The alternative hypothesis might be that the number of Heads and Tails would be very different. Symbolically, these hypotheses would be expressed as H0: p = 0.5 Ha: p <> 0.5 Suppose we flipped the coin 50 times, resulting in 40 Heads and 10 Tails. Given this result, we would be inclined to reject the null hypothesis. That is, we would conclude that the coin was probably not fair and balanced.
*nominal variable
Categorical data Example: sex, race When measuring using a nominal scale, one simply names or categorizes responses. Gender, handedness, favorite color, and religion are examples of variables measured on a nominal scale. Least precise level of measurement Can be reduced to presence/absence of traits Dichotomous - designated 1 or 0 (This is important later to determine which regression analysis should be used!) A nominal variable is another name for a categorical variable. Nominal variables have two or more categories without having any kind of natural order. they are variables with no numeric value, such as occupation or political party affiliation. Gender (Male, Female, Transgender). Eye color (Blue, Green, Brown, Hazel). Type of house (Bungalow, Duplex, Ranch). Type of pet (Dog, Cat, Rodent, Fish, Bird). Genotype ( AA, Aa, or aa).
*Dichotomous variables (Dichotomous nominal is the same thing)
Dichotomous variables are variables that have two levels. A very common example of a dichotomous variable is gender, which has two outcomes and is reported as male or female. Categorical variables are not measured by numbers, but they can instead be categorized. Dichotomous variables are any categorical variable that has two distinct outcomes.
*Ratio-level variables
Differs from interval only in that the zero is non-arbitrary Money is measured on a ratio scale because, in addition to having the properties of an interval scale, it has a true zero point: if you have zero money, this implies the absence of money. Since money has a true zero point, it makes sense to say that someone with 50 cents has twice as much money as someone with 25 cents (or that Bill Gates has a million times more money than you do). Age, income, etc.
*What are the components of contingency tables? How do we "name" contingency tables?
Dimensions ID by rows x columns Marginals Column and row totals Make comparisons between groups Move from left to right Percentages are calculated according to where the IV is on the table Values in contingency tables raw counts We impose conditions If you are X, what is the probability of Y If cells are equal no relationship When variability introduced probable relationship Therefore: chi-square test The components of a contingency table include the rows and columns of the table. The left axis is your dependent variable (e.g., death penalty opinion) and the top axis is your independent variable (e.g., political party affiliation). Contingency tables are named by the number of rows in your table by the number of columns in your table.
* Ordinal-level variables
Data that can be ranked or ordered Can be assigned numerical values Grades, Likert scale, military title -- Lieutenant, Captain, Major. etc.
*Measures of Central Tendency
Mean, median, mode
*What is a correlation/association?
How much a variable varies with another variable What happens to Y (positive,negative and no change), When X changes. Correlation is a statistical technique that can show whether and how strongly pairs of variables are related. For example, height and weight are related; taller people tend to be heavier than shorter people. The relationship isn't perfect. People of the same height vary in weight, and you can easily think of two people you know where the shorter one is heavier than the taller one. Nonetheless, the average weight of people 5'5'' is less than the average weight of people 5'6'', and their average weight is less than that of people 5'7'', etc. Correlation can tell you just how much of the variation in peoples' weights is related to their heights. Although this correlation is fairly obvious your data may contain unsuspected correlations. You may also suspect there are correlations, but don't know which are the strongest. An intelligent correlation analysis can lead to a greater understanding of your data. Correlation is the degree to which two variables vary together. It is a test of the magnitude and the direction of the relationship between two variables.
*Mean Squared error (look at equation)
In statistics, the mean squared error (MSE) or mean squared deviation (MSD) of an estimator (of a procedure for estimating an unobserved quantity) measures the average of the squares of the errors or deviations—that is, the difference between the estimator and what is estimated. is a measure of the quality of an estimator—it is always non-negative, and values closer to zero are better.
*In behavioral research, the researcher manipulates the _________________ variable?
Independent variable
*T-Tests (need to look at PP T-tests for more detail and equation)
Independent vs. repeated-measures Directional vs. non-directional hypotheses Step 1: State hypothesis and determine alpha-level Step 2: Locate critical region Need df and alpha level Step 3: Obtain data and compute test statistic Need to find standard error Make a decision Reject or fail to reject null hypothesis
*Hypothesis testing and interval estimation are two ways to do what?
Inferential statistics
*Chi-Square (need to go to PP Chi-square for more detail)
Interested in two things: Strength of relationship Generalizability of relationship Therefore still inferential statistics relating to or denoting a statistical method assessing the goodness of fit between observed values and those expected theoretically. also written as χ2 test is any statistical hypothesis test wherein the sampling distribution of the test statistic is a chi-squared distribution when the null hypothesis is true. Chi-squared tests are often constructed from a sum of squared errors, or through the sample variance. Test statistics that follow a chi-squared distribution arise from an assumption of independent normally distributed data, which is valid in many cases due to the central limit theorem. A chi-squared test can be used to attempt rejection of the null hypothesis that the data are independent.
*Interval-level variables
Intervals are same size Equal difference reflects equal change in magnitude, up or down the intervals between the values of the interval variable are equally spaced. KEY: arbitrary zero point Temperature (Kelvin?), IQ
*Relative percent in a frequency table
It describes what percentage of the sample has taken on a specific value It includes missing data in its calculation It is found by dividing the frequency of a value by the total sample size (including missing data)
*Explain what a normal distribution looks like.
It has a mean of x̄ = 0 and a standard deviation of s = 1 Its properties are expressed in standardized scores Symmetric, unimodal, theoretical distribution 50% of values left (less than mean) and right of center (greater than mean) Properties expressed in z-scores SO: 𝑋bar = 0, s = 1 and ~68% of distribution between +1.00 and -1.00 Particular to normal shape, does not apply to skewed distributions Mean=Median=Mode (all in the center) Symmetry about the center BELL CURVE
*You are concerned about your standard error being too large for your data. What concept states that if you increase your sample size, your standard error will decrease?
Law of Large Numbers
*Leptokurtic
Leptokurtic is a statistical distribution where the points along the X-axis are clustered, resulting in a higher peak, or higher kurtosis, than the curvature found in a normal distribution. This high peak and corresponding fat tails mean the distribution is more clustered around the mean than in a mesokurtic or platykurtic distribution and has a relatively smaller standard deviation.
*F-Test ( look up more and look at http://blog.minitab.com/blog/adventures-in-statistics/understanding-analysis-of-variance-anova-and-the-f-test )
Multivariate analysis - compresses all data into a single, interpretable value Inherently one-tailed, cannot get negative values New distribution - similar to t-distribution in that it is a family of distributions distinguished by the degrees of freedom Two degrees of freedom: between and within _𝑑𝑓_𝑏𝑒𝑡𝑤𝑒𝑒𝑛_=𝑔−1 _𝑑𝑓_𝑤𝑖𝑡ℎ𝑖𝑛_=𝑁 −𝑔 ANOVA uses F-tests to statistically test the equality of means. The F-statistic is simply a ratio of two variances. Variances are a measure of dispersion, or how far the data are scattered from the mean. Larger values represent greater dispersion. you can use F-statistics and F-tests to test the overall significance for a regression model, to compare the fits of different models, to test specific regression terms, and to test the equality of means. To use the F-test to determine whether group means are equal, it's just a matter of including the correct variances in the ratio. In one-way ANOVA, the F-statistic is this ratio: F = variation between sample means / variation within the samples
*Levels of Measurement
Nominal, ordinal, interval, and ratio
*Platykurtic
Platykurtic describes a statistical distribution with extremely dispersed points along the X-axis that results in thinner tails than a normal distribution. Because this distribution has thin tails, it is less clustered around the mean than are mesokurtic and leptokurtic distributions. The prefix of the term, 'platy', means broad, and distributions are deemed platykurtic when the excess kurtosis value is negative owing to the fact that the distribution has more data in its tails and less data in its peak. Platykurtic distributions produce less extreme outliers compared to outliers found in a normal distribution.
*A distribution that has most values clustered over low values with a tale that extends over the higher values is...
Positively skewed
*You observe the distribution of your data and find that most of the data are clustered over the lower values, with a tail that tapers off toward the higher values (the right). What kind of distribution is this?
Positively skewed
*Theoretical probability
Pr(𝐴) = # 𝑜𝑓 𝑜𝑢𝑡𝑐𝑜𝑚𝑒𝑠 𝐴 / 𝑡𝑜𝑡𝑎𝑙 # 𝑜𝑓 𝑝𝑜𝑠𝑠𝑖𝑏𝑙𝑒 𝑜𝑢𝑡𝑐𝑜𝑚𝑒𝑠 Ex: Coin 2 possible outcomes Pr(Tails) = ½ = 0.5 Interpretation: if flipped repeatedly tails 50%, heads 50% No coin flipped theoretical It is the likeliness of an event happening based on all the possible outcomes. The ratio for the probability of an event 'P' occurring is P (event) = number of favorable outcomes divided by number of possible outcomes. Number of ways an event can occur / the total number of equally likely outcomes The theoretical approach is based on theory. This sounds ridiculous, but it's true. Theoretically, rolling a 4 on a six-sided die should be the decimal equivalent to 1/6. So for every six rolls, we will get only one 4. However, even on a fair die, this is not always the case. It may be the case that we observe more than one 4 for every six rolls. At this point, we may be interested determining how much more often a 4 is likely to be rolled.
*Empirical probability
Real, actual, observable, "directly documented" Pr(A) = # 𝑜𝑓 𝑡𝑖𝑚𝑒𝑠 𝐴 𝑜𝑐𝑐𝑢𝑟𝑟𝑒𝑑 / 𝑡𝑜𝑡𝑎𝑙 # 𝑜𝑓 𝑜𝑐𝑐𝑢𝑟𝑟𝑎𝑛𝑐𝑒𝑠 Pr (𝑇𝑎𝑖𝑙𝑠) = 62 / 100 =0.62 62 = actually got 100 = actually flipped 100 times is the ratio of the number of outcomes in which a specified event occurs to the total number of trials, not in a theoretical sample space but in an actual experiment. Therefore, we asked the dealer at the craps table, in a very classic Sherlock Holmes fashion (the Conan Doyle original, not the Cumberbatch version; he's generally too rude), to allow us to roll the dice twenty times as an experiment. You know that the theoretical probability of obtaining a 4 is 0.1667 and observed seven fours in twenty rolls. The empirical probability at this point is 0.35, approximately double of the theoretical probability. We may, at this point, want to call the mob boss with the casino owner in his pocket on the cheater at the craps table, OR, we might want to keep rolling to get a larger number of trials and allow the law of large numbers prove that we just hadn't rolled the dice enough times to allow the observed probability approach the theoretical.
*Regression
Regression is: A way of predicting the value of one variable from another. It is a hypothetical model of the relationship between two variables. The model used is a linear one. Therefore, we describe the relationship using the equation of a straight line. The regression line is only a model based on the data. This model might not reflect reality. SST: Total variability (variability between scores and the mean). SSR: Residual/Error variability (variability between the regression model and the actual data). SSM: Model variability (difference in variability between the model and the mean). If the model results in better prediction than using the mean, then we expect SSM to be much greater than SSR a measure of the relation between the mean value of one variable (e.g., output) and corresponding values of other variables (e.g., time and cost). Yi = b0 + b1Xi + Ei https://www.khanacademy.org/math/statistics-probability/describing-relationships-quantitative-data
*What do the ANOVA and R2 values tell us in a regression analysis?
The ANOVA compares the full model to the intercept only model to determine if the full model is a better fit for the data (decreases the amount of residual error). If the ANOVA is significant in a regression output, we conclude that the full model is a better fit for the data than the intercept only model. The R2 value is a measure of the total amount of variation in the dependent variable that is explained by all the independent variables in the full model.
*You choose to add race, social class, and gender to your model. Therefore, you will have to determine which is a better fit for your data: the full model with four independent variables, or the intercept only model. What test determines this?
The F-Test
*Your data require the use of a bivariate regression analysis to measure the relationship between peer influence and delinquent behavior. What two pieces of information will your analysis provide you with?
The Slope and the Intercept
*After having a conversation with your colleague about your distribution, they still don't understand how it is considered a normal distribution. Explain further by saying...
The central limit theorem states that our large sample size has increased our empirical probabilities, therefore our distribution is normal. The central limit theorem also states that as your sample size approaches the population, your standard error decreases and your sample mean will get closer to the population mean.
*What are the components of a regression line? Why do we include error in the model?
The components of a regression line are the slope (bi) and the y-intercept (b0). Error is included because there is generally always going to be a difference between the observed and expected value in a regression model. This is not to say that observed values will never equal the predicted value. However, on average an observed point will deviate from the line of best fit. The equation for a regression line includes all three of these elements: Y = b0 + b1(Xi) + ei.
*Range
The difference between the largest value and the smallest range = Xmax - Xmin
*Inferential statistics
The mathematical procedures whereby we convert information about the sample into intelligent guesses about the population fall under the rubric of inferential statistics. EX: Blood samples, sampling pizza Two methods: -Interval estimation -Hypothesis testing Both use sample statistics to make estimations about population parameters HT is more common (in CJ) For instance, we use inferential statistics to try to infer from the sample data what the population might think. Or, we use inferential statistics to make judgments of the probability that an observed difference between groups is a dependable one or one that might have happened by chance in this study. Inferential statistics are techniques that allow us to use these samples to make generalizations about the populations from which the samples were drawn. It is, therefore, important that the sample accurately represents the population. The process of achieving this is called sampling (sampling strategies are discussed in detail here on our sister site). Inferential statistics arise out of the fact that sampling naturally incurs sampling error and thus a sample is not expected to perfectly represent the population. Population too large so much take a sample instead.
*Besides a visual data analysis, how can we estimate that a distribution is positively skewed?
The mean will be larger than the median, and the median will be larger than the mode.
*Median
The middle value 50% below, 50% above Less sensitive to outliers How to find the median: List all values from largest to smallest Find the middle value this is the median (N/2) +1 If two middle values (even sample): Add the two middle values Divide by 2
*Mode
The most commonly occurring value Can be used with all levels of measurement Least sensitive to outliers
*Univariate
Univariate analysis is perhaps the simplest form of statistical analysis. Like other forms of statistics, it can be inferential or descriptive. The key fact is that only one variable is involved. Univariate analysis is the simplest form of analyzing data. "Uni" means "one", so in other words your data has only one variable. It doesn't deal with causes or relationships (unlike regression) and it's major purpose is to describe; it takes data, summarizes that data and finds patterns in the data. You have several options for describing data with univariate data: Frequency Distribution Tables. Bar Charts. Histograms. Frequency Polygons. Pie Charts.
*R² ( look more at equation)
The proportion of variance accounted for by the regression model. The Pearson Correlation Coefficient Squared R-squared is a statistical measure of how close the data are to the fitted regression line. It is also known as the coefficient of determination, or the coefficient of multiple determination for multiple regression. R-squared = Explained variation / Total variation R-squared is always between 0 and 100%: 0% indicates that the model explains none of the variability of the response data around its mean. 100% indicates that the model explains all the variability of the response data around its mean. In general, the higher the R-squared, the better the model fits your data. EX: A regression model accounts for 38.0% of the variance while the other accounts for 87.4%. The more variance that is accounted for by the regression model the closer the data points will fall to the fitted regression line. Theoretically, if a model could explain 100% of the variance, the fitted values would always equal the observed values and, therefore, all the data points would fall on the fitted regression line. R-squared cannot determine whether the coefficient estimates and predictions are biased, which is why you must assess the residual plots. R-squared does not indicate whether a regression model is adequate. You can have a low R-squared value for a good model, or a high R-squared value for a model that does not fit the data!
*Standard deviation
The square root of the variance Puts deviations back into original units is a measure of how spread out numbers are Deviation just means how far from the normal σ this is the symbol
*What are the three types of relationships that can be found in a correlational analysis?
The three types of relationships are positive, negative, and zero-relationship. Positive relationships indicate that both variables are varying in the same direction together. They can both decrease or increase together, but the must move in the same direction. Negative, or inverse, relationships indicate that the variables are moving in opposite directions of each other. One will increase as the other decreases, or vice versa. A zero-relationships indicates that as the X variable varies, the Y variable does nothing. This results in a straight horizontal line. A perfect positive relationship is identified by a correlation coefficient of +1.0, while a perfect negative relationship is identified by a correlation coefficient of -1.0.
*Population
The total set of observations that can be made is called the population. A population is a collection of people, items, or events about which you want to make inferences. It is not always convenient or possible to examine every member of an entire population.
*Descriptive statistics
They provide simple summaries about the sample and the measures. Together with simple graphics analysis, they form the basis of virtually every quantitative analysis of data. Descriptive statistics are brief descriptive coefficients that summarize a given data set, which can be either a representation of the entire population or a sample of it. Descriptive statistics are broken down into measures of central tendency and measures of variability, or spread. With descriptive statistics you are simply describing what is or what the data shows.
*Parameter
a numerical characteristic of a population, as distinct from a statistic of a sample. A parameter is a measurable characteristic of a population, such as a mean or standard deviation. Parameters are numbers that summarize data for an entire population. Statistics are numbers that summarize data from a sample A parameter never changes, because everyone (or everything) was surveyed to find the parameter. For example, you might be interested in the average age of everyone in your class. Maybe you asked everyone and found the average age was 25. That's a parameter, because you asked everyone in the class. Now let's say you wanted to know the average age of everyone in your grade or year. If you use that information from your class to take a guess at the average age, then that information becomes a statistic. That's because you can't be sure your guess is correct (although it will probably be close!). Statistic is: Fraction of a population Parameter is: Whole population
*Histogram
a diagram consisting of rectangles whose area is proportional to the frequency of a variable and whose width is equal to the class interval. a graphical display of data using bars of different heights. It is similar to a Bar Chart, but a histogram groups numbers into ranges A frequency distribution shows how often each different value in a set of data occurs. A histogram is the most commonly used graph to show frequency distributions. It looks very much like a bar chart, but there are important differences between them. A Histogram is a vertical bar chart that depicts the distribution of a set of data. A Histogram does not reflect process performance over time. It's helpful to think of a Histogram as being like a snapshot.
*frequency distribution
a mathematical function showing the number of instances in which a variable takes each of its possible values. frequency- How often they occur frequency distribution is counting the frequencies
*bi
bi is the: Regression coefficient for the predictor Gradient (slope) of the regression line Direction/Strength of Relationship
*Bivariate
correlational Bivariate analysis is one of the simplest forms of quantitative (statistical) analysis.[1] It involves the analysis of two variables (often denoted as X, Y), for the purpose of determining the empirical relationship between them.[1] Bivariate analysis can be helpful in testing simple hypotheses of association. Bivariate analysis can help determine to what extent it becomes easier to know and predict a value for one variable (possibly a dependent variable) if we know the value of the other variable (possibly the independent variable)
* Simple Random Sampling
every element in the population has the same probability of being selected Representative - can fail, but objective With replacement - smaller samples, probability is greatly affected Without replacement - larger samples, probability is barely affected Suppose we use the lottery method described above to select a simple random sample. After we pick a number from the bowl, we can put the number aside or we can put it back into the bowl. If we put the number back in the bowl, it may be selected more than once; if we put it aside, it can selected only one time. When a population element can be selected more than one time, we are sampling with replacement. When a population element can be selected only one time, we are sampling without replacement. Does not work without a sampling frame Sampling frame - list of every element in population Is the Golden Standard If not randomly selected it will be bias and less reliable Simple random sampling is the basic sampling technique where we select a group of subjects (a sample) for study from a larger group (a population). Each individual is chosen entirely by chance and each member of the population has an equal chance of being included in the sample. The population consists of N objects. The sample consists of n objects. All possible samples of n objects are equally likely to occur. An important benefit of simple random sampling is that it allows researchers to use statistical methods to analyze sample results. For example, given a simple random sample, researchers can use statistical methods to define a confidence interval around a sample mean. This will maximize the likelihood of the representativeness of the resulting samples. For this reason, it is the gold standard Remember that even though every element has the same probability of being selected, by chance alone, it is possible to obtain a sample that is not representative of the population (e.g. obtaining a sample of only female fighters by chance, when an overwhelming majority of firefighters are male). In such cases, sampling has failed. SRS can be done with and without replacement. With replacement indicates that once an element is selected, it is replaced back into the sampling frame (the list of all elements in a population) to ensure that the same probability is maintained for every element in the sample. This is more important in smaller samples, where removing an element will significantly change the probability for the next element selected (for example, a population of 100 vs. a population of 1,000,000). Without replacement simply means that element is not replaced back into the sampling frame, therefore altering the probability of selection for the following elements. Ease of use represents the biggest advantage of simple random sampling. Unlike more complicated sampling methods such as stratified random sampling and probability sampling, no need exists to divide the population into subpopulations or take any other additional steps before selecting members of the population at random.
*Convenience sampling
or non-probability sampling. This is often referring to convenience samples (using what you've got). Convenience sampling is a non-probability sampling technique where subjects are selected because of their convenient accessibility and proximity to the researcher.
*Dependent variable
sometimes called an outcome variable. The dependent variable is simply that, a variable that is dependent on an independent variable(s). For example, in our case the test mark that a student achieves is dependent on revision time and intelligence (Independent).