Statistics Final Exam - J

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

Extrapolation

- Don't make predictions beyond the range of the data, because we are not sure that the linear trend will continue beyond the range of the data we are examining. example- The range of the data was for women 61 to 68 inches tall. It is not appropriate to use the regression equation to predict the height for a 36 inch tall woman since 36 is beyond the range of the data (extrapolation).

Basic Probability Rules

1.) The probability of an event, which informs us of the likelihood of it occurring, can range anywhere from 0 (indicating that the event will never occur) to 1 (indicating that the event is certain). Probability Rule One: For any event A, 0 ≤ P(A) ≤ 1. NOTE: One practical use of this rule is that it can be used to identify any probability calculation that comes out to be more than 1 (or less than 0) as incorrect. 0 < P(E) < 1 2.) Probability Rule Two: The sum of the probabilities of all possible outcomes is 1. P(S) = 1 3.) The Addition Rule: P(A or B) = P(A) + P(B) - P(A and B) If A and B are mutually exclusive events, or those that cannot occur together, then the third term is 0, and the rule reduces to P(A or B) = P(A) + P(B). For example, you can't flip a coin and have it come up both heads and tails on one toss. 4.) The Multiplication Rule: P(A and B) = P(A) * P(B|A) or P(B) * P(A|B) If A and B are independent events, we can reduce the formula to P(A and B) = P(A) * P(B). The term independent refers to any event whose outcome is not affected by the outcome of another event. For instance, consider the second of two coin flips, which still has a .50 (50%) probability of landing heads, regardless of what came up on the first flip. What is the probability that, during the two coin flips, you come up with tails on the first flip and heads on the second flip? Let's perform the calculations: P = P(tails) * P(heads) = (0.5) * (0.5) = 0.25 5.) The Complement Rule: P(not A) = 1 - P(A) Do you see why the complement rule can also be thought of as the subtraction rule? This rule builds upon the mutually exclusive nature of P(A) and P(not A). These two events can never occur together, but one of them always has to occur. Therefore P(A) + P(not A) = 1. For example, if the weatherman says there is a 0.3 chance of rain tomorrow, what are the chances of no rain? Let's do the math: P(no rain) = 1 - P(rain) = 1 - 0.3 = 0.7

Empirical Rule

68% of the data is within 1 standard deviation. - 95% of the data is within 2 standard deviations. - Almost all (99.7%)of the data is within 3 standard deviations. 34, 13.5, 2.35, .15 We can only compare standard deviations when the means are similar. • If we want to compare any distribution in relation to spread we can change the observations to standard units. • These standard units are called z-scores. Z-score Measures how many standard deviations an observed data value is from the mean Simply put, a z-score is the number of standard deviations from the mean a data point is. But more technically it's a measure of how many standard deviations below or above the population mean a raw score is. A z-score is also known as a standard score and it can be placed on a normal distribution curve. Z-scores range from -3 standard deviations (which would fall to the far left of the normal distribution curve) up to +3 standard deviations (which would fall to the far right of the normal distribution curve). In order to use a z-score, you need to know the mean μ and also the population standard deviation σ. The basic z score formula for a sample is: z = (x - μ) / σ For example, let's say you have a test score of 190. The test has a mean (μ) of 150 and a standard deviation (σ) of 25. Assuming a normal distribution, your z score would be: z = (x - μ) / σ = 190 - 150 / 25 = 1.6. The z score tells you how many standard deviations from the mean your score is. In this example, your score is 1.6 standard deviations above the mean.

Density Curve

A density curve is a mathematical model of a distribution. The total area under the density curve, by definition, is equal to 1, or 100%. on or above horizontal axis It is a smoothened version of a histogram. - Histogram is built from the sample, but the smoothed curve describes the population. The area under the curve (but above the x-axis) between any range of values is the probability of an outcome in that range. So, to sum up • For a curve to be called a density curve - The total area under the curve equals 1. - The curve must always be on or above the x-axis. - Probabilities are determined by finding an area under the curve. uniform, or skewed Find area and do regular geometry formulas- A=bxh/2 and A=bxh If asked for area that exceeds, subtract area not included from 1.

Definition of a Random Variable

A random variable (rv) is a variable whose value is a numerical outcome of a probability experiment. - Often denoted with capital letters (X, Y, etc.) - Random variables can be discrete or continuous. Discrete Random Variables • A random variable that can take on a countable number of observations - X has a finite or countably infinite number of values - An example is flipping a coin 10 times where X is the number of heads. - There are 11 possible values, 0, 1, ... , 10 Continuous Random Variables • A random variable whose values are uncountable. - X takes all values in an interval of numbers (often measurements) - An example is the amount of time it takes to complete a task. - Our exam is 70 minutes. You can finish it any time between 0 and 70 minutes.

Numerical variables

Dotplot- Record data values on a number line with a dot above the number line for each data value observed. Record data values on a number line with a dot above the number line for each data value observed. We can get a sense of frequency by seeing how high the dots stack up. Histogram - Group data into intervals, called bins (width of the interval = bin width). Count how many data values fall into each bin. Each rectangle has the following properties: - Consecutive bins touch - First value in each bin is recorded on the horizontal axis - The height of each rectangle corresponds to the count

t-distribution

Hypothesis tests and confidence intervals for estimating and testing the mean are based on a statistic called the t-statistic: t =x bar -m/s/sqrtn Since the population standard deviation is almost always unknown, we divide by an estimate of the standard error, using the sample standard deviation s instead of sigma . The t-statistic does not follow the Normal distribution, because the denominator changes with every sample. The t-statistic is more variable than the z-statistic, whose denominator is always the same. If the three conditions for using the Central Limit Theorem hold, the t-statistic follows a distribution called the t-distribution. Symmetric and "bell-shaped" • Has thicker tails than the Normal distribution • Shape depends on the degrees of freedom (df) • If df is small, the tails are thick; as df increases, tails get thinner

Law of Large Numbers (consider coin flips)

If an experiment with a random outcome is repeated a large number of times, the empirical probability of an event is likely to be close to the true probability. The larger the number of repetitions, the closer together these probabilities are likely to be.

"Gold Standard" for Experiments

Large sample size (to account for variability) Assignment to treatment or control group is: - Random - Controlled by the researcher Placebo is used, if appropriate Study is double-blind

Relationship between outcomes, events, and sample spaces

Outcome - result of a single trial of a probability experiment. Event - an outcome or set of outcomes of a probability experiment. Sample space The set of all possible equally likely outcomes of an experiment. Note: Although we assume all outcomes of an experiment to be equally likely throughout this text, this assumption is not true in general. Event Any collection of outcomes in the sample space. Using a Sample Space to find a Theoretical Probability For equally likely outcomes: P(A) = number of outcomes in A number of all possible outcomes Example: Roll a die. Let A represent the event "get a number less than 3." Find P(A). Sample space: 1, 2, 3, 4, 5, 6 6 outcomes A: 1, 2 2 outcomes P(A) =2/6

Idea of Simulation

Simulation of Random Events. Simulation is a way to model random events, such that simulated outcomes closely match real-world outcomes. By observing simulated outcomes, researchers gain insight on the real world.

The Normal Distribution N(µ,σ)

The curve is symmetric about the mean (i.e. area under the curve to the left of the mean is equal to the area under the curve to the right of the mean). The mean = median = mode. So, the highest point of the curve is at x = μ. The curve has inflection points at (μ - σ) and (μ + σ). This means that the curve changes from curved upward to curved downward (or vice versa) at these points. Empirical rule applies or 68% 1SD, 95% 2SD, 99.7% 3SD

data coding

The organization of data based on key concepts and categories. Gender becomes female 0 = No, 1 = Yes

Definition of a Probability Distribution

The probability distribution of a random variable X tells us what values X can take and how to assign probabilities to those values. So, a probability distribution is a table or graph that tells us: 1. All the possible outcomes of a random experiment 2. The probability of each outcome

Empirical vs. Theoretical probabilities

Theoretical probabilities are always the same value. Example: The theoretical probability of getting a heads when tossing a coin is always 0.50 or 50%. • Empirical probabilities change with every experiment. Example: If I toss a coin 10 times and get 7 heads, the empirical probability of heads = 0.70 or 70%. If I toss a coin 10 times and get 3 heads, the empirical probability of heads = 0.30 or 30%.

Statistical Inference

We want to estimate population parameters using sample statistics with: Accuracy (correctness, center, bias) Precision (constancy, spread, standard error) An estimation method should be both accurate and precise. • Accurate - The method measures what it intended; correctly estimates the population parameter. • Precise - If the method is repeated, the estimates are very consistent. This picture shows both accuracy and precision.- golf balls all around hole close This picture shows precision, but not accuracy.- balls clustered near hole on one side This picture shows accuracy, but not precision.- balls close and far all around hole

Be aware of misleading Graphs

Well designed graphs help us see patterns, but misleading graphs play tricks with our eyes and lead to wrong conclusions! Inappropriate scaling (starting at a value other than 0) Using icons of different sizes rather than bars

five number summary

minimum, Q1, median, Q3, maximum Boxplots - Help us visualize five number summary statistics. - Show where the bulk of the data lie. - The box is drawn from Q1 to Q3 with a line for the median inside the box. - Whiskers are drawn to the most extreme values within the fences (extreme values that are not outliers). - Potential outliers are marked with an asterisk. CAUTION! Boxplots work best for unimodal distributions. (They hide multi-modal information!)

hypothesis

statements about population parameters

Regression equation: y = a + bx

• A regression line is a straight line that describes how a response variable y changes as an explanatory variable x changes. • We often use a regression line to predict the value of the response variable y for a given value of x. • The distinction between the explanatory and response variables is necessary. • It is used for making predictions about future observed values - a is the y-intercept - b is the slope Formula for the Slope- b= r ( sy / sx) Here s sub x is the standard deviation of the variable and s sub y is the standard deviation of the variable. Formula for the Intercept- a= y bar - b( x bar) Here x bar is the mean of the x variable and y bar is the mean of the y variable. • Order matters. If x and y are switched, the regression equation will change. • We use the x-variable to make predictions about the y-variable, so the x-variable is called the explanatory or predictor variable. It is also called the independent variable. • The y-variable is the response or predicted variable. It is also called the dependent variable. • Slope tells us how much the y-variable changes when the x-variable is increased by 1 unit. • A slope close to 0 means there is no linear relationship between x and y. example- Interpretation: ( how to write results) For every additional inch in height, weight tends to increase by 9.03 pounds. Every increase of 1 inch in height is associated with an increase in weight of 9.03 pounds. • The y-intercept is the predicted value when x is 0. • The y-intercept is meaningful only if it makes sense for x to equal 0.

Types of Errors

• Rejecting a true null hypothesis is called a Type I Error (or Error of the First Type). • Failing to reject a false null hypothesis is called a Type II Error (or Error of the Second Kind). • These errors are always possible, but we try to limit the probability of them occurring. example- Suppose a political analyst is interested in predicting whether a school bond measure on the ballot will pass. Her hypotheses are: H0 : p = 0.50 Ha : p > 0.50 Describe two types of errors she might make in conducting this test. Explain what a 5% significance level means in the context of this problem. 1. Rejecting the null hypothesis when it is true. This means concluding that more than 50% favor the bond when the actual proportion who favor the bond is 50% or less. 2. Failing to reject the null hypothesis when it is false. This means being unable to conclude that more than 50% favor the bond when the actual proportion who favor the bond is more than 50%. 3. The 5% significance level means there is only a 5% chance that the analyst mistakenly concludes that more than 50% favor the bond when, in fact, they don't.

Central Limit theorem lets us make assumptions about sampling distributions and holds if:

◦ We have a random and independent sample ◦ Large Sample: n ≥ 25 or the original distribution is Normal ◦ Big population: (N > 10n)

One-way and Two-way Tables for organizing categorical data

A one-way table is simply the data from a bar graph put into table form. In a one-way table, you are only working with one categorical variable. Two-way frequency tables show how many data points fit in each category. The columns of the table tell us whether the student is a male or a female. The rows of the table tell us whether the student prefers dogs, cats, or doesn't have a preference. Each cell tells us the number (or frequency) of students. For example, the 3636 is in the male column and the prefers dogs row. This tells us that there are 3636 males who preferred dogs in this dataset. Notice that there are two variables—gender and preference—this is where the two in two-way frequency table comes from. Two-way relative frequency tables show what percent of data points fit in each category. We can use row relative frequencies or column relative frequencies, it just depends on the context of the problem. Find the totals for each column. Divide each cell count by its column total and convert to a percentage.

Hypothesis testing process:

A procedure that enables us to choose between two claims when we have variability in our measurements. - Claims are made about a population parameter. - We make a choice based on evidence provided by a sample. Purpose - to aid investigators in reaching a decision concerning a population by examining the data obtained from a sample Goal - to assess the evidence provided by the data in favor of some claim about the population • Define the population • State hypotheses • Define the significance level • Select a sample size and collect the data • Perform the hypothesis test • Draw a conclusion and interpret results

Response vs. explanatory variables

A response variable measures or records an outcome of a study. -(Also: y, dependent variable, predicted variable) An explanatory variable explains changes in the response variable. -(Also: x, independent variable, predictor variable)

Scatterplots

A scatterplot is a graph displaying the relationship between two quantitative variables measured on the same set of individuals. - Each individual in the data appears as a point on the scatterplot. - Generally, the response variable corresponds to the y-axis, and the explanatory variable corresponds to the x-axis. Scatterplots are the primary tools for examining relationships between two numerical variables. Each point in the scatterplot represents one observation, though some may overlap. Note three features: 1. Trend (like center) 2. Strength (like spread) 3. Shape The general tendency of the scatterplot as you read from left to right. • Typical trends: 1. Increasing (uphill), called a positive association - As x increases, y also increases. 2. Decreasing (downhill), called a negative association - As x increases, y decreases. 3. No trend, if there is neither an uphill nor downhill tendency - As x increases, y stays the same. Scatterplots with large amounts of scatter or vertical variation indicate a weak association. Scatterplots with small amounts of scatter or little vertical variation indicate a strong association. This scatterplot shows there is a linear association between volume of searches for the word "vampire" and the word "zombie." The red line is superimposed on the plot to emphasize the linear trend. Sometimes there are trends in data that are non-linear - trends that are better modeled by a curve rather than a line. This scatterplot shows there is a non-linear trend between temperature and pollutant ozone levels.

Categorical variables

Bar Chart - similar to a histogram, record data categories along the horizontal axis with the height corresponding to the frequency of the data. (Bars do not touch). A pie chart is a circle divided into sections, one for each category. •The area (angle) of each sector is proportional to the frequency/relative frequency of that category. •Pie charts are useful for showing the relative proportions of each category, compared to the whole. Two Main Components of a Categorical Distribution: 1. Mode (typical, or most frequent, outcome) 2. Variability (or diversity in outcomes)

Continuous Random Variables and Probability Distributions

Because you cannot list all outcomes for a continuous random variable, we cannot represent the probability distribution using a table. Usually measurements P(X = x) = 0, we are interested in intervals and their area under curve. Probability Distribution Displayed as graph or Formula- probability density curves.

Treatment and Outcome variables

Causality (what causes something to happen) can be understood in terms of two variables. - Treatment variable: Whether or not a specific treatment is used - Outcome (or response) variable: Whether or not a certain outcome is seen The goal: To determine if the treatment variable causes a change in the outcome variable!

Binomial Distribution B(n,p)

Check for Binomial Setting: There are a fixed number of trials n. Each observation fall into one of just two categories (called success and failure). The probability of a success is the same for each trial and is labeled, p. The n trials are all independent shape depends on n and p Symmetric if p = 0.5, but also when n is large even if p is close to 0 or 1 P(X = k) - Binomial distributions have 2 outcomes- success and failure Fixed # of outcomes denoted by n x= # of times you succeed (between 0 and n) p= probability of success q= 1-p (complements to each other) Use the equation c combination x ( p^x)(q^n-x) In calculator- type in # for n then PRB button and press combination then # for x. press Enter multiply by (p arrow up button x) then multiply by (q arrow up n-x) which will give P(# of certain succeses) P(X ≤ k)- p(at most 8) p(0)...p(8) P(X < k) -p fewer than 8 ( as above, but don't include 8) P(X ≥ k)- P( at least 8), 8 included up to number of fixed trials (n) P(X > k) -p greater than 8, same as above, but not including 8 - will be used with a given table or just do process as above for P(X = k) then add each probability together.

Hypothesis Tests in Detail

Confidence intervals and hypothesis tests are closely related but ask slightly different questions. Confidence intervals - "What is the value of this parameter?" Hypothesis test - "Are the data consistent with the parameter being one particular value or might the parameter be something else? Even though they are designed to answer different questions, they are similar enough to lead us to reach the same types of conclusions. A confidence interval can lead us to the same type of conclusion as a two-sided hypothesis test. • The only way to reduce the probability of both types of errors is to increase the sample size. Increasing the sample size improves the precision of the test. • We cannot make the significance level (Type I Error rate) arbitrarily small because this increases the probability of making a Type II Error. A result is "statistically significant" when the null hypothesis is rejected - the difference between the data-estimated value for a parameter and the null hypothesis value for a parameter is so large it cannot be convincingly explained by chance. However statistically significant findings do not necessarily mean the results are useful.

Confidence intervals for a single proportion

Confidence intervals provide us with: 1. A range of plausible values for a population parameter. 2. A confidence level, which expresses our level of confidence that the interval contains the population parameter. Confidence Level • Tells us how often the estimation is successful • Measures the success rate of the method, not of any one particular interval - Will need to keep this in mind when we interpret confidence intervals Margin of Error • Tells us how far from the population value our estimate can fall Margin of error = z * SE where z * is a number that tells how many standard errors to include in the margin of error. From the Empirical Rule we know that z * = 1 corresponds with a confidence level of 68% and z * = 2 corresponds with a confidence level of 95%. Confidence intervals have the form - p hat + or - m where m is the margin of error. The margin of error is so the confidence interval is p hat + or - z* SE example- A Huffpost/YouGov poll of 1000 Americans found that 38% believed that there is intelligent life on other planets. a. Construct a 95% confidence interval for the percentage of all Americans who believe there is intelligent life on other planets. b. Would it be plausible to conclude that 40% of Americans believe in intelligent life on other planets? Check that the conditions for the Central Limit Theorem apply. 1. Random independent sample p hat = .38 2. Large Sample: n( p hat )= 1000(.38)= 380 n=(1- p hat) = 620 Both 380 and 620 are greater than or equal to 10. 3. The population is at least 10 times larger than the sample SE= 0.0153 - .38 * .62/ 1000 then square rooted m= 1.96(0.0153) = 0.03 0.38-0.03=0.35 0.38+ 0.03= 0.41 The 95% confidence interval is 35% to 41%. It is plausible that 40% of Americans believe in intelligent life on other planets, since 40% is contained within the confidence interval. The 1.96 after m= tells it is a 95% level: 99- 2.58 95-1.96 90-1.645 80-1.28

Correlation: r

Correlation Coefficient • A number that measures the strength of a linear relationship - Sometimes called Pearson's correlation coefficient • Symbol: r • Always between -1 and +1 - exactly 1, perfectly linear. Lie exactly on the line. • Magnitude indicates strength of relationship - |r| close to 1: strong association - |r| close to 0: weak association • Sign indicates direction of trend - Negative r: negative association - Positive r: positive association as r increases, there is less vertical variation in the data (the trend is stronger). Correlation doesn't imply causation - An association between two variables is not sufficient evidence to conclude that a cause-and-effect relationship exists between the variables.

Fence Rule

Fences: Lower fence = Q1- 1.5(IQR) = Upper fence = Q3 + 1.5(IQR) = • Outliers: Any temperature below lower or above upper would be considered an outlier.

Difference in independent population means

General structure: Estimate ± margin of error Specifically: estimate of difference) ± t*(SEestimate of difference) The estimate of difference = Meanfirst sample - Meansecond sample SEestimate of difference = sqrt s1^2/ n1 +s2^2/n2 Two-Sample t-Interval x bar1 - x bar2 ± t* sqrt s1^2/n1+s2^2/n2 t* is based on an approximate t-distribution with df as the smaller of n1- 1 and n2- 1. For more accuracy, use technology. Interpreting confidence intervals for the difference of population means given independent samples is the same as interpreting confidence intervals for the difference of population proportions. 1. If 0 is in the interval, there is no significant difference between μ1 and μ2 2. If both values in the confidence interval are positive, then μ1>μ2 3. If both values in the confidence interval are negative, then μ1 <μ2

State the hypotheses

H0 The Null Hypothesis - The neutral, status quo, skeptical statement about a population parameter - Always contains = Ha The Alternative Hypothesis - The research hypothesis; the statement about a population parameter we intend to demonstrate is true - Always contains <, >, or ≠ NOTE: Hypotheses are always about population PARAMETERS; they are never about sample statistics. The null hypothesis always gets the benefit of the doubt throughout the hypothesis-testing procedure. We only reject the null hypothesis if the observed outcome is extremely unusual if the null hypothesis were true. It is analogous to assume that a defendant in a jury trial is innocent unless proven guilty "beyond a reasonable doubt." example- A 2014 Pew Poll found that 61% of Americans believed in global warming. A researcher believes this rate has declined. State the null and alternative hypotheses in words and in symbols. H0: p = 0.61 The same proportion of Americans believe in global warming as in the past. Ha: p < 0.61 The proportion of Americans who believe in global warming has decreased. The sign in the alternative hypothesis determines whether a hypothesis is one-sided or two-sided. Two sided- Ho: p = po Ha: p does not equal po one sided left- Ho: p = po Ha: p less than po one sided right- Ho: p = po Ha: p greater than po example- During the previous fiscal year, 30% of a retailer's sales were due to online sales. In an effort to increase this percentage the retailer has purchased ads on social media sites. The retailer gathers data on types of sales on a sample of 20 days during the current fiscal year. Has there been a change in online sales? 1. Write the H0 and Ha for the retailer. 2. Is this a one-sided or two-sided hypothesis? 3. What other hypotheses could the retailer have posed? H0 : p = 0.30 Ha : p ≠ 0.30 This is a two-sided hypothesis. The retailer could also have asked, "Have sales increased?" Then H0 : p = 0.30 Ha : p > 0.30 This would be a one-sided test (right) If the retailer asked, "Have sales decreased?" Then H0 : p = 0.30 Ha : p < 0.30, and this would be a one-side (left) test.

Sample statistics

Have sampling distributions • The sampling distribution of a statistic is the distribution of all possible values taken by the statistic when all possible samples of a fixed size n are taken from the population. • It is a theoretical idea; in reality, we often do not actually build it (however we simulated creating some last class). • The sampling distribution of a statistic is the probability distribution of that statistic. • Suppose we have a population with proportion p . From this population select all possible samples of size n and determine the proportion p hat in each sample. The distribution of the sampling proportions is called the sampling distribution of the proportion. The sample proportion changes from sample to sample, but the population proportion (p) remains the same. The probability distribution of is called the sampling distribution. We can represent it using a table or a graph. Our estimator, p hat , is not always the same as our parameter, p. • The mean of the sampling distribution is 25% - the same as the value of p. This indicates that the estimator p hat is unbiased. • Even though p hat is not always equal to p, the estimate is never more than 25 percentage points away from p. The standard error (SE) is the standard deviation of the sampling distribution. It measures how much an estimator typically varies from sample to sample. When the standard error is small, we say the estimator is precise. The bias of p hat is 0. The standard error of p hat is SE= sqrt p(1-p)/n if the following conditions are met: 1. The sample is randomly selected from the population of interest. 2. If the sampling is without replacement, the population must be at least 10 times larger than the sample size.

Typical Value (center)

Histogram- highest bar, average numbers on opposite sides of the bar and round down. The mean is the most common measure of center. It is what most people think of when they hear the word "average". However, the mean is affected by extreme values so it may not be the best measure of center to use in a skewed distribution. The median is the value in the center of the data. Half of the values are less than the median and half of the values are more than the median. It is probably the best measure of center to use in a skewed distribution.

Discrete and Continuous

Histograms tend to be favored in practice. • Histograms work well for numerical variables that are continuous or discrete. Dotplots only work well for continuous variables if the sample sizes are small. • Discrete: Possible values only occur at specific points. - Example: number of pets you own • Continuous: Possible values can be any number in an interval. - Example: weight

Types of studies ◦ Observational vs. Experiment (Advantages and disadvantages)

In an experiment investigators apply treatments to experimental units (people, animals, plots of land, etc.) and then proceed to observe the effect of the treatments on the experimental units. In a randomized experiment investigators control the assignment of treatments to experimental units using a chance mechanism (like the flip of a coin or a computer's random number generator). Advantages gain insight into methods of instruction intuitive practice shaped by research can be combined with other research methods for rigor Disadvantages subject to human error groups may not be comparable human response can be difficult to measure results may only apply to one situation and may be difficult to replicate In an observational study investigators observe subjects and measure variables of interest without assigning treatments to the subjects. The treatment that each subject receives is determined beyond the control of the investigator. A key advantage of conducting observations is that you can observe what people actually do or say, rather than what they say they do. People are not always willing to write their true views on a questionnaire or tell a stranger what they really think at interview. Observations can be made in real life situations, allowing the researcher access to the context and meaning surrounding what people say and do. There are numerous situations in the area of criminology, and related disciplines, where approaching people for interview or questionnaire completion is unlikely to yield a positive response, but where observations could yield valuable insights on an issue. On the other hand, there are a number of very important problems associated with observational research. An important one relates to the role of the observer and what effect he or she has on the people and situations observed. This is difficult to gauge. There is also the additional problem of being able to write an account, as a researcher, when one is immersed in a situation or culture. This latter situation can mean that the research is dismissed as too subjective. Observation can be very time consuming. Some well known observational pieces of research took some years of observation and immersion in a situation or culture. However, it is more common in modern research to reduce the observation time substantially. Observation time may be further reduced in experimental conditions (laboratory or simulation) in other words, controlled settings. An important potential disadvantage, in conducting observational research, is the ethical dilemmas inherent in observing real life situations for research purposes.

Titling and Labeling Graphs

Labeling the X-Axis The x-axis of a graph is the horizontal line running side to side. Where this line intersects the y-axis, the x coordinate is zero. When using a graph to represent data, determining which variable to put on the x-axis is important because it should be the independent variable. The independent variable is the one that affects the other. For example, if you were plotting time worked against dollars made, time would be the independent variable because time would pass regardless of income. Labeling the Y-Axis The y-axis of the graph is the vertical line running top to bottom. Where this line intersects the x-axis, the y coordinate is zero. When using a graph to represent data, the y-axis should represent the dependent variable. The dependent variable is the one that is affected by the independent variable. For example, if you were plotting time worked against dollars made, dollars made would be the dependent variable because the amount made depends on how many hours were worked. Titling the Graph Your graph isn't complete without a title that summarizes what the graph itself depicts. The title is usually placed in the center, either above or below the graph. The proper form for a graph title is "y-axis variable vs. x-axis variable." For example, if you were comparing the the amount of fertilizer to how much a plant grew, the amount of fertilizer would be the independent, or x-axis variable and the growth would be the dependent, or y-axis variable. Therefore, your title would be "Amount of Fertilizer vs. Plant Growth."

Definition of Statistics (as a field of study)

Statistics is the science of collecting, organizing, summarizing, and analyzing data to answer questions and/or draw conclusions. Two concepts: Variation and Data Data- observations you or someone else records Variation: Different types of outcomes

Variability (spread)

Look at the horizontal spread in the histogram or dotplot: • If all data values are similar: Narrow graph • If data values are different: Wide graph

Typical Value (mode)

Mode The category that occurs the most frequently Key difference in the mode for categorical and numerical data: - Numerical data: Mounds do not need to be the same height. - Categorical data: Modes must be roughly the same height.

Definition of Data

Observations gathered to draw conclusions; observations you or someone else records

Populations vs. Samples

Population - Collection of all data values that ever will occur for a group - Often difficult to obtain all of this information Sample - A subset of the population - Represents the population at large - Easier to obtain this information

Meaning of Probability

Probability is the study of randomness Generally, probability means the chance of an event occurring. Probability is used to measure how often random events occur. We want to give CHANCE a definite, clear interpretation. Used to measure how often random events occur • When tossing a coin, the probability of a head is ½ or 50%. This means that the coin will land on heads about 50% of the time. • Two types: 1. Theoretical 2. Empirical Theoretical Probabilities • Long-run relative frequencies • The relative frequency at which an event occurs after infinitely many repetitions Example: If we were to flip a coin infinitely many times, exactly 50% of the flips would be heads. Empirical Probabilities • Relative frequencies based on an experiment or on observation of a real-life process Example: I toss a coin 10 times and get 4 heads. The empirical probability of getting heads is 4/10 = 0.4, or 40% Theoretical probabilities are always the same value. Example: The theoretical probability of getting a heads when tossing a coin is always 0.50 or 50%. • Empirical probabilities change with every experiment. Example: If I toss a coin 10 times and get 7 heads, the empirical probability of heads = 0.70 or 70%. If I toss a coin 10 times and get 3 heads, the empirical probability of heads = 0.30 or 30%

Stacked vs. Unstacked Data.

Stacked Data - Data values stored in a spreadsheet format - Each row contains data for a single individual - Can store many variables! Example: - Each row corresponds to one infant - Variables: (top of columns) • Birth weight • Gender • Smoking status of the mother Unstacked Data - Data values are stored in two columns - Each column is a different group from a variable - Can only store data for two variables - Info in a row does NOT correspond to the same individual

Confidence intervals for a single population mean

Provides a range of plausible values for the population mean along with a measure of the uncertainty in our estimate The confidence level is a measure of the uncertainty in our estimate - the higher the level of confidence, the better our estimate is. Use confidence intervals whenever you are estimating the value of a population parameter on the basis of a random sample. Do NOT use a confidence interval if there is no uncertainty in your estimate. If you have data for the entire population you don't need to find a confidence interval since the population parameter is known - there is no need to estimate it. Before constructing a confidence interval for a population mean, check these three conditions: 1. Random, independent sample 2. Large sample - Either the population is Normally distributed or the sample size is at least 25. 3. Big population - If the sample is collected without replacement the population must be at least 10 times larger than the sample size example- A used car website wanted to estimate the mean price of a 2012 Nissan Altima. The site gathered data on a random sample of 30 such cars and found a sample mean of $16,610 and a sample standard deviation of $2736. The 95% confidence interval for the mean cost of this model car based on this data is (15588, 17632) a. Describe the population. Is the number $16,610 an example of a parameter or a statistic? The population is all 2012 Nissan Altimas that are for sale. The number $16,610 is a statistic because it is a measure of a sample, not a population. b. Verify that the conditions for a valid confidence interval are met. The sample is random and independent. The sample size (30) is large (at least 25) and the population is big (we can assume there are at least 10 x 30 = 300 2012 Altimas for sale). The confidence level is a measure of how well the method used to produce the confidence interval performs. For example, a 95% confidence interval means that if we were to take many random samples of the same size from the same population, we expect 95% of them would "work" (contain the population parameter) and 5% of them would be "wrong" (not contain the population parameter). \General structure: Estimator ± margin of error where margin of error = (multiplier) x SE and SE =sigma/sqrtn Because we usually do not know the population standard deviation, we replace SE with its estimate that uses the sample standard deviation. The one-sample t-interval is a confidence interval for a population mean. x ± m where m = t*SE sub EST and SEEST =s/ sqrtn The multiplier t* is found using the t-distribution withn-1 degrees of freedom. To construct a one-sample t-interval you need four pieces of information: 1. The sample mean, , that you calculate from the data. 2. The sample standard deviation, s, which you calculate from the data. 3. The sample size, n. 4. The multiplier, t* , which you can look up in a table or use technology. t* is determined by the confidence level and the sample size. The value of t* tells us how wide the margin of error is, in terms of standard errors. For example, if t* = 2 then our margin of error is 2 standard errors wide. t* can be found using a t-distribution table, but because most tables stop at 35 or 40 degrees of freedom, it it best to use technology to construct confidence intervals. In 2010, the mean years of experience among a nursing staff was 14.3 years. A nurse manager took a survey of a random sample of 35 nurses at the hospital and found a sample mean of 18.37 years with a standard deviation of 11.12 years. Do we have evidence that the mean years of experience among the nursing staff at the hospital has increased? Use a significance level of 0.05. 1. Hypotheses H0 : μ = 14.3 Ha : μ > 14.3 2. Prepare Random, independent sample; sample size is large (35 > 25) 3. Compute to Compare SEEST =11.12/ sqrt35=1.88 t =18.37-14.3/1.88= 2.16 Use technology to find the p-value that corresponds witht = 2.16 and df = 34. Just as with hypothesis tests for proportions, hypothesis tests can be one-sided or two-sided depending on the research question. The choice of the alternative hypothesis determines how the p-value is calculated.- same as others but mu instead of p.

Categorical vs. Numerical Data

Quantitative data is a numerical measurement expressed not by means of a natural language description, but rather in terms of numbers.These data have meaning as a measurement, such as a person's height, weight, IQ, or blood pressure; or they're a count, such as the number of stock shares a person owns, how many teeth a dog has, or how many pages you can read of your favorite book before you fall asleep. Numerical data can be further broken into two types: discrete and continuous. · Discrete data represent items that can be counted; they take on possible values that can be listed out. The list of possible values may be fixed (also called finite); or it may go from 0, 1, 2, on to infinity (making it countably infinite). For example, the number of heads in 100 coin flips takes on values from 0 through 100 (finite case), but the number of flips needed to get 100 heads takes on values from 100 (the fastest scenario) on up to infinity (if you never get to that 100th heads). Its possible values are listed as 100, 101, 102, 103, . . . (representing the countably infinite case). · Continuous data represent measurements; their possible values cannot be counted and can only be described using intervals on the real number line. For example, the exact amount of gas purchased at the pump for cars with 20-gallon tanks would be continuous data from 0 gallons to 20 gallons, represented by the interval [0, 20], inclusive. You might pump 8.40 gallons, or 8.41, or 8.414863 gallons, or any possible number from 0 to 20. In this way, continuous data can be thought of as being uncountably infinite. For ease of recordkeeping, statisticians usually pick some point in the number to round off. Another example would be that the lifetime of a C battery can be anywhere from 0 hours to an infinite number of hours (if it lasts forever), technically, with all possible values in between. Granted, you don't expect a battery to last more than a few hundred hours, but no one can put a cap on how long it can go (remember the Energizer Bunny?). Qualitative data is a categorical measurement expressed not in terms of numbers, but rather by means of a natural language description. In statistics, it is often used interchangeably with "categorical" data. Categorical data represent characteristics such as a person's gender, marital status, hometown, or the types of movies they like. Categorical data can take on numerical values (such as "1" indicating male and "2" indicating female), but those numbers don't have mathematical meaning. You couldn't add them together, for example.

Sampling

Simple Random Sampling (SRS) • Draw subjects at random from the population without replacement • A random sample is one in which every member of the population is equally likely to be chosen for the sample. • A true random sample is difficult to achieve. • Statisticians have developed methods for producing random samples that can be used to estimate characteristics of populations. Measurement Bias • Results from asking questions that do not produce a true answer; occurs when measurements tend to record values larger (or smaller) than the true value Example: Asking people, "How much do you weigh?" It is likely that people will report a number less than their actual weight, resulting in an estimate that tends to be too small. Sampling Bias • Occurs when a sample is used that is not representative of the population Example: Internet polls - people who answer these polls tend to be those who have strong feelings about the results and are not necessarily representative of the population

Shape (including deviations of the overall pattern)

Symmetric Left and right side roughly the same Symmetric (bell shaped) - when graphed, a vertical line drawn at the center will form mirror images, with the left half of the graph being the mirror image of the right half of the graph. In the histogram and dot plot, this shape is referred to as being a "bell shape" or a "mound". The most typical symmetric histogram or dot plot has the highest vertical column in the center. This shape is often referred to as being a "normal curve" (or normal distribution). Not all symmetric graphs, however, have this shape (see Symmetric U-shaped below). Symmetric (U-shaped) - as mentioned above, a symmetric graph forms a mirror image of itself when reflected in its vertical center line. Unlike the previous graphs, these histograms and dot plots have more of a U shape. Uniform - The data is spread equally across the range. There are no clear peaks in these graphs, since each data entry appears the same number of times in the set. Notice in the boxplot how each section is of equal length: min to Q1, Q1 to median, median to Q3, and Q3 to max. These graphs are also symmetric. Asymmetric- Skewed Right (positively skewed) - fewer data plots are found to the right of the graph (toward the larger numeric values). The "tail" of the graph is pulled toward higher positive numbers, or to the right. The mean typically gets pulled toward the tail, and is greater than the median. Skewed Left (negatively skewed) - fewer data plots are found to the left of the graph (toward the smaller numeric values). The "tail" of the graph is pulled toward the lower or negative numbers, or to the left. The mean typically gets pulled toward the tail, and is less than the median. Mounds

Symmetric Distributions

The mean (balancing point) represents a typical value in a set of data when the data is roughly symmetric. For skewed distributions, the mean is NOT a good estimate of a typical value. Populations - We have access to all of the individuals in a group of interest. - Descriptive measures of populations are called parameters. - Parameters are often (but not always) written using Greek letters (e.g. μ). • Samples - We can only access a portion of the individuals in a group of interest. - Descriptive measures of samples are called statistics. - Statistics are often written using Roman letters (e.g. x bar ). Measure Spread: Standard Deviation Standard deviation • A number that measures how far away the typical observation is from the mean (center) • For most distributions, a majority of the data is within one standard deviation of the mean. Note: • Think of the standard deviation as the typical distance of the observations from their mean. Variance: Formula The variance is the standard deviation squared

Skewed Distributions

The median is the middle data point when the data has been sorted from smallest to largest. If the sample size, n, is odd, the median is the middle ordered data value - If n is even, the median is the mean of the two middle ordered data values. The median is a good measure of a typical value for skewed distributions. The mean is affected by outliers. The median is not affected by outliers and its value does not respond strongly to changes in a few observations (no matter how large). Measuring the Spread Recall: The standard deviation measured spread using the distance from the mean. Now? Since we don't use the mean in skewed distributions, we need a measure of spread related to the median. So: We use the Interquartile Range (IQR). Quartiles Divide the distribution into fourths. Each quartile contains 25% of the data. Three numbers which divide an ordered data set into four equal sized groups. • The first quartile Q1 has 25% of the data below it. • Q2 has 50% of the data below it. • The third quartile Q3 has 75% of the data below it. • Q2 is also known as the median Order the data and find the median. • For Q1 , look at the lower half of the data values, those to the left of the median location; find the median of this lower half. • For Q3 , look at the upper half of the data values, those to the right of the median location; find the median of this upper half. When n is even, split the data set in half and find the median of the lower and upper halves to find to quartiles. • When n is odd, the median itself splits the data. Do not include it when finding the medians of the lower and upper halves to find the quartiles. IQR= Q3-Q1 Right skewed- Mean is greater than the median within the right tail Left skewed- Mean is below the median with the left tail

p- value

The null hypothesis tells us what to expect when we look at our data. If we see something unexpected - that is, when we are surprised - then we should doubt the null hypothesis. If we are really surprised, we should reject it altogether. The p-value gives us a way to numerically measure our surprise. It reports the probability that, if the null hypothesis is true, our test statistic will have a value as extreme as or more extreme than the value we actually observe. Small p-values (close to 0) mean we are really surprised. Large p-values (close to 1) mean we are not surprised at all. If we get a p-value that is less than or equal to our significance level, our p-value is considered small (we are really surprised!) and we reject the null hypothesis. If we get a p-value greater than our significance level, our p-value is not considered small (we are not surprised) and we do not reject the null hypothesis. If P-value < α Reject H0 if P-value > α Do not reject H0 Calculation of the p-value depends on the sign in the alternative hypothesis. If Ha contains ≠, we say the alternative hypothesis is two-sided and the p-value would be the total of the shaded regions (a "two-tailed" p-value). If Ha contains a < or a >, we have a one-tailed p-value. If Ha contains a <, we have a left-tailed test and the p-value is the area to the left of the test statistic. If Ha contains a >, we have a right-tailed test and the p-value is the area to the right of the test statistic. A small p-value means our test statistic is extreme. An extreme test statistic means something unusual, and therefore unexpected, has happened. Small p-values lead us to reject the null hypothesis. If the conditions concerning the sampling distribution of the z-statistic fail to be met, then we cannot find a p-value using the Normal curve. Conditions can fail for the following reasons: 1. The sample size is too small. 2. Samples are not randomly selected - in this case conclusions may not generalize to the population.

Discrete Random Variables and Probability Distributions

The probability distribution of a discrete random variable X gives each possible value and its corresponding probability Usually counts Probability distribution displayed as table P(X = x) is read as the probability that the random variable X equals the value x. Experiment: Rolling a die x P(X) 1 1/6 2 1/6 3 1/6 4 1/6 5 1/6 6 1/6 What is the probability of rolling a 5 or a 6? P(X=5 OR X=6) = P(X=5) + P(X=6) = 1/6 + 1/6 = 1/3 All P(X) values must be between 0 and 1. The sum of all P(X) values must equal 1.

Data Analysis

The process of compiling, analyzing, and interpreting the results of primary and secondary data collection. Identify the research objective Collect the information needed Organize and summarize the information Draw conclusions from the information

signifigance level

The significance level is the probability of making the mistake of rejecting the null hypothesis when, in fact, the null hypothesis is true. The symbol for the significance level is α. For most applications a significance level of 0.05 is used, but 0.01 and 0.10 are also sometimes used. -is given

to find standard deviation

The symbol for Standard Deviation is σ (the Greek letter sigma). To calculate the standard deviation of those numbers: 1. Work out the Mean (the simple average of the numbers) 2. Then for each number: subtract the Mean and square the result 3. Then work out the mean of those squared differences. 4. Take the square root of that and we are done! Some properties of the sample standard deviation. • s measures spread about the mean and should be used only when the mean is the measure of center. • s = 0 only when all observations have the same value and there is no spread. Otherwise, s > 0. • s is affected by outliers and skewness. • s has the same units of measurement as the original observations. Note: We will say an observation is unusual if it is more than 2 standard deviations away from the mean.

Check CLT conditions and calculate the test statistic

The test statistic compares our observed outcome with the outcome we would get if the null hypothesis is true. When the test statistic is far away from the value we would expect if the null hypothesis is true, we reject the null hypothesis and conclude the evidence supports the alternative hypothesis. The test statistic has the structure: z= observed value - null value/SE So we have a test statistic of: z= p hat - po/SE SE= sqrtpo(1- po)/n where p hat is the sample proportion (from your data). po is the proportion found in your null hypothesis (H0) example- In 2001, the Gallup Poll reported that 12% of Americans were "cool skeptics" who reported worrying "a little or not at all" about global warming. In 2014, 25% of those responding to a Gallup poll classified themselves as "cool skeptics." Assume that the sample size for the 2014 poll was 500. Can we conclude that the percentage of Americans who identify themselves as cool skeptics has increased since 2001? Calculate the test statistic and explain the value in context. H0 : p = 0.12 Ha : p > 0.12 z=8.95 The observed value of the test statistic is 8.95. This means that while the null hypothesis claims a proportion of 0.12, the observed proportion was much higher - 8.95 standard errors higher - than what the null hypothesis claims. if you want- distinct is the equal sign n= number of samples confidence given, c as decimal alpha is 1- c null is claims( doesnt have to have equal sign) alternative is the mathematical opposite of the null, if the opp. has the including in the opp. just write the opp of the opp side not the including under.

Dependent and Independent samples

When comparing two populations, it is important to note whether the data are two independent samples or are paired (dependent) samples. Each observation in one group is coupled or paired with one particular observation in the other group. Examples: • "Before and after" comparisons • Related objects/people (twins, siblings, spouses) People chosen in a random sample are asked how many minutes they spend reading and how many minutes they spent exercising during a certain day. Researchers wanted to know how different the mean amounts of time were for each activity. Is this a dependent or an independent sample? Dependent. The study was based on one sample of people who were measured twice (once for reading and once for exercise). A sample of men and women each had their hearing tested. Researchers wanted to know whether, typically,men and women differed in their hearing ability. Is this a dependent or an independent sample? Independent. Two different populations were used: one of men and the other of women. The people were not related.

Convert to Standard Normal Distribution: N(0,1)

standardizing- (converting non standard) z= x-mu/ sigma Convert using formula above. Look on table to find z and the rest of the area after the decimal. See where they match up then convert the decimal to a percentage when finding between, convert each value to a standard and look at chart to find probability, then subtract first number minus second and convert to a percentage. Look down first column, then go across the top row and find corresponding probability. Find P(Z < 1.25) • The table gives us the area under the standard normal curve to the left of z (which in this case, is 1.25). • Look down the first column and find 1.2, then go across the top row until you find .05. This corresponds to 0.8944, which is P(Z<1.25). Find P(Z > 2.09) • The table does not give us this value directly. However, we can use complements. • What's not greater then 2.09? - Use P(Z > 2.09) = 1 - P(Z ≤ 2.09). We can find P(Z ≤ 2.09) in the table. Reverse standardizing- Finding a z value from a Given Probability The probabilities are found inside the table. Remember, they are probabilities to the left of a specific z-value. • We notice 0.10 is to the right of the z-value in question. • If 0.10 is the area to the right, then 0.90 is the area to the left. Then, find the probability closest to 0.9000 inside the table. • The closest value is 0.8997. • This probability is associated with the z-value 1.28.

Coefficient of determination: r2

• The square of r, the correlation coefficient • Usually converted to a percentage, so always between 0% and 100% • Measures how much variation in the response variable is explained by the explanatory variable. • The larger r2, the smaller the amount of variation or scatter about the regression line. Example: r2 For the data on car age and predicted value, r = -0.778. Compute and interpret r2 r2 = (-0.778)2 = .605, so r2 = 60.5%. Car age explains about 60.5% of the variation in car value.

Paired differences in means

• Transform the original data from two variables into a single variable that contains the difference between the scores in Group 1 and Group 2. • After the differences have been computed, we can apply either a confidence interval approach or a hypothesis test approach to the differences. • The hypothesis test for two means is very similar to the test for one mean. • The hypothesis test for paired data is really a special case of the one-sample t-test

Central Limit theorem lets us make assumptions about sampling distributions and holds if:

• Used to estimate proportions in a population • Tells us that, if some basic conditions are met, the sampling distribution of the sample proportion is close to the Normal distribution - We can use the Normal distribution to calculate probabilities associated with sample proportions. ◦ We have a random and independent sample ◦ Large sample: The sample size, n, is large enough that the sample expects at least 10 successes (yes) and 10 failures (no).- Large Sample: # s >10 , # f >10, np ≥ 10, n(1-p) ≥ 10 ◦ Big population: If sampling is done without replacement, the population must be at least 10 times larger than the sample size.- (N > 10n) ◦ If the central limit theorem can be applied, we can assume the sampling distribution of the sample proportion is Normal with a mean of p, and Standard error of SE= sqrt p(1-p)/n ex- First, check conditions: 1. The sample was random and independent. 2. Large sample size: np = 500(0.59) = 295, and 295 ≥ 10 n(1 - p) = 500(1 - 0.59) = 205, and 205 ≥ 10 500- sample size .59- percentage given 3. We can assume that there are at least 10(500) = 5000 vegetarians in the population. Since these conditions are met, the CLT tells us the sampling distribution will be approximately normal, with SE= sqrt .59x.41/500

Sampling distribution of the sample mean:

◦ If the central limit theorem can be applied, we can assume the sampling distribution of the sample mean is Normal with a mean of µ, and Standard error of sigma√n As the sample size increases, the sample mean becomes more precise. example- A student has a large digital music library. The mean length of the songs is 258 seconds with a standard deviation of 87 seconds. The student creates a playlist that consists of 30 randomly selected songs. a. Is the mean value of 258 seconds a parameter or a statistic? Explain. b. What should the student expect the average song length to be for his playlist? c. Standard error = 15.88 seconds d. If the playlist had 50 songs, the standard error would be smaller because the standard error decreases as the sample size increases. If certain conditions are met, the Central Limit Theorem assures us that the distribution of sample means follows an approximately Normal distribution no matter what the shape of the population distribution.


Kaugnay na mga set ng pag-aaral

unknown chapter - most likely chapter 17

View Set

marketing chapter 13: personal selling and sales promotion

View Set

where good ideas come from, Where Good Ideas Come From, Psych 176 School Mafia (Homie Finally Returned the Favor) 3

View Set

Chapter 14 special examinations and tests (Recommended reading) ❌

View Set

Driver's Ed Proctored Final Study Set

View Set

Unit 6- General Liability Insurance

View Set

Chapter 56 Care of the patients with HIV/AIDS

View Set

IGCSE English LIT Anthology Poem Quotes and Structure

View Set