Statistics exam 1
Qualitative Variables
- Classify individuals into categories.
Nominal Variables
- Do not have a natural ordering.
Ordinal Variables
- Have a natural ordering.
The difference between a population and a sample:
-A population is the entire collection of individuals about which information is sought. -A sample is a subset of a population containing the individuals that are actually observed.
Quantitative Variables
-Tell how much or how many of something there is.
Various sources of bias including:
Voluntary Response Bias: Self-Interest Bias: Social Acceptability Bias: Leading Question Bias Non-Response Bias: Sampling Bias: A big sample doesn't make up for bias:
How to use the least-squares regression line to make predictions:
We can use the least-squares regression line to predict the value of the outcome variable by substituting a value for the explanatory variable in the equation of the least-squares regression line. NEED YOUR IMAGE
Incorrect sizing of graphical images by not following the Area Principle:
We often use images to compare amounts. Larger images correspond to greater amounts. To use images properly in this way, we must follow a rule known as The Area Principle.
How to use The Empirical Rule to describe a bell-shaped data set:
When a data set has a bell-shaped histogram, it is often possible to use the standard deviation to provide an approximate description of the data using a rule known as the Empirical Rule:
How to use Chebyshev's Inequality to describe any data set:
When a distribution is bell-shaped, we use the Empirical Rule to approximate the proportion of data within one or two standard deviations of the mean. Another rule called Chebyshev's Inequality holds for any data set. -In any data set, the proportion of the data that is within K standard deviations of the mean is at least 1 − (1/K^2). -Specifically, by setting K = 2 or K = 3, we obtain the following Results. -At least 3/4, or 75%, of the data are within two standard deviations of the mean. -At least 8/9, or 89%, of the data are within three standard deviations of the mean.
Z-score and the empirical rule:
When a population has a histogram that is approximately bell-shaped, then: Approximately 68% of the data will have z-scores between −1 and 1. Approximately 95% of the data will have z-scores between −2 and 2. All, or almost all of the data will have z-scores between −3 and 3.
Incorrect sizing of graphical images by not following the Area Principle: The Area Principle:
When amounts are compared by constructing an image for each amount, the areas of the images must be proportional to the amounts. For example, if one amount is twice as much as another, its image should have twice as much area as the other image.
Empirical Rule:
Approximately 68% of the data will be within one standard deviation of the mean. Approximately 95% of the data will be within two standard deviations of the mean. All, or almost all, of the data will be within three standard deviations of the mean. The Empirical Rule can be used for samples as well as populations. When we work with a sample, we use X in place of μ and s in place of σ.
Chebyshev's Inequality vs. Empirical Rule:
Both Chebyshev's Inequality and the Empirical Rule provide information about the proportion of a data set within a given number of standard deviations of the Mean.
How to construct a boxplot and use it to determine skewness:
Boxplot: graph that presents the five-number summary along with some additional information about a data set. There are several different kinds of boxplots. image close - missing outliers
The coefficient of variation is found by dividing the standard deviation by the mean
CV= σ μ
Regression lines with TI-84:
Change mode: Step 1: Press MODE to access the mode menu. Step 2: Scroll down to the STAT DIAGNOSTICS option, and select On. Step 1: Enter the x-values into L1 and the y-values into L2. Step 2: Press STAT and the right arrow key to access the CALC menu. Step 3: Select LinReg(a+bx) and run the command.
How to compute and interpret the coefficient of variation:
Coefficient of variation (CV for short)- tells how large the standard deviation is relative to the mean. It can be used to compare the spreads of data sets whose values have different units.
Case-control studies
Cohort study: Prospective cohort study Cross-sectional cohort study: Retrospective cohort study: Case-control study:
Median: Disadvantages:
Depends only on middle value or two middle values
How to determine the complement of an event: Two hundred students were enrolled in a Statistics class. Find the complements of the following events: -Exactly 50 of them are sophomores. The complement is that the number of sophomores is not 50. -More than 50 of them are sophomores. The complement is that 50 or fewer are sophomores. -At least 50 of them are sophomores. The complement is that fewer than 50 are sophomores.
EX: According to the Wall Street Journal, 40% of cars sold in recent years were small cars. What is the probability that a randomly chosen car sold in that year is not a small car? Solution: P(not a small car)=1-P(small car)= 1-0.40=0.60
How to find the five-number summary for a data set:
Five-number summary- consists of the median, the first quartile, the third quartile, the smallest value, and the largest value. These values are generally arranged in order.
How to construct a frequency and relative frequency distribution for quantitative data:
Frequency distribution: Relative frequency distribution:
Relative frequency distribution:
Given a frequency distribution, a relative frequency distribution can be constructed by computing the relative frequency for each class. Relative frequency= Frequency/ Sum of all frequencies
Equation for the least square regression line:
Given ordered pairs (x, y), with sample means X and Y sample standard deviations X and Y and correlation coefficient r, the equation of the least-squares regression line for predicting y from x is y=b0+b1x where the slope is b1=r(Sy/Sx) and the y-intercept is b0=y-b1x
mean: Disadvantages:
Highly influenced by extreme values (not resistant)
How to compute probabilities with equally likely outcomes:
If a sample space has n equally likely outcomes and an event A has k outcomes, then k/n
How sample means are used as estimates for population means:
It is often impossible to compute a population mean because we don't know all the values in the population. Instead, we draw a sample and use the sample mean as an estimate. That is, X is often used to estimate μ. Sample means usually either overestimate or underestimate the population mean.
How to identify the mode of a data set:
It is the value that appears most frequently Two or more values are tied for the most frequent- they are all considered to be modes. Values all have the same frequency- the data set has no mode.
Self-Interest Bias:
Many advertisements contain data about their product being superior. The advertiser, however, may not report any data that tends to show that the product is inferior. Even more seriously is a trend for companies to pay scientists or physicians to publish results involving their products or drugs. People who have a self-interest in the outcome of an experiment have an incentive to use biased methods.
How to compute and interpret the mean of a data set:
Mean Computing the mean:
How to compute the median:
Median is another measure of center. The median is a number that splits the ordered data set in half, so that half the data values are less than the median and half of the data values are greater than the median.
The five-number summary of a data set consists of:
Minimum First Quartile Median Third Quartile Maximum
How to determine whether events are mutually exclusive:
Mutually exclusive: if it is impossible for both events to occur. EX: -A die is rolled. Event A is that the die comes up 3, and event B is that the die comes up an even number. -These events are mutually exclusive since the die cannot both come up 3 and come up an even number. -A fair coin is tossed twice. Event A is that one of the tosses is heads, and Event B is that one of the tosses is tails. -These events are not mutually exclusive since, if the two tosses are HT or TH, then both events occur.
Weighted mean-
NEED YOUR IMAGE
How three-dimensional graphs can distort the perspective:
Newspapers and magazines often present three-dimensional bar graphs because they are visually impressive. Unfortunately, in order to make the tops of the bars visible, these graphs are often drawn as though the reader is looking down on them. This makes the bars look shorter than they really are.
Median: Advantages:
Not much influenced by extreme values (resistant)
Non-Response Bias:
People cannot be forced to answer questions or to participate in a study. A certain proportion of people who are asked to participate refuse to do so. These people are called non-responders. The opinions of non-responders may differ from the opinions of those who do respond. As a result, surveys with many non-responders are often biased.
How to compute a percentile of a data set:
Percentiles divide a data set into hundredths. For a number p between 1 and 99, the pTH percentile separates the lowest p% of the data from the highest (100 − p)%. Step 1: arrange the data in increasing order Step 2: Let n be the number of values in the data set. For the pTH percentile, compute L=(p/100)n Step 3: If L is a whole number, the pth percentile is the average of the number in position L and the number in position L + 1. - If L is not a whole number, round it up to the next higher whole number. The pTH percentile is the number in the position corresponding to the rounded-up value.
Cohort studies
Prospective, cross-sectional, and retrospective studies
The difference between a randomized experiment and an observational study:
Randomized experiment Observational study
The notation for a population mean and sample mean:
Recall that a population consists of an entire collection of individuals about which information is sought, and a sample consists of a smaller group drawn from the population. The method for calculating the mean is the same for both samples and populations. However the notation is different.
The standard deviation is not resistant:
Recall that a statistic is resistant if its value is not affected much by extreme values (large or small) in the data set. That is, the standard deviation is affected by extreme values.
When samples of convenience are acceptable:
Sample of convenience is a sample that is not drawn by a well-defined random method. Used when it is difficult or impossible to draw a sample in a truly random way. The problem with them is that they may differ systematically in some way from the population. If it is reasonable to believe that no important systematic difference exists, then it is acceptable to treat the sample of convenience as if it were a simple random sample.
Sampling Bias:
Sampling bias occurs when some members of the population are more likely to be included in the sample than others. For example, consumer surveys are often conducted by randomly calling people from a list of phone numbers. Because consumers without a phone number will be omitted from the sample, sampling bias is likely to occur. It is almost impossible to avoid sampling bias completely.
Histograms: Difference between frequency histograms and relative frequency histograms:
Scale on the vertical axis.
How the mean and median are related to the shape of a data set:
Skewed to the left: Skewed to the right: Approximately symmetric:
Social Acceptability Bias:
Social acceptability bias occurs when people are reluctant to admit to behavior that reflects negatively on them. This characteristic of human nature affects many surveys.
Leading Question Bias:
Sometimes questions are worded in a way that suggest a particular response. Consider these questions: "Do you favor decreasing the heavy tax burden on middle-class families?" The words "heavy" and "burden" suggest that taxes are too high and encourage a "Yes" response. This is an example of leading question bias.
The difference between a statistic and a parameter:
Statistic- a number that describes a sample. Parameter- a number that describes a population.
How to compute the percentile corresponding to a given data value:
Step 1: Arrange the data in increasing order. Step 2: Let X be the data value whose percentile is to be computed. Use the following formula to compute the percentile: Round the result to the nearest whole number. This is the percentile corresponding to the value X . image close but not exact
General steps for constructing a frequency distribution:
Step 1: Choose a class width. Step 2: Choose a lower class limit for the first class. Step 3: Compute the lower limit for the second class by adding the class width to the lower limit for the first class. Step 4: Compute the lower limits for each of the remaining classes. Stop when the largest data value is included in a class. Step 5: Count the number of observations in each class and construct the frequency distribution.
How to approximate the standard deviation for grouped data:
Step 1: Compute the midpoint of each class and approximate the mean of the frequency distribution. Step 2: For each class, subtract the mean from the class midpoint to obtain (Midpoint − Mean). Step 3: For each class, square the difference obtained in Step 2 to obtain (midpoint-mean)^2 , and multiply by the frequency to obtain (midpoint-mean)^2 x (frequency) Step 4: Add the products over all classes. (midpoint-mean)^2 x (frequency) Step 5: To compute the population variance, divide the sum obtained in Step 4 by N. To compute the sample variance, divide the sum obtained in Step 4 by n − 1. Step 6: Take the square root of the variance obtained in Step 5. The result is the standard deviation. Grouped data on the TI-84: Enter the midpoint for each class into L1 and the corresponding frequencies in L2. Next, select the 1-Var stats followed by L1, comma, L2.
How to approximate the mean for grouped data:
Step 1: Compute the midpoint of each class. The midpoint of a class is found by taking the average of the lower class limit and the lower limit of the next larger class. Step 2: For each class, multiply the class midpoint by the class frequency. Step 3: Add the products (Midpoint)×(Frequency) over all classes. Step 4: Divide the sum obtained in Step 3 by the sum of the frequencies.
Population Variance Steps:
Step 1: Compute the population mean u . Step 2: For each population value x compute X-u These Values are shown in the second row. Step 3: Square the deviations to obtain the quantity (Xi-u)^2 These values are shown in the third row. Step 4: Sum the squared deviations to obtain the quantity E(Xi-u)^2 Step 5: Divide the sum obtained in Step 4 by the population sizeN to obtain the population variance o^2
Histogram on TI-84:
Step 1: Enter the data in list L1. Step 2: Press 2nd, Y=, then 1 to access the Plot1 menu. Select On and the histogram plot type. Step 3: Press Zoom, 9 to view the plot.
Scatter Plot on TI-84:
Step 1: Enter the x-values in L1 and the y-values in L2. Step 2: Press 2nd, Y=, then 1 to access the Plot1 menu. Select On and the scatterplot type. Step 3: Press Zoom, 9 to view the plot.
IQR for detecting outliers:
Step 1: Find the first quartile and the third quartile Step 2: Compute the interquartile range: IQR= Q3-Q1 Step 3: Compute the outlier boundaries. These boundaries are the cutoff points for determining outliers. Step 4: Any data value that is less than the lower outlier boundary or greater than the upper outlier boundary is considered to be an outlier.
How to construct a boxplot: Steps
Step 1: We may use the TI-84 Plus or other technology to compute the quartiles. Step 2: We draw vertical lines at 45, 51, and 59, then horizontal lines to complete the box. Step 3: We compute the outlier boundaries: Step 4: The largest data value that is less than the upper boundary is 77. We draw a horizontal line from 59 up to 77. Step 5: The smallest data value that is greater than the lower boundary is 41. We draw a horizontal line from 45 down to 41. Step 6: The data value 100 lies outside of the outlier boundaries. Therefore, 100 is an outlier. We plot this point separately. image not right
Some possible shapes of a data set including:
Symmetric (bell-shaped or uniform): Skewed to the right (positively skewed): Skewed to the left (negatively skewed): Unimodal: Bimodal: Uniformly distributed:
mean: Advantages:
Takes every value in into account
Mean and Median on TI-84:
The 1-Var Stats command in the TI-84 Plus calculator displays a list of the most common parameters and statistics for a given data set. This command is accessed by pressing STAT and then highlighting the CALC menu.
Standard deviation on TI-84:
The 1-Var Stats command in the TI-84 Plus calculator returns both the sample standard deviation and the population standard deviation.
Incorrect positioning of the vertical axis:
The baseline of a graph is the value at which the horizontal axis intersects with the vertical axis. When a graph or plot represents how much or how many of something, it may be misleading if the baseline is not at zero. The same misleading information can be created with time-series plots.
What is a simple random sample:
The best sampling methods all involve random selection, the most basic and in many cases the best, sampling method is the method of simple random sampling. Helps represent a population as closely as possible. A simple random sample of size n is a sample chosen by a method in which each collection of n population items is equally likely to make up the sample.
Least-squares regression line:
The figures present scatterplots of the previous data, each with a different line superimposed. It is clear that the line in the figure on the left fits better than the line in the figure on the right. The reason is that the vertical distances are, on the whole, smaller. The line that fits best is the line for which the sum of squared vertical distances is as small as possible. This line is called the least-squares regression line. NEED IMAGE
Interquartile range (IQR):
The interquartile range is found by subtracting the first quartile from the third quartile.
Interpreting the predicted value y:
The predicted value y can be used to estimate the average outcome for a given value of x. For any given x, the value y is an estimate of the average y-value for all points with that x-value.
Probability Rules:
The probability of an event is always between 0 and 1. In other words, for any event A, 0 ≤ P(A) ≤ 1. -If A cannot occur, then P(A) = 0. -If A is certain to occur, then P(A) = 1.
Interpreting the y- intercept b0:
The y-intercept b0 is the point where the line crosses the y-axis. This has a practical interpretation only when the data contain both positive and negative values of x.
Addition rule for mutually exclusive events: If events A and B are mutually exclusive, then P(A and B)=0
This leads to a simplification of the General Addition Rule.
Frequency distribution:
To summarize quantitative data, we use a frequency distribution just like those for qualitative data. However, since these data have no natural categories, we divide the data into classes. Classes are intervals of equal width that cover all values that are observed in the data set.
Various types of observational studies:
Two main types of observational studies: Cohort studies Case-control Studies
• Voluntary response sampling:
Used by the media to try to engage the audience. For example, a radio announcer will invite people to call the station to say what they think.
Ratio Level of Measurement-
Zero means the absence of the quantity, and ratios are meaningful.
Event:
a collection of outcomes of a sample space
Cohort study:
a group of subjects is studied to determine whether various factors of interest are associated with an outcome.
Computing the mean:
a list of n numbers is denoted by X1, X2,K, Xn. Ex: represents the sum
How to compute a weighted mean:
a mean in which some numbers count more than others. To compute a weighted mean, we assign a positive number, called a weight, to each number, with the numbers that count more getting the larger weights.
Range:
a simple way to measure spread, the difference between the largest value and the smallest value.
Outlier:
a value that is considerably larger or considerably smaller than most of the values in a data set. Some outliers result from errors; for example a misplaced decimal point may cause a number to be much larger or smaller than the other values in a data set. Some outliers are correct values, and simply reflect the fact that the population contains some extreme values.
Histograms: Number of classes:
affects shape.
Uniformly distributed:
all classes have approximately equal frequencies
How to use The General Addition Rule to compute probabilities of events in the form A or B: Compound event:
an event that is formed by combining two or more events. One type of compound event is of the form A or B. The event A or B occurs whenever A occurs, B occurs, or A and B both occur. Probabilities of events in the form A or B are computed using the General Addition Rule.
Bar graphs - Horizontal bars
are sometimes more convenient when the categories have long names.
Computing the median:
arrange the data values in increasing order The median is the average of the 2 middle numbers or it is the middle number
The Law of Large Numbers:
as a probability experiment is repeated again and again, the proportion of times that a given event occurs will approach its probability.
Bar graphs - side by side bars
compare two bar graphs that have the same categories.
• If there are large differences in outcomes among the treatment groups, we can conclude that the
differences are due to the treatments.
Quantitative variables can be further divided into
discrete and continuous variables:
Nonlinear:
does not form a line
Scatterplot:
each individual in the data set contributes an ordered pair of numbers, and each ordered pair is plotted on a set of axes.
A disadvantage of Chebyshev's Inequality is that
for most data sets, it provides only a very rough approximation.
Correlation coefficient example:
given ordered pairs (x,y) with same means x and y, sample standard deviations sx and sy, and sample size n, the correlation coefficient r is given by see image
• Cluster sampling:
items are drawn from the population in groups, or clusters at random. Useful when the population is too large and spread out for simple random sampling to be feasible.
• Systematic sampling:
items are ordered and every kth item is chosen to be included in the sample. Used to sample products as they come off an assembly line, in order to check that they meet quality standards.
Positive association:
large values of one variable are associated with large values of the other.
Negative association:
large values of one variable are associated with small values of the other.
Frequency distribution: - Upper class limit-
largest value that can appear in that class.
Variance:
measure of how far the values in a data set are from the mean, on the average. The variance is computed slightly differently for populations and samples. The population variance is presented first.
Cross-sectional cohort study:
measurements are taken at one point in time.
Finding a value with a given z-score:
multiply the standard deviation by z and add that quantity to the mean. In a population with mean μ and standard deviation σ, the value x with a given z-score is x = μ + zσ.
A statistic is resistant if its value is
not affected much by extreme values (large or small) in the data set. The correlation coefficient is not resistant.
The rule-of-thumb for when an event A is unusual (if P(A) < 0.05): Unusual event:
one that is not likely to happen. In other words, an event whose probability is small. There are no hard-and-fast rules as to just how small a probability needs to be before an event is considered unusual, but we will use the following rule of thumb,If probability is less than 0.05
Unimodal:
only one mode
Qualitative variables can be further divided into
ordinal and nominal variables
Quantitative variables can be categorized as having a
ratio or interval level of measurement:
Standard deviation:
taking the square root of the variance.
Z- score:
tells how many standard deviations that value is from its population mean. EX: For example, a value one standard deviation above the mean has a z-score of z = 1 and a value two standard deviations below the mean has a z-score of z = −2. Let x be a value from a population with mean u and standard deviation σ. The z-score for x is:
Sample space:
the collection of all the possible outcomes of a probability experiment
Probability: is
the proportion of times that the event occurs in the long run. So, for a "fair" coin, that is, one that is equally likely to come up heads as tails, the probability of heads is 1/2 and the probability of tails is 1/2.
Interpreting the correlation coefficient: - If r is negative,
the two variables have a negative linear association.
Interpreting the correlation coefficient: -If r is positive,
the two variables have a positive linear association.
Outcome or response variable:
the variable we want to predict
Interpreting the correlation coefficient: If r = −1,
then the points lie exactly on a straight line with negative slope; in other words, the variables have a perfect negative linear association. When two variables are not linearly related, the correlation coefficient does not provide a reliable description of the relationship between the variables.
Interpreting the correlation coefficient: If r = 1,
then the points lie exactly on a straight line with positive slope; in other words, the variables have a perfect positive linear association.
Interpreting the y- intercept b0: - If the x-values are all positive or all negative,
then the y-intercept b0 does not have a useful interpretation.
Interpreting the y- intercept b0: - If the data contain both positive and negative x-values,
then the y-intercept is the estimated outcome when the value of the explanatory variable x is 0.
mean and median: Skewed to the right:
there are large values in the right tail. The median is resistant while the mean is not, so the mean is generally more affected. For a data set that is skewed to the right, the mean is often greater than the median.
• In a randomized experiment, small differences among treatment groups are likely
to be due only to chance.
Bimodal:
two clearly distinct modes
Case-control study:
two samples are drawn where one consists of people who have the disease of interest (the cases), and the other consists of people who do not (the controls). The investigators look back in time to determine whether a factor of interest differs between the groups.
Confounder-
variable that is related to both the treatment and the outcome. When a confounder is present, it is difficult to determine whether differences in the outcome are due to the treatment or to the confounder. EX:In the smoking example, gender was a confounder. Gender is related to smoking (men are more likely to smoke) and to heart attacks (men are more likely to have heart attacks). For this reason, it was impossible to determine whether the difference in heart attack rates was due to differences in smoking (the treatment) or in gender (the confounder).
Explanatory or predictor variable:
variable we are given
Sample variance:
when the data values come from a sample rather than a population
population standard deviation
σ=√(σ^2 )
Population Variance:
σ² = Σ ( Xi - μ )² / N
The investigative process of statistics: Statistics is an investigative process that involves these steps:
• Formulate questions. • Collect data needed to answer the questions. • Describe the data. • Draw conclusions, using appropriate methods.
Voluntary response samples are never reliable for the following reasons:
• People who volunteer an opinion tend to have stronger opinions than is typical of the population. • People with negative opinions are often more likely to volunteer their response.
The differences among:
stratified sampling cluster sampling systematic sampling voluntary response sampling
Prospective cohort study:
subjects are followed over time.
Retrospective cohort study:
subjects are sampled after the outcome has occurred.
Boxplots can be used to determine skewness in a data set:
- If the median is closer to the first quartile than to the third quartile, or the upper whisker is longer than the lower whisker, the data are skewed to the right. - If the median is closer to the third quartile than to the first quartile, or the lower whisker is longer than the upper whisker, the data are skewed to the left. - If the median is approximately halfway between the first and third quartiles, and the two whiskers are approximately equal in length, the data are approximately symmetric.
Continuous Variables
- Possible values can be any value in some interval.
Discrete Variables
- Possible values can be listed.
Pareto chart
- Sometimes it is desirable to construct a bar graph in which the categories are presented in order of frequency or relative frequency. These charts are useful when it is important to see clearly which are the most frequently occurring categories.
Properties of correlation coefficient:
- The correlation coefficient is always between −1 and 1. That is, −1 ≤ r ≤ 1. - The correlation coefficient does not depend on the units of the variables. - It does not matter which variable is x and which is y. - The correlation coefficient only measures the strength of the linear relationship. It can be misleading when the relationship is nonlinear. - The correlation coefficient is sensitive to outliers and can be misleading when outliers are present.
Relative frequency
- The proportion of observations in a category Relative frequency= Frequency/ Sum of all frequencies
Back to back stem and leaf plots
- When two data sets have values similar enough so that the same stems can be used, their shapes can be compared with a back-to-back stem-and-leaf plot.
Interval Level of Measurement
- Zero does not mean the absence of the quantity, and ratios are not meaningful. Differences are meaningful, however. EX: The number of siblings a person has: The level of measurement is ratio. Zero siblings means the absence of siblings and 4 siblings is twice as many as 2. The outdoor temperature in °C: The level of measurement is interval. The temperature 0°C does not represent the absence of heat. The year of the next presidential election: The level of measurement is interval. The year 0 does not denote the beginning of time. The price of a pair of shoes: The level of measurement is ratio. A price of $0 indicates that the shoes cost no money. Also, a $100 price is twice as expensive as $50.
Mean
- a measure of center. If we imagine each data value to be a weight, then the mean is the point at which the data set balances.
Frequency distribution
- a table that presents the frequency for each category/ how many observations are in each category.
Pie chart
- an alternative to the bar graph for displaying relative frequency information. The relative sizes of the sectors match the relative frequencies of the categories. For example, if a category has a relative frequency of 0.25, then its sector takes up 25% of the circle.
Frequency
- category is the number of times it occurs in the data set.
Bias
- degree to which a procedure systematically overestimates or underestimates a population value. • A study conducted by a procedure that tends to overestimate or underestimate a population value is said to be biased. • A study conducted by a procedure that produces the correct result on the average is said to be unbiased.
Frequency distribution: Width
- difference between consecutive lower class limits.
Dot Plot
- graph that can be used to give a rough impression of the shape of a data set. It is useful when the data set is not too large, and when there are some repeated values. The dotplot gives a good indication of where the values are concentrated and where the gaps are.
Bar graph
- graphical representation of a frequency distribution.
Histograms: Number of classes: Too many classes
- produce a histogram with too much detail so that the main features of the data are obscured.
Frequency distribution: - Lower class limit
- smallest value that can appear in that class.
Randomized experiment
- study in which the investigator assigns treatments to the experimental units at random.
Observational study
- the assignment to treatment groups is not made by the investigator.
Histograms
- used for quantitative data
Time-series plot
- used when the data consist of values measured at different points in time. In a time-series plot, the horizontal axis represents time, and the vertical axis represents the value of the variable we are measuring.
How to compute the quartiles of a data set:
-The first quartile, denoted, separates the lowest 25% of the data from the highest 75%. -The second quartile, denoted, separates the lowest 50% of the data from the highest 50%. is the same as the median. -The third quartile, denoted, separates the lowest 75% of the data from the highest 25%. Step 1: Arrange the data in increasing order. Step 2: Let n be the number of values in the data set. To compute the second quartile, simply compute the median. For the first or third quartiles, proceed as follows: For the first quartile, compute L = 0.25n For the third quartile, compute L = 0.75n Step 3: If L is a whole number, the quartile is the average of the number in position L and the number in position L + 1. If L is not a whole number, round it up to the next higher whole number. The quartile is the number in the position corresponding to the rounded-up value. Quartiles on TI-84: The 1-Var Stats command in the TI-84 Plus calculator returns the quartiles.
The difference between correlation and causation:
A group of elementary school children took a vocabulary test. It turned out that children with larger shoe sizes tended to get higher scores on the test, and those with smaller shoe sizes tended to get lower scores. As a result, there was a large positive correlation between vocabulary and shoe size. Does this mean that learning new words causes one's feet to grow, or that growing feet cause one's vocabulary to increase? - The fact that shoe size and vocabulary are correlated does not mean that changing one variable will cause the other to change. - Correlation is not the same as causation. In general, when two variables are correlated, we cannot conclude that changing the value of one variable will cause a change in the value of the other.
Skewed to the left (negatively skewed):
A histogram with a long left-hand tail
Skewed to the right (positively skewed):
A histogram with a long right-hand tail
Correlation coefficient:
A numerical measure of the strength of the linear relationship between two variables
A big sample doesn't make up for bias:
A sample is useful only if it is drawn by a method that is likely to represent the population well. If you use a biased method to draw a sample, then drawing a big sample doesn't help. For example, voluntary response surveys often draw several hundred thousand people to participate. Although the sample is large, it is unlikely to represent the population well, so the results are meaningless.
The procedure for computing the median differs, depending on whether the number of observations in the data set is even or odd:
If n is odd, the median is the middle number. If n is even, the median is the average of the two middle numbers.
Interpreting the slope b1:
If the x-values of two points on a line differ by 1, their y-values will differ by an amount equal to the slope of the line. This enables us to interpret the slope If the values of the explanatory variable for two individuals differ by 1, their predicted values will differ by b1 If the values of the explanatory variable differ by an amount d, then their predicted values will differ by byb x d
The advantages of randomized experiments:
In a perfect study, treatment groups would not differ from each other in any important way except that they receive different treatments. In practice, it is impossible to construct treatment groups that are exactly alike, but randomization does the next best thing.
The common ways that graphs can be misleading including:
Incorrect positioning of the vertical axis: Incorrect sizing of graphical images by not following the Area Principle: How three-dimensional graphs can distort the perspective:
Symmetric (bell-shaped or uniform):
if it's right half is a mirror image of its left half.
Resistant:
if its value is not affected much by extreme values (large or small) in the data set. The median is resistant, but the mean is not.
What it means for an experiment to be double-blind:
if neither the investigators nor the subjects know who has been assigned to which treatment. When investigators or subjects know which treatment is being given, they may tend to report the results differently. Therefore, experiments should be double-blinded whenever possible.
Statistics
is the study of procedures for collecting, describing, and drawing conclusions from information.
An advantage of Chebyshev's Inequality is that
it applies to any data set, whereas the Empirical Rule applies only to data sets that are approximately bell-shaped.
How to determine outliers using the IQR method:
outliers interquartile range (IQR)
Voluntary Response Bias:
people are invited to log on to a website, send a text message, or call a phone number, in order to express their opinions. In many cases, the opinions of the people who choose to participate in such surveys do not reflect those of the population as a whole. In particular, people with strong opinions are more likely to participate. In general, voluntary response surveys are highly biased.
Histograms: Number of classes: Too few classes-
produce a histogram lacking in detail.
population mean
pronounced "mu"
The definition of resistant and which measure of center is :
resistant
sample standard deviation
s=√(s^2 )
How to approximate probabilities using the Empirical Method: The Law of Large Numbers
says that if we repeat a probability experiment a large number of times, then the proportion of times that a particular outcome occurs is likely to be close to the true probability of the outcome. The Empirical Method consists of repeating an experiment a large number of times, and using the proportion of times an outcome occurs to approximate the probability of the outcome.
Stem-and-leaf-plots:
simple way to display small data sets. Rightmost digit is the leaf, and the remaining digits form the stem.
Linear association:
the data tend to cluster around a straight line when plotted on a scatter plot.
Interpreting the correlation coefficient: - If r is close to 0,
the linear association is weak.
mean and median: Approximately symmetric:
the mean and median are equal.
mean and median: Skewed to the left:
the mean is often less than the median.
Interpreting the correlation coefficient: -The closer ᵅ is to −1,
the more strongly negative the linear association is.
Interpreting the correlation coefficient: - The closer r is to 1,
the more strongly positive the linear association is.
• Stratified sampling:
the population is divided up into groups, called strata, then a simple random sample is drawn from each stratum. Useful when the strata differ from one another, but the individuals within a stratum tend to be alike.