STATS 250 EXAM 1
how to create a stem and leaf display
**Sort the data.** Separate each observation into a leaf, which is the final digit, and the stem, which consists of the remaining digits. Write the stems in a column in consecutive order, with the smallest at the top, and draw a vertical line to the right of this column. Write each leaf in the row to the right of its stem, with the leaves in consecutive order, smallest to largest.
properties of a binomial experience
1. There are n "trials" where n is determined in advance and is not a random value. 2. Two possible outcomes on each trial, called "success" and "failure" and denoted S and F. 3. Outcomes are independent from one trial to the next. 4. Probability of a "success", denoted by p, remains the same from one trial to the next. Probability of "failure" is 1 - p.
what possible explanations happen when we see an association between an explanatory and response variable
1. There is a causation. The explanatory variable is causing a change in the response variables. - The best way to establish causation is with a randomized experiment 2. There may be causation, but confounding factors make this causation difficult to prove. - A confounding variable is a variable that is not the main concern of the study but may be partially responsible for the observed results. 3. There is no causation. The observed association can be explained by how one or more other variables affect both the explanatory and response variables. - Lurking variable is a term sometimes used to describe variables that affect both the x and y variable and cause us to see an association 4. The response variable is causing a change in the explanatory variable.
form of a scatterplot
2 variables have a linear relationship when the pattern of their relationship resembles a straight line 2 variables have a nonlinear or curved relationship when a curve describes the pattern of a scatter plot better than a line does
direction of a scatterplot
2 variables have a positive relationship when the values of one variable tend to increase as the values of the other variable increase. 2 variables have a negative relationship when the values of one variable tend to decrease as the values of the other variable increase.
using regression lines to make predictions
A is the spot where the line crosses the y-axis, the predicted y value when x is 0 B is the slope, the amount the y variable changes on average when the x variable increases one unit. When b is positive, y increases as x increases (the direction of the relationship is positive). When b is negative, y decreases as x increases (the direction of the relationship is negative)
boxplots and how to construct
A box plot is a visual display of the five-number summary (MIN, Q1, MEDIAN, Q3, MAX) ((Constructing a box plot)) Label an axis with numbers from the minimum to the maximum of the data Draw a box with the lower end of the box at Q1 and the upper end at Q3. Draw a line through the box at the median. Calculate IQR = Q3-Q1 Draw a line that extends from Q1 to the smallest data value not smaller than the value of the lower fence, Q1-(1.5 x IQR). Also draw a line that extends from Q3 to the largest data value that is not greater than the upper fence, Q3+(1.5 x IQR) Mark the location of any data points smaller than the lower fence or larger than the upper fence with an asterisk. Values more than 1.5 IQRs beyond the quartiles are considered to be outliers.
placebo
A placebo is something that is identical (in appearance, taste, feel, etc) to the control group, except that it contains no active ingredients. You think you will so you will People respond to the power of suggestion. To measure a true treatment effect, it is better to compare a treatment to a placebo than to a control
scatter plots
A point on a scatterplot represents the combination of two measurements for an individual observation If there is an explanatory/response relationship, the explanatory variable is plotted on the x (horizontal) and the response or dependent variable is plotted on the y (vertical).
regression line
A regression line is a straight line that describes how values of a quantitative response variable(y) are related, on average, to values of a quantitative explanatory variable (x). Used for : Estimating the average y value of y at a specific value of x Predicting the unknown value of y for an individual, given that individual's x value Recall the equation for a straight line y=a+bx Since not all data points fall on the line, we will write the regression equation as y (HAT), where y is the predicted value
what to look for in dot plots, stem and leaf displays, and histograms
A representative or typical value (center) in the data set The extent to which the data values spread out The nature of the distribution (shape) along the number line The presence of unusual values (gaps and outliers)
using a residual plot to evaluate whether a linear regression model is appropriate for the relationship between two variables
A residual plot is a scatterplot of the (x, residual) pairs. Residuals can also be graphed against the predicted y-values Isolated points or a pattern of points in the residual plot indicate potential problems. - Curves indicate the data does not follow a linear form - Fanning indicates the residuals are not independent of the x values, an assumption of the linear regression model. A transformation of the variables or a more sophisticated model is called for. - Random scatter is a "good" thing to see in a residual plot
random selection
A way of ensuring that a sample of people is representative of a population by giving everyone in the population an equal chance of being selected for the sample
randomized block design
An experiment that incorporates blocking by dividing the experimental units into blocks of similar units and then randomly assigns the individuals within each block to treatments. Suppose that you were worried that gender might also be related to performance in the math experiment (MATCHED PAIRS) - Direct control of gender - use only boys or only girls. Conclusions can ONLY be generalized to the group that was used.
outliers
An outlier is a point that has an unusual combination of data values We defined an outlier as a data point with an unusual combination of x and y values. Outliers with extreme x values have the most influence on correlation and regression and are called influential observations. -The equation of the regression line changes substantially when an influential observation is removed.
identify when poor practices produces misleading plots
Areas should be proportional to frequency, relative frequency, or magnitude of the number being represented. -don't use unusual symbols in bar charts Watch out for unequal time spacing in time series plots. Be sure you have the right type of plot for your data type
binomial probabilities and cumulative probabilities
Binompdf P(x=k) (n,p,k) Binomcdf p(x<=k) (n,p,k)
how to construct a time plot
Bivariate data with time and a numeric variable ((How to construct)) Draw horizontal and vertical axes. Label the horizontal axis and include an appropriate scale for the x-variable. Label the vertical axis and include an appropriate scale for the y-variable. For each (x, y) pair in the data set, add a dot in the appropriate location connect dots in order
correlation coefficient (r)
Correlation gives us an objective numeric measure to quantify the strength of a linear relationship The statistical correlation between 2 quantitative variables is a number that indicates the strength and the direction of the linear relationship
unimodal
having one peak
how to create a histogram
Decide how many equally spaced intervals to use for the horizontal axis. Between 6 and 15 is usually appropriate. Break the range of your data up into these intervals. Decide whether to use frequencies or relative frequencies on the vertical axis. Determine the frequency or relative frequency of data values in each interval and draw a bar with the corresponding height. If a value is on a boundary, count it in the interval that begins with that value.
how to create a dot plot
Draw a horizontal number line that covers the range from the smallest to the largest data value. Place a dot above the number line located at each observation's data value. When there are multiple observations with the same value, the dots are stacked vertically.
how to construct a scatterplots
Draw horizontal and vertical axes. Label the of data horizontal axis and include an appropriate scale for the x-variable. Label the vertical axis and include an appropriate scale for the y-variable. For each (x, y) pair in the data set, add a dot in the appropriate location in the display. What to look for: Relationship between x and y
mean
numerical average
numerical variables (quantitative)
observations or measurements take on numerical values Something you can average Units enrolled in, hours of work per week
why is random assignment important when collecting data in an experiment?
Ensure they produce chance-like variability . When we cannot control directly or haven't even thought of other variables, random assignment should evenly spread variables into treatment groups. We expect these variables to affect all the experimental groups in the same way; therefore, their effects are not confounding
percentile
For a number r between 0 and 100, the r^th percentile is a value such that r percent of the observations fall AT or BELOW that value. -The median is the 50th percentile -Q1 is the 25th percentile
split stem and leaf display
If there are too many observations in a stem, you can split all of the stems into equal sized pieces.
why is the regression line called the least squares regression
In choosing which line to draw through our data points, we want a line that comes as close as possible to the points in some sense. The b0 and b1 values we use in simple linear regression are known as the least squares regression coefficients because they are chosen to minimize the sum of the squared residuals.
finding areas under a given normal distribution
Most commonly encountered type of continuous random variable is the normal random variable, which has a specific form of a bell-shaped probability density curve called a normal curve. A normal random variable is also said to have a normal distribution The normal distribution is symmetric, unimodal, bell-shaped and characterized by its mean and standard deviation Empirical rule
census
observe the entire population
how to outliers affect mean and median
Outliers have more influence on the mean than on the median. This is because all values are used in computing the mean. if the data is skewed to the left, the mean is less than the median if the data is skewed to the right, the mean is more than the median
interquartile range
Q3-Q1 Q3 is the 3rd quartile (75th percentile). Find it by taking the median of the upper half of the ordered data values. Q1 is the first quartile (25th percentile). Find it by taking the median of the lower half of the ordered data values. IQR tells us how spread out the middle half of the data is
sampling without replacement
Sampling In Which an individual or object, once selected, is not put back into the population before the next selection; once an individual or object is selected, they are not replaced and cannot be selected again (more typical)
sampling with replacement
Sampling in which an individual or object, once selected, is put back into the population before the next selection. This allows an object or individual to be selected more than once for a sample.
systematic sampling
Select an ordered arrangement form population by first choosing a starting point at random from the first k individuals then every k th individual after that Suppose you wish to select a sample of faculty members from the faculty directory. You would first randomly select a faculty from the first 20 (k = 20) faculty listed in the directory. Then select every 20th faculty after that on the list.
cluster sample
Sometimes it's easier to select groups of individuals from a population than it is to select individuals themselves. Cluster sampling divides the population of interest into non overlapping subgroups, called clusters. Clusters are then selected at random, and ALL individuals in the selected clusters are included in the sample. The ideal situation for cluster sampling is when each cluster mirrors the characteristics of the population.
strength of a scatterplot
Strong - perfect correlation An outlier is a point that has an unusual combination of data values Cannot measure without measuring correlation
steps in a linear regression analysis
Summarize the data graphically by constructing a scatterplot Based on the scatterplot, decide if it looks like the relationship between x an y is approximately linear. If so, proceed. Find the equation of the least squares regression line. Construct a residual plot and look for any patterns or unusual features that may indicate that line is not the best way to summarize the relationship between x and y. In none are found, proceed to the next step. Compute the values of se and r2 and interpret them in context. Based on what you have learned from the residual plot and the values of se and r2, decide whether the least squares regression line is useful for making predictions. If so, proceed. Use the least squares regression line to make predictions.
prediction error/residual
The difference between the observed y value in a data set and the predicted value of y using the regression equation observed-predicted
compute the mean (expected value) of a discrete random variable using a probability distribution table
The mean value of a random variable x, denoted by 𝜇, describes where the probability distribution of x is centered. This is also called the expected value E(x) of the random variable. The standard deviation of a random variable x, denoted by SD(x) or 𝝈, describes the variability in the probability distribution. If X is a random variable with possible values x1, x2, x3, . . . , occurring with probabilities p1, p2, p3, . . . , then the expected value of X is calculated as
facts about correlation
The pieces of the formula in parentheses are standardized z scores Correlation is an average of the products of the standardized values of the 2 variables Correlation makes no distinction between explanatory and response variables. The correlation between x and y is the same as the correlation between y and x. Correlation requires that both variables be quantitative. We cannot compute a correlation between a categorical variable and a quantitative variable or between two categorical variables. r does not change when we do transformations. The correlation between height and weight is the same whether height was measured in feet or centimeters or weight was measured in kilograms or pounds. This happens because all the observations are standardized in the calculation of correlation. The correlation r itself has no unit of measurement, it is just a number. Positive r indicates positive association between the variables and negative r indicates negative association. r is always between -1 and 1. Values of r near 0 indicate litte/no linear association. Values of r close to -1 or 1indicate that the points in a scatterplot fall close to a straight line. If r = 1 or r= -1 then points are exactly a straight line. Correlation measures the strength of only a linear relationship between two variables. A relationship may be very strong, but curved and have a correlation of zero. Correlation is a nonresistant measure. r is affected by outliers.
stratified random sample
The population is first divided into non- overlapping subgroups (called strata). Then separate simple random samples are selected from each subgroup (stratum).
express the probability distribution of a discrete random variable in a table
The probability distribution function (pdf) X is a table or rule (formula) that assigns probabilities to possible values of X. X = the random variable K = a number the discrete random variable could assume p(x=k) is the probability that x equals k
empirical rule
The probability of falling within any particular number of standard deviations of μ is the same for all normal distributions
strength of an association based on the correlation
The sign of r tells the direction and the strength is on a scale of 0 to 1. So r can range from -1 to 1 The closer r is to +/-1, the stronger the correlation.
coefficient of determination (r^2)
The square of the correlation can also be used to describe the strength of a linear relationship. r2 takes on values between 0 and 1. It can be interpreted as the proportion of variation in the y variable explained by the x variable. When we look at an ANOVA table, r2 is computed as SSResid/SSTO Earlier we measured the variability in y by variance and standard deviation. These measures account for the distance between each y observation and the mean of the y observations. Residuals measure the distance between each y observation and the regression line. A high r2 value means the residuals are much smaller than the distances to the mean. A small value of se indicates that residuals tend to be small. This value tells you how much accuracy you can expect when using the least squares regression line to make predictions. A large value of r2 indicates that a large proportion of the variability in y can be explained by the approximate linear relationship between x and y. This tells you that knowing the value of x is helpful for predicting y. A useful regression line will have a reasonably small value of se and a reasonably large value of r2.
turning values from any normal distribution into z-scores from the standard normal
The z-score for an observation is the number of standard deviations that it falls from the mean - Formula : x-mean/SD When x has a normal distribution with mean and SD, z has a n(0,1) / n(mean,SD) distribution with mean 0 and standard deviation 1. This is called the standard normal distribution
comparative stem and leaf
To compare two distributions, use the same stems, but write the leaves for the second distribution on the left side. For instance, we could have put opponents' scores on the other side of the stem-and-leaf plot in the example. Be sure to label each side.
fences method to determine outliers
Values more than 1.5 IQRs beyond the quartiles are considered to be outliers. How to find: IQR = Q3-Q1 Q1-(1.5 x IQR) Q3+(1.5 x IQR)
locate regression values on excel output
When we look at an ANOVA table, r2 is computed as SSResid/SSTO
procedure for drawing a simple random sample
You need : a list of units in the population and a source of random numbers Give each unit in your list an ID number and generate n random numbers from your source. The units with ID numbers that match the random numbers generated become the sample. Software - randint(lower, upper, n) and randbetween(lower, upper)
frequency distribution
a listing of all categories along with their frequencies Counts - how many
relative frequency distribution
a listing of all categories along with their relative frequencies, expressed as a proportion (number between 0-1) or a percent (number between 0-100)
statistic
a number that describes a sample
population characteristic
a number that describes the entire population
discrete variable
a numerical variable Isolated points along a number line, usually counts of items. Can list possible values. Units, stop lights on the way to campus
continuous variable
a numerical variable Variable that can be any value in a given interval, usually measurements of something time, age
sample
a part of the population that is selected for the study
sample
a sample of the population
random sample
a sample that fairly represents a population because each member has an equal chance of inclusion
observational study
a study in which the person conducting the study observes characteristics of a sample (researchers do not determine the group) purpose is to collect data that will allow you to learn about a single population or about how two or more populations differ no causation
experiment
a study in which the person conducting the study seeks to determine how a response behaves under different behaves under different experimental treatments and determines which subjects and units are in each group researchers carrying out the study impose experimental conditions on the subjects or units
voluntary response
a type of convenience sampling which relies solely on individuals volunteering to be part of the study
lurking variable
a variable that is not among the explanatory or response variables in a study but that may influence the response variable
discrete distribution satisfies what properties
all probabilities are between 0 and 1 probability of all possible outcomes sums to 1
completely randomized
an experiment in which experimental units are randomly assigned to treatments
double blind
both participant and evaluator do not know which treatment they are receiving
comparative box plot
compares two different factors
univariate
consist of observations on a single variable made on individuals in a sample or population
categorical variable (qualitative)
consists of categorical responses eye color, major, neighborhood
multivariate
data that consist of observations on two or more variables
bivariate
data that consists of pairs of numbers from two variables for each individual in a sample or population
measures of spread
describe how much variability there is in a data distribution. A measure of spread provides information about how much individual values tend to differ from one another -standard deviation = approximately symmetric -Interquartile range = skewed or has outliers
measures of center
describe where the data distribution is located along the number line. A measure of center provides information about what is "typical" -mean/average = approximately symmetric distribution -Median = skewed or has outliers
distributions
describes how often the possible responses occur
simple comparative experiment
determine the effects of the treatment on the response variable. In a simple comparative experiment, the value of some response variable is measured under different experimental conditions (treatments). Experimental units are the smallest unit to which a treatment is applied.
displaying numerical data for small data sets
dot plot or stem and leaf display
single blind
either participant or evaluator does not know which treatment they are receiving
population
entire collection of individuals or objects you want to learn about
sampling error
error that comes from drawing a sample instead of taking a census
matched pairs design
everybody gets both treatments, randomize which one is first and second
how are conclusions drawn?
from a statistical study - depends on the way that data is collected casual relationship - need a well-designed experiment
discrete variables
has possible values that are isolated points on the number line. It can take one of a countable list of distinct values. We can find probabilities for exact outcomes
displaying numerical data for larger data sets
histogram or box plot
negative residual
indicates that the data point falls below the regression line and the prediction was an overestimate of the observed value.
positive residual
indicates the data point falls above the regression line and the prediction was an underestimate of the observed value
quantiles of any normal distribution
invNorm (left tail area, mean, SD)
random variable
is continuous if its possible values are all points in some interval. We are limited to finding probabilities for intervals of values. Probabilities for exact outcomes are zero
range
maximum - minimum
control group
no/standard treatment group allows the experimenter to assess how the response variable behaves when the treatment is not used. provides a baseline against which the treatment groups can be compared to determine whether the treatment had an effect.
under which conditions is the normal distribution used to approximate the binomial distribution
np>=10 n(1-p)>=10
nonresponse bias
occurs when responses are not obtained from all individuals selected for inclusion in the sample Those you selected did not respond To minimize bias, it is critical that a serious effort be made to follow up with individuals who did not respond to the initial request for information
Measurement or response bias
occurs when the method of observation tends to produce values that systematically differ from the true value in some way Improperly calibrated scale is used to weigh items Tendency of people to not be completely honest Appearance or behavior of the person asking the questions Questions on a survey are worded in a way that tends to influence the response
selection bias
occurs when the way the sample selected systematically excludes some part of the population of interest May also occur if volunteers or self-selected individuals are used in a study
comparative bar charts
particularly useful for comparing the distribution of a categorical variable across groups. Use relative frequencies as group size may differ
why are voluntary response samples and convenience samples unlikely to produce reliable information about a population
people who are motivated to volunteer responses often hold strong opinions. it is extremely unlikely that they are representative of the population
which measures are resistant or not to outliers
resistant measures : median and IQR nonresistant measures : mean, SD, range
convenience sampling
selecting individuals or objects that are easy or convenient to sample
standard deviation
tells us roughly the average distance from the mean shown by 𝜎
symmetric
the distribution looks similar on both sides. One symmetric shape we will encounter frequently is a bell-shape
median
the middle data value for an odd number of observations when the data are ordered by size. When n is even, the median is the average of the two middle values. Suppose the x values are ordered by smallest to largest, then find the middle
mode
the most frequent value
standard deviation about the regression line, Se
the typical amount an observation deviates from the least squares regression line.
skewed to the right
there is more data on the left side of the distribution, the right tail is longer (positive)
skewed to the left
there is more data on the right side of the distribution, the left tail is longer (negative)
why is random selection an important component of a sampling plan?
to prevent selection bias; everyone has the same chance of getting chosen
bimodal
two prominent peaks
pie chart
use for categorical data ((How to construct)) A circle is used to represent the whole data set. "Slices" of the pie represent the categories The size of a particular category's slice is proportional to its frequency or relative frequency. Most effective for summarizing data sets when there are not too many categories
segmented or stacked bar charts
use for categorical data ((How to construct)) Use a rectangular bar rather than a circle to represent the entire data set. The bar is divided into segments, with different segments representing different categories. The area of the segment is proportional to the relative frequency for the particular category.
bar charts
useful for summarizing one categorical variable
extrapolation
using a regression line to predict y-values for x-values outside the observed range of the data. ((should be avoided)) Riskier the farther we move from the range of the given x-values There is no guarantee that the relationship will have the same trend outside the range of x-values observed
what are the limitations of using volunteers as subjects in an experiment
using volunteers in observational studies is never a good idea however, it is common practice to use volunteers as subjects in experiments -Random assignment of the volunteers to treatments allows for cause-and-effect conclusions -But, limits the ability to generalize to population
how is a probability distribution for a continuous random variable expressed?
with a density curve The curve always falls at or above zero The probability a random variable falls in a given interval is equal to the area under the curve on that interval
normality in a data set
you are looking for the dots to fall on a 45 degree line if the data set matches up to what you would expect a data set of that size to look like coming from a normal distribution.