C459 Intro to Probability and Statistics
Variables - Classification
- Quantitative variables take numerical values, and represent some kind of measurement. Example: Age, Weight, and Height are quantitative variables. - Categorical variables take category or label values, and place an individual into one of several groups. Example: Race, Gender, and Smoking are categorical variables.
Spread (3 most commonly used measures)
- Range - Inter-quartile range (IQR) - Standard deviation
Histogram: When describing the shape of a distribution, we should consider:
- Symmetry/skewness of the distribution. - Peakedness (modality)—the number of peaks (modes) the distribution has.
Histogram Modes
- Unimodal—has one peak in distribution - Bimodal— has two peaks in distribution - Flat, or uniform - has no modes, or no value around which the observations are concentrated; uniformly distributed values. Used to for categorical; not quantitative.
Probability Scale
0% = Impossible 25% = Unlikely 50% = Even Chance 75% = Likely 100% = Certain
Distribution of a variable
A description of the relative numbers of times each possible outcome will occur in a number of trials.
Contingency table
A display format used to analyze and record the relationship between two or more categorical variables.
Lurking Variable
A lurking variable is a variable that was not included in your analysis, but that could substantially change your interpretation of the data if it were included. Association does not imply causation!
Sample survey
A particular type of observational study in which individuals report variables' values themselves, frequently by giving their opinions.
Bias in sampling
A sample that produces data that is not representative because of the systematic under- or over-estimation of the values of the variable of interest is called biased. Bias may result from either a poor sampling plan or from a poor design for evaluating the variable of interest.
Linear regression
A technique of finding the line that best fits the pattern of the linear relationship.
Placebo group
A way of testing a medical therapy in which, in addition to a group of subjects that receives the treatment to be evaluated, a separate control group receives a "placebo" treatment which is specifically designed to have no real effect.
Randomized response technique
Allows respondents to respond to sensitive issues (such as criminal behavior or sexuality) while maintaining confidentiality.
Inter-Quartile Range (IQR) to Find Outliers
An observation is considered a suspected outlier if it is: - below Q1 - 1.5 x IQR or - above Q3 + 1.5 x IQR
Box Plot
Appropriate for display that compares several quantitative items of data for the relationship between a categorical explanatory variable and a quantitative response variable. C --> Q
Standard Deviation Rule (Empirical rule) for a Normal Shape
Approximately 68% of the observations fall within 1 standard deviation of the mean. Approximately 95% of the observations fall within 2 standard deviations of the mean. Approximately 99.7% (or virtually all) of the observations fall within 3 standard deviations of the mean.
Classical Probability
Classical methods are used for games of chance, such as flipping coins, rolling dice, spinning spinners, roulette wheels, or lotteries. They are "classical" because their values are determined by the game itself. Use the nature of the situation to determine probabilities.
Scatterplot - relationships
Direction: Positive relationship - increasing relationship in both Negative relationship - increase in one and decrease in another Form: Linear - points scattered about a line Curvilinear - points scattered about the same curved line Strength: Linear - points closely scattered about a line Curvilinear - points widely scattered about the same curved line Outliers - deviate greatly from the pattern
Systematic sampling
Example: Obtain a student directory with email addresses of all the university's students, and send the music poll to every 50th name on the list. It may not be subject to any clear bias, but it would not be as safe as taking a random sample.
Extrapolation
Extrapolation is prediction of values of the explanatory variable that fall outside the range of the data. Since there is no way of knowing whether a relationship holds beyond the range of the explanatory variable in the data, extrapolation is not reliable, and should be avoided.
General Multiplication Rule
For any two events A and B, P(A and B) = P(A) * P(B | A). It is useful when two events, A and B, occur in stages, first A and then B (like the selection of the two cards)
Variable - role type classification
Four possibilities for "role-type classification": 1. Categorical explanatory and quantitative response 2. Categorical explanatory and categorical response 3. Quantitative explanatory and quantitative response 4. Quantitative explanatory and categorical response (this last classification is not discussed in the C459 course) Note: The explanatory variable is always listed first.
Treatment groups
Groups receiving different treatments in an experiment.
Simple random sample (SRS).
If individuals are sampled completely at random, and without replacement, then each group of a given size is just as likely to be selected as all the other groups of that size.
Double-blind experiment
If neither the subjects nor the researchers know who was assigned what treatment.
Variables - x and y
In most studies involving two variables, each of the variables has a role. We distinguish between: The explanatory variable (also commonly referred to as the independent variable)-—the variable that claims to explain, predict or affect the response.Typically the explanatory (or independent) variable is denoted by X; The response variable (also commonly referred to as the dependent variable)-—the outcome of the study. Typically the response (or dependent) variable is denoted by Y.
Linear relationships (LR)
Interpretation guide: r value is between -1 and 1 with zero in the middle Values less than zero or negative Values greater than zero are positive Values closer to zero are weaker Values closer to -1 or 1 are stronger A scatterplot graphically represents the relationship - if the points on the scatterplot appear to be scattered randomly, there is no relationship between the variables and r = 0.
Slope
Left to right... ...if you go up, it it a positive slope ...if you go down, it it a negative slope
Sampling frame
List of potential individuals to be sampled—does not match the population of interest.
BoxPlot - Components using FNS
Max = top vertical line; largest observation that is not an outlier Min = Bottom vertical line; smallest observation that is not an outlier Q3 = Top of the box Q1 = Bottom of the box Median (M) = Horizontal Line inside box Note: width of box has no relevance
Mean
Mean is the average of a set of observations
Inter-Quartile Range (IQR)
Measures the variability of a distribution by giving us the range covered by the MIDDLE 50% of the data. - Step 1: Arrange the data in increasing order, and find the median - Step 2: Find the median of the lower 50% of the data = 1st quartile (Q1) - Step 3: Repeat this again for the top 50% of the data = 3rd quartile (Q3) - Step 4: The middle 50% of the data falls between Q1 and Q3. Therefore: IQR = Q3 - Q1 Note: When finding the median of an odd number of data items, it is not included in either the bottom or top half of the data; when median is even, the data are naturally divided into two halves.
Mode
Mode is the most commonly occurring value in a distribution. For simple datasets where the frequency of each value is available or easily determined, the value that occurs with the highest frequency is the mode.
Probability with Equally Likely Outcomes
Number of outcomes in event divided by total number of outcomes in sample space
Empirical Probability
Observational method of probability that allows us to determine if the coin, die, roulette wheel, etc is fair. In a two possible outcome scenario, the Probability would be close or exactly .5 or 50%. Empirical methods use a series of trials that produce outcomes that cannot be predicted in advance (hence the uncertainty).
Outliers
Observations that fall outside the overall pattern.
Open vs. closed questions
Open question allows for almost unlimited responses which makes categorization difficult. Closed question designed to narrow the response - results in defined category.
Probability tree
Particularly useful for showing probabilities when the order of outcomes is important.
Hawthorne Effect
People in an experiment behave differently from how they would normally behave.
Empirical Probability - Relative Frequency
Relative frequency = number of times event occurs (or total) divided by total number of repetitions
Multi-stage sampling
Represents a more complicated form of cluster sampling in which larger clusters are further subdivided into smaller, more targeted groupings for the purposes of surveying.
Experiment
Researchers interfere assign the values of the explanatory variable to the individuals. The researchers "take control" of the values of the explanatory variable because they want to see how changes in the value of the explanatory variable affect the response variable. (Note: By nature, any experiment involves at least two variables.
Sample Space and Events
Sample space: A set of all possible outcomes Event: A subset of the sample space and may include on eor more outcomes
Scatterplot - labeled scatterplot
Sometimes it makes sense to identify subgroups or categories within the data on the scatterplot, by labeling each subgroup differently.
Skewed Left
Tail is on the left The left tail (smaller values) is much longer than the right tail (larger values). An example of a real life variable that has a skewed left distribution is age of death from natural causes (heart disease, cancer etc.). Most such deaths happen at older ages, with fewer cases happening at younger ages. Note: skewed distributions can also be bimodal.
Skewed Right
Tail is on the right -The right tail (larger values) is much longer than the left tail (small values). An example of a real-life variable that has a skewed right distribution is salary. Most people earn in the low/medium range of salaries, with a few exceptions (CEOs, professional athletes etc.) that are distributed along a large range (long "tail") of higher values. Note: skewed distributions can also be bimodal.
Cluster Sampling
Technique is used when a population is naturally divided into groups (clusters). Example: all students in a university are divided into majors; all nurses in a city are divided into hospitals; all registered voters are divided into precincts.
Law of Large Numbers
The Law of Large Numbers states that as the number of trials increases, the relative frequency becomes the actual probability.
Scatterplot
The appropriate graphical display for examining the relationship between two quantitative variables is the scatterplot.
Midpoint (distribution)
The center of the distribution is its midpoint—the value that divides the distribution so that approximately half the observations take smaller values, and approximately half the observations take larger values.
BoxPlot - Five Number Summary (FNS) Defined
The combination of all five numbers (min, Q1, M, Q3, max) provides a quick numerical description of both the center and spread of a distribution.
Linear relationships - Correlation coefficient (r)
The correlation coefficient (r) is a numerical measure that measures the strength and direction of a linear relationship between two quantitative variables.
Non-linear relationships
The correlation of "r" is useless for assessing the strength of any type of relationship that is not linear (including relationships that are curvilinear). Beware, then, of interpreting the fact that "r is close to 0" as an indicator of a "weak relationship" rather than a "weak linear relationship."
Variable - explanatory variable
The explanatory variable (aka independent variable)-—the variable that claims to explain, predict or affect the response.Typically the explanatory (or independent) variable is denoted by X
Scatterplot - layout
The explanatory variable is always on the horizontal X-axis. The response variable is always plotted on the vertical Y-axis. If there is no clear distinction between explanatory and response variables, each of the variables can be plotted on either axis.
Control group
The group in an experiment or study that does not receive treatment by the researchers and is then used as a benchmark to measure how the other tested subjects do.
Standard deviation
The idea behind the standard deviation is to quantify the spread of a distribution by measuring how far the observations are from their mean. The standard deviation gives the average (or typical distance) between a data point and the mean.
Median
The median M is the midpoint of the distribution. It is the number such that half of the observations fall above, and half fall below. To find the median: Order the data from smallest to largest. Consider whether n, the number of observations, is even or odd. - If n is odd, the median M is the center observation in the ordered list. This observation is the one "sitting" in the (n + 1) / 2 spot in the ordered list. - If n is even, the median M is the mean of the two center observations in the ordered list. These two observations are the ones "sitting" in the n / 2 and n / 2 + 1 spots in the ordered list.
Randomized controlled double-blind experiment
The most reliable way to determine whether the explanatory variable is actually causing changes in the response variable. Depending on the variables of interest, such a design may not be entirely feasible, but the closer researchers get to achieving this ideal design, the more convincing their claims of causation (or lack thereof) are.
Sample size
The number of observations or replicates to include in a statistical sample.
Range
The range is exactly the distance between the smallest data point (min) and the largest one (Max). Range = Max - Min
Variable - response variable
The response variable (aka dependent variable)-—the outcome of the study. Typically the response (or dependent) variable is denoted by Y.
Matched pairs design
The same person is getting both treatments and randomized administration within each matched pair.
Spread (also called variability)
The spread or variability of the distribution can be described by the approximate range covered by the data. Such as minimum to maximum
Stemplot
The stemplot (also called stem and leaf plot) is another graphical display of the distribution of quantitative data. Separate each data point into a stem and leaf, as follows: - The leaf is the right-most digit. - The stem is everything except the right-most digit. So, if the data point is 34, then 3 is the stem and 4 is the leaf. If the data point is 3.41, then 3.4 is the stem and 1 is the leaf.
Blind experiment
The subjects are unaware of which treatment is being administered to them.
Observational study
The values of the variable or variables of interest are recorded as they naturally occur. There is no interference by the researchers who conduct the study. In the observational study, individuals are divided based upon the method by which they choose. It is because of the existence of a virtually unlimited number of potential lurking variables that we can never be 100% certain of a claim of causation based on an observational study.
Retrospective observational study
The values of the variables of interest are recorded backward in time. Example: A study that asks people to recall from a previous time period their actions during a specified event or activity.
Prospective observational study
The values of the variables of interest are recorded forward in time. Example: A study that asks people ahead of time to document their actions during a specified event or activity.
Independent --------------- Dependent
Two events A and B are said to be independent if the fact that one event has occurred does not affect the probability that the other event will occur. They cannot be disjoint. ---------------------------------------------------------- If whether or not one event occurs does affect the probability that the other event will occur, then the two events are said to be dependent.
Disjoint Not Disjoint
Two events that cannot occur at the same time. Events that are disjoint cannot be independent. Two events that can occur at the same time.
Randomized controlled experiments
Under random assignment, the groups should not differ significantly with respect to any potential lurking variable. When subjects are randomly assigned to the different treatments we can draw causal conclusions.
Stratified Sampling
Used when our population is naturally divided into sub-populations. For example, all the students in a certain college are divided by gender or by year in college; all the registered voters in a certain city are divided by race.
Two-way probability table
When there are two categorical variables in the background, each with two possible values, a two-way probability table is a quick and easy way to display the probabilities associated with the 4 possible combinations.
Simpson's Paradox
Whenever including a lurking variable causes us to rethink the direction of an association.
Convenience sample
Where individuals happen to be at the right time and place to suit the schedule of the researcher. A convenience sample may also be susceptible to bias because certain types of individuals are more likely to be selected than others.
Volunteer sample
Where individuals have selected themselves to be included. Such a sample is almost guaranteed to be biased.
Intercept
x-intercept is a point on the graph where y is zero, y-intercept is a point on the graph where x is zero. Said another way: an x-intercept is a point in the equation where the y-value is zero, and a y-intercept is a point in the equation where the x-value is zero.