C459 Intro to Probability and Statistics

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

Variables - Classification

- Quantitative variables take numerical values, and represent some kind of measurement. Example: Age, Weight, and Height are quantitative variables. - Categorical variables take category or label values, and place an individual into one of several groups. Example: Race, Gender, and Smoking are categorical variables.

Spread (3 most commonly used measures)

- Range - Inter-quartile range (IQR) - Standard deviation

Histogram: When describing the shape of a distribution, we should consider:

- Symmetry/skewness of the distribution. - Peakedness (modality)—the number of peaks (modes) the distribution has.

Histogram Modes

- Unimodal—has one peak in distribution - Bimodal— has two peaks in distribution - Flat, or uniform - has no modes, or no value around which the observations are concentrated; uniformly distributed values. Used to for categorical; not quantitative.

Probability Scale

0% = Impossible 25% = Unlikely 50% = Even Chance 75% = Likely 100% = Certain

Distribution of a variable

A description of the relative numbers of times each possible outcome will occur in a number of trials.

Contingency table

A display format used to analyze and record the relationship between two or more categorical variables.

Lurking Variable

A lurking variable is a variable that was not included in your analysis, but that could substantially change your interpretation of the data if it were included. Association does not imply causation!

Sample survey

A particular type of observational study in which individuals report variables' values themselves, frequently by giving their opinions.

Bias in sampling

A sample that produces data that is not representative because of the systematic under- or over-estimation of the values of the variable of interest is called biased. Bias may result from either a poor sampling plan or from a poor design for evaluating the variable of interest.

Linear regression

A technique of finding the line that best fits the pattern of the linear relationship.

Placebo group

A way of testing a medical therapy in which, in addition to a group of subjects that receives the treatment to be evaluated, a separate control group receives a "placebo" treatment which is specifically designed to have no real effect.

Randomized response technique

Allows respondents to respond to sensitive issues (such as criminal behavior or sexuality) while maintaining confidentiality.

Inter-Quartile Range (IQR) to Find Outliers

An observation is considered a suspected outlier if it is: - below Q1 - 1.5 x IQR or - above Q3 + 1.5 x IQR

Box Plot

Appropriate for display that compares several quantitative items of data for the relationship between a categorical explanatory variable and a quantitative response variable. C --> Q

Standard Deviation Rule (Empirical rule) for a Normal Shape

Approximately 68% of the observations fall within 1 standard deviation of the mean. Approximately 95% of the observations fall within 2 standard deviations of the mean. Approximately 99.7% (or virtually all) of the observations fall within 3 standard deviations of the mean.

Classical Probability

Classical methods are used for games of chance, such as flipping coins, rolling dice, spinning spinners, roulette wheels, or lotteries. They are "classical" because their values are determined by the game itself. Use the nature of the situation to determine probabilities.

Scatterplot - relationships

Direction: Positive relationship - increasing relationship in both Negative relationship - increase in one and decrease in another Form: Linear - points scattered about a line Curvilinear - points scattered about the same curved line Strength: Linear - points closely scattered about a line Curvilinear - points widely scattered about the same curved line Outliers - deviate greatly from the pattern

Systematic sampling

Example: Obtain a student directory with email addresses of all the university's students, and send the music poll to every 50th name on the list. It may not be subject to any clear bias, but it would not be as safe as taking a random sample.

Extrapolation

Extrapolation is prediction of values of the explanatory variable that fall outside the range of the data. Since there is no way of knowing whether a relationship holds beyond the range of the explanatory variable in the data, extrapolation is not reliable, and should be avoided.

General Multiplication Rule

For any two events A and B, P(A and B) = P(A) * P(B | A). It is useful when two events, A and B, occur in stages, first A and then B (like the selection of the two cards)

Variable - role type classification

Four possibilities for "role-type classification": 1. Categorical explanatory and quantitative response 2. Categorical explanatory and categorical response 3. Quantitative explanatory and quantitative response 4. Quantitative explanatory and categorical response (this last classification is not discussed in the C459 course) Note: The explanatory variable is always listed first.

Treatment groups

Groups receiving different treatments in an experiment.

Simple random sample (SRS).

If individuals are sampled completely at random, and without replacement, then each group of a given size is just as likely to be selected as all the other groups of that size.

Double-blind experiment

If neither the subjects nor the researchers know who was assigned what treatment.

Variables - x and y

In most studies involving two variables, each of the variables has a role. We distinguish between: The explanatory variable (also commonly referred to as the independent variable)-—the variable that claims to explain, predict or affect the response.Typically the explanatory (or independent) variable is denoted by X; The response variable (also commonly referred to as the dependent variable)-—the outcome of the study. Typically the response (or dependent) variable is denoted by Y.

Linear relationships (LR)

Interpretation guide: r value is between -1 and 1 with zero in the middle Values less than zero or negative Values greater than zero are positive Values closer to zero are weaker Values closer to -1 or 1 are stronger A scatterplot graphically represents the relationship - if the points on the scatterplot appear to be scattered randomly, there is no relationship between the variables and r = 0.

Slope

Left to right... ...if you go up, it it a positive slope ...if you go down, it it a negative slope

Sampling frame

List of potential individuals to be sampled—does not match the population of interest.

BoxPlot - Components using FNS

Max = top vertical line; largest observation that is not an outlier Min = Bottom vertical line; smallest observation that is not an outlier Q3 = Top of the box Q1 = Bottom of the box Median (M) = Horizontal Line inside box Note: width of box has no relevance

Mean

Mean is the average of a set of observations

Inter-Quartile Range (IQR)

Measures the variability of a distribution by giving us the range covered by the MIDDLE 50% of the data. - Step 1: Arrange the data in increasing order, and find the median - Step 2: Find the median of the lower 50% of the data = 1st quartile (Q1) - Step 3: Repeat this again for the top 50% of the data = 3rd quartile (Q3) - Step 4: The middle 50% of the data falls between Q1 and Q3. Therefore: IQR = Q3 - Q1 Note: When finding the median of an odd number of data items, it is not included in either the bottom or top half of the data; when median is even, the data are naturally divided into two halves.

Mode

Mode is the most commonly occurring value in a distribution. For simple datasets where the frequency of each value is available or easily determined, the value that occurs with the highest frequency is the mode.

Probability with Equally Likely Outcomes

Number of outcomes in event divided by total number of outcomes in sample space

Empirical Probability

Observational method of probability that allows us to determine if the coin, die, roulette wheel, etc is fair. In a two possible outcome scenario, the Probability would be close or exactly .5 or 50%. Empirical methods use a series of trials that produce outcomes that cannot be predicted in advance (hence the uncertainty).

Outliers

Observations that fall outside the overall pattern.

Open vs. closed questions

Open question allows for almost unlimited responses which makes categorization difficult. Closed question designed to narrow the response - results in defined category.

Probability tree

Particularly useful for showing probabilities when the order of outcomes is important.

Hawthorne Effect

People in an experiment behave differently from how they would normally behave.

Empirical Probability - Relative Frequency

Relative frequency = number of times event occurs (or total) divided by total number of repetitions

Multi-stage sampling

Represents a more complicated form of cluster sampling in which larger clusters are further subdivided into smaller, more targeted groupings for the purposes of surveying.

Experiment

Researchers interfere assign the values of the explanatory variable to the individuals. The researchers "take control" of the values of the explanatory variable because they want to see how changes in the value of the explanatory variable affect the response variable. (Note: By nature, any experiment involves at least two variables.

Sample Space and Events

Sample space: A set of all possible outcomes Event: A subset of the sample space and may include on eor more outcomes

Scatterplot - labeled scatterplot

Sometimes it makes sense to identify subgroups or categories within the data on the scatterplot, by labeling each subgroup differently.

Skewed Left

Tail is on the left The left tail (smaller values) is much longer than the right tail (larger values). An example of a real life variable that has a skewed left distribution is age of death from natural causes (heart disease, cancer etc.). Most such deaths happen at older ages, with fewer cases happening at younger ages. Note: skewed distributions can also be bimodal.

Skewed Right

Tail is on the right -The right tail (larger values) is much longer than the left tail (small values). An example of a real-life variable that has a skewed right distribution is salary. Most people earn in the low/medium range of salaries, with a few exceptions (CEOs, professional athletes etc.) that are distributed along a large range (long "tail") of higher values. Note: skewed distributions can also be bimodal.

Cluster Sampling

Technique is used when a population is naturally divided into groups (clusters). Example: all students in a university are divided into majors; all nurses in a city are divided into hospitals; all registered voters are divided into precincts.

Law of Large Numbers

The Law of Large Numbers states that as the number of trials increases, the relative frequency becomes the actual probability.

Scatterplot

The appropriate graphical display for examining the relationship between two quantitative variables is the scatterplot.

Midpoint (distribution)

The center of the distribution is its midpoint—the value that divides the distribution so that approximately half the observations take smaller values, and approximately half the observations take larger values.

BoxPlot - Five Number Summary (FNS) Defined

The combination of all five numbers (min, Q1, M, Q3, max) provides a quick numerical description of both the center and spread of a distribution.

Linear relationships - Correlation coefficient (r)

The correlation coefficient (r) is a numerical measure that measures the strength and direction of a linear relationship between two quantitative variables.

Non-linear relationships

The correlation of "r" is useless for assessing the strength of any type of relationship that is not linear (including relationships that are curvilinear). Beware, then, of interpreting the fact that "r is close to 0" as an indicator of a "weak relationship" rather than a "weak linear relationship."

Variable - explanatory variable

The explanatory variable (aka independent variable)-—the variable that claims to explain, predict or affect the response.Typically the explanatory (or independent) variable is denoted by X

Scatterplot - layout

The explanatory variable is always on the horizontal X-axis. The response variable is always plotted on the vertical Y-axis. If there is no clear distinction between explanatory and response variables, each of the variables can be plotted on either axis.

Control group

The group in an experiment or study that does not receive treatment by the researchers and is then used as a benchmark to measure how the other tested subjects do.

Standard deviation

The idea behind the standard deviation is to quantify the spread of a distribution by measuring how far the observations are from their mean. The standard deviation gives the average (or typical distance) between a data point and the mean.

Median

The median M is the midpoint of the distribution. It is the number such that half of the observations fall above, and half fall below. To find the median: Order the data from smallest to largest. Consider whether n, the number of observations, is even or odd. - If n is odd, the median M is the center observation in the ordered list. This observation is the one "sitting" in the (n + 1) / 2 spot in the ordered list. - If n is even, the median M is the mean of the two center observations in the ordered list. These two observations are the ones "sitting" in the n / 2 and n / 2 + 1 spots in the ordered list.

Randomized controlled double-blind experiment

The most reliable way to determine whether the explanatory variable is actually causing changes in the response variable. Depending on the variables of interest, such a design may not be entirely feasible, but the closer researchers get to achieving this ideal design, the more convincing their claims of causation (or lack thereof) are.

Sample size

The number of observations or replicates to include in a statistical sample.

Range

The range is exactly the distance between the smallest data point (min) and the largest one (Max). Range = Max - Min

Variable - response variable

The response variable (aka dependent variable)-—the outcome of the study. Typically the response (or dependent) variable is denoted by Y.

Matched pairs design

The same person is getting both treatments and randomized administration within each matched pair.

Spread (also called variability)

The spread or variability of the distribution can be described by the approximate range covered by the data. Such as minimum to maximum

Stemplot

The stemplot (also called stem and leaf plot) is another graphical display of the distribution of quantitative data. Separate each data point into a stem and leaf, as follows: - The leaf is the right-most digit. - The stem is everything except the right-most digit. So, if the data point is 34, then 3 is the stem and 4 is the leaf. If the data point is 3.41, then 3.4 is the stem and 1 is the leaf.

Blind experiment

The subjects are unaware of which treatment is being administered to them.

Observational study

The values of the variable or variables of interest are recorded as they naturally occur. There is no interference by the researchers who conduct the study. In the observational study, individuals are divided based upon the method by which they choose. It is because of the existence of a virtually unlimited number of potential lurking variables that we can never be 100% certain of a claim of causation based on an observational study.

Retrospective observational study

The values of the variables of interest are recorded backward in time. Example: A study that asks people to recall from a previous time period their actions during a specified event or activity.

Prospective observational study

The values of the variables of interest are recorded forward in time. Example: A study that asks people ahead of time to document their actions during a specified event or activity.

Independent --------------- Dependent

Two events A and B are said to be independent if the fact that one event has occurred does not affect the probability that the other event will occur. They cannot be disjoint. ---------------------------------------------------------- If whether or not one event occurs does affect the probability that the other event will occur, then the two events are said to be dependent.

Disjoint Not Disjoint

Two events that cannot occur at the same time. Events that are disjoint cannot be independent. Two events that can occur at the same time.

Randomized controlled experiments

Under random assignment, the groups should not differ significantly with respect to any potential lurking variable. When subjects are randomly assigned to the different treatments we can draw causal conclusions.

Stratified Sampling

Used when our population is naturally divided into sub-populations. For example, all the students in a certain college are divided by gender or by year in college; all the registered voters in a certain city are divided by race.

Two-way probability table

When there are two categorical variables in the background, each with two possible values, a two-way probability table is a quick and easy way to display the probabilities associated with the 4 possible combinations.

Simpson's Paradox

Whenever including a lurking variable causes us to rethink the direction of an association.

Convenience sample

Where individuals happen to be at the right time and place to suit the schedule of the researcher. A convenience sample may also be susceptible to bias because certain types of individuals are more likely to be selected than others.

Volunteer sample

Where individuals have selected themselves to be included. Such a sample is almost guaranteed to be biased.

Intercept

x-intercept is a point on the graph where y is zero, y-intercept is a point on the graph where x is zero. Said another way: an x-intercept is a point in the equation where the y-value is zero, and a y-intercept is a point in the equation where the x-value is zero.


Ensembles d'études connexes

Chapter 29: PrepU - Nursing Assessment: Endocrine Function

View Set

Ch 2 Characteristics of Real Property

View Set

Chapter 4 Section 2 microeconomics

View Set

Fjalë të panjohura, Libri i Medinës 1 - Mësimi 1-23

View Set

Army E5/E6 Promotion Board Study Questions

View Set

JATC homework code practice test

View Set

public speaking chapter 11, Chapter 11 Assignment, Speech ch 11

View Set

4. European Union Privacy Law Basics

View Set