Statistics at UC Davis (STA 13, SOC 106, NUT 258)

Lakukan tugas rumah & ujian kamu dengan baik sekarang menggunakan Quizwiz!

Deviation

(xi - x̅) A measure of risk -- how far each value of x deviates from the overall mean; Difference in what happened (x) and what we expected to happen (x̅ or mean). This is a "deviation" away from what you "expect" and is a measure of risk. Soc 106: Deviation is positive when the observation falls above he mean. Negative when it falls below the mean.

Histogram

*Can only be used with quantitative data;* a graph of a relative frequency distribution for a quantitative variable, consisting of bars of equal width drawn adjacent to each other (always touching; unless there are gaps in the data). The horizontal scale represents the classes or quantitative data values and the vertical scale represents frequencies. The height of the bars correspond to frequency values. Choosing intervals for frequency distributions and histograms is a matter of common sense. If too few intervals are used, too much info is lost. If too many intervals are used, then they are so narrow tat the information present is difficult to digest.

Stratified sampling

*Divide the entire population into strata or distinct subgroups* (based on a common characteristic like age, income, education etc. --> e.g. males vs. females; freshman, sophomore, junior and senior) --> then a random sample of a certain size is drawn from each stratum; *need to have at least two groups*

Interquartile Range (IQR)

*Interquartile range (IQR) = Q3-Q1 --> gives you the spread of the middle half of the data*. IQR increases as the variability increases and is useful for comparing variability of different groups. Less affected by outliers. For bell-shaped distributions, the distance from the meant o either quartile is about 2/3rd of a standard deviation. Then IQR is roughly (4/3)s.

Simple random sample

*Most desired sampling method,* everyone in a sample of size "n" from the population has the exact same, equal chance of being selected for a study (not only does every sample of the specified size have an equal chance of being selected, but every individual of the population also has an equal chance of being selected); *we can assume independence with objects/subjects if the study uses a simple random sample* SOC 106: To select a random sample, we need the *sampling frame:* a list of all subjects in the population. The most common method for selecting a random sample is to 1) Number the subjects in the sampling frame 2) generate a set of these numbers randomly, and 3) sample the subjects whose numbers were generated. Using random numbers to select the sample ensures hat each subject has an equal change of selection.

Ordinal level of measurement

*qualitative*: data can be arranged in some natural order, but numerical differences/distances between data values either cannot be determined or are meaningless. (e.g. course grades: A, B, C, D, F or evaluation rating: poor, acceptable, good) - NOT numerical data, but qualitative (cannot add, subtract, etc.) --> *categories with some order but differences are not meaningful* SOC 106: Scale is a set of categories, so they are often analyzed using the same methods as nominal scales. But are also similar to interval scales in that each level has a greater or smaller magnitude than another level. It is helpful to analyze ordinal scales by assigning numerical scores to categories --> allows us to use more powerful statistical analysis methods (e.g. course grades A, B, C... are ordinal, but we treat them as interval when we assign numbers to the grades such as 4, 3, 2, 1, 0) to compute GPA.

Standard Deviation (of sample) = s

-Measure of variability, risk or spread around the center --> the typical distance of an observation from the mean Square root of the sample variance s = √ ((Σ (xi- x̅)^2) / n-1)) -Displayed in the original units of x (because we take the square root of the sample variance and get rid of the squared units) -- depends on the units of measurement (difficult to compare measurements from different populations) -Larger S means greater variability in data (bigger spread of data around x̅) -based on the differences between each data value and the mean -We divide by n-1 to make s a little larger because you typically don't get extreme values with random samples Soc 106: Properties of the standard deviation: -s≥0 -s=0 only when all observations have the same value -The greater the variability about the mean, the larger the value of s -If data are re-scaled, the standard deviation is also rescaled -Can be greatly affected by an outlier, especially for small data sets

How to create a box plot

1. Create a number line to include the lower and highest values 2. Capture the min and max by putting a dot above the number line (if they are outliers, plot them with an asterisk *) ** 3. Draw a straight, vertical line above quartiles (Q1, Q2 and Q3) 4. Connect the tops and bottoms of the lines to make a box 5. Draw whiskers (horizontal lines) from the minimum value to Q1 and from Q3 to the maximum value **You can calculate outliers by first calculating the lower and upper limits: Lower limit is Q1 - (1.5 * IQR) Upper limit is Q3 + (1.5 * IQR) (IQR) = Q3-Q1 Tell us: -Where the middle half of the data lies (between Q1 and Q3) and the IQR (Q3-Q1) Used to find the distribution of data if you have percentiles, without finding the mean and variance: 1. Look at the distance from the minimum value to the median (Q2) 2. Look at the distance from the median (Q2) to the maximum value 3. If one has a longer distance, then you know the data is skewed one way 4. Look at the length of the upper and lower whiskers; whichever is longer emphasizes skewness Soc 106: side by side box plots are useful for comparing two distributions.

Steps to solve probability questions -- key words/phrases

1. Decide what kind of problem it is. "And" or "Or"? --> "And" means they happen together and require the multiplication rule --> "Or" means either or both can happen, requires the addition rule 2. If it's an "and" problem, check for independence or dependence // If it's an "or" problem, check if disjoint or joint. 3. Calculate the probability of each event using the proper equation: -"And" requires the multiplication rule --> if independent, then simply P(A∩B) = P(A)P(B) -"And" requires the multiplication rule --> if dependent, then you have to use the conditional probability rule, P(A∩B) = P(A)P(B|A) = P(B)P(A|B); P(B|A) = (A∩B) / P(A) (notice that the denominator is always the event that has already happened) -"Or" requires the addition rule -If joint (can happen at the same time), P(A or B) = P∪B = P(A)+P(B)-P(A∩B) -If disjoint (cannot happen at the same time), P(A or B) = P∪B = P(A)+P(B)

Typical bimodal histogram

2 mounds or peaks Soc 106: more common with attitudinal variables when populations are polarized, with responses tending to be strongly in one direction or another.

Variable

A characteristic of an individual subject in a sample or population to be measured, observed, or carry value (e.g. height, weight, age, etc.) Qualitative variables - categorical Quantitative variables - anything measurable in numbers

Ogive

A line graph that depicts cumulative frequencies (class boundaries on the x axis and cumulative frequencies on the y axis; plot frequency versus the upper class); in contrast, a histogram is class boundaries vs. frequency or relative frequency

Reliability (consistency)

A measure should also have reliability, being consistent in the sense that subject will give the same response when asked again. Invalid or unreliable data gathering instruments render statistical manipulations of the data meaningless.

Validity (accuracy)

A measure should have validity, describing what it is intended to measure and accurately reflecting the concept

Typical uniform or rectangular histogram

All bars are the same height (looks like a rectangle); mean = median = mode

Probability vs. Statistics

Although statistics and probability are closely related fields of mathematics, they are nevertheless separate fields. Probability is the medium through which statistical work is done. In fact, if it were not for probability theory, inferential statistics would not be possible. Probability allows us to make inferences about populations from a sample. Probability describes the likelihood that an event will occur (a number between 0 and 1; can be expressed as a fraction, decimal or percentage). It is the field of study that makes statements about what will occur when samples are drawn from a known population. Statistics is the field of study that describes how samples are to be obtained and how inferences are to be made about unknown populations.

Outlier (Soc 106)

An observation is an outlier if it falls more than 1.5(IQR) above the upper quartile or more than 1.5(IQR) below the lower quartile. This is a somewhat arbitrary criterion for an outlier, and it is better to consider an observation as a "potential outlier", rather than a definite outlier. By the empirical rule, for a bell-shaped distribution, it is very unusual for an observation to fall more than three SD from the mean. Thus an observation may be considered an outlier if it's more than 3 SD/z-scores from the mean.

Statistical experiment or statistical observation

Any random activity that results in a different outcome (e.g. flipping a coin or rolling a die)

Stem-and-leaf display/plot

Can only be used with quantitative data; example of a graph for exploratory data analysis (detects patterns and extreme values, requires very few assumptions). Used to rank-order and arrange data into groups; represents quantitative data by separating each value into two parts: the stem (such as leftmost digit, e.g. tens place, but you can choose the number of digits in the stem, such as hundredths) and the leaf (rightmost digit, e.g. ones place); is good for showing trends in frequency (the lengths of the lines containing the leaves give the visual impression that a sideways histogram would present.) You don't lose actual specific data values (unlike a histogram where you can't list actual data bc it's already lost in intervals/boundaries); use a label to indicate the magnitude of the numbers displayed (e.g. 3 | 2 represents 32 lbs) Useful for quick portrays of small data sets.

Disjoint events or mutually exclusive events

Cannot happen at the same time; they do not share any elements in common (no intersection) E.g. score a 90 on an exam and a 91 at the same time; turning left or turning right P(A∩B) = 0 If disjoint, use this addition rule for mutually exclusive events: P(A∪B) = P(A) + P(B) Two events CANNOT be disjoint and independent. If they are disjoint, then they must be dependent. -Why? Because if A and B are disjoint, then P(A∩B) = 0, thus P(A|B) = 0/P(B) = 0 -For A and B to be independent, P(A|B) = P(A), but we see that P(A|B) = 0 if disjoint, so P(A)≠0 -Also if A and B are independent, having information about A does not tell us anything about B. -If A and B are disjoint, then knowing that A occurs, tells us that B cannot occur (they cannot happen simultaneously) -A and B must be dependence since if one occurs, we know the other cannot

Sensitivity Analysis

Checking whether conclusions would differ in any significant way for other choices of the scores/values

Convenience sampling

Create a sample by using data from population members that are conveniently and readily available, can result in biased samples

Classical probability

Each event must be equally likely to occur P(A) = # of ways A can occur / # of total outcomes/possibilities E.g. what is the probability of getting heads with a fair coin? --> P(H) = 1/2 = 0.50 = 50% there is a difference between relative frequency probability and classical probability in this coin flipping example --> Law of large numbers

Cluster sampling

First, divide the demographic area/entire population into sections (often geographic), then randomly select sections (or clusters). Every member of the cluster is included in the sample. a method used extensively by government agencies and certain private research orgs; usually geographic (e.g. districts, counties or states)

Chebyshev's Theorem

For any set of data (population or sample) and for any constant k > 1, the proportion of data that must lie within k standard deviations on either side of the mean is at least 1 - (1/k^2) -- applies to all distributions (not just normal distributions) For any set of data: Thus k = 2 --> 2 standard deviations from the mean: 1 - (1/2^2) = 1 - (1/4) = 0.75 --> at least 75% of data fall within two standard deviations of the mean (μ - 2σ, μ + 2σ where μ = population mean and σ = standard deviation of the population) k = 3: 1 - (1/3^2) = 1 - 1/9 = 0.889 --> at least 88.9% of the data fall within 3 standard deviations of the mean (μ - 3σ, μ + 3σ where μ = population mean and σ = standard deviation of the population) k = 4: 1 - (1/4^2) = 1 - 1/16 = 0.938 --> at least 93.8% of data fall within four standard deviations of the mean (μ - 4σ, μ + 4σ where μ = population mean and σ = standard deviation of the population) This formula/theorem provides the minimum percentage between specific numbers of standard deviation If the distribution is mound-shaped, an even greater percentage of data will fall into the specified intervals. Example: compute a 75% Chebyshev interval around the sample mean -We know to use 2 standard deviations (because the problem says 75% and the theorem states k = 2 = 75%) We know x̅ = 36.3, s = 22.3 (from previous example) x̅ - 2s = 36.3 - 2(22.3) = -8.3 x̅ + 2s = 36.3 + 2(22.3) = 80.9 75% of the data fall in the interval (-8.3, 80.9) --> (0, 80.9) because you can't have negative quiz scores CV = (s/x̅)⋅100 = (22.3/36.3)⋅100 == 61.4%

Addition rule

Given two events A and B, the probability that A occurs, B occurs, or both occur is given as: The general addition rule: P(A or B) = P∪B = P(A)+P(B)-P(A∩B) where ∪ = "union" = "or" Why? Because if the events are joint or can happen simultaneously, you count the overlapping/intersecting area twice, so you have to subtract one so you're just capture this area once. If you have two disjoint or mutually exclusive events, then the addition rule is: P(A∪B) = P(A) + P(B) [You don't need to subtract the overlapping area in the equation because there is no intersection] "or" implies using the addition rule, then you must decide if they events are disjoint

Data file

Has a separate row of data for each subject (e.g. person #1, #2, #3, etc.) and a separate column for each characteristic (e.g. sex, race, age, income, etc.)

Collecting data with an observational study

In social research, it is rarely possible to conduct experiments. *Observational studies* are those in which the researcher measures subjects' responses on the variables of interest, but has no experimental control over the subjects. It is sometimes difficult to compare groups within observation studies because some key variables they are affecting outcomes may not be measured. E.g. what affects standardized test scores? If we are just looking at race, it may not capture variables like parents' education or income. Strong change the sample is not representative of the population. Some unmeasured variable could be responsible for patterns observed in the data. Difficult to study cause and effect with observational studies or sample surveys.

Law of Large Numbers

In the long run, as the number of times an experiment is repeated (n increases), the relative frequency/probability of outcomes gets closer and closer to the actual theoretical value (classical probability value). The law of large numbers is the reason businesses such as health insurance, automobile insurance, and gambling casinos can exist and make a profit.

Range

Maximum - minimum -Soc 106: the simplest way to describe variability in data -heavily influenced by outliers -Range is greater with more variability

Coefficient of variation

Measure of spread of the data relative to the average (mean). Expresses the standard deviation as a percentage of the sample or population mean (does not have units). Allows you to compare the variability between two different samples or populations. CV = (s/x̅) * 100 (s = standard deviation of sample, x̅ = mean of the sample) CV = (σ/μ) * 100 (σ = standard deviation of population; μ = mean of the population) the numerator and denominator in the definition of CV have the same units, so CV itself has no units of measurement. A CV of 61.4% tells us that the standard deviation is 0.614 times the mean (61.4% of the mean). This is not very insightful on its own. Most of the time, we use the CV as a relative/scaled variability to make comparisons between datasets that are on different levels of magnitude (i.e. values around 100 vs. values around 100,000). NUT 258: the proportion or % of the variance in outcome that is explained by the explanatory variables included in the prediction model. Also known as the coefficient of determination or R-squared. It is a measure of relative variability explained by the independent variables. Generally, the higher the R2 value is, the better, which means your model fits the data better.

Descriptive statistics

Methods of organizing, picturing, and summarizing information from samples or populations (take your data/numbers and organize or summarize) SOC 106: summaries of data & relationships -- Graphs, tables, and numbers (e.g. averages, percentages); to boil down raw data to a single, useful value that can be interpreted

Inferential statistics

Methods of using information from a sample to draw conclusions or make predictions about the population as a whole (to make an inference is to reach a conclusion based on evidence and reasoning) A statistical inference is a conclusion about the value of a population parameter. We will do both estimation and testing.

Dependent events / dependence

Must take into account changes in the probability of one event caused by the occurrence of the other event (e.g. hours studying for an exam and an exam score)

Systematic Sampling

Number all members of the population sequentially (assumes elements of the population are arranged in some natural sequential order - e.g. a line of people getting tickets to a concert) - then *select a random starting point and then select every "k-th" element of the population in the sample* (e.g. every 5th person in line): easy to get data, but need to be aware of a repetitive or cyclic population

Independent events / independence

Occurence or nonoccurence of one event does not change the probability that the other event will occur. 2 events are completely unrelated and don't affect each other. E.g. It's raining and it's Thursday -- neither of these outcomes affect the probability of the other Multiplication rule of independent events --> P(A∩B) = P(A)P(B)

Multiplication rules

P(A and B) = P(A∩B) where ∩ = "and" or "intersection" Multiplication rule of independent events --> P(A∩B) = P(A)P(B) General multiplication rule --> P(A∩B) = P(A)P(B|A) P(A∩B) = P(B)P(A|B) Note: -For independent events, you can switch the order --> P(A∩B) = P(B∩A) -For dependent events, you can't switch the order --> P(A|B) ≠ P(B|A) -For independent events, P(B|A) = P(B) and P(A|B) = P(A) These rules help us compute the probability of events happening together when the sample space is too large for convenient reference or when it is not completely known. **when using the multiplication rule, ask yourself if the events are independent or dependent**

Experimental (relative frequency) probability

P(A) = # of times A occurred / # of times experiment repeated E.g. a person flipping a fair coin 20 times; heads appeared 14 times. According to this experiment, the probability of getting heads --> P(H) = 14/20 = 0.70 or 70%

Conditional probability rule

P(A, given B) = P(A | B) where | = "given" The probability that "Event A will occur, given event B has already happened" Because of the multiplication rule for dependent events, we know that P(A∩B) = P(A)P(B|A) --> P(B|A) = P(A∩B)/P(A) (notice that the denominator is always the event that has already happened) --> P(A|B) = P(A∩B)/P(B) "Given" implies conditional probability rule For example: Suppose the probability that a person likes SNL is 3/4. What is the probability that a randomly selected person is male and likes SNL? These are independent events "and" is the key word that means intersection, so use the multiplication rule P(M)=1/2 P(SNL)=3/4 P(M∩SNL) = P(M) x P(SNL) = 1/2 x 3/4 = 3/8 Example 2: Suppose you roll two fair die. What is the probability that you roll snake eyes (1 and 1)? -These are independent events P(A) = 1/6 and P(B) = 1/6 P(A∩B) = P(A)P(B) = 1/6 * 1/6 = 1/36 Example 3: suppose you pick two cards at random (without replacement*) from a well shuffled deck. What is the probability that both cards are black? *implies dependent events P(A∩B) = P(A)P(B|A) (multiplication rule for dependent events) Total cards = 52, total black cards = 26, total red = 26 P(B1 ∩ B2) = P(B1)(PB2|B1) = 26/52*25/51 = 650/2652 = 25/102 = 24.5%

Laws of probability

Probability of an impossible event is 0 Probability of a certain event is 1 0 ≤ P(A) ≤ 1 P(A) + P(A^c) = 1 1 - P(A) = P(A^c) 1 - P(A^c) = P(A)

Practices for any graph

Provide a title, label the axes, and identify the units of measure; don't let artwork or skewed perspective cloud the clarity of the information displayed

Bar graph

Quantitative or qualitative data. Bars can be vertical or horizontal; they are uniform width and spacing; lengths of bars are used to show frequencies or percentage of occurrence of categories of categorical/qualitative data (the vertical scale represents frequencies or relative frequencies. the horizontal scale identifies the different categories of qualitative data). For quantitative data, the measurement itself can be displayed. The same measurement scale is used for the length of each bar; graph has title, labels for each bar, and vertical scale or actual value for the length of each bar. May or may not be separated with small gaps; ensure that the measurement scale is consistent or that a jump scale squiggle is used.

Boxplot Quartiles

Quartiles are special percentiles that divide the data into fourths. How to find quartiles (3 medians): 1. Order the data from smallest to largest 2. Find the median (this is Q2) 3. Q1 is the median of the lower half of the data 4. Q3 is the median of the upper half of the data Q1 = first quartile, 25th percentile, P25 (25% of data fall at or below, 75% of data fall at or above) Q2 = second quartile, 50th percentile, P50 (median, 50% of data fall above and below) Q3 = third quartile, 75th percentile, P75 (75% of data fall at our below, 25% at or above) To find the location of a quartile, use (P/100)⋅(n+1) Q1 represents the 25th percentile, thus (25/100)⋅(n+1) Q3 represents the 75th percentile. Thus, (75/100)*(n+1) = this represents the location. *Interquartile range (IQR) = Q3-Q1 --> gives you the spread of the middle half of the data*. IQR increases as the variability increases and is useful for comparing variability of different groups. Not sensitive to outliers. Outliers --> give upper and lower limits (values inside the limits are not outliers, values above or below limits are considered outliers) = Lower limit outlier = Q1 - (1.5⋅IQR) Upper limit outlier = Q3 + (1.5⋅IQR) 5 number summary = min, Q1, Q2/median, Q3, max semi-interquartile range: (Q3 - Q1)/2 Midquartile: (Q3 + Q1)/2

Relative versus cumulative frequency

Relative frequency is the frequency expressed as percentage or proportion (number of observations in a category divided by the total number of observations; percentage is proportion * 100) Cumulative frequency is the sum of the frequencies for that category and all previous categories

Non sampling error

Result of poor sample design, sloppy data collection, faulty measuring instruments, biased questionnaires, etc.

Sampling bias

SOC 106: *Probability sampling methods* include simple random sample, meaning that the probability any particular sample will be selected is known. Inferential statistical methods assume probability sampling. *Nonprobability sampling methods* are ones for which it is not possible to determine the probabilities of the possible samples. Inferences using such samples have unknown reliability and result in *sampling bias*. The most common non probability sampling method is *volunteer sampling* for which subjects volunteer to be in the sample - the sample may poorly represent the population and yield misleading conclusions. Unfortunately, volunteer sampling is sometimes necessary. *Undercoverage* is when the sample frame doesn't match the population (omitting population members from the sample frame) - lacks representation from some groups in the population.

Response bias

SOC 106: In a survey, the way a question is worded or asked can have a large impact on the results. Poorly worded or confusing questions result in *response bias.* Even the order of the questions can influence results. In interviews, responders can give answers they think the interviewer wants to hear, perhaps lying.

Collecting data with sample surveys

SOC 106: Many studies select a sample of people from a population and interview them to collect data. This method is called a *sample survey.* The interview could be personal, telephone, or self-administered questionnaire. A variety of problems can cause responses from a sample survey to tend to favor some parts of the population over others. Then results from the sample are not representative of the population. Still important to incorporate randomization (e.g. randomly select a sample for the survey).

Nonresponse bias

SOC 106: Some subjects who are supposed to be in the sample may refuse to participate, or it may not be possible to reach them. This results in the problem of nonresponse bias. IF only half the intended sample were actually observed, we should worry about whether the half not observed differ from those observed in a way that causes biased results.

Probability (Soc 106)

SOC 106: With a random sample or randomized experiment, the probability an observation is a particular outcome is the proportion of times that outcome would occur in a very long sequence of observations. Refers to the long run because you need a large number of observations to accurately assess probability. -Expressed as a proportion or percentage -A number between 0 and 1 or 0% and 100%

Nominal level of measurement

STA 13: *qualitative:* characterized by data that consists of names, labels, or categories only. the data cannot be arranged in an ordering scheme (high to low) [eye color] --> *categories ONLY*

Population

STA 13: complete collection of all measurements or data that is being considered in a study SOC 106: total set of subjects of interest in a study. e.g. people, families, schools, cities, companies, etc. -- can be hypothetical or conceptual also

Qualitative (categorical)

STA 13: consists of names or labels that are not numbers representing counts or measurements (can sometimes be ordered) SOC 106: A variable is categorical when the measurement scale is a set of categories (e.g. single, married, divorced, widowed; or yes/no). Distinct categories vary in quality, not in numerical magnitude. You cannot find averages for qualitative data. *Categorical variables are discrete*

Quantitative

STA 13: consists of numbers representing counts or measurements SOC 106: When the measurement scale has numerical values and values represent different magnitudes of the variable (e.g. income, number of siblings, age, number of years of education completed, etc.). *Quantitative variables can be either discrete or continuous, and in practice, if they can take lots of values, they are treated as continuous*

Interval level of measurement

STA 13: quantitative: data can be arranged in order, differences between data values can be found and are meaningful. no natural zero starting point (where none of said quantity is present (0 degrees Temperature still exists, so no natural zero; time, years, IQ, etc.); quantitative data (numbers, so you can add, subtract, but no ratios; can be negative or decimals) --> *differences are meaningful, but no natural zero, so no ratios* SOC 106: Interval scales are used interchangeably with ratio and have a specific numerical distance or interview between each pair of levels (e.g. income). We can compare differences, and there are meaningful zeroes.

Continuous Data

STA 13: result from infinitely many possible quantitative values, where the collection of values is not countable, includes entire real number line (e.g. height, weight, temperature, amount of milk a cow products, etc.). Can be negative. SOC 106: Quantitative variables can be discrete or continuous (e.g. age is continuous but # of siblings is discrete). Stat methods for continuous variables are mainly used for quantitative variables that can take lots of values (regardless of whether they are theoretically continuous or discrete) - e.g. age, income and IQ are often treated as continuous in statistical analysis. *Smooth curves are used to represent population distributions for continuous variables*

Discrete Data

STA 13: result when the data values are quantitative and the number of values is finite or "countable" (e.g. money, integers, number of eggs a hen lays). Can be negative and can also be decimals. A variable can take these values but it cannot take every value in between. SOC 106: All categorical variables (nominal or ordinal) are discrete, having a finite set of categories. Stat methods for discrete variables are mainly used for qualitative variables that take relatively few values (e.g. # of times a person has been married)

Statistics

STA 13: the science of planning studies and experiments; obtaining data; and then organizing, summarizing, presenting, analyzing and interpreting those data and then drawing conclusions based on them. SOC 106: Consists of a body of methods for obtaining and analyzing data: 1. Design: planning how to gather data for research studies 2. Description: summarizing the data (e.g. using graphs and tables --> descriptive statistics) 3. Inference: Making predictions based on the data Statistical analysis: Description and Inference - ways of analyzing data

Sample variance = S^2

S^2 = (Σ (xi- x̅)^2) / n-1 -Typical average of (x-x̅)^2 (sum of squares) -Subtract 1 to take away error -Because you are left with squared units of x (variable units), the units of the sample variance may not make sense (e.g. GPA^2) -based on the differences between each data value and the mean

When to use the mean vs. median

Soc 106: The median is usually preferred if a distribution is highly skewed, because the median better represents what is typical. The mean is usually preferred if the distribution is close to symmetric or only middle skewed, or if data is discrete with few distinct values, because the mean uses the numerical values of all the observations.

Complement of Event A

The complement of event A, denoted by A^c, consists of all outcomes not in A p(A) + p (A^c) = 1 P (event A does not occur) --> P(A^c) = *1 - p(A)* The probability an event *does not* occur

Mode

The easiest average to compute; the mode of a data set is the value that occurs most frequently (useful to order or sort the data before scanning for the mode); not every data set has a mode: -no mode, unimodal, bimodal, or multi-modal (2+ modes) - "the most likely outcome" Categorical or qualitative data can ONLY have a mode (not mean or median, because no numbers) Soc 106: mode is most commonly used with highly discrete variables, such as categorical data. Though the mode is appropriate for all types of data.

Sampling error

The extent to which sampling measurements don't correspond to the population measurements (caused by the fact that the sample does not perfectly represent the population) Soc 106: The error that results from estimating μ by x̅, because we only sampled part of the population. *Sampling error tends to decrease as the sample size, n, increases.*

Randomization

The mechanism for achieving good sample representation

Fundamental Counting Rule or the Multiplication Rule of Counting

The multiplication rule of counting indicates that the product of the outcomes of each event gives the total number of possible outcomes for the series of all events. Number of ways that two events can occur, given that the first event can occur m ways and the second event can occur n ways. The events occur together in a total of m*n ways. Or for multiple events, the production is n1 x n2 x ...nm ex: for a two character code consisting of a letter followed by a digit, the number of different possible codes is 26 x 10 = 260 ex2: Let's suppose you want to order a pizza. You have the choice between 2 different types of crusts, 4 different toppings, and 3 sizes. How many different combinations of pizza can you create? 2x4x3 = 24 different pizza combinations

Collecting data with an experiment

The purpose of most experiments is to compare responses of subjects on some outcome measure, under different conditions. Those conditions are level of a variable that can influence the outcome. The scientist has the experimental control of being able to assign subjects to the conditions. For instance, the conditions might be different drugs for treating some illness. The conditions are called *treatments*. To conduct the experiment, the researcher needs a plan for assigning subjects to the treatments. These plans are called *experimental designs*. Good experimental designs use randomization to determine which treatment a subject receives.

Sample space

The set of all simple events of an experiment - e.g. all rolls 1-6 combined. The sum of the probabilities of all simple events in a sample space must equal 1.

Measurement scale

The values the variable can take form the *measurement scale* The measurement scale for gender consists of two labels: female and male (or transgender) The measurement scale for number of siblings is 0, 1, 2, 3... Valid statistical methods for a variable depend on its measurement scale (i.e., we treat income different than responses to a yes/no question)

Mound shape

Universal distribution (looks like the normal distribution)

Trimmed mean

Used if you have extreme values in your data; a measure of center more resistant than the mean, but still sensitive to specific data values; the mean of the data values left after "trimming" a specified percentage (usually 5%) of the smallest and largest data values from the data set 1. Order the data from smallest to largest 2. Take your sample size (count the number of data points = n), calculate 5% of n and take away that number of values from your data from the bottom and top ends (if the 5% calculation doesn't produce a whole number, round to the nearest integer) 2. Then calculate the average E.g. if n = 11, then 5% of 11 is 0.55, so round up to 1, then take one data value off the bottom and one off the top; then calculate the mean.

Weighted average/mean

Used when there are specific weights for certain data points. More resistant to outliers than traditional mean, but still sensitive to specific data values Weighted average x̅ = Σ xw / Σ w, where x is the data value (e.g. point values) and w is the weight assigned to that data value (e.g. credits per class). The sum is taken over all data values (e.g. all credits taken for GPA). Multiply each score by its weight and add the results together, then divide by the sum of all the weights. For example, if you earn an 83 on a midterm (worth 40% of your grade) and a 95 on your final (worth 60% of your grade) then 83⋅0.4 =33.2 and 95⋅0.6 = 57; 33.2 + 57 = 90.2 or an A

Multi-stage sampling

Uses all different sampling techniques (e.g. stratified or other several stages) to create successively smaller groups at each stage. The final sample consists of clusters (e.g. government Current Population Survey)

Midrange

[(max value) + (min value)] /2 1. very sensitive to extreme values 2. easy to compute 3. not the median

Time-series graph

a line graph that shows how data change over time. Presents time-series data, which are quantitative data (measurements of the same variable for the same subject taken at regular intervals over a period of time; e.g. monthly or yearly); shows how data changes over time by showing data measurements in chronological order. Time is on the horizontal scale and the variable being measured is on the vertical scale. Data points plotted in order of occurrence at regular/equal intervals over a period of time and are connected by line segments. Best if the units of time are consistent in a graph (e.g. measurements taken every day should not be mixed with weeks on the same graph)

Scatterplot

a plot of paired (x,y) quantitative data with a horizontal x-axis and a vertical y-axis Soc 106: the values of the two variables for any particular observation form a point relative to these axis. The response variable is plotted on the y-axis and the explanatory variable is plotted on the x-axis.

Sample

a subcollection of members selected from a population

Events

any collection of one or more results or outcomes of a statistical experiment or observation (e.g rolling an event number or a number less than 5 on a die)

Pareto chart

bar graph for *qualitative/categorical data* that identify the frequency of events or categories in *decreasing order of frequency* of occurrence (the bar height represents frequency of an event and the bars are arranged in descending order according to frequencies. the bars decrease in height from left to right)

Census

collection of data from every member of the population SOC 106: When data exist for an entire population, such as in a census, it's possible to find actual values of the parameters of interest. Then there is no need to use inferential statistical methods

Data

collection of observations, such as measurements, genders, survey responses on opinion, political party affiliation, education level, income, etc. Looking at data the right way helps us learn how such characteristics are related. We can then answer questions such as, "Do people who attend church more often tend to be more politically conservative?"

Frequency distribution (frequency table)

divides data into intervals or classes (classes are constructed so that each data falls exactly into one class - *values are mutually exclusive*); shows how data are partitioned among several categories (or classes) by listing the categories along with number (frequency) of data values in each of them Includes: -Class limits (intervals/classes), usually equal width -Class boundaries (upper and lower limits, help to create histogram) -Tally and frequency (Frequency is just numbers that represent sum of tally marks) -Class midpoints (directly in the middle of a class) Class width = (max value-min value) / # of classes Can be used for a population and sample (sample distribution vs. sample data distribution)

Dotplot

example of a graph for exploratory data analysis (detects patterns and extreme values, requires very few assumptions): consists of a graph which each data value is plotted as a point (or dot) along a horizontal scale of values

Skewed to the left

longer left tail; mean < median < mode; more data to the right

Skewed to the right mode

longer right tail; mode < median < mean; more data to the left

Percentiles

measure of location/relative position, denoted at P1... which divide a set of data into 100 groups with about 1% of the values in each group Used for distributions that are heavily skewed or bimodal A pth percentile of a distribution is a value such that p% of the data falls at or below it and (100-p)% of data fall at or above it -- for example = = 60% or 60th percentile, then 40% of the data values fall above and 60% below To calculate - order the data and find the position position/location of a data value = p/100*(n+1) Use traditional rounding rules

Median

measure of the center of the ordered data (value in the middle of the data; central value) --> uses the position rather than the specific value of each data entry (an example of a percentile) -the median does not change by large amounts when we include just a few extreme values (median is resistant to outliers or extreme values; a "resistant measure") --> median is often used as the average for house prices -the median does not use every data value and *is not sensitive to outliers* or the specific size of a data value (is insensitive to the distances of the observations from the middle) -Can only be used for quantitative variables *-Median is usually more appropriate than the mean when the distribution is highly skewed* -you know there are an equal number of data values in the ordered distribution that are above and below it (50% above and 50% below, aka the 50th percentile) 1. Order the data from smallest to largest; calculate 0.5(n+1) To find the location of a quartile, use (P/100)⋅(n+1). Q3 represents the 75th percentile. Thus, (75/100)⋅(n+1) = this represents the location. 2. For an odd number of data values in the distribution, the mean is the middle data value (If the answer is a whole number, then the median is the value in that position) 3. For an even number of data values in the distribution, the median is the sum of the middle two values divided by two (the answer will be a decimal, the median is the average of the two middle values) -Soc 106: For discrete data that take relatively few values, quite different data patterns can have the same median = weakness. Generally, for highly discrete data (especially *binary data,* which can take only two values (e.g. 0 and 1), the mean is more informative than the median.

Variance

measure of variation --> measure of the distance/spread around the mean or expected value (either the sample mean or population mean) square of the standard deviation (unbiased estimator) Smaller variance is better because it means that the data is more clustered around an expected value (mean)

Parameter

numerical measurement or summary describing some characteristic of a population Soc 106: usually greek letters (e.g. μ = population mean, σ = population standard deviation) μ and σ are constants because they refer to just one particular group of observations (the entire population)

Statistic

numerical measurement or summary describing some characteristic of a sample Soc 106: usually roman letters (e.g. x̅ = sample mean, s = sample standard deviation)

simple event

one particular outcome of a statistical experiment (cannot be further broken down into simpler components) -- e.g. rolling a 1 or 2 on a die

Pie Chart

or circle graph; a graph that depicts qualitative/categorical data as slices of a circle, in which the size of each slice is proportional to the frequency count for the category. Works best for less than 10 slices and the total quantity or 100% is represented by the entire circle.

Individual, cases or observations

people or objects included in the study

Ratio level of measurement

quantitative: data can be arranged in order; differences, averages, and ratios can be found and are meaningful; and there is a natural zero starting point (where zero indicates that none of the quantity is present) [heights, weight, length, cost] --> *differences and natural zero, can take ratios* Can take decimals like 5.5 feet for height, but usually only take positive values because it has a meaningful zero. Most of the time it does not make sense for things to be below zero like -5.5 feet.

Range rule of thumb

s = range/4 or Xmax-Xmin = 4σ

Mean

the average (that uses the exact value of each entry rather than the position, sometimes called the arithmetic mean) - can only be used for quantitative variables. The mean is the same of the observations divided by the total number of observations. "The center of gravity" 1. sample means drawn from the same population tend to vary less than other measures of center 2. uses every data value 3. one extreme value (outlier) can drastically change the mean (not a resistant measure of center; we can make the mean as large as we want by changing the size of only one data value). Mean is very sensitive to outliers. Σx (read "the sum of all given x values") The symbol for the mean of a sample distribution of x values is denoted by - (x̅, read "x bar"). If your data comprise the entire population, we use the symbol μ (lowercase Greek letter mu, pronounced "mew") to represent the mean. n = number of data values in the sample N = number of data values in the population *Soc 106: For skewed distributions, the mean lies towards the direction of the skew (longer tail) relative to the median*

Cumulative frequency distribution

the frequency of each class is the sum of the frequencies for that class and all previous classes

Sampling frame

the list of individuals from which a sample is actually selected (ideally the entire population) *Undercoverage* is when the sample frame doesn't match the population (omitting population members from the sample frame) - lacks representation from some groups in the population. A sample is never a perfect representation of a population

Frequency polygon

uses line segments connected to points located directly above midpoint values. a frequency polygon is very similar to a histogram

Relative frequency polygon

uses relative frequencies (proportions or percentages) for vertical scale.

Sum of squares

Σ(xi - x̅)^2 -Used to calculate standard deviation -Squared deviations (xi - x̅) to make differences non-negative so we don't cancel out any calculations (e.g. 2-3 = -1 and (-1)^2 = 1)


Set pelajaran terkait

Chapter 11 and 12 Prof, Legal and Ethical Resp

View Set

Oklahoma Property & Casualty Adjusters - Exam

View Set

Jay's Treaty vs Pinckney's Treaty

View Set

Corporate FIN Final Exam (Ch26, E1, E2)

View Set