Statistics
Sampling Bias
Created when a sample is collected from a population and some members of the population are not as likely to be chosen as others. Can cause incorrect conclusions drawn about the population that is being studied.
Simple Random Sample (SRS)
Every member of the population has an equal chance of being chosen. 1. Give each member of the population a number. 2. Use a random number generator to select a set of labels. 3. These randomly selected labels identify the members of your sample - use them to find out which members were sampled
An Example
Find the IQR of the following data points: 10, 12, 15, 21, 26, 29, 31. • i1 = 1 4 (7 + 1) = 2; so Q1 = 12. • i3 = 3 4 (7 + 1) = 6; so Q3 = 29. • IQR = Q3 − Q1 = 29 − 12 = 17.
calculating Standard Deviation
For a sample set of data x1, x2, . . . , xn 1. Calculate sample mean x 2. For each data point: calculate (xi − x) 3. For each data point: (xi − x) 2 4. Add them up: (x1 − x) 2 + · · · + (xn − x) 2 5. Divide the number from step 4 by (n-1) 6. Take the square root of the number from step 5 This calculation is done by computer for large data sets, but you need to know how it works.
Which Graph to Use?
It is a good idea to look at a variety of graphs to see which is the most helpful in displaying the data. We might make different choices of what we think is the best graph depending on the data and the context. Our choice also depends on what we are using the data for.
Frequency Polygons
Looks like a line graph but uses intervals to display ranges of large amounts of data
Lurking Variables
Lurking Variable A variable or factor that has an important effect on the relationship among the variables in a study but is not one of the explanatory variables studied. not included Confounding Two variables are confounded when their effects on a response variable cannot be distinguished from each other. The confounded variables may be either explanatory variables or lurking variables. included
Measures of the Spread of the Data
Standard Deviation: The standard deviation is a number that measures how far data values are from their mean. Uses for the standard deviation • The standard deviation provides a measure of the overall variation in a data set • The standard deviation can be used to determine whether a data value is close to or far from the mean.
What is statistics?
Statistics is the science of collecting, classifying, presenting, and interpreting data in order to make decisions
Example: Effectiveness of Vitamin E
Suppose we wish to investigate the effectiveness of vitamin E in preventing disease. We recruit a group of subjects and ask them if they regularly take vitamin E. We notice that the subjects who take vitamin E exhibit better health on average than those who do not. Does this prove that vitamin E is effective in disease prevention? It does not. There are many differences between the two groups compared in addition to vitamin E consumption. People who take vitamin E regularly often take other steps to improve their health: exercise, diet, other vitamin supplements, choosing not to smoke. Any one of these factors could be influencing health. This study does not prove that vitamin E is the key to disease prevention.
Skewness and the Mean, Median, and Mode
Symmetrical Distribution A distribution is symmetrical if a vertical line can be drawn at some point in the histogram such that the shape to the left and the right of the vertical line are mirror images of each other. Consider this dataset: 4; 5; 6; 6; 6; 7; 7; 7; 7; 7; 7; 8; 8; 8; 9; 10 The mean, median, and mode are each 7 for these data. In a perfectly symmetrical distribution, the mean and the median are the same
Measures of the Center of the Data
The "center" of a data set is also a way of describing location. The two most widely used measures of the "center" of the data are the mean (average) and the median
Example of variables
The # of shoes you own: Quantitative, discrete The type of car you drive: Categorical The distance it is from your home to the nearest grocery store: Quantitative, continuous The number of classes you take per school year: QUantitative, discrete Type of calculator you use: categorical Weights of sumo wrestlers: continuous number of correct answers on quiz: discrete IQ score: Discrete
statstic
A number that represents the property of interest in the sample A statistic is often used to guess or estimate the actual value of the unknown parameter of interest
Example: Sleep Deprivation & Driving Ability
A recent study was conducted to know how sleep deprivation affects the ability to drive. Towards that, 19 professional drivers were selected and each driver participated in two experimental sessions: one after normal sleep and one after 27 hours of total sleep deprivation. The treatments were assigned in random order. In each session, performance was measured on a variety of tasks including a driving simulation. • Response Variable: Driving performances • Explanatory variable: Amount of sleep • Treatments: Normal sleep and 27 hours of sleep deprivation • Experimental Units: 19 Drivers participating in this study
descriptive statistics
numerical and graphical ways to describe your data A graph is a tool that helps you learn about the shape or distribution of a sample or population statisticians often graph data first to get a picture of the data, then more formal tools are applied
descriptive statistics
organizing and summarizing data
Which one to use?
The median is generally a better measure of the center when there are outliers because it is not affected by the precise numerical values of the outliers.
Median
The median is the 50th percentile. i = k/100 (n + 1) =⇒ M = 50/100 (n + 1) = n + 1 2
Mode
The mode is the most frequent value. There can be more than one mode in a data set as long as those values have the same frequency and that frequency is the highest. A data set with two modes is called bimodal.
Comparing Values from Different Data Sets: Z-scores
Use standard deviation to compare numbers from different data sets. Write x = x + z · sx x = µ + z · σ
Placebo Effect & Control Group
When participation in a study prompts a physical response from a participant (also referred to as the power of suggestion), the experimenter must take further steps to isolate the effects of the explanatory variable from the power of suggestion. To counter the power of suggestion, researchers set aside one treatment group as a control group. This group is given a placebo treatmenta treatment that cannot influence the response variable. The control group helps researchers balance the effects of being in an experiment with the effects of the active treatments.
Scatterplot
a graphical depiction of the relationship between two variables - High values of one variable occurring with high values of other variables or low variables of one variable occurring with low values of the other variable -High values of one variable occurring with
distribution
a listing or function showing all the possible values of the data and how often they occur
parameter
a number that is used to represent a specific characteristic or feature of the population
correlation coefficient
a statistical index of the relationship between two things (from -1 to +1) use the coefficent as an indicator for a strength of the relationship between the x and y variables also provides direction
Decimal Example
k = 73, n = 100 so we need to look at i = 73/ 100 (100 + 1) = 73.73 Since i is a decimal we need to average the 73th and 74th data points both of which are 70 (x73 = 70, x74 = 70) so the 73rd precentile is 70.
Formula for Finding the Quartiles
k= the kth Quartile i= the index (ranking or position of a data value) n= the total number of data 1. Order the data from smallest to largest. 2. Calculate i = k 4 (n + 1) 3. Use i to find the correct data point 4. If i is a whole number, find ith data value in the ordered list, that's the quartile. 5. If i is a decimal, we will take the two whole numbers closest to i and average them.
A Formula for Finding the kth Percentile
k= the kth percentile i= the index (ranking or position of a data value) n= the total number of data 1. Order the data from smallest to largest. 2. Calculate i = k 100 (n + 1) 3. Use i to find the correct data point
probability
mathematical study of uncertainty, the foundation of statistics. The goal of statistics is not to perform numerous calculations using formulae, but to gain insight of your data
left-skewed distribution
mean < median both are less than mode If the histogram has a longer left tail than right tail, it is left skewed.
Right-Skewed Distribution
mean > median both are greater than mode If the histogram has a longer right tail than left tail, it is right skewed.
Steps for constructing Box Plot
1. Compute quartiles: Q1, M = Q2, Q3 2. Compute IQR 3. Draw the line from the minimum data point to the maximum data point 4. Draw a box from Q1 to Q3 5. Draw a line in the box at Median M 6. Compute the outlier cutoffs: L = Q1 − 1.5 × IQR and U = Q3 + 1.5 × IQR 7. Whisker 1 is the fisrt data point in the data larger than L 8. Whisker 2 is the last data point in the data smaller than U 9. All data in the set not between L and U is marked with an X to indicate it is a potential outlier
The role of a statistician
1. Design studies 2. Analyze data 3. Translate data into knowledge and understanding of the world around us These processes always take into account the uncertainties (or errors occurring by chance or at random- "random errors" present in any real life measurement mechanisms, as well as systematic errors that may be introduced by the designs of the underlying experiment (experimental design)
Constructing a Histogram
1. decide how many bins 2. calculate the width of each bin bin width=ending point-starting point/number of bins 3. calculate the bins (intervals) boundary=starting point+K x bin width K is number of bins 4. calculate bin frequency 5. draw the histogram
Example: Hardness of Aluminium Alloy1
A metallurgical engineer is interested in studying the effect of two different hardening processes, oil quenching and saltwater quenching, on an aluminum alloy. Objective of the experimenter is to determine which quenching solution produces the maximum hardness for this particular alloy. He decides to subject a number of alloy specimens or test coupons to each quenching medium and measures the hardness of the specimens after quenching. The average hardness of the specimens treated in each quenching solution will be used to determine which solution is best. • Response Variable Hardness of aluminium alloy • Explanatory Variable Quenching solution • Treatments Oil Quenching and Saltwater Quenching • Experimental Units Alloy specimens or test coupon
Example: Sleep Deprivation & Driving Ability
A recent study was conducted to know how sleep deprivation affects the ability to drive. Towards that, 19 professional drivers were selected and each driver participated in two experimental sessions: one after normal sleep and one after 27 hours of total sleep deprivation. The treatments were assigned in random order. In each session, performance was measured on a variety of tasks including a driving simulation. Use key terms to describe the design of this experiment. • Response Variable: Driving performances • Explanatory variable: Amount of sleep • Treatments: Normal sleep and 27 hours of sleep deprivation • Experimental Units: 19 professional drivers • Lurking variables: None - all drivers participated in both treatments • Random assignment: Treatments were assigned in random order; this eliminated the effect of any "learning" that may take place during the first experimental session • Control/Placebo: Completing the experimental session under normal sleep conditions • Blinding: Researchers evaluating subjects' performance must not know which treatment is being applied at the time
Example: SAT Score
A researcher analyzes the results of the SAT (Scholastic Aptitude Test) over a five-year period and finds that male students on average score higher on the math section, and female students on average score higher on the verbal section. She concludes that these observed differences in test performance are due to genetic factors. Explain how lurking variables could offer an alternative explanation for the observed differences in test scores. . . . 60/74 Example: SAT Score (continued) There are many potential lurking variables that could influence the observed differences in test scores. • Perhaps the boys, on average, have taken more math courses than the girls, and the girls have taken more English classes than the boys. • Perhaps the boys have been encouraged by their families and teachers to prepare for a career in math and science, and thus have put more effort into studying math, while the girls have been encouraged to prepare for fields like communication and psychology that are more focused on language use. A study design would have to control for these and other potential lurking variables (anything that could explain the observed difference in test scores, other than the genetic explanation) in order to draw a scientifically sound conclusion
Example: Birth Order
A researcher wants to study the effects of birth order on personality. Explain why this study could not be conducted as a randomized experiment. What is the main problem in a study that cannot be designed as a randomized experiment? The explanatory variable is birth order. You cannot randomly assign a persons birth order. Random assignment eliminates the impact of lurking variables. When you cannot assign subjects to treatment groups at random, there will be differences between the groups other than the explanatory variable.
Types of Sampling
A sample should have the same characteristics as the population it is representing. How do we get representative samples from the populations? Most statisticians use various methods of random sampling in an attempt to achieve this goal, including: 1. Simple Random Sample (SRS) 2. Systematic Sample 3. Stratified Sample 4. Cluster Sample
Sample
A subset of the population that has the same characteristics of the population of interest Each individual member of a sample is called a sampling unit An individual measurement on a sampling unit is called a sample observation The totality of all the sample observations is referred to as the data
Convenience Sample
A type of sampling that is non-random is convenience sampling. Convenience sampling involves using results that are readily available. A computer software store conducts a marketing study by interviewing potential customers who happen to be in the store browsing through the available software. The results of convenience sampling may be very good in some cases and highly biased (favor certain outcomes) in others.
Example
AIDS data indicating the number of months a patient with AIDS lives after taking a new antibody drug are as follows (smallest to largest): 3; 4; 8; 8; 10; 11; 12; 13; 14; 15; 15; 16; 16; 17; 17; 18; 21; 22; 22; 24; 24; 25; 26; 26; 27; 27; 29; 29; 31; 32; 33; 33; 34; 34; 35; 37; 40; 44; 44; 47; Calculate the mean and the median The sample mean is x = 1 40 (3+4+8+8+10+11+12+13+14+15+· · ·+44+47) = 23.6 The sample median is i = 50 100 (40 + 1) = 41 2 = 20.5 The data values we need to average are x20 = 24 and x21 = 24, so the sample median is M = (24 + 24)/2 = 24
Ethics
All planned studies that involve human participants must be approved in advance by the IRB. Key protections that are mandated by law include the following: • Risks to participants must be minimized and reasonable with respect to projected benefits. • Participants must give informed consent. This means that the risks of participation must be clearly explained to the subjects of the study. Subjects must consent in writing, and researchers are required to keep documentation of their consent. • Data collected from individuals must be guarded carefully to protect their privacy. Researchers have a responsibility to verify that proper methods are being followed and to guard against statistical fraud Describe the unethical behavior in each example and describe how it could impact the reliability of the resulting data. Explain how the problem should be corrected. A researcher is collecting data in a community. - She selects a block where she is comfortable walking because she knows many of the people living on the street. Describe the unethical behavior in each example and describe how it could impact the reliability of the resulting data. Explain how the problem should be corrected. A researcher is collecting data in a community. • She selects a block where she is comfortable walking because she knows many of the people living on the street. By selecting a convenient sample, the researcher is intentionally selecting a sample that could be biased. Claiming that this sample represents the community is misleading. The researcher needs to select areas in the community at random. • She skips four houses on her route because she is running late for an appointment. When she gets home, she fills in the forms by selecting random answers from other residents in the neighborhood. It is never acceptable to fake data. Even though the responses she uses are real responses provided by other participants, the duplication is fraudulent and can create bias in the data. She needs to work diligently to interview everyone on her route.
Blinding
Blinding Blinding in a randomized experiment preserves the power of suggestion. When a subject in a research study is blinded, that subject does not know who is receiving the active treatment(s) and who is receiving the placebo treatment. Double-Blinding Both the subjects and the researchers involved with the subjects are blinded
Box Plots
Box plots (also called box-and-whisker plots or box-whisker plots) give a good graphical image of the concentration of the data. Show how far the extreme values are from most of the data. A box plot is constructed from five values: • minimum • first quartile • median • third quartile • maximum Called the 5 number summary of the data
Data
Data are the actual values of the variables. They may be numbers or they may be actual words. Datum is a single value
Bivariate Data
Data with two variables We also want to study how strong the relationship between the two variables is which is correlation
Experimental Design
Does aspirin reduce the risk of heart attacks? Is one brand of fertilizer more effective at growing roses than another? Is fatigue as dangerous to a driver as the influence of alcohol? Questions like these are answered using randomized experiments. Proper study design ensures the production of reliable, accurate data. The purpose of an experiment is to investigate the causal relationship between two variables. When one variable causes change in another, we call the first variable the explanatory variable. The affected variable is called the response variable. In an experiment, the researcher manipulates values of the explanatory variable and measures the resulting changes in the response variable. The different values of the explanatory variable are called treatments. An experimental unit is a single object or individual to be measured. There can be more than one explanatory variables in an experiment.
Example: Oral Medication & Heart Attack
Example: Oral Medication & Heart Attack Researchers want to investigate whether taking aspirin regularly reduces the risk of heart attack. Four hundred men between the ages of 50 and 84 are recruited as participants. The men are divided randomly into two groups: one group will take aspirin, and the other group will take a placebo. Each man takes one pill each day for three years, but he does not know whether he is taking aspirin or the placebo. At the end of the study, researchers count the number of men in each group who have had heart attacks. • Experimental units: 400 individuals aged between 50 and 84. • Explanatory variable: Oral medication. • Treatments: Aspirin and a placebo. • Response variable: Whether a subject had a heart attack. • Lurking variables: Sedentary lifestyle, consumption of
Time Series Graph
Given a paired data set, we start with a standard Cartesian coordinate system: the horizontal axis is used to plot the date/time increments, and the vertical axis is used to plot the values of the variable that we are measuring. By doing this, we make each point on the graph correspond to a date and a measured quantity. The points on the graph are connected by straight lines in the order in which they occur.
What to do with i?
Going to use i to calculate the percentile. • If i is a whole number, find ith data value in the ordered list, thats the percentile. • If i is a decimal, we will take the two whole numbers closest to i and average them.
Example
Identify the key terms in the following sample -we want to know the average amount of money spent on uniforms each year by the families with the children at Knoll Academy, we randomly surveyed 100 families with children in the school. 3 of the families spent 65, 75, and 95 Population: all the students at Knoll Statistic: mean amount of money Parameter: price spent on the uniform Sample: 100 families Data: money spent Variable: amount of money spent 65,75,95
Percentages add up to be less than 100%
If this is due to omitting categories/missing data, we can add an "Other" category to complete it. Sorting the bars from largest to smallest makes the bar graph easier to read and interpret. Since the "Other" category makes the categories' counts sum to 100% now, pie chart becomes an option. The chart in Figure 1.11b is organized by the size of each wedge, which makes it a more visually informative graph than the unsorted, alphabetical graph in Figure 1.11a.
Example: Sedentary Lifestyle
In a study published in the July 7, 2014, edition of the American Journal of Medicine, it was suggested that lack of exercise contributed more to weight gain than eating too much. The study examined the current exercise habits and caloric intake of a sample of both males and females. It was reported that women younger than 40 are quite vulnerable to the risks of a sedentary lifestyle. • Response variable - weight gain. Explanatory variables - exercise habits and caloric intake. • Motherhood of women below 40 could be a potential lurking variable which leads to less exercise, eating more, and hence a more sedentary lifestyle. • This is an observational study since no assignment of treatments was made by the researchers; the researchers simply observed current habits of the subjects
Lurking Variables
In any experiment, the results and conclusions that can be drawn depend to a large extent on the manner in which the data were collected. Suppose that the metallurgical engineer in the Aluminium alloy experiment used specimens from one heat in the oil quench and specimens from a second heat in the saltwater quench. Now, when the mean hardness is compared, the engineer is unable to say how much of the observed difference is the result of the quenching media and how much is the result of inherent differences between the heats. Such extraneous factors that can potentially influence the effect of the explanatory variable(s) on the outcome of the response variable are called lurking variables
Randomized Experiment
In order to prove that the explanatory variable is causing a change in the response variable, it is necessary to isolate the explanatory variable. The researcher must design her experiment in such a way that there is only one difference between groups being compared: the planned treatments. This is accomplished by the random assignment of experimental units to treatment groups. When subjects are assigned treatments randomly, all of the potential lurking variables are spread equally among the groups. At this point the only difference between groups is the one imposed by the researcher. Different outcomes measured in the response variable, therefore, must be a direct result of the different treatments. In this way, an experiment can prove a cause-and-effect connection between the explanatory and response variables
Rationale
Information obtained at a smaller scale that would be projected/used to obtain an idea of the bigger level
Interquartile Range
Interquartile Range (IQR) or IQR, is the range of the middle 50 percent of the data values; the IQR is found by subtracting the first quartile from the third quartile. IQR = Q3 − Q1 • The IQR can help to determine potential outliers • A value is suspected to be a potential outlier if it is less than (1.5)(IQR) below the first quartile or more than (1.5)(IQR) above the third quartile.
Observational Study vs. Experimental Study
Observational Study An observational study observes individuals and measures variables of interest but does not attempt to influence the responses. Experiment An experiment deliberately imposes some treatment on individuals to measure their responses. The purpose of an experiment is to study whether the treatment causes a change in the response.
Frequency Tables
Once you have a set of data, you will need to organize it so that you can analyze how frequently each datum occurs in the set. Frequency: Count number of times each data value occurs. Frequency Table: Table with data value and frequency Twenty students were asked how many hours they worked per day. Their responses, in hours, are as follows: 5, 6, 3, 3, 2, 4, 7, 5, 2, 3, 5, 6, 5, 4, 4, 3, 5, 2, 5, 3 Make a frequency table. Relative Frequency The ratio (fraction or proportion) of the number of times a value of the data occurs in the set of all outcomes to the total number of outcomes. Cumulative Relative Frequency The sum of the relative frequencies for all the values that are less than or equal to the given value.
Measures of Location
Percentile a number that divides ordered data into hundredths; percentiles may or may not be part of the data. The median of the data is the second quartile and the 50th percentile. The first and third quartiles are the 25th and the 75th percentiles, respectively. Quartile the numbers that separate the data into quarters; quartiles may or may not be part of the data. The second quartile is the median of the data.
Displaying Qualitative Data
Pie Chart: Categories of data are represented by wedges in a circle and are proportional in size to the percent of individuals in each category Bar Graph: The length of the bar for each category is proportional to the number or percent of individuals in each category. Bars may be vertical or horizontal.
Percentages add up to be more than 100%
Pie charts are only good for displaying parts of a whole. If the percentages add up to be more than 100%, a pie chart cannot be used. In this case, it occurs because the students can be in more than one of these categories.
Common Problems
Problems with samples: A sample must be representative of the population. A sample that is not representative of the population is biased. Biased samples that are not representative of the population give results that are inaccurate and not valid Self-selected samples: Responses only by people who choose to respond, such as call-in surveys, are often unreliable. Sample size issues: Samples that are too small may be unreliable. Larger samples are better, if possible. In some situations, having small samples is unavoidable and can still be used to draw conclusions. Example: crash testing cars or medical testing for rare conditions Undue influence: collecting data or asking questions in a way that influences the response Non-response or refusal of subject to participate: The collected responses may no longer be representative of the population. Often, people with strong positive or negative opinions may answer surveys, which can affect the results Causality: A relationship between two variables does not mean that one causes the other to occur. They may be related (correlated) because of their relationship through a different variable Self-funded or self-interest studies: A study performed by a person or organization in order to support their claim. Is the study impartial? Read the study carefully to evaluate the work. Do not automatically assume that the study is good, but do not automatically assume the study is bad either. Evaluate it on its merits and the work done. Misleading use of data: improperly displayed graphs, incomplete data, or lack of context Confounding: When the effects of multiple factors on a response cannot be separated. Confounding makes it difficult or impossible to draw valid conclusions about the effect of each factor.
Systematic Sample
Randomly select a starting point and then take every n th data piece after that. Suppose you have to do a phone survey. Your phone book contains 20,000 residence listings. You must choose 400 names for the sample. Number the population 120,000 and then use a simple random sample to pick a number that represents the first name in the sample. Then choose every fiftieth name thereafter until you have a total of 400 names (you might have to go back to the beginning of your phone list). Systematic sampling is frequently chosen because it is a simple method.
Outlier Example One
Referring back to the Books example. JMP said the minimum was 1, the maximum was 6, and Q1 was 2, and Q3 was 4. Are there any outleirs? IQR: 4-2 = 2 Outer cutoffs: smaller than 2 - 1.5(2) = -1, or larger than 4 + 1.5(2) = 7 Therefore there are no outliers. Sometimes the outlier definition covers numbers that are impossible or highly unlikely
Key Idea 1
Roughly, the parameter is to the population as the statistic is to sample We use a statistic, calculated from the data obtained from a sample, in order to estimate or infer about the parameter 1. Identify your parameter of interest for some population 2. Obtain a sample form of population 3. Gather data or observation of the sample 4. Compute a static based on observation 5. Use the statistic as an estimate of the parameter A statistic calculated from a population is a parameter A statistic calculated from a sample is an estimate of the parameter
A study is done to determine the average tuition that San Jose State undergraduate students pay per semester. Each student in the following samples is asked how much tuition he or she paid for the Fall semester. What is the type of sampling in each case? • A completely random method is used to select 75 students. Each undergraduate student in the fall semester has the same probability of being chosen at any stage of the sampling process.
SRS
Sample Mean
Sample Mean The mean value in a set of data is called the sample mean of the data.
Sampling Errors & Non-Sampling Errors
Sampling Errors: The actual process of sampling causes sampling errors A sample will never be exactly representative of the population so there will always be some sampling error. As a rule, the larger the sample, the smaller the sampling error. Non-Sampling Errors: Caused by factors not related to the sampling process
Example: Smell & Taste Experiment
The Smell & Taste Treatment and Research Foundation conducted a study to investigate whether smell can affect learning. Subjects completed mazes multiple times while wearing masks. They completed the pencil and paper mazes three times wearing floral-scented masks, and three times with unscented masks. Participants were assigned at random to wear the floral mask during the first three trials or during the last three trials. For each trial, researchers recorded the time it took to complete the maze and the subjects impression of the masks scent: positive, negative, or neutral. Answer the following: • Describe the explanatory and response variables in this study. • What are the treatments? • Identify the plausible lurking variables, if any. • Is it possible to use blinding in this study? • The explanatory variable is scent, and the response variable is the time it takes to complete the maze. • There are two treatments: a floral-scented mask and an unscented mask. • All subjects experienced both treatments. The order of treatments was randomly assigned so there were no differences between the treatment groups. • Subjects clearly know whether they can smell flowers or not, so subjects cannot be blinded in this study. Researchers timing the mazes can be blinded, though. The researcher who is observing a subject will not know which mask is being worn.
Variation in Data
Variation: Data values will not always be the same for each element of the population; this difference is called variation. 16-ounce cans of beverage may contain more or less than 16 ounces of liquid. In one study, eight 16 ounce cans were measured and produced the following amount (in ounces) of beverage: 15.8; 16.1; 15.2; 14.8; 15.8; 15.9; 16.0; 15.5 Measurements of the amount of beverage in a 16-ounce can may vary because different people make the measurements or because the exact amount, 16 ounces of liquid, was not put into the cans.
Cluster Sample
We divide the population into clusters (groups) and then randomly select some of the clusters. All the members from these clusters are in the cluster sample. If you randomly sample four departments from your college population, the four departments make up the cluster sample. Divide your college faculty by department. The departments are the clusters. Number each department, and then choose four different numbers using simple random sampling. All members of the four departments with those numbers are the cluster sample.
Stratified Sample
We divide the population into groups called strata and then take an SRS from each stratum. You could stratify (group) your college population by department and then choose a proportionate simple random sample from each stratum (each department) to get a stratified random sample. Suppose there are 5 departments, and you want to choose a sample of 50. You choose 10 from each department using SRS sampling
Interpreting Percentiles, Quartiles, and Median
When writing the interpretation of a percentile in the context of the given data, the sentence should contain the following information. • information about the context of the situation being considered • the data value (value of the variable) that represents the percentile • the percent of individuals or items with data values below the percentile • the percent of individuals or items with data values above the percentile Example 1. On a timed math test, the first quartile for time it took to finish the exam was 35 minutes. Interpret the first quartile in the context of this situation. ANSWER Twenty-five percent of students finished the exam in 35 minutes or less. Seventy-five percent of students finished the exam in 35 minutes or more. A low percentile could be considered good, as finishing more quickly on a timed exam is desirable. (If you take too long, you might not be able to finish.) 2. On a 20 question math test, the 70 th percentile for number of correct answers was 16. Interpret the 70 th percentile in the context of this situation. ANSWER Seventy percent of students answered 16 or fewer questions correctly. Thirty percent of students answered 16 or more questions correctly. A higher percentile could be considered good, as answering more questions correctly is desirable.
Sampling with/without Replacement
With Replacement: Once a member is picked, that member goes back into the population and thus may be chosen more than once. Without Replacement Once a member is picked, that member goes out of the pool and thus cannot be chosen more than once. Sampling without replacement instead of sampling with replacement becomes a mathematical issue only when the population is small.
Variable
a characteristic or measurement that can be determined for each member of the population 1. numerical/quantitative (height)-number 2. categorical /qualitative (brown hair, ethnic group)- an attribute of the population
population
a collection or set of individuals or objects (living or non living) whose properties are being studied
outlier
an observation of data that does not fit the rest of the data some outliers are due to mistakes, while others may indicate something unusual is happening. It takes some background information to explain outliers
sampling
goal: find the true parameter of interest in population Ideally we would measure the parameter on each member of the population. This is plausible to do for the small populations Generally impossible for large populations. Either time an/or monterary constraints or other physical constraints Usually, for most of the populations that we encounter in practice, the actual value of a paramater cannot be determined easily To study the population we will draw from a subset of the population, referred to as a sample
Categorical/ Qualitive Variables
can be expressed through labels or tags 1. Nominal: no natural order to the variables (religion, ethnicity, dog breeds) 2. Ordinal: is a natural order to the variables ( review, grades)
Numerical/Quantitive variables
can be expressed through numerical figures 1. Discrete: measured in whole units or digits (count data) (number of fish caught) 2. Continuous: can be measured in decimals (weight, height, price, time to finish a race
correlation coefficient strength
closer to 1 the strongest
A study is done to determine the average tuition that San Jose State undergraduate students pay per semester. Each student in the following samples is asked how much tuition he or she paid for the Fall semester. What is the type of sampling in each case? • The freshman, sophomore, junior, and senior years are numbers one, two, three, and four, respectively. A random number generator is used to pick two of those years. All students in those two years are in the sample.
cluster
Histogram
consists of continuous conjoining bars. The horizontal axis is labeled with what the data represents. The vertical axis is labeled either frequency or relative frequency of each bin. The graph will have the same shape
• An administrative assistant is asked to stand in front of the library one Wednesday and to ask the first 100 undergraduate students he encounters what they paid for tuition the Fall semester. Those 100 students are the sample.
convenience
stem and leaf plot
divide each observation of data into stem and leaf. The leaf consists of a final significant digit. good choice for showing the distribution of a numerical variable when the data set is small
inferential statistics
formal methods for drawing conclusion from data
Bin
represents range of data and is used when displaying large data sets. Also called classes or intervals, all have the same width. The histogram, like the stemplot, can give you the shape of the data. It is easier to analyze when the data set is large. (the ranges in data value)
Calculating the Mean of Grouped Frequency Tables
step 1: find midpoint of each interval
Calculating the Mean of Grouped Frequency Tables
step 2: calculate the means with midpoints as data values
A sample of 100 undergraduate San Jose State students is taken by organizing the students' names by classification (freshman, sophomore, junior, senior), and then selecting 25 from each.
stratified
A study is done to determine the average tuition that San Jose State undergraduate students pay per semester. Each student in the following samples is asked how much tuition he or she paid for the Fall semester. What is the type of sampling in each case? • A random number generator is used to select from the alphabetical listing of all undergraduate students in the Fall semester. Starting with that student, every 50th student is chosen until 75 students are included in the sample.
systematic
Bar Graph
the length of the bar for each category is proportional to the frequency or relative frequency of each category, Bars may be vertical or horizontal. histogram is closed, bar has gaps
data rarely fit a straight line exactly
usually you must be satisfied with rough predictions. Typically you have a set of data whose plot appears to be fit, called line of best fit
equation of a line
y=m x + C, m= tan
Example question of variables
you go to the supermarket and buy 3 cans of soup (19 oz of tomato bisque, 14.1 oz lentil, and 19 oz of italian wedding). 2 packages of nuts (walnuts and peanuts), 4 different kinds of vegetable broccoli, cauliflower, spinach, and carrots) and 2 desserts- 16 oz pistachio ice cream and 32 oz cookies Package Counts: Quantitive, discrete Soup can weighs: Qualitative, continuous types of items: Categorical, nominal
Summary: Skewness
• Symmetric: mean = median • Left Skewed: mean < median • Right Skewed: mean > median