STA2014 - Chapter 1 : Data Collection
closed question
A closed question requires the respondent to choose from a list of predetermined responses.
continuous variable
A continuous variable is a quantitative variable that has an infinite number of possible values that are not countable.
parameter
A parameter is a numerical summary of a population.
presurvey
A presurvey could give the researcher an idea of what the most common responses are from a population. The researcher could then use these responses as the answers to closed questions in the actual survey.
prospective study
A prospective study collects the data over time.
qualitative variable
A qualitative variable allows for classification of individuals based on some attribute or characteristic.
sample
A sample is a subset of the population that is being studied.
inferential statistics
Inferential statistics uses methods that generalize results obtained from a sample to the population and measure the reliability of the results.
statistics
Statistics is the science of collecting, organizing, summarizing, and analyzing information to draw a conclusion and answer questions. In addition, statistics is about providing a measure of confidence in any conclusions.
countable
The term "countable" means that the values result from counting, such as 0, 1, 2, 3, and so on.
undercoverage error
Undercoverage bias occurs when the proportion of one segment of the population is lower in a sample than it is in the population.
types of nonsampling errors
Undercoverage, nonresponse bias, response bias, or data-entry errors are all types of nonsampling errors.
sampling without replacement
When sampling without replacement, once an individual is selected, the individual is removed from the possible choices for that sample and cannot be chosen again.
sampling bias
When the technique used to obtain the individuals to be in the sample tends to favor one part of the population over another, this is known as sampling bias.
cohort study
A cohort study first identifies a group of individuals to participate in the study (the cohort). The cohort is then observed over a long period of time. During this time period, characteristics about the individuals are recorded and some individuals will be exposed to certain factors (not intentionally) and others will not. At the end of the study the value of the response variable is recorded for the individuals.
confounding variable
A confounding variable is an explanatory variable that was considered in a study whose effect cannot be distinguished from a second explanatory variable in the study. The big difference between lurking variables and confounding variables is that lurking variables are not considered in the study whereas confounding variables are measured in the study.
designed experiment
A designed experiment is when a researcher assigns individuals to a certain group, intentionally changing the value of an explanatory variable, and then recording the value of the response variable for each group. A designed experiment allows the researcher to claim causation between an explanatory variable and a response variable.
discrete variable
A discrete variable is a quantitative variable that has either a finite number of possible values or a countable number of possible values.
lurking variable
A lurking variable is an explanatory variable that was not considered in a study, but that affects the value of the response variable in the study. In addition, lurking variables are typically related to explanatory variables in the study. A relation that appears to exist between a certain explanatory variable and the response variable may be due to some other variable or variables not accounted for in the study. These variables are called lurking variables.
nonresponse bias error
A nonresponse means that an individual selected for the sample does not respond to the survey. Nonresponse bias exists when individuals selected to be in the sample who do not respond to the survey have different opinions than those who do. Nonresponse bias can be controlled using callbacks. For example, if nonresponce occurs because a mailed questionnaire was not returned, a callback might mean phoning the individual to conduct the survey. If nonresponse occurs because an individual was not at home, a callback might mean returning to the home at other times in the day or on other days of the week. Using rewards and incentives is another method to improve nonresponse. Rewards may include cash payments. Incentives may include a cover letter that states that the responses to the questionnaire will determine future policy.
response rates
A possible advantage of offering rewards or incentives to increase response rates is that respondents put more effort into completely and accurately answering the survey questions because they feel obligated. A possible disadvantage of offering rewards or incentives to increase response rates is that the people interested in the rewards or incentives differ from the population in some way that is important to the study, causing biased results.
response bias error
A response bias is a question that is not balanced. That is, it is worded in such away to influence the response of those being surveyed. Response bias exists when the answers on a survey do not reflect the true feelings of the respondent.
retrospective study
A retrospective study requires that individuals look back in time or require the researcher to look at existing records.
statistic
A statistic is a numerical summary of a sample.
individual
An individual is a person or object that is a member of the population being studied.
observational study
An observational study measures the value of the response variable without attempting to influence the value of either the response or explanatory variables.
open question
An open question allows the respondent to choose his or her response.
confounding
Confounding in a study occurs when the effects of two or more explanatory variables are not separated. Therefore, any relation that may exist between an explanatory variable and the response variable may be due to some other variable or variables not accounted for in the study. Confounding is potentially a major problem with observational studies. Often, the cause of confounding is a lurking variable.
descriptive statistics
Descriptive statistics consists of organizing and summarizing information collected.
cross-sectional vs case-control
Neither study is always the superior to the other. Both have advantages and disadvantages that depend on the situation. Both studies are inexpensive and can be done relatively quickly. A case-control study is limited in that it requires individuals to recall information correctly, and to answer questions truthfully. A cross-sectional study is limited in that it only gives information at a specific point in time or over a very short period of time, and might not contain valuable information that occurs outside of that point in time.
nonsampling error
Nonsampling error is the error that results from the process of obtaining the data.
PRACTICE: Researchers wanted to determine if there was an association between the level of happiness of an individual and their risk of high blood pressure. The researchers studied 1546 people over the course of 88 years. During this 88-year period, they interviewed the individuals and asked questions about their daily lives and the hassles they face. In addition, hypothetical scenarios were presented to determine how each individual would handle the situation. These interviews were videotaped and studied to assess the emotions of the individuals. The researchers also determined which individuals in the study experienced any type of high blood pressure over the 88-year period. After their analysis, the researchers concluded that the happy individuals were less likely to experience high blood pressure.
Q1: What type of observational study was this? A1: This was a cohort study, because information was collected about a group of individuals by observing them over a long period of time. Q2: What is the response variable? A2: The response variable is whether or not high blood pressure was contracted, because it is the variable of interest. Q3: What is the explanatory variable? A3: The explanatory variable is level of happiness, because it affects the other variable. Q4: In the report, the researchers stated that "the research team also hasn't ruled out that a common factor like genetics could be causing both the emotions and the high blood pressure." Explain what this sentence means. A4: The researchers may be concerned with confounding that occurs when the effects of two or more explanatory variables are not separated or when there are some explanatory variables that were not considered in a study, but that affect the value of the response variable.
PRACTICE: Researchers wanted to determine if there was an association between daily pomegranate consumption and the occurrence of high blood pressure. The researchers looked at 93,166 women and asked them to report their pomegranate-eating habits. The researchers also determined which of the women had high blood pressure. After their analysis, the researchers concluded that consumption of two or more servings of pomegranate per day was associated with a reduction in high blood pressure.
Q1: What type of observational study was this? A1: This was a cross-sectional study because all information about the individuals was collected at a specific point in time. Q2: What is the response variable in the study? Is the response variable qualitative or quantitative? A2: The response variable is whether the woman has high blood pressure or not. The response variable is qualitative. Q3: What is the explanatory variable? A3: The explanatory variable is consumption of pomegranate. Q4: In their report, the researchers stated that "After adjusting for various demographic and lifestyle variables, daily consumption of two or more servings was associated with a 30% reduced prevalence of high blood pressure." Why was it important to adjust for these variables? A4: The researchers may be concerned with confounding that occurs when the effects of two or more explanatory variables are not separated or when there are some explanatory variables that were not considered in a study, but that affect the value of the response variable.
sampling error
Sampling error is the error that results because a sample is being used to estimate information about a population. This type of error occurs because a sample gives incomplete information about a population.
PRACTICE: Suppose that a magazine predicted that Candidate A would defeat Candidate B in a certain election. They conducted a poll of telephone directories with a response rate of 23%. On the basis of the results, the magazine predicted that Candidate A would win with 57% of the popular vote. However, Candidate B won the election with about 62% of the popular vote. At the time of this poll, most households with telephones belonged to the party of Candidate A. Name two biases that led to this incorrect prediction.
Sampling bias: Using an incorrect frame led to undercoverage. Nonresponse bias: The low response rate caused bias.
yields
Similar individuals will not necessarily yield the same data. People can be different heights, different weights, different ages, and so on. In addition to this, a person's age will change over time, and their height and weight can change over time as well. Measuring two people's heights at the same time, or measuring one person's height at different times can yield different results.
closed vs open question
Since closed questions limit the possible responses, they are easier to analyze. Open questions are harder to analyze due to the variety of answers and the chance of misinterpreting an answer. Closed questions are easier to analyze, but limit the responses. Open questions allow respondents to state exactly how they feel, but are harder to analyze due to the variety of answers and possible misinterpretation of answers.
population
The entire group of individuals to be studied is called the population.
frame
The frame is a list of the individuals in the population being studied. If the population of interest is all the students at a school, the frame would be a list of all the students currently attending that school. It is rare for frames to be accurate because frames are obtained periodically, whereas populations are constantly changing. For example, a frame that consists of all of the students in a school would be inaccurate as soon as any student leaves the school, or any new student joins the school.
simple random sampling
The most basic sample survey design is simple random sampling. A sample of size n from a population of size N is obtained through simple random sampling if every possible sample of size n has an equally likely chance of occurring. The sample is then called a simple random sample. To obtain a simple random sample, typically, each individual in the population is assigned a unique number between 1 and N, where N is the size of the population. Then n distinct random numbers from this list are selected, where n represents the size of the sample. To number the individuals in the population, one needs a framea list of all the individuals within the population.
response variable
The response variable is the variable of interest to be measured in the study. The value of the response variable is affected by the explanatory variable.
categories of observational studies
There are three major categories of observational studies: cross-sectional studies, case-control studies, and cohort studies.
variables
Variables are the characteristics of the individuals within the population. If variables did not vary, they would be constants, and statistical inference would not be necessary.
under-represented
When a part of the population is proportionally smaller in a sample than in its population, this part of the population has been under-represented. This could be caused by many different types of bias, or even by random chance.
case-control studies
Case-control studies are observational studies that are retrospective, meaning that they require individuals to look back in time or require the researcher to look at existing records. In case-control studies, individuals that have a certain characteristic are matched with those that do not. A disadvantage to this type of study is that it requires individuals to recall information from the past. Plus it requires the individuals to be truthful in their responses. An advantage of case-control studies is that they are relatively inexpensive to conduct and can be done relatively quickly.
cross-sectional studies
Cross-sectional studies are observational studies that collect information about individuals at a specific point in time or over a very short period of time. For example, a researcher might want to assess the risk associated with smoking by looking at a group of people, determining how many are smokers and comparing the incidence rate of lung cancer of the smokers to the nonsmokers. A clear advantage of cross-sectional studies is that they are cheap and quick to do. However, cross-sectional studies have limitations. For the lung cancer study, it could be that individuals develop cancer after the data are collected, so the study will not give the full picture.