BANA 2010- Exam II

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

Sampling with replacement

"With replacement" means that after an item or individual has been selected for the sample, you place that item or individual back into the sampling frame. It therefore may be selected again for the sample.

Sampling without replacement

"Without replacement" means that after an item or individual has been selected for the sample, you do NOT place that item or individual back into the sampling frame so that they can not be selected again for the sample.

Sampling

--Less expensive and time consuming then studying the entire population

What are the differences between Primary research and Secondary research?

1. Purpose: Primary -For the problem at hand. Secondary - for other purposes 2. Process: Primary: very involved. Secondary: rapid and easy 3. cost: Primary: high. Secondary: Relatively Low. 4. Time: Primary: Long. Scondary: short

5-Number Summary

A 5-number summary presents 5 key figures from a data set: the lowest value, Q1 (25th percentile), the median (50th percentile), Q3 (75th percentile) and the highest value. Here again is a data set we saw in a recent lecture (n = 16): 56 65 67 71 77 79 80 82 82 83 84 86 89 92 93 95 The 5-number summary is 56 71 82 89 95 The Q1 and Q3 values (71 and 89, respectively) were determined in Measures of Position, Pt. I. See if you agree with Mr. Eric's value of 82 for the median. Here is the 5-number summary with an annotation below it, to help you interpret it : 56 71 82 89 95 lowest value; Q1 (25th %ile); median; Q3 (75th %ile); highest value Here's something else that a 5-number summary can tell you: It can give you an idea of the shape of the data distribution -- in other words, whether the data are distributed symmetrically or not. PROPERTIES OF A SYMMETRICAL DISTRIBUTION If the data are distributed symmetrically (mirroring each other on each side of the mean), then all 3 of the following are true: 1) The range from the distribution's lowest value to the median is the same as the range from the median to the highest value. In other words, median - smallest value = largest value - median 2) The range from the smallest value to the value at the 25th percentile (Q1) equals the range from the value at the 75th percentile (Q3) to the largest value. Q1 - smallest value = largest value - Q3 3) The range from Q1 to the median equals the range from the median to Q3. median - Q1 = Q3 - median These statements are NOT true for the data set we're working with in this lecture and therefore the data are NOT distributed symmetrically around the mean. Thus, the distribution must be skewed. But which way, left or right? PROPERTIES OF SKEWED DISTRIBUTIONS Well, the range from the smallest value (bottom of the distribution) to Q1 is 15 points (71 - 56), whereas the range from Q3 to the largest value is only 6 points (95 - 89). So there is a longer tail to the left of the median, which indicates a left- or negatively-skewed distribution. HOW TO BEGIN To construct a 5-number summary, start by placing the data in an ordered array. The ordered array will quickly show you the lowest value in the data set (which is the 1st figure in a 5-number summary) as well as the maximum value in the data set (which is the 5th figure in a 5-number summary). You also need an ordered array to determine the median, which is the 3rd figure in a 5-number summary.

Sample

A portion of a population selected for analysis. Goal is to generalize from the sample to the larger population. ex: Would survey partial of all full-time students like 250 students. The 250 students would be a sample or a subset of the population.

Systematic Error

A second category of error is systematic error, or bias, and it is more problematic. Bias results from the research process itself. The way you go about collecting data, or who you collect it from, can influence the results. The problem is that the researcher often doesn't realize that his or her data-collection methods are introducing bias. Bias slants the data in a particular direction. Thinking back to Data Collection Overview Pt. I, remember we saw that in early 2017 President Trump's lowest approval ratings came from live telephone interview polls, while his highest ratings resulted from online or automated telephone polls? That's an example of bias in the data. The method slants (biases) the results. Bias in the sample data means the researcher doesn't have an accurate estimate of the population parameter, and the population parameter is ultimately what we're interested in. Another example of bias is a scale that is not calibrated correctly. It might underestimate the weight of everyone in the home by 3 pounds, for instance. The point is that all the data collected (the weights of everyone in the home) are slanted. As yet another example of bias, questions on a written survey may be worded ambiguously such that members of the sample misunderstand them and provide inaccurate answers. Therefore, the sample members' answers may not really be representative of how the target population would respond.

Outliers

A value that is much different than the others. Can be more than one outlier. Affected: -Range -Mean -Standard deviation & variance( they become distorted and less meaningful/ useless) Not affected: -Median

Experiments

Allows the researcher to control the environment & thereby limit the variables that are involved. Accomplished with the use of groups that are identical in every way except one- namely the variable you want to test.

Q1 Ranking

Also important to know is the value at the 25th percentile of the data distribution. It is the value which is ≥ 25% of the data when the data are placed in an ordered array from smallest to largest, such as these exam scores (n = 16): 56 65 67 71 77 79 80 82 82 83 84 86 89 92 93 95 To find the position (ranking) of the 25th percentile in a data set, start with this equation: (n + 1) Q1 ranking (25th percentile) = ------------ 4 [Note: The formula above will be provided on the Official Cheat Sheet.] Q1 represents the end of the first quartile, which is up to and including the 25th percentile -- i.e., the bottom 25% of the values. ("Quartile" is related to the word "quarter," or one-fourth.) As pertains to our data set, (n + 1) Q1 ranking = ----------------- 4 (16 + 1) = ------------ = 4.25th value in the ordered array 4 This means that the 4.25th value from the lowest value in an ordered array is at the 25th percentile. The 4.25th value? Egads! When you get a ranking that isn't a whole number, such as 4.25, round it to the nearest whole number. (Values of < .50 are rounded down.) In this example, 4.25 would be rounded to 4.0, meaning the 4th value in the data when they are placed in an ordered array. The 4th value is 71, which is interpreted to mean that 25% of the class scored a 71 or lower on the exam, which is another way of saying 71 ≥ 25% of the data.

Stratfied Random Sampling

An important probability sampling method. You separate the sampling frame into different sub-populations, or strata, based on some characteristic (variable) of interest to you such as income or gender. Then you take a random sample from each stratum usually in proportion to the frequency of each statum in the population. --Ensures that all sub-groups of the population are as important and represented in the sample. EX: Shoppers with different income levels have different attitudes toward your canned carrot slices. You would stratify into different income levels- up to $24,999, $25k-$49,999, $50k-$74,999- Do not overlap When to use: 1. Used when sub-groups in a population differ from one another in a way that is important to you, such as income level affecting consumers attitudes toward the product. 2. Determining whether one sub-group differs from others. 3. When you want to focus on the relationship between 2 or more sub-populations. Advantages: 1. Greater accuracy 2. Can use a smaller sample, which is less research, time, effort and cost Disadvantage: Identifying the variable that should be used to create the strata (sub-population)

Measure of Central Tendency

Are designed to give you an idea of a typical data value. Mean (X bar)= Average Median= The value in the middle of the data. 21 -- choose the 11th value- 10 on each side. When there is an even number: add the two middle values/ 2 Mode= The number used the most

Parameters

Calculations on data for the entire population. Usually unknown because populations are too large to study. You don't have time to ask every full time student how many tweets they make per week.

Statistics

Calculations performed on a data collected from samples. When research has been conducted properly (without influencing peoples responses to questions) statistics calculated on sample data can be used to make inferences about parameters of the larger population. ex: If we know 3.6 is the average number of Twitter tweets per week among a sample of 250 students, inferential methods allow us to estimate how many weekly Twitter tweets are made by the population of a full-time undergrads at CU.

Surveys

Common in the business world. Involve asking people about their views, behaviors and desires, either by an interview in person, over the phone, providing a questionaire to fill out

Population

Complete set of objects or people that we wish to reach conclusions about. ex: You want to know the average number Twitter tweets made per week by full-time undergrad students at CU. The population is all the full-time graduate students.

Measure of Association

Correlation Coefficient- the word correlation, you see "co-relation," i.e., the relationship between 2 variables. The correlation coefficient can range in value from -1.0 to +1.0. The correlation coefficient of a sample is symbolized by the letter r. For a population, it's symbolized by the Greek letter rho. When r > 0.0, the 2 variables are positively related. This means that as the value of one variable changes in a particular direction, the value of the other variable changes in the same direction. At the extreme, where r = 1.0 (Panel C, at the bottom of p. 137 in the illustration at the right), a change in the first variable's value lets you predict exactly how much the other variable's value will change. It'll fall on the line that's shown. When r = 1.0, it is said that the 2 variables are perfectly positively correlated. To summarize: When r > 0.0: positive relationship direct relationship variables' values change in same direction (they both increase, or they both decrease) A negative sign in front of a correlation coefficient indicates an inverse relationship, meaning that when the value of one variable changes in a particular direction, the value of the other variable changes in the opposite direction. This is also referred to as a negative or indirect relationship. When r < 0.0: negative relationship indirect relationship inverse relationship variables' values change in opposite direction (as one increases, the other decreases, and vice versa) If r = -1.0 (perfectly negatively correlated; see Panel A on p. 137), a change in one variable's value lets you predict exactly how much the other variable will change in the opposite direction. It'll fall on the line shown in Panel A. If r is negative but not -1.0, it means that there is a tendency for the variables' value to change in opposing directions, but you can't predict exactly how much. Flip to p. 139 and look at the top 2 illustrations. On the left (Panel A), r = -0.90 (which is pretty close to -1.00) and you can clearly see a strong tendency for the value of the variable on the Y-axis to decrease as the value of the variable on the X-axis increases. But in Panel B, where r = -0.60, the tendency is not as strong. In other words, you can't predict the change in Y that results from a change in X as well as you could when r = -0.90. By the time you get to Panel C in the 2nd row, where r = -0.30, it's not very clear at all that there is an inverse relationship between X and Y, even though the correlation coefficient tells us there is because it has a negative sign. What about a correlation coefficient of 0.0? When r = 0.0, it means you can't predict what'll happen to one variable based on how the other is changing. There is no relationship between the 2 variables -- and no correlation. Be sure to check out Panels D, E and F on p. 139 to see what positive correlations look like when plotted out. Note that when r > 0.0, as the value of the variable on the X-axis increases, so does the value of the variable on the Y-axis. And the relationship grows stronger as r approaches 1.0. To sum up, the correlation coefficient's absolute value tells you the degree to which the variables are related -- in other words, how strong the relationship between the 2 variables is/how well you can predict a change in one variable's value from a change in a second variable's value. The sign (+ or -) of the correlation coefficient indicates whether the relationship is direct or indirect. NO CAUSATION While a correlation coefficient tells you how strong the association between 2 variables is, it does NOT permit you to say that one variable causes the change in the other variable. Even if r = 1.0, you can't say that the change in one variable caused the change in the other variable. It's possible that some third, unknown variable is the cause. The same is true when r = -1.0. You can NOT say that the change in one variable caused the other variable's value to move in the opposite direction. Again, it is possible that some third, unknown variable caused the change.

Secondary Research

Data collected for some purpose other than the problem at hand. Using data that is already available.

Secondary Data

Data that has been collected by someone else

Descriptive Statistics

Describe the data that has been collected Ex: What shape would the graph have? Are the data values very close to another?

Advantages and Disadvantages of Observational Studies

Disadvantages: 1. Don't have control over all of the variables at play. You may not even be aware of all of them. 2. Can produce inaccurate data

Disadvantages and Advantages of Surveys

Disadvantages: 1. Questions must carefully be worded so they do not influence the individuals response in any way. 2. You don't know if the people filling out the surveys are telling the truth Advantages: 1. You can ask exactly what you want to find out

Advantages and Disadvantages of Experiments

Disadvantages: 1. Take place in artificial settings that may not reflect the real-world, especially when the experiments involve people. Therefore, the ability to apply the results of an experiment are often limited. 2. Cost & Time- require both 3. Experiments are used more in science and social science than in business

Systematic sampling

Every kth member of the population is chosen for the sample. The value of k is determined by dividing the size of the population (N) by the size of the sample (n) sampling frame and every kth unit thereafter. ex. You have 100 samples, and you randomly choose 10 of them in random spots.

Biased Sample

Has characteristics different from those of the population it is intended to represent; over emphasizes or under emphasizes 1 or more characteristics compared to the population it is intended to represent. Leads to inferences about the population represented by the sampling frame not the target population.

Observational Studies

In a natural setting that allow the researcher to collect data on behavior in the real-world. ex: watching shoppers in a store. Do they look at special displays or pass by with a yawn?

Interquartile Range

In everyday language, "inter" means "between," right? The interquartile range is "between" the first and third quarters. It is the middle 50% of the data, from the 25th percentile to the 75th percentile. The interquartile range involves subtraction: interquartile range = Q3 - Q1. The larger the interquartile range, the more spread out the data IN THE MIDDLE are, but not necessarily in the extremes. The interquartile range is not distorted by outliers because they are not involved in the calculation of either Q1 or Q3, which is what the interquartile range is based on.

Measures of Dispersion

Indicate how spread out the data is to see the difference from the mean. Range= highest value-lowest value To figure out how spread out use: Variance (S2) & Standard deviation: indicate how spread out the data are. It looks at it as the perspective on the difference from the mean. Variance- =sum of squares/n-1 Standard Deviation- Square root of variance CAN NEVER BE NEGATIVE **Prefer to use standard deviation because it is in the same units as the original data, whereas the variance is in square units Sum of squares- The sum of the squared differences between each value and the mean. Used to compute the sample variance (S2) and the sample standard deviation (S). *THE LARGER THE STANDARD DEVIATION & VARIANCE, THE MORE SPREAD OUT THE DATA ARE* Relationship: The standard deviation is the square root of the variance. The standard deviation is expressed in the same units as the mean is, whereas the variance is expressed in squared units, but for looking at a distribution, you can use either just so long as you are clear about what you are using

Non-Response Bias

It results from failing to include in the sample people who declined to participate in your research. Non-participants may differ from those who participated in the research, which makes the sample unrepresentative of the target population. Therefore, the statistics calculated on the data differ from the true population parameter you wish to estimate. So it would be inappropriate to generalize the results to the target population. Here's a diagram of this: members of the target population who declined to participate in the research differ from those who did participate in the research | | | \/ sample statistics inappropriate to sample not a good estimate generalize results not representative -----> of target population ----> from the sample to of the target population parameter the target population

Primary Data

Newly collected data designed to answer a specific question you may have.

Random Sampling Error

Now that you know about sampling methods, let's suppose you use a nice random sampling technique and successfully select a sample that is representative of the target population. If you collect data on this awesome sample - height, for example -- do you think the average height in the sample will be exactly the same as the average height in the population? No, it probably won't be. Even if the sample is representative of the target population in terms of age, ethnicity and every other relevant variable, the average height of the individuals in a sample would be expected to differ from the average height of the entire population. It might be pretty close, but it probably won't be exactly the same. The difference between the value for the sample and the value in the target population is called error. There are some sub-categories of error; if you add them together, you get total error, although statisticians don't really use the term total error. They just say "error" and appreciate that it includes a few things. Random sampling error is due to the sample being an inexact representation of the population. Using inferential statistical methods we can actually quantify the amount of random sampling error that's in our statistics. Random sampling error is sometimes called random error or non-systematic error. Very importantly, the impact of random error can be reduced by using a larger sample size.

Q1 and Q3 Outliers

One final point: Q1 and Q3 don't get distorted by outliers like many other descriptive statistics do. The reason they don't get distorted by outliers is that their computation does not involve the highest or lowest values (i.e., the outliers). These descriptive statistics are only affected by n. (Look back at the formulas for Q1 and Q3.) Regarding Q1, it doesn't really matter how low the lowest value is; Q1 will be the same whether the lowest value is very very very very low, or just kind of low. Similarly, Q3 will be the same whether the highest value is quite a bit above the 2nd-highest value, or just a little bit above it. Q3 occupies the same position in the data regardless of high the top value is.

Voluntary response sampling.

Produces biased sample. This relies on members of the sampling frame to do something in order to be part of the sample, such as fill out a survey you sent them. If you've ever received an e-mail asking you to complete some silly survey, did you rush to fill it out and reply back, or did you hit delete just as fast as your fingers can fly? If you rushed to fill it out, it's probably because you felt strongly about the survey's topic, either positively or negatively. If you hit delete, it's probably because you weren't too interested in the topic at hand. So voluntary response sampling often attracts only the portion of the population that has a strong interest in the topic, producing a sample that is biased if you are interested in how the total population feels about the topic. **you can NOT generalize results obtained from these kinds of sampling methods to a larger population

Measures of Position PT IV

Read text 128 be sure to read the text for this topic be sure to go through this topic's SELFIEQ know: what a boxplot is used to graphically present; other names for a boxplot; how to read/interpret a boxplot; how to construct a 5-number summary from a boxplot; how to determine the shape of the data distribution from a boxplot

understand which category of statistical methods is used to generalize the results of research

Sampling frame that is representative of the target population--unbiased--appropriate generalizations about the target populations using inferential statistic methods

Selection Bias

Selection bias, as we saw, results from choosing a sample that is not representative of the target population. That would certainly contribute to the statistics calculated on the sample differing from the population parameters being estimated. Here's a little diagram of this (the arrows mean "leads to"): sample statistics inappropriate to sample not a good estimate generalize results not representative -------> of target population ----> from the sample to of the target population parameter the target population

Big Data

Tends to include primary and secondary sources

Q3 Ranking

That's the bottom 25%. Nobody wants to be in the bottom 25%. What about the top 25%? You can find it with the following formula: 3 x (n + 1) Q3 ranking (75th percentile) = ---------------- 4 3 x (n + 1) Q3 ranking = ---------------- = 12.75th ranked value in the data 4 Q3 stands for the third quartile or 75th percentile, meaning Q3 ≥ 75% of the scores. Again, when a non-whole number is obtained, round to the nearest whole number. Thus, rounding 12.75 to the nearest whole number becomes 13.0. This means the 13th value in the ordered array is at the 75th percentile. The 13th value is 89. This means 75% of the class scored an 89 or lower on the exam. It also implies that 25% scored an 89 or higher. In other words, Q3 is the cut-off between the 50th through 75th percentiles (the third quartile) and the top 25% (75th through 99th percentiles, which is the top quartile).

Sampling Frame

The list or set of population members from which the sample is drawn. ex: individuals 18 and over on the 16th street mall on a weekday during lunch hour. That is representative of the target population--- unbiased--appropriate generalizations (inferences) about the target populations using inferential statistical methods.

Shape

The pattern to the distribution of data values throughout the entire range of all the values. Skewness: Measures the extent to which the data values are not symmetrical around the mean. Symmetrical (Normal) : two halves of the data distribution are identical. Mean and median are equal. The mode is also equal to the mean and median. Mean= median with zero skewness. Negative skew: the extremely low values (outliers) reduce the average compared to a symmetrical distribution, which causes the average (mean) to sink below the middle value (median). Mean < Median Positive skew: the extremely high values (outliers) increase the average compared to a symmetrical distribution, which pull the mean above the median. Mean > Median **When the data are skewed, the median is often a better indicator of central tendency. ex: Cost of homes being sold. Because can't sell a home less than $0 and there is no upper limit on how the price of a home can go.

Target Population

The study of the entire population that is in interest to you.

Simple random sampling

This is a commonly-used method. With this method, every member of the sampling frame has an equal probability of being chosen for the sample. ex: Begin by assigning a code to each member of the sampling frame - in other words, give a code to each individual on the list. If you are going to use a table of random numbers, then the code's length has to match the number of digits in the size of the sampling frame. In other words, if the sampling frame consists of 168 individuals, then there are 3 digits to the size of the sampling frame (units, tens and hundreds). If the sampling frame consists of 1,538 individuals, then the sampling frame is 4 digits long (units, tens, hundreds and thousands). If the sample frame consists of 94 individuals, then the sampling frame is 2 digits long (units and tens).

Probability Sampling

To be able to generalize what you've found out about a sample to the target population, you need to use a probability sample. There are 4 types of probability samples that we'll learn about and which are discussed on pages 25 - 26 of your text. All involve selecting members of the sample based on known probabilities, as your text indicates on p. 24.

Goal of inferential statistics

To calculate statistics on data collected from a sample, or a sub-set of the population, and then use the sample statistics to estimate the true population parameter.

Primary Research

To collect data yourself or collect data for your company. "data originated by the researcher for the specific purpose of addressing the research problem"

Measures of position

describe important rankings within the data. We have already learned about one measure of position, actually. The median is the middle value (a position) in the data. Thus, the median actually does double-duty as a measure of position as well as being a measure of central tendency. Also important to know is the value at the 25th percentile of the data distribution. It is the value which is ≥ 25% of the data when the data are placed in an ordered array from smallest to largest, such as these exam scores (n = 16): 56 65 67 71 77 79 80 82 82 83 84 86 89 92 93 95

Convenience Sampling

the first of the 2 sampling methods we shall look at that lead to a biased sample and hence biased results, involves choosing for your sample only easily-accessible (convenient) members of the target population. ex: An example of convenience sampling is standing on the 16th Street Mall and asking your survey questions to whomever walks by. This leads to a biased sample because certain sub-groups of the population may be under-represented and the data you collect will suffer from selection bias, a tilt to the data caused by selecting a sample that is not representative of the target population. If your target population is all residents of the Denver metro area, it is possible that certain sub-groups may not stroll down the 16th Street Mall on a regular basis and therefore are not represented in your sample. For example, people who live in the suburbs could be under-represented in a sample collected this way. Similarly, people who work full-time may be under-represented if you conduct your research at 10 AM on a work day. **you can NOT generalize results obtained from these kinds of sampling methods to a larger population


Kaugnay na mga set ng pag-aaral

Ch2. Field Underwriting, Application, Premiums, Receipts

View Set

NURS 220 Final Video Questions, Evolve quiz questions, and prep work quiz questions

View Set

U.S. Government: Chapter Three Review and Study Guide

View Set

Quiz 2: U.S. States, Capital Cities, and the Global Traveler

View Set