STATISTICS ECON-E 370 EXAM 1

अब Quizwiz के साथ अपने होमवर्क और परीक्षाओं को एस करें!

Inferential Statistics (infer about population from sample)

making claims or conclusions about the data based on a sample (makes statement about population)

mean and SD of a binomial distribution

mean of a binomial distribution: μ= np standard deviation of a binomial distribution: σ = sqrt (npq) where μ = the mean of the binomial distribution σ = the standard deviation of the binomial distribution n = # of trials p = probability of a success q = probability of a failure

horizontal bar chart

y-axis = groups x-axis = frequency

descriptive and inferential statistics

descriptive statistics (describe data) -collecting, summarizing, and displaying data (reported based on observations) inferential statistics (infer about population from data) -making claims or conclusions about the data based on a sample (makes statement about population) ex: (from bottom of slide 16) observed sample statistic (known)--> estimated population parameter (unknown, but can be estimated from sample evidence) descriptive statistics- compute avg. income of 50,000 (you describe data) inferential statistics- estimate avg. (statistic) income of US population (parameter) based on avg. of 50,000 (you infer parameter using statistic)

The median

"hey diddle diddle the median's the middle" median- the value in the data set for which half the observations are higher and half are lower (think of the value in the middle when the data items are arranged in ascending order) **the median is not sensitive to outliers (values that are much higher or lower than most of the data) steps: 1. sort data 2. value in the middle rule of thumb (to find the median): 1. when there are an odd number of data values, the median is always the middle value (in the SORTED data set) 2. when there are an even number of data values, the median is an average between the two middle values (in the SORTED data set) ex: odd data set 27, 21, 27, 34, 45, 50, 28 1. sort 21, 27, 27, 28, 34, 45, 50 2. median is 28 ex: even data set 145, 157, 170, 182, 204, 209 1. sort 145, 157, 170, 182, 204, 209 2. two middle (170+182)/2 = median is 176

the mode

"the mode is the one that you see the most" mode- the value that appears the most often in a data set (value that occurs with the greatest frequency) -if no data value or category repeats more than once, then we say that the mode does not exist -more than one mode can exist if two or more values tie for most frequent -if the data have exactly two modes, the data are bimodal -if the data have more than two modes, the data are multimodal

range

"the range is the difference between" the simplest measure of variation range = highest value (max) - lowest value (min) advantage: -easy to calculate and understand disadvantages: -based on two numbers in the data set and ignores the way in which data are distributed -sensitive to outliers

continuous probability distribution

-a continuous random variable can assume any value in an interval on the real line or in a collection of intervals (ex: any value between 0 and 0.1. There is an infinite number of values in the given interval, therefore, it's impossible assign probability for each value in an interval) -it is not possible to talk about the probability of the random variable assuming a particular value because there are an infinite number of possible values, the probability of one specific value occurring is theoretically equal to zero - P(x = x0) = 0 -instead we talk about the probability of the random variable assuming a variable within a given interval -P(x>x1),P(x>=1), P(x<1), P(x<=1) -P(x1 (<,<=)x(<,<=)x2)

Normal Probability Distributions

-a distribution's mean and standard deviation describe its shape changing μ shifts the distribution left or right (horizontal shift left and right) changing σ increases or decreases the spread (vertical shift up and down) -sometimes called skinny/fat curve -it is determined by the thickness of tail. Skinnier curve is tighter and taller around the mean

probability density function

-denoted by f(x) (continuous random variable) is used to specify the probability of the random variable falling within a particular range of values, as opposed to taking on any one value -f(x) is used to calculate probabilities but the value of f(x) is not a probability per se -the probability that x takes on a value between some lower value x1 and some higher value x2 can be found by computing the integral of the probability density function f(x) over the interval from x1 to x2 -graphically, it is equivalent to computing the area under the graph of f(x) over the interval from x1 to x2

elements, variables and observations

-elements are the entities on which data are collected -a variable is a characteristic of interest for the elements -the set of measurements obtained for a particular element is called an observation (a data set with n elements has n observations) EXAMPLE ON PAGE 2 NOTEBOOK

examples of how businesses use statistics

-marketing research (customer surveys) -advertising (study of TV viewing habits) -operations (quality control, reliability) -finance and economics (data on income, credit risk, unemployment)

some useful notes for chapter 6-1

-the probability of the random variable falling within a particular range of values can always be expressed in terms of cumulative probabilities (ex: P(x>x1) can be expressed as a function P(x<=x1). This is useful for when using Excel) -for the continuous random variable, P(x<=x1) = P(x<x1). P(x<=1) = P(x=x1) + P(x<x1) = 0 + P(x<x1) -same holds for intervals with inclusive boundaries

frequency distributions

1) define groups 2) count observations for each group -indicates the number of occurrences of various categories -techniques are similar to frequency distributions with quantitative data --> grouping and counting -we can construct a relative frequency distribution (same idea as for the quantitative data) -cumulative (order) relative frequency does not really make sense (specifically for nominal (no order) data)

Primary Data Collection Methods

1. Direct observation or focus group -observing subjects in their natural environment -ex: watching to see if drivers stop at a stop sign 2. Experiments -treatments are applied in controlled conditions -ex: crop growth from different plots using different fertilizers 3. Surveys or questionnaires -subjects are asked to respond to questions or discuss attitudes -ex: email surveys to customers to assess service quality

properties of mean, median, and mode

1. Linear Transformation -suppose that for random variable X, mean=10, median=15, mode=20. Then, for random variable Y=2X+10, mean=30, median=40, mode=50. 2. Qualitative Data -for qualitative data, mode is widely used 3. Outliers -except for the small sample case, median adn mode are insensitive to outliers. Mean may be sensitive to outliers but note that 'Range' is the measurement which is the most influenced by outliers. In the presence of outliers, median is recommended for measure of central tendency 4. Existence -for any quantitative data, mean and median exist. Mode is uncertain. Mode may not exist in your data set or sometimes you can find more than one mode from your data set

measures of central tendencies review

1. Mean -discrete continuous -always exists 2. Meidan (outliers) -discrete continuous -always exists 3. Mode -discrete/continuos qualitative -may not exist

normal probability distribution

1. bell shaped 2. symmetric (mean=median) 3. the entire family of normal probability distribution is defined by its mean and standard deviation 4. the highest point on the normal curve is at the mean, which is also the median and mode 5. the mean can be any numerical value 6. the standard deviation determines the width of the curve: larger values result in wider, flatter curves 7. (property of continuous random variable) probabilities for the normal random variable are given by areas under the curve. The total area under the curve is 1 (.5 to the left of the mean and .5 to the right)

probability distributions

1. discrete probability distributions -listing of all the possible outcomes of an experiment for a discrete random variable -along with the relative frequency of each outcome -describes how probabilities are distributed over the values of the random variable -can be represented by a table (relative frequency distribution), graph (histogram), or formula -is defined by a probability function, denoted by P(x)--> discrete OR (sometimes (f(x)--> continuous), which provides the probability for each value of random variable x -a probability distribution may be shown graphically: the value of x are placed on the horizontal axis and the probabilities P(x) are on the vertical axis. A bar is drawn so that its height equals P(x) -rules for discrete probability distributions: 1) the probability of each value of x, P(x), must be between - and 1 (inclusive): 0<=P(x)<=1, for all values of x. 2) the sum of the probabilities for all values of x in the distribution must be 1 where n equals the total number of possible values (ex: P(1)+P(2)+P(3) = 1) 2. continuous probability distributions **slide 68

descriptive statistics for characterizing two variables

1. scatter plots -graphical tool used to determine if two variables are related. Each point represents a pair of known values of the two variables for one observation -in the relationship, we usually distinguish between dependent and independent variables -the dependent variable: influenced by changes in the independent variable, denoted by y;, placed on the vertical axis -the independent variable: used to explain changes in the dependent variable, denoted by x;, placed on the horizontal axis -positive relationships: points are clustered together along a trend line with a positive slope -negative relationships: points are clustered together along a trend line with a negative slope -no relationship: data are randomly scattered with no discernible pattern 2. numerical measures of association between two variables -two kinds: sample covariance and sample correlation coefficient (sample correlation) -sample covariance: measures the direction of the linear relationship between two variables, denoted by Sxy (notebook page 12). A positive value implies a positive linear relationship. A negative value implies a negative linear relationship. The covariance is zero if y and x have no linear relationship -sample correlation coefficient (sample correlation): Rxy (notebook page 12) measures both the strength (size) and direction (sign) of the linear relationship between two variables. The values of r range from -1.0, a strong negative relationship, to +1.0, a strong positive relationship. Where r=0, there are no linear relationship between variables x and y. If absolute value of relationship of x&y is about 1, it is a strong (high) relationship. If absolute value of relationship of x&y is about 0, it is a weak (low) relationship.

Excel Exercise Chapter 6-4

According to a recent survey by Smith Travel Research, the average daily rate for a luxury hotel in the US is $237.22. Assume the daily rate follows a normal probability distribution with a standard deviation of $21.45. What's the probability that a randomly selected luxury hotel's daily rate will be... a. less than $250 per night? 1. x~N(237.22, 21.45) 2. mean = 237.22, SD = 21.45 3. P(x<250) 4. "=norm.dist(250, 237.22, 21.45, 1)" = 0.72435 b. more than $260? 1. P(x>260) 2. same as 1-P(x<260) 3. "=norm.dist(260, 237.22, 21.45, 1)" = 0.85588 4. take different 1-0.85588 = 0.14412 c. between $210 and $240? 1. P(210 < x < 240) 2. P(x<big) - P(x<small) 3. big = 240, small = 210 4. for 240, do "=norm.dist(240, 237.22, 21.45,1)" = 0.55156 5. for 210, do "=norm.dist(210, 237.22, 21.45,1)" = 0.10222 6. take difference 0.55156-0.10222 = 0.44934 d. the managers of a local luxury hotel would like to set the hotel's average daily rate at the 80th percentile, which is the rate below which 80% of hotels' rates are set. What rate should they choose for their hotel? 1. P(x<c)=0.8 2. "=norm.inv(0.8, 237.22, 21.45)" = 255.273 3. To verify value, do "=norm.dist(255.273, 237.22, 21.45, 1)" = 0.8 e. The managers of a local luxury hotel consider a prize as a signal of a prestige of the hotel and would like to set the hotel's average daily rate not lower than 10% of the most expensive hotels. What rate should they choose for their hotel? 1. P(x>c)=0.1 2. same as 1-P(x>c)=0.9 3. do "=norm.inv(0.9, 237.22, 21.45)" = 264.709 4. To verify, do "=norm.dist(264.709, 237.22, 21.45, 1)" = 0.9

Excel exercise chapter 5-4

According to a survey by Apartment.com, 35% of apartment renters in 2013 had owned a home at one point. Consider a random example of 9 apartment renters range of x x~B(9,0.35) x = 0,1,2,...9 a. What is the probability that exactly 3 renters from this sample previously owned a home? 1. make two columns: "x" and "number of renters who previously owned a home" 2. make your two columns set up like this x / # of renters... n / 9 p / 0.35 3. then for P(x=3), do "=binom.dist(3(number of success), 9 (value of n), 0.35 (value of p), 0) 4. Answer = 0.27162 b. what is the probability that less than 4 renters from this sample previously owned a home? (use BINOM.DIST.RANGE() function and use BINOM.DIST() function and compare solutions) 1. P(x<4) which is same as P(x<=3) 2. P(x=0) + P(x=1) + P(x=2) + P(x=3) / min. 0 and max. 3 3."=BINOM.DIST.RANGE(9 (value of n), 0.35 (value of p), 0 (min. value of x), 3 (max. value of x))" 4. Answer = 0.60889 5. OR do "=BINOM.DIST(3 (bc P(x<=3)), 9 (value of n), 0.35 (value of p), 1 (do this # every time)) and get same answer c. what is the probability that 6 or 7 renters from this sample previously owned a home? (use both BINOM.DIST.RANGE() and BINOM.DIST() functions) 1. P(x=6) + P(x=7) 2. min. value is 6 / max. value is 7 3.do "=BINOM.DIST.RANGE(9 (value of n), 0.35 (value of p), 6 (min. value of x), 7 (max. value of x)" and answer is 0.05219 4. OR do "=BINOM.DIST(6 (min. value of x), 9 (value of n), 0.35 (value of p), 0)" = 0.04241 then do "=BINOM.DIST(7 (max. value of x), 9 (value of n), 0.35 (value of p), 0)" = 0.00979. Then add 0.04241+0.00979=0.05219 which is the same NOW DO THE SAME BUT CALCULATE CORRESPONDING PROBABILITIES IN A DIFFERENT WAY d. generate an entire binomial distribution using BINOM.DIST() and verify that it's a legitimate probability distribution 1. do 2 columns (1st column / 2nd column) x/ blank n / 9 p / 0.35 x / P(x) range of x is 0-9 so for x do 0,1,2,...9. For P(x), do "=binom.dist(highlight whatever x value you need it for,9(number of trials), 0.35 (p value),0)" and prob. of x=0 is P(0)= 0.020711913. Do this for the rest of the values by pulling down and locking all values except x value 2. all probabilities should be positive and equal than or less than 1 (if you have a number like 7.88E-05 it is close to 0 but not less than 0) 3. all probabilities must equal 1 (do sum function to check this) e. build a histogram for this distribution that you can interpret 1. highlight all x and P(x) values including titles and go to "insert", "recommended charts", "clustered column" f. find the probabilities from parts b and c without using binom.dist() function 1. part b take sum of probabilities from 0-3 and get 0.60889 2. part b take sum of 6 and 7 and get 0.05219 g. compute mean and standard deviation of distribution 1. for mean do np. for variance, do npq. Mean is 9x0.35=3.15. Variance is 9x0.35*(1-probability of success (0.35)) = 2.0475 2. for standard deviation, do sqrt of variance and get 1.43091 3. OR for mean do "=sumproduct(all x-values, all P(x) values) = 3.15 4. OR for variance do (x-mean) for all values by locking the mean and pulling down for all x values. Then do (x-mean)^2. Then to get variance, do "=sumproduct(all (x-mean)^2 values, all P(x) values)" = 2.0475 and sqrt for standard deviation

Excel Exercise Chapter 6-3

Compute (will use norm.dist function for probability and norm.inv to compute value of x): norm.dist function "=norm.dist(x value, mean, standard deviation, 1)" norm.inv function "=norm.inv(cumulative probability, mean, standard deviation)" 1. P(z ≤ 1) -z~N(0,1) -mean=0, SD=1 -"=norm.dist(1 (value of x≤1) 0 (mean), 1 (SD), 1 (keep this #))" = 0.84134 2. P(z ≤ 0) -"=norm.dist(0, 0, 1, 1)" = 0.5 3. P(-2 ≤ z ≤ 1) -=P(z ≤ big) - P(z ≤ small). Big=1, small= -2 -for 1, do "=norm.dist(1, 0, 1, 1)" = 0.84134 - for -2, do "=norm.dist(-2, 0, 1, 1)" = 0.02275 -subtract 0.84134-0.02275 = 0.81859 4. P(-2 < z < 1) -probability of one point is always equal to 0 5. P(-1 ≤ z) -same as 1-P(z < -1) -"=norm.dist(-1, 0, 1, 1)" = 0.15866 6. Find ? such that P(z ≤ ?)=0.6 -"=norm.inv(0.6, 0, 1) = 0.25335 -to check do "=norm.dist(0.25335, 0, 1, 1) = 0.6 7. Find ? such that P(-? < z < ?)=0.8 -P(-? < z < 0)=0.4 -P(0 < z < star)=0.4 -P(z<0)=0.5 (from property of standard distribution) -combine P(0 < z < ?)=0.4 and P(z<0)=0.5 and get P(z < ?) = 0.9 -"=norm.inv(0.9, 0, 1)" = 1.28156 -verify by using "=norm.dist(1.28156, 0, 1, 1)" = 0.9

Frequency Distribution Using Grouped Quantitative Data

Ideally, the number of classes in a frequency distribution should be between 4 and 20 - some data sets particularly those with continuous data require several values to be grouped together in a single class -this grouping prevents having too many classes in the frequency distribution, which can make it difficult to detect patterns

True zero?

Interval --> no true zero Ratio--> true zero True zero= non-existence Calendar Year (interval) -0 year --> no time or no year? NO TRUE ZERO! Temperature (interval) -0 degree--> no temperature? NO TRUE ZERO! Income (ratio) - 0 dollar--> no money! YES, TRUE ZERO! Number of students (ratio) -0 student--> no student! YES, TRUE ZERO!

Ranking?

Nominal--> No ranking Ordinal--> Ranking Zip code (nominal) -47401<47405?47405<47401? CANNOT RANK Gender (nominal) -male<female?female<male? CANNOT RANK Education level (ordinal) -masters degree<doctorate degree? YES! CAN RANK Satisfaction rating (ordinal) -like<extremely like? CAN RANK no measurable meaning to the number differences? Ex: education level. Bachelor=1, Master=2, Doctorate=3. Master-Bachelor=1 = Doctorate-Master? YES!

Primary Data vs Secondary Data

Primary Data -data you've collected for your own use advantages: - collected by person/organization who uses data disadvantages: - can be expensive and time consuming to gather Secondary Data -data collected by someone else advantages: - readily available -less expensive to collect disadvantages: -no control over how data's collected -less reliable unless recorded and collected accurately

four levels of data measurement summary

Qualitative: Nominal- arbitrary labels for data, no ranking allowed (ex: zip codes) Ordinal- ranking allowed, no measurable meaning to the number of differences (ex: education level) Quantitative: Interval- meaningful differences, no zero point (ex: calendar year) Ratio- meaningful differences, true zero point (ex: income)

discrete probability distribution example

Rolling a 6 sided die Probability distribution: x value / P(x) 1 / 1/6 2 / 1/6 3/ 1/6 4 / 1/6 5/ 1/6 6/ 1/6 this table: 1. lists all possible outcomes of the experiment 2. each outcome has an associated probability of 1/6 --> the pairs of values and their probabilities form the probability distribution for the random variable x

The sample mean, the population mean, and the weighted mean

The sample mean and the population mean- notebook page 8 the weighted meet- notebook page 9

Histogram to graph a frequency distribution

a histogram is a "graph" showing the number of observations in each class of a frequency distribution: 1- it is a graphical representation of a frequency distribution or the relative frequency distribution 2- the variable of interest is placed on the horizontal axis (x-axis) 3- a rectangle is drawn above each class interval with its height corresponding to the frequency or relative frequency 4- a histogram plots Quantitative data

Excel exercise chapter 5-3

a sample of 2,500 people were asked how many cups of coffee they drink in the morning. Let's think of the number of coffee cups as a random variable x. The following distribution of x is cups of coffee (x) / Probability 0 / 0.24 1 / 0.36 2 / 0.28 3 / 0.08 4 / 0.04 **copy and paste table into Excel a. verify that this is a legitimate probability distribution 1. rule 1- all probabilities should be positive and equal than or less than 1 2. rule 2- summation of these probabilities should equal 1 Answer- this is a legitimate probability distribution! b. calculate the mean and variance of x 1. to calculate mean, make a new column titled "xP(x)" and do "=(x value)*(P(x) value)" for each row. Then sum the xP(x) values for the mean OR do "=sumproduct(highlight all x values, highlight all P(x) values)" to get mean (probably easier) which the answer is 1.32 2. to calculate variance of x, make a new column called "x-mean" and do "xP(x)-mean" using the columns already made. 3. make another column called "(x-mean)^2" and just square the "x-mean" column 4. take product of squared difference (from step 3) by making another column called "((x-mean)^2)*P(x)" by clicking on all the other columns and putting them into that equation 5. take summation of "((x-mean)^2)*P(x)" to get variance of x which is 1.0976 6. OR for variance of x, do "=sumproduct(highlight all (x-mean)^2 values, highlight all P(x) values) and you will still get 1.0976 7. to get standard deviation just do "=sqrt()" of variance answer which should be 1.04766 c. find the probability that a randomly selected people drink less than 2 cups of coffee in the morning 1. P(x<2) OR P(x=0) + P(x=1) plug these in in Excel (x=0--> P(0)=0.24 + x=1 --> P(1)=0.36 --> 0.24 + 0.36 = 0.6) 2. answer is 0.6 d. find the probability that a randomly selected people drink 2 or 3 cups of coffee in the morning 1. P(x=2) + P(x=3) 2. same steps as C1 3. answer is 0.36

population, sample, parameter, statistic example

a statistician is planning on studying age of all people in Bloomington. He/She randomly selected 100 people and investigated their age. The average age of 100 people is 25. Population- all people in the city of Bloomington Sample- 100 people who were selected for this study Parameter- average age of all people in Bloomington (usually unknown) Statistic- average age of 100 selected people (25 years, can be known) there is NO certain relationship between sample statistics and parameters (sample statistics is an estimator for parameters)

binomial excel example

according to a survey by apartment.com, 35% of apartment renters in 2013 had owned a home at one point. Consider a random sample of 9 apartment renters a. what's the prob. that exactly 3 renters from the sample previously owned a home? b. what's the prob. that less than 4 renters from this sample previously owned a home? -use BINOM.DIST RANGE() function -use BINOM.DIST() function and compare solutions 1. check whether it's a binomial setup or not (2 outcomes) 2. define the "success" ("success" = previously owned home) 3. note that x~B(n,p). Find out n and p. n=9. p=0.35 4. write the mathematical expression (ex: less than 4 renters is P(x<4)) 5. consider possible x (x=0-->9) 6. compute the probability using excel c. what's the prob. that 6 or 7 renters from this sample previously owned a home? -use both BINOM.DIST.RANGE() and BINOM.DIST() functions

advantages and disadvantages of using the mean to summarize data

advantages: -simple to calculate -summarizes the data with a single value disadvantages: -with only a summary value you lose Information about the original data -ex: sample 1 with n=3: 999, 1000, 1001 --> x bar = 1000 -just knowing the mean doesn't help you know what the underlying data looks like -the value of the mean is sensitive to outliers (values that are much higher or lower than most of the data)

the probability of the continuous random variable

assuming a value within some given interval from x1 to x2 is defined to be the area under the graph of the probability density function f(x) between x1 and x2 uniform is a rectangle/square shape normal graph is like the shape n exponential is where its really high when x=0 then goes down **all area under the graph is equal to 1

bar charts

bar charts--> frequency distribution bar charts are very similar to histogram -can be arranged in a vertical or horizontal orientation -on one axis (x or y axis) (usually, horizontal) we specify the labels that are used for each of the classes -a frequency or relative frequency scale can be used for the other axis (usually, vertical) -using a bar of fixed width drawn above each class label, we extend the height appropriately 2 kinds: 1- vertical bar chart 2- horizontal bar chart

Measures of central tendency

central tendency- single value used to describe the center point of a data set measures of central tendency 1. Mean --> weighted mean 2. Median 3. Mode

Standard Normal Probability Distribution

characteristics: -a random variable having a normal distribution with a mean of 0 and a standard deviation of 1 is said to have a standard normal probability distribution -the letter z is designated for the standard normal random variable (z~N(0,1)) Relationship with normal distribution: -any normal distribution (with any mean and standard deviation) can be transformed into the standard normal distribution -need to transform x units into z~score (standardization method) Converting to standard normal distribution -(x~N(mean, standard deviation) -z= (x-mean)/standard deviation

Descriptive Statistics (describe data)

collecting, summarizing, and displaying data (reported based on observations)

Measures of Relative Position

compares the position of one value in relation to other values in the data set measurements of relative position 1. percentiles -the p^th percentile divides a data set into two parts -approximately p percent of the observation have values less than the p^th percentile -approximately (100-p) percent of the observations have values greater than the p^th percentile Ex: suppose we have miles per gallon (MPG) recorded for a sample of 12 cars. Based on this data we found that 60th percentile = 31.1 MPG --> 60% of cars in the sample have MPG <31.1 2. quartiles -split the ranked data into 4 equal groups -the first quartile (Q1) is the value that constitutes the 25th percentile -the second quartile (Q2) is the value that constitutes the 50th percentile -second quartile (50th percentile) = median -the third quartile (Q3) is the value that constitutes the 75th percentile

Discrete vs Continuous Data

discrete data are typically represented by integer numbers (ex: 0 ∼ 5 --> 0 1 2 3 4 5 6 groups) -based on observations that can be counted (how many) -take on whole numbers such as 0, 1, 2, 3 ex -number of children per family -number of cars listed per insurance policy -vacation days per month continuous data are values that can take on any real numbers, including numbers that contain decimal points -based on observations that can be measured (how much) -take on any numbers such as 1, 3.1, 5.07, 4.941, etc. ex: -time required to read chapter 2 -thickness of paint applied to a car body -person's height

discrete vs continuous probability distributions

discrete random variable: p(x) = (x^2)/30 and x=1,2,3,4 continuous random variable: f(x) = 1/4 for any value of x between 0 and 4 discrete / continuous for each question 1. range of x 1,2,3,4 / any value between 0 and 4 2. P(x=1) or f(x=1)? 1/30=P(x=1) / 1/4=f(x=1) does not equal P(x=1)=0 3. P(x=1)=P(x=2)? No 1/30 does not equal 4/30 / Yes. 0=0 4. P(x<=2)? P(x=1)+P(x=2)=5/30 / The area under the curve of f(x) for x<=2 = 1/2

Relative frequency distributions (grouping and counting)

displays the proportion of observations of each class relative to the total number of observations--> computing the proportion of each group -shows the fraction of observations in each class -found by dividing each frequency by the total number of observations -the fractions in a relative frequency distribution add up to 1.00 -sum of frequencies equals total

expressing x in terms of z-scores

formulas for expressing x in terms of the z-score *s is always positive! for a population: x = μ + zσ for a sample: x = x̄ + zs ex: for a symmetrical bell shaped population with a mean of 20 and a standard deviation of 3, what interval will contain about 95% of all the values: x = μ + zσ --> 20 + (2)(3) = 26 x = μ + zσ --> 20 - (2)(3) = 14 answer: about 95% of the values will fall between 14 and 26

Class frequencies

found by counting and recording the number of observations in each class -each class is represented by a range of values

the empirical rule

if a distribution follows a bell shaped, symmetrical curve centered around the mean, we would expect: -approximately 68% of the values to fall within +/- 1 standard deviations from the mean -approximately 95% of the values to fall within +/- 2 standard deviations from the mean -approximately 99.7% of the values to fall within +/- 3 standard deviations from the mean

finding the z or x value

in our example, the time on the phone follows the normal distribution with mean=12 and SD=3. What's the wait time so that 95% of calls have a shorter wait time? Find x0 value so that P(x<=x0)=0.95? Use Excel or find z-score to solve

parameter vs. statistic

parameter- a described characteristic about a population (in a population, values calculated using population data are called parameters) statistic- a described characteristic about a sample (in a sample, values computed from sample data are called statistics)

pie charts

pie charts are a tool for comparing proportions for qualitative data each segment of the pie represents the relative frequency of one category -all categories in the data set must be included in the pie -use a pie chart to compare the relative sizes of all possible categories -bar charts are more useful when you want to highlight the actual data values

population vs. sample

population- represents all possible subjects that are of interest in a particular study sample- refers to portion of the population that is representative of the population from which it was selected

types of data

qualitative data (categorize your sample): -classified by descriptive terms (labels or names used to identify an attribute of elements) -ex: marital status, political party, eye color quantitative data: -described by numerical values (how many or how much) 1. Counted -ex: number of children, defects per hour 2. Measured -ex: weight, voltage

Displaying qualitative data

qualitative data are values that are categorical (ex: gender - male and female) -can be nominal or ordinal measurement level -describe a characteristic, such as gender or level of education summarizing qualitative data: -frequency distribution (tabular) -relative frequency distribution (tabular) -bar and pie charts (graphs)

Random Variables

random variable- numerical description of the outcome of an experiment (ex: roll a die--> random outcome = random variable with options 1, 2, 3, 4, 5, 6) discrete random variable P(x)- may assume either a finite number of values or an infinite sequence of values. Values are whole numbers (integers), usually counted number of complaints per day; number of TVs in a household continuous random variable- may assume any numerical value in an interval or collection of intervals. Often measured, fractional values are positive. Time required to complete a task; height, in inches.

the need for sampling

reasons for sampling from the population: -too expensive to gather information on the entire population -too time consuming to gather information on the entire population -often impossible to gather Information on the entire population ex: income survey POPULATION- US population: about 300 million. Parameter- avg. income of US population SAMPLE- Census (SIPP): about 50,000 households. Statistic- avg. income of selected 50,000

Class boundaries

represent the minimum and maximum values for each class choose class boundaries that are easy to read (ex: rather than 3.21 to less than 6.41, do 3 to less than 6).

sample vs population measures

sample statistics- if the measures are computed for data from a sample, they are called this (most times denoted by Latin letters x, y, A, B,...) population parameters- if the measures are computed for data from a population, they are called population parameters (most times denoted by Greek letters, think Greek life letters) point estimator- a sample statistic is referred to as the point estimator of the corresponding population parameter sample mean--> population mean

measures of variability

shows how much spread is present in the data set 1. range 2. variance -for a sample -for a population 3. standard deviation -for a sample -for a population

Frequency distribution

shows the number of data observations that fall into specific intervals (classes)--> grouping and counting if there are 5 observations (how many times the thing has appeared), the frequency is 5

distribution shape

symmetric (mean=median) left skewed (mean<median) right skewed (mean>median) drawings of graph on notebook page 9 you may not compare mean with mode because mode may not exist in your data set. For that reason, comparison between mean and median would be safe.

The Coefficient of Variation

the coefficient of variation, CV, measures the SD in terms of its percentage of the mean and indicates how large the SD is in relation to the mean -a high CV indicates high variability relative to the mean -a low CV indicates low variability relative to the mean the sample CV = (s / x bar) x 100 where s= the sample SD and x bar = the sample mean the population CV = (σ / μ) x 100 where σ = the population SD and μ = the population mean notebook page 11

statistics/stats

the mathematical science that deals with the collection, analysis, and presentation of data which can then be used as a basis for interference (data) and induction (big data)

which measure of central tendency should you use?

the mean is generally used, unless extreme values (outliers) exist if outliers are present, the median is often used, since the median is not sensitive to outliers for qualitative data, the mode is a good choice

mean of a discrete probability distribution

the mean of a discrete probability distribution is the weighted average of ALL values of the random variable -the weights are the probabilities -the mean does not have to be a value the random variable can assume mean is also known as expected value, E(x) --> NOTEBOOK PAGE 16

Mean

the mean, or average, is the most common measure of central tendency "you add and divide for the mean" ex: A= 4 A= 4 B= 3 Mean = (4+4+3)/3=3.6666

binomial distribution

the restrictions (or assumptions) of a binomial experiment 1. the experiment consists of a fixed number of (Bernoulli) trials, denoted by n (n=1) 2. each trial has only 2 possible outcomes, a success or a failure (ex: flipping a coin) 3. the probability of a success p and probability of a failure q are constant throughout the experiment 4. each trial is independent of the other trials in the experiment (one may consider these restrictions as the characteristics of a binomial experiment) our interest is in the number of successes (random variable) occurring in the n trials which we will denote by x examples: -toss a coin (H or T) -roll a die (6 or not) -a survey response to a question is "yes I will buy" or "no I will not buy" -new job applicants either accept an offer or reject it binomial distribution denoted by x~B (n, p) where x: random variable (uncertain number). Number of successes ~: follows B: binomial distribution n: total # of trials p: probability of success this mathematical notation implies when x follows binomial distribution, we can get everything about x if n, p are available

the sample variance, and the sample standard deviation (SD), and population variance & standard deviation

the sample variance- denoted by s^2 and is the average of the squared differences between each data value and the mean the sample standard deviation (SD)- square root of the sample variance, has the same units of measurement as the original data population variance and standard deviation- used when the data set represents an entire population rather than a sample from a population **the standard deviation is affected by the scale of the data. When sample means are different, comparing SDs can be misleading notebook page 10

the z-score

the z-score identifies the number of standard deviations a particular value is from the mean of its distribution a z-score has no units the z-score is: -zero for values equal to the mean -positive for values above the mean -negative for values below the mean a data value that has a z-score above +3 or below -3 is categorized as an outlier (has a value far from the mean) the sample z-score: Z = (x - x bar) / s where s = the sample SD, x bar = the sample mean, x = the data value of interest the population z-score: Z = (x - μ) / σ where σ = the population SD, μ = the population mean and x = the data value of interest notebook page 11

binomial distribution example

x~ B(4,0.4) n=4 p=0.4 1. what is the # of trials 4 2. what is the probability of failure q=1-P q=1-0.4 = 0.6 probability = 6 3. what is the x discrete random variable, number of successes 4. what is the range of x 0<=x<=4 bc n=4 5. does P(x=1) + P(x=2) + P(x=3) + P(x=4) = 1 ? NO! 6. (P<=3) = 1-P(x>?) since x<=3, x can be 3, 2, 1, or 0. For 1-P(x>?), 1 can be P(x=0) all the way to P(x=4). Answer Is 1-P(x>3) 7. P(x=0) = 0? NO! 8. what is the mean of x n*p --> 4*0.4= 1.6 9. what is the variance of x n*p*q --> q = 1-p 4*0.4*0.6 = 0.96

Class Width

there are methods to determine the number of classes k in a frequency distribution. But they are just a recommendation. You can always adjust! once k is known, the width of each class can be found as Estimated class width = (maximum data value - minimum data value)/k -the width is the range of numbers to put into each class -round this estimate to a useful whole number that makes the frequency distribution more readable -there is no one correct answer for the class width -the goal is to create a histogram to clearly and usefully show the pattern in the data -often there is more than one acceptable way to accomplish this ex: k = 4 groups continuous--> group = interval 0∼10 (group) 0 = minimum 10 = maximum minimum & maximum are class width 1) all data belongs to one of 4 groups 2) same class width for 4 groups

time series vs cross-sectional data

time series data (multiple time points): -values that correspond to specific measurements taken over a range of time periods -data can include hourly, daily, weekly, monthly, quarterly, or annual observations -graph of data from multiple years cross section (one time point): -values collected from a number of subjects during a single time period -subjects might include individuals, households, firms, industries, regions, countries, etc. -graph of data from one year EXAMPLE ON PAGE 3 NOTEBOOK

discrete probability distributions

to obtain the value of "x" and the probability of "x", we impose restrictions (or assumptions) on the random variable "x" specific discrete probability distributions 1. binomial

Cumulative Relative Frequency Distributions

totals the proportion of observations that are less than or equal to the class at which you are looking--> summing up the proportions of equal or lower groups -shows the accumulated proportion as class values vary from low to high -cumulative relative frequency for the highest class (group) is equal to 1.00 1) sum up relative frequency of equal or lower 2) sum relative frequency equal and cumulative relative frequency previous

types of data and their corresponding levels

types of data 1. qualitative and 2. quantitative 1. qualitative - nominal -ordinal 2. quantitative -interval -ratio

cumulative probability function f(x)

used to calculate the cumulative probability at any point x0 or probability that random variable x takes a value less than or equal to x0 denoted like P(x<=x0) = F(x0) graphically, cumulative probability is the area under the graph of f(x) on the left of x0

data and information

values assigned to observations or measurements. All the data collected in a particular study are referred to as the data set for the study data set/raw data--> transform --> information ex: income data--> average--> average income information- data that are transformed into useful facts that can be used for a specific purpose, such as making a decision EXAMPLE ON PAGE 1 NOTEBOOK

Variance and SD of a discrete probability distribution

variance summarizes the variability in the values of the random variable the standard deviation is the square root of the variance notebook page 17 for equation

normal distribution

we usually denote x~N(μ,σ) Binomial: x~B(n,p) x: continuous random variable (uncertain number) ~: follows N: normal distribution μ: expected value (mean) σ: standard deviation the mathematical notations implies when x follows Normal Distribution, we can get everything about x if μ, σ are available

standard normal distribution

when the original random variable, x, follows the normal distribution--> z-scores also follow a normal distribution with mean=0 and standard deviation=1 or follow the standard normal distribution ex: the time customers spend on the phone (x) spend on the phone for service follows the normal distribution with a mean of 12 minutes and a standard deviation of 3 minutes. What's the probability that the next customer who calls will spend 14 minutes or more on the phone? P(x>=14) 1. Find z-score (mean=12 and SD=3). The z-score for x=14 is (14-12)/3=0.67. So, x=14 is 0.67 standard deviations above the mean of 12 2. Upper tail probabilities: the area under the curve equals 1.0, so P(x>14)=P(z>0.67) using Excel for the cumulative probability. Then, 1-P(z<=0.67) = 0.2514 3. Lower tail probabilities (cumulative probability) for 10 minutes or less on phone is P(x<=10)=P(z<=(10-12)/3) which is P(z<=-0.67) on Excel which = 0.2514

the consequences of too few or too many classes

wide classes can result in few class intervals -can obscure important patterns -gives a "blocky" distribution graph -tells little about the distribution shape too many narrow classes in a histogram also has consequences -results in a "jagged" histogram -some classes may be empty

vertical bar chart

y-axis = frequency x-axis = groups -looks like a histogram for qualitative data--> we call it vertical bar chart for quantitative data--> we call it histogram

Normal Probability Density Function

μ: mean σ: standard deviation π: 3.14159 e: 2.71828


संबंधित स्टडी सेट्स

Financial accounting test 1 review

View Set

The Great Gatsby Study Questions

View Set

PSYCH TEST #2: Sensation and Perception

View Set

Wei Intro to Microbiology Exam 5

View Set

EAQ Ch 12, Schizophrenia and Schizophrenia Spectrum Disorders

View Set

Chapter 6: Values, Ethics, and Advocacy

View Set