Statistical Inference

¡Supera tus tareas y exámenes ahora con Quizwiz!

Sampling Distribution

Say you have the population of used cars in a car shop. We want to analyze the car prices and be able to make some predictions on them. Population parameters which may be of interest are mean car price, standard deviation of prices, covariance and so on. Normally, in statistics we would not have data on the whole population, but rather just a sample. Let's draw a sample out of that data. The mean is $2,617.23. Now, a problem arises from the fact that if I take another sample, I may get a completely different mean ‐ $3,201.34. Then a third with a mean of $2,844.33.

Comparing the two formulas for calculating confidence interval for population mean

There are two key differences. First, instead of z‐statistic, we have a t‐statistic. Second, instead of population standard deviation, we have sample standard deviation. Otherwise, everything is the same. When population variance is known, population standard deviation goes with the z‐statistic. When population variance is unknown, sample standard deviation goes with the t‐statistic. Note the CL row in t‐table is for confidence interval.

Note the standard error decreases as the sample size increases. What gives a better approximation of the population?

Therefore, As bigger samples give a better approximation of the population.

What is a margin of error and why is it important in Statistics?

These are the formulas that allow us to find confidence intervals. As we noted in our previous lecture, these parts are the ones that determine the span of the confidence interval. There is a special name for these expressions - margin of error.

Reasons of Concentrating Normal and Student's T Distributions

They approximate a wide variety of random variables, Distributions of sample means with large sample sizes could be approximated to normal, All computable statistics are elegant, Decisions based on normal distributions insights have a good track record.

Meaning of standard error:

Variability of the sample means

Example of a biased estimator

What if somebody told you that you will find the average height of Americans by taking a sample, finding its mean and then adding 1 foot to that result? So, the formula is x bar plus 1 foot. Well, I hope you won't trust them. They gave you an estimator, but a biased one.

Calculating confidence intervals within a population with an unknown variance

You are an aspiring data scientist and are wondering how much the mean data scientist salary is. This time though, you do not have the population variance. In fact, you have a sample of only 9 compensations you found on Glassdoor and have summarized the information in the following table. we've already calculated the sample mean and standard error, which are $92,533 and $4644 respectively. Good, but we don't have one key piece of information - the population variance. we will use the Student's t distribution.

If the CI 95% unknown = (81806, 103261) and the CI 95% known = (94833, 105568) what can we say?

You can clearly note that when we know the population variance, we get a narrower confidence interval. When we do not know the population variance, there is a higher uncertainty that is reflected by wider boundaries for our interval.

Standard Normal Distribution

You have a bell-shaped curve symmetrical, area =1, miu=0 (population mean mu), sigma=1 (standard deviation sigma)

Calculating confidence intervals within a population with a known variance Example

You want to become a data scientist and you are interested in the salary you are going to get. Imagine you have certain information that the population standard deviation of data science salaries is equal to $15,000. Furthermore, you know the salaries are normally distributed and your sample consists of 30 salaries. The formula for the confidence interval with a known variance is given below. std error = sigma/sqrt(n), sample mean is a point estimate x bar, Reliability factor Z alpha/2, Common Confidence Levels: 90%, 95%, 99% wiht respect to alpha = 10%,5%,1%.

An unbiased estimator has an expected value

equal to the population parameter.

Continuous normal distribution

f(x) -> pdf (probability density function)

Formula to calculate the probability density function for a uniform distribution

f(x) = 1/(b-a) and the mean; miu = (a+b)/2 ; stddev = (b-a)/sqrt(12)

Continuous random variable

it can take on an infinite number of possible values, corresponding to every value in an interval.

Central Limit Theorem what can we say about the mean and the variance?

its mean is the same as the population mean. It is the population variance divided by the sample size (n), Since the sample size is in the denominator, the bigger the sample size, the lower the variance.

Formula to calculate the population mean miu

miu = x bar +- (t * s)/sqrt(n)

Compute the 99% confidence interval of the true mean. The test scores of 9 random selected students are: 83,73,62,63,71,77,77,59,72

miu = x bar +- (t * s)/sqrt(n) = 73 +- (3.355)(10.69)/sqrt(9) = (61.05, 84.95)

Formula standard deviation of the sampling distribution

std dev = sigma/sqry(n)

Formula to calculate Student's T distribution

t with n‐1 degrees of freedom and a significance level of α equals (the sample mean minus the population mean)divided by the standard error of the sample

Example t(n-1,alpha/2) S/sqrt(n) ; t(8,0.025) =

t(8,0.025) = 2.31

For CLT, the more samples you extract and the bigger they are

the closer to a normal distribution the samples means will be. Moreover, their distribution will have the same mean as the original data set and end times smaller variance where n is the size of the your samples we took.

The most efficient estimators are

the ones with the least variability of outcomes. Most efficient means: the unbiased estimator with the smallest variance.

When to use the t-test

when sigma is unknows and sample size is less than 30

Formula to find confidence interval for the population mean where population variance is unknown

x bar +- t(n-1,alpha/x) times S/sqrt(n)

Confidence Interval for The Population Mean μ Assuming Knowing the Population Standard Deviation σ

x bar plus minus Z alpha/2 times sigma/sqrt(n)

T Table explanation

‐Rows represent degrees of freedom. Columns are our α's ‐After 30th row The numbers do not vary that much

Calculating confidence intervals within a population with a known variance procedure and interpretation

A commonly used term for the z is critical value. So, we have found the critical value for this confidence interval. Now, we can easily substitute in the formula. The final confidence interval becomes 94,833 to 105,568. The interpretation is the following: We are 95% confident that the average data scientist salary will be in the interval 94,833 and 105,568 dollars.

Confidence Intervals

A confidence interval is the range within which you expect the population parameter to be. Its estimation is based on the data we have in our sample. There can be two main situations when we calculate the confidence intervals for a population when the population variance is known, and when the population variance is unknown. Depending on which situation we are in, we would use a different calculation method. Now, the whole field of statistics exists because we almost never have population data.

What is a margin of error and why is it important in Statistics? How to get a wider confidence interval?

A higher level of confidence increases the statistic. A higher statistic means a higher margin of error. This leads to a wider confidence interval.

What is a Hypothesis?

A hypothesis is an idea that can be tested, so if I tell you that apples in New York are expensive. This is an idea or a statement but is not testable until I have something to compare it with. For instance if I define expensive as any price higher than a dollar seventy five cents per pound then it immediately becomes a hypothesis.

What is a margin of error and why is it important in Statistics Example. What about the standard deviation?

A lower standard deviation means that the data set is more concentrated around the mean, so we have a better chance to get it right.

Statistic is the broader term. What is an example of an statistinc?

A point estimate is a statistic.

Estimators and Estimates

An estimator of a population parameter. It is an approximation depending solely on sample information. A specific value is called an estimate.

What's something that cannot be a hypothesis?

An example may be would the USA do better or worse under a Clinton administration compared to a Trump administration statistically speaking. This is an idea but there is no data to test it. Therefore it cannot be a hypothesis of a statistical test. The following might be a hypothesis and can be tested. We may compare different U.S. presidencies that have already been completed such as the Obama administration and the Bush administration as we have data on both.

Calculating confidence intervals within a population with a known variance

An important assumption in this calculation is that the population is normally distributed. Even if it is not, you should use a large sample and let the central limit theorem do the normalization magic for you. Remember? If you work with a sample, which is large enough, you can assume normality of sample means

Student's T distribution the bigger the sample

As the degrees of freedom depend on the sample, in essence, the bigger the sample, the closer we get to the actual numbers. A common rule of thumb is that for a sample containing more than 50 observations, we use the z‐table instead of the t‐table.

Why the Central Limit Theorem is Important?

Because it allows us to perform tests, solve problems and make inferences using the normal distribution even when the population is not normally distributed.

Estimators, what are their properties?

Bias and Efficiency.

Uniform Distribution

It has constand f for x values. U(a,b) a<=x<=b

Distribution

It is a function that shows the possible values for a variable and how often they occur.

Confidence Interval

It is a much more accurate representation of reality. However, there is still some uncertainty left which we measure in levels of confidence.

Confidence interval

It is a range of values; likely include the population parameter at some confidence level

A point estimate

It is a single number, while a confidence interval... naturally... is an interval.

The level of confidence

It is denoted by: 1 minus alpha, and is called the confidence level of the interval. Alpha is a value between 0 and 1.

Standardization

It is the process of transforming this variable to one with a mean of 0 and a standard deviation of 1. Every distribution can be standarized.

The standard error

It is the standard deviation of the distribution formed by the sample means. The standard deviation of the sampling distribution.

Why is the standard error important?

It is used for almost all statistical tests, because it shows how well you approximated the true mean.

Inferential Statistics

It refers to methods that rely on probability theory and distributions in particular to predict population values based on sample data.

A lower standard deviation

It results in a lower dispersion, so more data in the middle and thinner tails.

The distribution of a dataset

It shows us the frequency at which possible values occur.

Central Limit Theorem

It states that if you have a population with mean μ and standard deviation σ and take sufficiently large random samples from the population with replacement , then the distribution of the sample means will be approximately normally distributed

A higher standard deviation

It will cause the graph to flatten out with less points in the middle and more to the end, or in statistics jargon - fatter tails.

Normal Distribution. Controlling for Standard Deviation. What can we say about a A lower mean?

It would result in the same shape of the distribution but o the left side of the plane

Normal Distribution. Controlling for Standard Deviation. What can we say about a A larger mean?

It would result in the same shape of the distribution but o the right side of the plane

Calculating confidence intervals within a population with a known variance Example

Let's say that we want to find the values for the 95% confidence interval. Α is 0.05, therefore, we are looking for z of alpha divided by two, or 0.025. In the table, this will match the value of 1 minus 0.025, or 0.9775. The corresponding z comes from the sum of the row and column table headers associated with this cell. In our case, the value is 1.9+0.06, or 1.96.

Normal Distribution (Gaussian Distribution)

Many people called it bell curve, It is symmetrical, Mean=median=mode, Has no skew, Perfectly centered around its mean, The highest point is located at the mean, because it coincides with the mode, The spread of the graph is determined by the standard deviation.

Confidence Interval for The Population Mean μ Assuming Knowing the Population Standard Deviation σ

Margin of Error x bar plus minus E; E=z alpha/2 times sigma/sqrt(n); significance level alpha = 1-CL (confidence level); CL=1-alpha

Student's T distribution table and degree of freedom notes

Much like the standard normal distribution table, we also have a Student's t table. The rows indicate different degrees of freedom, abbreviated as d.f., while the columns - common alphas. Please note that after the 30th row, the numbers don't vary that much. Actually, after 30 degrees of freedom, the t‐statistic table becomes almost the same as the z‐statistic.

For CLT, does it matter if the distribution of the entire data set is binomial, uniform, or another one.

No, it doesn't matter. Also, the means of the samples we took from the entire data set will approximate a normal distribution.

Significance Level alpha definition

Normally we aim to reject the null if it is false. However as with any test there is a small chance that we could get it wrong and reject the null hypothesis that is true. Significance level is denoted by α and is defined as the probability of rejecting the null hypothesis if it is true. The probability of making this error typical values for alpha (α) are : 0.01, 0.05, 0.1; it is a value you select based on the certainty you need. The choice of alpha is determined by the context you are operating in but 0.05 is the most commonly used value.

Null and Alternative Hypothesis

Null hypothesis : (Ho): Is the one to be tested. Alternative hypothesis: (H1 or HA): Is everything else. In our previous example of the data scientist mean salary.

There are two types of estimates

Point estimates and confidence interval estimates.

Example Point Estimators not Very Reliable

Point estimators are not very reliable. Imagine visiting 5% of the restaurants in London and saying that the average meal is worth 22.50 pounds. You may be close, but chances are that the true value isn't really 22.50 but somewhere around it. It's much safer to say that the average meal in London is somewhere between 20 and 25 pounds, isn't it? In this way, you have created a confidence interval around your point estimate of 22.50.

Total area under a curve for a continuos probability distribution

1

Example Confidence Interval for the population mean miu assuming knowing the population Std Dev sigma. Scores on an exam: sigma = 5.6, n=40, x bar = 32

CL = 95%, alpha = 0.05; Zalpha/2 = Z 0.025 = 1.96; E = Zalpha/2 times sigma/sqrt(n); E = Zalpha/2 5.6/sqrt(40)

z-test vs t-test (The average test score for an entire school is 75 with std dev of a random sample of 40 students is 10. What is the probability the average test score for the sample is above 80?

Conditions for using t-tes. Sigma is unknown and n<30; In this case: x bar = 80, n = 40, miu = 75, s = 10. Sigma is unknown but n > 30. Therefore, we cannot use t-test. We will use z-test.

z-test vs t-test (The average test score for an entire school is 75 with std dev of 10. What is the probability that a random sample of 5 students scored above 80?

Conditions for using t-tes. Sigma is unknown and n<30; In this case: x bar = 80, n = 5, miu = 75, sigma = 10. Therefore, we cannot use t-test. We will use z-test.

z-test vs t-test (The average test score for an entire school is 75. The std dev of a random sample of 9 students is 10. What is the probability the average test score for the sample is above 80?

Conditions for using t-tes. Sigma is unknown and n<30; In this case: x bar = 80, n = 9, miu = 75, s = 10. Sigma is unknown and n < 30. Therefore, we can use t-test.

Hypothesis Testing idea

Confidence intervals provide us with an estimation of where the parameters are located. However when you are making a decision you need a yes or no answer. The correct approach in this case is to use a test.

Confidence Interval Clarification Example: Age Intervals

I don't know your age, dear student, but I am 95% confident that you are between 18 and 55 years old, because you are taking a data science course. I am 99% confident that you are between 10 and 70 years old. I am 100% confident that you are between 0 and 118 years old. There is a trade‐off between the level of confidence and the range of the interval.

Standarizatin of Normal Distribution

If we shift the mean by mu, and the standard deviation by sigma, for any normal distribution we will arrive at the standard normal distribution, We use the letter Z to denote it, Its mean is 0 and its standard deviation is 1, The standardized variable is called a z‐score and is equal to the original variable, minus its mean, divided by its standard deviation.

Importance of Central Limit Theorem

Image the power if your data set was made up of millions of values and you can afford to sample just a tiny bit of them. We can be assuming normally distributed data almost all the time and that is extremely helpful as you will see later on.

Values for CI for 80%, 90% and 98% CL if sigma=5.6, n=40 and x bar = 32

For 80% CI = (30.87, 33.13); for 90% CI = (30.54,33.46); for 98% CI = (29.94, 34.06)

Values for margin of error E for 80%, 90% and 98% CL if sigma=5.6, n=40 and x bar = 32

For 80% E = 1.13; for 90% E = 1.46; for 98% E = 2.06

Values of Z alpha/2 for 80% confidence, 90% confidence and 98% confidence

For 80% Z alpha/x = 1.28; for 90% Z alpha/2 = 1.645; for 98% Z alpha/2 = 2.33

Steps in Data‐Driven Decision Making

Formulate a hypothesis. Find the right test for the hypothesis. Execute the test. Make a decision based on the result.

What is a margin of error and why is it important in Statistics can we control it and what can we say about it?

Getting a smaller margin of error means that the confidence interval would be narrower. As we want a better prediction, it is in our interest to have the narrowest possible confidence interval. The best part is that we can control the margin of error. Margin of error parts, there is a statistic, a standard deviation and the sample size.

Two Sides Test Example

Some one thinks that the data scientists make more than $125,000 . The null hypothesis Ho of this test would be the mean data scientist annual salary is more or equal to $125,000. The alternative H1 will cover everything else. i.e. the mean data scientist salary is less than $125,000. Note: Outcomes of tests refer to population parameter μ0 rather than sample statistic.

Student's T distribution

The Student's t distribution is one of the biggest breakthroughs in statistics: Inference through small samples, Unknown population variance, Huge real‐life application, It looks much like the normal distribution but has fatter tails. Fatter tails as you may remember allows for a higher dispersion of variables, as there is more uncertainty.

Null and Alternative Hypothesis. The concept is similar to

The concept of the null hypothesis is similar to the concept innocent until proven guilty. Ho is true until rejected.

What is a margin of error and why is it important in Statistics Example. What is the conclusion?

The conclusion is that the more observations there are in the sample, the higher the chances of getting a good idea about the true mean of the entire population.

What is a margin of error and why is it important in Statistics? How to reduce the margin of error?

The statistic and the standard deviation are in the numerator, so smaller statistics and smaller standard deviations will reduce the margin of error.

The t-statistics is realted to

The sudent's t-distribution

A distribution is defined by the underlying probabilities and not the graph. What can we say about the graph?

The graph is just a visual representation

Student's T distribution degree of freedom

The last characteristic of the Student's t-statistic is that there are degrees of freedom. Usually, for a sample of n, we have n-1 degrees of freedom. So, for a sample of 20 observations, the degrees of freedom are 19.

How are Point estimates and confidence interval estimates related?

The two are closely related. In fact, the point estimate is located exactly in the middle of the confidence interval. However, confidence intervals provide much more information and are preferred when making inferences.


Conjuntos de estudio relacionados

BOP (Businessowners Policy) CH.13

View Set

Comprehensive physical assessment of an adult post test

View Set

Patho Ch. 11 - Malignant Disorders of White Blood Cells

View Set

Chapter 30: Lung cancer and other noninfectious lower resp

View Set