Sampling and Estimation
what is survivorship bias?
most common form of sample selection bias, if you are studying the performance of something, you are only studying the portfolios that survived, not all of them including those that failed
what is student's t-distribution?
a bell-shaped probability distribution that is symmetrical about its mean (used when there is a small sample with unknown variance and a normal distribution)
what quantity do we typically refer to as sufficiently large when sampling a population?
n > 30
what is longitudinal data?
observations over time of multiple characteristics of the same entity (like unemployment, inflation, and GDP growth)
what is look-ahead bias?
occurs when a study tests a relationship using sample data that was not available on the test date
what is a confidence interval?
a range of values in which the population parameter is expected to lie
what is cross sectional data?
a sample of observations taken at a single point in time
what is data mining?
occurs when analysts repeatedly use the same database to search for patterns or trading rules until one that "works" is discovered
what is sample selection bias?
occurs when some data is systematically excluded from the analysis, usually because of the lack of availability (results in a nonrandom sample)
what is level of significance?
alpha, also is 1 - the degree of confidence
what is the equation for the standard error of the sample mean when you are given the population standard deviation?
population standard deviation / square root of the size of the sample
what is the data-mining bias?
refers to the results where the statistical significance of the pattern is overestimated because the results were found through data mining
what is systematic sampling?
selecting every nth member from a population as a sample
what is a point estimate of a population parameter?
single (sample) values used to estimate population parameters
since we typically don't know population standard deviation in practice how do we estimate the standard error of the sample mean?
standard deviation of the sample mean / square root of the size of the sample
what is central limit theorem?
states that for simple random samples of size n from a population with a mean and a finite variance, the sampling distribution of the same mean approaches a normal probability distribution as the sample becomes larger
what is time-period bias?
can result if the time period over which the data is gathered is either too short or too long (if it is too short it might describe phenomena specific to that time interval)
what is simple random sampling?
method of selecting a sample in such a way that each item or person in the population being studied has the same likelihood of being included in the sample
what is a sampling distribution?
is a probability distribution of all possible sample statistics computed from a set of equal-size samples that were randomly drawn from the same population
how do we construct a confidence interval when the population distribution is normal but the variance is unknown?
t-distribution
describe "consistency" when talking about the different desirable properties of an estimator.
the accuracy of the parameter estimate increases as the sample size increases
what is sampling error?
the difference between a sample statistic (the mean, variance, or standard deviation of the sample) and its corresponding population parameter (the mean, variance, or standard deviation of the population)
what are degrees of freedom?
the parameter that defines t-distribution where degrees of freedom are equal to the number of sample observations minus 1 (n-1)
what is the standard error of the sample mean?
the standard deviation of the distribution of the sample means
how do we construct a confidence interval when the population distribution is nonnormal and the variance is unknown?
the t statistic can be used as long as the sample is larger than 30
describe "efficiency" when talking about the different desirable properties of an estimator.
the variance of its sampling distribution is smaller than all the other unbiased estimators
how do we construct a confidence interval when the population distribution is nonnormal but the variance is known?
the z statistic can be used as long as the sample size is large (greater than 30)
what is stratified random sampling?
uses a classification system to separate the population into smaller groups based on one or more distinguishing characteristics (often used in bond indexing because the difficulty and cost of completely replicating the entire population of bonds)
what is practical interpretation when talking about confidence intervals?
we can be 99% confident that the population mean score is between the confidence interval
how do we construct a confidence interval when the population distribution is normal and the variance is known?
z- statistic
what are the warning signs of data mining?
1. evidence that many different variables were tested, most of which are unreported, until significant ones were found 2. the lack of any economic theory that is consistent with the empirical results
what are some characteristics of t-distribution?
1. it is symmetrical 2. it is defined by a single parameter (degrees of freedom) 3. it has more probability in the tails than normal distribution (fatter tails) 4. as the degrees of freedom (sample size) gets larger the shape of the t-distribution more closely approaches a standard normal distribution
what are the two limitations on the idea that larger is better when it comes to selecting an appropriate sample size?
1. larger samples may contain observations from a different population 2. the cost, larger samples are more costly
what are the desirable properties of an estimator?
1. unbiased - expected value of the estimator is equal to the parameter you are trying to estimate 2. efficient - the variance of its sampling distribution is smaller than all the other unbiased estimators 3. consistent - the accuracy of the parameter estimate increases as the sample size increases
what is probabilistic interpretation when talking about confidence intervals?
99% of confidence intervals will, in the long run, include the population mean
what is time series data?
consist of observations taken over a period of time at specific and equally spaced time intervals
what is panel data?
contain observations over time of the same characteristic for multiple entities like debt/equity ratios for 20 companies over the most recent 24 quarters
describe "unbiased" when talking about the different desirable properties of an estimator.
expected value of the estimator is equal to the parameter you are trying to estimate
