Descriptive Statistics
A ________ can be displayed using a table or a diagram
distribution
The ____ variable is labeled X and is on _____ axis; Y is the _______ variable and is mapped on the _____ axis Y=f(x)
independent, horizontal, dependent, vertical
Negative kurtosis
indicates a relatively flat distribution
Positive kurtosis
indicates a relatively peaked distribution
We _______ something about the _________ using the _________ through statistical ________.
infer, population, sample, inference
Statistical inference
is the process of making an estimate, prediction, or decision about a population based on a sample.
Volatility clustering
large changes tend to be followed by large changes, of either sign, and small changes tend to be followed by small changes." (Mandelbrot 1963)
for small changes in p
ln(p_t )-ln(p_(t-1) ) ≈(p_t-p_(t-1))/p_(t-1)
Stock prices are assumed to follow a _______; as a result, the log of its prices has normal distribution and ________
lognormal distribution, log returns are normally distributed
A ________ is some characteristic of a population or sample, e.g. inflation, gender, stock price
variable
Statistical analysis plays an important role in _____
virtually all aspects of business and economics.
How to calculate b0 and b1?
y ̂=b_o+b_1 x b_1=(Covariance (x,y))/(Variance(x)) b_o=y ̅-b_1 x ̅
Interval-ratio variables
§Values are real numbers. §All calculations are valid. §Data may be treated as ordinal or nominal.
Nominal variables
§Values are the arbitrary numbers that represent categories. §Only calculations based on the frequencies of occurrence are valid. §Data may not be treated as ordinal or interval.
Ordinal variables
§Values must represent the ranked order of the data. §Calculations based on an ordering process are valid. §Data may be treated as nominal but not as interval.
Mean
μ x ̅
Coefficient of correlation
ρ_xy r_xy
Standard deviation
σ s
Variance
σ^2 s^2
Covariance
σ_xy s_xy
Sample
— A sample is a set of data drawn from the population. — Potentially very large, but less than the population. E.g. a sample of 765 voters exit polled on election day
Population
— a population is the group of all items of interest to a statistics practitioner. — frequently very large; sometimes infinite. E.g. All 5 million Florida voters
Nature of time series data II
•A sequence of random variables indexed by time is called a stochastic process or time series process (data generating process) •Stationary time series: It denotes a time series whose statistical properties are independent of time: §The process generating the data has a constant mean §The variability of the time series is constant over time •What is the population in case of time series data? §When we collect time series data, we obtain one possible outcome (realisation) of the stochastic process §We can only see a single realisation of all possible realisations that might have occued if certain conditions in history had been different §The set of all possible realisations of a time series process plays the role of the population in cross-sectional analysis
Systemic and firm-specific risk
•Beta (the slope coefficient) is a measure of the stock's market related (or systemic) risk - it measures the volatility of the stock price that is related to the overall market volatility •The coefficient of determination (R²) measures the proportion of the total risk (=market-related risk + firm-specific risk) that is market related -R² = 0.65 à 65% of GE's total risk is market related and 35% are firm-specific (or nonsystemic or idiosyncratic) risk -The firm-specific risk is attributable to variables that are not included in the market model (eg GE's managerial competencies,..) -The firm-specific risk (market-related risk) can (cannot) be diversified away by creating a diversified portfolio of stocks
Coefficient of Correlation I
•Coefficient of correlation (r_xy) measures the strength of linear association between two numerical variables •r_xy is between [-1,+1] and r_xy=r_yx •Correlation coefficient is defined as the covariance divided by the standard deviations of the variables oPopulation coefficient of correlation: ρ_xy=σ_xy/(σ_x σ_y ) oSample coefficient of correlation: r_xy=s_xy/(s_x s_y ) •Correlation between assets is highly relevant for diversification oMost correlations in finance are positive as there is a mutual dependence on the economy (business cycle) oCorrelation between most pairs of companies is between 0.2 to 0.3
Inference statistics 2 techniques
•Estimation and hypothesis testing are the two techniques of inference statistics.
Correlation vs causation
•If two variables are linearly related it does not mean that X is causing Y: Correlation is not Causation •Establishing causality in social sciences is very challenging •The most convincing way to search for causal effects of X on Y would be to run an experiment with a treatment and a control group •A well designed experiment controls for confounding variables •The opposite of experimental studies are observational studies - they use nonexperimental (observational) data
Log-returns
•In quantitative finance, returns are usually calculated by using the natural log (continuously compound return): r_t=ln(p_t )-ln(p_(t-1) )=∆〖ln(p〗_t)
Coefficient of Determination (R²)
•In the case of simple linear regression, R² is calculated by squaring the coefficient of correlation, i.e. r² •The coefficient of determination measures the amount of variation in the dependent variable that is explained by the variation in the independent variable •Example: R² = 0.65, i.e. 65% of the variation of Y is explained by the variation of X
Autocorrelation
•In time series data, the value of Yt in one period is typically correlated with its past value Yt-1 and its future value Yt+1 •The correlation of a series with its own lagged values is called autocorrelation or serial correlation •The first (j^th) lag of a time series Y_t is Y_(t-1) 〖(Y〗_(t-j)) •The first (second) autocorrelation coefficient r_1(r_2) is the correlation between Y_t and Y_(t-1)(Y_(t-2))
What is Statistics?
"Statistics is a way to get information from data"
Measures of reliability to make inferential statistics more correct
-For this reason, we build into the statistical inference "measures of reliability", namely confidence level and significance level •Confidence level: Proportion of times an estimating procedure will be correct (eg. 95%) •Significance level: Measures how frequently a conclusion (result of a hypothesis test) about the population will be wrong (eg 5%)
Why use statistical inference?
-Large populations make investigating each member impractical and expensive. -Easier and cheaper to take a sample and make estimates about the population from the sample.
•Application in Finance
-Sharpe Ratio -Market Model
Cons of statistical inference
-Such conclusions and estimates are not always going to be correct.
Symmetrical distribution
0
normal distribution
0
empirical rule
68%, 95%, 99.7%
Covariance
A measure of linear association between two variables. Positive values indicate a positive relationship; negative values indicate a negative relationship
Anscombe's quartet
A straight line is not always the best way to represent a bivariate distribution
Market model for General Electric
Beta = 1.6166: A 1% increase in S&P index return leads to an increase of the GE return of 1,6166%. Beta > 1 à GE is more volatile and hence riskier than the S&P index (volatile security)
Measures of Central Location
Arithmetic Mean, Geometric Mean, Median, Mode
________ can take on any value and are not confined to take specific numbers. Their values are limited only by precision: eg rental yield on a property could be 6.2%, 6.24%, or 6.238%.
Continuous data
Measures of Linear Relationship
Covariance, Correlation, Coefficient of Determination, Least Squares Line
________ are the observed values of a variable, e.g. student marks: {67, 74, 71, 83, 93, 55, 48}
Data
Two modes of statistical analysis
Descriptive statistics, Inference statistics
________ can only take on certain values, which are usually integers: eg number of people in a particular underground carriage or the number of shares traded during a day.
Discrete data
Distribution of a categorical variable:
Groups data into categories and records the number of observations that fall into each category
Distribution of a quantitative variable:
Groups data into intervals (bins, classes) and records the number of observations that fall into each interval
________ of a distribution measures how much mass is in its tails; the greater the ______, the more likely are outliers
Kutosis
Numerical descriptive techniques
Measures of Central Location Measures of Variability Measures of Relative Standing Measures of Linear Relationship
size
N n
Descriptive statistics
One form of descriptive statistics uses graphical techniques. •Another form of descriptive statistics uses numerical techniques to summarize data (eg mean, variance).
Population
Parameters
Measures of Relative Standing
Percentiles, Quartiles
Skewed right distribution
Positive Skewness. mean > median
Measures of linear relationship
Provide information as to the strength & direction of a linear relationship between two variables (1) Covariance (2) Coefficient of correlation (3) Coefficient of determination
Simple return of an asset
R_t=(p_t-p_(t-1))/p_(t-1)
Measures of Variability
Range, Standard Deviation, Variance, Coefficient of Variation
i.i.d.
Simple random sampling results in independently and identically distributed random (i.i.d.) variables, i.e. two or more independent random variables have the same distribution
_________ measures the lack of symmetry of a distribution
Skewness
_________ measures the average distance of the values from the mean.
Standard deviation
Sample
Statistic
The __________ are the range of possible values for a variable, e.g. student marks (0..100)
values of the variable
Geometric mean
The geometric mean is aka the compound annual growth rate (CAGR)
Scatter diagram
a graph that shows the degree and direction of relationship between two variables
A distribution (or frequency distribution) of a variable shows
all the possible values of the variable and how often they occur
Financial econometrics
application of statistical techniques to problems in finance
Ordinal and nominal variables are also called
categorical or qualitativ variables.
The _______ measures the amount of variation in the dependent variable that is explained by the variation in the independent variable
coefficient of determination
panel data
collected over time for the same statistical unit(s) (at least two) on one or more variables ie repeated cross sections over time
cross sectional data
data collected at the same or approximately the same point in time for one or more statistical unit(s) on one or more variable(s)
time-series data
data collected over several time periods
Descriptive statistics
deals with methods of organizing, summarizing, and presenting data in a convenient and informative way.
Skewed left distribution
negative skewness. mean < median
Interval variables are also known as
numerical or quantitative variables.
Coefficient of Correlation II
positive linear + negative linear - independent 0 curvilinear 0
Finance
quantitative oriented field of research
Statistical inference is only valid if the sample data are collected via _________
random sampling.
A ________ _________ is a variable, whose outcome is uncertain
random variable
The Gaussian distribution gives a poor estimate for the occurence of
rare events
A random sample is said to be a _________ __________
representative sample
risk on the bell
risk is low at a high level of a skinny bell. the wider the bell and further away from the average, the more risky
The _____ is a subset of a ________.
sample, population
In the case of __________, R² is calculated by squaring the coefficient of correlation, i.e. r²
simple linear regression
A _____________ ________ _________ is a sample selected in such a way that every possible sample of the same size is equally likely to be chosen
simple random sample
Quantitative Finance
the application of probability and statistics to finance
very large losses/gains occur much more frequently than predicted by
the normal distribution "Fat Tails"
With random sampling, the value of
the variable for the next random draw is uncertain
Random sampling does not work in the same way with _____________
time series data
With ______ _____ _____, we ususally do not know future values, hence they are uncrtain
time series data
Statistics is a ______ for creating ____ ________ from a set of numbers.
tool, new understanding
The least squares method
•LS-method gives you the intercept and slope of a straight line so that the squared sum of the deviations between the data points and the line is minimized: min∑▒〖(y-y ̂ )^2=min∑▒〖residuals〗^2 〗 •The estimated line is y ̂=b_o+b_1 x with b_o and b_1 estimated by the LS method -b_o is the intercept, i.e. the value of y ̂ when x=0 b_1is the slope and shows the change in y if x increases by one unit: b_1=dy/dx
The Market Model
•Market model is the empirical counterpart of the theoretical CAPM: (E(R_i )-r_f )=β_i (E(R_m )-r_f ) •Market model assumes that the return on a stock i is linearly related to the rate of return on the stock market index: R_i=α+β_i R_m+ε •If CAPM holds, α should be zero •The market model says that the return on a stock depends on (1)the return on the market portfolio (stock market index) and the extent of the stock's responsiveness to changes in the overall market as measured by beta (2)as well as on conditions that are unique to the firm
Nature of time series data I
•Sample size of time series data is the number of periods over which we observe the variable of interest •Data frequency denotes the interval at which time series data are collected (yearly, quarterly, monthly,daily, real time) •In contrast to cross-sectional data where the assumption of identically and independently distributed data is key, time series data are serially correlated (autocorrelated) - there are dependencies between past and future values •If the behavior of the times series data of the past is expected to continue in the future, it can be used as a guide in selecting an appropriate forecasting method
Covariance matrix (variance-covariance matrix)
•Symmetric array of numbers •The covariance matrix generalizes the notion of variance to multiple dimensions.
Sharpe Ratio
•The Sharpe ratio (named after Nobel Laureate William Sharpe) is used to characterize how well the return of an asset compensates for the risk that the investor takes •Investors are often advised to pick investments that have high Sharpe ratios - the higher the Sharp-ratio, the better the investment compensates its investors for risk •The Sharpe ratio (Sr) measures the extra award per unit of risk: Sr=(x ̅_I-R ̅_f)/s_I •x ̅_I...mean return for the investment •R ̅_f...mean return for a risk-free assets (short-term government bonds, T bill) •x ̅_I-R ̅_f...excess return:Measures the extra reward investors receive for the added risk taken •s_I...standard deviation for the investmen, measures the amount of risk
Beta of a portfolio
•To estimate beta for a portfolio we need to average the betas of the portfolio's stocks •If an investor believes that she is in a bull (bear) market, a portfolio with a beta greater (smaller) than 1 is desireable. •Risk averse investors may prefer portfolios with a beta below 1, i.e. the portfolio is made up of defensive securities.
Inference statistics
•is a body of methods used to draw conclusions or inferences about characteristics of populations based on sample data.