Topic 5: Correlation, covariance and causation
Confidence intervals and limits
A point estimate is a single number that is used to estimate a population parameter. For example, the sample mean is a point estimate of the population mean. Key concept: Confidence interval A confidence interval is an estimated range within which the population parameter is likely to be contained. For instance, a researcher might want to derive the 95% confidence interval for the population mean. This implies that in repeated sampling, 95% of such confidence intervals will include the true (unknown) population mean. Key concept: Confidence limits Confidence limits refer to the lower and upper bounds of the confidence level. For example, the 95% confidence level for industry sales might equal '$100 million up to and including $120 million'. Both $100 million and $120 million are included in the interval (i.e. the interval includes the two limits). The lower confidence limit is $100 million and the upper confidence limit is $120 million.
Scatter plot
A scatter plot or scatter diagram is a graph of paired observations of two variables (e.g. Y and X) and is used to illustrate the relationship between the variables.
Spurious correlation
A spurious relationship refers to an invalid inference drawn from an observed correlation. It is a special case of coincidence. In particular, a spurious relationship often arises from highly correlated (either positive or negative) variables that have no logical connection. Example: Sharks and sunglasses A researcher might observe the simultaneous increase in shark attacks and sales of sunglasses and assume one is causing the other. The researcher in this experiment, of course, ignores the missing variables that were actually causing both (sun and high temperatures). Example: Population increase and earth rotation slowing As the population of the earth has steadily increased by approximately seven billion over the past five million years, the length of the day has steadily increased by a minute and a half. With all those data points, these increasing relationships should be highly statistically significant. The reality is, however, that humanity is just going about its business and, unrelatedly, the earth's rotation is slowing down.
Statistically significant correlation estimate
A statistically significant correlation estimate is one where its estimated confidence interval does not include the value of zero. Consider the confidence interval [-0.20, 0.30], which includes zero. In this case, we cannot reject the hypothesis that the (population) correlation parameter equals zero. In other words, we would conclude that the correlation estimate is not statistically significant (the estimated value is not significantly different from zero). If the correlation estimate is not statistically significant, then we cannot reject the hypothesis that there is no (linear) relationship between the two variables. Conversely, if the estimated confidence interval does not include zero (e.g. [0.34, 0.63]), we can reliably reject the hypothesis that the population correlation equals zero. We would conclude that the correlation estimate is statistically significant (i.e. the estimated value is significantly different from zero). If the correlation estimate is statistically significant, then we can reject the hypothesis that there is no (linear) relationship between the two variables.
Importance of the sign of covariance
As explained earlier, the covariance is unbounded, ranging from negative to positive infinity. The magnitude depends on the unit of measurement for the two variables. Therefore, the actual magnitude of the covariance provides little insight into the strength of the relationship. The most important information provided by the covariance is its sign (positive versus negative), indicating whether the variables exhibit a positive or negative relationship on average.
Fisher transformation
As r approaches either -1 or +1, its variance gets smaller. This is a violation of the properties of a normally distributed variable and means that the formula to calculate the lower and upper limits of a confidence interval cannot be used. However, this issue can be resolved by converting r using the fisher transformation.
Ceteris Paribus
Ceteris paribus is a Latin phrase meaning 'all other things being equal' in analysis it refers to attempting to isolate particular (causal) factors from a range of other potential environmental variables. This phrase is more often used in social science than in physical science because no controlled experiments would be possible in the former. In order to predict the most likely outcome of certain action in social science, one assumes that one can control the action/reaction of other factors as if one were conducting a controlled experiment. Ceteris paribus besets correlation and causal issues. In financial markets, and the real world generally, there are so many things happening all the time that it can be hard to sort out the key drivers from all the other things, some of which may be peripheral noise.
Stability of correlation coefficients
Correlation coefficient (r ) is a poor guide when determining whether an association between x and y is statistically significant, or whether one correlation is significantly stronger than another. This is because r is dependent on the underlying distributions of x and y, but they are not taken into account in its calculation as, under the null hypothesis, correlation is zero and there is no further assumption on the distribution of x and y. In fact, the interpretation of ρ can be completely meaningless if the underlying joint probability distribution of x and y is too different from a bivariate normal distribution. Table 6 shows, for different numbers of monthly data, the probabilities of various non-zero correlations occurring when x and y are independent and identically normally distributed: x, y ~ NID(1, 0.5), where NID stands for 'normally and independently distributed' and where correlation between x and y should be 0%. This result is important when deciding the amount of data to use in calculating accurate correlations and how much importance to place on any published correlations. For instance, with one year (12 months) of monthly data (see column 1Y), a correlation square of 30% will occur by chance 7% of the time. Thus, it is difficult to infer anything reliable from this type of correlation. However, with six years (with 72 months) of monthly data, (see column 6Y), the probability falls from 7% to 0% and the correlations based on this data set would be a much more reliable result. Remember that the values in Table 6 only apply if the underlying joint probability distribution is bivariate normal. Although it is often assumed that financial data is normally distributed, this assumption rarely stands up to empirical testing. Thus, the normality assumption underlying the table would not hold with most real financial data. These simulated results do provide some insight into the reliability of calculated correlation values with different amounts of data — the size of some of the probabilities is surprisingly high.
Coincidence or chance
Even a relationship that is 'statistically significant at the 95% level' has a 5% chance that it may be the outcome of a coincidence. Of particular relevance in this regard is the ex-post (after events) 'trap' — or 'seeing is believing'. That is, just because an event occurs which had a very small chance of occurring ex ante (before events), it does not mean that it must have been originally set in motion by design. A link between the events should not be immediately established without further proof.
Correlation
In statistics, correlation is a measure of the linear relationship between two random variables in terms of both the strength and direction of the relationship. The strength of the correlation between two variables determines how well the value of one of the variables can be predicted from the value of the other. The relationship is linear when any change in the independent variable results in a change in the dependent variable. Such a relationship, plotted on a graph, results in a straight line. However, in general terms, correlation can refer to any kind of relationship, i.e. other kinds as well as linear. By combining asset classes that are not correlated with each other, a portfolio can be produced with better risk-return characteristics than either asset class alone. This effect is extremely powerful, and occurs whenever the asset classes are not perfectly correlated with one another. It is sometimes difficult to interpret a covariance because the scale of measurement or dispersion of each of the variables affects the covariance estimate. For this reason, the correlation coefficient is often used instead.
Benefits of non-parametric correlation
Non-parametric correlation is particularly suited to financial data because it is much more robust than the normal Pearson correlation. Conversely, the median is more robust than the mean, and the mean absolute deviation (MAD) is more robust than the standard deviation. This is due to the large number of outliers in financial data and, statistically speaking, the small number of data points normally available. When outliers or deviations from normality are not present in a data set, rank correlation essentially results in the same values as a standard Pearson correlation. The rank correlation equivalent of Table 5, where everything is normally distributed, would look the same and have very similar values. The benefit of rank correlation when the underlying distributions have outliers can be demonstrated by simulating this effect and producing a table of spurious correlations, analogous to Table 5, for Pearson and Spearman correlations.
Topic learning outcomes
On completing this topic, students should be able to: • analyse the factors affecting financial instruments identifying which: − are significant − are positively or negatively interrelated − contribute to diversification − contribute to volatility • apply the statistical analyses of covariance, correlation and matrix algebra • explain the strengths and weaknesses of the Pearson and Spearman rank correlations • calculate the variance of a portfolio of many asset returns using the covariance matrix of asset returns • explain the difference between causation, correlation and spurious correlation.
Partial correlation
Partial correlation is the correlation between two variables after controlling or removing the effects of other related variables. Example: Partial correlation The partial correlation between industry sales and the GDP equals the correlation between industry sales and GDP, holding all other variables constant (e.g. assuming no changes in interest rates, inflation, consumer sentiment or any other variable that is related to industry sales). Therefore, the partial correlation measures the relationship that isolates sales and GDP.
Sampling error
Sampling error refers to the difference between a sample statistic and the corresponding population parameter that it is trying to estimate. Sampling error is caused by sampling from the population, rather than using the entire population to derive conclusions. Fortunately, sampling error declines as the sample size grows.
Statistical significance
Statistical significance refers to the probability that a relationship observed in a sample did not happen by chance. Therefore, if we conclude that an estimate is statistically significant, we are saying that the number we observed is unlikely to be attributable to chance. In other words, we can rely on the estimate.
Common cause
Strength in the economy over a period may cause the sharemarket to rise and bond prices to fall. In these circumstances, a negative correlation could be measured between share prices and bond prices. It would be erroneous to conclude, however, that the strength in the sharemarket caused the weakness in the bond market because both are the results of a growing economy.
Granger causality test
The Granger causality test is a hypothesis test, based on regression, to determine if one time series of data can provide statistically significant useful information about a series of data later in time, establishing that one event causes another. To be reliable, the time series data used in the regression must be made 'stationary' by differencing (i.e. taking period-to-period 'change of a time' series). If the lagged values of a stationary time series (X) is a significant explanatory factor in the regression model of the current values of another stationary time series (Y), then X is said to Granger-cause Y. Conversely, if the lagged values of (Y) is a significant explanatory factor in the regression model of the current values of (X), then Y is also said to Granger-cause X. The story would be more convincing if one time series Granger-causes the other time series, but not the other way around. This approach relies on the hypotheses that: • time only travels forward • an event cannot be caused by another event that has not happened yet. While these hypotheses appear very reasonable, it is possible that there is no causal relationship between the first two factors and that a third factor is causing the first two factors. It is also very possible that although one time series of data always precedes another, it is not a causal relationship. Consider the fact that from 1 to 24 December each year, we have a large and very reliable set of data that people send each other Christmas cards either electronically or on paper. Subsequently, each year, Christmas occurs on 25 December. A Granger causality analysis could lead to the false conclusion that Christmas cards cause Christmas. Therefore, this purely quantitative technique has its limitations and is not a substitute for a complete understanding of the true causes and effects of different outcomes.
The Pearson product-moment correlation formula
The major weakness of covariance is that it is useless as a measure of the strength of the relationship between two variables. To correct the problem, it is common to scale (divide) the covariance by the product of the standard deviations of the two variables. The Pearson product-moment correlation coefficient or Pearson correlation is the measure of the correlation between two variables giving a value between +1 and -1. Correlation coefficients are particularly useful when more than one random variable is involved. This is the case with a portfolio of assets because the volatility of the portfolio returns consists of the volatilities of the underlying asset returns and the correlations of these returns. The Pearson correlation is shown in Formula 1 below.
Co-occurrence
The shark attack/sunglass data suffers from co-occurrence (i.e. both variables shift at the same time). The correlation is high because sunglass sales rise during the summer, coinciding with larger numbers of people swimming in the ocean, making shark attacks a greater possibility.
Three correlation relationships
There are three correlation relationships: positive, negative and none. Correlations that are close to +1 indicate that a strong positive linear relationship exists between the two variables, i.e. all data lies on a perfect straight line with a positive slope. Correlations that are close to -1 indicate that a strong negative linear relationship exists between two variables, i.e. the data lies on a straight line with a negative slope. Correlations that are close to zero indicate that no linear relationship exists between the two variables, in which case the variables are said to be uncorrelated or independent. Illustrations of strong positive, negative and no linear relationships are provided in the scatter plots in Figures 2 through 4 below.
Determining causality
There is a possibility that the variation in series A causes the variation in series B, B causes A or that they may be co-determined. However, causation or 'theoretic linkage' is only one of a large number of possible explanations as to why any two particular series are correlated. A quantitative analyst can do the following things to ensure that reliable causal relationships are being uncovered: • collect as much data as possible and make every effort to minimise the risk of Type I and Type II errors • ensure that the theoretical foundations are sound • watch out for errors • trace out intermediate steps in a chain of reasoning and rethink what looks like a spurious correlation if there is a lot of data that can fill in intermediate steps in a theory.
Overview
This topic examines the concepts of correlation and covariance. As well, it introduces Pearson correlation and the Spearman rank correlation coefficients. The aim of correlation analysis is to identify the degree to which a linear relationship exists between two variables, and to measure the extent of the relationship. As an example, if there is a strong correlation between the share price and earnings per share of a company, the two variables would be expected to move closely together in the future. Any divergence would tend to be corrected over time by a change in the value of one or both of the variables. This being the case, the relationship between them (in this case the price-to-earnings (PE) ratio) would tend to revert (regress) to its historic mean level (regression is covered in Topic 6). There are, however, some important issues to bear in mind when using an established relationship to predict future behaviour: • the ability to discriminate between those correlations that can be relied on to persist and those that will not is a key part of the analysis. • correlation is contemporaneous — and explicitly not forecasting. Consider the following statement, 'If it is windy my boat will sail well; if it is calm it will not'. This is a very reliable correlation, but it does not help forecast how I will sail tomorrow because there is no means of absolutely knowing the weather beyond today. However, whether or not the boat will sail well can be predicted, on the condition that tomorrow's weather is windy. In order to make an 'unconditional' forecast of one variable, the cause and effect of related variables (rather than just historic correlation) must be known. • correlation does not imply causation — i.e. a reliable correlation between two variables does not mean there is a causal relationship between the variables. Nor does it mean that there is no causal relationship. Any causal relationship between the variables may be indirect, or unknown, or hostage to the existence of another factor. Even where a causal relationship exists between two correlated variables, in many cases it is impossible to tell the direction of the causation, i.e. whether a change in A causes a change in B or vice versa.
Two viewpoints of data
When looking at data, there are two viewpoints to consider: • a priori or ex-ante (before events) forecasting view • a posteriori or ex- post (after events) historical view. The distinction between these two viewpoints helps illustrate the difference between correlation and causation. The a priori approach emphasises the causal aspect of relationships. Example: A priori approach The return series on physical gold and the return series on the company Newcrest Mining Ltd (NCM) may be correlated because rises and falls in the gold price cause NCM shares to become more or less valuable. This is the a priori approach. It emphasises the causal aspect of relationships. Even if the calculation of correlation coefficients showed only a slight or intermittent correlation between the variables, it is quite likely that the belief that a rising gold price 'causes' NCM shares to be a more valuable security would be retained. A posteriori approach A quantitative analyst could collect return series for many stocks and for many different commodities and calculate the pair-wise correlations for all the stocks and commodities. Even with no opinions about which changes might be causing which results, information about correlations might lead the analyst to make discoveries about causal relationships which might not have been otherwise thought of.
