Reading 2 - Organizing, Visualizing, and Describing data
Reading 2 Summary:
In this reading, we have presented tools and techniques for organizing, visualizing, and describing data that permit us to convert raw data into useful information for investment analysis. ■ Data can be defined as a collection of numbers, characters, words, and text—as well as images, audio, and video—in a raw or organized format to represent facts or information. ■ From a statistical perspective, data can be classified as numerical data and categorical data. Numerical data (also called quantitative data) are values that represent measured or counted quantities as a number. Categorical data (also called qualitative data) are values that describe a quality or characteristic of a group of observations and usually take only a limited number of values that are mutually exclusive. ■ Numerical data can be further split into two types: continuous data and discrete data. Continuous data can be measured and can take on any numerical value in a specified range of values. Discrete data are numerical values that result from a counting process and therefore are limited to a finite number of values. ■ Categorical data can be further classified into two types: nominal data and ordinal data. Nominal data are categorical values that are not amenable to being organized in a logical order, while ordinal data are categorical values that can be logically ordered or ranked. ■ Based on how they are collected, data can be categorized into three types: cross-sectional, time series, and panel. Time-series data are a sequence of observations for a single observational unit on a specific variable collected over time and at discrete and typically equally spaced intervals of time. Cross-sectional data are a list of the observations of a specific variable from multiple observational units at a given point in time. Panel data are a mix of time-series and cross-sectional data that consists of observations through time on one or more variables for multiple observational units. ■ Based on whether or not data are in a highly organized form, they can be classified into structured and unstructured types. Structured data are highly organized in a pre-defined manner, usually with repeating patterns. Unstructured data do not follow any conventionally organized forms; they are typically alternative data as they are usually collected from unconventional sources. ■ Raw data are typically organized into either a one-dimensional array or a two-dimensional rectangular array (also called a data table) for quantitative analysis. ■ A frequency distribution is a tabular display of data constructed either by counting the observations of a variable by distinct values or groups or by tallying the values of a numerical variable into a set of numerically ordered bins. Frequency distributions permit us to evaluate how data are distributed. ■ The relative frequency of observations in a bin (interval or bucket) is the number of observations in the bin divided by the total number of observations. The cumulative relative frequency cumulates (adds up) the relative frequencies as we move from the first bin to the last, thus giving the fraction of the observations that are less than the upper limit of each bin. ■ A contingency table is a tabular format that displays the frequency distributions of two or more categorical variables simultaneously. One application of contingency tables is for evaluating the performance of a classification model (using a confusion matrix). Another application of contingency tables is to investigate a potential association between two categorical variables by performing a chi-square test of independence. ■ Visualization is the presentation of data in a pictorial or graphical format for the purpose of increasing understanding and for gaining insights into the data. ■ A histogram is a bar chart of data that have been grouped into a frequency distribution. A frequency polygon is a graph of frequency distributions obtained by drawing straight lines joining successive midpoints of bars representing the class frequencies. ■ A bar chart is used to plot the frequency distribution of categorical data, with each bar representing a distinct category and the bar's height (or length) proportional to the frequency of the corresponding category. Grouped bar charts or stacked bar charts can present the frequency distribution of multiple categorical variables simultaneously. ■ A tree-map is a graphical tool to display categorical data. It consists of a set of colored rectangles to represent distinct groups, and the area of each rectangle is proportional to the value of the corresponding group. Additional dimensions of categorical data can be displayed by nested rectangles. ■ A word cloud is a visual device for representing textual data, with the size of each distinct word being proportional to the frequency with which it appears in the given text. ■ A line chart is a type of graph used to visualize ordered observations and often to display the change of data series over time. A bubble line chart is a special type of line chart that uses varying-sized bubbles as data points to represent an additional dimension of data. ■ A scatter plot is a type of graph for visualizing the joint variation in two numerical variables. It is constructed by drawing dots to indicate the values of the two variables plotted against the corresponding axes. A scatter plot matrix organizes scatter plots between pairs of variables into a matrix format to inspect all pairwise relationships between more than two variables in one combined visual. ■ A heat map is a type of graphic that organizes and summarizes data in a tabular format and represents it using a color spectrum. It is often used in displaying frequency distributions or visualizing the degree of correlation among different variables. ■ The key consideration when selecting among chart types is the intended purpose of visualizing data (i.e., whether it is for exploring/presenting distributions or relationships or for making comparisons). ■ A population is defined as all members of a specified group. A sample is a subset of a population. ■ A parameter is any descriptive measure of a population. A sample statistic (statistic, for short) is a quantity computed from or used to describe a sample. ■ Sample statistics—such as measures of central tendency, measures of dispersion, skewness, and kurtosis—help with investment analysis, particularly in making probabilistic statements about returns. ■ Measures of central tendency specify where data are centered and include the mean, median, and mode (i.e., the most frequently occurring value). ■ The arithmetic mean is the sum of the observations divided by the number of observations. It is the most frequently used measure of central tendency. ■ The median is the value of the middle item (or the mean of the values of the two middle items) when the items in a set are sorted into ascending or descending order. The median is not influenced by extreme values and is most useful in the case of skewed distributions. ■ The mode is the most frequently observed value and is the only measure of central tendency that can be used with nominal data. A distribution may be unimodal (one mode), bimodal (two modes), trimodal (three modes), or have even more modes. ■ A portfolio's return is a weighted mean return computed from the returns on the individual assets, where the weight applied to each asset's return is the fraction of the portfolio invested in that asset. ■The geometric mean is especially important in reporting compound growth rates for time-series data. The geometric mean will always be less than an arithmetic mean whenever there is variance in the observations. ■ The harmonic mean, Xh , is a type of weighted mean in which an observation's weight is inversely proportional to its magnitude. ■ Quantiles—such as the median, quartiles, quintiles, deciles, and percentiles— are location parameters that divide a distribution into halves, quarters, fifths, tenths, and hundredths, respectively. ■ A box and whiskers plot illustrates the interquartile range (the "box") as well as a range outside of the box that is based on the interquartile range, indicated by the "whiskers." ■ Dispersion measures—such as the range, mean absolute deviation (MAD), variance, standard deviation, target downside deviation, and coefficient of variation—describe the variability of outcomes around the arithmetic mean. ■ The range is the difference between the maximum value and the minimum value of the dataset. The range has only a limited usefulness because it uses information from only two observations. ■ The MAD for a sample is the average of the absolute deviations of observations from the mean, where X is the sample mean and n is the number of observations in the sample. ■ The variance is the average of the squared deviations around the mean, and the standard deviation is the positive square root of variance. In computing sample variance (s2) and sample standard deviation (s), the average squared deviation is computed using a divisor equal to the sample size minus 1. ■ The target downside deviation, or target semi deviation, is a measure of the risk of being below a given target. It is calculated as the square root of the average squared deviations from the target, but it includes only those observations below the target (B), or X B n i X B n i 2 1 for all . ■ The coefficient of variation, CV, is the ratio of the standard deviation of a set of observations to their mean value. By expressing the magnitude of variation among observations relative to their average size, the CV permits direct comparisons of dispersion across different datasets. Reflecting the correction for scale, the CV is a scale-free measure (i.e., it has no units of measurement). ■ Skew or skewness describes the degree to which a distribution is asymmetric about its mean. A return distribution with positive skewness has frequent small losses and a few extreme gains compared to a normal distribution. A return distribution with negative skewness has frequent small gains and a few extreme losses compared to a normal distribution. Zero skewness indicates a symmetric distribution of returns. ■ Kurtosis measures the combined weight of the tails of a distribution relative to the rest of the distribution. A distribution with fatter tails than the normal distribution is referred to as fat-tailed (leptokurtic); a distribution with thinner tails than the normal distribution is referred to as thin-tailed (platykurtic). Excess kurtosis is kurtosis minus 3, since 3 is the value of kurtosis for all normal distributions. ■ The correlation coefficient is a statistic that measures the association between two variables. It is the ratio of covariance to the product of the two variables' standard deviations. A positive correlation coefficient indicates that the two variables tend to move together, whereas a negative coefficient indicates that the two variables tend to move in opposite directions. Correlation does not imply causation, simply association. Issues that arise in evaluating correlation include the presence of outliers and spurious correlation.
Section 7: Measures of Central Tendency
Learning Objective: - Calculate and interpret measures of central tendency - Evaluate alternative definitions of mean to address an investment problem Notes: Formulas: Definitions: Measures of Central Tendency - Measures of Location - Statistics - Population - Sample Statistic - Arithmetic mean - Trimmed Mean - Winsorized Mean - Median - Mode - Unimodal - Bimodal - Trimodal - Modal Interval - Weighted Mean - Geometric Mean & Formula - Harmonic Mean & Formula - Cost Averaging -
Section 9: Measures of Dispersion
Learning Objective: - Calculate and interpret measures of dispersion Notes: Formulas: Definitions: Dispersion - Absoute Dispersion - Range - Mean Absolute Deviation - Variance - Standard Deviation - Sample Standard Deviation -
Section 10: Downside Deviation and Coefficient of Variation
Learning Objective: - Calculate and interpret target downside deviation Notes: Formulas: Definitions: Downside Risk - Target Semideviation - Relative Dispersion - Coefficient Variant -
Section 12: Correlation Between Two Variables
Learning Objective: - Interpret correlation between two variables Notes: Formulas: Definitions: Correlation - Sample Covariance - Sample Correlation Coefficient - Spurious Correlation -
Section 1: Introduction
While this data-rich environment offers potentially tremendous opportunities for investors, turning data into useful information is not so straightforward. Organizing, cleaning, and analyzing data are crucial to the development of successful investment strategies; otherwise, we end up with "garbage in and garbage out" and failed invest- ments. It is often said that 80% of an analyst's time is spent on finding, organizing, cleaning, and analyzing data, while just 20% of her/his time is taken up by model development. So, the importance of having a properly organized, cleansed, and well- analyzed dataset cannot be over-emphasized.
Section 2: Data Types
Learning Objective: - Identify and compare data types - Describe how data are organized for quantitative analysis Notes: Data can be any raw or organized information that represents facts. There are appropriate statistical methods to summarize data, specific charts to best visualize, each suited for different types of data. Statistically data is separated between Numerical and Categorical data. Numerical (Quantitative) are values that represent measured or counted quantities as a number and split into Continuous (like a stock price) and Discrete (finite, countable, like frequency of compounding interest rates - daily, weekly, monthly, yearly, et.c) data groups. Categorical Data (Qualitative) deals with values that represent information not defined by a number range. For example, Solvent vs. Insolvent. Separated between Nominal and Ordinal data. Nominal Data is categorical, but it can not be organized in a logical order. (for example the Global Industry Classification Standard (GICS). GICS, developed by Morgan Stanley Capital International (MSCI) and Standard & Poor's (S&P), is a four-tiered, hierarchical industry classification system consisting of 11 sectors, 24 industry groups, 69 industries, and 158 sub-industries.) nominal data can be in text or in "CODE". A code would be a number identifier, but with no statistical purpose - kind of like a student id #. Practical to know this for modeling (like a regression test) with software that requires numerical inputs, even for a nominal label. Ordinal data are categorical values that can be logically ordered or ranked. Example - Investment Researchers assigning grades on companies based on relative performances in a group (F is bad, A is good...). ordinal data can also involve numbers, but again it must be used in a logic based system based on relative performance. It should also note that ordinal data generally lacks specifics on details specifying why one data point is logically better or worse performer than another. Differences could be slight or significant, the better the data representation the better the picture. those are the face value types of data. However, we also classify data on how it is collected: Cross-sectional, Time-series, and panel Data A Variable (Field, attribute, or feature) is what a data point generally is. An observation of a variable and recording it for use is data collection. Variables come in a wide range of possibilities, some common ones include stock multiples like p/e, dividend yield, EPS, and many more. 1. Cross-Sectional Data - a list of observations, looking at a specific variable among a group of "observational units" (who is being observed) - individuals, groups, companies, countries...Example - independent inflation rates of each individual EU nation in january (variable = inflation rate, observation unit. = EU countries, point of time = January) 2. Time-series data - a single observational unit, more than one observational period, one variable, with equal intervals between observations. Example - daily closing price of a stock. OFTEN DISCRETE 3. Panel series - a mixture of cross-sectional and time-series. Includes 1 or more variables, multiple observational units, and generally more than one observational period. Displayed in a data table. Example: a data table showing 3 companies (observational units), over 4 quarters (observational periods), and their EPS/quarter (variable) Similar to categorical data and a main difference between ordinal and nominal data - data can be categorized by how structured or unstructured it is. Structured Data = highly organized (generally time-series or panel data): ■ Market data: data issued by stock exchanges, such as intra-day and daily closing stock prices and trading volumes. ■Fundamental data: data contained in financial statements, such as earnings per share, price to earnings ratio, dividend yield, and return on equity. ■Analytical data: data derived from analytics, such as cash flow projections or forecasted earnings growth. Unstructured Data: data that do not follow any conventionally organized forms. a relatively new classification driven by the rise of alternative data (i.e., data generated from unconventional sources, like electronic devices, social media, sensor networks, and satellites, but also by companies in the normal course of business). Generally collected from unconventional sources. ■ Produced by individuals (i.e., via social media posts, web searches, etc.); ■Generated by business processes (i.e., via credit card transactions, corporate regulatory filings, etc.); ■Generated by sensors (i.e., via satellite imagery, foot traffic by mobile devices, etc.). Typically, financial models are able to take only structured data as inputs; therefore, unstructured data must first be transformed into structured data that models can process. fun fact: EDGAR (Electronic Data Gathering, Analysis, and Retrieval) database. place to find SEC public filings. The SEC has utilized eXtensible Business Reporting Language (XBRL) to structure such data. Raw Data is the original form in which data are collected, unorganized. Organizing raw data into a one-dimensional array or a two-dimensional array is typically the first step in data analytics and modeling. Formulas: Definitions: Data - a collection of number panel datas, characters, words, and text—as well as images, audio, and video—in a raw or organized format to represent facts or information. Numerical data / Quantitative data - values that represent measured or counted quantities as a number. Categorical data / Qualitative data - values that describe a quality or characteristic of a group of observations and therefore can be used as labels to divide a dataset into groups to summarize and visualize. Usually they can take only a limited number of values that are mutually exclusive. Continuous data - data that can be measured and can take on any numerical value in a specified range of values. Discrete data - numerical values that result from a counting process. So, practically speaking, the data are limited to a finite number of values. Nominal data - Categorical values that are not amenable to being organized in a logical order. Ordinal data - categorical values that can be logically ordered or ranked. Variable - characteristic or quantity that can be measured, counted, or categorized and is subject to change. Observation - the value of a specific variable collected at a point in time or over a specified period of time. Cross-sectional data - a list of the observations of a specific variable from multiple observational units at a given point in time. Time-series data - a sequence of observations for a single observational unit of a specific variable collected over time and at discrete and typically equally spaced intervals of time, such as daily, weekly, monthly, annually, or quarterly. Panel data - a mix of time-series and cross-sectional data that are frequently used in financial analysis and modeling. Panel data consist of observations through time on one or more variables for multiple observational units. The observations in panel data are usually organized in a matrix format called a data table. Structured data - highly organized in a pre-defined manner, usually with repeating patterns. The typical forms of structured data are one-dimensional arrays, such as a time series of a single variable, or two-dimensional data tables, where each column represents a variable or an observation unit and each row contains a set of values for the same columns. Unstructured data - data that do not follow any conventionally organized forms. Some common types of unstructured data are text—such as financial news, posts in social media, and company filings with regulators—and also audio/ video, such as managements' earnings calls and presentations to analysts. Raw data - data available in their original form as collected, such data typically cannot be used by humans or computers to directly extract information and insights.
Section 4: Summarizing Data Using Frequency Distribution
Learning Objective: - Interpret frequency and related distributions Notes: Frequency Distribution- also referred to as a "one way table" helps visualize data observations and is an important tool for initially summarizing data by groups or bins for easier interpretation. 2 steps to constructing a frequency distribution of categorical data. 1. count the number of observations for each unique value of a specified variable. (imagine counting all the cars on you way home, and sorting by color - color is the variable) 2. Construct a table listing each unique value and the corresponding counts, and then sort the records by number of counts in descending order to facilitate the display. (which car colors were observed the most to least, and what percentage frequency did red show up?) When looking for data of specific variables we need to know how many specific observations were made for each unique value (within the variable), known as Absolute Frequency (also referred to as raw frequency). it is also generally requested to provide the relative frequency - calculated as the absolute frequency of each unique value of the variable divided by the total number of observations and presented as a %. Relative frequency is important because it provides a normalized measure of the distribution of the data, allowing comparisons between datasets with different numbers of total observations. (allowing us to see how things like marketshare and other variables may change over time) *example in attached photo* Frequency distribution tables = snap shot w/ ability to identify patterns Relative frequency should always sum up to 100% Frequency distribution can be used with numerical data, not just categorical. However, it becomes a bit more involved as the observations cannot overlap into more than one bin/bucket or interval (denoted by "k") 7 steps to make a frequency distribution chart with numerical data: 1. data should be presented in ascending order (smallest to largest) 2. calculate and define the data range (max, min values) 3. determine # of bins /buckets/ intervals (k) for the data set 4. determine the width of bins by doing range/k *always round up* 5. Determine the first bin by adding the bin width to the minimum value. Then, determine the remaining bins by successively adding the bin width to the prior bin's end point and stopping after reaching a bin that includes the maximum value. * 6. sort observations into correct bins - only using observations between or equal to max and minimum range values. 7. create a table - smallest to largest that displays the numerical observations in each respective bin. for step 6, a practical refinement that promotes interpretation is to start the first bin at the nearest whole number below the minimum value. (instead of 3.21, just 3) and bins must not overlap. # of bins (k) is subjective, yet important.If we use too few bins, we will summarize too much and may lose pertinent characteristics. Conversely, if we use too many bins, we may not summarize enough and may introduce unnecessary noise. The best option is a blind of no empty bins, but also having a frequency distribution that effectively summarizes the distribution. when charting a numerical data frequency distribution you might include a column for cumulative absolute frequency (CAF) and cumulative relative frequency (CRF). CAF = is a running sum of all absolute frequencies from the ascending order. This allows you to have a sum total of observations at a specific point in the chart, and helps create a range if specific areas become of interest. should = sum total of observations CRF= a running sum of relative frequencies. This gives you a % of observations have been noted so far in the chart. helpful to spot huge increases and focus on specific bins. should = 100% at charts end *see picture for chart example* The frequency distribution gives us a sense of not only where most of the observations lie but also whether the distribution is evenly spread. frequency distributions also can be effectively represented in visuals remember, frequency distribution with categorical data doesn't require the creation of ranging bins and CAF and CRF, only numerical. Always round up on bin width and round the first and last bin to the nearest whole number for ease of interpretation. also, distribution frequency tables generally only organize one variable. Formulas: Definitions: Frequency Distribution - tabular display of data constructed either by counting the observations of a variable by distinct values or groups or by tallying the values of a numerical variable into a set of numerically ordered bins. Absolute Frequency - raw frequency, is the actual number of observations counted for each unique value of the variable Relative Frequency - calculated as the absolute frequency of each unique value of the variable divided by the total number of observations. The relative frequency provides a normalized measure of the distribution of the data, allowing comparisons between datasets with different numbers of total observations. Intervals - bin, bucket, a type of numerical field Cumulative absolute frequency - cumulates (meaning, adds up) the absolute frequencies as we move from the first bin to the last bin. Cumulative relative frequency - CAF Sum divided by total observations
Section 8: Quantiles
Learning Objective: - Calculate quantiles and interpret related visualizations Notes: Formulas: Definitions: Quartiles - Quintiles - Deciles - Percentiles - Interquartile Range (IQR) - Linear Interpolation - Box and Whisker Plot -
Section 6: Data Visualization
Learning Objective: - Describe How to Select among Visualization Types - Describe ways that data may be visualized and evaluate uses of specific visualizations Notes: Visualization of data in the financial world is important to help summarize and help deliver insight in an organized way. Most data, structured or unstructured, numerical or categorial can be visualized. 1a. Histogram (Numerical data) a common way to visualize a distribution of numerical data for a distribution frequency table - each bar representing the absolute frequency of the bin (k) in the distribution. X Axis = a bin of the variable Y Axis = Absolute frequency Histograms generally have no space or very minor gaps between bins. An advantage of the histogram is that it can effectively present a large amount of numerical data that has been grouped into a frequency distribution and can allow a quick inspection of the shape, center, and spread of the distribution to better understand it. An absolute frequency histogram best answers the question of how many items are in each bin, while a relative frequency histogram gives the proportion or percentage of the total observations in each bin. 1b. Frequency Polygon Similar to a histogram a frequency polygon can help visualize a distribution frequency. X Axis - middle of the bin Y Axis - Absolute frequency can do absolute or relative frequency. Each data point is attached with a straight line. It's main difference from a histogram from an analytical mathematical standpoint is that it visualizes the area under the curve. 1c. Cumulative Frequency Distribution Chart as the name implies, a helpful tool for frequency distribution. it takes the cumulative absolute or relative frequencies and charts the rolling sum total across all bins like you would see in a numerical frequency distribution chart. a steep change in slope means there was a huge shift in observations. X Axis = bins Y Axis = cumulative absolute/relative frequency 2a. Bar Chart (Categorical Data) Similar to a histogram but generally used for categorical data bars can be plotted horizontally or vertically X Axis = mutually exclusive groups (not bins representing a single variable like in a histogram) Y Axis = absolute or relative frequency When visualizing nominal data with no logical ordering, the bars may be arranged in any order. However, in the particular case where the categories in a bar chart are ordered by frequency in descending (small to big) order and the chart includes a line displaying cumulative relative frequency, then it is called a Pareto Chart. Bar charts provide a snapshot to show the comparison between categories of data.T o compare categories more accurately, in some cases we may add the frequency count to the right end of each bar Watch out for truncated y axis as they may lead you to misinterpret the information. 2b. Group/Cluster Bar Chart a simple bar chart can do only one variable. when you need to do 2 variables we use a Group/Cluster bar chart. This type of graph is perfect for joint frequencies, using a contingency table instead of a frequency distribution. Can be horizontal or vertical 1. categorical label axis 2. Frequency axis (relative or absolute) to show the multiple variables, each categorical label will have a bar for each independent variable. All categories should show the same variables - unless there is 0 observations. best example is sector vs (small, mid, large) market cap. Each category show the absolute or relative frequency in each sector, and each bar is made to fit a consistent legend. 2c. Stacked Bar Chart essentially a a clustered bar chart, but instead of individual bars inside each category, they get stacked on top of each other, showing individual representation via a legend but in a single bar. really helpful if you want to visualize marginal frequency of individual categories. it is worth noting that applications of bar charts may be extended to more general cases when categorical data are associated with numerical data. For example, suppose we want to show a company's quarterly profits over the past one year. In this case, we can plot a vertical bar chart where each bar represents one of the four quarters in a time order and its height indicates the value of profits for that quarter. 3. Tree-Map (Categorical) displayed visual, not a graph It consists of a set of colored rectangles to represent distinct groups, and the area of each rectangle is proportional to the value of the corresponding group. Largest area = most observed category Smallest area = smallest category very helpful for visualizing marginal frequencies, and in some cases joint frequencies if the visual is well thought out. color scheme is less important here. AREA SHOULD ALWAYS BE PROPORTIONAL generally 3 levels within one category is the max* 4. Word/Tag Cloud (Unstructured data) Represents textual data A word cloud consists of words extracted from a source of textual data, with the size of each distinct word being proportional to the frequency with which it appears in the given text. This format allows us to quickly perceive the most frequent terms among the given text to provide information about the nature of the text, including topic and whether or not the text conveys positive or negative news. Color can add an additional dimension to the information conveyed in the word cloud. For example, red can be used for "losses" and other words conveying negative sentiment, and green can be used for "profit" and other words indicative of positive sentiment. 5a. Line Chart (Structured) We first plot all the data points against horizontal and vertical axes and then connect the points by straight line segments. x,y axis can be whatever you need them to be, but generally some sort of numerical data, like time, date, price, etc. An important benefit of a line chart is that it facilitates showing changes in the data and underlying trends in a clear and concise way. It also helps us forecast, model what the future may look like. Can use more than one line, but generally 2 is most useful as you can use the left and right axis to present separate ranges. useful for comparisons. 5b. Bubble Line Chart Using a line chart, but using more than one dimension to visualize multiple variables. Bubbles are placed on top of the data point on the x,y axis and its area is relative to other data points. for example: we plot 8 quarters of revenue (Quarter, revenue) and at each point we add a bubble with EPS (the bubble size depends on its distance from $0 EPS - bigger the loss /gain bigger the bubble, smaller the gain/loss smaller the bubble - color may also be Added to emphasize loss or gain) a great way to see a more holistic picture while comparing several data variables 6a. Scatter plot visualizing the joint variation in two numerical variables. useful tool for displaying and understanding potential relationships between the variables. X-Axis = Variable 1 Y-Axis = Variable 2 It uses dots to indicate the values of the two variables for a particular point in time, which are plotted against the corresponding axes it is important to inspect for any potential association between the two variables The pattern of the scatter plot may indicate no apparent relationship, a linear association, or a non-linear relationship. A scatter plot with randomly distributed data points would indicate no clear association between the two variables. However, if the data points seem to align along a straight line, then there may exist a significant relationship among the variables. A positive (negative) slope for the line of data points indicates a positive (negative) association, meaning the variables move in the same (opposite) direction. Tight (loose) clustering signals a potentially stronger (weaker) relationship. inspecting the scatter plot can help to spot extreme values (i.e., outliers). Finding these extreme values and handling them with appropriate measures is an important part of the financial modeling process. Scatter plots are a powerful tool for finding patterns between two variables, for assessing data range, and for spotting extreme values 6b. Scatter Plot Matrix a useful tool for organizing scatter plots between pairs of variables, making it easy to inspect all pairwise relationships in one combined visual really just a group of charts that show all the individual 2 variable (bivariate) mash ups. when separate variables, its a scatterplot, when comparing the same variable - its a histogram. It is important to note that despite their usefulness, scatter plots and scatter plot matrixes should not be considered as a substitute for robust statistical tests; rather, they should be used alongside such tests for best results. 7. Heat Map (Categorical mixed with numerical at times) graphic that organizes and summarizes data in a tabular format and represents them using a color spectrum. Cells in the chart are color-coded to differentiate high values from low values by using the color scheme defined in the color spectrum on the right side of the chart. heat maps are commonly used for visualizing the degree of correlation among different variables. conclusion of this long section... remember 1. make sure to pick the correct chart type 2. presenting data for too short a time window may mistakenly point to a non existing trend 3. watch out for truncated y-axis graphs 4. don't use a range too small or large to misrepresent the data ethics matter in data visualization, don't try to trick people. Formulas: Definitions: Visualization - the presentation of data in a pictorial or graphical format for the purpose of increasing understanding and for gaining insights into the data. Histogram - a chart that presents the distribution of numerical data by using the height of a bar or column to represent the absolute frequency of each bin or interval in the distribution. Frequency Polygon - similar to a histogram, can visualize a distribution frequency, but instead of bars it uses the midpoint of each bin as its x axis and the absolute frequency as a y axis and connected with a straight line. It can quickly convey a visual understanding of the distribution since it displays frequency as an area under the curve. Cumulative Distribution Frequency Chart - Visualized frequency distribution where cumulative absolute or relative frequency is plotted on a line. Bar Chart - Similar to a histogram, but used for the visualization of categorical data on a frequency distribution Group /Cluster Bar Chart - a special bar chart for categorical data that has more than one variable, creating more than one bar in each individual category. Helps to visualize joint frequency. Stacked Bar Chart - similar to a cluster bar chart, but instead of each category having separate bars to represent each variable within a category they all get stacked on-top of each other, while maintaining a visual legend separating each variable. Helps to visualize a marginal frequency of a specific category. Tree-map - consists of a set of colored rectangles to represent distinct groups, and the area of each rectangle is proportional to the value of the corresponding group. Word/Tag Cloud - consists of words extracted from a source of textual data, with the size of each distinct word being proportional to the frequency with which it appears in the given text. Line Chart - x,y data plotted with a straight line. important benefit of a line chart is that it facilitates showing changes in the data and underlying trends in a clear and concise way. Bubble Line Chart - more holistic look of a line chart by adding another dimension represented by bubbles on each data point. size, color is relative. Scatter plot - Compares 2 variables against each other to look for a correlation. Scatter Plot Matrix - Compares more than 2 variables against each other, but does so by making a group of charts where bivariate, pairwise grouping are organized to help visualize which variables are and are not correlated. Heat Map - graphic that organizes and summarizes data in a tabular format and represents them using a color spectrum.
Section 3: Organizing Data for Quantitative Analysis
Learning Objective: - Describe how data are organized for quantitative analysis Notes: In quantitative analysis, raw data can be organized into two types of formats for quantitative analysis: one-dimensional arrays and two-dimensional rectangular arrays. A one-dimensional array can represent a collection of observations of a single variable. Often and best seen in a time series. in contrast to compiling the data randomly in an unorganized manner, organizing such data by its time-series nature preserves valuable information beyond the basic descriptive statistics that summarize central tendency and spread variation in the data's distribution. When using time to present data, trends and patterns become more apparent. note, Descriptive statistics are limited in so much that they only allow you to make summations about the people or objects that you have actually measured. A two-dimensional rectangular array (also called a data table) is one of the most popular forms for organizing data for processing by computers or for presenting data visually for consumption by humans. Realistically this section is just mentioning that making charts is important, but understanding how to build them is key, to make them logical. ****when there is no data for a chart do not put 0, rather n/a as 0 can be interpreted incorrectly (unless the formatting of the program requires this - then be sure to note i supplemental material)*** Formulas: Definitions: One-dimensional array - the simplest format for representing a collection of data of the same data type, so it is suitable for representing a single variable. Two-dimensional rectangular array / Data Table - When a data table is used to organize the data of one single observational unit (i.e., a single company), each column represents a different variable (feature or attribute) of that observational unit, and each row holds an observation for the different variables; successive rows represent the observations for successive time periods. Descriptive statistics - brief informational coefficients that summarize a given data set, which can be either a representation of the entire population or a sample of a population.
Section 11: The Shape of Distribution
Learning Objective: - Interpret Skewness -Interpret Kurdosis Notes: Formulas: Definitions: Skewed - Skewness - Sample Skewness - Kurtosis - Lepokuric ("Fat-Tail") - Platykurtic ("Thin-Tailed") - Excess Kurtosis - Sample Excess Kurtosis -
Section 5: Summarizing Data Using a Contingency Table
Learning Objective: - Interpret a contingency table Notes: Frequency Distribution tables show 1 variable, contingency tables show more than 1 variable, but only for categorical data. It's also known as the two way table Contingency tables are constructed by listing all the levels (i.e., categories) of one variable as rows (R ) and all the levels of the other variable as columns (C )in the table. A contingency table having R levels of one variable in rows and C levels of the other variable in columns is referred to as an R × C table. Levels must be finite and either ordinal and or nominal as this table structure only works for categorical data. contingency table can be either a frequency (count) or a relative frequency (percentage) based on either overall total, row totals, or column totals. Contingency tables show joint frequencies as 2 separate variables come together to represent a combination of data of a more specific observation. This is seen in a cross between a R x C (i.e small cap - healthcare) Marginal Frequencies also are represented in contingency tables as the sum value of an entire row or column. (i.e. sum of small caps across sectors, or the sum of healthcare companies in a portfolio) we can express frequency in percentage terms as relative frequency by using one of three options. a) Joint frequency / sum total (all joint frequency sum to 100%) b) joint frequency / marginal frequency on a row (joint frequencies in the row = 100%) Row composite c) joint frequency / marginal frequency on a column (joint frequencies in a column = 100%) column composite *see picture* Contingency tables are also used to evaluate performance of investment models and is generally called a Confusion Matrix - the objective being taking predictions formed by a model, and using a T/F type of contingency table to determine performance of model. Another application of contingency tables is to investigate potential association between two categorical variables. (looking for obvious skews in data) One way to test for a potential association between categorical variables is to perform a chi-square test of independence. Essentially, the procedure involves using the marginal frequencies in the contingency table to construct a table with expected values of the observations. The actual values and expected values are used to derive the chi-square test statistic. This test statistic is then compared to a value from the chi-square distribution for a given level of significance. If the test statistic is greater than the chi-square distribution value, then there is evidence to reject the claim of independence, implying a significant association exists between the categorical variables *later reading for better chi square notes* Formulas: Definitions: Contingency table - tabular format that displays the frequency distributions of two or more categorical variables simultaneously and is used for finding patterns between the variables. Joint frequencies - joining one variable from the row (i.e., sector) and the other variable from the column (i.e., market cap) to count observations. Marginal frequencies - sums of joint frequencies added across rows and or across columns Confusion matrix - analytical tool using a contingency table to evaluate a performance of a model. Chi-square test of independence - test for a potential association between categorical variables, involves noting expected values and comparing to a chi square distribution to a test statistic to determine the level of significance, rejecting or accepting a claim.