ch 3
Frequency distribution
- for a categorical variable groups the data into categories and records the number of observations that fall into each category. - For a numerical variable, a frequency distribution groups data into intervals and records the number of observations that falls into each interval.
Bubble plot
A bubble plot shows the relationship between three numerical variables in a two-dimensional graph. The third numerical variable is represented by the size of the bubble
Using a contingency table to display two categorical variables
A contingency table shows the frequencies for two categorical variables, x and y, where each cell represents a mutually exclusive combination of the pair of x and y values
line chart
A line chart shows a numerical variable as a series of data points connected by a line. IT is especially useful for tracking changes or trends over time.
Using a scatter plot to display the relationship between two numerical variables
A scatterplot is a graphical tool that helps in determining whether or not two numerical variables are related in some systematic way. Each point in a scatter plot represents a paired observation for the two variables.
Scatterplot
A scatterplot with a categorical variable shows the relationship between two numerical variables and a categorical variable in a two-dimensional graph. The categorical variable is typically represented by different colors for each category
Using a stacked column chart to display two categorical variables
A stacked column chart is designed to visualize more than one categorical variable. It allows for the comparison of composition within each category
weakness of range
The main weakness of the range is that it ignores all observations except the extremes
mean
The mean is the most commonly used measure of central location. One weakness of the mean is that it unduly influenced by outliers
median
The median is the middle observation of a variable; that is, it divides the variable in half. The median is especially useful when outliers are present.
mode
The mode is the most frequently occurring observation of a variable. A variable may have no mode or more than one mode. The mode is the only meaningful measure of central location for a categorical variable.
Histogram
a series of rectangles where the width and height of each rectangle represent the interval width and frequency (or relative frequency) of the respective interval
The variance
an average of the squared differences between the observations and the mean. The standard deviation is the positive square root of the variance.
The correlation coefficient
between two variables x and y indicates the direction and strength of the linear relationship
The covariance
between two variables x and y indicates whether they have a negative linear relationship, a positive linear relationship, or no linear relationship.
A heat map
is an important visualization tool that uses color and color intensity to display relationships between variables
The skewness coefficient
measures the degree to which a distribution is not symmetric about its mean. A symmetric distribution has a skewness coefficient of o. A positively (negatively) skewed distribution has a positive (negative) skewness coefficient
The kurtosis coefficient
measures whether the tales of a distribution are more or less extreme than the normal distribution. Because the normal distribution has a kurtosis coefficient of 3, it is common to calculate the excess kurtosis of a distribution as the kurtosis coefficient minus 3
Mean absolute deviation (MAD)
the average of the absolute differences between the observations and the mean
he interquartile range (IQR)
the difference between the third quartile and the first quartile or IQR = Q3 - Q1. the measure does not rely on extreme observations; however, it does not incorporate all observations