S2.1 & S2.4 Outliers & Boxplots
How can you compare 2 sets of data?
Always comment on a measure of location and a measure of spread. If there are extreme values in the data set, use the median and IQR. Otherwise median and IQR or mean and standard deviation
What is an outlier?
An extreme value that lies outside the overall pattern of the data
What do you plot a box plot if you don't know the largest and smallest value?
Find the largest or smallest value that is not an outlier / use the outlier boundary (e.g. Q₁ - (1.5 x IQR))
What is the best measure of variation if a data set contains outliers?
The interquartile range (not variance or standard deviation or range because outliers make them much larger than they would be otherwise) because it is not affected by extreme values
What is the outlier boundary?
The value at which if a data point is smaller or bigger than this it is an outlier
What is a boxplot?
They visually represent a data set. They show the quartiles, the maximum and minimum values and any outliers
What are some common definitions of outliers?
± 2 standard deviations from the mean Less than Q₁ - (1.5 x IQR) or more than Q₃ + (1.5 x IQR)
What makes an outlier an anomaly?
Sometimes outliers are legitimate values which could still be correct Anomalies are outliers that are clearly an error and misleading to keep in This might be due to an experimental or recording error or the data is not relevant (e.g. an adult at a kids party)
What is cleaning?
Removing anomalies from a data set