Exploratory Data Analysis
Exploratory data analysis checklist
1. Formulate your question 2. Read in your data 3. Check the packaging 4. Run str() 5. Look at the top and the bottom of your data 6. Check your "n"s 7. Validate with at least one external data source 8. Try the easy solution first 9. Challenge your solution 10. Follow up
Principles of analytic graphics
1. Show comparisons 2. Show causality, mechanism, explanation 3. Show multivariate data 4. Integrate multiple modes of evidence 5. Describe and document the evidence 6. Content is King
Common dplyr Function Properties
1. The first argument is a data frame. 2. The subsequent arguments describe what to do with the data frame specified in the first argument, and you can refer to columns in the data frame directly without using the $ operator (just use the column names). 3. The return result of a function is a new data frame 4. Data frames must be properly formatted and annotated for this to all be useful. In particular, the data must be tidy2 . In short, there should be one observation per row, and each column should represent a feature or characteristic of that observation.
check the packaging
Assuming you don't get any warnings or errors when reading in the dataset, you should now have an object in your workspace named ozone. It's usually a good idea to poke at that object a little bit before we break open the wrapping paper. For example, you can check the number of rows and columns. nrow(ozone) 7147884 ncol(ozone) 23 Remember when I said there were 7,147,884 rows in the file? How does that match up with what we've read in? This dataset also has relatively few columns, so you might be able to check the original text file to see if the number of columns printed out (23) here matches the number of columns you see in the original file.
Follow up questions
At this point it's useful to consider a few followup questions. 1. Do you have the right data? Sometimes at the conclusion of an exploratory data analysis, the conclusion is that the dataset is not really appropriate for this question. In this case, the dataset seemed perfectly fine for answering the question of which counties had the highest levels of ozone. 2. Do you need other data? One sub-question we tried to address was whether the county rankings were stable across years. We addressed this by resampling the data once to see if the rankings changed, but the better way to do this would be to simply get the data for previous years and re-do the rankings. 3. Do you have the right question? In this case, it's not clear that the question we tried to answer has immediate relevance, and the data didn't really indicate anything to increase the question's relevance. For example, it might have been more interesting to assess which counties were in violation of the national ambient air quality standard, because determining this could have regulatory implications. However, this is a much more complicated calculation to do, requiring data from at least 3 previous years. The goal of exploratory data analysis is to get you thinking about your data and reasoning about your question. At this point, we can refine our question or collect new data, all in an iterative process to get at the truth.
run str()
Compactly display the internal structure of an R object, a diagnostic function and an alternative to summary (and to some extent, dput). Ideally, only one line for each 'basic' structure is displayed. It is especially well suited to compactly display the (abbreviated) contents of (possibly nested) lists. The idea is to give reasonable output for any R object. It calls args for (non-primitive) function objects.
with()
Evaluate an Expression in a Data Environment with(data, expr, ...)
simple summaries: one dimension
Five-number summary: This gives the minimum, 25th percentile, median, 75th percentile, maximum of the data and is quick check on the distribution of the data (see the fivenum()) • Boxplots: Boxplots are a visual representation of the five-number summary plus a bit more information. In particular, boxplots commonly plot outliers that go beyond the bulk of the data. This is implemented via the boxplot() function • Barplot: Barplots are useful for visualizing categorical data, with the number of entries for each category being proportional to the height of the bar. Think "pie chart" but actually useful. The barplot can be made with the barplot() function. • Histograms: Histograms show the complete empirical distribution of the data, beyond the five data points shown by the boxplots. Here, you can easily check skewwness of the data, symmetry, multi-modality, and other features. The hist() function makes a histogram, and a handy function to go with it sometimes is the rug() function. • Density plot: The density() function computes a non-parametric estimate of the distribution of a variables
Formulate your question
Formulating a question can be a useful way to guide the exploratory data analysis process and to limit the exponential number of paths that can be taken with any sizeable dataset. In particular, a sharp question or hypothesis can serve as a dimension reduction tool that can eliminate variables that are not immediately relevant to the question.
histogram
Histograms show the complete empirical distribution of the data, beyond the five data points shown by the boxplots. Here, you can easily check skewwness of the data, symmetry, multi-modality, and other features. The hist() function makes a histogram, and a handy function to go with it sometimes is the rug() function
Look at the top and the bottom of your data
I find it useful to look at the "beginning" and "end" of a dataset right after I check the packaging. This lets me know if the data were read in properly, things are properly formatted, and that everything is there. If your data are time series data, then make sure the dates at the beginning and end of the dataset match what you expect the beginning and ending time period to be. You can peek at the top and bottom of the data with the head() and tail() functions. head(ozone[, c(6:7, 10)]) Latitude Longitude Date.Local 1 30.498 -87.88141 2014-03-01 2 30.498 -87.88141 2014-03-01 3 30.498 -87.88141 2014-03-01 4 30.498 -87.88141 2014-03-01 5 30.498 -87.88141 2014-03-01 6 30.498 -87.88141 2014-03-01 tail(ozone[, c(6:7, 10)]) Latitude Longitude Date.Local 7147879 18.17794 -65.91548 2014-09-30 7147880 18.17794 -65.91548 2014-09-30 7147881 18.17794 -65.91548 2014-09-30 7147882 18.17794 -65.91548 2014-09-30 7147883 18.17794 -65.91548 2014-09-30 7147884 18.17794 -65.91548 2014-09-30
Validate with at least one external data source
Knowing that the national standard for ozone is something like 0.075, we can see from the data that -The data are at least of the right order of magnitude (i.e. the units are correct) -The range of the distribution is roughly what we'd expect, given the regulation around ambient pollution levels -Some hourly levels (less than 10%) are above 0.075 but this may be reasonable given the wording of the standard and the averaging involved.
R graphing resources
R graph gallery R bloggers
Challenge your solution
The easy solution is nice because it is, well, easy, but you should never allow those results to hold the day. You should always be thinking of ways to challenge the results, especially if those results comport with your prior expectation
group_by
The group_by() function is used to generate summary statistics from the data frame within strata defined by a variable. often used with summarize() First, we can create a year varible using as.POSIXlt(). chicago <- mutate(chicago, year = as.POSIXlt(date)$year + 1900) Now we can create a separate data frame that splits the original data frame by year. years <- group_by(chicago, year)
Read in your data
The next task in any exploratory data analysis is to read in some data. Sometimes the data will come in a very messy format and you'll need to do some cleaning. Other times, someone else will have cleaned up that data for you so you'll be spared the pain of having to do the cleaning
select
The select() function can be used to select columns of a data frame that you want to focus on names(chicago)[1:3] "city" "tmpd" "dptp" subset <- select(chicago, city:dptp) head(subset) city tmpd dptp 1 chic 31.5 31.500 2 chic 33.0 29.875 3 chic 33.0 27.375 4 chic 29.0 28.625 5 chic 32.0 28.875 6 chic 40.0 35.125 You can also omit variables using the select() function by using the negative sign. select(chicago, -(city:dptp))
abline()
This function adds one or more straight lines through the current plot.
Try the easy solution first
What's the simplest answer we could provide to this question? For the moment, don't worry about whether the answer is correct, but the point is how could you provide prima facie evidence for your hypothesis or question. You may refute that evidence later with deeper analysis, but this is the first pass.
full name of a function
You may get some warnings when the package is loaded because there are functions in the dplyr package that have the same name as functions in other packages. For now you can ignore the warnings. NOTE: If you ever run into a problem where R is getting confused over which function you mean to call, you can specify the full name of a function using the :: operator. The full name is simply the package name from which the function is defined followed by :: and then the function name. For example, the filter function from the dplyr package has the full name dplyr::filter. Calling functions with their full name will resolve any confusion over which function was meant to be called
bmp
a native windows bitmapped format
add a line
abline(h=12) abline(v=median(pollution$pm25), col = "magenta", lwd = 4)
line width
abline(v=12,lwd=2)
mutate
add new variables/columns or transform existing variables chicago <- mutate(chicago, pm25detrend = pm25 - mean(pm25, na.rm = TRUE)) head(chicago) There is also the related transmute() function, which does the same thing as mutate() but then drops all non-transformed variables. head(transmute(chicago, pm10detrend = pm10tmean2 - mean(pm10tmean2, na.rm = TRUE), o3detrend = o3tmean2 - mean(o3tmean2, na.rm = TRUE)))
bg
background color
plotting systems in R
base plotting:"artists palette" model. start with base and work your way up lattice: entire plot specified by one function; conditioning ggplot2: mixes elements of base and lattice
base plotting system
basic plotting system. blank canvas and work your way up. start with plot function then add annotations. can't go backwards. flexibility to specify details in painstaking accuracy, but sometimes want figure to do it for you. difficult to describe or translate a plot to others bc no clear graphical language or grammar to communicate what you did. Just list of commands
png
bitmap format, good for line drawings or images with solid colors, uses lossless compression (like the old GIF format), most web browsers can read this format natively, good for plotting many many points, does not resize well
jpeg
bitmap format:good for photographs or natural scenes, uses lossy compression, good for plotting many many points, does not resize well, can be read by almost any computer and any web browser, not great for line drawings
color
boxplot(pollutant$mean, col = "blue")
xlab
character string for x-axis label
ylab
character string for y-axis label
check your n's
check if numbers match. Find landmarks like number of people in a study or number of states correct
dev.copy
copy a plot from one device to another
tiff
creates bitmap files in the TIFF format; supports lossless compression
filter
extract a subset of rows from a data frame based on logical conditions chic.f <- filter(chicago, pm25tmean2 > 30)
dev.cur()
finds currently active graphics device when more than one open at a time
summarize
generate summary statistics of different variables in the data frame, possibly within strata First, we can create a year varible using as.POSIXlt(). > chicago <- mutate(chicago, year = as.POSIXlt(date)$year + 1900) Now we can create a separate data frame that splits the original data frame by year. > years <- group_by(chicago, year) Finally, we compute summary statistics for each year in the data frame with the summarize() function. > summarize(years, pm25 = mean(pm25, na.rm = TRUE), + o3 = max(o3tmean2, na.rm = TRUE), + no2 = median(no2tmean2, na.rm = TRUE)) ex2 First, we can create a categorical variable of pm25 divided into quintiles. > qq <- quantile(chicago$pm25, seq(0, 1, 0.2), na.rm = TRUE) > chicago <- mutate(chicago, pm25.quint = cut(pm25, qq)) Now we can group the data frame by the pm25.quint variable. > quint <- group_by(chicago, pm25.quint) Finally, we can compute the mean of o3 and no2 within quintiles of pm25. > summarize(quint, o3 = mean(o3tmean2, na.rm = TRUE), + no2 = mean(no2tmean2, na.rm = TRUE)) 1. create a new variable pm25.quint 2. split the data frame by that new variable 3. compute the mean of o3 and no2 in the sub-groups defined by pm25.quint
boxplots
graph based on five-number summary, plus outliers boxplot()
x11()
launches screen device on linux/unix
quartz()
launches screen device on mac
windows()
launches screen device on windows
lm()
lm is used to fit linear models. It can be used to carry out regression, single stratum analysis of variance and analysis of covariance (although aov may provide a more convenient interface for these).
mar
margin size
five number summary
minimum, Q1, median, Q3, maximum fivenum()
mfcol
number of plots per row, column (plots are filled column-wise)
mfrow
number of plots per row, column (plots are filled row-wise)
dplyr
package in R that simplifies manipulation of data in easy to understand terms (using "grammer"/verbs d for data plyr for pliers
?par
parameters for graphics
vector file device formats
pdf svg win. postscript:
lattice system
plots created with a single function call (xyplot, bwplot, etc) most useful for conditioning types of plots: looking at how y changes with x across levels of z things like margins/spacing set automatically bc entire plot is specified at once good for putting many many plots on a screen
bitmap formats
png jpeg tiff bmp
rename
rename variables in a data frame chicago <- rename(chicago, dewpoint = dptp, pm25 = pm25tmean2) head(chicago[, 1:5], 3) The syntax inside the rename() function is to have the new name on the left-hand side of the = sign and the old name on the right-hand side.
arrange
reorder rows of a data frame chicago <- arrange(chicago, date) head(select(chicago, date, pm25tmean2), 3) Columns can be arranged in descending order too by useing the special desc() operator. chicago <- arrange(chicago, desc(date))
dplyr "grammer"/ key verbs
select filter arrange rename mutate summarize %>%
sides of a plot
side 1 = bottom side 2 = left side 3 = top side 4 = right
Density plot
similar to a histogram except that rather than having a summary bar representing the frequency of scores, it shows each individual score as a dot. They can be useful for looking at the shape of a distribution of scores The density() function computes a non-parametric estimate of the distribution of a variables
what is a graphics device?
something where you can make a plot appear
dev.copy2pdf
specifically copy a plot to a PDF file
ggplot system
splits the difference between base and lattice system automatically deals with spacing, texts, titles but also allows you to annotate by "adding" to a plot superficial similarity to lattice but generally easier/more intuitive to use default mode makes many choices for you (but you can still customize to your hearts desire)
%>%
the "pipe" operator is used to connect multiple verb actions together mutate(chicago, pm25.quint = cut(pm25, qq)) %>% group_by(pm25.quint) %>% summarize(o3 = mean(o3tmean2, na.rm = TRUE), no2 = mean(no2tmean2, na.rm = TRUE)) Once you travel down the pipeline with %>%, the first argument is taken to be the output of the previous element in the pipeline.
lty
the line type (default is solid line), can be dashed, dotted, etc
lwd
the line width, specified as an integer multiple
las
the orientation of the axis labels on the plot
oma
the outer margin size (default is 0 for all sides
col
the plotting color, specified as a number, string, or hex code; the colors() function gives you a vector of colors by name
pch
the plotting symbol (default is open circle)
mtext
used to create an overall title for the panel of plots
barplot
useful for visualizing categorical data barplot()
win
vector format: metafile:windows metafile format (only in windows)
postscript
vector format: older format, also resizes well, usually portable, can be used to create encapsulated postscript files; windows systems often don't have a postscript viewer
vector format: useful for line type graphics, resizes well, usually portable, not efficient if a plot has many objects/points
svg
vector format:XML-based scalable vector graphics, supports animation and interactivity, potentially useful for web-based plots
two basic types of file devices
vector- good for line drawings and plots with solid colors using a modest number of points bitmap- good for plots with a large number of points, natural scenes or web based plots
dev.set(<integer>)
you can change the active graphics device with dev.set(<integer>) where integer is the humber associated with the graphics device you want to switch to
simple summaries: 2 dimensions and beyond
• Multiple or overlayed 1-D plots (Lattice/ggplot2): Using multiple boxplots or multiple histograms can be useful for seeing the relationship between two variables, especially when on is naturally categorical. • Scatterplots: Scatterplots are the natural tool for visualizing two continuous variables. Transformations of the variables (e.g. log or square-root transformation) may be necessary for effective visualization. • Smooth scatterplots: Similar in concept to scatterplots but rather plots a 2-D histogram of the data. Can be useful for scatterplots that may contain many many data points. For visualizing data in more than 2 dimensions, without resorting to 3-D animations (or glasses!), we can often combine the tools that we've already learned: • Overlayed or multiple 2-D plots; conditioning plots (coplots): A conditioning plot, or coplot, shows the relationship between two variables as a third (or more) variable changes. For example, you might want to see how the relationship between air pollution levels and mortality changes with the season of the year. Here, air pollution and mortality are the two primary variables and season is the third variable varying in the background. • Use color, size, shape to add dimensions: Plotting points with different colors or shapes is useful for indicating a third dimension, where different colors can indicate different categories or ranges of something. Plotting symbols with different sizes can also achieve the same effect when the third dimension is continuous. • Spinning/interactive plots: Spinning plots can be used to simulate 3-D plots by allowing the user to essentially quickly cycle through many different 2-D projections so that the plot feels 3-D. These are sometimes helpful to capture unusual structure in the data, but I rarely use them. • Actual 3-D plots (not that useful): Actual 3-D plots (for example, requiring 3- D glasses) are relatively few and far between and are generally impractical for communicating to a large audience. Of course, this may change in the future with improvements in technology....