Exploratory Data Analysis part 1

Ace your homework & exams now with Quizwiz!

Give an example of output from the following R code: summary(pollution$pm25)

## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 3.38 8.55 10.00 9.84 11.40 18.40

Name 2 possible modeling strategies to debug an analysis that could be suggested by graphs in data analysis.

1) A linear model. 2) A nonlinear model.

Describe the basic cycle of producing/editing/loading code in the R programming language.

1) Edit your R code in the R editor and save the file every time you edit it. 2) If you want that code to be available in R, you have to use the source( ) function to source that file back into R. 3) You do not have to use a single file, you can add a new file, by selecting New Script and it will open in another window. You can save it to be a different file if you want. So that way you can separate code for different projects or different assignments.

In exploratory data analysis, list 5 types of one-dimensional summaries of data that you can use in the R programming language.

1) Five-number summary 2) Box plots (box and whisker plot) 3) Histograms 4) Density plots 5) Bar plots

Describe 2 different ways to resolve the problem of an error where it cannot find the file in the working directory.

1) If you know where this file is, you can move it to your working directory. 2) You can change your working directory to be wherever the file happens to be.

What 3 broad steps typically follow exploratory data analysis?

1) Modelling 2) Prediction 3) Inference

List 5 characteristics of Exploratory Graphs.

1) They are made quickly. 2) A large number are made. 3) The goal is for personal understanding. 4) Axes/legends are generally cleaned up (later). 5) Color/size are primarily used for information.

Which 4 of the 5 reasons graphs are used in data analysis apply to exploratory data analysis?

1) To understand data properties. 2) To find basic patterns in data. 3) To suggest modeling strategies. 4) To "debug" analyses.

List 5 reasons graphs are used in data analysis.

1) To understand data properties. 2) To find basic patterns in data. 3) To suggest modeling strategies. 4) To "debug" analyses. 5) To communicate results.

List 3 steps in viewing and running specific functions from a saved R file containing multiple functions.

1) Use the source( ) function to load the saved R file. 2) Type ls( ) to view available functions 3) Call the function you are interested in by name.

List 6 questions you want to ask yourself about whether your content is king in building analytic graphics from data.

1) What is the content that you are trying to present? 2) What is the story you are trying to tell? 3) What is the data that you have? 4) What is the best way to present that? 5) How are you going to present it? 6) What is it going to look like?

You have typed the following simple R function into the R text editor. Describe 2 ways you can load that into R or the R console. myfunction <- function( ) {x <- rnorm(100) mean(x)}

1) You can cut and paste the code from the text editor window into the console window and then type in the function name to call it and execute the function. 2) Save the code in the text editor and give it a file name with a .R extension. Type in the following command in the R console window to run the saved function from the R text editor: source("<function name with .R extension>")

In regard to Principle 4 of building analytic graphics from data, list 4 items of evidence that must be completely integrated.

1) words 2) numbers 3) images 4) diagrams

A data graphic should tell what?

A data graphic should tell a complete story that is credible.

The principle of showing comparisons should answer what question?

Always ask "Compared to What?"

Name a good book one can consult to learn about the general rules to follow when building analytic graphics from data.

Beautiful Evidence by Edward Tuffey.

Quickly summarize the 6th basic principle of building analytic graphics from data.

Content is king - the story that you tell and the data that you use are the most important elements of any graphic.

In regard to working directories, describe a recommended practice.

Create a single directory or single folder where you can put all of your materials for the course and not have to worry about them being scattered all over the place. Any time you download something from the website or create a new file, it is probably best to store it all in one folder so that you don't have to be searching all over for it. That way, you can always set your working directory to be that directory and not have to worry about changing it.

Quickly summarize the 5th basic principle of building analytic graphics from data.

Describe and document the evidence - Always have sources and source code to lend credibility to your plots.

Why is exploratory data analysis (EDA) a key element of data science?

Exploratory data analysis (EDA) is a key element of data science because it allows you to develop a rough idea of what your data look like and what kinds of questions might be answered by them.

What is exploratory data analysis all about in broad terms?

Exploratory data analysis involves exploratory techniques for summarizing data.

Define exploratory graphs.

Exploratory graphs are graphs that you make more or less for yourself, so that you can look at the data and explore what is occurring in the data sets that you are looking at.

Exploratory techniques for summarizing data can help inform the development what?

Exploratory techniques for summarizing data can help inform the development of more complex statistical models.

True or False. The goal with exploratory graphs is to give an idea of what the data looks like and the best possible presentation and appearance for the graph.

False. Exploratory graphs should give an idea of what the data should look like but typically, there is NOT a need to worry about the appearance of the graph or how it is presented. Those are addressed later on.

True or False. Data science is just a collection of tools.

False. It is important to remember that data science is not just a collection of tools. It requires a person to apply those tools in a smart way to produce results that are useful to people.

True or False. You can only have 1 function per saved file name in the R text editor.

False. You can have multiple functions saved in one R file.

True or False. It is okay to only use the tools you have to present data.

False. You should not let the tool drive the analysis. R gives you the flexibility to make a plot that you want to make, and not just let the tools do the thinking.

Describe a problem if you do not properly set your working directory.

For example if you do something like read.csv("mydata.csv"), if the file is not in the working directory, you will get an error because it cannot find the file in the working directory.

Examine the following code and explain what is happening: pollution <- read.csv("data/avgpm25.csv", colClasses = c("numeric", "character", "factor", "numeric", "numeric")) head(pollution)

Here an R object called pollution is created using the read.csv( ) function. The argument of "data/avgpm25.csv" specifies the filepath to the file with data to be read in. The colClasses = argument is a vector that specifies the classes of the respective columns whether numeric, character or factor. The head( ) function instructs to print the first 6 lines of the pollution object or the file that has been read in.

Explain the meaning of the following R code: summary(pollution$pm25)

Here the summary( ) function in R is applied to the column name of pm25 of the data file named pollution. The $ or dollar sign is used to subset a column of the data file.

Explain the meaning of the following R code: boxplot(pollution$pm25, col = "blue")

Here, a box plot (also called box and whisker plot) of the pm25 variable of a data file named pollution is produced using the boxplot( ) function. The range from the 1st to the 3rd quartile is colored blue. The graph is centered around the median value.

Explain the meaning of the following R code: boxplot(pollution$pm25, col = "blue") abline(h = 12)

Here, a box plot (also called box and whisker plot) of the pm25 variable of a data file named pollution is produced using the boxplot( ) function. The range from the 1st to the 3rd quartile is colored blue. The graph is centered around the median value. There is an overlaid feature on the plot, which is a horizontal line at the level of 12.

Explain the meaning of the following R code: hist(pollution$pm25, col = "green")

Here, a histogram is produced that provides detail about the shape of the distribution of the variable pm25 of the data file named pollution. The histogram is colored green as specified in the argument of the hist( ) function that is used to create the histogram.

Explain the meaning of the following R code: hist(pollution$pm25, col = "green") abline(v = 12, lwd = 2) abline(v = median(pollution$pm25), col = "magenta", lwd = 4)

Here, a histogram is produced that provides detail about the shape of the distribution of the variable pm25 of the data file named pollution. The histogram is colored green as specified in the argument of the hist( ) function that is used to create the histogram. A black(default), vertical line has been overlayed at 12. A 2nd vertical line with color magenta has been overlayed at the median. The abline( ) function adds one or more straight lines through the current plot. The v = argument specifies a vertical line and the position on the plot. The lwd = specifies the line width. A lwd of 2 is twice as wide and a lwd of 4 is four times as wide as default.

Explain the meaning of the following R code: hist(pollution$pm25, col = "green") rug(pollution$pm25)

Here, a histogram is produced using the hist( ) function that provides detail about the shape of the distribution of the variable pm25 of the data file named pollution. The histogram is colored green as specified in the argument of the function. The 2nd line of code puts a rug underneath the histogram.

Explain the meaning of the following R code: hist(pollution$pm25, col = "green", breaks = 100) rug(pollution$pm25)

Here, a histogram is produced using the hist( ) function that provides detail about the shape of the distribution of the variable pm25 of the data file named pollution. The histogram is colored green as specified in the argument of the function. The breaks = argument is a single number, giving the number of cells (100) for the histogram. The 2nd line of code puts a rug underneath the histogram.

What is important to include when describing and documenting the evidence for building analytic graphics from data?

If you are going to be making a plot with a system like R, it is important to preserve the computer code that made the plot. The idea is that you want to lend some credibility to the evidence that you present. Sources of where the data came from is very important and how you made the plot is also important.

Explain the trade-offs when it comes to selecting the number of breaks for a histogram.

If you select fewer breaks, then the bars are fewer and wider. This result is a smoother histogram but it is harder to see the shape of the distribution. With more breaks, the bars are narrower with a larger quantity. The result is a rougher, noisier and spikey looking histogram. You have to balance the noisiness of a large number of breaks with the lack of a shape to the distribution with fewer breaks. It is useful to play around with the breaks argument, to get the histogram that you like the best.

Describe how to go about loading a new text editor in R.

In the standard R program, go to File and then select New Script. This will give you a blank window, that you can use to write your code.

Quickly summarize the 4th basic principle of building analytic graphics from data.

Integrate multiple modes of evidence - use things like tables and plots and texts all together. You do not have to keep them separate, if you do not want to.

What does multivariate mean?

Multivariate means more than 2 variables.

Describe a key advantage of R when it comes to not letting the tools drive the kinds of plots that you make and rather make the plot you want to make.

One of the nice advantages of a system like R is that the tools are very flexible in R. You can make all kinds of customized plots to show the data and to integrate different modes of evidence.

What is a key feature of the histogram plot in R?

One thing you can do with a histogram is change the breaks. This is essentially the number of bars that are used to construct the histogram.

Why are plots useful in exploratory data analysis?

Plots are one of the ways that you can explore data without having a precise or specific model about what is supposed to be going on in there.

In simple words explain the 2nd principle to follow when building analytic graphics from data.

Principle 2: Show causality, mechanism, explanation, systematic structure.

Quickly summarize the 2nd basic principle of building analytic graphics from data.

Show causality, mechanism, explanation - how the system or world is working at least according to your ideas.

Quickly summarize the 1st basic principle of building analytic graphics from data.

Show comparisons - Always show something relative to something else.

Quickly summarize the 3rd basic principle of building analytic graphics from data.

Show multivariate data - always try to show more than 2 variables, because the world is complex and involves many variables.

What is Simpson's Paradox?

Simpson's paradox is a phenomenon in probability and statistics in which a trend appears in several groups of data but disappears or reverses when the groups are combined. This result is often encountered in social-science and medical-science statistics. The trend that appears in several groups of data confounds the relationship when the groups are combined.

Describe the 2nd principle to follow when building analytic graphics from data.

The 2nd basic principle, is to show causality or mechanism. Make an explanation to show what is going on or show some systematic structure. The use of the word causality here is not supposed to be formal but rather to show how you believe the world works. You need to be able to show how you believe the system is operating. What is the causal framework for thinking about the question that you are interested in? You can show a plot, that might corroborate the evidence of how you believe things work.

Describe the 4th principle to follow when building analytic graphics from data.

The 4th principle is to integrate the evidence that you have. You want to use as many different modes of displaying evidence as you can. If you have a tool that makes a plot you only show a plot or if you only have the ability to make a table you only show a table. You should be able to combine different modes of evidence into a single presentation or edit or make your graphic or whatever display that you are making as information rich as possible.

Describe the 5th principle to follow when building analytic graphics from data.

The 5th principle is to describe and document the evidence with appropriate labels, scales, sources, etc.

Describe the 1st principle to follow when building analytic graphics from data.

The first principle is to show comparisons. This is a very basic idea in all of science. The basic idea is that the evidence for a hypothesis or an idea about the world is always going to be relative to another hypothesis. Evidence is always relative. If you are comparing Hypothesis A, there has to be some alternative hypothesis that you are going to compare it to. Whenever you hear a statement or hear a a summary of evidence, based on data you should always be asking a question compared to what?

Describe the five-number, one-dimensional summary of data in R.

The five number summary is not a plot. It is a summary of some particular aspects of a given variable. The summary function in R is actually a six number summary that includes the mean plus the traditional five number summary values of the minimum, the first quartile, the median, the third quartile, and the maximum.

Explain why it is desirable to pair a rug with a histogram.

The histogram is a summary. The rug underneath provides some fine detail. It lets you see where the outliers are, and where the bulk of the data is located.

What is the idea behind the 4th principle to follow when building analytic graphics from data?

The idea is not to let the tools that you use to drive the kinds of plot that you make. You should make a plot that you want to make, and not just let the tools do the thinking.

What is the purpose of exploratory data analysis?

The purpose of exploratory data analysis is to look at your data, get a sense of what is happening and what are the kinds of plots that you want to make.

Why do you need to show multivariate data?

The reason is because the world of data is inherently multivariate. There is lots of things going on all the time. If you just plot 2 variables, or maybe even 3 variables, it is not going to show the real picture of what is happening in the world. If you can integrate and put a lot of data on a plot, then you will be able to tell a much richer story.

Why is it important to know how to set your working directory?

The reason why it is important to know how to set your working directory is because when you read data or when you write things out using functions like read or write CSV, they will be read or written to your home, your working directory.

Explain what the rug( ) function does in combination with the hist( ) function.

The rug plots all of the points in a dataset along underneath the histogram. The rug lets you see exactly where the points are that make up the histogram.

You have saved the following simple R function into the R text editor under the name Mycode.R: second <- function(x) {x + rnorm(length(x))} Explain what is happening in the following R code when entered into the console window of R: source("Mycode.R") ls( ) second(4)

The source( ) command loads the Mycode.R file of R functions. The ls( ) command allows you to view available functions in Mycode.R. The second( ) function has an argument of 4 that is applied to the function and it autoprints one value based on the function formula.

You have saved the following simple R function into the R text editor under the name Mycode.R: second <- function(x) {x + rnorm(length(x))} Explain what is happening in the following R code when entered into the console window of R: source("Mycode.R") ls( ) second(4:10)

The source( ) command loads the Mycode.R file of R functions. The ls( ) command allows you to view available functions in Mycode.R. The second( ) function has an argument of 4 thru 10 that is applied to the function and it autoprints 6 values based on the function formula applied to each of the argument values.

Describe the 3rd principle to follow when building analytic graphics from data.

The third principle is to show multivariate data. The basic rule is to show as much data on a single plot as you can. It is important to show as many variables as is reasonable at a given time, so that you can get a clear picture of the relationships in your data.

What is the underlying question essentially of all exploratory data analysis?

The underlying question essentially of all exploratory data analysis: What does the data look like?

Describe the 6th principle to follow when building analytic graphics from data.

The very last principle of course is that, content is king. If you do not have an interesting story to tell, then there is no amount of presentation that will make it interesting. When you are making plots, when you are making figures, and when you are making graphs, the first thing you want to ask is what is the content that you are trying to present? If you do not have very good content, then there is really not much you are going to be able to do beyond that.

What is a working directory?

The working directory is where R finds all of its files for reading and writing on your computer.

What can you learn by showing multivariate data?

There are other variables that may be of interest and may be part of the system, and may confound this relationship.

What is an approach to learning about building analytic graphics from data?

There are some general rules that one can follow when they are building analytic graphics from data. These rules are quite useful when thinking through building data graphics and they apply to many different situations

Why is there a need to produce a large number of exploratory graphs?

There is a tendency to produce a large number of exploratory graphs because of the need to look at different aspects of the data. You have to look at lots of variables and you often have to go through them one at a time.

What is meant by the need to "escape flatland"?

There is lots of things going on all the time. If you just plot 2 variables, or maybe even 3 variables, it is not going to show the real picture of what is happening in the world.

When are exploratory techniques for summarizing data typically applied?

These techniques are typically applied before formal modeling commences.

Describe the tendency of exploratory graphs to be made quickly.

They tend to be made very quickly. They are made on the fly as you are looking through the data.

Why would you want to have an overlay feature on a plot?

This is useful for pointing out specific features on the plot.

True or False. Data graphics should make use of many modes of data presentation

True

True or False. Exploratory techniques for summarizing data are important for eliminating or sharpening potential hypotheses about the world that can be addressed by the data.

True

True or False. For exploratory graphs, color or size are primarily used to separate information. In a different setting like for a presentation or a talk you might think a little bit more carefully about color and and size.

True

True or False. It is important to show as many variables as is reasonable at a given time, so that you can get a clear picture of the relationships in your data.

True

True or False. There are many other text editors that you might see on the web that you can download, and those are fine to use, but they are not really necessary. The text editor that comes with R should be sufficient for this course.

True

True or False. it is always important to show a comparison in a plot so that you can have a basis to compare what you are showing and you can compare the evidence between 2 different hypothesis.

True

The 2nd principle of showing causality should answer what question?

What is your causal framework for thinking about a question?

Explain how graphs in data analysis can be used to debug analyses.

When you start doing an analysis, graphs in data analysis can help you figure out what is going wrong.

Explain the goal of personal understanding for exploratory graphs.

With personal understanding of the dataset you get a sense of how the data looks like. What are the properties? What are the problems? What are the things that need to be followed up? And so this is the goal for making exploratory graphs and a lot of the things that you typically worry about, you know, later on in terms of appearances of a graph or how it's presented you don't worry about now.

What is it that you are trying to accomplish when you are building analytic graphics from data?

You are trying to tell a story about what is happening in the data.

What are possible next steps to confirm a hypothesis after you have produced a plot that corroborates a causality or mechanism to show what is going on or show some systematic structure?

You can confirm the hypothesis by doing a little bit more investigation and perhaps more experimentation.

Using the R Editor, why may it be a good idea to have multiple saved files of R code?

You do not have to use a single file, you can add a new file, so you can start a New Script in R Editor and it will open in another window, you can save it to be a different file if you want. So that way you can separate code for different projects or different assignments. If you close the file here You can always open it back again by hitting, the open button and you can see your code is right there. If you open it up, you will see you will have all your code right back there again.

Describe what precedes exploratory data analysis.

You get data from the internet or through various APIs and the data goes through a variety of steps to process the data to get to the data set.

Give an example of R code that gives you a box plot in R.

boxplot(pollution$pm25, col = "blue")

Give an example of R code that overlays a feature on a box plot (box and whiskers plot).

boxplot(pollution$pm25, col = "blue") abline(h = 12)

You have set your working directory in the R console window. Give the command to view directory contents.

dir( )

Give the R command to view the current working directory.

getwd( )

Give an example of R code that gives you a histogram in R.

hist(pollution$pm25, col = "green")

Give an example of R code that overlays a feature on a histogram.

hist(pollution$pm25, col = "green") abline(v = 12, lwd = 2) abline(v = median(pollution$pm25), col = "magenta", lwd = 4)

Give an example of R code that gives you a histogram in R with a feature that helps you get a better handle on where the data is along different points on the histogram.

hist(pollution$pm25, col = "green") rug(pollution$pm25)

Give an example of R code that gives you a histogram in R with a feature that helps you get a better handle on where the data is along different points on the histogram. Include an adjustment on the number of breaks or the number of bars used to construct the histogram.

hist(pollution$pm25, col = "green", breaks = 100) rug(pollution$pm25)

Give the command in R for viewing a list of R objects or loaded functions you are currently working with.

ls( )

The real world is ___________.

multivariate

Analytical presentations ultimately stand or fall depending on the __________, __________, and ___________ of their content.

quality, relevance, integrity

Evidence for a hypothesis is always ___________ to another competing hypothesis.

relative

Give an example of R code that gives you the five-number summary or one-dimensional, summary values in R.

summary(<file name>$<variable name>)

Provide the address for a highly recommended web site that provides useful information on the principles of building analytic graphics from data.

www.edwardtufte.com


Related study sets

Management of Care Practice Questions

View Set

Chapter 3 American govt Review question

View Set

Microbiology BIOL 2420 : Chapter 13, 14, 15

View Set