stat quiz 2
Six ways to hone the message of data
Mapping versus setting Labels for clarity The importance of scale Overplotting Choosing a theme Annotations
exploratory data analysis
asking questions of it (the data), probing it for structure, and seeing how it responds. the aim of this is to uncover the shape and structure of your data and to uncover unexpected features
combining geometries in R
just add a plus and then the other command to combine this
%>%
pipe operator. you can use this to funnel the output of one filter directly into another. e.g: in the picture, this lets you filter for bodyweight less than 12 and then arrange that filtered bodyweight into descending order. this is really good because it's efficient
size R
the bigness of the mark representing the observation
alpha R
the level of transparency of the color
sample variance
used to measure the spread of the data points in a given data set around the mean.
common aesthetic attributes in R
x, y, color, alpha, size, shape, fill
== filter
"exactly equal to" - when used, it just gives you stuff which meets that exact criteria.
!= filter
"not at all equal to" - when used, it just gives you every single data point except for the ones that were asked to be filtered out.
logical operators R
&, |, %in%
what to include in a summary of a data set
- Qualities relevant to the question you're answering or claim you're making - Features that are aligned with the interest of your audience
do not include in a data set summary
- Qualities that are irrelevant, distracting, or deceptive - Replicated or assumed information
desiderata
- all information must be preserved the most commonly used graphic that fulfills this is the dot plot. - Balance depiction of the general characteristics of the distribution with a fidelity to the individual observations. - allow for easy comparisons between groups - Synthesizes the magnitudes of all variable "n" observations. - As close as possible to all of the observations. - Is identical to as many observations as possible.
filter operators R
==, !=, <, >, <=, >=
box plot
A graph that displays the highest and lowest quarters of data as whiskers, the middle two quarters of the data as a box, and the median. the "box" shows the bounds of the interquartile range and the line inside it is the median.
median
If we put the numbers in order from smallest to largest, then the number that is as close as possible to all observations will be the middle number, the (____)
Conditioning
It adds specificity to a claim and illuminates the relationship between variables.
filtering
The act of subsetting the rows of a data frame based on the values of one or more variables to extract the observations of interest. They are powerful because they can comb through the values of the data frame, which is where most of the information is.
alpha value command (alpha = )
This is jittering in R method 2. By adding this w/in the geom_jitter() command, it allows you to make the points transparent. Setting the number to 1 for this command makes the point fully opaque, and setting the number to 0 makes it fully transparent.
mapping R
_________ that attribute in the plot to whatever data lives in the corresponding column in the data frame. When you do this it's always specific to the points in a data frame. You use this in ggplot, not in geom____xx. you often do this within the aes() command
Histogram
a bar graph depicting a frequency distribution. There are bars arranged along the x-axis according to their values with heights that correspond to the count of each value found in the data set, but this chart also involves aggregation that's determined by binwidth. use larger binwidths to see coarse structure in distribution.
violin plot
a graph that shows an approximation of the frequency distribution of a numerical variable in each group and its mirror image. it's really good for identifying differences between different groups.
dot plot
a one-dimensional scatter plot. Each observation shows up as a dot and its value corresponds to its location along the x-axis. Importantly, it fulfills our desiderata: given this graphic, one can recreate the original data perfectly. There was no information loss. Con: this type of graph becomes unwieldy with more data recorded
standard deviation
a quantity calculated to indicate the extent of deviation for a group as a whole.
statistic
both an area of academic study and the object of that study. Any numerical summary of a data set - a mean or median, a count or proportion - is a ____
how to find center and spread
calculate a statistic (this is preferred to just eyeballing it)
Modality
captures the number of distinct peaks (or modes) that are present. multiple modes are often a hint that there is something more going on in the data.
logical vector
checks a condition and returns TRUE or FALSE. e.g: in vector a <- (1,2,3,4,5), doing command class(a>=5) returns FALSE FALSE FALSE FALSE TRUE. This types of vectors work basically like numbers.
summarize()
collapses that data frame down into a single row for each group (can be affected by the group_by command) and creates a new column for each new summary statistic
plot primary elements
data, aesthetic mapping, geometry
skew
describes the behavior of its tails: whether the right trail stretches out (right ___), the left tail stretches out (left ___), or if both tails are of similar length (symmetric).
theme R
ggplot has some built-in themes but you can also download them from online. they change the colors and make the graphs look fun! e.g: theme_bw makes it black and white.
color R
hue of the mark that represents the observation
%in%
it lets you look for multiple things at once instead of typing it individually with & or |. e.g: to find smthn with multiple names in the "name" column within the "msleep" dataframe, you type this: filter(msleep, name _____ c("Little brown bat", "African elephant"))
words for distribution shape
modality, skew
geometry commands R
point bar line histogram density violin dotplot boxplot
distribution most important characteristics
shape, center, spread
density plot
shows the shape of a variable's distribution by "smoothing out" its histogram to make a gentle curve. Besides the shift from bars to a smooth line, the ___ plot also changes the y-axis to feature a quantity called "density". We will return to define this term later in the course, but it's sufficient to know that the values on the y-axis of a density plot are rarely useful. The important information is relative: an area of the curve with twice the density as another area has roughly twice the number of observations. Tip the balance to feature what you find most interesting by adjusting the bandwidth (how compressed the line is) of the ___ plot.
mean (arithmetic)
synthesizes the magnitudes by taking their sum, then keeps that sum from getting larger than any of the observations by dividing by a relevant number "n." Con: this can be negatively affected by outliers.
shape R
the (name for the figure with a certain amount of sides) of the mark representing the observation
fill R
the color of the inside of the representation of an observation
| logical operator
the equivalent of saying "or" in a filter. therefore it's either/or, and NOT "both"
geometry
the method of graphing our data. e.g: bar graph, line graph
mode
the most common observation result within a set of observation results. this is very useful for categorical data particularly
overplotting
the phenomenon of points obscuring one another because many values coincide
aesthetic mapping
the result of encoding variables in a data frame into visual variation in a plot. e.g: axis assignments, color, data location along the y-axis
scale commands R
they are xlim() and ylim(). these commands let you set the start and end points of the data you want to show in a chart. if you don't set these, then the software will automatically set the limits of your plot based on the range of the data.
size =
this command lets you change the colors of a point in R. when you use it for setting, it shows that different columns have different sizes of points. when you use it for mapping, it makes all the points stay at one certain size. you enter a number for this command to change the size.
group_by()
this command lets you group the data by a specific column for viewing in a data frame. this is a really powerful function because it lets you change the behavior of downstream functions.
labs()
this command lets you relabel the x and y axes in a plot. you can also add titles and captions with title = , and caption = within this command.
setting R
tweak the look of our aesthetic attributes in a way that doesn't have anything to do with the data in our data frame. This is what you do in geom___XXX and not in ggplot
jittering
unpile overplotted data by adding just a little bit of random noise to their x- and y-coordinate
annotations R
you can add this in a graph in R with the command in aes() called "annotate("text")" and then add a line pointing to a certain part with"annotate("segment"). then you add the location of both numerically by assigning points for it on the x and y axes.
jittering in R (the command)
you do this by replacing geom_point() with geom_jitter()