stat quiz 2

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

Six ways to hone the message of data

Mapping versus setting Labels for clarity The importance of scale Overplotting Choosing a theme Annotations

exploratory data analysis

asking questions of it (the data), probing it for structure, and seeing how it responds. the aim of this is to uncover the shape and structure of your data and to uncover unexpected features

combining geometries in R

just add a plus and then the other command to combine this

%>%

pipe operator. you can use this to funnel the output of one filter directly into another. e.g: in the picture, this lets you filter for bodyweight less than 12 and then arrange that filtered bodyweight into descending order. this is really good because it's efficient

size R

the bigness of the mark representing the observation

alpha R

the level of transparency of the color

sample variance

used to measure the spread of the data points in a given data set around the mean.

common aesthetic attributes in R

x, y, color, alpha, size, shape, fill

== filter

"exactly equal to" - when used, it just gives you stuff which meets that exact criteria.

!= filter

"not at all equal to" - when used, it just gives you every single data point except for the ones that were asked to be filtered out.

logical operators R

&, |, %in%

what to include in a summary of a data set

- Qualities relevant to the question you're answering or claim you're making - Features that are aligned with the interest of your audience

do not include in a data set summary

- Qualities that are irrelevant, distracting, or deceptive - Replicated or assumed information

desiderata

- all information must be preserved the most commonly used graphic that fulfills this is the dot plot. - Balance depiction of the general characteristics of the distribution with a fidelity to the individual observations. - allow for easy comparisons between groups - Synthesizes the magnitudes of all variable "n" observations. - As close as possible to all of the observations. - Is identical to as many observations as possible.

filter operators R

==, !=, <, >, <=, >=

box plot

A graph that displays the highest and lowest quarters of data as whiskers, the middle two quarters of the data as a box, and the median. the "box" shows the bounds of the interquartile range and the line inside it is the median.

median

If we put the numbers in order from smallest to largest, then the number that is as close as possible to all observations will be the middle number, the (____)

Conditioning

It adds specificity to a claim and illuminates the relationship between variables.

filtering

The act of subsetting the rows of a data frame based on the values of one or more variables to extract the observations of interest. They are powerful because they can comb through the values of the data frame, which is where most of the information is.

alpha value command (alpha = )

This is jittering in R method 2. By adding this w/in the geom_jitter() command, it allows you to make the points transparent. Setting the number to 1 for this command makes the point fully opaque, and setting the number to 0 makes it fully transparent.

mapping R

_________ that attribute in the plot to whatever data lives in the corresponding column in the data frame. When you do this it's always specific to the points in a data frame. You use this in ggplot, not in geom____xx. you often do this within the aes() command

Histogram

a bar graph depicting a frequency distribution. There are bars arranged along the x-axis according to their values with heights that correspond to the count of each value found in the data set, but this chart also involves aggregation that's determined by binwidth. use larger binwidths to see coarse structure in distribution.

violin plot

a graph that shows an approximation of the frequency distribution of a numerical variable in each group and its mirror image. it's really good for identifying differences between different groups.

dot plot

a one-dimensional scatter plot. Each observation shows up as a dot and its value corresponds to its location along the x-axis. Importantly, it fulfills our desiderata: given this graphic, one can recreate the original data perfectly. There was no information loss. Con: this type of graph becomes unwieldy with more data recorded

standard deviation

a quantity calculated to indicate the extent of deviation for a group as a whole.

statistic

both an area of academic study and the object of that study. Any numerical summary of a data set - a mean or median, a count or proportion - is a ____

how to find center and spread

calculate a statistic (this is preferred to just eyeballing it)

Modality

captures the number of distinct peaks (or modes) that are present. multiple modes are often a hint that there is something more going on in the data.

logical vector

checks a condition and returns TRUE or FALSE. e.g: in vector a <- (1,2,3,4,5), doing command class(a>=5) returns FALSE FALSE FALSE FALSE TRUE. This types of vectors work basically like numbers.

summarize()

collapses that data frame down into a single row for each group (can be affected by the group_by command) and creates a new column for each new summary statistic

plot primary elements

data, aesthetic mapping, geometry

skew

describes the behavior of its tails: whether the right trail stretches out (right ___), the left tail stretches out (left ___), or if both tails are of similar length (symmetric).

theme R

ggplot has some built-in themes but you can also download them from online. they change the colors and make the graphs look fun! e.g: theme_bw makes it black and white.

color R

hue of the mark that represents the observation

%in%

it lets you look for multiple things at once instead of typing it individually with & or |. e.g: to find smthn with multiple names in the "name" column within the "msleep" dataframe, you type this: filter(msleep, name _____ c("Little brown bat", "African elephant"))

words for distribution shape

modality, skew

geometry commands R

point bar line histogram density violin dotplot boxplot

distribution most important characteristics

shape, center, spread

density plot

shows the shape of a variable's distribution by "smoothing out" its histogram to make a gentle curve. Besides the shift from bars to a smooth line, the ___ plot also changes the y-axis to feature a quantity called "density". We will return to define this term later in the course, but it's sufficient to know that the values on the y-axis of a density plot are rarely useful. The important information is relative: an area of the curve with twice the density as another area has roughly twice the number of observations. Tip the balance to feature what you find most interesting by adjusting the bandwidth (how compressed the line is) of the ___ plot.

mean (arithmetic)

synthesizes the magnitudes by taking their sum, then keeps that sum from getting larger than any of the observations by dividing by a relevant number "n." Con: this can be negatively affected by outliers.

shape R

the (name for the figure with a certain amount of sides) of the mark representing the observation

fill R

the color of the inside of the representation of an observation

| logical operator

the equivalent of saying "or" in a filter. therefore it's either/or, and NOT "both"

geometry

the method of graphing our data. e.g: bar graph, line graph

mode

the most common observation result within a set of observation results. this is very useful for categorical data particularly

overplotting

the phenomenon of points obscuring one another because many values coincide

aesthetic mapping

the result of encoding variables in a data frame into visual variation in a plot. e.g: axis assignments, color, data location along the y-axis

scale commands R

they are xlim() and ylim(). these commands let you set the start and end points of the data you want to show in a chart. if you don't set these, then the software will automatically set the limits of your plot based on the range of the data.

size =

this command lets you change the colors of a point in R. when you use it for setting, it shows that different columns have different sizes of points. when you use it for mapping, it makes all the points stay at one certain size. you enter a number for this command to change the size.

group_by()

this command lets you group the data by a specific column for viewing in a data frame. this is a really powerful function because it lets you change the behavior of downstream functions.

labs()

this command lets you relabel the x and y axes in a plot. you can also add titles and captions with title = , and caption = within this command.

setting R

tweak the look of our aesthetic attributes in a way that doesn't have anything to do with the data in our data frame. This is what you do in geom___XXX and not in ggplot

jittering

unpile overplotted data by adding just a little bit of random noise to their x- and y-coordinate

annotations R

you can add this in a graph in R with the command in aes() called "annotate("text")" and then add a line pointing to a certain part with"annotate("segment"). then you add the location of both numerically by assigning points for it on the x and y axes.

jittering in R (the command)

you do this by replacing geom_point() with geom_jitter()


Ensembles d'études connexes

Chapter 8 - Understanding Network Effects

View Set

Chapter 8: Hip Joint and Pelvic Girdle

View Set

CA Insurance Code & Ethics Chapter Quizzes

View Set

The Pectoral Girdle and Upper Limb

View Set

Ch.23 Cardiovascular Dysfunction

View Set

Ch 39: Fluid, Electrolyte, and Acid-Base Balance NCLEX-Style Chapter Review

View Set

Chapter 2: Collecting Subjective Data: The Interview and Health History PrepU

View Set

Fundamentals of Success Theory-Based Nursing Care

View Set