math 140 stats unit one exam
how to tell shape of a distribution in a box plot
- bigger plot on right side= skewed right since Lower half of the data has less variability than upper half -center described by IQR
where do probability distributions come from
- can come from observed data - then they calculate the relative frequency of each outcome which represents the empirical probability for that outcome]
center of the distribution
- can think of as a typical value and choose a single value of the variable to represent the entire group
other techniques to mitigate effects of confounding variables
- control group (provides a baseline for comparison since it does not receive treatment) -with human participants, use of a control group may not be enough to establish whether a treatment really has an effect since placebo effect
random assignment
- controls the effects of confounding variables that a researcher cannot control directly or that are difficult to identify in advance -assign children to random treatment groups - goal of random assignment is to create similar groups with respect to age, weight, and other characteristics that may be apart of confounding variables -If random assignment works, average age for each treatment group should be about equal -goal is to create similar treatment groups because then any differences in the response variable are due to treatments -often make random assignments by flipping a coin (each participant has an equal chance of receiving any of the treatment options)
first step in data analysis
- create a graph of the distribution of the variable - in a graph the summarizes the distribution, can see the possible values of the variable and the number of individuals with each variable or interval of values
analyze distribution of a quantitative variable
- describe overall pattern of data (shape, center, spread) and any deviations (outliers) - can use dot plots, histograms, and box plots
quantitative data analysis
- determine categorical variable -draw histogram
2 common strategies to control the effects of confounding variables
- direct control and random assignment
deviations from the mean for each point
- distances between each point and the mean (point-mean) - negative difference= data point to the left of the mean and positive difference= data point to the right of the mean - take absolute value of these differences, add them, and find the average which is a measure of spread about the mean called the average deviation from the mean (ADM)
quartile marks
- divide the data set into four groups with equal counts
to create a box plot from each distribution
- draw a box from 1q to q3 draw a vertical line in the box at q2( median) extend a tail from Q1 t the smallest value that is not an outlier and from Q3 to the largest value that is not an outlier -dinciate outliers with asterisks -long box =large IQR= middle half has large variability in data and vice versa
how can we determine if two events are independent
- e.g. if being female does not depend on being in a health science program and vice versa - when two events are dependent, this does not mean a "cause and effect" relationship exists between them -If there is a large enough difference to suggest a relationship between being female and being enrolled in the health science program, these events are dependent - can also compare the probability that a student is female with the probability that a health science student is female -P(A given B)= P(A)= two events A and B are independent
Population
- entire group of individuals or objects we want to study -since not possible to study the whole population, we collect data from a part of the population called a sample and we use the sample to draw conclusions about the population
probability affected discrete random variable
- for a discrete random variable like shoe size, the probability is affected by whether or not we include the end point of the interval -e.g. the area (and corresponding probability) is reduced if we consider only shoe sizes strictly less than 9
sample size
- for random samples, the bigger the same the more accurate -If not a random sample, does not necessarily guarantee reliable results, - large samles tend to be more accurate than smaller samples if chosen randomly - precision of sample depends on sample size not population size -size of population does not affect accuracy of a random sample as long as the population is large
relative frequency
- from these counts/frequnecy, we can determine a percentage of individuals with a given interval of variable values -this percentage= relative frequency -probability of an event A is the relative frequency with which that event occurs in a long series of repetitions
creating box plots in stat crunch
- graph box plot, select column, group by.... compute
how to create a histogram on stat crunch
- graph, histogram, choose variable - to compare multiple histograms, choose weight command click, hold mouse down, drag for another variable
continuous random variables
- have numeric values that can be any number in an interval - ex: the exact weight of a person, foot length, measurements, height - with a discrete variable, you can count the possible values for the variable without rounding off - with a continuous variable, you cannot -random here means that the outcomes are uncertain in the short run but have a regular distribution or predictable pattern in the long run - we reserve the term random variable for quantitative variables
symmetrical
- identifying shape of distribution sets up our analysis
when can we add probabilities
- if two events are disjoint ; this works because the events have no outcomes in common= disjoint (e.g. can't have type A and type O blood at the same time, can't have no eggs and 1 egg in a nest at the same time)
modified vs regular box plots
- in modified box plot, outliers are marked with an asterisk - for a box plot that is not modified, the tails extend to the minimum and maximum values (can't see outliers)
observations about histograms
- individual variable values are not visible -grouping individuals into bins of equal-sized intervals is particularly useful when analyzing large data sets -can easily use percentages, also called relative frequencies, to describe the distribution -descriptions of shape, center, and spread, are affected by how the bins are defined
observations about dot plots
- individual variable values are visible particularly when data set is small -descriptions of shape, center, and spread are not affected by how the dot plot is constructed - we can accurately calculate the overall range (largest-smallest)
mean and standard deviation symbols
- mean of a normal distribution locates its center (mu; u); - greek letter sigma (o) to represent the standard deviation of a normal distribution - the standard deviation determines the spread of the distribution; the shape of a normal curve is completely determined by specifying its standard deviation
measure of center when data has an outlier
- media is a better summary since outlier does not affect the median since it doesn't affect order of scores but lower outlier makes mean lower and higher outlier makes mean higher; too low/high to be a representative measure - the smaller the sample, the greater impact outliers have
confounding variables
- mix up our ability to determine if the explanatory variable causes a change in the response variable - weakens cause-effect relationship between explanatory and response variables
producing data
- need a representative sample from the population - observational studies and experiments -determine what you are measuring and collection of actual data
examples of quantitative variables
- number of boreal owl eggs in a nest - number of times a college student changes major -shoe size -weight of a student -foot lengths for adults - when the outcomes are quantitative, we call the variable a random variable
outliers
- observations that fall outside the overall pattern
common survey plans that produce unreliable and potentially biased results
- online polls; voluntary response sample (biased because only people with strong opinions participate) -mall surveys (ever-present white middle class/retired people); convenience sampling
how to add a distribution overlay to histogram stat crunch
- options edit, under display options say overlay distribution - which you click one (e.g. normal), can specify parameters for distribution (sample meal and sample standard deviation for normal distribution), compute
how to customize summary stats in stat crunch
- options, edit statistics has different options you can add to the table - to select or deselect options, hold common, click compute
how to modify a histogram to see the shape better on stat crunch
- options, edit, binds, with, change width to be larger - changes # of bins, easier to see skew - larger bin width= less bars= less detail - more detail= make bin width smaller = more bus
how to show numerical value of each bin on a histogram stat crunch
- options, edit, display, check value above bar for each bin and click compute
to add percentile stats to table (statcrunch)
- options, edit, in percentile box can enter different values (10, 90) for 10th and 90th percentiles, compute
to draw box plots horizontally instead of vertically
- options, edit, mark other option draw boxes horizontally, compute -numerical scale now on x-axis
to identify outliers in box plots stat crunch
- options, edit, other options, use fences to identify outliers, compute -click and drag to highlight outlier, corresponding row in data table is highlighted -to clear highlight, use clear button in bottom corner
how to create own custom statistic stat crunch
- options, edit, other statistic, type in (column=x) , to see different functions click build to give you list of functions that can be used -Ex, type in std(x)*std(x)= variance, add variance back into table (statistics) to verify custom statistic created, compute, should match
confounding variables observational static's
- other factors influencing results of an observational study - difficult to remove all of the factors that may have an influence which is why observational study gives weak/misleading effect of a cause effect relationship
features of a probability distribution
- outcomes described by the model are random. means that individual outcomes are uncertain, but there is a regular, predictable distribution of outcomes in a large number of repetitions - the model provides a way of assigning probabilities to all possible outcomes - the probability of each possible outcome can be viewed as the relative frequency of the outcome in a large number of repetitions, so like any other probability, it can be any value between 0 and 1 - the sum of the probabilities of all possible outcomes must be 1
information in a well stated research question
- population -variable (what we plan to measure) - numerical characteristics about the population related to variable (e.g. average, proportion, relationship, majority, etc.)
2 types of statistical research questions
- questions about population (select sample and observational study) - questions about cause and effect (experiment)
range vs IQR
- range measures the variability of a distribution by looking at the interval covered by all the data - the IQR measures the variability of a distribution by giving us the interval covered by the middle 50% of the data
role of the normal curve in statistical inference
- relate sample means or proportions to population means or proportions
research question cause and effect
- research question that focuses on a cause and effect relationship is common in disciplines that use experiments such as medicine or psychology - how one variable responds as another variable is manipulated -questions involve two variable -to provide convincing evidence, researcher designs an experiments
statistical adjustments
- researchers may use advanced techniques for making statistical adjustments within an observation study to control the effects of confounding variables that can influence the results - also use criteria when making cause effect relationship from observational studies -reasonable explanation: smoking in rats causes cancer so it would in humans to, multiple observational studies performed that vary in design so factors that confound one are not present in another
biased sampling plan elections
- sample using magazine subscriptions, lists of registered car owners, etc. -not represented of American public, systematically underrepresented democrats so the poll results did not represent the population
describe distribution of data
- shape -typical value (center) -spread - want less detail fro shape but more detail for center -f or spread, if total = 50, 50%= 25 bears, how far out do you have to go to reach 25 bears?
how to describe patterns in quantitative data
- shape, center, spread, and outliers
descriptions of distributions with box plots
- shape, center, spread, outliers, and 5 number summary
compute summary statistics stat crunch
- stat menu summary stats menu columns, pick which column compute, by default produces table with 11 summary stats
how to find q1 and q3 stat crunch
- stat, summary stats, columns, pick column, compute (or can personally add q1 and q3) - in small data sets, order numbers from lowest to highest divide into 4 parts - the value separating the 1 and 2 part= Q1, the value separating the 3rd and 4th part= Q3
contingency frequency table
- stat, table, contingency, data - for comparison of multiple categories: stat, table, contingency decided which variable (row and column), get 2 way table
categorical variables
- take category of label values and place an individual into one of several groups - each observation can be placed in only one category and the categories are mutually exclusion -e.g smoker or nonsmoker, gender, race,
quantitative variables
- take numerical values and represent some kind of measurement - age (can take on multiple numerical values), weight, height, etc.
how are quartiles used to measure variability about the median
- the IQR is the distance between first and third quartile marks - the IQR is a measurement of the variability about the median, tells us the range of the middle half of the data
in a table that summarizes the distribution of a categorical variable, we can see
- the different values (categories) the variable takes - how many times each value occurs (count) and how often each value occurs (converting counts to proportions)
joint probability equation
- the joint probability equals the product of the marginal and conditional probabilities - marginal probability*conditional probability= joint probability -P (A and B)= P(A) * P(B given A) - P(A and B)= P(A)*P(B0 only when two events are independent
Probability
- the machinery behind inferences since we infer something about a population based on a sample - since the sample is not necessarily the population, probability is involved since samples vary
margins of a two way table
- the numbers in the margins are totals for each row or column -where a row and column cross is where we see the number of individuals who fit a particular portion of each category
in a graph that summarizes the distribution of a quantitative variable, we can see...
- the possible values of the variable - the number of individuals with each variable value or interval of values
how to measure spread
- the spread of a distribution is a description of how the data varies -can use range, IQR, and std - when we use the median, Q1 to Q3 gives a typical range of values associated with the middle 50 %o f the data and when we use the mean, mean + or - SD gives a typical range of values
when do we use standard deviation
- to compare the variability of two distributions - incorporate the standard deviation into our description of the pattern in the distribution of a quantitative variable
explanatory variable vs response variable
- to establish a cause and effect relationship, want to make sure the explanatory variable is the only thing that impacts the response variable - remove other factors affecting the response and manipulate only explanatory variable
exploratory data analysis
- to make sense of the data, need to explore and summarize it using graphs and different numerical measures (percentages and averages)
range of typical values
- to respect common variable values for the group (bin widths)
two way tables
- two tables for two categorical variables give us a useful snapshot of all of the data organized in terms of the two variables of interest - give us a practical context for talking about probability
ADM
- typical range of values (within 1 ADM of the mean) contains more than half of the values in the data set - ADM measures the average distance of the data from the mean - the larger the ADM, the more variable it is
how to adjust a histogram to show relative frequency (density) on y axis stat crunch
- under options, choose edit - in the type box, change frequency to relative frequency then click compute
scores with an outlier
- use median and IQR - outlier increases (or decreases) standard deviation and mean which makes it seem the data is more variable - the typical range based on the first and 3rd quartiles give a better summary since outlier does not affect quartile marks - same applies to skewed data
measuring spread about the mean
- use standard deviation (spread= +/- one std above and below the mean)
when to use mean vs median
- use the mean as a measure of center only for distributions that are reasonably symmetric with a central peak -use the median as a measure of center for all other cases - use median when outliers are present
choosing numerical summaries
- use the mean standard deviation as measures of center and spread only for distributions that are reasonably symmetric with a central peak -when outliers are present, use IQR and 5 number summary and use for skewed data
investigating the relationship between 2 categorical variables
- use the values of the explanatory variable to define the comparison groups - we then compare the distributions of the response variable for values of the explanatory variable - we lookout how the pattern of conditional percentages differs between the values of the explanatory variable
likelihood/chance/probability statements
- use to data to make statement about the likelihood that a randomly selected student from a specific college is a health science major - the risk associated with not wearing a seat belt - the chance of a positive drug test for someone who does not use drugs when the test is 94% accurate
using the iQR to identify outliers
- value is greater than Q3+1.5IQR or less than Q1-1.5*IQR
what do we mean when we say an event is random or due to chance
- we mean the event is unpredictable in the short run but has a regular and predictable behavior in the long run -true when tossing. a coin; cannot predict whether an individual toss will be heads but in the long the outcomes have a predictable pattern (relative frequency of heads is close to 0.5) - can make probability statements only about random events
probabilities of all possible outcomes
- we think of all possible outcomes as variable values -each variable value has a probability - the variable values together with their probabilities are a probability distribution
risk
- when calculating the probability of a negative outcome - risk= another word for probability -often compare 2 risks by calculating the percentage change -calculate the difference (how much the risk is changed) and divide by the risk for the placebo group - percentage reduction of risk= (new treatment risk-reference (placebo) risk)/reference (placebo) risk
why does changing the bin size and the starting point of the first bin change the histogram so drastically
- when we change the bins, the data gets grouped differently which affects the appearance of the histogram - avoid histogram with too large or too small bin withers since it doesn't help us see variability or patterns in data
what happens to the probability histogram when our continuous random variable has more precision
- when we increase the precision of the measurement, we will have a larger number of bins in our histogram because each bin contains measurements that fall within a smaller interval of values - as the width of the intervals of the bins get smaller, the probability histogram gets closer to the curve - if we continue to reduce the size of the intervals, the curve becomes a better and better way to estimate the probability histograms -normal distributions normally for continuous random variables
spread about the median
- which distribution has more variability - determines how you measure spread (either by range) but don';t use range when data is distributed about the median -to measure variability about the mean, use quartiles - if it has more data close to the median, the data set has less variability about the median
standard deviation vs mean sample vs model
- x bar= mean of data in a sample, mu= mean of a density curve defined by a mathematical model -s represents standard deviation of data in a sample; sigma (o) to represent the standard deviation of a density curve defined by a mathematical model
conditional percentages
-A way to approximate a percentage by dividing the number of times an event occurred in an experiment by the total number of respondents in that row or column. See relative frequency. - based on the a specific conditions -conditional percentages= numeric summary - conditional percentages are calculated separately for each value of the explanatory variable - when we try to understand the relationship between 2 categorical variables, we compare the distributions of the response variable for values of the explanatory variable - we look at how the pattern of conditional percentages differs between the values of the explanatory variable
addition rule
-Considering mutually exclusive events, the probability of both occurring is the sum of the probabilities of each event. -
marginal proportions
-Ratios of the row or column totals to the overall total number of observations -doesn't help us determine relationship to two categorical variables because it involves only one of the variables
Theoretical Probability
-What should occur or what we expect to happen in an experiment -expect for a coin toss to be 50/50 heads or tails - we determine the number of ways an event can occur and divide by the total number of possible outcomes - in situations where the outcomes are equally likely, we can use mathematics to calculate the probability instead of collecting data
exploring the relationship between 2 categorical variables
-amounts fo comparing the distributions of the response variable for different values of the explanatory variable
median
-another way to identify typical value - middle of the ata when the values are listed in order - divides the data into 2 equal groups where there is equal amounts of data below and above it
census
-attempt to include every individual from a population in a sample
mean
-average, x bar - to calculate the mena, add data values and divide by # of data points -mean is the fair share measure of center - we can understand the mean as the score Beth would have on every assignment is she always made the same great -does not give us information about any individual homework score or about how the homework scores vary - also known as the balancing point of a distribution since the distance between each data point and the mean are balanced on each side of the mean -distances below the mean= negative, above the mean=positive
placebo effect
-because people in medical experiments improve when taking a placebo, a placebo group provides.a baseline for comparing treatment - if a treament produces better results than a placebo, have evidence that treatment is responsible for improvement
random sampling
-best way to eliminate bias - collecting a random sample is like pulling names from a hat; in a sample random sample everyone in the population has an equal chance of being chosen -also guarantees that the sample results do not change significantly from sample to sample; variability is results is due to chance
typical value
-center of distribution -normally the tallest bin width (since it has the most frequency) -can calculate by taking the entire amount, divide by 2, and count frequencies upward until you land on the bar with the middle value -changing bin width changes the typical value but should be similar
Boxplots
-commonly used to summarize a distribution of a quantitative variable
inference
-conclusion we reach from our sample data that answers our original question about the population - to learn and draw conclusions about the opinions of the entire population based on our sample
examples of observational study and experiment
-conducting a survey, diving class into two groups of one listening to music and one not and having them keep a journal -experiment: word puzzles without music and word puzzles with music and calculating average number of words found= experiment
sampling plan
-describes exactly how we will choose the sample - a sampling plan is biased if it systematically favors certain outcomes - focus on surveys as sampling plans
steps used to rule out confounding variables
-direct control - random assignment -control group -placebo group -blinding -does not rule out chance variation between treatment groups
single blind experiment
-either researcher or participant does not know which treatment the participants receive
why can raw counts be misleading
-ex: if there are more females then males, comparing raw counts is misleading -Instead we compare the percentage of females who responded to each category - by converting to percentages, we are reporting the results as though there are 100 females and 100 males - in general, we need to supplement our display (2 way table) with numeric summaries that allow us to compare the distributions. therefore, we always convert counts to percentages
two types of statistical investigations (producing data )
-experiments and observational studies -our approach to collecting data determines what we can conclude from the data
Dotplot
-graphs a dot for each case against a single axis - vertical axis = count/frequency, horizontal axis= variable values - can say how much protein varies by (1-6 grams ) by using range - most of the cereals have 1-2 gram so proteins
discrete random variables
-have numeric values that can be listed and often can be counted -e.g. variable number of boreal owl eggs is a decrete random variable -shoe size= discrete random variable - blood type is categorical
blinding
-in experiments that use a placebo participants do not now whether they are receiving drug or placebo - are blind to the treatment to prevent their own beliefs about the drug or placebo from confounding the results
direct control
-influencing length of time in washing hands (want all groups to wash hands for the same time) - amount of soap (all participants use one squirt) - stabilizing impact of confounding variable across treatments; differences in response variable cannot be due to differences in confounding variables
experiment
-intentionally manipules one variable in an attempt to cause an effect on another variable - cause and effect relationship between 2 variables
probability distribution
-list of possible outcomes with associated probabilities -each variable value is assigned a probability - sum of all probabilities=1 - e.g. each blood type has a corresponding probability - the probabilities are numbers between 0 and 1 since each probability is a relative frequency - all outcomes are assigned a probability -the outcomes are random events; when we randomly choose a persons we do not know their blood type but there is a predictable pattern in the outcomes that is described by the relative frequencies
bell shaped curve
-normal distribution/normal curves -continuous random variables - indicates that values closer to the mean are more likely and it becomes increasing unlikely to take values far from the mean in either direction -even though all normal curves have the same bell shape, they vary in their center and spread
what information does a box plot not give us
-number of data points in the data set - number of data points within each quartile (even though each quartile contains the same number of data point) -pattern of the data within each quartile
observational study
-observes individuals and measures variables of interest but does not attempt to influence the responses -main purpose is to describe a group of individuals or to investigate an associate between two variables; can investigate a relationship but since not manipulating one variable to cause an effect in another, does not provide convincing evidence of a cause and effect relationship
how to change starting point/bin width of a histogram stat crunch
-options, edit, bins, (can enter start at point and width)
to represent a probability distribution of a random variable
-probability histogram -probability distribution of a random variable X can be represented by a table that provides a way to assign probabilities to outcomes
how to use normal calculator stat crunch
-stat, calculator, normal
well designed experiment
-takes steps to eliminate the effects of confounding variables including random assignment of people to treatment groups, use of a placebo, or blind conditions
Conditional Probability
-the likelihood that a target behavior will occur in a given circumstance -e.g. if we select a female student at random, what is the probability that she is in the health sciences program -starting with a female (condition), then asking what is the probability that female is in the health sciences - a condition is given -can also be represented by a vertical bar - the probability of a categorical variable taking on a particular value giving the condition that the other categorical variable has some particular variable -only using a subset of the data which is determined by the given condition
Complement Rule
-the probability of an event occurring is 1 minus the probability that it doesn't occur -e.g. P(not a universal donor)= P(blood type is not O)= 1- P(type 0) -the complement of event A is the event composed of all blood types except for O
Joint Probability
-the probability of the intersection of two events -e.g. the probability that a randomly selected student is both female and in the health sciences program - when we calculate a joint probability, we divide the count from an inner cell of the table by the overall total count in the lower right corner - the probability that 2 categorical variables each take on a specific value
Marginal Probability
-the values in the margins of a joint probability table that provide the probabilities of each event separately -same as marginal proportion - the probability of a categorical variable taking on a particular value without regard to the other categorical variable - use overall student data contained in the margins of the table - do not take into account the other categorical variable
spread
-the variability of the data - can measure using range but the outliers make it seem like data is much more variable than it is in reality (seen with salaries) -normally can look 1 bar above and 1 below - focus on middle 50% and how spread out is the middle 50% of the data
bins
-variable values divided into equal sized intervals (each bin is a bar) - height of bin= count/frequency
empirical probability
..., involves conducting an experiment to observe the frequency with which an event occurs -actually collecting data; actually flipping a coin multiple times (need a large sample) -empirical probability gets closer to theoretical probability with larger samples - an estimate using data the likelihood that the event will happen
Big picture of stats
1. Producing Data 2. Exploratory Data Analysis 3. Probability 4. Inference
why can't you take about the shape for categorical data
- bars can be rearranged - can talk about typical value -for variability for categorical data, # of bars, 1-2 categories or 12 categories
why it is important to identify the explanatory variable
- because we always use the totals for the explanatory variable to calculate percentages
inflection point normal curve
- -the x-values of the inflection points correspond to 1 standard deviation above and below the mean the curve changes the direction of its bend and goes from bending upward to bending downward or vice versa
standard deviation
- a measurement of spread bout the mean similar to the average deviation - think of standard deviation as roughly the average distance of data from the mean, approximately= to the average deviation
random variable
- a quantitative variable with outcomes that occur as a result of some random process (discrete and continuous) - a probability distribution of a random variable tells us the probabilities of all the possible outcomes (for discrete random variables) of the variable or ranges of values (for continuous random variables) - a probability distribution shows us the regular, predictable distribution of outcomes in a large number of repetitions of a random variable -for a discrete random variable, the probabilities of values are areas of the corresponding regions of the probability histogram for the variable
representative sample
- a subset of the population that reflects the characteristics of the population
histogram
- another way to display distribution of a quantitative variables - histograms useful for large data sets as it divides the variable values into equal sized intervals
questions asking to calculate relative frequency
- approximately what percentage of the sample has hip measurements between 85 and 90cm? - what percentage of the sample will wear large size sweat pants? - in these calculations, we assume that the value of the left-hand endpoint of each bin is included in the count for that bin and not the right-hand endpoint - bin for interval 85-90 has values of 85 but not 90
right skewed
a distribution with a tail that extends to the right
left skewed
A density curve where the left side of the distribution extends in a long tail. (Mean < median.)
mutually exclusive
Events that cannot occur at the same time.
uniform shape
a rectangular shape with the same amount of data for each variable value
empirical rule for normal curves
within 1 std of the mean: central 68% of the data - 95% of values fall within 2stds of the mean; therefore eunliekyl for a value to fall more than 2 stds away from the mean -values more than 2 stds away from the mean in a normal distriubtion= outliers - 99.7% of values fall within 3 stds of the mean - extremely unlikely for a value to fall more than 3 standard deviations away from the mean -values more than 3 standard deviations away from the mean are often called extreme outliers
to find mean of a data set stat crunch
stat, summary stats, columns, pick column (variable), pick statistics - to specifically choose one, options, edit, statistics column and look for IQR
independent events
The outcome of one event does not affect the outcome of the second event
std formula
subtract each data point- mean, square the result, add the total for each data points, divide by the amount of data points-1
2 types of variables
categorical and quantitative
data
consists of individuals and variables that give us information about those individuals (object or person) -variable= an attribute, such as measurement or a label
what does IQR tell us
how spread out the middle 50% of the data is
5 number summary
min, Q1, median, Q3, max - median=Q2 -some quartiles exhibit more variability in the data even though each quartile contains the same amount of data - first quartile has 25% of data, second=50%, third= 75% - uses quartiles to identify center and spread of a d sitrubtion - values between q1 and q3 give a typical range of values
shape
to describe the shape of a distribution imagine sketching the outline of the data to emphasize the general trend