Chapter 1-5 Test (Stat)
Variable
A variable holds information about the same characteristic for many cases.
Quantitative variable
A variable in which the numbers act as numerical values is called quantitive. Quantitive variables always have units
Categorical variable
A variable that names categories (whether with words or numerals) is called categorical
Population
All the cases we wish we knew about
Data table
An arrangement of data in which each row represents a case and each column represents a variable
Experimental unit
An individual in a study for whom or for which data values are recorded. Human experimental units are usually called subjects or participants.
re-express (transform)
Applying a simple function (such as a logarithm or square root) to the data can make a skewed distribution more symmetric or equalize spread across groups.
Bar chart
Bar charts show a bar whose area represents the count of observations for each category of a categorical variable
Unimodal/Bimodal
Having one mode. This is a useful term for describing the shape of a histogram when it's generally mound-shaped. Distributions with two modes are called bimodal. Those with more than two are multimodal.
Units
A quantity or amount adopted as a standard of measurement
Simulation
A random re-enactment of data collection under one or more assumptions. If real data look very different from simulated data, then the assumptions are called into question
Relative frequency table
A relative frequency table lists the categories of a categorical variable and gives the fraction or percent of observations of each categories
Context
The context ideally tells Who was measured, What was measured, How the data were collected, Where the data were collected, and Why and Why the study was performed.
Distribution
The distribution of a variable gives - the possible values of the variable and - the relative frequency of each value The distribution of a quantitative variable slices up all the possible values of the variable into equal-width bins and gives the number of values falling into each bin.
Conditional distribution
The distribution of a variable restricting the Who to consider only a smaller group of individuals
Quartile
The lower quartile (Q1) is the value with a quarter of the data below it. The upper quartile (Q3) has three quarters of the data below it. The median and quartiles divide data into four parts with equal numbers of data values.
Categorical Data Collection
The methods in this chapter are appropriate for displaying and describing categorical data. Be careful not to use them with quantitative data
Shape
To describe the shape of a distribution, look for. . . - single vs. multiple modes - symmetry vs. skewness - outliers and gaps
Histogram (relative frequency histogram)
Uses adjacent bars to show the distribution of a quantitative variable. Each bar represents the frequency (or relative frequency) of values falling in each bin.
Independence
Variables are said to be independent if the conditional distribution of one variable is the same for each category of the other. if variables are not independent, we say there is an association.
Simpson's paradox
When averages are taken across different groups, they can appear to contradict the overall averages
Timeplot
displays data that change over time to show long-term patterns and trends
Record
Information about an individual in a database
Outliers
extreme values that don't appear to belong with the rest of the data. they may be unusual values that deserve further investigation, or they may just be mistakes.
5 Number Summary
min, Q1, median, Q3, max
Parameter
numerically valued attribute of a model
standard deviation
the square root of the variance
Variance
the sum of squared deviations from the mean, divided by the count minus one
Interquartile Range (IQR)
the difference between the first and third quartiles
Range
the difference between the highest and lowest scores in a data set
Subject
A human experimental unit. Also called a participant.
Segmented Bar Chart
A bar chart whose bars are stacked on top of one another in a vertical graph, or lined up side-by-side in a horizontal graph. A segmented bar chart usually shows relative frequencies so that the distribution of the categorical variable can be more easily compared between different groups
Case
A case is an individual about whom or which we have data
Identifier variable
A categorical variable that assigns a unique value for each case, used to name or identify it
Contingency Table
A contingency table displays counts and, sometimes, percentages or individuals falling into named categories on two or more variables. The table categorizes the individuals on all variables at once to reveal possible patterns in one variable that may be contingent on the category of the other
Nearly Normal Condition
A distribution is nearly Normal if it is unimodal and symmetric. We can check by looking at a histogram or a Normal probability plot.
Symmetric
A distribution is symmetric if the two halves on either side of the center look approximately like mirror images of each other.
Frequency table
A frequency table lists the categories of a categorical variable and gives the number of observations of each category
Marginal distribution
In a contingency table, the distribution of either variable alone is called the marginal distribution. The counts or percentages are the totals found in the margins (last row or column) of the table.
Area principle
In a statistical display, each data value should be represented by the same amount of area
Rescale
Multiplying each data value by a constant multiplies both the measures of position (mean, median, and quartiles) and the measures of spread (standard deviation and IQR) by that constant.
Pie chart
Pie charts show how a "whole" divides into categories by showing a wedge of a circle whose area corresponds to the proportion in each category.
Respondent
Someone who answers, or responds to, a survey
Data
Systematically recorded information, whether numbers or labels, together with its context
normal percentile
The Normal percentile corresponding to a z-score gives the percentage of values in a standard Normal distribution found at that z-score or below.
Sample
The cases we actually examine in seeking to understand the larger population
Center
The place in the distribution of a variable that you'd point to if you wanted to attempt the impossible by summarizing the entire distribution with a single number. Measures of center include the mean and median.
Tails
The tails of a distribution are the parts that typically trail off on either side. Distributions can be characterized as having long tails (if they straggle off for some distance) or short tails (if they don't)
Boxplot
a box-lot displays the 5 number summary as a central box, whiskers that extend to the non outlying data values and any other outliers shown
normal probability plot
a display to help assess whether a distribution of data is approximately normal; if it is nearly straight, the data satisfy the nearly normal condition
Skewed
a distribution is this if it's not symmetric and one tail stretches out farther than the other.
Uniform
a distribution that's roughly flat
Dot plot
a dot for each case against a single axis
Mode
a hump or local high point in the shape of the distribution of a variable. the apparent location of modes can change as the scale of a histogram is changed
Spread
a numerical summary of how tightly the values are clustered around the "center". measures of spread include the IQR and standard deviation
Gap
a region of the distribution where there are no values
Shifting
adding a constant to each data value adds the same constant to the mean, the median, and the quartiles, but does not change the standard deviation or IQR
Comparing Distributions
compare shape, center, spread
Mean
found by summing all the data values and dividing by the count
68-95-99.7 rule
in a normal model, about 68% of values fall within 1 standard deviation of the mean, about 95% fall within 2 standard deviations of the mean, and about 99.7% fall within 3 standard deviations of the mean
Stem and leaf display
shows quantitative data values in a way that sketches the distribution of the data
Z-score
tells how many standard deviations a value is from the mean and in which direction; have a mean of zero and a standard deviation of one
Quantitative Data Condition
the data are values of a quantitative variable whose units are known
Percentile
the ith percentile is the number that falls above i% of the data
Median
the middle value with half of the data above and half below it
Normal Model
useful family of models for unimodal, symmetric distributions
Statistic
value calculated from data to summarize aspects of the data
Standardized value
value found by subtracting the mean and dividing by the standard deviation
normality assumption
we must have a reason to believe a variable's distribution is normal before applying a normal model