Stats 1 Exam Chapters 1, 2, and 3
line charts
Is used to display a time series, to spot trends, or to compare time periods. · Can display several variables at once; may require a two-scale line chart; lets you compare variable that differ in magnitude or are measure in different units.
Statistics vs Statistic
Statistics- is the science of collecting, organizing, analyzing, interpreting, and presenting data. Statistic- is a single measure, reported as a number, used to summarize a sample data set. o Ex: average height of students in a university
Data Set
consists of all the values of all of the observations we have chosen to observe.
sturges' rule
every time we double up the sample size we should add one bin. Formula: k = 1 + 3.3 log(n)
error
refers to problems in sample methodology that lead to inaccurate estimates of a population parameter.
cluster sampling
select random geographical regions (e.g zip codes) that represent the population.
nominal measurement
the weakest level of measurement. Identify a category or a classification same as qualitative. EX: eye color or Name
3 types of data sets:
univariate bivariate multivariate
time series data
when each observation in the sample represents a different equally spaced point in time (years, months, days). EX: unemployment rate for 2020 or GPD for 2020
Cross-sectional data
when each observation represents a different individual unit (eg person, firm or geographic area) at the same point in time. EX: traffic fatalities in the 50 us states for 2020
7 sources of error:
§ Nonresponse bias- respondents differ from nonrespondents § Selection bias- self-selected respondents are atypical § Response error- respondents give false info § Coverage error- incorrect specification of frame or population § Measurement error- unclear survey instrument wording § Interviewer error- responses influenced by interviewer § Sampling error- random and unavoidable
relative frequencies
§ Relative frequencies are calculated as the absolute frequency for a bin divided by the total # of data values. Relative frequency= Frequency/# of total data values (x 100 for percentage)
systematic sampling
§ select every kth item from a list or sequence (e.g restaurant customers)
stratified sampling
§ select randomly within defined strata (e.g by age, occupation, gender)
simple random sample
§ use random #'s to select items from a list (e.g visa card holders)
interval scale of measurement
2nd strongest def- scale points are represented as #'s but they're arbitrary; they're not a count or physical measure. The # 0 has no meaning. Distance bw the scale have meaning. EX: Celsius and Fahrenheit scales
Ordinal measurement
2nd weakest def- codes represent a ranking of data values. The #'s imply a larger rank than the ones before it. EX: what size automobile do you usually drive? 1. full size 2. compact 3. subcompact EX: recruiter's ranking of job candidates (outstanding, good, adequate, weak, unsatisfactory)
2 data types
Categorical data- qualitative, values that are words instead of numbers. Numerical data- quantitative and numeric values that arise from counting or measuring.
2 Types of Stats: Descriptive and Inferential
Descriptive stats- refers to the collection, organization, presentation, and summary of data (either using charts and graphs or using a numerical summary). Inferential stats- refers to generalizing from a sample to a population, estimating unknown population parameters, drawing conclusions, and making decisions
likert scale
a special case that is frequently used in survey research. The respondent is asked to indicate his or her agreement/disagreement on a 5 point or 7 point scale using verbal anchors. The coarseness refers to the number of scale points (typically 5 or 7).
sample
a subset of the population that we will actually analyze.
binary variables
categorical variables that have only 2 values. EX: marital status or gender
Categorical data types:
coding- a categorical variable that is represented using numbers; doesn't make data numerical. EX: 1. for Female and 2. for male qualitive data- have values described by words rather than numbers. EX: eye color- blue, brown, green
Empirical Data
data collected through observations and experiments.
modal class and the 3 types:
def- is a histogram bar that is higher than those on either side. May be the artifacts of the way bin limits are chosen. 3 types: § Unimodal- a single modal class § Bimodal- two modal classes § Multimodal- more than 2 modal classes
random sampling and 4 random sampling techniques:
def- o items are chosen by randomization or a chance procedure. o Is used to produce a sample that is representative of the population 4 types: 1 simple random sample 2 systematic random sample 3 stratifies random sample 4 cluster sample
Numerical data types:
discrete data- a variable with countable numbers of distinct values. EX: the # of medical patients (cannot be a fraction or decimal) continuous data- a numerical variable that can have any value within an interval. EX: distance, time, weight (can be a decimal or fraction not a whole #)
Observation
is a single member of a collection of items that we want to study, such as a person, firm, or region.
Variable
is characteristic of the subject or individual, such as an employee's income or invoice amount.
sampling frame
is the group from which the sample is taken and is an attempt to capture the essence of the population. (e.g. voter registration lists represent voters; phonebook to represent voters.
sampling frame
is the group from which we take the sample.
3 non-random sampling techniques:
judgement sample- § use expert knowledge to choose "typical" items (e.g which employees to interview) convenience sample- § use a sample that happens to be available (e.g ask co=workers opinions at lunch) focus groups- § in-depth dialog w/ a representative panel of individuals (e,g iPod users)
four levels of measurement from the weakest to the strongest:
nominal ordinal interval ratio
census
o A census is an examination of all items in a defines population.
parameter
o A measurement or characteristic of the population (e.g a mean or proportion). Usually unknown bc we can rarely observe the entire population. Usually but not always represented by a greek letter.
target population
o A population may be defined either by a list (e.g name of passengers on flight) or by a rule (e.g the customers who eat at noodles & Co.) o Target population- contains all the individual in which we are interested.
steam-and-leaf plots
o A simple way to visualize data o Is a tool of exploratory data analysis (EDA)- seek to reveal essential data feature in an intuitive way. o Basically a frequency tally that uses digits instead of tally marks o They reveal the central tendency and dispersion of data.
population
o All of the items that we are interested in. May be either finite (e.g. all the passengers on a particular plane) or effectively infinite (e.g. all of the cokes produced in an ongoing bottling process)
bin limits and bin widths
o Bin limits define the values to be included in each bin. o Usually, all bin widths are the same and their limits cannot overlap
bivariate data set
o Bivariate- has two variables § Ex: income, Age § Typical tasks: scatter plots, correlation
tables
o Can sometimes present information better than a graph, but even when a graph is preferable, a summary table is often a valuable way to make data available for further analysis. pivot tables
univariate data (stem, dot plot, histogram) can be described w/ three characteristics:
o Center- Where are the data values concentrated? o Variability- How spread out are the data values? o Shape- Are the data values distributed symmetrically? Skewed? Sharply peaked? Flat? Bimodal? o Measurement- what are the units of measurement?
frequency distributions
o Frequency distribution is a table formed by classifying n data values into k classes called bins. o Frequencies also can be expressed as relative frequencies or percentages of the total number of observations.
frequency polygon and ogive
o Frequency polygon- is a line graph that connects the midpoints of the histogram bin intervals, plus extra intervals at the beginning and end so that the line will touch the X-axis. § Used when you need to compare 2 data sets (bc more than 1 frequency polygon can be plotted on the same scales) o Ogive- is a line graph of cumulative frequencies. § Useful for finding percentiles or in comparing the shape of the sample w/ a known benchmark like the normal distribution.
dot plots
o Is a simple graphical display of n individual values of numerical data. o They show variability (by displaying the range of the data), the center (by revealing where the data values tend to cluster and where the midpoint lies), and the shape (of the distribution if the sample is large enough) of the data.
multivariate data set
o Multivariate- has more than two variables § Ex: income, age, gender § Typical tasks: regression modeling
deceptive graphs (errors)
o Nonzero origin § Exaggerates the trend § Measured distances do not match the stated values or axis demarcations o Elastic graph proportions § By shortening the x axis in relation to the y axis, vertical change is exaggerated. § Keep aspect ratio below 2.00 § For a time series (x axis) this can make sluggish sales or profit curve appear steep. o Dramatic titles and distracting pictures § A title should be short and adequate § Pictures distract the reader or impart an emotional slant o 3-D and Novelty Graphs § Depth introduces ambiguity in a bar height (do we measure from the back or the front?) § Novelty charts like pyramid charts should be avoided since they distort the bar volume and make it harder to measure bar height. o Rotated graphs § Making a graph 3D and rotating it, will make the trends appear to dwindle into the distance or loom alarmingly toward you o Unclear definitions or scales § Missing or unclear units of measurement can render a chart useless § We must know the variable being plotted o Vague sources § May indicate that the author lost the citation, didn't know the data source, or mixed data from several sources o Complex graphs § Complicated visuals make the reader work harder. § Omit bonus detail or put it in the appendix § Apply the 10 second rule o Gratuitous effects § Color and special affect attract attention. § Once the novelty wears off, audiences may find special effects annoying o Estimated data § Estimated points should be noted o Area trick § One of the most pernicious visual tricks is simultaneously enlarging the width of the bars as their height increases, so the bar misstates the true proportion. § It distortes the area · Final advice
stacked column chart
o The bar height is the sum of several subtotals. o Areas may be compared by color to show patterns in subgroups, as well as showing the total. o Effective for any # of groups but work best when you have a few.
univariate data set
o Univariate- has one variable § Ex: income § Typical tasks: histograms, basic stats (i.e. an avg)
Histograms
o a graphical representation of a frequency distribution. o A column chart whose y-axis shows the number of data values (or %) within each bin of a frequency distribution and whose x-axis ticks show the end points of each bin. o There shouldn't be any gaps (only if there is no data in a bin)
stacked dot plot
o can be used to compare 2 or more groups. § Ex: comparing median home prices for 150 US cities in 4 different regions
pie chart
o can only convey a general idea of data bc it is hard to assess areas precisely. o Are ineffective when they have too many slices o Correct use: to portray data that sum to a total (e.g. percent mkt shares)
pareto charts
o displays categorical data, w/ categories displayed in descending order of frequency, so that the most common categories appear first.
outlier
o is an extreme value that is far enough from the majority of the data that it probably arose from a different cause or is due to measure error.
bias
o refers to a systematic tendency to over- or underestimate a population parameter of interest.
skewness def and the 3 main types
o skewness is indicated by the direction of its longer tail. § If neither tail is longer, the histogram is symmetric. § Right-skewed (positively skewed)- histogram has a longer right tail, w/ most data clustered on the left side. § Left skewed (negatively skewed)- histogram has a longer left tail, w/ most data values clustered on the right side.
ratio measurement
the strongest level of measurement def- possess a meaningful zero that represents of the quantity being measure. EX: balance sheet data ($20 mil in revenue is twice as much as $10 mil and $0 is a loss)
column charts/ bar charts
· Column chart- is a vertical display of data o Attribute data are displayed using a column to represent a category or attribute o The height of each column reflects a frequency or a value for that category. · Bar chart- is a horizontal display of data
scatter plots
· shows n pairs of observations (x1, y1), (x2, y2),..., (xn, yn) as dots or some other symbol on an X-Y graph. · Very important in stats · Is a starting point for bivariate data analysis · They investigate the relationship b/w 2 variables; we would like to know if there is an association b/w the 2 variables, and if so, what kind of association exists.