STATISTICS

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

histogram

-range of possible values is divided into evenly-sized intervals - vertical axis shown as a frequency or relative frequency - individuals are counted, creating the height of the column over the range

What is the fundamental difference between a matched pairs experiment and a case-control study?

Case-control studies are observational studies, not experiment

raw data

The original data as it was collected must have: unique identifier, all data one row, each variable in one column

Which of the following is correct about categorical data?

The proportion of a given outcome for a sample of observations is given the symbol P-HAT, whereas the proportion of a given outcome for an entire population is given the symbol P.

frequency

When summarizing categorical data, counts can also be called

percentile

a measure of placement in a sorted quantitative dataset

When examining a time plot, we especially look for evidence of

a trend, cyclical pattern, or both

issues that effect our ability to estimate the properties of a population when only using a sample data

amount of information made available (sample size), variability of data collected, potential bias

mistake

an error was made during the study or when recording the value -can be corrected or discarded -always keep detailed records

mean

average

The outcomes in a graph representing categorical data

can be sorted in any order

association, however strong, does NOT imply

causation

center (distribution of quantitative data)

center of mass (mean) vs midpoint (median)

response bias

fancy terms for lying or forgetting (especially on sensitive or personal issues), can be exacerbated by survey method (in person vs by phone or online

when describing association, need to find

form, direction, strength, outliers

form

general shape of the plot of points (linear, curved, clusters, no pattern)

"suspected" outlier flag

greater than Q3 or less than Q1

strength

how closely the points fit the "form" -weak (lots of scatter) -strong (little scatter)

median

measure of center -splits the data set into sets of equal number of data points -50th percentile (50% smaller, 50% larger)

five number summary

minimum, Q1 (25th percentile), median, Q3 (75th percentile), maximum

quantitative variable

must be on scaled axis: -dotplot -histogram

The variable EyeColor in the UCI student dataset is an example of

nominal data

nonresponse

some people chose not to answer/participate

variance

standard deviation squared

With a simple random sample, nobody can game the system to purposefully amplify their answers over that of others because

the individuals are selected entirely by chance

To learn about the target population, anecdotal data and volunteer data are fundamentally biased data collection processes because

the individuals used are typically unusual and may have ulterior motives.

timeplots

trends or major changes over time, seasonal or cyclical variations

explanatory variable (independent variable)

x-value of a function

response variable (dependent variable)

y-value of a function

matched paired experiment

(comparisons made at individual level) imposed conditions are compared on pairs/sets of RELATED individuals- the pair of subjects are closely matched then randomly assigned a treatment; detects subtle effects of variables

Which of the following is correct notation for the standard deviation?

The standard deviation of a sample of observations is labeled with the English letter S, whereas the standard deviation of an entire population is labeled with the Greek letter SIGMA.

What is the main reason that it is easier to reach a conclusion of causation from an experimental study than from an observational study?

When conditions are imposed at random like in a randomized experiment, the conditions are not confounded with any important lurking variable

relative frequency

When summarizing categorical data, proportions can also be called

What is the fundamental difference between a sample survey of human beings that may suffer from nonresponse and data using a volunteer sample?

You can't participate in a sample survey unless you were selected by chance, and you can never answer any given question more than once

A bar graph is more versatile than a pie chart to plot categorical data because

a bar graph can display more than one categorical variable whereas a pie chart cannot

modified boxplot

a display for quantitative data that graphs the five-number summary on an axis and shows outliers if they exist (does not include them in whiskers)

scatterplot

a graphed cluster of dots, each of which represents the values of two variables -specifically two quantitative variables recorded for each individual (1 dot= 1 individual, two variables for that individual form the (x,y) coordinates

percentile AKA quantile

a measure of placement in the ordered data set -value splitting data set in two, some smaller or equal to the percentile -pth percentile= p% smaller or equal to

linear correlation coefficient (r)

a measure of the strength and direction of the linear relation between two quantitative variables -unitless (calculated relative to the mean and standard deviation of both variables) -bounded by -1, 1 -measure of both direction and strength of a relationship for linear or random patterns only -only for linear patterns -always plot the data before computing any summary value -outliers have impact

extropolation

a model-based prediction for a value outside the range of data used to create the model

double-blind procedure

an experimental procedure in which both the research participants and the research staff are ignorant (blind) about whether the research participants have received the treatment or a placebo. Commonly used in drug-evaluation studies. (informed consent for human subjects- ethical matter)

bar graphs

bar heights indicate a summary value for each outcome shown (some or all outcomes are shown, count/proportion of outcomes must be shown, versatile- can be easily misrepresented) -used to display anything for the height of the bar represents a numerical value

wording effects

biased or leading questions, complicated/confusing statements can influence survey results

actual relationship

both variables are influenced by another variable (confounding) -one variable actually influences the other, directly or indirectly (need to explore all possibilities before reaching causality when interpreting data in context)

stratification

constraint random sample so tat it has x,y, z% of individuals of certain types (typically fit the population makeup)

2 categorical variables

contingency/two-way tables, bar graphs, comparing conditional proportions (comparing distribution of eye color between male and female students)

systematic random sample

create your own sample by taking every other nth individual on the population list (beware of potential patterns/cycles in population) -does not guarantee a totally unbiased sample but guarantees the sample is unlikely biased

illusion

created by inappropriately lumping things together -resulting in deceitful pattern and appearance of no pattern

population data

data from every individual of interest cons: expensive, time consuming, maybe impossible pro: exact knowledge of the population

case-control observational study

data is recorded in observational setting using 2 distinct random samples of individuals by some feature, individuals with the feature are CASES; those without are CONTROLS

experimental study

deliberate treatments are imposed on the individual and record their responses pros: influential factors can be controlled, concluding causation is possible cons: realistic/simplistic setting

Boxplot

displays the 5-number summary as a central box with whiskers that extend to the non-outlying data values

multivariate analysis

examining a pattern in a single variable is interesting - comparing patterns across different groups -studying how a pattern changes over time -looking for patterns of association made by two or more variables *higher level of complexity

direction

if there is a pattern to the plot of points (positive, negative, no direction)

r^2

indicates what fraction of the variation in y can be explained by the linear regression model

atypical individual

individual is fundamentally not representative of target population -belongs to different subgroup or is known to be atypical -values may be discarded if summary is desired for the majority group only

dotplot

individual value plot -each individual placed on scaled axis -for large data, one dot= multiple individuals

for every study, we need to identify...

individuals and variables studied, type and design of the study (objective)

completely randomized experiment

individuals are assigned to different treatment groups, occurs completely at random and creates independent random samples

Longitudinal cohort study

individuals are observed repeatedly over time, examine the compounded effect of naturally occurring factors over time

independent samples

individuals compared are unrelated, the comparison is made at the group level ex: comparing GPA at the end of the first year in a random sample of freshman who did or did not attend a seminar on learning skills and study habits

When describing quantitative data, an outlier

is a data point that does not fit the main pattern of the data (investigate- dont throw out unless justified)

double-blind experiment

neither the participants nor the experimenters know who is receiving which treatment until the study is over

quantitative data

numerical data (values of many individuals can be averaged) -discrete (only record whole numbers) -continuous (anything over an interval: decimals)

parameter

numerical summary of a population

sample surveys (polls)

observational design

case-control studies

observational studies using two distinct random samples that differ by one important feature

volunteer response sample

open to anyone who wants to participate, fundamentally bias, open to manipulation, potentially hugely different from target population ex: write-ins, polls, bots, tweets

outliers (distribution of quantitative data)

points that do not fit the main pattern

What is NOT an important issue affecting our ability to estimate the properties of a population when using only sample data?

population size

categorical data

qualitative description recorded for each individual (values with individuals with given feature can be counted) -nominal (attribute) -ordinal (ranked)

coincidence

random occurrence of unrelated things

simple random sample

randomly selecting individuals in the population that have the same probability of being selected and all possible samples of size n have the same chance of being drawn ex: in a class of 100 students the instructor uses the roster to randomly pick 5 students midterms to check that they were graded properly

spread (distribution of quantitative data)

range (min to max)

observational study

record data on individuals without attempting to influence the responses cons: lots of unknown factors, concluding causation is very difficult pro: realistic setting

association (relationship)

refers to the idea that variables can vary together with some level of synchronicity; the existence of an overall pattern -deterministic (exact pattern) -statistical: example being weather (overall, but not an exact pattern)

legitimate value

represents the natural variability for the group and the variable measured -provides important information about location and spread -do not discard

conditional distribution

row percents and column percents of one factor, given the levels of the other factor

2 quantitative variables

scatterplot, correlation and regression (examining the distribution of height and weight among students)

convenience sampling

select a set of easily accessible individuals, representative of similar individuals but not the whole population ex: using college students for human behavioral studies

1 categorical variable and 1 quantitative variable

side-by-side dotplots or boxplots, comparing means or medians, variability and spread

pie charts

slices are scaled to proportion of each outcome that make up the categorical variable (all outcomes must be shown, count/proportion must be shown, only represent one variable in one group)

when choosing a variable, things that need to be considered

study's ultimate objective, what aspects of the goal can be recorded, would quantitative or categorical point of view be better, cost, speed, and accuracy

marginal distribution

summarizer each factor independently with proportions or percents

statistic

summary of values of sample data

shape (distribution of quantitative data)

symmetrical- homogenous, skewed- right, left, multimodal(several types of occurrences), irregular

replication and randomization prevent what?

systematic bias and confounding ex: several individuals are studied for each condition, individuals are assigned to treatment using probability

sample data

the data are from only some of the individuals of interest pros: cheaper, faster, typically doable cons: uncertainty about the population

confounding variables

the effects on the response variable cannot be distinguished -major issue because there is no clear conclusion especially in observational studies -makes difficult to conclude causation

matched pairs, cross-over, repeated measures, time series

the individuals compared across conditions are clearly RELATED or identical, comparison is made at the individual level ex: comparing total sleep times the week before and the week after finals in a random sample of freshman (same students both times)

probability sampling

the individuals/units are randomly selected therefore the sampling process is unbiased

undercoverage

the sampling process systematically leaves out or under-represents part of the population

standard deviation

the square root of the variance

least squares regression line

the unique line such that the sum of the squared vertical distances between the data points (residuals) and the line is as small as possible

bias

unconscious or conscious, should be prevented at all costs

anecdotal evidence

uniquely personal cases not representative in target population ex: celebrity endorsements

The ideal way to organize electronically the raw data for a study is to

use one table and assign one row for each individual so that all the data for that individual is in that row.

two-way tables (contingency tables)

used to organize the counts of joint outcomes of two categorical variables (factors) 1 factor= row, 1 factor= column -represents the intersection of one factor with a given level of the other factor

cross-sectional survey

uses 1 random sample drawn once from a population, comparisons can be made from the subgroups after the data set is collected

regression line

y= constant + slope * x -slope: how much we expect y to change on average for every unit increase in x -y-intercept: may or may not have a meaning in context -can be used to make predictions within range (on average)


Kaugnay na mga set ng pag-aaral

Unit 27 - Communications with the Public

View Set

Retirement Planning: Plan Selection for Businesses (Module 8)

View Set

PNU 133 PrepU Passpoint Integumentary Disorders

View Set

A&P Tissue Level of Organization

View Set

QUARTERLY EXAM 1 Rules & Definitions Shormann Algebra 1

View Set