STA2023 Test1
Matched Pairs Design
Compare two treatment groups by using subjects matched in pairs that are somehow related or have similar characteristics.
Converting a Percentile to a Data Value
Converting a Percentile to a Data Value
P-Value
the probability of getting paired sample data with a linear correlation coefficient r that is at least as extreme as the one obtained from the paired sample data.
Replication
the repetition of an experiment on more than one individual. Good use requires sample sizes that are large enough so that we can see effects of treatments.
Nonsampling error
the result of human error, including such factors as wrong data entries, computing errors, questions with biased wording, false data provided by respondents, forming biased conclusions, or applying statistical methods that are not appropriate for the circumstances.
Nonrandom sampling error
the result of using a sampling method that is not random, such as using a convenience sample or a voluntary response sample.
Regression
the straight line that "best" fits the scatterplot of the data. y hat = B0 + B1x
Delete Cases
: One very common method for dealing with missing data. Delete all subjects having any missing values.
Pie Charts
A very common graph that depicts categorical data as slices of a circle, in which the size of each slice is proportional to the frequency count for the category
Pareto Charts
A Pareto chart is a bar graph for categorical data, with the added stipulation that the bars are arranged in descending order according to frequencies, so the bars decrease in height from left to right.
Randomized Block Design
A block is a group of subjects that are similar, but blocks differ in ways that might affect the outcome of the experiment.
Skewness
A distribution of data is skewed if it is not symmetric and extends more to one side than to the other.
Histogram
A graph consisting of bars of equal width drawn adjacent to each other (unless there are gaps in the data)
Bar Graphs
A graph of bars of equal width to show frequencies of categories of categorical (or qualitative) data. The bars may or may not be separated by small gaps.
Dotplots
A graph of quantitative data in which each data value is plotted as a point (or dot) above a horizontal scale of values. Dots representing equal values are stacked
Time-Series Graph
A graph of time-series data, which are quantitative data that have been collected at different points in time, such as monthly or yearly
Frequency Polygon
A graph using line segments connected to points located directly above class midpoint values A frequency polygon is very similar to a histogram, but a frequency polygon uses line segments instead of bars.
Simple Random Sample
A sample of n subjects is selected in such a way that *every possible sample of the same size n* has the same chance of being chosen. - Note that this every possible sample of the same size. Meaning, yes, a simple random sample can be a random sample, but these samples can further be broken down into equal smaller samples
Scatterplot (or Scatter Diagram)
A scatterplot (or scatter diagram) is a plot of paired (x, y) quantitative data with a horizontal x-axis and a vertical y-axis. The horizontal axis is used for the first variable (x), and the vertical axis is used for the second variable (y).
Completely Randomized Experimental Design
Assign subjects to different treatment groups through a process of random selection.
Voluntary Response Sample
By their very nature, all are seriously flawed because we should not make conclusions about a population on the basis of samples with a strong possibility of bias
Rigorously Controlled Design
Carefully assign subjects to different treatment groups, so that those given each treatment are similar in ways that are important to the experiment.
Multistage Sampling
Collect data by using some combination of the basic sampling methods
Data
Collections of observations, such as measurements, genders, or survey responses
Cluster Sampling
Divide the population area into sections (or clusters), then randomly select some of those clusters, and choose all the members from those selected clusters.
Pictographs
Drawings of objects, called pictographs, are often misleading. Data that are one-dimensional in nature (such as budget amounts) are often depicted with two-dimensional objects (such as dollar bills) or three-dimensional objects (such as stacks of coins, homes, or barrels).
Double-Blind
Experimenter & subjects don't know if receive or give placebo or drug The subject doesn't know whether he or she is receiving the treatment or a placebo. The experimenter [ Mr.physician ] does not know whether he or she is administering the treatment or placebo.
CVDOT
Explore the data by analyzing the histogram to see what can be learned about "CVDOT": the Center of the data, the Variation, the shape of the Distribution, whether there are any Outliers, and Time.
Calculating a z score
For a sample: z = x - (x-bar) ÷ s For a population: z = x - µ ÷ σ Note: s = standard deviation for sample σ = standard deviation for population µ = arithmetic mean for population x-bar = arithmetic mean for sample
Practical Significance
It is possible that some treatment or finding is effective, but common sense might suggest that the treatment or finding does not make enough of a difference to justify its use or to be practical.
Levels of Measurement
Nominal - categories only Ordinal - categories with some order Interval - differences but no natural zero point Ratio - differences and a natural zero point
10 - 90 quartile range
P90 - P10
Finding the Percentile of a Data Value
Percent value of x = number of values less than x ÷ total number of values × 100 The process of finding the percentile that corresponds to a particular data value x is given by the following (round the result to the nearest whole number):
Percentages
Some studies cite misleading percentages. Note that 100% of some quantity is all of it, but if there are references made to percentages that exceed 100%, such references are often not justified.
Midquartile range
Q3 + Q1 ÷ 2
Interquartile Range (IQR)
Q3 - Q1
Semi-interquartile range
Q3 - Q1 ÷ 2
interquartile range (IQR)
Q3 − Q1
The Gold Standard
Randomization with placebo/treatment groups is sometimes called the "gold standard" because it is so effective. (A placebo such as a sugar pill has no medicinal effect.)
Stemplots (or stem-and-leaf plot)
Represents quantitative data by separating each value into two parts: the stem (such as the leftmost digit) and the leaf (such as the rightmost digit).
Round-off Rule for z Scores *(and really any value in statistics)*
Round z scores to two decimal places (such as 2.31). - If greater than or equal to 5, round up - If lower than 5, round down
Q2 (Second quartile):
Same as P50 and same as the median. It separates the bottom 50% of the sorted values from the top 50%.
Q3 (Third quartile):
Same as P75. It separates the bottom 75% of the sorted values from the top 25%.
Q1 (First quartile):
Same value as P25. It separates the bottom 25% of the sorted values from the top 75%.
Systematic Sampling
Select some starting point and then select every kth element in the population.
Using z Scores to Identify Significant Values
Significant values are those with z scores ≤ −2.00 or ≥ 2.00.
Stratified Sampling
Subdivide the population into at least two different subgroups (or strata) so that the subjects within the same subgroup share the same characteristics. Then draw a sample from each subgroup (or stratum).
Population
The complete collection of all measurements or data that are being considered. Typically, a population is the complete collection of data that we would like to make inferences about.
Class width
The difference between two consecutive lower class limits in a frequency distribution usually between 5 and 20. Max - Min ÷ n
Upper class limits
The largest numbers that can belong to each of the different classes
Class boundaries
The numbers used to separate the classes, but without the gaps created by class limits Max - Min ÷ n
Statistics
The science of planning studies and experiments, obtaining data, and organizing, summarizing, presenting, analyzing, and interpreting those data and then drawing conclusions based on them
Lower class limits
The smallest numbers that can belong to each of the different classes
Class midpoints
The values in the middle of the classes Each class midpoint can be found by adding the lower class limit to the upper class limit and dividing the sum by 2. Lower Class Boundary + Upper Class Boundary ÷ 2
Convenience Sampling
Use data that are very easy to get.
Impute Missing Values
We "impute" missing data values when we substitute values for them.
boxplot (or box-and-whisker diagram
a graph of a data set that consists of a line extending from the minimum value to the maximum value, and a box with lines drawn at the first quartile Q1, the median, and the third quartile Q3.
Parameter
a numerical measurement describing some characteristic of a population
Statistic
a numerical measurement describing some characteristic of a sample
Blinding
a technique in which the subject doesn't know whether he or she is receiving a treatment or a placebo. a way to get around the placebo effect, which occurs when an untreated subject reports an improvement in symptoms.
outlier n a modified boxplot
above Q3, by an amount greater than 1.5 × IQR or below Q1, by an amount greater than 1.5 × IQR.
Statistical Significance
achieved in a study if the likelihood of an event occurring by chance is 5% or less
Experiment
apply some treatment and then proceed to observe its effects on the individuals. (The individuals in experiments are called experimental units, and they are often called subjects when they are people.)
Nominal Level
characterized by data that consist of names, labels, or categories only, and the data cannot be arranged in some order (such as low to high). Example: Survey responses of yes, no, and undecided
Categorical (or qualitative or attribute) data
consists of names or labels (not numbers that represent counts or measurements). Example: The gender (male/female) of professional athletes
Quantitative (or numerical) data
consists of numbers representing counts or measurements. Example: The weights of supermodels Example: The ages of respondents
5-number summary
consists of these five values: 1.) Minimum 2.) First quartile, Q1 3.) Second quartile, Q2 (same as the median) 4.) Third quartile, Q3 5.) Maximum
Ratio Level
data can be arranged in order, differences can be found and are meaningful, and there is a natural zero starting point (where zero indicates that none of the quantity is present). Differences and ratios are both meaningful. Example: Class times of 50 minutes and 100 minutes
skewed data
data that is not symmetric and extends more to one side than to the other
Linear Correlation Coefficient r
denoted by r, and it measures the strength of the linear association between two variables. Note: r, is always between −1 and 1. If r is close to −1 or close to 1, there appears to be a correlation. If r is close to 0, there does not appear to be a linear correlation.
Correlation
exists between two variables when the values of one variable are somehow associated with the values of the other variable.
Linear Correlation
exists between two variables when there is a correlation and the plotted points of paired data result in a pattern that can be approximated by a straight line.
Data science
involves applications of statistics, computer science, and software engineering, along with some other relevant fields (such as sociology or finance).
Interval Level
involves data that can be arranged in order, and the differences between data values can be found and are meaningful. However, there is no natural zero starting point at which none of the quantity is present. Example: Years 1000, 2000, 1776, and 1492
Ordinal Level
involves data that can be arranged in some order, but differences (obtained by subtraction) between data values either cannot be determined or are meaningless. Example: Course grades A, B, C, D, or F
Pk variable
kth percentile (Example: P25 is the 25th percentile.)
L variable
locator that gives the position of a value (Example: For the 12th value in the sorted list, L = 12.)
Percentiles
measures of location, denoted P1, P2, . . . , P99, which divide a set of data into 100 groups with about 1% of the values in each group.
Quartiles
measures of location, denoted Q1, Q2, and Q3, which divide a set of data into four groups with about 25% of the values in each group
skewed to the left
negative skewed longer left tail
Observational Study
observing and measuring specific characteristics without attempting to modify the individuals being studied
Nonresponse
occurs when someone either refuses to respond or is unavailable.
Sampling error (or random sampling error)
occurs when the sample has been selected with a random method, but there is a discrepancy between a sample result and the true population result; such an error results from chance sample fluctuations
k variable
percentile being used (Example: For the 25th percentile, k = 25.)
skewed to the right
positively skewed longer right tail
Big Data
refers to data sets so large and so complex that their analysis is beyond the capabilities of traditional software tools. Analysis of big data may require software simultaneously running in parallel on many different computers.
modified boxplot
regular boxplot constructed with these modifications: 1.) A special symbol (such as an asterisk or point) is used to identify outliers as defined above 2.) the solid horizontal line extends only as far as the minimum data value that is not an outlier and the maximum data value that is not an outlier.
Continuous Data
result from infinitely many possible quantitative values, where the collection of values is not countable. Example: The lengths of distances from 0 cm to 12 cm. {length could be 16.99998865545 and thus reaching never-ending values between the two given sets]
Discrete Data
result when the data values are quantitative and the number of values is finite, or "countable." Example: The number of tosses of a coin before getting tails [numbers 1,2,3.....900][not 1.4465565 like continuous]
A z score (or standard score or standardized value)
the number of standard deviations that a given value x is above or below the mean.
z score
the number of standard deviations that a given value x is above or below the mean. A data value is significantly low if its z score is less than or equal to −2 or the value is significantly high if its z score is greater than or equal to +2.
n variable
total number of values in the data set
Randomization
when subjects are assigned to different groups through a process of random selection. The logic is to use chance as a way to create two groups that are similar.