SAS Programming II Business Analysis Applications - C748
Normal Distribution
A common theoretical distribution in statistics that is shaped like a bell with values concentrated near the mean. Shape of the distribution depends on the mean and standard deviation.
platykurtic distribution
A distribution that is characterized by being flatter than the normal distribution, that is, less peaked, with heavier flanks and thinner tails.
leptokurtic
A distribution that is often referred to as heavy-tailed and might sometimes also be referred to as an outlier-prone distribution; has a positive kurtosis value.
chi-square test
A formal test of association between two categorical variables.
Two-Sample T-Test
A hypothesis test for answering questions about the means of two populations. This test enables you to examine the differences between populations for one or more continuous variables. You can assess whether the means of the two populations are statistically different from each other.
cross-validation
A model validation technique for assessing how the results of a statistical analysis will generalize to an independent data set.
bootstrapping
A resampling method that tries to approximate the distribution of the parameter estimates to estimate the standard error.
representative
A sample from a population should be
standard error
A statistic that measures the variability of your estimator; variability of a sample statistic.
μ1 - μ2 != 0
Alternative hypothesis
ODS Graphics
An extension of the SAS Output Delivery System. With this, statistical procedures produce graphs as automatically as they produce tables, and graphs are integrated with tables in the output.
outlier
An unusual observation that has a large residual compared to the rest of the points. Keyword to detect these: STUDENT
decreases
As the power of a test increases, the chances of a Type II error ________.
homogeneity
Assumption that all comparison groups have the same variance.
scale of measurement
Based on type of variable, used to determine the statistical procedures appropriate for use with that variable.
CLM option
Calculate the confidence limits for the mean
PROC MEANS
Calculates a standard set of statistics, including the minimum, maximum, and mean data values, as well as standard deviation and n, which is the number of non missing values in the sample. If any additional measurement options are specified, they will override the default.
logistic regression
Can have categorical or continuous predictor variable, but has a categorical response variable.
influential observation
Can sometimes have a large residual compared to the rest of the points, but it is an observation so far away from the rest of the data that it singlehandedly exerts influence on the slope of the regression line. Keywords to detect these: COOKD, RSTUDENT, DFFITS
independent variable
Can take different values ;affects or determines other variables. AKA predictor variable, explanatory, control, or input variable
t distribution
Characterized by degrees of freedom associated with the data. Arises when you're making inferences about a population mean and the population standard deviation is unknown and has to be estimated from the data. The number of standard errors that the sample mean is from the hypothesized mean.
multiple logistic regression
Characterizes the relationship between a categorical response variable and multiple predictor variables.
effect coding
Compare one level to the average effect of all levels
post-hoc test
Confirms where the differences occurred between groups.
multivariate analysis
Considers multiple response variables when examining two or more variables. Examples: Factor analysis and clustering.
multivariable analyses
Considers only one response variable when examining two or more variables. Examples: multiple linear regression and n-way ANOVA
ALPHA= option
Construct confidence intervals with a different confidence level
linear regression
Continuous predictor (independent) variable and continuons response (dependent) variable.
platykurtic
Data is less heavily concentrated about the mean; has a negative kurtosis statistic.
quantitative data
Data that consists of counts or measurements
Continuous data
Data that consists of variables that are measured on a scale that has an infinite number of values and has no breaks or jumps.
Discrete data
Data that consists of variables that can have only a countable number of values within a measurement range.
Categorical data
Data that consists of variables that denote groupings or labels.
nominal categorical variable
Data that exhibits no ordering within its observed levels, groups, or categories. AKA qualitative or classification variable.
ordinal categorical
Data that has observed levels of the variable that can be ordered in some meaningful way that implies that the differences between the groups or categories are due to magnitude.
mode
Data value that occurs the most. Not informative in small data files.
interquartile range
Difference between the 25th and 75th percentiles. Robust estimate of the variability because changes in the upper and lower 25% of the data do not affect it, meaning it is resilient to outliers.
PRINTALLTYPES
Display the analysis for all requested combinations of class variables
distribution of sample means
Distribution of all possible sample means from the population; always less variable than the data.
frequency table
Distribution of your data
bimodal distribution
Distributions with two peaks.
F-Test
Evaluates the assumption of equal variances in the two populations.
coefficient of variation
Expresses the root MSE as a percentage of the mean bulb weight.
σ1^2 = σ2^2
F-Test for Equal Variance, if true then supports H0. F-Statistic will be close to 1.
Nuisance Factors
Factors that can affect the outcome of your experiment but are not of interest in your study.
Type II error
Failing to reject the null hypothesis and it's actually false.
PROC GLM
General linear model
interval estimator
Gives us a range of values that is likely to contain the population mean; incorporates uncertainty that arises from random variability.
predictive modeling
Goal is to answer the question, if you know X, can you predict Y? Sample sizes are typically quite large and include many predictor variables, also called input variables. The focus is on the predictions of observations, rather than the parameters of the model. To assess a predictive model, you validate predictions using holdout sample data.
explanatory modeling
Goal is to develop a model that answers the question, how is X related to Y? Sample sizes are typically small and include few variables. The focus is on the parameters of the model. To assess the model, you use p-values and confidence intervals.
Box Plot
Graphical rendition of statistical data based on the minimum, first quartile, median, third quartile, and maximum.
one-way ANOVA
Has a continuous dependent, or response variable, and a categorical independent, or predictor variable. Predictor variable can have two or more levels, but can only have one predictor variable.
Mallows' Cp statistic
Helps detect model bias. Should be <= p statistic for prediction. For analysis, Cp<=2p - pfull +1, where p is the number of parameters in the model, including the intercept.
greater
If the f-statistic is _______ than alpha (usually 0.05), you fail to reject the null hypothesis.
less
If the f-statistic is _______ than alpha (usually 0.05), you reject the null hypothesis.
greater
If the p-value is _______ than the alpha (usually 0.05), you fail to reject the null hypothesis.
less
If the p-value is _______ than the alpha (usually 0.05), you reject the null hypothesis.
ANOVA assumptions
Independent observations, normally distributed errors, and equal variances
odds ratio
Indicates how much more likely it is, with respect to odds, that a certain event, or outcome, occurs in one group relative to its occurrence in another group.
Standard deviation
Indicates how much variation there is from the mean, measuring how spread out the data is. Defined as the square root of the variance.
alternative hypothesis
Initial research hypothesis, that is, your proposed explanation; what you are attempting to demonstrate; hypothesis of inequality.
QRANGE
Interquartile range (Q3-Q1)
Welch
Is used in ANOVA when the assumption of homogeneity of variance is violated.
Coefficient of variation (C.V.)
Measure of the standard deviation expressed as a percentage of the mean.
Variance
Measure of variability of the data around the mean. Defined as the average squared difference of the observations from the mean.
adjusted odds ratio
Measures the effect of a single predictor variable on a response variable while holding all the other predictor variables constant.
p-value
Measures the probability of observing a value as extreme as the one observed or more extreme, assuming that the null hypothesis is true.
Kurtosis
Measures the tendency of your data to be concentrated toward the center or toward the tails of the distribution; measure of the peakedness of the data. The measure of tail thickness. The closer to 0, the closer the tails resemble normal distribution. Normal distribution actually has a value of 3, but SAS standardizes it to 0.
Skewness
Measures the tendency of your data to be more spread out on one side of the mean than on the other. The closer to this is, the more centered the data. Data to the left means the mean is less that the median. Data to the right means the mean is greater than the median.
nominal logistic regression
More than two response levels and response measurement scale is nominal.
ordinal logistic regression
More than two response levels and response measurement scale is ordinal.
μ1 - μ2 = 0
Null hypothesis
p-value >= α
Null hypothesis is true
Cramer's V Statistic
One measure of the strength of an association between two categorical variables.
independent observations
One observation doesn't affect another observation, that is, no observation provides information about any other observation.
Descriptive statistics
Organize, describe, and summarize data using numbers and graphical techniques. This branch of statistics uses a set of standard measures such as percent, averages, and variability, as well as simple graphs, charts, and tables. AKA exploratory data analysis, or EDA
residuals have a cyclical shape
PROC AUTOREG can help take autocorrelation into account
BY Statement in GLM
Perform separate analysis of observation in different groups
PROC TTEST
Performs the two-sample t-test by default. It also computes confidence limits and uses ODS graphics to create graphs as part of its output. It automatically tests the assumption of equal variances, and provides an exact two-sample t-test when the assumption is met, and an approximate t-test when it is not met.
SHOWNULL
Places a vertical reference line at the mean value of the null hypothesis, which is 0 by default, in the interval plot.
μ
Population Mean
σ
Population Standard Deviation
σ2
Population Variance
CLI
Produces confidence limits for an individual predicted value.
PROC CORR
Produces tables of variable information, simple descriptive statistics, and correlation statistics, including Pearson correlation coefficients and corresponding p-values. It also produces scatter plots or a scatter plot matrix.
Kurtosis
Rectangular, bimodal, and multimodal distributions tend to have low values of ____________.
p-value < α
Reject the null hypothesis
Type I error
Rejecting the null hypothesis when it's actually true.
two-sided t-test
Rejection region for the t statistic is contained in both tails of the data distribution.
HOVTEST=LEVENE
Requests a homogeneity of variance test for the groups defined by the MEANS effect, specifically Levene's test; Levene's is also the default
LSMEANS
Requests all of the multiple comparison methods
PDIFF=ALL
Requests p-values for the differences between ALL the means
sum to 0
Residuals always __________________, regardless of the number of observations.
UNPACK
SAS puts each plot on a separate page.
CLASS statement
SAS will group the data by the variable defined in the statement.
x̄
Sample Mean
s
Sample Standard Deviation
s^2
Sample Variance
point estimator
Sample statistic used to estimate a population parameter. Examples: Mean, sample standard deviation, population standard deviation
interval scale
Scale of measurement for continuous variables. Can be rank-ordered like ordinal data, but it also has a sensible spacing of observations such that differences between measurements are meaningful. Example: Measuring patient temperature. Lack the ability to calculate ratios between numbers on the scale.
ratio scale
Scale of measurement for continuous variables. Is rank-ordered with meaningful spacing, includes a true zero point and can therefore accurately indicate the ratio of difference between two spaces on the measurement scale. Example: If an individual has zero dollars, this implies an absence of money. And one individual can have twice as much money as another.
Confidence intervals
Shows a range of plausible values for the unknown population mean; chosen by analysis, typically 95%.
R Squared
Shows how well terms (data points) fit a curve or line. Percentage explained by the terms.
Adjusted R Squared
Shows how well terms fit a curve or line, but adjusts for the number of terms in a model. Percentage explained by the terms.
range
Single value that measures the difference between the maximum and minimum values.
Dunnett's Method
Specialized multiple comparison test that allows you to compare a single control group, such as a placebo in a drug trial, to all other groups or treatments.
SLENTRY=
Specifies a significance level for a variable to enter the model.
SELECT=SL
Specifies significance level as the selection criterion.
ADJUST=
Specifies the adjustment method for multiple comparisons
variability
Spread or dispersion of the data
2 standard deviations
Statisticians often consider values that are more than _______________ from the mean as unusual.
measures of central tendency
Statistics that locate the center of the data. Examples: Mean, median, mode
t statistic
Symmetric distribution like the normal distribution, except that it has thicker tails than a normal distribution.
dependent variable
Takes different values in response to another variable. AKA response, outcome, or target variable
/ reg
Tells SAS to include a regression line fit to the scatter plot
omnibus test statistic
Test whether the explained variance in a set of data is significantly greater than the unexplained variance, but not which groups are different from each other, if any.
mean
The average of all data values. Highly influenced by outliers. Useful for measuring the center of your data when the data is balanced on both sides.
Inferential statistics
The branch of statistics concerned with drawing conclusions about a population from analysis of a random sample drawn from that population. It is also concerned with the precision and reliability of those inferences. AKA explanatory modeling
effect size
The difference between the observed statistic and the hypothesized value
probability density function
The height of the function at any point on the horizontal axis
mean (µ) and standard deviation (σ)
The location and spread of a normal distribution depend on the value of __________.
effect
The magnitude of the expected change in the response variable presumably caused by the change in value of a predictor variable in the model.
median
The middle value in the data when the data is ordered. Less sensitive to the presence of outliers.
power
The probability that you correctly reject the null hypothesis. It is the ability of the statistical test to detect a true difference, or the ability to successfully reject a false null hypothesis.
R-Square
The proportion of variance in the response accounted for by the model. It's close to 0 if the independent variables do not explain much variability in the data, and it's close to 1 if the independent variables explain a relatively large proportion of the variability in the data.
F statistic
The ratio of the maximum sample variance of the two groups to the minimum sample variance of the two groups.
variance of the residuals is not constant
The response variable in the model might need some sort of transformation; natural log and square root transformations are very common.
method of least squares
This method provides the estimates by determining the line that minimizes the sum of the squared vertical distances between the data points and the fitted line.
Tukey Method
This test compares all possible pairs of means, so it can only be used when you make pairwise comparisons. AKA Honestly Significant Difference test
linearity assumption is being violated
To account for the curvature in the data, the model needs a quadratic term added.
Histogram
To check the assumption that your random sample has a normal distribution, it can be useful to plot a histogram.
binary logistic regression
Two (dichotomous) response variables
Collinearity
Two or more predictor variables are highly correlated with each other.
two-way ANOVA model
Use _______ when you have a continuous response variable and two categorical predictor variables.
PROC UNIVARIATE
Use to generate descriptive statistics, including skewness and kurtosis, quantiles or percentiles, frequency tables and extreme values. It also generates histograms and normal probability plots that assist you in assessing the distribution of your data.
PROC SGPLOT
Use to produce a wide variety of plot types, including scatter plots, line graphs, histograms with overlaid distribution curves, regression lines with confidence and prediction bands, dot plots, box plots, bar charts, etc. You can also overlay plots together to produce many different types of graphs.
PROC SGPANEL
Use to produce panels of plots for different levels of a factor or several different time periods, depending on the classification variable.
PROC SGRENDER
Use to produce plots from graph templates you have modified or written yourself.
PROC SGSCATTER
Use to produce several types of scatter plots.
Pooled t-test
Use when you are assuming the population variances are equal.
Satterthwaite t-test
Use when you are assuming the population variances are not equal.
hypothesis test
Uses sample data to evaluate a question about a population.
Percentile
Value of a variable below which a certain percentage of observations fall, most commonly reported are quartiles, which break the data up into quarters.
Model Sum of Squares (SSM)
Variability between groups.
Error Sum of Squares (SSE)
Variability explained by the error terms.
Total Sum of Squares (SST)
Variability within the groups.
homogeneity of variance
Variances in the two populations are equal
Normal Probability Plot
Visual method for determining whether or not your data comes from a distribution that is approximately normal.
Histogram
Visual representation of the frequency distribution of your data.
means class_var / Welch
Welch Test
null hypothesis
What you assume to be true when you start your analysis; hypothesis of equality.
effect
a
sample
a subset of the population
generalization
ability to predict outcomes for new data, also known as scoring
ANOVA
categorical predictor variable + continuous outcome variable =
Variables
characteristics or properties of data that can take on different values or amounts for different individuals in the population
convenience sampling
collecting a sample from a section of the population that is easily available
population
complete set of observations or the entire group of objects that you are researching
correlation analysis and linear regression
continuous predictor variable + continuous outcome variable =
Bivariate analysis
describes and explains the relationship between two variables and how they change, or covary, together.
central limit theorem
distribution of sample means is approximately normal, regardless of the population distribution's shape, if the sample size is large enough; about 30 observations.
UNITS statement
enables you to obtain an odds ratio estimate for a specified change in a predictor variable
ε
error term
ML estimation
estimate the parameters that are most likely to occur given the data and model assumptions
simple random sample
every possible sample of a given size in the population has an equal chance of being selected
Multivariate or multivariable analysis
examines two or more variables at the same time, in order to understand the relationships among them
variance of the sample mean
how much the value of the sample mean varies from sample to sample
observation number
i
β0
intercept parameter
indexes predictor variable
j
one-sided tests
look for a difference in one direction
discordant
model did not predict the order correctly
concordant
model predicted the order correctly
Parameters
numerical values that summarize characteristics of a population
honest assessment
partitioning the available data into two data sets: a training data set and a validation data set
PROC PLM
perform post-fitting statistical analyses and plotting for the contents of the store item
X
predictor variable
Univariate analysis
provides techniques for analyzing and describing a single variable at a time
joint sampling
randomly select input-target pairs, randomly select input-target pairs
imputation
replace missing values with reasonable values
STORE statement
requests that the procedure save the context and results of the statistical analysis into an item store
Y
response variable
central limit theorem
sampling distribution of the sampling means approaches a normal distribution as the sample size gets larger
SURVEYSELECT procedure
selects a random sample from a SAS data set
β1
slope parameter
%GLOBAL statement
specifies that the inputs macro variable is available for the entire SAS session
coefficient of variation
standard deviation expressed as a percentage of the mean
predictive models
statistical model is used to predict future fables of a response variable based on the existing values of predictor variables
Statistics
summarize characteristics of a sample
ONLY
suppresses the default plots
SIDES=L
syntax to specify a lower one-sided test
SIDES=U
syntax to specify an upper one-sided test
two-sided two-sample t-test
testing to see whether two group means are significantly different from each other
one-sided two-sample t-test
testing to see which of two group means is greater than or less than the other
observational data
the data was collected for operational purposes (such as tax or accounting purposes) unrelated to statistical analysis
logit transformation
transforms the probability scale to the real line (-∞, +∞)
inferential models
used to test hypotheses about the data and characterize the relationships among variables
error term
εij
overall population mean
μ