SAS Programming II Business Analysis Applications - C748

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

Normal Distribution

A common theoretical distribution in statistics that is shaped like a bell with values concentrated near the mean. Shape of the distribution depends on the mean and standard deviation.

platykurtic distribution

A distribution that is characterized by being flatter than the normal distribution, that is, less peaked, with heavier flanks and thinner tails.

leptokurtic

A distribution that is often referred to as heavy-tailed and might sometimes also be referred to as an outlier-prone distribution; has a positive kurtosis value.

chi-square test

A formal test of association between two categorical variables.

Two-Sample T-Test

A hypothesis test for answering questions about the means of two populations. This test enables you to examine the differences between populations for one or more continuous variables. You can assess whether the means of the two populations are statistically different from each other.

cross-validation

A model validation technique for assessing how the results of a statistical analysis will generalize to an independent data set.

bootstrapping

A resampling method that tries to approximate the distribution of the parameter estimates to estimate the standard error.

representative

A sample from a population should be

standard error

A statistic that measures the variability of your estimator; variability of a sample statistic.

μ1 - μ2 != 0

Alternative hypothesis

ODS Graphics

An extension of the SAS Output Delivery System. With this, statistical procedures produce graphs as automatically as they produce tables, and graphs are integrated with tables in the output.

outlier

An unusual observation that has a large residual compared to the rest of the points. Keyword to detect these: STUDENT

decreases

As the power of a test increases, the chances of a Type II error ________.

homogeneity

Assumption that all comparison groups have the same variance.

scale of measurement

Based on type of variable, used to determine the statistical procedures appropriate for use with that variable.

CLM option

Calculate the confidence limits for the mean

PROC MEANS

Calculates a standard set of statistics, including the minimum, maximum, and mean data values, as well as standard deviation and n, which is the number of non missing values in the sample. If any additional measurement options are specified, they will override the default.

logistic regression

Can have categorical or continuous predictor variable, but has a categorical response variable.

influential observation

Can sometimes have a large residual compared to the rest of the points, but it is an observation so far away from the rest of the data that it singlehandedly exerts influence on the slope of the regression line. Keywords to detect these: COOKD, RSTUDENT, DFFITS

independent variable

Can take different values ;affects or determines other variables. AKA predictor variable, explanatory, control, or input variable

t distribution

Characterized by degrees of freedom associated with the data. Arises when you're making inferences about a population mean and the population standard deviation is unknown and has to be estimated from the data. The number of standard errors that the sample mean is from the hypothesized mean.

multiple logistic regression

Characterizes the relationship between a categorical response variable and multiple predictor variables.

effect coding

Compare one level to the average effect of all levels

post-hoc test

Confirms where the differences occurred between groups.

multivariate analysis

Considers multiple response variables when examining two or more variables. Examples: Factor analysis and clustering.

multivariable analyses

Considers only one response variable when examining two or more variables. Examples: multiple linear regression and n-way ANOVA

ALPHA= option

Construct confidence intervals with a different confidence level

linear regression

Continuous predictor (independent) variable and continuons response (dependent) variable.

platykurtic

Data is less heavily concentrated about the mean; has a negative kurtosis statistic.

quantitative data

Data that consists of counts or measurements

Continuous data

Data that consists of variables that are measured on a scale that has an infinite number of values and has no breaks or jumps.

Discrete data

Data that consists of variables that can have only a countable number of values within a measurement range.

Categorical data

Data that consists of variables that denote groupings or labels.

nominal categorical variable

Data that exhibits no ordering within its observed levels, groups, or categories. AKA qualitative or classification variable.

ordinal categorical

Data that has observed levels of the variable that can be ordered in some meaningful way that implies that the differences between the groups or categories are due to magnitude.

mode

Data value that occurs the most. Not informative in small data files.

interquartile range

Difference between the 25th and 75th percentiles. Robust estimate of the variability because changes in the upper and lower 25% of the data do not affect it, meaning it is resilient to outliers.

PRINTALLTYPES

Display the analysis for all requested combinations of class variables

distribution of sample means

Distribution of all possible sample means from the population; always less variable than the data.

frequency table

Distribution of your data

bimodal distribution

Distributions with two peaks.

F-Test

Evaluates the assumption of equal variances in the two populations.

coefficient of variation

Expresses the root MSE as a percentage of the mean bulb weight.

σ1^2 = σ2^2

F-Test for Equal Variance, if true then supports H0. F-Statistic will be close to 1.

Nuisance Factors

Factors that can affect the outcome of your experiment but are not of interest in your study.

Type II error

Failing to reject the null hypothesis and it's actually false.

PROC GLM

General linear model

interval estimator

Gives us a range of values that is likely to contain the population mean; incorporates uncertainty that arises from random variability.

predictive modeling

Goal is to answer the question, if you know X, can you predict Y? Sample sizes are typically quite large and include many predictor variables, also called input variables. The focus is on the predictions of observations, rather than the parameters of the model. To assess a predictive model, you validate predictions using holdout sample data.

explanatory modeling

Goal is to develop a model that answers the question, how is X related to Y? Sample sizes are typically small and include few variables. The focus is on the parameters of the model. To assess the model, you use p-values and confidence intervals.

Box Plot

Graphical rendition of statistical data based on the minimum, first quartile, median, third quartile, and maximum.

one-way ANOVA

Has a continuous dependent, or response variable, and a categorical independent, or predictor variable. Predictor variable can have two or more levels, but can only have one predictor variable.

Mallows' Cp statistic

Helps detect model bias. Should be <= p statistic for prediction. For analysis, Cp<=2p - pfull +1, where p is the number of parameters in the model, including the intercept.

greater

If the f-statistic is _______ than alpha (usually 0.05), you fail to reject the null hypothesis.

less

If the f-statistic is _______ than alpha (usually 0.05), you reject the null hypothesis.

greater

If the p-value is _______ than the alpha (usually 0.05), you fail to reject the null hypothesis.

less

If the p-value is _______ than the alpha (usually 0.05), you reject the null hypothesis.

ANOVA assumptions

Independent observations, normally distributed errors, and equal variances

odds ratio

Indicates how much more likely it is, with respect to odds, that a certain event, or outcome, occurs in one group relative to its occurrence in another group.

Standard deviation

Indicates how much variation there is from the mean, measuring how spread out the data is. Defined as the square root of the variance.

alternative hypothesis

Initial research hypothesis, that is, your proposed explanation; what you are attempting to demonstrate; hypothesis of inequality.

QRANGE

Interquartile range (Q3-Q1)

Welch

Is used in ANOVA when the assumption of homogeneity of variance is violated.

Coefficient of variation (C.V.)

Measure of the standard deviation expressed as a percentage of the mean.

Variance

Measure of variability of the data around the mean. Defined as the average squared difference of the observations from the mean.

adjusted odds ratio

Measures the effect of a single predictor variable on a response variable while holding all the other predictor variables constant.

p-value

Measures the probability of observing a value as extreme as the one observed or more extreme, assuming that the null hypothesis is true.

Kurtosis

Measures the tendency of your data to be concentrated toward the center or toward the tails of the distribution; measure of the peakedness of the data. The measure of tail thickness. The closer to 0, the closer the tails resemble normal distribution. Normal distribution actually has a value of 3, but SAS standardizes it to 0.

Skewness

Measures the tendency of your data to be more spread out on one side of the mean than on the other. The closer to this is, the more centered the data. Data to the left means the mean is less that the median. Data to the right means the mean is greater than the median.

nominal logistic regression

More than two response levels and response measurement scale is nominal.

ordinal logistic regression

More than two response levels and response measurement scale is ordinal.

μ1 - μ2 = 0

Null hypothesis

p-value >= α

Null hypothesis is true

Cramer's V Statistic

One measure of the strength of an association between two categorical variables.

independent observations

One observation doesn't affect another observation, that is, no observation provides information about any other observation.

Descriptive statistics

Organize, describe, and summarize data using numbers and graphical techniques. This branch of statistics uses a set of standard measures such as percent, averages, and variability, as well as simple graphs, charts, and tables. AKA exploratory data analysis, or EDA

residuals have a cyclical shape

PROC AUTOREG can help take autocorrelation into account

BY Statement in GLM

Perform separate analysis of observation in different groups

PROC TTEST

Performs the two-sample t-test by default. It also computes confidence limits and uses ODS graphics to create graphs as part of its output. It automatically tests the assumption of equal variances, and provides an exact two-sample t-test when the assumption is met, and an approximate t-test when it is not met.

SHOWNULL

Places a vertical reference line at the mean value of the null hypothesis, which is 0 by default, in the interval plot.

μ

Population Mean

σ

Population Standard Deviation

σ2

Population Variance

CLI

Produces confidence limits for an individual predicted value.

PROC CORR

Produces tables of variable information, simple descriptive statistics, and correlation statistics, including Pearson correlation coefficients and corresponding p-values. It also produces scatter plots or a scatter plot matrix.

Kurtosis

Rectangular, bimodal, and multimodal distributions tend to have low values of ____________.

p-value < α

Reject the null hypothesis

Type I error

Rejecting the null hypothesis when it's actually true.

two-sided t-test

Rejection region for the t statistic is contained in both tails of the data distribution.

HOVTEST=LEVENE

Requests a homogeneity of variance test for the groups defined by the MEANS effect, specifically Levene's test; Levene's is also the default

LSMEANS

Requests all of the multiple comparison methods

PDIFF=ALL

Requests p-values for the differences between ALL the means

sum to 0

Residuals always __________________, regardless of the number of observations.

UNPACK

SAS puts each plot on a separate page.

CLASS statement

SAS will group the data by the variable defined in the statement.

Sample Mean

s

Sample Standard Deviation

s^2

Sample Variance

point estimator

Sample statistic used to estimate a population parameter. Examples: Mean, sample standard deviation, population standard deviation

interval scale

Scale of measurement for continuous variables. Can be rank-ordered like ordinal data, but it also has a sensible spacing of observations such that differences between measurements are meaningful. Example: Measuring patient temperature. Lack the ability to calculate ratios between numbers on the scale.

ratio scale

Scale of measurement for continuous variables. Is rank-ordered with meaningful spacing, includes a true zero point and can therefore accurately indicate the ratio of difference between two spaces on the measurement scale. Example: If an individual has zero dollars, this implies an absence of money. And one individual can have twice as much money as another.

Confidence intervals

Shows a range of plausible values for the unknown population mean; chosen by analysis, typically 95%.

R Squared

Shows how well terms (data points) fit a curve or line. Percentage explained by the terms.

Adjusted R Squared

Shows how well terms fit a curve or line, but adjusts for the number of terms in a model. Percentage explained by the terms.

range

Single value that measures the difference between the maximum and minimum values.

Dunnett's Method

Specialized multiple comparison test that allows you to compare a single control group, such as a placebo in a drug trial, to all other groups or treatments.

SLENTRY=

Specifies a significance level for a variable to enter the model.

SELECT=SL

Specifies significance level as the selection criterion.

ADJUST=

Specifies the adjustment method for multiple comparisons

variability

Spread or dispersion of the data

2 standard deviations

Statisticians often consider values that are more than _______________ from the mean as unusual.

measures of central tendency

Statistics that locate the center of the data. Examples: Mean, median, mode

t statistic

Symmetric distribution like the normal distribution, except that it has thicker tails than a normal distribution.

dependent variable

Takes different values in response to another variable. AKA response, outcome, or target variable

/ reg

Tells SAS to include a regression line fit to the scatter plot

omnibus test statistic

Test whether the explained variance in a set of data is significantly greater than the unexplained variance, but not which groups are different from each other, if any.

mean

The average of all data values. Highly influenced by outliers. Useful for measuring the center of your data when the data is balanced on both sides.

Inferential statistics

The branch of statistics concerned with drawing conclusions about a population from analysis of a random sample drawn from that population. It is also concerned with the precision and reliability of those inferences. AKA explanatory modeling

effect size

The difference between the observed statistic and the hypothesized value

probability density function

The height of the function at any point on the horizontal axis

mean (µ) and standard deviation (σ)

The location and spread of a normal distribution depend on the value of __________.

effect

The magnitude of the expected change in the response variable presumably caused by the change in value of a predictor variable in the model.

median

The middle value in the data when the data is ordered. Less sensitive to the presence of outliers.

power

The probability that you correctly reject the null hypothesis. It is the ability of the statistical test to detect a true difference, or the ability to successfully reject a false null hypothesis.

R-Square

The proportion of variance in the response accounted for by the model. It's close to 0 if the independent variables do not explain much variability in the data, and it's close to 1 if the independent variables explain a relatively large proportion of the variability in the data.

F statistic

The ratio of the maximum sample variance of the two groups to the minimum sample variance of the two groups.

variance of the residuals is not constant

The response variable in the model might need some sort of transformation; natural log and square root transformations are very common.

method of least squares

This method provides the estimates by determining the line that minimizes the sum of the squared vertical distances between the data points and the fitted line.

Tukey Method

This test compares all possible pairs of means, so it can only be used when you make pairwise comparisons. AKA Honestly Significant Difference test

linearity assumption is being violated

To account for the curvature in the data, the model needs a quadratic term added.

Histogram

To check the assumption that your random sample has a normal distribution, it can be useful to plot a histogram.

binary logistic regression

Two (dichotomous) response variables

Collinearity

Two or more predictor variables are highly correlated with each other.

two-way ANOVA model

Use _______ when you have a continuous response variable and two categorical predictor variables.

PROC UNIVARIATE

Use to generate descriptive statistics, including skewness and kurtosis, quantiles or percentiles, frequency tables and extreme values. It also generates histograms and normal probability plots that assist you in assessing the distribution of your data.

PROC SGPLOT

Use to produce a wide variety of plot types, including scatter plots, line graphs, histograms with overlaid distribution curves, regression lines with confidence and prediction bands, dot plots, box plots, bar charts, etc. You can also overlay plots together to produce many different types of graphs.

PROC SGPANEL

Use to produce panels of plots for different levels of a factor or several different time periods, depending on the classification variable.

PROC SGRENDER

Use to produce plots from graph templates you have modified or written yourself.

PROC SGSCATTER

Use to produce several types of scatter plots.

Pooled t-test

Use when you are assuming the population variances are equal.

Satterthwaite t-test

Use when you are assuming the population variances are not equal.

hypothesis test

Uses sample data to evaluate a question about a population.

Percentile

Value of a variable below which a certain percentage of observations fall, most commonly reported are quartiles, which break the data up into quarters.

Model Sum of Squares (SSM)

Variability between groups.

Error Sum of Squares (SSE)

Variability explained by the error terms.

Total Sum of Squares (SST)

Variability within the groups.

homogeneity of variance

Variances in the two populations are equal

Normal Probability Plot

Visual method for determining whether or not your data comes from a distribution that is approximately normal.

Histogram

Visual representation of the frequency distribution of your data.

means class_var / Welch

Welch Test

null hypothesis

What you assume to be true when you start your analysis; hypothesis of equality.

effect

a

sample

a subset of the population

generalization

ability to predict outcomes for new data, also known as scoring

ANOVA

categorical predictor variable + continuous outcome variable =

Variables

characteristics or properties of data that can take on different values or amounts for different individuals in the population

convenience sampling

collecting a sample from a section of the population that is easily available

population

complete set of observations or the entire group of objects that you are researching

correlation analysis and linear regression

continuous predictor variable + continuous outcome variable =

Bivariate analysis

describes and explains the relationship between two variables and how they change, or covary, together.

central limit theorem

distribution of sample means is approximately normal, regardless of the population distribution's shape, if the sample size is large enough; about 30 observations.

UNITS statement

enables you to obtain an odds ratio estimate for a specified change in a predictor variable

ε

error term

ML estimation

estimate the parameters that are most likely to occur given the data and model assumptions

simple random sample

every possible sample of a given size in the population has an equal chance of being selected

Multivariate or multivariable analysis

examines two or more variables at the same time, in order to understand the relationships among them

variance of the sample mean

how much the value of the sample mean varies from sample to sample

observation number

i

β0

intercept parameter

indexes predictor variable

j

one-sided tests

look for a difference in one direction

discordant

model did not predict the order correctly

concordant

model predicted the order correctly

Parameters

numerical values that summarize characteristics of a population

honest assessment

partitioning the available data into two data sets: a training data set and a validation data set

PROC PLM

perform post-fitting statistical analyses and plotting for the contents of the store item

X

predictor variable

Univariate analysis

provides techniques for analyzing and describing a single variable at a time

joint sampling

randomly select input-target pairs, randomly select input-target pairs

imputation

replace missing values with reasonable values

STORE statement

requests that the procedure save the context and results of the statistical analysis into an item store

Y

response variable

central limit theorem

sampling distribution of the sampling means approaches a normal distribution as the sample size gets larger

SURVEYSELECT procedure

selects a random sample from a SAS data set

β1

slope parameter

%GLOBAL statement

specifies that the inputs macro variable is available for the entire SAS session

coefficient of variation

standard deviation expressed as a percentage of the mean

predictive models

statistical model is used to predict future fables of a response variable based on the existing values of predictor variables

Statistics

summarize characteristics of a sample

ONLY

suppresses the default plots

SIDES=L

syntax to specify a lower one-sided test

SIDES=U

syntax to specify an upper one-sided test

two-sided two-sample t-test

testing to see whether two group means are significantly different from each other

one-sided two-sample t-test

testing to see which of two group means is greater than or less than the other

observational data

the data was collected for operational purposes (such as tax or accounting purposes) unrelated to statistical analysis

logit transformation

transforms the probability scale to the real line (-∞, +∞)

inferential models

used to test hypotheses about the data and characterize the relationships among variables

error term

εij

overall population mean

μ


Kaugnay na mga set ng pag-aaral

Multiplying Fractions & Whole Numbers

View Set

Chapter 2 The Enlightenment and the American Revolution

View Set

Childbirth at Risk and Birth-Related Procedures

View Set