HAN 467

Lakukan tugas rumah & ujian kamu dengan baik sekarang menggunakan Quizwiz!

Samples

A subset of a population (ie. All persons who received prescriptions for that drug from a certain physician)

Standard Deviation

How far data falls from the middle 1 deviation - 68% 2 deviation - 95% 3 deviation -99%

Marginal Probability

associated with a single event (flip coin 5 times, marginal probability of heads in a flip is 1/2)

Final Exam Notes

A) Mean, median, mode (differentiate) and their excel functions B) Dispersion definition and excel functions C) Demonstrate sampling with replacement D) What's the point of N-1?

What's the big picture?

A) Not all data will be available (may come at a high cost) B) Use data to make inferences C) Stating a question (ie. Are our Medicare claims properly completed?)

Chi-square (cont'd)

A) Performed using [ +CHIDIST() ] function B) The value of chi-square can be found using the [ =CHINV ] function - this is known as the critical value - the critical value takes two arguments: the probability in which we will reject the null and the degrees of freedom C) For a chi-square analysis, use [ =CHITEST() ] function

Statistical vs Practical Significance

A) The greater the sample size and the smaller the statistical difference between variables, the greater the statistical difference i. statistical significance is associated with the rejection of the null hypothesis ii. practical significance refers to the importance found in a statistical test B) Chi-square tables can be extended directly by using n-by-two, n-by-n, and two-by-n tables

Type Errors (revisisted)

A) Type I Error: - the mistake of rejecting the null hypothesis when it is actually true * the symbol alpha is used to represent the probability of a type I error* B) Type II error: - mistake of failing to rejecting the null hypothesis when it is actually false * the symbol beta is used to represent the probability of a type II error*

The Bartlett Test (part of ANOVA)

A) a chi-square test that tests the 'homogeneity of variances' across groups in ANOVA B) When variances differ across an order of magnitude, ANOVA will typically be greatly affected unless the difference exceed that level

P-Value Method

A) reject H0 if p-vales greater or equal than alpha (where alpha is the significance level, such as 0.05) B) Fail to reject H0 if p-value > alpha

Traditional Method

A) reject H0 if the test statistic falls within the critical region B) Fail to reject H0 if the test statistic doesn't fall within critical region

Outcome

a collection of events (ie. Flip coin five times; set of the result in an outcome {HTTHT}) *mutually exclusive events or outcomes can't occur at the same time

Bar Graph (*Discrete Data)

a diagrammatic => (ie. form of a diagram) comparison of discrete variables (Categorical data)

Independent variable

a variable whose values are assumed to be unaffected by other variable in a given analysis

Event

an observation to which a probability may be assigned (ie. a coin flip has one event that occurs: heads or tails) *two events are independent from each other

Empirical probabilty

based on actual observations or event occurrences (ie. what's the proportion of blue M&Ms in a bag?) ** Observations are made using historical data

Textbook

material in text assumes data were drawn in either a simple random method or a stratified method with the number of observations drawn from each stratum being proportional to the size of the stratum

Probability

mathematical expression of 'chance' of the occurrence of an event (ie. flipping a coin)

=BINOM.DIST()

produces binomial probabilities *always accumulates from the lowest number on the index to the highest

Histogram (*Continuous Data)

represents the frequency distribution of continuous variables (Numerical data)

Sampled Population

the population that the actual sample is selected from *samples are generally drawn from a population close to a 'target' population* **Can be difficult to draw truly random sample from entire target population

Statistics

the practice or science of collecting and analyzing numerical data in large quantities, especially for the purpose of inferring proportions in a whole from those in a representative sample (Collect, Analyze, Infer)

Important Traits of Binomial Probability

*Probabilities must sum to 1 ** The most 'ways' for an outcome to occur is in the middle of the distribution

Spreadsheets vs. Workbooks

- Excel workbooks are a collection of spreadsheets or tabs - Spreadsheet: 16384 cells x 1048576 cells

Excel Functions

- Functions or arithmetic referencing cells must start with an '=' - Arithmetic functions (+,-,/,*,^) may all be used to calculate values (ie. =A1+B1 (2+3=5)) =A1*B1 (2*3 = 6) =A1/B1 (2/3 = 0.67)

One-way analysis of Variance

- a technique used to compare means of two samples that can be used only for numerical data A) Relating the analysis of variance to the t-Test i. The t-Test is a test of whether the two mean values can be thought of as being different from each other ii. Analysis of variance examines the same question, but for more than 2 levels of a categorical variable

Confidence Intervals

- because a confidence interval estimate of a population parameter contains the likely values of the parameter has a value that is not included in the confidence interval

Extension of Pivot Table: Two Variables

- pivot tables with two variables may be created by dragging a second categorical variable into the 'Column Values' - clicking on the variable listed in the 'Values' area and selecting the 'Value Field Settings' will bring up this dialog - Within that dialog box, the user can choose the type of calculation to be used to summarize the data *A custom name for the summary may also be defined within that dialog box* (e.g. Calculating the average charge by DRG for Male and Female patients and overall with Pivot tables)

=IF() Function

- provides a way to make decisions under specific conditions - =IF() function tests a condition and produces one result if the condition is true and a different result of the condition is false *A logical function? justifying either false or truth?*

Measures of Central Tendency and Dispersion

- summary statistic that represents the center point or typical value of a dataset (where most values in a distribution fall) A) Mean B) Median C) Mode * Each of these measures calculate the location of the 'central point' using a different method* **ie. Histogram - 'symmetric' Histogram - 'skewed' 'Categorical' ** *** need to determine the type of data before choosing a measure of central tendency***

P-value

- the probability of getting a value of the test statistic that is at least as extreme as the one representing the sample data, assuming that the null hypothesis is true *null hypothesis is rejected if the P-value is very small, such as 0.05 or less*

Significance Level (denoted by alpha)

- the probability that the test statistic will fall in the critical region when the null hypothesis is actually true

Chi-Square Analysis

1. Make a hypothesis based on your basic biological question 2. Determine the expected frequencies 3. Create a table with observed frequencies, expected frequencies, and chi-square values using the formula: (O-E)^2/E 4. Find the degrees of freedom: (c-1)(r-1) 5. Find the chi-square statistic in the Chi-Square Distribution table 6. If chi-square statistic > your calculated chi-square value, you do not reject your null hypothesis and vice versa.

Marginal, Joint and Conditional Probabilities

Sequential Events

Random sample

sample drawn in a manner whereby every member of the population has a known probability of being selected *Statistics assumes the presence of random samples* *Almost always the preferred sampling methodology*

Nonrandom Sample

sample drawn on a nonrandom basis *Generally not relevant*

Ordinal Level of Measurement

"if data can be arranged in some order, but differences (obtained by subtraction) between data values either cannot be determined or are meaningless" (e.g. A college professor assigns grades of A, B, C, D, or F. These grades can be arranged in order, but we can't determine differences between such grades) (ie. can be arranged in order, but interval space between numbers/variables aren't set or clear *Can't differentiate interval space between letter grades because it lacks numerical value that needs to be measured*

Special Traits of Normal Distribution

*68% of distribution is within 1 standard deviation of mean **95% of distribution is within 2 standard deviation of mean ***99% of distribution is within 3 standard deviation of mean

Chart Function - Column Chart

A) "Insert Chart" - data is skewed to the left * Y-axis is observations * X-axis is age categories (BIN values) **Think of BIN, as a way to categorize data ***There's also Line chart, Bar chart, Pie chart, XY-Scatter chart, but column chart is mostly used

Simple Addition Rule-Top

For mutually exclusive events, the marginal probability is equal to the sum of relevant joint probabilities (ie. What's the marginal probability of a true emergency? P[True] = P[True and 1st] + P[True and 2nd] + P[True and 3rd] )

Calculating Confidence Limits for Multiple Samples

*if a large number of samples were selected, approx 95% of the resulting confidence intervals would include the true population mean*

Critical value

- any value that separates the critical region from the values of the test statistic that do not lead to rejection of the null hypothesis

The =FREQUENCY() Function (cont'd)

- the =FREQUENCY() Function may be used to create counts by age ranges - to determine the age ranges of interest, use the min and max functions to determine the range of ages present in the data *values may be used a guide for choosing the 'bins' *e.g. Each bin should be 6 wide* Steps: 1. Determine min and max 2. Calculate range by getting the difference between the max and min 3. Divide the calculated range by 6

Functions that give results in more than one cell

A) =FREQUENCY() function - accumulates the number of observations that lie at or below a specified value - consists of two arguments B) =RAND() function - inserts a random number into cell - Excel can perform Matrix math functions as well as multiple regression C) =MMULT() Function - will perform matrix multiplications operation for matrices (generally 2x2)

Two-Sample Hypotheses

A) Null Hypothesis: - The frequency distribution of variable Y does not differ as a result of group membership in variable X. B) Non-Directional Alternative Hypothesis: - The frequency distribution of variable Y does differ as a result of group membership in variable X.

Decisions and Conclusions

Decision Criterion: - decision to reject or fail to reject the null hypothesis is usually made using either the traditional method of testing hypothesis, the P-value method, or the decision is sometimes based on confidence intervals

Hypothesis Testing Example

IE: Example: Public health nurse knows that the height for age scales indicate that six year old boys is approximately normally distributed with a mean of 48 inches. Suppose she wishes to select a sample of 100 six year old boys from a county to determine if they are growing at the expected rate. H0: mean = 48 H1: mean not = 48 Suppose sample results in a mean of 49 inches and a standard deviation of 10 inches. Is the value 48 contained in a 95% confidence interval? CL: (49+/-(2)x(10/sqrt(100))) or (47.51") Since H0 is contained in the confidence limits, she concludes that the average height for six year old boys in her county is consistent with the national average.

What does the dollar sign mean in Excel?

It means that the row or column which comes after the dollar sign is anchored or absolute. When you copy Excel formulas, they will copy cells referred in that formula relative to the position where they are being copied to.

Variables

Review of types of variables: A) Categorical and Numerical B) Continuous Numerical C) Can also be referred to as the scale in which they are measured

Factorials

multiplies a number to which the factorial refers by every number less than the number (ie. 4! = 4x3x2x1 = 24)

Sampling (cont'd)

other random sampling mechanisms - =RANDBETWEEN() - data analysis add-in: random numbers from different distributions - random numbers from Uniform distribution - random numbers from Normal distribution - random numbers from Bernoulli distribution (only generates discrete random numbers) - random numbers from Binomial distribution - random numbers from Poisson distribution (not a random number generator) - random numbers from Discrete distribution

Secondary Data

refers to data that was collected by someone other than the user. e.g. Common sources of secondary data for social science include censuses (information collected by Federal Gov't, but collected by an investigator to conduct research).

Data

the 'raw material' of statistical analysis *Two aspects of data: A) Data is recorded in regard to or about cases of observations B) Data is made up of variables

Imputation (In Statistics)

the process of replacing missing data with substituted values. i) Unit Imputation: - substituting for a data point ii) Item Imputation: - substituting for a component of a data point

Sample space

the set of all possible outcomes of an experiment (ie. coin flip: sample space is {head, tail})

Joint probabilities

the simultaneous occurrence of two or more events (ie. flip coin 5 times, joint probability of heads on all 5 flips is (1/2)^5 = (1/32))

Independence of Two Variables

(Chi-square and Type I & II errors) A) Chi-square is not a normal distribution, it tends to be heavily skewed to the right B) Due to the skewed distribution, these errors may become situational depending on the sample selected

Null Hypothesis (H0)

- a statement that the value of a population parameter (such as proportion, mean, or standard dev) is equal to some claimed value *we test the null hypothesis directly in the sense that we assume it is true and reach a conclusion to either reject H0 or fail to reject H0

The Grand Mean (SS_T) [Part of ANOVA]

*Note: (SS_B): Between group variance and (SS_W) Within Group variance A) SS_T = SS_W + SS_B B) Can be calculated in excel using [ =SUM() ] funciton C) Can be used as a check for mean square error

Basic Functions

1) Click on Fx in Formula Ribbon to display insert Function dialog box A) =Average(cell references) will calculate the average of the cells listed in () B) =Sum(cell references) will calculate the sum of the cells listed in ()

Scale of Measurement (1 lowest; 4 highest)

1) Nominal - attributes are only named; weakest Property: A) Identity e.g.: gender 2) Ordinal - attributes can be ordered Properties: A) Identity B) Magnitude e.g. the results of a horse race (* the distance between scale points is not equal) 3) Interval - distance is meaningful Properties: A) Identity B) Magnitude C) Equal Distance e.g. Survey Rating: Strongly Agree, Agree, Neutral, Disagree, and Strongly Disagree *Zero point is arbitrary/ not set* 4) Ratio - absolute zero Properties: A) Identity B) Magnitude C) Equal Distance D) Absolute/True Zero e.g. Money *Zero point is set* *Best way to remember: No Oil In Rivers: The 1st letters of the scales in order from lowest to highest*

Dummy Variable

Also known as Boolean indicator, it's a variable that takes the value 0 or 1 to indicate the absence of some categorical effect that may be expected to shift the outcome. **Think Absolute/ True Zero**

Categorical Variables

variables whose levels are distinguished simply by names - can be represented by numbers - classified as "discrete" or "continuous" - "Gender" is a categorical variable; type of insurance is a "multilevel" categorical variable; ICD-10-CM is categorical

Sampling Distribution of the Mean

* Mean can be estimated by drawing successive samples ** a True Mean can be found by calculating the mean of all sample means A) Standard Error: - the standard deviation of the sample means for samples of a given size B) Random Number Generation - an excel add-in that draws multiple random samples from a population (use excel to calculate measures of Central Tendency and Dispersion)

Hypothesis Testing

*Inferential Statistics: - science of converting data collected via sampling into intelligent guesses about a population* Hypothesis Testing: - process of determining from a sample whether something could be true about a population or not Hypothesis Statements: A) Null Hypothesis (H0 or H-naught) - status quo or assumption prior to sampling B) Alternative Hypothesis (H1) - research hypothesis *Hypothesis test is performed to determine if null hypothesis should be rejected based on the sample data* **this test has a probability of making an error associated with it; this is analogous or comparable to the confidence level of a confidence interval**

The =MAX() & =MIN() Functions

- Determine min and max values of any data set *For frequency distribution, be sure to include enough categories, but not too many (5-8 preferred) i) Fewer than five categories is considered insignificant ii) More than eight is difficult to conceptualize at a single glance

More on Hypothesis Testing

- a hypothesis is a claim or statement about a property of a population *so, a hypothesis test is a standard procedure for testing a claim about a property of a population*

7.1 Confidence Interval

- a range within which we expect a true population value (mean,proportion, etc) to lie *population mean or proportion is often unknown* **Rely on a sample to make conclusions or inference about population mean or proportion** IE: Hospital CFO wishes to estimate mean cost of inpatient stay. The population value is unknown, so CFO relies on a sample of 100 cases to estimate mean cost

The F - Test (part of ANOVA)

- a statistical test in which the test statistic has an F distribution if the null hypothesis is true 𝐹= (〖𝑆𝑆〗_𝐵/〖𝑑𝑓〗_(〖𝑆𝑆〗_𝐵 ))/(〖𝑆𝑆〗_𝑊/〖𝑑𝑓〗_(〖𝑆𝑆〗_𝑊 ) ) = 〖𝑀𝑆〗_𝐵/〖𝑀𝑆〗_𝑊 A) Can be found using the =FDIST() Function-Pg 330 B) Excel contains an add-in for one-way analysis of variance C) Single-factor Data Analysis Add-In * [ =FINV() ] function - used to find F critical, or F crit - F crit is the value that must be reached in order to reached desired level of confidence

Test Statistic

- a value computed from the sample data, and it's used in making the decision about the rejection of the null hypothesis. *the test statistic is found by converting the sample statistic to a score with the assumption that the null hypothesis is true*

Creating A Pivot Table

1. Highlight the data to include in the summary 2. From the Insert ribbon, select PivotTable 3. The 'Table/Range:' blank field should contain the range of data highlighted in first step 4. Click OK 5. Drag and drop the 'Gender' variable to the 'Row Labels' 6. Drag and drop the 'Gender' variable to the 'Values' 7. The count of patients by gender is produced 8. Functions of numerical variables may also be computed in the pivot table (sum, average, etc.)

The Excel Graphs

- graphical displays of data are useful in helping an audience understand the results of an analysis - Excel includes a very robust set of graphing tools *Insert Chart dialog box displays 11 types of graphs* A) Identifying and Formatting Chart Data 1. Select Data 2. Choose Chart Type 3. Click OK 4. Right Click to modify chart/graph as need be B) The Select Data Source Dialog Box *4 Distinct Areas: i) Chart Data range area ii) Switch Row/ Column Button iii) Labels iv) Legend Entries C) Formatting the finished graph - select the 'Design ribbon' and select layout style *The default column chart can be customized with titles/legends using the chart formatting ribbon*

Confidence Interval (Cont'd)

- mean value for each potential random sample of 100 cases will be different since the cases making up each sample is different - confidence interval will provide the CFO with a range of possible values for the mean cost and an associated probability that the population mean is contained in that range General form of a confidence interval: Point estimate (+/-) t(SE)[point estimate] *t is a number of standard errors on each side of the mean required to encompass a desired proportion of the t-distribution (95%) IE: 0 SD shows the proportion in the range of 1SD below the mean ** [ =TINV() ] function returns the t-value corresponding with the 'two-tailed probability P' (P-value) and the specified degrees of freedom, df. TINV is the 'inverse' function of TDIST function

Using the Histogram Tool

- part of the data analysis toolpak Steps: 1. Select the data analysis button from the data ribbon 2. Select Histogram > OK 3. Populate the histogram dialog as depicted 4. The same frequency table as that produced with the =FREQUENCY() Function will be created 5. If the 'chart output' is requested, then a graph of the frequency distribution will also be created

Distribution of Proportion

- proportion distributions can be treated as normal distributions in order to find Standard Error *performed using [ = NORMDIST() ] function * ** N-1 only performed if the population inequality sign in excel is > **

Interval Level of Measurement

- similar to ordinal level of measurement, but with the additional property that "the difference between any two data values is meaningful" e.g. The years 1000, 2000, 1776, and 1492

Ratio Level of Measurement

- similar to the interval level, but with the additional property that there is also a "natural zero starting point" e.g. Prices of college textbooks; Visits, total cost, and cost per visit (cost/visit) * each has equal intervals ** each has a true zero point

Alternative Hypothesis (H1 or Ha)

- statement that the parameter has a value that somehow differs from the null hypothesis. The symbolic form of the alternative hypothesis must use one of these symbols: [ < or > or inequal sign)

=FREQUENCY() Function

- tabulating counts or frequencies is a common need in summarizing data - Excel offers two methods: 1) =FREQUENCY() Function 2) Pivot Tables *Typically data is one column* A) =FREQUENCY() is an array function that requires two arguments: i) Data (data to be summarized) ii) Bins (categories that define frequency distributions) * =FREQUENCY() is useful when summarizing numerical data in ranges for tables or histograms **Is typically initiated by the =MAX() & =MIN() Functions**

Critical Region (rejected region)

- the set of all values of the test statistic that cause us to reject the null hypothesis

ANOVA ("ANalysis Of VAriance")

- the test of differences i. Means can be tested and compared using ANOVA 𝐹=(〖〖(𝑥 ̅〗_1−𝑥 ̅_2 )^2〗∕〖(1/𝑛_1 +1/𝑛_2 )〗)/〖𝑀𝑆〗_𝑊 F may have to be adjusted in order to the inflation of Alpha 𝑎_𝐴𝑑𝑗=1−(1−α)^(1/c)

Discrete Numerical Value

- values that only take on whole numbers - allow a different definition of mean and standard deviation * results will typically be similar, with the difference being that these formulas will treat each observation individually *

Dispersion

- variation in a set of data A) Range: - difference between the largest and smallest values B) Variance: - measures how far a data set is spread out *the average of the squared differences from the mean" - which provides a 'general idea of the spread of the data' *A value of zero means that there is "no variability"; all the numbers in the data set are the same* [ =VAR() ] C) Standard Deviation: square root of variance [ =STDEV() ] **N-1 is used as an unbiased estimator **

Example of Distribution of a Proportion

1) Sample of 80 forms and 60 had been filled out correctly, what's the probability that the population proportion exceeds 85% 2) Sample Proportion = 60/80 = 0.75 3) Approx. with a normal distribution with a mean of 0.75 and standard error of SqrRt((0.75(1-0.75))/80 4) Use excel [ =NORMDIST ] Function = 1 - NORMDIST(0.85,0.75,0.048,1) = 0.019 * we subtract the value of the NORMDIST() function from one to get the 'tail' probability of more than 85

Basics of Hypothesis Testing

1. Given a claim, identify the null hypothesis and the alternative hypothesis, and express them both in symbolic form. 2. Given a claim and sample data, calculate the value of the test statistic. 3. Given a significance level, identify the critical value(s). 4. Given a value of the test statistic, identify the P-value. 5. State the conclusion of a hypothesis test in simple, nontechnical terms. 6. Identify the type I and type II errors that could be made when testing a given claim.

Creating a Pareto Chart

A) Pareto charts are a 'method of looking at individual and cumulative frequencies at the same time B) Pareto charts are 'used in quality assurance projects' to ensure that the most significant issues are addressed first C) The categories in a Pareto chart are arranged from most to least frequent D) Detailed instructions on creating a Pareto Chart (ie. The first three categories in a Pareto chart are the ones that need the most concern.)

Chi Square Test (additional)

A) Relating to or denoting a statistical method assessing the goodness of fit between observed values and those expected theoretically. B) Used to examine differences in the distributions of nominal data C) A mathematical comparison between expected frequencies and observed frequencies D) (i) Theoretical, or (ii) Expected, Frequencies: developed on the basis of some hypothesis E) (iii) Observed Frequencies obtained empirically through direct observation

Sampling

A) Samples and Populations - populations are the total 'N' of the focus of the study - samples are the 'n' of the values taken from 'N' to make the inference on 'N' 1) ***Parameters refer to populations*** 2) *** Statistics refer to samples*** - a statistic is a value derived from a sample to make an inference on a population B) Drawing a random sample - generally accepted that people can't truly be random *random samples are not representative* - a mechanism is required for random sampling - can also be done using the =RAND() Function in Excel

Types of Distributions

A) Skewed i) Left skewed shows the tail of the graph is on the left ii) Right skewed is reverse and is more common than left * Costs skew to right usually * B) Normal: - not skewed in any direction C) Uniform: - almost equal numbers at each interval, not usually seen in health statistics

T-Test for Comparing Two Groups

A) Stratified Sampling i. Sampling in which the researcher divides the population into groups or strata, and then performs random sampling ii. groups should be of equal size iii. IE: costs/discharge male vs. female B) Using a t-test to compare two means - using two groups will result in two means. A t-Test can be run to determine whether the means of the two samples could have come from two populations in which the true mean of costs was the same

Assumptions of Chi-square

A) 1 or more categories B) Independent observations C) A sample size of at least 10 D) Random sampling E) All observations must be used F)For the test to be accurate, the expected frequency should be at least 5

The Dollar Sign $ Convention for Cell References

A) Absolute and relative cell references - Excel formulas can refer to other cells - Relative reference make copied formulas reference to a relative area - Absolute references make copied formulas reference to a specific area i) Absolute references are made using the ($) sign ii) Can also be accomplished using F4 (e.g. $B$84)

Review of Types of Variables

A) Categorical variable: - names or labels (e.g. hair color, age, etc.) B) Numerical variable: - the measurement or number has a numerical meaning. (e.g. total rainfall measured in inches, heart rate, etc.) C) Continuous Numerical variable: - an unbroken set of observations; that can be measured on a scale, any numeric value, within a finite or infinite range of possible values. (e.g. Height, time in a race, weight of an animal) *Finite or infinite* D) Discrete Numerical variable: - countable in a finite amount of time. (e.g. you can count the change in your pocket; you can count the money in your bank account; you could also count the amount of money in everyone's bank account; number of cars on a train) *Only Finite*

Statistical Tests - 5 Tests

A) Chi-Square tests: - used to assess the independence of two categorical variables (ie. Test for Independence) **Only categorical variables** e.g. Is there a relationship between age and gender? B) t Test: - used to assess the independence of a numerical variable and a categorical variable - one variable is a discrete or continuous numerical variable and other variable is categorical with only two values (ie. An independent samples t-test is used when you want to compare the means of a normally distributed interval dependent variable for two independent groups.) **1 discrete variable & 1 continuous variable** **Only two values??** *** 1 dependent & 1 independent variable*** e.g. Compare cost per visit for males and females. i) cost per visit is dependent, continuous numerical ii) gender is single 2 value independent variable C) Analysis of Variance (ANOVA): - extension of t-Test - use to assess two or more variables' independence from each other - one variable is discrete or continuous numerical variable and other variables are categorical with any number of values (ie. ANOVA is used to determine whether there are any statistically significant differences between the means of two or more independent groups. ) **ANOVA provides a statistical test of whether the population means of several groups are equal, therefore generalizes t-Test to more than two groups** (ie. 1 variable either discrete or continuous ; Other variables are categorical) D) Regression: - used to assess the independence of two or more discrete or continuous numerical variables - may also include one or more categorical variables of no more than 2 values each (ie. Regression analysis is also used to understand which among the independent variables are related to the dependent variable, and to explore the forms of these relationships.) Goal: Find the Cause-and-Effect relation between these variables. e.g. Finding the cause-and-effect relation between lifestyle choices and health status/costs E) LOGIT: - an extension of regression - *2-level categorical value* - used to examine independence of 2 or more variables in which the variable that is dependent is a dichotomous variable (takes on one of only two possible values when observed or measured. The value is most often a representation for a measured variable (e.g. age: under 65 or 65 and over; or an attribute, e.g., gender: male or female) * Variables set may be categorical or numerical --- discrete or continuous* **dependent variable = 2-level categorical variable (e.g. insurance coverage: yes/no); independent variable = categorical or numerical** ***Also known as Logical Regression, Logistic regression is used to describe data and to explain the relationship between one dependent binary variable and one or more nominal, ordinal, interval or ratio-level independent variables.*** - the appropriate regression analysis to conduct when the dependent variable is dichotomous (binary). e.g. How do variables such as GRE, GPA, and prestige of undergraduate institution effect admission into graduate school.

Review Types of Statistical Testing for Variables

A) Chi-square test B) t Test C) ANOVA D) Regression E) LOGIT (Logical Regression)

Small Expected Values in Cells

A) Chi-square tests assume relatively little about the data - this allows small expected values in any table cell to inflate the value of the chi-square and increase the likelihood of a Type 1 error B) Yates Correction - if the expected value in any cell of a two-by-two table is less than 10, dealing with small expected value can be done by taking the absolute value of the difference between the observed and expected frequencies and subtracting 0.5 before the value is squared

Cluster vs Stratified?

A) Cluster - only a few of the groups or clusters actually have members represented in the final sample B) Stratified - all groups, or strata, have members represented in the final sample **Remember? Stratified samples ensure smaller data are represented in the final sample**

Cell References

A) Contiguous range: B4:B8 or B4..B8 **A contiguous range of cells is a group of highlighted cells that are adjacent to each other, such as the range C1 to C5 shown in the image above.** B) Non-contiguous range: B4:B8, D4:D8 **A non-contiguous range consists of two or more separate blocks of cells.**

Correlation vs Causation

A) Correlation: - a statistical measure (expressed as a number) that describes the size and direction of a relationship between two or more variables * usually associated with measuring a linear relationship ** a correlation between variables, however, does not automatically mean that the change in one variable is the cause of the change in the values of the other variable B) Causation: - Indicates that one event is the result of the occurrence of the other event e.g. "there is a causal relationship between two events => there's a cause and effect ***Correlation does not imply Causation***

Why is Dispersion Complicated?

A) Cumulative Differences: - the sum of the negative values = the sum of the positive values, which result in a zero B) Absolute Differences: - negates cumulative differences, but doesn't work well with mean

Cumulative Frequencies & Percentages

A) Cumulative Frequency - smallest BIN value in first BIN category - cumulative frequencies in 2nd and subsequent BINs B) Percentage & Cumulative Percentage Distribution: - Formulas are the same as cumulative frequency distribution. (Copy the range and paste)

Ch9 Final

A) Define t test; be able to identify examples of using in healthcare data analyses B) Be able to establish an H0 and an H1 hypothesis. Be able to differentiate and select correct use of same. Differentiate between H0 & H1 hypothesis that reflect 2 tail vs 1 tail tests C) Be able to identify the equation for finding t. (see pg 297 in text). D) What is "standard error of means"? E) What are degrees of freedom? How is d/f determined? F) Differentiate between 2-tailed and 1-tailed tests G) Be able to identify how to appropriately use the =TDIST() function. H) What arguments are used for the ()? I) Interpret results from a t test. Correctly identify a 1 tail result. J) What is the significance of a Type I vs Type II error? K) Identify examples of using a t test to compare 2 groups L) What does "stratified" sampling mean? How is a stratified sample selected? M) What are the steps in comparing 2 means? (pg 304-305) Be able to identify conclusion reached in the text regarding the example on pg 304-305 N) Know concepts related to unequal size samples (pg 307) What is a "control" group, "experimental" group--differentiate O) Differentiate a "pooled variance statistic" and identify what it does (pg 308) Was there a difference between the control group and the experimental group? What was the conclusion of the experiment? P) Considering the difference between the test related to cost differences and the experiment related to breast cancer awareness, is there an indication of the cause of the differences identified? Q) What is the Excel t test add-in function used for? - The t-Test is used to test the null hypothesis that the means of two populations are equal

Discrete & Continuous Numerical Variables

A) Discrete Numerical Variables: - variables that can only take on whole numbers values. Counting people, events, organizations. (e.g.Number of E.D. patients is a discrete variable, there can never be less than a whole number of persons represent by a sample (ie. Discrete can only take on integer values; no fractions or decimals) B) Continuous Numerical Variables: - variables that can take on any value, including whole numbers or decimal places. *While the number of persons in E.D. is a discrete variable, then the amount of time spent before being seen by a caregiver would be considered a continuous numerical variable*

Summary of Discrete vs Continuous

A) Discrete data: - type of data that has clear spaces between values - discrete data is countable - contain distinct or separate values - Bar graph - (Function of graph): shows isolated points (e.g. Days of the week, number of students in class, number of cars in a parking lot) B) Continuous data: - data that falls in a continuous sequence, including fractions & decimals. - continuous data is measurable - can take any value in some interval - Histogram - (Function of graph): show connected points (e.g. market price of a product, Age, height/weight, temperature, time, etc.)

Roles of Statistical Applications

A) Documentation of Medicare Reimbursement Claims B) Emergency Trauma Color Code C) Calculating length of stay, readmission rates, and cost per case (Using ANOVA) D) Establishing statistical differences between two groups (physician billing) E) Calculating a standard hourly rate for personnel F) Breast Cancer education Effectiveness

The Distribution of Frequencies

A) Frequency Distribution: - a method of characterizing a data set using the manner in which it is distributed [ =FREQUENCY() ] * typically used after average and standard deviation functions * B) Normal Distributions: - a symmetrical, bell-shaped, distribution in which most observations will fall towards the middle [ =NORMDIST() ] For area, use [ =NORMDIST(x, mean, stdev1) ] *based on scales that are continuous, not discrete ** 68 percent of all values are within 1 SD ** *** 95 percent of all values are within 2 SD *** **** 99 percent of all values are within 3 SD **** C) Cumulative Normal Distributions - sums all the probabilities of the normal distribution up to a given value

Working and Moving around in a Spreadsheet

A) Highlighting cells: - can be accomplished by left-clicking, clicking on column letter/row number, and by Ctrl+Shift+Arrow B) Copying a cell to a range of cells - can include all values in highlighted range C) Moving data with drag and drop - can drag and move entire range of data D) Undoing changes - simple and easy

Small Expected Values (cont'd)

A) If df = 1 - Yates correction can be used when the degrees of freedom = 1 - The Fischer exact method can also be applied B) If df > 1 - it's best to increase the # of observations, if possible - cells can be collapsed where it makes logical sense in order to increase the expected values in cells

Independence of Two Variables

A) Mathematical Independence: - when marginal probabilities equal conditional probabilities in a joint probability table i. Recall: - marginal probabilities are associated with the frequency distribution of a single variable ii. In larger tables, it's necessary to look at as many conditional probabilities iii. Does not imply a relationship between variables *To test statistical independence, use the chi-square test*

Final Ch8 Guide

A) Mathematical independence and statistical independence—be able to identify meanings and utilization in given examples (pg 274) B) What is the significance of Chi-square? Be able to identify the definition (pg 275-276) and uses C) When are two variables considered statistically independent based on the Chi-square value. D) Does Excel have a Chi-square Function? How do we calculate the Chi-square value? E) What does the =CHIDIST() Function provide? What arguments are required to be included the =CHIDIST() Function in Excel? What does the result (value) tell us about the H0? (pg 276-277) F) What arguments are required for the =CHIINV() Function? What does the =CHIINV() Function tell us about statistical independence? (pg 277). G) What is the =CHITEST() Function? What arguments are required? What does this function tell us? (pg 279) H) Is =CHIPROB() an Excel Function? If so, what does it tell us? I) Be able to distinguish between Type I and Type II errors in relation to Chi-square distributions. How does increasing the d/f change the appearance of a Chi-square distribution table? (pg 280) J) Is the Chi-square distribution skewed to the right or left or balanced? Why? K) What effect does increasing the sample size have on a Type II error associated with a normal or t distribution? (pg 281) H) What does "Practical significance" refer to? (pg 285)

Variable Types (cont'd)

A) Nominal B) Ordinal (scale) C) Interval (scale) D) Ratio (scale) E) Independent (causal) F) Dependent (caused)

One-Sample Hypothesis

A) Null Hypothesis: There is no significant difference between the observed and expected frequencies. B) Alternative Hypothesis: There is a significant difference between the observed and expected frequencies.

Types of Random Samples

A) Systematic - samples done by first dividing the population into subsets equal to the number of observations desired in the sample and then drawing a specific observation from each subset *used to draw inferences about a population* B) Simple Random - samples drawn in such a way that every possible sample of a given size has an equal likelihood of being selected for the sample C) Stratified - samples drawn by dividing the total population into two or more groups, or strata, and then drawing a specific proportion of each stratum (singular version of a stratified sample) for the sample (ie. Those with specific surnames (e.g. Hispanic) vs those without, may result in two strata. We would sample x number of those families in one strata and sample y number of those in the other strata *Why use stratified samples?* - Ensures groups in a smaller strata are represented D) Cluster - samples drawn by first dividing the sample into several groups, or clusters Proceeds in two stages: 1) A set of the clusters is drawn using systematic or simple random sample 2) Either all the members of the cluster or a sample of members of the cluster are selected to be included in the final sample (e.g. divide sample into ZIP code areas first, then select samples randomly from each ZIP code group)

Type I and II errors for Two-Groups Example

A) TDIST() can be used to find the probability of a t value Pt = TDIST(0.288, 98, 2) = 0.774 B) Both types of errors can occur when comparing means Initial H0 that the population mean for cost per discharge was exactly the same for both men and women, and, The t value calculated for a large number of samples of discharges for 50 men and 50 women would be as large as 0.288 in 77 percent of the samples selected. The probability of getting a t value as large as 0.288 is very high, H0 is accepted C) Samples of unequal sizes can be compared, but the experiment may have to be adjusted in order to compensate. D) Remember: Type I Error occurs when H0 is rejected and is TRUE ***Type II Error occurs when H0 is accepted and is FALSE*** Unequal Sample Sizes: Pg 307-308 If a pooled variance exists, use the appropriate t test (Pg 308) 𝑡=(𝑥 ̅_1−𝑥 ̅_2)/√(𝑠_𝑝^2 (1/𝑛_1 +1/𝑛_2 )) See Figure 9.7 text, Pg 309 for results & Pg 310 for Formulas Excel also has t Test add-in-Pgs 312-316

Chi-Square Distribution

A) There is a family of χ2 distributions, each determined by a single degree of freedom value. B) For a single variable: df = k - 1 C) For multiple variables: df = (r - 1)(c - 1) Where r = the number of rows c = the number of columns *As the degrees of freedom increase, the sampling distribution approaches the normal distribution.* ** total of (O-E)^2/E of data sets equal to Chi-square (X^2) *** If chi-square statistic > your calculated value, then don't reject your null hypothesis (there's significant difference that's not due to chance ***

T-Test (cont'd)

A) To interpret, realize that t is a probability that determines if population means differ B) Can be linked to both Type I & Type II errors C) Can be either one-tailed or two-tailed Example: H0: The average cost per discharge is $6,000. or H0∶ c = $6, 000 and H1: The average cost per discharge is not $6,000. or H1∶ c ≠ $6, 000 Average cost/discharge = $6,586.30 Std Deviation = $5,262.73 Std Error (sample size = 100) = $5,262.73/sqr rt(100) = $526.27. t = 6586.30 − 6000 = 1.11 (value of t) 526.27 t dist Excel function: =TDIST(1.11,99,2) = 0.268 D) Interpretation: (pg 298)-Probability that H0 = True = 0.268 (probability that sample value of $6,548 from the sample of 100 with true avg cost per case = $6000). - Accept H0 as true E) Data Analysis field in Excel used to take a larger sample (250 samples of 100) - Calculate mean & std error for samples and follow through with t test per Equation 9.1 (Refer to Chapter 6 & Exercise 6.03) F) Pg 299, Figure 9.1 is distribution of the tests 68% have results -1 - +1; 93% are -2 - +2; 7% are outside of the range *** 7% probability of rejecting H0 when it's true! ***

Two-tailed, left-tailed, right-tailed

A) Two-tailed Test: - critical region is the two extreme regions (tails) under the curve B) Left-tailed test: - critical region is the extreme left region (tail) under the curve C) Right-tailed test: - critical region is in the extreme right region (tail) under the curve

Type Errors

A) Type I Error: - error of concluding that the null hypothesis is false when it is actually true. *referred to as a False Negative (denoted as alpha or the 'level' of a hypothesis test) B) Type II Error: - error of concluding that the null hypothesis is true when it is actually false. *referred to as a False Positive (denoted as beta)

Ch. 10 Analysis of Variance

A) Understand the concepts associated with one-way analysis of variance (ANOVA) B) Understand the use of the F test to determine whether between group variances is large relative to within group variance C) Enable and use the Excel add-in for one-way ANOVA D) Apply the concepts associated with ANOVA for repeated measures for calculating the variation within people or groups

Ch 8 Statistical Tests for Categorical Data

A) Understand the concepts associated with the independence of two variables B) Understand statistical independence of two variables C) Understand and interpret the chi-square statistic

Equations for one-way Analysis of Variance

Analysis of variance depends on two key pieces of information: A) Between group (explained) variance 〖(𝑆𝑆〗_𝐵) 〖𝑆𝑆〗_𝐵= ∑_(𝑗=1)^𝑚▒〖𝑛_𝑗 (𝑥 ̅_𝑗−𝑥 ̿)^2 〗 B) Within group (error) variance (〖𝑆𝑆〗_𝑤) 〖𝑆𝑆〗_𝑤=∑_(𝑗=1)^𝑚▒∑_(𝑖=1)^(𝑛_𝑗)▒〖(𝑥 ̅_𝑖𝑗−𝑥 ̿_𝑗 〗 )^2

Sorting Data

Sort functionality in Excel is useful for more than determining the order of data - may be used for sampling a population - may be used for identifying missing or outlier values *Sort dialog box allows sorting by multiple variables* (ie. Sort by Sex [Order A to Z], then Sort by Age [Order A to Z]) Result: Rows are sorted by Sex first, then by age (dependent on the initial sorting by Sex)

Variable

a characteristic of an observation or element of the sample that is assessed or measured. Variables must not be a constant!! No relationship exists between variable and constant. Value of a variable across all members of a sample is a 'statistic'

Final Exam Ch7

Define Confidence Interval - Be able to recognize examples of use of the CI - Establishment of confidence limits and the testing of hypotheses about the data—Independence Effect on the standard error as the sample size increases Understand "True Population Mean" Identify processes associated with Hypothesis testing - Know: H0 and H1-meaning and application; when is H0 accepted Differentiate and understand alpha and beta-one tailed vs two tailed test Definition of Inferential Statistics Type I and Type II errors: differentiate

Two-Groups Example

H0: Cm = Cw H1: Cm not = Cw Population divided into 2 groups, equal size sample from each group = Stratified Sample In text example, 50 male and 50 female Use the t test to decide if the true mean of the costs is the same (pg 305 in text): t = 6460.04-6177.30 sq rt 2387156+24477861 (std dev m; std dev w) = 0.288 50

Measures of Central Tendency and Dispersion (Cont'd)

In excel, A) Mean (Average) = sum of data points/# of data points [ also =MEAN() or =AVERAGE() ] B) Median = middle value in an ordered set of values [ =MEDIAN() ] C) Mode = most commonly appearing value in a set of data [ =MODE() ]

Addition Rule-Bottom

Joint probabilities may also be expressed as 'or' (ie. P[A or B] = P[A] + P[B] - P[A and B] ) *If events are mutually exclusive, then P[A and B] is zero*

Selecting Samples Sizes (Measurement Error)

Measurement Error (ME) - the half-width of the confidence interval. As sample size increases, the width of a confidence interval decreases. Larger sample > more precise Recall that the sqrt of n is in the denominator of the SE of the sample mean How to determine sample size. For estimating mean: n = (t^2 x s^2)/(ME)^2 For estimating proportion: n = (t^2 x p x (1-p)/(ME)^2

Using Pivot Table to Generate Frequencies of Categorical Variables

Pivot tables and Pivot charts may be used to summarize data by categorical variables (e.g. Counts, Averages, Totals, etc.)

Data Analysis Pack

Statistical Analysis Tools that are included to functions in Excel

How are statistics utilized in developing an organization's health policy and assisting healthcare administrators in their decision-making process?

Statistics allow Informaticians to establish data structures, analyze data,and provide justified data that supports the clinical decision-making process. (ie. • The ever-increasing costs of providing hospital services have sparked a keen interest on the part of hospital administrators in practical mechanisms that can account for—and control or mitigate—those costs. The administrators for the Sea Coast Alliance, a system of eight hospitals, wants to be able to use the previous case data to provide guidance on how to control costs. Because Sea Coast is associated with eight hospitals, it has a substantial volume of case data that the administrators believe can be useful in achieving their goal.)

One-dimensional Example

Suppose we want to know how people in a particular area will vote in general and go around asking them. Republican: 20 Democrat: 30 Other: 10 How will we go about seeing what's really going on? Hypothesis: Democrats should win district Solution: chi-square analysis to determine if our outcome is different from what would be expected if there was no preference Rep (observed): 20 ; Rep (Expected): 20 Dem (obs): 30 ; Dem (exp): 20 Other (obs): 10 ; Other (exp): 20 *plug in to formula to calculate total for chi-square* X^2(2) = 10 * df = 2, df = (k-1) But, X^2(alpha=0.05) =5.99 So, reject H0 Conclusion: * Note that all we really can conclude is that our data is different from the expected outcome given a situation* A) Although it would appear that the district will vote democratic, really we can only conclude they were not responding by chance B) Regardless of the position of the frequencies we'd have come up with the same result C) In other words, it is a non-directional test regardless of the prediction

Ch9 T-Tests for Related and Unrelated Data

The t-Test: A) Assesses whether a numerical value, continuous or discrete, is independent of a categorical variable that takes on only two values B) Also provides the ability to assess whether a value found from a sample could have come from a population in which a hypothesized value is true C) Answers can range from -∞ to ∞, though typically uses a more finite range D) Use the =TDIST Function (does not use negative values) E) The t test can produce a result that may range from −∞ to∞, although practically speaking it is usually in a more finite range. F) The =TDIST() function works only with positive values of t. G) Therefore, when using the =TDIST() function to determine the true probability of a negative t value, it is necessary to change the negative value to a positive one. H) Because the t distribution is symmetrical with regard to the probability of being at −t or t, the conversion makes no difference. *Consequently, if the two-tail probability of a value of −2.3 were desired, the way to determine this would be by using =TDIST(2.3,df,2).*

T-test (Cont'd part 2)

Type I & II Errors and the t test: - New H0-$4,500/discharge: H0: Avg cost per discharge = $4,500 H1: Avg cost per discharge not = $4,500 See Figure 9.2 pg 300. Values centered on 3.5, since the t value is greater than 2, H0 is rejected and H1 is accepted. However: 9% of 250 t tests resulted in values <2, so realizing that 9% of 250 samples <2, would have produced a Type II error (accept H0 when it is false). Pg 300: 1-tail & 2-tail t tests: H0: c < $6,000 and H1: c > $6,000 Any sample mean < $6,000 will result in H0 = False One-tail test will result in lower std deviation unit, approx. 1.7 See Fig 9.3 & 9.4 pg 302

Normal Distribution

a continuous distribution that contain an infinite or non-countable number of values *bell shaped; symmetrical around the mean **Or provides probabilities for continuous numerical variables ***Both sides have equal probabilities

Poisson Distribution

a discrete distribution like the binomial *concerned with number of observations that will occur in a small amount of time or region of space (whole numbers only) **Performed using =POISSON() function

Normal

a distribution of data that has roughly the same amount of data on either side of the middle and has its most common values around the middle of the whole data *symmetrical bell shaped data points* ** mean and median are the same value resulting in a symmetric distribution **

T - Distribution

a distribution that assumes a finite number of observations * may have 'family' of distributions A) Degrees of Freedom: - number of observations less 1 * the greater the degrees of freedom, the more the distribution will appear to be normal in shape * [ =TDIST() ] Function To calculate number of standard deviations from the mean, t, use [ =TINV() ] function

Binomial

a probability distribution that describes the behavior of a binary event (yes/no, dead/alive, readmitted or not)

Dependent Variable

a variable whose values are assumed to be affected or modified by the value of other variables in a given analysis

Nominal Level of Measurement

characterized by data consist of names, labels, or categories. - data can't be arranged in an ordering scheme e.g. The colors of cars driven by college students (red, black, blue, white, etc.) * the only level of measurement where data can't be arranged in an ordering scheme*

Population

groups of entities about which there is an interest, generally large groups of individual persons, objector or items which samples can be taken from (ie. All person that have used or ever will use a cholesterol-reducing drug)

Priori probability

known before the event occurs (ie. probability of heads in a coin flip is 1/2) *gambling is based on priori probabilities

Binomial Probability

probability that an experiment results in a specified number of successes (ie. predicting how many times coin flip will results in tails) * p is the probability of success ** 1-p is probability of failure

Distribution

shows how often each value occurs in each data set

Microsoft Excel

spreadsheet developed by Microsoft for Windows, Mac OS X, and IOS. Features: A) Calculation B) Graphing Tools C) Pivot Tables D) Macro Programming Language (Visual Basic for Applications)

Bayes Theorem - Conditional Probability

the conditional probability of event A given that event B occurred is: P[A|B] = P[A and B]/P[B] ) (ie. P[True|3rd Shift] = 0.188/0.251 = 0.748 *two events are independent (knowing B doesn't influence probability of A)

Venn Diagrams

useful in computing joint probabilities Mutually exclusive: venn diagrams not on top of each other Non-Mutually exclusive: venn diagrams overlap each other

Numerical Variables

variables whose values are designated by numbers and have some meaning relative to each other (e.g. one numerical number is larger than another) Typically measured by 3 Scales A) Ordinal: values assigned to the variable levels indicate variables are in an order of magnitude (e.g. surveys in which a scale is used to designate responses to one of several choices => Likert Scale B) Interval: values indicate order of magnitude between variable levels is in equal intervals (e.g. celsius temperature scale) C) Ratio: Values assigned to the variable levels are indicative of the order of: 1) Magnitude 2) Equal intervals 3) assumes a real zero, where zero represents complete absence of the trait being measured (e.g. number of patients in an E.D. can be any number, including 0 patients) **Any numerical variable is assumed to be measured on at least: 1) Interval Scale 2) Ratio Scale

Standard Error (Cont'd)

when sample size is large relative to the population size, standard error can be reduced by multiplying times a value called the "finite population error"


Set pelajaran terkait

Chapter 13. The Nekton: Swimmers of the Sea

View Set

故事一 (对话) Story 1 (Conversation) - Jess and Curly Had a Farm

View Set

Honan Chapter 36, 37, 38, 39, 43, 44, 45, 46, 47 PHARM:

View Set