Biostatistics Final review

Ace your homework & exams now with Quizwiz!

The sum of the deviations of the individual observations from their mean is?

0; the sum of the deviation of the individual data elements from their mean is always equal to zero. This is why we use the sum of squared deviations. Example, 1, 3, 5, 2, 4 with mean and x bar = 3 (1-3)+(3-3)+(5-3)+(2-3)+(4-3) = 0

Properties of normal distribution

1) the normal distribution is symmetric about the mean (i.e. P(X > mu) = P(X < mu) = 0.5). 2) The mean and variance (mu and sigma squared) completely characterize the normal distribution 3) the mean = median = mode 4) approximately 68% of obs between mean +/- 1 sd, 95% between mean and +/- 2 sd, and >99% between mean and +/- 3 sd

Rating I Freq. I Relative Freq. I % Freq. I Cumulative RF Excell I 4 I 1/5 I 20 I .20 good I * (7) I * I * (35) I .55 avg I 6 I 3/10 I 30 I .85 poor I 3 I 3/20 I 15 I * the last category in a cumulative relative frequency distribution will have a cumulative relative frequency equal to?

1; A cumulative relative frequency distribution is a tabular summary of a set of data showing the relative frequency of items less than or equal to the upper class limit of each class. The relative frequency is defined as the fraction or proportion of the total number of items. To find the cumulative relative frequencies, add all the previous relative frequencies to the relative frequency for the current row. Thus, the last entry of the cumulative relative frequency column is one, indicating that one hundred percent of the data has been accumulated.

When considering a contingency table test with 6 rows and 6 columns, this implies that the number of degrees of freedom for the test much be which of the following?

25; The degree of freedom in a contingency table is given by (r-1)(c-1) where r and c are the number columns and rows. Thus we have (6-1)(6-1) = 25.

Based on the following partial ANOVA table: Source of Variation I SS I df I MS I F I Treatment I 100 I * I 25 I 6.25 Error I 60 I * I 4 I Total I 160 I 90 I The numerator and denominator degrees of freedom (shown with the * in the above table) are:

4 and 15, In the ANOVA table, each source of variation when divided by its appropriate degrees of freedom provides an estimate of the variation in the experiment. The total sums of square (Total SS) involves n squared observations, its degrees of freedom is (n - 1). Also, the sum of squares for the treatments involves k (the number of populations) squared observations, thus its degrees of freedom is (k - 1). Finally, the sum of squares for error is a direct extension of the pooled estimate and its degrees of freedom is (n - k). We note that the two sources of variation and their respective degrees of freedom are combined to obtain the means squares as MS = SS/df. Therefore MS(treat) = SS(treat) / df(treat) or df(treat) = SS(treat) / MS(treat) = 100/25 = 4. Similarly, df(error) = SS(error) / MS(error) = 60/4 = 15.

Suppose we are counting the number of patients out of 100 that have a particular disease XYZ with the probability of any one individual having the disease is 0.40. Assuming all the patients are independent and the chance of disease is the same for each patient, what is the expected number of patients with the disease XYZ?

40; Here we have a binomial experiment with 100 identical trials, and the probability of success on a single trial equals 0.40. Therefore the expected number of patients with the disease XYZ (success) is 100x.40=40.

Select the most correct statement concerning relative risk and odds ratios: It is possible to calculate a relative risk when data are from a case-control study. Coefficients from logistic regression analysis yield relative risk. A relative risk of 10 has the same strength of association as a relative risk of 0.1. At least one variable should be normally distributed to calculate a relative risk. If the confidence interval for the relative risk does not contain 0, there is an association.

A relative risk of 10 has the same strength of association as a relative risk of 0.1. The risk ratio measures the increased risk for developing a disease after being exposed to a risk factor compared to not being exposed to the risk factor. it is given by RR = risk for the exposed/risk for the unexposed, and it is often referred to as the relative risk, which is a proportion.

Which one of the following items does not represent the value of biostatistics in the assessment of health problems of the population and determine their extent? Finding patterns in the collected data Deciding what information to gather to help identify the health problems Accounting for possible inaccuracies in responses and measurements Summarizing and presenting the information to best describe the target population

Accounting for possible inaccuracies in responses and measures. A biostatistician's responsibility within a collaborating research team is to aid in the research design, analysis, and interpretation of the data. The other answers all describe tasks that would fall within a biostatistician's expertise area. A biostatistician would not be able to account for possible inaccuracies in the data. This is because the biostatistician only has access to the information contained within the data at hand and does not have information concerning the underlying reasoning for inaccuracies in the data.

An investigator measures a continuous variable on four independent, disjointed groups of people and would like to know whether the means of each group differ. Which statistical test should the investigator use to answer this question?

Analysis of variance (ANOVA); analysis of variance is the only choice that is appropriate for continuous outcome variables. Logistic regression is used for binary outcomes, Cox regression is used for survival outcomes, and chi-squared tests are used for pairs of categorical variables

A longitudinal cohort study is conducted to assess risk factors for diabetes. Participants free of diabetes at the start of the study are followed for 10 years for development of diabetes. Consider the study described above and suppose that the outcome is change in blood glucose level over 10 years, what test would be used to assess whether BMI (measured as normal weight, overweight, and obese) is related to change in blood glucose level?

Analysis of variance; The outcome of interest is change in blood glucose level, a continuous variable, and we wish to test if there is a difference in mean changes in blood glucose levels among three indepdent groups (persons of normal weight, overweight, and obese). Analysis of variance is used to test for a difference in more than two independent means.

Continuous variables

Assume, in theory, any value between a theoretical minimum and maximum Quantitative, measurement variables Example - systolic blood pressure

If the biostatistician uses sampling and estimation methods to monitor how well regulators are complying with policy, determine possible interventions and/or preventive measures of health problems. Assessment Assurance and policy development Policy development and assessment Assurance

Assurance and policy development

We are interested in assessing disparities in infant morbidity by race/ethnicity. What kind of variable is race/ethnicity?

Categorical

The idea that z has a standard normal distribution when Z = Xbar - mu/(sigma/rootn) is justified by:

Central Limit Theorem: This is a standard definition. The Central Limit Theorem states that if the sum of the variables has a finite variance, then it will be approximately normally distributed.

In a group of individuals, the probability of characteristic C is 0.4, and the probability of characteristic D is 0.2. The probability of their intersection is 0.10. Which of the following statements is correct? Characteristics C and D are not independent, char. c and d are mutually exclusive, C and D are independent and mutually exclusive, not enough information is given, C and D are independent

Characteristics C and D are not independent. Two events A and B are independent if the occurrence of one does not affect the probability of the occurrence of the other. Because there is some probability of an intersection of events, these events are no independent.

A longitudinal cohort study is conducted to assess risk factors for diabetes. Participants free of diabetes at the start of the study are followed for 10 years for development of diabetes. What test would be used to assess whether sex is related to incident diabetes?

Chi-square test of independence. In this analysis, the outcome is dichotomous (incident diabetes - yes or no) as is the predictor (sex). The data can be organized in a 2x2 table and the chi-square test of independence is used to assess whether there is a difference in the proportions of men and women who develop diabetes.

The following test would be used to compare education levels between groups: (less than high school, complete high school, complete some college)

Chi-square test of independence: the outcome of interest is educational level, measured here as a 3-level ordinal variable, and interest lies in comparing the proportions of participants in each educational category between groups (participants assigned to the placebo as compared to participants assigned to the experimental group). The data can be organized into a 3x2 table for analysis.

The following test would be used to compare prevalent diabetes between groups (% with diabetes)

Chi-square test of independence; the outcome of interest is prevalent diabetes status, a dichotomous or indicator variable, and interest lies in comparing the proportions of participants with diabetes between groups (participants assigned to the placebo as compared to participants assigned to the experimental group). The data can be organized in a two by two table for analysis.

Assume that a researcher has measured weight in a sample of 100 overweight adults before and after a diet and exercise program conducted at the local health department's weekly Eat Healthy-Be Fit community program. To determine whether the mean weight decreased six weeks after the exercise program compared to the initial baseline measures, the researcher should:

Conduct a t-test for dependent samples, A t-test is a hypothesis test to compare population means and proportions. In this case, the sample is dependent because the tests are performed on the same individuals in the sample.

Now assume that the researcher has measured weight in a sample of 200 overweight adults who have been randomized to receive either the diet and exercise program or no program (to serve as a control group). All subjects are weighed at a baseline and again six weeks later. Choosing from the following analysis options, which is the most appropriate way to determine whether the diet program had an impact on weight loss?

Conduct an analysis of covariance using the weight at six weeks as the dependent variable, the diet and exercise program versus control group as the independent variable, and the baseline weight as a covariate. Analysis of covariance (ANCOVA) is a technique that involves a multiple regression model in which the study factors of interest are all treated as nominal variables. the variables being controlled for in the model (the covariates) may be measurements of any scale. We want to use the baseline weight as a covariate to adjust for, or control for, any confounding of baseline weight

A longitudinal cohort study is conducted to assess risk factors for diabetes. Participants free of diabetes at the start of the study are followed for 10 years for development of diabetes. Consider the study described above and suppose that the outcome is change in blood glucose level over 10 years. What test would be used to assess whether age is related to change in blood glucose level?

Correlation analysis. The goal of the analysis is to assess the relationship between two continuous variables - age and change in blood glucose level. Correlation analysis is one technique to quantify the direction (positive or negative) and strength of the association, if one exists.

We want to study whether individuals over 45 years are at greater risk of diabetes than those 45 and younger. What kind of variable is age?

Dichotomous (usually age is continuous variable, but in this case, it is either over 45 or under 45, which makes it dichotomous)

An investigator would like to asses the association between two categorical variables, but a cross-tabulation of the variables reveals that some cells contain counts equal to zero. Which statistical test would be most appropriate in this situation.

Fisher's exact test; it is often used to test for association between two categorical variables when there are small cell counts (e.g. expected cell counts are less than or equal to 5 in a table).

Dichotomous and categorical

Frequencies and relative frequencies bar charts (freq. or relative freq.)

Ordinal

Frequencies, relative freq., cum. freq. and cum. relative freq. Histograms (freq. or relative freq.)

Numerical summaries of dichotomous, categorical, and ordinal variables are best organized through....

Frequency distribution tables - frequency and relative frequency Cumulative frequency and cumulative relative frequency - ordinal only

Smaller within group variability and larger between group variability in scenario 1. In Scenario 2 we have more variability within the groups thus causing the two groups to mix such that it is more difficult to detect the identical difference in the means. This basic example can be formalized by the analysis of variance where the total variability in an experimental situation is partition into the sums of square treatment and sums of square error.

Given the following diagram below with Group 1 (triangle) and group 2 (diamond). We have both scenarios produce identical pairs of group means. What factors allow the easier detection among the two group means in scenario 1 and scenario 2?

Dichotomous variable

Have 2 possible responses (yes/no)

Definition of variance:

In probability theory and statistics, variance measures how far a set of numbers are spread out. A variance of zero indicates that all the values are identical.

Parametric vs non-parametric

In the literal meaning of the terms, a parametric statistical test is one that makes assumptions about the parameters (defining properties) of the population distribution(s) from which one's data are drawn, while a non-parametric test is one that makes no such assumptions.

Suppose the least squares line resulting from a simple linear regression analysis between weight (y in pounds) and height (in inches) is as follows: y hat = 135 + 4x. The interpretation of this line is: if the height is increased by 1 inch on average the weight is expected to: decrease by one pound, increase by 4 pounds, increase by 1 pound, decrease by 4 pounds

Increase by 4 pounds. The basic model in simple linear regression is given by y = Bb + B1x + e and the slope B1 can be interpreted as the change in the mean of y for a unit of change in x. Here the slope is 4 thus we have an increase in weight by 4 pounds with each unit change in height.

A doctor would like to estimate a patient's weight based on their age and gender. Age and gender are known as? Independent variables, response variables, outcome variables, dependent variables

Independent variables; in regression, the variables are used to predict the response variable are independent predictor variables

If the chances for a second event to occur stay the same, regardless of the outcome of a first event, then the two events are: indeterminate, independent, mutually exclusive, equally likely

Independent; Two events A and B are independent if the occurrence of one does not affect the probability of the occurrence of the other. If A and B are not independent, they are considered dependent.

What does the abbreviation of IQR stand for?

Interquartile range; the IQR is calculated by subtracting the 25th percentile of the data from the 75th percentile of the data and is used to measure variability that is not as influenced by extreme values.

Which of the following is true regarding the sum of frequencies for all categories in the above summary table?

It is always equal to the number of individuals in the given data set

Rating I Freq. I Relative Freq. I % Freq. I Cumulative RF Excell I 4 I 1/5 I 20 I .20 good I * (7) I * I * (35) I .55 avg I 6 I 3/10 I 30 I .85 poor I 3 I 3/20 I 15 I * A survey of 20 individuals in a rural community was asked to rate the quality of their health care program on a scale of 4 categories: excellent, good, average, and poor. Their rating frequency distribution table is given below. Which of the following statements below is correct regarding the sum of frequencies for all categories in the above summary table?

It is always equal to the number of individuals in the given data set. A frequency distribution is a tabular summary of data showing the frequency (or number) of items in each of several non-overlapping categories; therefore, the sum of the frequencies among all categories will always equal the number of elements (individuals) in the dataset.

Which is the most correct statement about a scatterplot? It is used to compare the means of two variables, it is used to investigate the relationship between two continuous variables, it shows the relationship between any two variables, it is a useless plot when the relationship between two variables is nonlinear, it is used to determine whether to perform a linear regression

It is used to investigate the relationship between two continuous variables; A scatterplot diagram is a plot of paired (x,y) data with a horizontal x-axis and a vertical y-axis. A scatterplot can be used to investigate the relationship between two continuous variables as well as to identify outliers within a data set.

Which of the following does not describe a measure of the variability of a continuous variable? Confidence interval, kurtosis, interquartile range, standard deviation

Kurtosis; in probability theory and statistics, kurtosis is a measure of the "peakedness" of the probability distribution of a real-valued random variable. Higher kurtosis means more of the variance is due to infrequent extreme deviations versus frequent modestly sized deviations.

In simple linear regression, what is a method of determining the slope and intercept of the best-fitting line? Least squares, r-square, least error, regression, minimum error

Least squares; Simple linear regression involves data on a dependent variable y and one or more independent variables (x 1 , x 2 , etc.). Regression analysis involves finding the "best" mathematical model (within some restricted class of models) to describe y as a function of the x's or to predict y from the x's. The regression line is the presentation of the regression equation. Residuals are used to determine the best-fitting line, and residuals are calculated by subtracting the observed minus expected values along the regression line. A straight line satisfies the least-squares property if the sum of the squares of the residuals is the smallest sum possible.

Source I df I SS I MS -------------------------------------------------------- Treatments I k-1 I SST I MST = SST/(k-1) -------------------------------------------------------- Error I n-k I SSE I MSE = SSE/(n-1) -------------------------------------------------------- Total I n-1 I Total SS Based on the ANOVA table, how is the F-test computed based on the components in the ANOVA table?

MST/MSE; the test statistic for a one-way ANOVA is given by F=MST/MSE. We reject the null hypothesis for large values of F, using a right-tailed statistical test. When the null is true, this test statistic has an F-distribution with df1= k -1 and df2 = n-k

Inferential statistics

Make inferences about population parameters based on sample statistics

The z-score measures the relative position of one observation relative to others in a data set. What components are needed to compute a z-score?

Mean and standard deviation; a z-score measures the distance between an observation and the mean, measured in units of standard deviation.

The lengths of stay for six patients were 0, 0, 1, 2, 2, and 16 days. Which is (are) the best measure(s) to summarize these data? Median, mean, median and range, mean and standard deviation, median and standard deviation

Median and range because the data are skewed and have an outlier, the median and range would best summarize the data.

What to look at when determining type of analysis

Nature of primary outcome variable (continuous, dichotomous, categorical, time to event) Number of comparison groups (one, 2 independent, 2 matched or paired, > 2) Associations between variables (regression analysis)

Central Limit Theorem

Non-normal population with mu and sd Take sample sizes of n - as long as n is sufficiently large (usually n >/= 30 suffices) The distribution of the sample mean is approximately normal, therefore can use Z to compute probabilities Z = xbar - mu/ sd/rootn

The sensitivity of a particular screening test for a disease is 95%, and the specificity is 90%. Which of the following statements is most correct? If a person has the disease, there is a 5% chance that the test will be negative. If a person does not have the disease, there is a 5% chance that the test will be positive, of 100 people sampled from a population with the disease, the test will correctly detect 90 individuals, of 100 people sampled from a population with the disease, the test will correctly detect 95 individuals as positive for the disease, if a person tests positive, the probability of having the disease is 0.95.

Of 100 people sampled from a population with the disease, the test will correctly detect 95 individuals as positive for the disease; sensitivity is the proportion of truly disease people in the screened population who are identified as diseased by the screening test. It is a measure of the probability of correctly diagnosing a case or the probability that any given case will be identified by the test (e.g. true positives). Specificity is the proportion of truly non-diseased people who are so identified by the screening test. It is a measure of the probability of correctly identifying a non-diseased person with the screening test (e.g. true negatives).

A researcher is designing a new questionnaire to examine patient stress levels on a scale of 0 to 5. What type of outcome variable is being collected? Interval, ration, binary, ordinal, nominal

Ordinal; data are at the ordinal level of measurement if they can be arranged in some order, but differences between data values either cannot be determined or are meaningless.

In the construction of Box-plots, the upper and lower fences are used to detect which of the following summary? Quartiles, maximum number, minimum number, outliers

Outliers; the lower fence is defined as Q1 - 1.5 (IQR). The upper fence is defined as: Q3 + 1.5 (IQR) where Q1 and Q3 are the lower and upper quartiles and IQR is the interquartile range. The upper and lower fences are boundaries to detect any measurements beyond those fences which are called outliers.

Two kinds of estimates

Point estimate and confidence interval

A 911 emergency operator is flooded with calls during the daily rush hour period. What is the distribution that best describes this data set? Hypergeometric, poisson, binomial, normal

Poisson, the Poisson distribution is used to model data that represent the number of occurrences of a specified event in a given unit of time or space.

The assumption of a t-test for the difference between the means of two independent populations is that the respective: sample sizes are equal, populations are approximately normal, sample variances are equal, or all of the above

Populations are approximately normal; one of the assumptions for the t-test for two independent populations is normality

The ability to reject the null hypothesis when the null is in fact false is called?

Power, the power of a statistical test is defined as 1 - beta = P(reject H0 I Ha is true)

Let S denote the sample size and P denote the population size. Which of the following statements is the most correct? S is always smaller than or equal to P S can be larger or smaller than P S can be larger than P S is always equal to P

S is always smaller than or equal to P

The probability distribution for all possible values of a given sample statistic is called?

Sampling distribution; the sampling distribution of a statistic is the distribution of values of the statistic over all possible samples of size n that could have been selected from the reference population.

Sensitivity

Sensitivity is the proportion of truly diseased people in the screened population who are identified as diseased by the screening test. It is a measure of the probability of correctly diagnosing a case or the probability that any given case will be identified by the test (e.g. true positives).

The proportion of people with a disease who are correctly identified by a screening test as having the disease is called:

Sensitivity: sensitivity is defined as the proportion of truly diseased people in the screened population who are identified as having the disease by the screening test (true positives).

Specificity

Specificity is the proportion of truly non-diseased people who are so identified by the screening test. It is a measure of the probability of correctly identifying a non-diseased person with the screening test (e.g. true negatives).

Making inferences regarding certain characteristics of the population based on the sample data is referred to as:

Statistical inference; SI is the use of statistics to make inferences concerning some unknown aspect of a population

Descriptive statistics

Summarize a sample selected from a population

The goal of an ANOVA statistical analysis is to determine whether or not:

The means of two or more populations are different. One of the simplest experimental design is the completely randomized design in which random samples are selected independently from each of g populations. An analysis of variance is used to test if the g population means are the same, or is at least one mean different from the others.

A type 1 error is defined as:

The probability of rejecting the null hypothesis when the null hypothesis is true

Assume that a linear regression analysis is performed. Which of the following results would justify trying a different method of analysis for the data? The constant is not significant, plotting the residuals against the dependent variable gives a random cloud of points, the r 2 = 0.99, the slope coefficient = 0.001, the r 2 = 0.001

The r 2 = 0.001, the r 2 value indicates the amount of variance in the criterion variable Y that is accounted for by the variation in the predictor variable X. In the linear regression analysis, the set of predictor variables x 1, x 2, ... is used to explain variability of the criterion variable Y. The r 2 value should fall between 0 and 1, with a value closer to 1 explaining more of the variability of the model.

Suppose a researcher calculates a confidence interval for a population mean based on a sample size on 9. Which of the following assumptions have been made? The sampled population is approximately normal, the sampling distribution of z is normal, the population standard deviation is known, no assumptions have been made

The sampled population is approximately normal. In general, the assumption for a 95% confidence interval is that the mean is approximately normal since the central limit theorem will hold. However with a sample of 9 the central limit theorem will not apply and so the sampled population needs to be approximately normal.

When performing a nonparametric Wilcoxon rank-sum test, the first step is to combine the data values in the two samples and assign a rank of '1' to which of the following? The middle observation the largest observation the smallest observation the most frequently occurring observation

The smallest observation; the ranking procedure for the Wilcoxon rank-sum test is to first combine the data from the two groups, and order the values from lowest to highest.

An epidemiologist attempts to predict the weight of an elderly person from demispan. She randomly chooses 70 elderly subjects in a particular geographic area and records their weight and demispan measurements in the form of (x i, y i) for i = 1..., 70. Given that the value of the Pearson correlation coefficient is zero, what can be deduced? There is a strong negative relationship between weight and demispan, there could be some nonlinear relationship between weight and demispan, there is no relation between weight and demispan, all pairs of values of weight and demispan are practically identical, there is an almost perfect relationship between weight and demispan.

There could be some nonlinear relationship between weight and demispan: The justification is that the Pearson correlation only looks at linear relationships. The zero value means that there is no linear relation but there could be a non linear one. For example, if points are (-3, 9), (-2, 4), (-1, 1), (0,0) (1,1) (2,4), (3,9), then the Pearson correlation is zero by Y = X squared.

A clinical trial is conducted to assess the efficacy of a new drug for increasing HDL cholesterol. A 95% confidence interval for the difference in increase in HDL cholesterol levels over 12 weeks between patients assigned to the new drug or to placebo is (-2.45, 5,97). Which of the following statements is most correct?

There is no significant difference in increases in HDL cholesterol levels measured over 12 weeks. The outcome of interest is the increase (or change) in HDL over 12 weeks. Because the confidence interval for the difference (between patients assigned to the new drug versus placebo) in increase in HDL includes zero (the null value), we do not have evidence of statistically significant difference in increase in HDL between groups.

What is involved in estimation

Trying to determine likely values for unknown population parameter

A clinical experiment with four treatment groups was analyzed using an ANOVA and a significant difference in the population means is found. Which of the following is a natural or next step?

Tukey's or a similar method of pairwise comparison. Once a significant difference among the population means is found after performing an ANOVA, we next examine pairwise comparisons to further identify the nature of the differences while adjusting for the multiple comparisons via Tukey's method or a similar method.

Which of the following statistical tests is not considered a nonparametric test? Mann-whitney, Tukey's, Kruskal-Wallis, Wlixocon rank-sum

Tukey's test; there are actually two Tukey's tests. One is a post hoc procedure for ANOVA, and the other is a test for additivity used in ANOVA. Neither is a nonparametric test.

Statistical Inference

Two broad areas: estimation and hypothesis testing

A longitudinal cohort study is conducted to assess risk factors for diabetes. Participants free of diabetes at the start of the study are followed for 10 years for development of diabetes. What test would be used to assess whether age is related to incident diabetes?

Two independent sample t test: the goal of the analysis is to compare mean ages (age is a continuous variable) between two independent groups (persons who develop diabetes over the 10 year follow-up as compared to those who do not).

A longitudinal cohort study is conducted to assess risk factors for diabetes. Participants free of diabetes at the start of the study are followed for 10 years for development of diabetes. Consider the study described above and suppose that the outcome is change in blood glucose level over 10 years. What test would be used to assess whether sex is related to change in blood glucose level?

Two independent samples t test

Hypothesis testing

Type of statistical inference A statement is made about paramenter, sample stats support or refute statement

Estimation

Type of statistical inference Population parameter (mean, proportion) is unknown, sample statistics are used to generate estimates

Summarizing location and variability

When there are no outliers, the sample mean and standard deviation summarize location and variability When there are outliers, the median and the IQR summarize location and variability were IQR = Q3-Q1 Outliers <Q1 - 1.5(IQR) or >Q3 + 1.5(IQR) Q1 = 25% Q3 = 75th percentile Median = middle value

normal distribution formula (z score)

Z = x - mean / standard deviation

Tchebysheff's Theorem states that given only k >= and regardless of the shape of a population's frequency distribution, the proportion of observations falling within k standard deviations of the mean is:

at least 1 - (1/k2)

A longitudinal cohort study is conducted to assess risk factors for diabetes. Participants free of diabetes at the start of the study are followed for 10 years for development of diabetes. What test would be used to assess whether BMI (measured as normal weight, overweight, and obese) is related to incident diabetes

chi-square test of independence; The outcome of interest is incident diabetes, a dichotomous or indicator variable, and the predictor is BMI. BMI is generally measured as a continuous variable, but here participants are classified into one of three ordinal categories. The goal of the analysis is to compare the proportions of participants who develop diabetes among the BMI categories. The data can be organized into a 3x2 table for analysis.

The probability that an event occurs if some condition is met is called a:

conditional probability; a conditional probability is the probability of an event assuming that another event occurred.

frequency bar chart versus histogram

frequency bar chart - categorical histogram - ordinal can have a relative frequency histogram or bar chart

Ordinal and categorical/nominal

have more than two responses and responses are ordered and unordered, respectively

Location of box in box and whisker

if its centered, data is pretty symmetrical if it is one side or the other, it is skewed either right or left

Frequency

is count

Standard deviation definition

measure of variability among observations, ex: 19.4, meaning on average the observations deviate (vary) from the sample mean by about 19 units (above or below). Where they cluster around the mean. Smaller s means the more tightly clustered the observations are around the mean

Normal Distribution

model for continuous outcome mean=median=mode

Standard summary for a continuous variable:

n = 75, xbar = 123.6, s = 19.4 n = sample size xbar = sample mean s = standard deviation

Continous

n, xbar, and s, or median and IQR (if outliers) Box whisker pot

Continous

or measurement, assume in theory any values between a theoretical minimum and maximum

Cumulative frequency

ordinal only - running counts (adding the sample total to each category)

cumulative relative frequency

ordinal only - running percentages (adding percentage to each category)

Poisson is used when

outcome is count number of response categories is >2 number of trials/replications is infinite relationships among trials are independent Exposure data, exposed or not; likelihood of observing whose been exposed versus not Larger numbers

Binomial is used when

outcome is success/failure number of response categories = 2 number of trials/replications = fixed relationships among trials = independent Example: a certain type of disease. They either have it or they don't 50 patients, what's the likelihood that any one has the disease? It is independent that one person has it compared to another. What is the probability that out of 10 patients seen, how many have the diease? could be answered with binomial

Relative frequency

percent or proportion

Point estimate

the best single-valued estimate for a parameter

The Central Limit Theorem states that:

the sample mean is approximately normal; the central limit theorem states that if the sample size is large enough, the distribution of the sample means can be approximated by a normal distribution, even if the original population is not normally distributed. In other words, the distribution of the sample means approaches a normal distribution as the sample size increases.

The following test would be used to compare mean ages between groups: (Mean +- SD)

two independent samples t test; the outcome of interest is age, which is a continuous variable, and the interest lies in comparing mean ages between two independent groups (participants assigned to the placebo as compared to participants assigned to the experimental group). The two independent samples t test is used to compare means of a continuous variable between two independent groups.

One can describe the F-distribution as a sampling distribution of the ratio of the following:

two sample variances provided that the samples are independently drawn from two normal populations with equal variances. This answer is just the definition of the F statistic, which is typically used for comparing two population variances. If the parent populations are independently and normally distributed, then the F statistic is calculated by (F=var1/var2 or F=var2/var1) where the numerator is the larger of the two variances. This ratio has F-distribution with the degrees of freedom n 1 - 1, n 2-1 where n 1 and n2 are the sample sizes

If all of the numbers in a list increase by 2, then the standard deviation is: unchanged, cannot be determined without the actual list of numbers, increased by 4, increased by 2

unchanged; adding a constant number to a list of data does not change the standard deviation, but it will change the list of numbers

percentile formula

x = mean + (z * r)

If a population has a standard deviation σ, then the standard deviation of the mean of 100 randomly selected items from this population is:

σ /10, the standard deviation of the mean is given by σ/root n, here n = 100


Related study sets

5.4: Passive Transport Across Membranes

View Set

nur 116 - Davis Advantage / Edge - Multiple Sclerosis

View Set

Qu'est-ce que tu fais sur Internet ? (Grade 10 -Adv)

View Set

Math Measurement Terms Arithmetic 6 Abeka

View Set