HIM 4331 FINAL EXAM

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

columns

For machine learning, the ________ of a matrix are called the features.

less; is

If the p value from your statistical test is _______than the alpha significance level the hypothesis test _________statistically significant.

parameter

Imagine that the U.S. federal government had the means to survey all high school seniors in the U.S. concerning their plans for future education and employment, and found that 50 percent were planning to attend a 4-year college or university

C. (95 - 85)/ 5

In a biology class, the scores on the final exam were normally distributed, the mean = 85, and a standard deviation = 5. Susan got a final exam score of 95. Select the correct formula to compute her Z-score? A. (100-95)/5 B. (85-95)/5 C. (95 - 85)/ 5 D. (95-5)/5

7/35

In a class of 35 students, seven students received scores in the 70-79 range. What is the relative frequency of scores in this range?

A. x1

In the regression equation Y = a + b1x1 + b2x2 which variable represents a feature data value? A. x1 B. y C. b1 D. a

C. y

In the regression equation Y = a + b1x1 + b2x2 which variable represents the predicted value? A. a B. b1 C. y D. x1

reject, because the CI does not contain the exam score 80

Let alpha = 0.05. Your null hypothesis is H0: mean exam score on the first test = 80. You calculated a 95% confidence interval of (70, 79). Do you reject or fail to reject the null hypothesis?

C. (sample mean + 1.96se, sample mean - 1.96se)

Let se = standard error. Which correctly represents a 95% Confidence interval? A. (sample mean + 1.96, sample mean - 1.96) B. (sample mean + se, sample mean - se) C. (sample mean + 1.96se, sample mean - 1.96se) D. (sample mean + 2 se, sample mean - 2 se)

quantitative-discrete

Number of visits per week is what kind of data?

internal validity

Problems with content validity and construct validity describe threats to:

mean

The cluster center is the _________ of all points in that cluster.

1.96

The confidence level for a 95% CI is has a z value of ________.

below; -0.89

The distribution of height in centimeters for girls aged five years and no months has the distribution: X ~ N(109, 4.5). The Z-score for one girl is −0.89. This means that she is ______ the average height for girls of age five years and no months by _______ standard deviations.

C. z = (112 - 109) / 4.5

The distribution of height in centimeters for girls aged five years and no months is: X ~ N(109, 4.5). Which is the correct formula to calculate a Z score for a height of 112 centimeters?

defining the research questions

The first step in the research process is:

cluster

The manager of a department store decides to measure employee satisfaction by selecting four departments at random and conducting interviews with all the employees in those four departments. What type of survey design is this?

faster; before

The median hours to complete a 60-minute exam was 40 minutes. You finished in 30 minutes. Your time was _______ than the median indicating that you finished _____ than half of the exam takers.

C. 120

The midterm grades on a chemistry exam, graded on a scale of 0 to 100 are below. Which grade is the outlier in this data? 64, 65, 65, 68, 70, 72,75, 75, 75, 76,78, 78, 81, 83, 84, 98, 100, 100, 120 A. 62 B. 84 C. 120 D. 100

true

True or False. Conducting a pilot study helps researchers to work out survey design and implementation errors.

false

True or False. Internal validity involves being able to apply a study's results to another similar healthcare organization.

false

True or False. The relative frequency of a value that occurs 10 times in 100 observations is ten.

true

True or false. A Pearson's r value of -2 indicates the regression line has a negative slope.

true

True or false. A Seaborn countplot is used to make a bar chart.

false

True or false. A Silhouette score of zero is desirable, it indicates clusters are far away from each other.

true

True or false. A coefficient of determination close to one indicates the regression model is better at prediction.

true

True or false. A correlation heatmap uses the absolute value of the target.

true

True or false. A feature can be any column except the target column.

false

True or false. A mean squared error close to zero indicates the model is a bad fit.

true

True or false. A negative Silhouette score indicates points are assigned to the wrong cluster.

true

True or false. A reduced dataset is created by removing columns that are highly correlated with the target.

false

True or false. A small standard error indicates the sample mean is not a good population mean estimator.

true

True or false. A smaller p value provides more evidence against the null hypothesis.

true

True or false. A subject may withdraw from a research study at any time.

false

True or false. According to the Empirical Rule, given a mean=800 and standard deviation = 150, the formula for the finding the percentage less than 600 would be (600-150)/800.

true

True or false. Calculating Silhouette coefficients helps determine the optimal number of clusters.

false

True or false. Clustering algorithms are an example of supervised learning.

true

True or false. Each point in a cluster is closer to its own cluster center than to the other cluster centers.

true

True or false. Feature selection can be conducted with Chi Square or ANOVA tests.

true

True or false. Feature selection would exclude columns that are strongly correlated with each other.

false

True or false. For KNN Classification, the data being analyzed must be normally distributed.

false

True or false. For the K-nearest neighbors' classification, the y value represents the label predictors.

true

True or false. If you draw random samples of size n, then as n increases, the sample means form a normal distribution called the sampling distribution.

true

True or false. In a normal distribution, the mean, median and mode are all equal.

true

True or false. It the p value is greater than or equal to the alpha value we do not reject the null hypothesis.

true

True or false. K-nearest neighbors classification works by computing the smallest distance between the point(s) to be labelled and the existing points in the neighborhood.

true

True or false. Numpy row and column array indexes start with zero.

true

True or false. Pandas has two basic data structures, a series and a data frame.

false

True or false. Regression is qualitative supervised machine learning that predicts continuous numeric outcomes.

false

True or false. The K-mean algorithm determines the number of clusters.

true

True or false. The Numpy array column indexing given by arr[1, 0:3] would not include column three.

true

True or false. The choice of K for the KNN Classification has a major affect on the classifier created.

true

True or false. The data.describe command would return the mean, median, and count.

true

True or false. The goal of regression is to find an equation for the relationship between the predictors and the outcome variable.

true

True or false. The probability of committing a Type I error is decreased when you decrease your alpha significance level.

true

True or false. The regression coefficients represent the effect of a one-unit change in x on the target y value.

false

True or false. The standard deviation of the sample means is equal to the population standard deviation.

false

True or false. The strength and direction of the x and y relationship is given by the coefficient of determination.

false

True or false. When using a correlation heatmap, you would keep the features with a Pearson's correlation value below 0.5

false

True or false. When you sign an informed consent you release the researchers from any liability for negligence.

false

True or false. You are conducting a machine learning analysis with data that does not have a labeled outcome field. This is called supervised learning.

true

True or false. You should reject the null hypothesis if your test statistic is greater than critical value.

true

True or false. You want to test that the mean age of graduate students at your college is greater than 28 years. Thus, your null hypothesis would be that the mean age of graduate students is less than or equal to 28 years of age.

false

True or false. Your model is stronger if it has columns that are not related to the outcome.

D. r >= 80%

We want to test that 80% or more of the students who take the RHIA exam pass on the first attempt. Let r = the pass rate for the RHIA exam for the first try. Which is the correct alternate hypothesis? A. r = 80% B. r < 80% C. r > 80% D. r >= 80%

histogram

What graph is not used for categorical data?

nominal

What type of measurement scale is being used when you measure blood type?

before building the model

When should you perform feature selection?

E. A and B

Which are limitations of k-means clustering? A. only handles linear boundaries B. slow for large data sets C. uses the most complex clustering algorithm D. All the above E. A and B

B. study of the number of data breaches per geographic region

Which best describes a quantitative study? A. study of patient perceptions of staff friendliness B. study of the number of data breaches per geographic region C. study to observe hand washing in an ambulatory surgery center D. study asking participants to describe their experiences at their last doctors visit

B. arr[0:3 , :]

Which creates an array with rows 0, 1, and 2, and all the columns? A. arr[0:2 , :] B. arr[0:3 , :] C. arr[ : , 0:3] D. arr[:, 0:2 ]

C. Amount of time students studied for a Biology quiz

Which is a continuous variable? A. Count of students living in a college housing B. The different types of coffee purchased at Starbucks coffee shop C. Amount of time students studied for a Biology quiz D. Number of patients discharged in one day

d. pulse rate

Which is not a qualitative variable? A. blood type B. favorite color C. city D. pulse rate

A. convenience sampling

Which is not an element of an experimental study? A. convenience sampling B. double blind assignment to groups C. control group D. treatment group

B. continuous outcome

Which is not associated with supervised machine learning classification? A. discrete labels B. continuous outcome C. categorical outcome D. qualitative

C. filtering

Which is selecting only columns that are highly correlated to your target? A. sorting B. predicting C. filtering D. analyzing

A. Sci-Kit

Which is the Python package that contains most of the machine learning code? A. Sci-Kit B. Matplotlib C. Seaborn D. NumPy

B. Pandas

Which is the Python package that we used in class to load our comma separated value files? A. SciPy B. Pandas C. Seaborn D. Matplotlib

B. residual

Which is the difference between a y value in your data set and the y value calculated with multivariate regression? A. slope B. residual C. mean squared error D. adjusted R squared

C. prediction

Which is the process of evaluating new points with your trained machine learning model? A. test B. train C. prediction D. all the above

C. target

Which is the quantity you want to predict with your machine learning model? A. column B. feature C. target D. all the above

B. arr[: , 0:8]

Which selects all rows, and columns 0 up to and including column 7? A. arr[0:8,:] B. arr[: , 0:8] C. arr[:, 0:7] D. arr[0:7,:]

C. decision tree

Which works with both continuous and categorical data? A. classification B. regression C. decision tree D. a or b

A. 15 + (2)(4.3)

You are given that the mean = 15 and standard deviation = 4.3. Which is the correct formula to calculate the number that is two standard deviations above the mean? A. 15 + (2)(4.3) B. 15 - (2)(4.3) C. 15 + 2 D. 15 + 4.3

bar chart

You collect data on the color of cars driven by students in your statistics class, the best chart to display the data is:

C. histogram

You get data from the U.S. Census Bureau on the median household income for your city. Which chart is the better choice to group and display this data? A. pie chart B. bar chart C. histogram D. line chart

82%; 18%

You had to wait 90 minutes in the emergency room of a hospital before you could see a doctor. You learn this wait time was in the 82nd percentile of all wait times. This means that ______ percent of patients had a shorter wait time than you, and only ______ percent had a longer wait time.

my data.Country

You have loaded some data into the Python variable mydata. Which command would you use to select the single column named Country from this data?

B. median

You have numeric data with several outliers, which is the best measure to reflect the average value of the data set? A. mean B. median C. mode D. standard deviation

2

You have stored the following data in a Numpy array named myarr: 5, 2, 4, 6. What would the command myarr[1] return?

C. paired t-test

You want to determine if the mean weight of your research group is lower after the members have been on a special low-fat diet for two weeks. Which is the best t-test for this research? A. one sample B. independent sample C. paired t-test D. a or b

retrospective

You want to determine the rate of postoperative infections in patients who previously had heart valve surgery and were discharged from the hospital. This is a _________ study.

50%

Your data is normally distributed. This implies that approximately ________ percent of your data will be above the mean.

better; better

Your daughter brings home test scores showing that she scored in the 80th percentile in math and the 76th percentile in reading for her grade. This means that your daughter scored ________ than 80 percent of the students in her grade on math and _________than 76 percent of the students in reading.

Type II

Your null hypothesis is that the new medicine you are testing does not reduce systolic blood pressure. This hypothesis is actually false, but you failed to reject it. What type of error is this?

strong positive

A Pearson's r value of 0.87 indicates a ____________ association.

systematic

A health club is interested in knowing how many times a typical member uses the club in a week. They decide to ask every tenth customer on a specified day to complete a short survey including information about how many times they have visited the club in the past week. What kind of sampling is this?

statistic

A study finds that the mean amount spent on produce per visit to a grocery store by the customers in the sample is $12.84. This is an example of a:

intra-rater reliability

A survey was administered by the same instructor at two times with similar results. This is an example of:

most

Cluster centroids are relocated until they are all in the position that gives them maximum density, this means they have the ___________________ points closest to the cluster centers.


Kaugnay na mga set ng pag-aaral

Ecnomoics for Managers- Creating Markets

View Set

Quiz 2 (Nazi Berlin & Divided Berlin)

View Set

Autism spectrum disorder (ASD) (Sherpath)

View Set

Life-Limited to the Payment of Funeral and Burial Expense

View Set

Famous Filipino Architects and Famous Buildings in Philippines

View Set

ECO-2050 HW Assignment 12, Chapter 11

View Set