HIM 4331 FINAL EXAM
columns
For machine learning, the ________ of a matrix are called the features.
less; is
If the p value from your statistical test is _______than the alpha significance level the hypothesis test _________statistically significant.
parameter
Imagine that the U.S. federal government had the means to survey all high school seniors in the U.S. concerning their plans for future education and employment, and found that 50 percent were planning to attend a 4-year college or university
C. (95 - 85)/ 5
In a biology class, the scores on the final exam were normally distributed, the mean = 85, and a standard deviation = 5. Susan got a final exam score of 95. Select the correct formula to compute her Z-score? A. (100-95)/5 B. (85-95)/5 C. (95 - 85)/ 5 D. (95-5)/5
7/35
In a class of 35 students, seven students received scores in the 70-79 range. What is the relative frequency of scores in this range?
A. x1
In the regression equation Y = a + b1x1 + b2x2 which variable represents a feature data value? A. x1 B. y C. b1 D. a
C. y
In the regression equation Y = a + b1x1 + b2x2 which variable represents the predicted value? A. a B. b1 C. y D. x1
reject, because the CI does not contain the exam score 80
Let alpha = 0.05. Your null hypothesis is H0: mean exam score on the first test = 80. You calculated a 95% confidence interval of (70, 79). Do you reject or fail to reject the null hypothesis?
C. (sample mean + 1.96se, sample mean - 1.96se)
Let se = standard error. Which correctly represents a 95% Confidence interval? A. (sample mean + 1.96, sample mean - 1.96) B. (sample mean + se, sample mean - se) C. (sample mean + 1.96se, sample mean - 1.96se) D. (sample mean + 2 se, sample mean - 2 se)
quantitative-discrete
Number of visits per week is what kind of data?
internal validity
Problems with content validity and construct validity describe threats to:
mean
The cluster center is the _________ of all points in that cluster.
1.96
The confidence level for a 95% CI is has a z value of ________.
below; -0.89
The distribution of height in centimeters for girls aged five years and no months has the distribution: X ~ N(109, 4.5). The Z-score for one girl is −0.89. This means that she is ______ the average height for girls of age five years and no months by _______ standard deviations.
C. z = (112 - 109) / 4.5
The distribution of height in centimeters for girls aged five years and no months is: X ~ N(109, 4.5). Which is the correct formula to calculate a Z score for a height of 112 centimeters?
defining the research questions
The first step in the research process is:
cluster
The manager of a department store decides to measure employee satisfaction by selecting four departments at random and conducting interviews with all the employees in those four departments. What type of survey design is this?
faster; before
The median hours to complete a 60-minute exam was 40 minutes. You finished in 30 minutes. Your time was _______ than the median indicating that you finished _____ than half of the exam takers.
C. 120
The midterm grades on a chemistry exam, graded on a scale of 0 to 100 are below. Which grade is the outlier in this data? 64, 65, 65, 68, 70, 72,75, 75, 75, 76,78, 78, 81, 83, 84, 98, 100, 100, 120 A. 62 B. 84 C. 120 D. 100
true
True or False. Conducting a pilot study helps researchers to work out survey design and implementation errors.
false
True or False. Internal validity involves being able to apply a study's results to another similar healthcare organization.
false
True or False. The relative frequency of a value that occurs 10 times in 100 observations is ten.
true
True or false. A Pearson's r value of -2 indicates the regression line has a negative slope.
true
True or false. A Seaborn countplot is used to make a bar chart.
false
True or false. A Silhouette score of zero is desirable, it indicates clusters are far away from each other.
true
True or false. A coefficient of determination close to one indicates the regression model is better at prediction.
true
True or false. A correlation heatmap uses the absolute value of the target.
true
True or false. A feature can be any column except the target column.
false
True or false. A mean squared error close to zero indicates the model is a bad fit.
true
True or false. A negative Silhouette score indicates points are assigned to the wrong cluster.
true
True or false. A reduced dataset is created by removing columns that are highly correlated with the target.
false
True or false. A small standard error indicates the sample mean is not a good population mean estimator.
true
True or false. A smaller p value provides more evidence against the null hypothesis.
true
True or false. A subject may withdraw from a research study at any time.
false
True or false. According to the Empirical Rule, given a mean=800 and standard deviation = 150, the formula for the finding the percentage less than 600 would be (600-150)/800.
true
True or false. Calculating Silhouette coefficients helps determine the optimal number of clusters.
false
True or false. Clustering algorithms are an example of supervised learning.
true
True or false. Each point in a cluster is closer to its own cluster center than to the other cluster centers.
true
True or false. Feature selection can be conducted with Chi Square or ANOVA tests.
true
True or false. Feature selection would exclude columns that are strongly correlated with each other.
false
True or false. For KNN Classification, the data being analyzed must be normally distributed.
false
True or false. For the K-nearest neighbors' classification, the y value represents the label predictors.
true
True or false. If you draw random samples of size n, then as n increases, the sample means form a normal distribution called the sampling distribution.
true
True or false. In a normal distribution, the mean, median and mode are all equal.
true
True or false. It the p value is greater than or equal to the alpha value we do not reject the null hypothesis.
true
True or false. K-nearest neighbors classification works by computing the smallest distance between the point(s) to be labelled and the existing points in the neighborhood.
true
True or false. Numpy row and column array indexes start with zero.
true
True or false. Pandas has two basic data structures, a series and a data frame.
false
True or false. Regression is qualitative supervised machine learning that predicts continuous numeric outcomes.
false
True or false. The K-mean algorithm determines the number of clusters.
true
True or false. The Numpy array column indexing given by arr[1, 0:3] would not include column three.
true
True or false. The choice of K for the KNN Classification has a major affect on the classifier created.
true
True or false. The data.describe command would return the mean, median, and count.
true
True or false. The goal of regression is to find an equation for the relationship between the predictors and the outcome variable.
true
True or false. The probability of committing a Type I error is decreased when you decrease your alpha significance level.
true
True or false. The regression coefficients represent the effect of a one-unit change in x on the target y value.
false
True or false. The standard deviation of the sample means is equal to the population standard deviation.
false
True or false. The strength and direction of the x and y relationship is given by the coefficient of determination.
false
True or false. When using a correlation heatmap, you would keep the features with a Pearson's correlation value below 0.5
false
True or false. When you sign an informed consent you release the researchers from any liability for negligence.
false
True or false. You are conducting a machine learning analysis with data that does not have a labeled outcome field. This is called supervised learning.
true
True or false. You should reject the null hypothesis if your test statistic is greater than critical value.
true
True or false. You want to test that the mean age of graduate students at your college is greater than 28 years. Thus, your null hypothesis would be that the mean age of graduate students is less than or equal to 28 years of age.
false
True or false. Your model is stronger if it has columns that are not related to the outcome.
D. r >= 80%
We want to test that 80% or more of the students who take the RHIA exam pass on the first attempt. Let r = the pass rate for the RHIA exam for the first try. Which is the correct alternate hypothesis? A. r = 80% B. r < 80% C. r > 80% D. r >= 80%
histogram
What graph is not used for categorical data?
nominal
What type of measurement scale is being used when you measure blood type?
before building the model
When should you perform feature selection?
E. A and B
Which are limitations of k-means clustering? A. only handles linear boundaries B. slow for large data sets C. uses the most complex clustering algorithm D. All the above E. A and B
B. study of the number of data breaches per geographic region
Which best describes a quantitative study? A. study of patient perceptions of staff friendliness B. study of the number of data breaches per geographic region C. study to observe hand washing in an ambulatory surgery center D. study asking participants to describe their experiences at their last doctors visit
B. arr[0:3 , :]
Which creates an array with rows 0, 1, and 2, and all the columns? A. arr[0:2 , :] B. arr[0:3 , :] C. arr[ : , 0:3] D. arr[:, 0:2 ]
C. Amount of time students studied for a Biology quiz
Which is a continuous variable? A. Count of students living in a college housing B. The different types of coffee purchased at Starbucks coffee shop C. Amount of time students studied for a Biology quiz D. Number of patients discharged in one day
d. pulse rate
Which is not a qualitative variable? A. blood type B. favorite color C. city D. pulse rate
A. convenience sampling
Which is not an element of an experimental study? A. convenience sampling B. double blind assignment to groups C. control group D. treatment group
B. continuous outcome
Which is not associated with supervised machine learning classification? A. discrete labels B. continuous outcome C. categorical outcome D. qualitative
C. filtering
Which is selecting only columns that are highly correlated to your target? A. sorting B. predicting C. filtering D. analyzing
A. Sci-Kit
Which is the Python package that contains most of the machine learning code? A. Sci-Kit B. Matplotlib C. Seaborn D. NumPy
B. Pandas
Which is the Python package that we used in class to load our comma separated value files? A. SciPy B. Pandas C. Seaborn D. Matplotlib
B. residual
Which is the difference between a y value in your data set and the y value calculated with multivariate regression? A. slope B. residual C. mean squared error D. adjusted R squared
C. prediction
Which is the process of evaluating new points with your trained machine learning model? A. test B. train C. prediction D. all the above
C. target
Which is the quantity you want to predict with your machine learning model? A. column B. feature C. target D. all the above
B. arr[: , 0:8]
Which selects all rows, and columns 0 up to and including column 7? A. arr[0:8,:] B. arr[: , 0:8] C. arr[:, 0:7] D. arr[0:7,:]
C. decision tree
Which works with both continuous and categorical data? A. classification B. regression C. decision tree D. a or b
A. 15 + (2)(4.3)
You are given that the mean = 15 and standard deviation = 4.3. Which is the correct formula to calculate the number that is two standard deviations above the mean? A. 15 + (2)(4.3) B. 15 - (2)(4.3) C. 15 + 2 D. 15 + 4.3
bar chart
You collect data on the color of cars driven by students in your statistics class, the best chart to display the data is:
C. histogram
You get data from the U.S. Census Bureau on the median household income for your city. Which chart is the better choice to group and display this data? A. pie chart B. bar chart C. histogram D. line chart
82%; 18%
You had to wait 90 minutes in the emergency room of a hospital before you could see a doctor. You learn this wait time was in the 82nd percentile of all wait times. This means that ______ percent of patients had a shorter wait time than you, and only ______ percent had a longer wait time.
my data.Country
You have loaded some data into the Python variable mydata. Which command would you use to select the single column named Country from this data?
B. median
You have numeric data with several outliers, which is the best measure to reflect the average value of the data set? A. mean B. median C. mode D. standard deviation
2
You have stored the following data in a Numpy array named myarr: 5, 2, 4, 6. What would the command myarr[1] return?
C. paired t-test
You want to determine if the mean weight of your research group is lower after the members have been on a special low-fat diet for two weeks. Which is the best t-test for this research? A. one sample B. independent sample C. paired t-test D. a or b
retrospective
You want to determine the rate of postoperative infections in patients who previously had heart valve surgery and were discharged from the hospital. This is a _________ study.
50%
Your data is normally distributed. This implies that approximately ________ percent of your data will be above the mean.
better; better
Your daughter brings home test scores showing that she scored in the 80th percentile in math and the 76th percentile in reading for her grade. This means that your daughter scored ________ than 80 percent of the students in her grade on math and _________than 76 percent of the students in reading.
Type II
Your null hypothesis is that the new medicine you are testing does not reduce systolic blood pressure. This hypothesis is actually false, but you failed to reject it. What type of error is this?
strong positive
A Pearson's r value of 0.87 indicates a ____________ association.
systematic
A health club is interested in knowing how many times a typical member uses the club in a week. They decide to ask every tenth customer on a specified day to complete a short survey including information about how many times they have visited the club in the past week. What kind of sampling is this?
statistic
A study finds that the mean amount spent on produce per visit to a grocery store by the customers in the sample is $12.84. This is an example of a:
intra-rater reliability
A survey was administered by the same instructor at two times with similar results. This is an example of:
most
Cluster centroids are relocated until they are all in the position that gives them maximum density, this means they have the ___________________ points closest to the cluster centers.