CHAPTER 5-7 TEST

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

What would be the value of the sum of squares due to regression (SSR) if the total sum of squares (SST) is 25.32 and the sum of squares due to error (SSE) is 6.89?

18.43

Which statement is NOT true?

Failing to reject the null hypothesis when it is false is a Type I error

refers to the degree of correlation among independent variables in a regression mode

Multicollinearity

Which of the following regression models is used to model a nonlinear relationship between the independent and dependent variables by including the independent variable and the square of the independent variable in the model?

Quadratic regression model

is a statistical procedure used to develop an equation showing how two variables are related.

Regression analysis

is the data set used to build the candidate models.

Training set

Larger values of α have the disadvantage of increasing the probability of making a

Type I error

refers to the data set used to compare model forecasts and ultimately pick a model for predicting values of the dependent variable

Validation set

can be used to partition observations in a manner to obtain clusters with the least amount of information loss due to the aggregation.

Ward's method

In which of the following scenarios would it be appropriate to use hierarchical clustering?

When binary or ordinal data needs to be clustered

A normally distributed error term with a mean of zero would

allow more accurate modeling

The _____ is an indication of how frequently interval estimates based on samples of the same size taken from the same population using identical sampling techniques will contain the true value of the parameter we are estimating.

confidence level

Single linkage is a measure of calculating dissimilarity between clusters by __

considering only the two most similar observations in the two clusters

A collection of text documents to be analyzed is called a

corpus

Assessing the regression model on data other than the sample data that was used to generate the model is known as

cross-validation

Observation refers to the

set of recorded values of variables associated with a single entity

In a random sample of 400 registered voters, 120 indicated they plan to vote for Trump for President. Determine a 95% confidence interval for the proportion of all the registered voters who will vote for Trump.

(0.25, 0.34)

A statistics teacher started class one day by drawing the names of 10 students out of a hat and asked them to do as many pushups as they could. The 10 randomly selected students averaged 15 pushups per person with a standard deviation of 9 pushups. Suppose the distribution of the population of number of pushups that can be done is approximately normal. What is the standard error of the mean?

2.876

In a simple linear regression analysis the quantity that gives the amount by which the dependent variable changes for a unit change in the independent variable is called the

slope of the regression line

Average linkage is a measure of calculating dissimilarity between two clusters by __

computing the average distance between every pair of observations between two clusters

The _____ is the range of values of the independent variables in the data used to estimate the regression model

experimental region

The process of making a conjecture about the value of a population parameter, collecting sample data that can be used to assess this conjecture, measuring the strength of the evidence against the conjecture that is provided by the sample, and using these results to draw a conclusion about the conjecture is known as

hypothesis testing

The degree of correlation among independent variables in a regression model is called

multicollinearity

Regression analysis involving one dependent variable and more than one independent variable is known as ____

multiple regression

In a simple linear regression model, y = ß0 + ß1x + ε the parameter ß1 represents the

slope of the true regression line

The random numbers generated using Excel's RAND function follows a _____ probability distribution between 0 and 1.

uniform

The goal of _____ is to use the variable values to identify relationships between observations

unsupervised learning

A variable used to model the effect of categorical independent variables in a regression model is known as a

dummy variable

Prediction of the mean value of the dependent variable y for values of the independent variables x1, x2, . . . , xq that are outside the experimental range is called

extrapolation

Prediction of the value of the dependent variable outside the experimental region is called

extrapolation

Complete linkage can be used to measure the distance between clusters that are the _____ in cluster analysis.

most different

A _____ is used to visualize sample data graphically and to draw preliminary conclusions about the possible relationship between the variables.

scatter chart

The least squares regression line minimizes the sum of the

squared differences between actual and predicted y values

A method for modifying variables that reduces bias prior to cluster analysis is

standardization

The process of making estimates and drawing conclusions about one or more characteristics of a population through analysis of sample data drawn from the population is known as

statistical inference

The process of converting a word to its stem, or root word, is referred to as

stemming

The graph of the simple linear regression equation is a(n)

straight line

A popular measure for weighing terms based on frequency and uniqueness is

term frequency times inverse document frequency

The finite correction factor should be used in the computation of the standard deviation of the sample mean and the standard population when n / N is

greater than 0.05

Suppose we had a data set of from a call center where customers were asked to choose between the following three options: hear account information, billing questions, and customer service. Using the given order of the three options, and using 0-1 dummy variables to encode the categorical variables, which of the following combinations would yield an entry "customer service"?

001

refers to the scenario in which the relationship between the dependent variable and one independent variable is different at different values of a second independent variable.

Interaction

is the dissimilarity measure that is more robust to outliers than Euclidean distance.

Manhattan distance

is a measure that computes the dissimilarity between a cluster AB and a cluster C by averaging the distance between A and C and the distance between B and C.

McQuitty's method

refers to the number of times a collection of items occurs together in a transaction data set.

Support count

A pizza shop advertises that they deliver in 30 minutes or less or it is free. People who live in homes that are located on the opposite side of town believe it will take the pizza shop longer than 30 minutes to make and deliver the pizza. A random sample of 50 deliveries to homes across town was taken and the mean time was computed to be 32 minutes. What is the appropriate symbol to represent the value, 32?

X=32

For a population with an unknown distribution, the form of the sampling distribution of the sample mean is

approximately normal for large sample sizes

In interval estimation, as the sample size becomes larger, the interval estimate

becomes narrower

As the number of degrees of freedom for a t distribution increases, the difference between the t distribution and the standard normal distribution

becomes smaller

In the graph of the simple linear regression equation, the parameter ß0 represents the _____ of the true regression line.

end-point

The _____ is a measure of the goodness of fit of the estimated regression equation. It can be interpreted as the proportion of the variability in the dependent variable y that is explained by the estimated regression equation.

coefficient of determination

A tree diagram used to illustrate the sequence of nested clusters produced by hierarchical clustering is known as a

dendrogram

In a linear regression model, the variable that is being predicted or explained is known as _____. It is denoted by y and is often referred to as the response variable

dependent variable

A cluster's _____ can be measured by the difference between the distance value at which a cluster is originally formed and the distance value at which it is merged with another cluster in a dendrogram.

durability

In the simple linear regression model, the _____ accounts for the variability in the dependent variable that cannot be explained by the linear relationship between the variables.

error term

In a linear regression model, the variable (or variables) used for predicting or explaining values of the response variable are known as the _____. It(they) is(are) denoted by x

independent variable

The coefficient of determination

is used to evaluate the goodness of fit

Fitting a model too closely to sample data, resulting in a model that does not accurately reflect the population is termed as

overfitting

The strength of the association rule is known as _____ and is calculated as the ratio of the confidence of an association rule to the benchmark confidence.

lift

A null and alternative hypothesis for a one proportion z test are given as H0: p = 0.8, Ha: p < 0.8. This hypothesis test is

lower-tailed

Statistical significance at the 0.01 level is _____ than significance at the 0.05 level.

more difficult to achieve

In k-means clustering, k represents the

number of clusters

k-means clustering is the process of

organizing observations into distinct groups based on a measure of similarity

The difference between the observed value of the dependent variable and the value predicted using the estimated regression equation is known as the

residual

A regression analysis involving one independent variable and one dependent variable is referred to as a

simple linear regression

The _____ is a measure of the error that results from using the estimated regression equation to predict the values of the dependent variable in the sample

sum of squares due to error (SSE)

The basis for using a normal probability distribution to approximate the sampling distribution of the sample means and population mean is

the central limit theorem

A procedure for using sample data to find the estimated regression equation is

the least squares method

A parameter is a numerical measure from a population, such as

u

A visual representation of a document or set of documents in which the size of the word is proportional to the frequency with which the word appears is called a __

word cloud

When the mean value of the dependent variable is independent of variation in the independent variable, the slope of the regression line is

zero

A Type I error is committed when

a true null hypothesis is rejected

The CEO of a company wants to estimate the percent of employees that use company computers to go on Facebook during work hours with 95% confidence. He selects a random sample of 150 of the employees and finds that 53 of them logged onto Facebook that day. What is the point estimate of the proportion of the population that logged onto Facebook that day?

0.35

What would be the coefficient of determination if the total sum of squares (SST) is 23.29 and the sum of squares due to regression (SSR) is 10.03?

0.43

A large manufacturing plant has analyzed the amount of time required to produce an electrical part and determined that the times follow a normal distribution with mean time μ = 45 hours. The production manager has developed a new procedure for producing the part. He believes that the new procedure will decrease the population mean amount of time required to produce the part. After training a group of production line workers, a random sample of 25 parts will be selected and the average amount of time required to produce them will be determined. If the switch is made to the new procedure, the cost to implement the new procedure will be more than offset by the savings in manpower required to produce the parts. Use the hypotheses: H0: μ ≥ 45 hours and Ha: μ < 45 hours. If the sample mean amount of time is = 43.118 hours with the sample standard deviation s = 5.5 hours, give the appropriate conclusion, for α = 0.025.

Do not reject H0, do not switch to the new procedure

Which of the following is true of Euclidean distances?

It is commonly used as a method of measuring dissimilarity between quantitative observations.

Which statement is true of an association rule?

It is ultimately judged on how actionable it is and how well it explains the relationship between item sets.

If the Euclidean distance were to be represented in a right triangle, which of the following would be considered the distance between two observations of a cluster?

The hypotenuse

A variable used to model the effect of categorical independent variables in a regression model which generally takes only the value zero or one is called

a dummy variable

It is impossible to construct a sampling frame for an _____ population.

infinite

An estimate of a population parameter that provides an interval of values believed to contain the value of the parameter is known as the

interval estimate

The prespecified value of the independent variable at which its relationship with the dependent variable changes in a piecewise linear regression model is referred to as the

knot

A _____ is an interval estimate of an individual y value, given values of the independent variables.

prediction interval

The value of the _____ is used to estimate the value of the population parameter

sample statistic

One reason a sample may fail to represent the population of interest is _

sampling error

is used to test the hypothesis that the values of the regression parameters ß1, ß2, ... ßq are all zero.

An F test

The population parameters that describe the y-intercept and slope of the line relating y and x, respectively, are

B0 and B1

The owners of a fast food restaurant have automatic drink dispensers to help fill orders more quickly. When the 12 ounce button is pressed, they would like for exactly 12 ounces of beverage to be dispensed. There is, however, some variation in this amount. The company does not want the machine to systematically over fill or under fill the cups. Which of the following gives the correct set of hypotheses?

H0: u = 12, Ha: u ≠ 12

refers to the use of sample data to calculate a range of values that is believed to include the value of the population parameter.

Hypothesis testing

Determine whether the alternative hypothesis is left-tailed, right-tailed, or two-tailed: H0: μ = 11, Ha: μ > 11

Right-tailed


Ensembles d'études connexes

History; Challenges Today: Technology

View Set

Ezra, Nehemiah, Esther Study Guide

View Set

Promulgated contracts practice exam

View Set

Preassessment Practice Questions

View Set

CompTIA Security+ Cert Prep 1: Threats, Attacks, and Vulnerabilities

View Set

10. AWS CCP Knowledge Review - Shared Responsibility Model

View Set

AZ-900: Microsoft Azure Fundamentals (Q&A Series - 2023)

View Set

Fin 341 Chapter 5 and rest of section 3

View Set