MGMT 473

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

Equation: Salary = 44.0073 + 6.6227GPA + 6.6071MIS + 6.7309Statistics. What is the additional salary a graduate would earn with an MIS degree?

$6,607

What are the 6 phases of CRISP-DM?

1. Business understanding 2. Data understanding 3. Data preparation 4. Modeling 5. Evaluation 6. Deployment

What are the 5 steps of SEMMA?

1. Sample 2. Explore 3. Modify 4. Model 5. Assess

According to interviews and expert estimates, analytics professionals spend from ________________ of their time in the mundane task of collecting and preparing unruly data, before analytics can be applied (The New York Times ,August 17, 2014)

50 - 80%

Calculate the Mean Absolute Deviation for the following data: We have observed the age of 3 individuals in a study, where the mean age is 40. The observed ages were 31, 40, and 49. What is the MAD?

6 Reason: =abs((31-40)+(40-40) +(49-40))/3 = 6

Which of the following is not a common approach for transforming categorical data? A. Mathematical transformation B. Dummy variables C. Category reduction D. Category scores

A. Mathematical transformation

Which of the following are reasons for data professionals to learn data wrangling skills? Analytics professionals can no longer rely on the IT department to provide data Analytics professionals are superior to all other IT professionals Organizations will be able to make decisions more rapidly Analytics professionals need broader skill sets than data mining techniques

Analytics professionals can no longer rely on the IT department to provide data Organizations will be able to make decisions more rapidly Analytics professionals need broader skill sets than data mining techniques

What is the term used to describe computer systems that demonstrate human-like intelligence and cognitive functions, such as deduction, pattern recognition, and the interpretation of complex data?

Artificial intelligence

When selecting the cutoff values for performance measures, in some applications, the analyst may choose to increase or decrease the cutoff value to classify fewer or more observations into the target class. What are some reasons for doing this? Select all that apply. Uneven class distributions Asymmetric misclassification costs Personal preference Even class distributions

Asymmetric misclassification costs Uneven class distributions

Recall Organic Food Superstore from the introductory case; In that case, an Entity Relationship Diagram (ERD) for the store illustrates three entities:CUSTOMER, ORDER, and PRODUCT. The relationship between CUSTOMER and ORDER entities is 1:M because: A. An order can contain ONLY one product B. Each order can only belong to one customer C. A customer places many orders over time

B. Each order can only belong to one customer

Recall that we use nominal and ordinal measurement scales to represent categorical variables. Which of the examples below represent a nominal scale representation of a categorical variable? A. Performance of a manager (excellent, good, fair, poor). B. The temperature of the resort location C. Marital status (single, married, widowed, divorced, separated) D. Profit and inventory level of a distribution center

C. Marital status (single, married, widowed, divorced, separated)

Examples of transforming numerical data include transforming: Calculating Percentages Individual's date of birth to age Combining height and weight to create body mass index There is no need to transform data

Calculating Percentages Individual's date of birth to age Combining height and weight to create body mass index

If we have a third variable in the data set that is categorical, we can plot the two numerical variables and then add the third categorical variable. This scatter plot is called a scatter plot with a '________________' variable.

Categorical

When conducting data mining analysis, practitioners generally adopt either the ____________ or ___________

Cross-Industry Standard Process for Data Mining (CRISP-DM) methodology or the Sample, Explore, Modify, Model, and Assess (SEMMA) methodology

What is the name of the chart that shows the improvement that a predictive model provides over a random selection in capturing the target class cases?

Cumulative life chart

What are the most popular performance charts?

Cumulative lift chart Decile-wise lift chart Receiver operating characteristic (ROC) curve

Unsupervised data mining techniques are especially effective for:

Data exploration Dimension reduction Pattern recognition

What is the name of the chart that shows the improvement that a prediction model provides over a random selection but presents the information in 1- equal-sized intervals (e.g. every 10% of the observations)?

Decile-wise lift chart

A relationship between CUSTOMER and ORDER entities would be 1:M because

Each order can only belong to one customer

What is one of the most widely used measures for evaluating similarity with numerical variables. It is defined as the length of a straight line between two observations.

Euclidean distance

In the semi-log regression model not all variables are transformed into logs.The semi-log model that transforms only the response variable is often called:

Exponential regression model

True or false: R^2 can decrease as we add more predictor variables to the linear regression model

False

True or false: The greater the k, the lesser the reliability of the k-fold method and the greater will be its computational cost

False

In the case of a dummy variable categorizing a person's gender, we can define 1 for male and 0 for female. In this case, what would the reference category be?

Female

The formula for calculating the Matching coefficient is (the number of variables with matching outcomes)/(total number of variables). The _________ the value of the matching coefficient, the more similar the two observations are

Higher

Another commonly used transformation that captures nonlinearities is based on the natural logarithm. Which of the following variables are commonly log-transformed? Select all that apply. House prices Age Scores Income

House prices Income

Important risk factors for high blood pressure reported by the National Institute of Health include weight and ethnicity.High blood pressure is common in adults who are overweight and are African American.a public policy researcher in Atlanta surveyed 150 adult men about 5′10″ in height and in the 55-60 age group. Data were collected on their systolic pressure, weight (in pounds), and race (Black = 1 for African American, 0 otherwise). The resulting regression equation which includes the interaction between weight and race is: Systolic = 70.8312 + 0.4362Weight + 30.2482Black − 0.1118(Weight × Black). The interaction variable is negative and statistically significant at the 5% level. Interpret what a negative interaction implies in this example:

Implies that black men carry their weight better in terms of the systolic pressure than their non black counterparts.

What is the term in regression models when a predictor variable has a different partial effect on the outcome depending on the values of another predictor variable?

Interaction effect

Euclidean and Manhattan distance: suitable for numerical variables What similarity measures should we use for categorical and binary data?

Matching coefficient Jaccard's coefficient

The logistic regression model cannot be estimated with standard ordinary least squares (OLS) procedures. Instead, we rely on which method?

Maximum likelihood estimation (MLE)

There are numerous applications where the relationship between the predictor variable and the response variable cannot be represented by a straight line and, therefore, must be captured by an appropriate curve. What are some simple transformations of the variables for nonlinear relationships? Select all that apply. Natural logarithms Goodness of fit Squares Dummy variables

Natural logarithms Squares

Which of the following is true of a data warehouse? It is a small-scale data warehouse or a subset of the enterprise data ware - house that focuses on one particular subject or decision area. One of its primary purposes is to support decision making It can be designed to support the marketing department for analyzing customer behaviors , and it contains only the data relevant to such analyses Data in a data warehouse are usually organized around subjects such as sales, customers, or products that are relevant to business decision making

One of its primary purposes is to support decision making Data in a data warehouse are usually organized around subjects such as sales, customers, or products that are relevant to business decision making

An effective strategy for dealing with these issues is category reduction, where we collapse some of the categories to create fewer nonoverlapping categories. The first guideline states that categories with very few observations may be combined to create the ________________ category

Other

_____________ occurs when a predictive model is made overly complex to fit the quirks of given sample data. By making the model conform too closely to the sample data, its predictive power is compromised.

Overfitting

Examples of the use of prediction models include:

Predict the selling price of a house Predict the spending of a customer

Which nonlinear regression model is appropriate when the slope, capturing the influence of x on y, changes in magnitude as well as sign?

Quadratic regression model

Which tool shows the sensitivity and specificity measures across all cutoff values and how accurately the model is able to classify both target and non target class cases overall?

ROC curve

Which of the following are reasons for missing values in data? Select all that apply. Respondents always provide all the requested information Respondents decline to provide the information due to its sensitive nature There are never missing values in data Some of the questions do not apply to every respondent

Respondents decline to provide the information due to its sensitive nature Some of the questions do not apply to every respondent

The most popular query language used today is ___________________. This popular query language is used for manipulating data in a relational database using relatively simple and intuitive commands.

SQL

The basic structure of a SQL statement is relatively simple and usually consists of three keywords: Which of the following is a SQL keyword? Select all that apply! Where Choose Select From

Select Where From

________________ measures gauge whether a group of observations are similar or dissimilar to one another

Similarity

_________________ data also allows us to review the range of values for each variable.

Sorting

This difference in scale distorts the true distance between observations and can lead to inaccurate results.It is common, therefore, to make the observations unit-free. How is this accomplished?

Standardizing Normalizing

In order to select the preferred model, we examine several goodness-of-fit measures. Which ones?

The coefficient of determination The standard error of the estimate The adjusted coefficient of determination

What is the linear regression model applied to a binary response variable called?

The linear probability regression model

What is true of the cross-validation method?

The sample is partitioned into a training set and a validation set to assess how well the estimated model predicts with unseen data The holdout method is a cross validation method The k-fold cross-validation method is a cross validation method

What is used to evaluate how well the sample regression equation fits the data?

The standard error of the estimate The coefficient of determination, R^2

We use analysis of variance (ANOVA) in the context of the linear regression model to derive R2.We denote the total variation in y as Σ(yi−y ̄)2, which is the numerator in the formula for the variance of y. What is this total variation called?

Total sum of squares

As a common practice, in data partitioning in the oversampling technique, which data set is oversampled?

Training data set

True or false: A relational database consists of one or more logically related data files, where each data file is a two-dimensional grid that consists of rows and columns.

True

True or false: In a business setting, we might use a 1:1 relationship to describe a situation where each department can have only one manager and each manager can only manage one department.

True

True or false: In the holdout method, the sample data is partitioned into two independent and mutually exclusive data sets-the training set and the validation set

True

Target (success) class is Class 1 Non target class is Class 0 In the confusion matrix there are 4 possible outcomes. When a Class 1 observation is correctly classified by the model, what would it be called?

True positive (TP)

Which of the following are the very first tasks most data analysts perform to gain a better understanding and insights into the data? Sorting the data Counting the data Copying the data Visually reviewing

Visually reviewing Counting the data Sorting the data

When is the RMSE performance measure most desirable?

When large errors are particularly undesirable

Oftentimes, a categorical variable is defined by more than two categories. For example, the mode of transportation used to commute may be described by three categories: Public Transportation, Driving Alone, and Car Pooling. Given k categories of a variable, the general rule is to create how many dummy variables?

k - 1

When comparing models with the same response variable, we prefer the model with a smaller Se. A smaller Se implies that there is _____________ dispersion of the observed values from the predicted values

less

Data ______________________ is a process that an organization uses to acquire, organize, store, manipulate, and distribute data

management

In addition to binning, another common approach is to create new variables through ________________ transformations of existing variables.

mathematical

Data ____________________ is the process of defining the structure of a database.

modeling

The z-score measures the distance of a given observation from the sample mean in terms of standard deviation. The z-score is an example of making observations unit free. This is an example of ________________ data

numerical

There are two common strategies for dealing with missing values: ___________________ & ___________________.

omission and imputation

Data _________ is the process of dividing a data set into a training, validation, and in some situations, an optional test data set

partitioning

It is important to develop ____________ measures that evaluate how well an estimated model will perform in an unseen sample, rather than making the evaluation solely on the basis of the sample data used to build the model

performance

The formula for the variance differs depending on whether we have a sample or a ______________.

population

The accuracy rate is calculated as the number of predictions divided by the _____________ number of observations

total

Data ____________________ is the data conversion process from one format or structure to another

transformation

True or false: Data in a data mart are organized using a multidimensional data model called a star schema, which includes dimension and fact tables.

true

A large lecture class has 280 students. Mean score of exam: 74. Standard deviation of exam: 8. Distribution of scores is bell-shaped. How many standard deviations above the mean would a score of 90 be?

z-score = (observed value - mean of the sample)/standard deviation of sample z-score = (90-74)/8 z-score = 2

R^2 measures the percentage of sample variations of the response variable explained by the model. What is true when comparing linear and log-transformed regression models?

We need to compute the percentage of explained variations of y We cannot compare the percentage of explained variations of y with that of ln(y)

Which of the following is NOT a correct statement about entity-relationship diagram (ERD) attributes? A. A primary key is an attribute that uniquely identifies each instance of an entity B. The relationships between entities can only be one-to-many C. An entity is a generalized category to represent persons, places, things, or events. D. A foreign key is the primary key of a related entity.

B. The relationships between entities can only be one-to-many

Finally, another common transformation of categorical variables is to create category ________________.

scores

If a linear regression model uses only one predictor variable, then the model is referred to as a ______________ linear regression model

simple

The _____________ coefficient measures the degree to which a distribution is not symmetric about its mean.

skewness

For 0 < B1 < 1, the log-log regression model implies a positive relationship between x and E(y); as x increases, E(y) increases at a _________ rate

slower

In situations where negative outcomes are not as important as positive outcomes, what is a more appropriate measure of similarity? a. Euclidean coefficient b. Manhattan coefficient c. Jaccard's coefficient d. Matching coefficient

b. Jaccard's coefficient

The key distinction between supervised and unsupervised data mining techniques is supervised data mining is: a. Effective for data exploration b. Effective for dimension reduction c. Effective for developing predictive models d. Effective for pattern recognition

c. Effective for developing predictive models

The example of momentum p is the product of the mass m and the velocity v of an object; that is, p = mv is an example of a _________ relationship

deterministic

A ______________ variable, also referred to as an indicator or a binary variable, is commonly used to describe two categories of a variable.

dummy


Kaugnay na mga set ng pag-aaral

Module 10 Computer Concepts Exams

View Set

Musculoskeletal Trauma Surgeries and Disorders, Connective Tissue, and Arthritis

View Set

CTP - Chapter 11 - Working Capital Metrics

View Set

GCSE History (Medicine and Treatment): The Industrial Revolution

View Set

Chapter 5: The Court System in Texas

View Set

Starting Out With Python Chapter 3

View Set

US History Chapter 9: Nationalism and Sectionalism, 1815-1828

View Set

TAMUSA A&M Business Statistics Chapter 1 Supplementary Exercise

View Set

Artificial Intelligence Chapter 11

View Set