BSAN 160 exam review

¡Supera tus tareas y exámenes ahora con Quizwiz!

In text mining, tokenizing is the process of _________________.

breaking a text into simple units, like sentences or words

Confidence for Milk and Juice (that is the probability P(Y|X) where X= Milk and Y =Juice) is ___________

50%

Using the complete linkage criteria, after creating the first cluster, the next step would be:

create a cluster that includes {Kristen and Ben} or Add Kristen to the cluster {Dennis and Ross}

Which chart should Mia use to visualize the relative proportion of market share of Peloton in 2020 compared to its competitors Nordic Track, Myx Fitness, and Echelon?

Pie chart

Let's assume we are confronted with a one-stage decision problem that has 3 decision alternatives: Decision Alternative 1, 2 and 3, where each Decision Alternative has possible outcomes and probabilities associated with possible outcomes. We use the Expected Monetary Value (EMV) and pick Decision Alternative 2. Assume we conduct sensitivity analysis for this decision model. Which of the following statements is incorrect?

Using sensitivity analysis, selected decision cannot change if we use the same decision criterion (e.g., EMV-maximizer)

"In decision modeling, using the Expected Monetary Value (EMV) criterion guarantees the best outcome."

False

Table 2.Confusion matrix True/Observed Class Positive (Patient is healthy) Negative (Patient is sick) Predicted class Positive (Patient is healthy) 80 20 Negative (Patient is sick) 40 60 QUESTION 4What is the false positive (FP) count?

20

How many mistakes (misclassifications) did the model make?

60

Which of the following is/are element(s) of decision models under uncertainty? A: Decision alternative B: Outcome variable C: Decision criterion

All of the above (A, B, and C)

In data mining, finding an affinity of two products to be commonly purchased together is known as

Association rule mining

Which of the following is/are predictive analytics method(s)? A)Boxplot B)Text analysis C)Simulation D)Regression analysis, E)Clustering B, D and EB, C and ED and E

B, D, E

Which chart type below would be most helpful to show the comparison between worldwide turnover rate compared with tech sector turnover rate? Line chart Histogram Bar chart Scatterplot

Bar chart

Using the complete linkage criteria, after creating the first cluster, the next step would be:

Create a cluster that includes {Sam and Max}

Which of the following is a segmentation model that classifies the items in a dataset based on pairwise distances between these items until every observation is linked into one large group?

Hierarchical clustering

Using the Expected Monetary Value (EMV) criterion, which decision alternative should we choose?

Indifferent between Decision alternative 1 and Decision alternative 2

What are the decision variables in this model?

Number of snacks, drinks, sun protection items, and clothing items

Which of the following statements in incorrect?

Perfect classification is represented by AUC = 0.5.

Decision modeling is a __________ analytics method.

Prescriptive

Given a student was retained, the number of times the model predicted a student's retention correctly is called ____________. The value of this measure is this example is ___________.

Recall, 2/23

The Bag-of-Words method uses ____________ to extract feature from textual data.

Word frequencies in a text

Descriptive Analytics

help managers understand current events in the organization including causes, trends, and patterns.

Which of the below statement(s) is/are correct? A: Information dashboards provide interactive visual displays of important information that is so that the level of granularity of key insights can modified by drilled in or moving out for more / less exploration B: Visual analytics combines data visualization and different analytics methods such as descriptive, predictive and prescriptive analytics. C: Interactive information dashboards provide key insights as static information that focus on better understanding of what happened.

A and B

Which of the following data preprocessing activity that Mia conducts is not associated with data cleaning?

Derive a new variable representing total time of class material from existing variables

T or F: Original (raw) data is usually collected from multiple data sources including various formats, and it is readily usable by analytics tools and algorithms

False

T or F: "Linear regression analysis can be used to predict an unknown value of a dependent variable using the values of a set of numeric and/or categorical independent variables."

True

T or F: Based on the distance values in Table 1, the first cluster in a hierarchical clustering model will include {Joe and Hannah}."

True

Which of the below is a method to deal with filling out the missing values in data? data cleaning data reduction data smoothing data imputation

data imputations

In evaluating a two-class classification model, the accuracy is __________________.

the ratio of correctly classified positives and correctly classified negatives divided by the sum of all positive (true and false) and negative (true and false) counts.

Assume there are two students, Neal and Jimmy, with the same high school GPA percentile and gender, and Neal's SAT score is one point higher than Jimmy's. Using the regression model shown in Question 3, Neal's predicted combined score would be _______.

0.06 higher than Jimmy's combined score

Consider the following decision problem. Lumos Company owns a yoga studio in Philadelphia. Because their current studio can only accommodate 20 people at a time, they experience challenges on days when more people show up to take classes. Lumos realized that they can potentially generate more net profit if they expand (i.e., open another yoga studio). Right now, in March 2021, they have to decide whether to expand or stay in the current location.• If they decide to stay in current location, by the end of 2021 their revenue will be $120,000 with certainty and the cost (associated with business maintenance) will be $20,000 with certainty.• If they decide to expand, the cost (associated with business maintenance plus expansion cost) will be $50,000 with certainty. In case of expansion, the demand for yoga classes could increase with probability of 0.6 or decrease with probability 0.4. If they expand and demand increases, by the end of 2021 their revenue will be $250,000. If they expand and demand decreases, by the end of 2021 their revenue will be $50,000. (Keep in mind: Net profit = Revenue - Cost). Question 1. Compute the Expected Monetary Value (EMV) of net profit for decision alternative "stay in current location". Enter the numeric EMV value in the box below (do not use a dollar sign, do not use decimal points, do not use a comma separator for thousands).

100,000

Compute the Expected Monetary Value (EMV) of net profit for decision alternative "expand". Enter the numeric EMV value in the box below (do not use a dollar sign, do not use decimal points, do not use a comma separator for thousands).

120,000

Consider the item set in Figure 1 (with 6 items: eggs, beer, juice which is represented by the red glass, milk, diapers, and bread) in a dataset with 5 transactions where each row in the dataset is a shopping purchase transaction. For example, transaction 1 is the first row in dataset and includes the purchase of bread and milk, transaction 2 is the second row and includes the purchase of bread, diapers, beer, and eggs, etc. Figure 1.Itemset and a dataset derived from shopping data. Support for Diapers and Beers being purchased together, i.e., P(X&Y) where X = Diapers and Y = Beer being purchased together, is ________

60%

Comparing two regression models (Model 1 and Model 2) developed using the same dataset, assume Model 1 has an R-squared of 0.58 and Model 2 has an R-squared of 0.79. Which of the following statement(s) is/are correct? A: Model 2 describes 79% of the variation in the given data B: Comparing both models and how well they explain the variation in the given data, Model 1 is a better fit compared to Model 2 C: The independent variables used in Model 1 do not capture 42% of the variation in the given data

A and C

Using the single linkage criteria, after creating the first cluster, the next step would be

Add Ben to the cluster {Dennis and Ross}

Using the single linkage criteria, after creating the first cluster, the next step would be:

Add Sam to the cluster {Joe and Hannah}

"One of the characteristics of a data warehouse is that it is non-volatile. That means, _______________________________________."

After data are entered into a data warehouse, previous data is not erased when new data is added to it

Which decision would you recommend if Lumos wants to choose the decision by using a weighted average of the possible payoffs - in other words uses the Expected Monetary Value (EMV) and picks the decision with the largest EMV with regards to net profit?

Choose to expand

Use the following description for Questions 4-6. Imagine we have a decision problem where we are asked to choose between two decision alternatives. Decision alternative 1 can result in a payoff of 10000 with probability 0.4 or a loss 4000 with probability 0.6. Decision alternative 2 results in a payoff 2000 with probability 0.5 or a payoff 1200 with probability 0.5.If we look at the worst possible outcome for each decision alternative and choose the decision that has the best "worst outcome", which decision alternative should we choose?

Decision alternative 2

When querying a dimensional database, a user goes from summarized data (e.g., quarters) to its underlying details (e.g, months). The OLAP function that serves this purpose is:

Drill Down

Consider the following decision problem. You are preparing your backpack for a hike and need to decide how many items of different categories to pack. There are four item categories: snacks, drinks, sun protection items, and clothing items. Each snack, drink, sun protection item, and clothing item has a weight. For example, each snack weighs 60 grams, each water bottle weighs 25 grams, each sun protection item weighs 20 grams, and each clothing item weighs 400 grams. You can't carry more than 1600 grams in your backpack. You must have at least 3 snacks, at least 4 water bottles, and you can have at most 2 sunscreen items in the backpack for your hike. Your goal is to minimize to the total weight of your backpack while satisfying all the constraints. QUESTION 3Is the following statement true or false?"This is a linear programming model where the objective function is a maximization."

False

T or F:"In Association Rule Mining, confidence is a metric that represents the probability of observing the items A and B together in a given dataset, P(A&B)."

False

T or F:Decision support system are computer-based support systems that integrate individuals' expertise and computer capabilities, and they have precise definitions agreed to by practitioners.

False

Which of the following is not a linkage criterion used in clustering models?

K-fold linkage

Which chart should Mia use to visualize the number of new members joining the Peloton customer community every month from 2012 to 2020?

Line chart

Which of the following is not a constraint of this model? You must have at least 3 snacks in the backpack You can have at most 2 sunscreen items in the backpack The number of items in each category that you take with you in your backpack cannot be negative Minimum weight of all items combined in the backpack must be at least 1600 grams

Minimum weight of all items combined in the backpack must be at least 1600 grams

When querying a dimensional database, a user transforms the data coming from rows of a table into data grouped on several columns. The OLAP function that serves this purpose is:

Pivot

Which data mining method would be best suited to predict the length of night sleep (numerical) using the previous nights' length of night sleep in the past 60 days?

Times Series

Using characteristics of first year undergraduate students, such as age, gender, major, location, workout/sports activities, if we developed a model to forecast which students are at risk of dropping out after the first year of college, decided which students to reach out to and offered them support services to reduce their risk of dropping out, what kind of analytics application would this work represent?

prescriptive analytics

Consider the decision problem used in Questions 3-6. Assume there is one more constraint that you have to satisfy. The total weight of snacks in your backpack cannot exceed the total weight of water bottles. Let's define x1 = number of snacks in your backpack and x2 = number of water bottles in your backpack. Which formulation below represents this new constraint?

60*x1 <= 25*x2

Consider the following sentence: "I enjoy taking a walk in the rain." Which of the following statements in incorrect?

After removing the stop words, the bigram method creates the vector: [ "enjoy", "taking", "walk", "rain"]

Which data mining method would be best suited to find out which days (categorical) and night sleep categories (binary) are frequently observed together?

Association Rule Analysis

Assume we fit a regression line to the scatterplot in Figure 1 from Question 1. Which of the following statement(s) is/are correct? A: The intercept of the line would represent the high school GPA percentile of a student given his/her/their SAT score is 0. B: The intercept of the line would represent the SAT score of a student given his/her/their high school GPA percentile is 0. C: The slope of the regression line would represent how much the SAT score changes if the high school GPA percentile changes by 1% D: The slope of the regression line will be positive

B C and D

Which of the below statement(s) is/are correct? A: An important data transformation subtask is to select the relevant data using domain expert input, i.e., decide which sources and data to collect. B: When merging two data source tables A and B, using the full outer join method eliminates all rows from the resulting table that do not have corresponding rows in both source tables A and B. C: For numerical variables, normalizing the observed values between two values, such as 0 and 1, allows to rescale the values and compare variables with different means and/or standard deviations on a single scale. D: Identifying and reducing noise in the data is a subtask of data reduction.

C

Which decision would you recommend if Lumos wants to look at the best possible outcome for each decision and choose the decision that has the best "best outcome" with regards to net profit?

Choose to expand

Using the dataset of 178 students from Question 1, your team develops a regression model to predict the Combined score of a student (i.e., score that the school uses to rank applicants) using HSPercentile (i.e., high school GPA percentile and it takes values from 0 to 1 where 1 represents the 100th percentile meaning maximum), Gender (where Female=0 and Male=1) and the student's SAT score. The regression equation is given by: CombinedScore =118.95 + 1.91*HSPercentile -1.74*(Gender) + 0.06*SAT This model is a ______________ regression model.

Multiple Linear

Which chart type below would be most helpful to show the relative proportions of turnover rate of different categories (e.g., computer games, Internet, computer software and other) within the tech sector that drive tech turnover the most? Histogram Pie chart Bar chart Scatterplot

Pie Chart

Assume we develop a regression model to predict the final grade of a student using the following variables: midterm grade, time spent studying for the final exam, number of other classes the student is taking the same term, whether the student took a similar class before (yes or no) and whether the student is female (1=female or 0=male). Which of the following statement(s) is/are correct?

This model is a multiple linear regression model

T or F: Data reduction can be applied to rows (observations) and/or columns (variables) in a given dataset

True

When analyzing the original data of household income of a selected population, analysts notice that 5% of observations are missing and entered in the dataset as N/A (not available). Further, they notice that there are a few extremely low household income values. Which of the following method(s) would be well-suited to prepare the data before conducting descriptive analysis, such as calculating descriptive statistics and creating histogram of household income? A: Use the original dataset to avoid introducing additional noise to data prior to analysis B: Identify the outliers in data with statistical techniques and remove the extremely low income values C: Identify the outliers in data with statistical techniques and replace the extremely low values using the mean of the income values to smooth the values D: Fill in missing values (imputations) with most appropriate values using zeros to indicate that these income values are missing

B and C

Which of the following data preprocessing activity/activities that Mia conducts would fall under data transformation? A: Identify and replace extremely high and low selling price values using appropriate imputation methods B: Convert number of bikes sold per month (numeric) into discrete categories using frequency-based bins C: Filter the data to ensure that only key performance and price features needed for the analysis are included in the data D: Reduce the range of values of quarterly market share (numeric) data to a standard range (e.g., 0 to 1 or -1 to +1) by using normalization or scaling techniques E: Oversample the less represented financial performance measurements

B and D

Which decision would you recommend if Lumos wants to look at the worst possible outcome for each decision and choose the decision that has the best "worst outcome" with regards to net profit?

Choose to stay in current location

Which data mining method would be best suited to find out which days are similar to each other with regards to length of mid-day nap and length of night sleep?

Clustering

If we look at the best possible outcome for each decision alternative and choose the decision that has the best "best outcome", which decision alternative should we choose?

Decision alternative 1

Assume there are 4 customers (Kristen, Dennis, Ben and Ross), and we want to develop a hierarchical clustering model to group the customers into clusters. The input variables used in this clustering model to calculate the distance values are: how much the individual spent for grocery shopping in winter 2020 (measured in $) and how much the individual spent for grocery shopping in summer 2020 (measured in $).Euclidean distance values (measured in $) between all pairs of customers are given in Table 1 below. For example, the Euclidean distance between Dennis and Ben is $360. Based on the distance values in Table 1, the first cluster in a hierarchical clustering model will include:

Dennis and Ross

Assume there are 4 students (Sam, Joe, Max and Hannah), and we want to develop a hierarchical clustering model to group the students into clusters. The input variables used in this clustering model to calculate the distance values are: Midterm exam grade (measured as grade points on a scale from 0 to 100) and Final exam grade (measured as grade points on a scale from 0 to 100). Euclidean distance values (measured as grade points) between all pairs of students are given in Table 1 below. For example, the Euclidean distance between Sam and Max is 11 points. T or F: "The input variables used in this model are not on the same scale, and this makes comparing the distance between students difficult. We need to convert the input variables to be on a similar scale by standardizing."

False

T or F Linear regression models represent the mathematical relationship between one or more dependent variables to explain or predict a binary (i.e., a variable that takes values 0=no and 1=yes) independent variable."

False

T or F: "In this linear programming model, taking 5 snacks, 6 water bottles, 2 sunscreen items and 3 clothing items is a feasible solution."

False

T or F: "Using the correlation between size and selling price, we can predict the selling price of a new house (that is not included in this dataset) if we know the size of that new house."

False

T or F: Based on the results shown in Table 2, the true positive rate is higher than the true negative rate."

False

T or F: In data preprocessing step to reduce the dimension of data prior to analysis, sampling the rows is more complex than selecting the columns (variables

False

Mia decides to use imputation methods as part of the data preprocessing. What is the main purpose of imputation methods?

Fill in missing values with most appropriate values

Use the following description to answer Questions 1 - 5.According to the American Academy of Pediatrics, toddlers (i.e., children who are 12 to 36 months old) need plenty of sleep during the night. However, studies show that many toddlers struggle and resist sleeping through the night. Dr. Capan has a 20-month-old toddler whose night sleep is rather difficult to predict. To analyze her toddler's sleep, she collects daily data including:• day of the week (categorical, where the categories are: Monday,..., Sunday),• length of mid-day nap (numerical, in Minutes),• type of snacks he had before going to bed (categorical where the categories are: fruit, yoghurt, crackers),• length of night sleep (numerical, in hours)• night sleep category (binary, takes values 0=short or 1=long --> short means night sleep < 12 hours, and long means night sleep >= 12 hours)Which data mining method would be best suited to predict the length of night sleep (numerical) using the variables: day of the week, length of mid-day nap, and type of snacks he had before going to bed?

Linear Regression

A regression model is developed to predict which students will be retained in second year (i.e., still enrolled in second year). Using various characteristics of the 178 students in the given dataset, the variable we want to predict is "SecondFallRegistered" which is either yes (=1) if the student is still enrolled in second year (measure of retention), or no (=0) if student dropped out between the beginning of first year and second year of college. In this case, a ______________ regression model would be the best suited model to predict the variable SecondFallRegistered.

Logistic

Which data mining method would be best suited to predict the night sleep category (binary) using the variables: day of the week, length of mid-day nap, and type of snacks he had before going to bed?

Logistic Regression

T or F: During data transformation, depending on the context and purpose of preprocessing the data can be rescaled to a fixed range, and numeric variables can be converted to categorical variables

True

Given the model predicted that a student would be retained, __________ is a measure that quantifies the ratio of number of students who actually were retained compared to all students that were predicted to be retained. The value of this measure is this example is ___________.

Precision, 2/46

What type of analytics seeks to recognize what is going on as well as the likely forecast and make decisions to achieve the best performance possible?

Prescriptive

________ is an important concept to consider when developing a data warehouse because data warehouses can grow quickly and issues can arise regarding the amount of data, e.g., the pace at which the data warehouse is expected to grow and the complexity of user queries.

Scalability

Using Figure 1, test the hypothesis that students with higher high school GPA percentile have a higher SAT score compared to students with a lower high school GPA percentile. In other words, we want to test if we increase the high school GPA percentile of a student by 1% then their SAT score will also increase. Which of the following method would help to test this hypothesis?

Simple linear regression with high school percentile as the independent and SAT as the dependent variable

T or F: "A feasible solution of a linear programming model is a solution that represents the values for all decision variables that satisfies all the constraints."

True

T or F: "Classification learns a set of information on characteristics of the previously labeled items, objects, or events to place new instances (with unknown labels) into their respective groups."

True

T or F: "Due to potential risk of model overfitting, rather than using all available data we split the data into training and testing data, we use the training data for model development and evaluate model performance using the testing data."

True

T or F: "If the correlation between size and selling price of a house is 0.85, and we develop a simple regression model using size as independent and selling price as the dependent variable, the slope coefficient associated with size in the regression equation would have a positive sign."

True

T or F: "In a clustering model with two numerical input variables used for clustering, if the input variables are not on the same scale standardizing is used to convert the variables and compare them on a single scale."

True

T or F: "The relational data in a data warehouse are modified and analyzed using Online Analytical Processing (OLAP) tools. Commonly used OLAP tools are slice, dice, drill up and down, and pivot."

True

T or F: "Web structure mining focuses on navigation through a website by analyzing the links in Web documents, and Web content mining is related to extraction of information from the content of Web pages using text mining."

True

T or F: "When developing a data mining model, we split the original data into training data and testing data in order to evaluate the model performance in a dataset that was not used to develop the model."

True

T or F: Choice of visualization method that meets the presentation requirements for a given data depends on the data types available, purpose of the visual and context

True

T or F: Data is a collection of observations, experiments, and experiences that do not necessarily represent absolute facts that are universally true.

True

In ________, the complete data set is randomly split into mutually exclusive subsets and tested multiple times on each left-out subset, using the others as a training set.

k-fold cross-validation

In text mining, what is a lexicon?

a catalog of words and scores (or categories) assigned to the words based on their meaning

A probability node of a decision tree for decision modeling represents __________________.

a time when the result of an uncertain outcome becomes known

Optimization is a _________ analytics method.

prescriptive

Which of the below is not a data preprocessing step? data consolidation data transformation data separation data reduction

data seperation

Business Intelligence (BI)

is an umbrella term that combines architectures, databases, analytical tools, applications, and methodologies


Conjuntos de estudio relacionados

Practice Bank 1-15RN Practice Question Banks 1-15.pdf

View Set

15.01 Segment 1: Cell Theory, Cell Structure, Cell Transport

View Set

Chapter 4 - Project Integration Mgmt.

View Set

Managerial Accounting Chapter 2 Exam Review

View Set