BSAN 160 - Activities and Quiz Questions

Ace your homework & exams now with Quizwiz!

Which type of data visualization method can be helpful when the intention is to show relative proportions of dollars per department allocated by a university administration? A. scatterplot B. line chart C. pie chart D. histogram

C. pie chart

Which methods would your team use to forecast the risk factors that are most likely to result in a newly enrolled senior member's fall at home using existing Humana members' data? a) Descriptive analytics methods b) Central tendency methods c) Predictive analytics methods d) Prescriptive analytics methods

c) Predictive analytics methods

According to the American Academy of Pediatrics, toddlers (i.e., children who are 12 to 36 months old) need plenty of sleep during the night. However, studies show that many toddlers struggle and resist sleeping through the night. Dr. Capan has a 22-month-old toddler whose night sleep is rather difficult to predict. To analyze her toddler's sleep, she collects daily data including: • day of the week (categorical, where the categories are: Monday,..., Sunday), • length of mid-day nap (numerical, in Minutes), • type of snacks he had before going to bed (categorical where the categories are: fruit, yoghurt, crackers), • length of night sleep (numerical, in hours) • night sleep category (binary, takes values 0=short or 1=long --> short means night sleep < 12 hours, and long means night sleep >= 12 hours) Which data mining method would be best suited to predict the length of night sleep (numerical) using the previous nights' length of night sleep in the past 60 days? A. Time Series B. Association Rule Mining C. Logistic Regression D. Clustering

A. Time Series

Which type of data visualization method can be helpful when the intention is to show relationship between time spent working out per week (in hours) and average amount of sleep an individual gets per week (in hours)? A. scatterplot B. pie chart C. histogram D. boxplot

A. scatterplot

Figure 1 shows the Annual Salary distribution (histogram) at Department 1. Figure 2 shows Annual Salary distribution (histogram) at Department 2. Which of the following statement(s) is/are correct? Figure 1. Annual salary histogram for Department 1. Figure 2. Annual salary histogram for Department 2. A: Annual salary values are more spread out at Department 2 compared with Department 1. B: At Department 1, there are some outlier (extreme) salary values, e.g., there are a few employees that make a lot more than the majority of other employees C: At Department 2, the annual salary is symmetrically distributed with zero skewness. A and B

A: Annual salary values are more spread out at Department 2 compared with Department 1.

Which of the following is/are element(s) of decision models under uncertainty? A: Decision alternative B: Outcome (e.g., payoff or cost) C: Decision criterion All of the above (A, B, and C)

All of the above (A, B, and C)

Congratulations! The model your team developed to identify the factors that increase the risk of a senior individual to fall at home and experience an injury was well received by the upper-level management. Now, your team is tasked with turning this predictive model to a prescriptive analytics model. Which of the below action would make your analytics model a prescriptive model? a) Create a data visualization tool that shows the fall risk distribution of Humana members b) Identify at-risk members soon after their enrollment and offer those individuals devices and services that detect risk factors to prevent falls at home c) Describe the variation in fall risk of Humana members to better understand low and high risk levels d) Predict the top 5% highest risk members within the newly enrolled member population

b) Identify at-risk members soon after their enrollment and offer those individuals devices and services that detect risk factors to prevent falls at home

What type of analytics seeks to forecast the future/unknown values of outcomes of interest using past and current available data? a) descriptive b) predictive c) prescriptive d) domain

b) predictive

Consider the following decision problem. Lumos Company owns a yoga studio in Philadelphia. Because their current studio can only accommodate 20 people at a time, they experience challenges on days when more people show up to take classes. Lumos realized that they can potentially generate more net profit if they expand (i.e., open another yoga studio). Right now, in March 2021, they have to decide whether to expand or stay in the current location. • If they decide to stay in current location, by the end of 2021 their revenue will be $120,000 with certainty and the cost (associated with business maintenance) will be $20,000 with certainty. • If they decide to expand, the cost (associated with business maintenance plus expansion cost) will be $50,000 with certainty. In case of expansion, the demand for yoga classes could increase with probab1ility of 0.6 or decrease with probability 0.4. If they expand and demand increases, by the end of 2021 their revenue will be $250,000. If they expand and demand decreases, by the end of 2021 their revenue will be $50,000. (Keep in mind: Net profit = Revenue - Cost). Compute the Expected Monetary Value (EMV) of net profit for decision alternative "expand". Enter the numeric EMV value in the box below (do not use a dollar sign, do not use decimal points, do not use a comma separator for thousands).

(250,000-50,000)*0.6 = 120,000

Julie Miller, a manager at the health tech company, Wellness Path, oversees two departments in the wearables business line: Product Development (referred as Department 1), and Product Testing (referred as Department 2). Both departments focus on bringing impactful products to the market that provide competitive advantage due to their design features and interoperability with other devices. Table 1 below shows selected descriptive statistics for age in Department 1 and Department 2. Table 1. Selected descriptive statistics of age at Department 1 (left) and Department 2 (right). (a) Department 2 has a larger spread with regards to age compared to Department 1. (b) Age distribution at Department 1 is negatively skewed (c) Age distribution at Department 2 is negatively skewed (d) 95% of the age values at Department 1 are between the ages 29 and 41

(c) Age distribution at Department 2 is negatively skewed

Consider the following decision problem. Lumos Company owns a yoga studio in Philadelphia. Because their current studio can only accommodate 20 people at a time, they experience challenges on days when more people show up to take classes. Lumos realized that they can potentially generate more net profit if they expand (i.e., open another yoga studio). Right now, in March 2021, they have to decide whether to expand or stay in the current location. • If they decide to stay in current location, by the end of 2021 their revenue will be $120,000 with certainty and the cost (associated with business maintenance) will be $20,000 with certainty. • If they decide to expand, the cost (associated with business maintenance plus expansion cost) will be $50,000 with certainty. In case of expansion, the demand for yoga classes could increase with probab1ility of 0.6 or decrease with probability 0.4. If they expand and demand increases, by the end of 2021 their revenue will be $250,000. If they expand and demand decreases, by the end of 2021 their revenue will be $50,000. (Keep in mind: Net profit = Revenue - Cost). Compute the Expected Monetary Value (EMV) of net profit for decision alternative "stay in current location". Enter the numeric EMV value in the box below (do not use a dollar sign, do not use decimal points, do not use a comma separator for thousands).

100,000

When analyzing the data of household income of a selected population, analysts notice that 5% of observations are missing and entered as N/A (not available). Which of the following method(s) can be used to prepare the data before conducting descriptive analysis? A: Identify the missing values and replace them using the mean income values of the selected population B: Identify the missing values and replace them using the median income values of the selected population C: Remove the records (rows) with missing household income value A, B and C

A, B and C

Consider the following dataset (Table 1) that an analytics team at Netflix collected from 10 customers including information about the viewers and their content consumption choices. Table 1. Netflix data where 1=yes, and 0=no. What is the support, P(X&Y), for the rule X ⇒ Y where X represents Married and Y represents Watched Breaking Bad = 1? A. 10% B. 20% C. 50% D. 80%

A. 10%

Assume there are 4 students (Sam, Joe, Max and Hannah), and we want to develop a hierarchical clustering model to group the students into clusters. The input variables used in this clustering model to calculate the distance values are: Midterm exam grade (measured as grade points on a scale from 0 to 100) and Final exam grade (measured as grade points on a scale from 0 to 100). Euclidean distance values (measured as grade points) between all pairs of students are given in Table 1 below. For example, the Euclidean distance between Sam and Max is 11 points. Table 1.Distance values between all student pairs Using the single linkage criteria, after creating the first cluster, the next step would be: A. Add Sam to the cluster {Joe and Hannah} B. Add Max to the cluster {Joe and Hannah} C. Create a new cluster that includes {Sam and Max}

A. Add Sam to the cluster {Joe and Hannah}

Consider the following decision problem. Lumos Company owns a yoga studio in Philadelphia. Because their current studio can only accommodate 20 people at a time, they experience challenges on days when more people show up to take classes. Lumos realized that they can potentially generate more net profit if they expand (i.e., open another yoga studio). Right now, in March 2021, they have to decide whether to expand or stay in the current location. • If they decide to stay in current location, by the end of 2021 their revenue will be $120,000 with certainty and the cost (associated with business maintenance) will be $20,000 with certainty. • If they decide to expand, the cost (associated with business maintenance plus expansion cost) will be $50,000 with certainty. In case of expansion, the demand for yoga classes could increase with probab1ility of 0.6 or decrease with probability 0.4. If they expand and demand increases, by the end of 2021 their revenue will be $250,000. If they expand and demand decreases, by the end of 2021 their revenue will be $50,000. (Keep in mind: Net profit = Revenue - Cost). Which decision would you recommend if Lumos wants to look at the best possible outcome for each decision and choose the decision that has the best "best outcome" with regards to net profit? A. Choose to expand B. Choose to stay in current location C. Indifferent between expand and stay in current location

A. Choose to expand

Consider the following decision problem. Lumos Company owns a yoga studio in Philadelphia. Because their current studio can only accommodate 20 people at a time, they experience challenges on days when more people show up to take classes. Lumos realized that they can potentially generate more net profit if they expand (i.e., open another yoga studio). Right now, in March 2021, they have to decide whether to expand or stay in the current location. • If they decide to stay in current location, by the end of 2021 their revenue will be $120,000 with certainty and the cost (associated with business maintenance) will be $20,000 with certainty. • If they decide to expand, the cost (associated with business maintenance plus expansion cost) will be $50,000 with certainty. In case of expansion, the demand for yoga classes could increase with probabhttps://quizlet.com/785756271ility of 0.6 or decrease with probability 0.4. If they expand and demand increases, by the end of 2021 their revenue will be $250,000. If they expand and demand decreases, by the end of 2021 their revenue will be $50,000. (Keep in mind: Net profit = Revenue - Cost). Which decision would you recommend if Lumos wants to choose the decision by using a weighted average of the possible payoffs - in other words uses the Expected Monetary Value (EMV) and picks the decision with the largest EMV with regards to net profit? A. Choose to expand B. Choose to stay in current location C. Indifferent between expand and stay in current location

A. Choose to expand

LinkedIn data shows that job retention is a challenge in every industry. Especially in the tech sector, worldwide annual turnover rate changes year by year with values between 10 to 14 percent turnover rate depending on the year. Within the tech sector, different sub-sectors also show differences with regards to job retention. Sectors such as the computer games and computer software industries drive tech turnover the most compared to other sectors. Which chart type would be most helpful to show the trend of worldwide annual turnover rate over time? A. Line chart B. Histogram C. Boxplot D. Scatterplot

A. Line chart

If we want to analyze the impact of one unit increase/decrease in high school GPA on students' SAT score, which of the following methods should we use? A. Simple linear regression with HSGPA as the independent variable and SAT score as the dependent variable B. Simple linear regression with HSGPA as the dependent variable and SAT score as the independent variable C. Multiple linear regression with HSGPA as the independent variable and SAT score as the dependent variable D. Multiple linear regression with HSGPA as the dependent variable and SAT score as the independent variable

A. Simple linear regression with HSGPA as the independent variable and SAT score as the dependent variable

Fill in the blank. "In ___________, the original dataset is split once into two subsets, model is developed using the training data and model is evaluated using the testing data." A. Simple/single split B. k-Fold cross validation C. Overfitting D. Area under the curve

A. Simple/single split

Select the correct answer. "__________ is calculated by taking the square root of the variations." A. Standard deviation B. Variance C. Skewness D. Mean

A. Standard deviation

In text mining, what is a lexicon? A. a catalog of words and scores (or categories) assigned to the words based on their meaning B. a catalog of words and scores (or categories) extracted from the links embedded in Web documents C. a catalog of quantitative web analytics metrics such as page views and click paths D. a catalog of systematic approaches to analyze the content of social media outlets

A. a catalog of words and scores (or categories) assigned to the words based on their meaning

Using Figure 1, which of the following statement is incorrect? Figure 1. Characteristics of survey participants A. This dataset has 6 variables B. "State" is an ordinal variable C. "Gender" is a categorical variable D. "Salary" and "Age" are numerical variables

B. "State" is an ordinal variable

What is the approximate predicted BAC for a person who weighs 140 pounds and had 3 drinks? Note that the intercept starts with 0.03, not 0.08! A. 0.039 B. 0.048 C. 0.150 D. 2.835

B. 0.048

Assume a two-class classification model is developed to classify hospitalized patients as "healthy" (referred as positive) and "sick" (referred as negative). When the model is applied to 200 new patients to classify them as healthy or sick, and predicted class for each patient (healthy / sick) is compared to the actual/observed class (healthy / sick). Table 2 shows the model evaluation results. Table 2. Model evaluation results. What is the false positive (FP) count? A. 80 B. 20 C. 40 D. 60

B. 20

Consider the item set in Figure 1 (with 6 items: eggs, beer, juice which is represented by the red glass, milk, diapers, and bread) in a dataset with 5 transactions where each row in the dataset is a shopping purchase transaction. For example, transaction 1 is the first row in dataset and includes the purchase of bread and milk, transaction 2 is the second row and includes the purchase of bread, diapers, beer, and eggs, etc. Figure 1.Itemset and a dataset derived from shopping data. Support for Beer and Milk being purchased together is ________. A. 20% B. 40% C. 60% D. 80%

B. 40%

Consider the item set in Figure 1 (with 6 items: eggs, beer, juice which is represented by the red glass, milk, diapers, and bread) in a dataset with 5 transactions where each row in the dataset is a shopping purchase transaction. For example, transaction 1 is the first row in dataset and includes the purchase of bread and milk, transaction 2 is the second row and includes the purchase of bread, diapers, beer, and eggs, etc. Figure 1.Itemset and a dataset derived from shopping data. Confidence for Milk and Juice (that is the probability P(Y|X) where X= Milk and Y =Juice) is ___________ A. 20% B. 50% C. 60% D. 80%

B. 50%

Which of the below is not an example of textual data? A. Tweets posted in Twitter during a given week B. A database including features of all active Twitter users such as their age and annual income C. Physician notes taken during patient visits in the hospital describing their observations and diagnoses D. Customer feedback on restaurants (describing what they liked and didn't like) shared on an online portal

B. A database including features of all active Twitter users such as their age and annual income

Assume there are 4 customers (Kristen, Dennis, Ben and Ross), and we want to develop a hierarchical clustering model to group the customers into clusters. The input variables used in this clustering model to calculate the distance values are: how much the individual spent for grocery shopping in winter 2020 (measured in $) and how much the individual spent for grocery shopping in summer 2020 (measured in $).Euclidean distance values (measured in $) between all pairs of customers are given in Table 1 below. For example, the Euclidean distance between Dennis and Ben is $360. Using the single linkage criteria, after creating the first cluster, the next step would be: A. Add Kristen to the cluster {Dennis and Ross} B. Add Ben to the cluster {Dennis and Ross} C. Create a cluster that includes {Kristen and Ben}

B. Add Ben to the cluster {Dennis and Ross}

Consider the following sentence: "I hate unloading the dishwasher." Which of the following statements is incorrect? A. Tokenization can find the most frequently used words in this sentence. B. After removing the stop words, the bigram method creates the vector: [ "hate", "unloading", "dishwasher"] C. "I" and "the" in this sentence would be considered as stop words. D. If we use a lexicon using emotion-word associations, the word "hate" could be assigned a negative sentiment.

B. After removing the stop words, the bigram method creates the vector: [ "hate", "unloading", "dishwasher"]

According to the American Academy of Pediatrics, toddlers (i.e., children who are 12 to 36 months old) need plenty of sleep during the night. However, studies show that many toddlers struggle and resist sleeping through the night. Dr. Capan has a 22-month-old toddler whose night sleep is rather difficult to predict. To analyze her toddler's sleep, she collects daily data including: • day of the week (categorical, where the categories are: Monday,..., Sunday), • length of mid-day nap (numerical, in Minutes), • type of snacks he had before going to bed (categorical where the categories are: fruit, yoghurt, crackers), • length of night sleep (numerical, in hours) • night sleep category (binary, takes values 0=short or 1=long --> short means night sleep < 12 hours, and long means night sleep >= 12 hours) Which data mining method would be best suited to find out which days (categorical) and night sleep categories (binary) are frequently observed together? A. Time series B. Association Rule Mining C. Linear Regression Analysis D. Logistic Regression Analysis

B. Association Rule Mining

In data mining, finding an affinity of two products to be commonly purchased together is known as A. Decision trees B. Association rule mining C. Supervised learning D. OLAP

B. Association rule mining

Employees of both departments were asked the following question: " On a scale from 0 (worst value) to 10 (best value), how would you rate your job satisfaction at your current work position?" Results are shown in Figure 3. Which of the following statement is correct? Figure 3. Boxplots of job satisfaction for Departments 1 and 2 where 0 is the worst possible satisfaction value and 10 is the best satisfaction value. A. At Department 1, 25 % of all job satisfaction observations are below 2 B. At Department 1, 75% of all job satisfaction observations are below 4 C. Satisfaction scores show a smaller spread at Department 2 compared with Department 1 D. On average, employees at Department 1 are more satisfied with their current job position

B. At Department 1, 75% of all job satisfaction observations are below 4

Consider the following decision problem. Lumos Company owns a yoga studio in Philadelphia. Because their current studio can only accommodate 20 people at a time, they experience challenges on days when more people show up to take classes. Lumos realized that they can potentially generate more net profit if they expand (i.e., open another yoga studio). Right now, in March 2021, they have to decide whether to expand or stay in the current location. • If they decide to stay in current location, by the end of 2021 their revenue will be $120,000 with certainty and the cost (associated with business maintenance) will be $20,000 with certainty. • If they decide to expand, the cost (associated with business maintenance plus expansion cost) will be $50,000 with certainty. In case of expansion, the demand for yoga classes could increase with probab1ility of 0.6 or decrease with probability 0.4. If they expand and demand increases, by the end of 2021 their revenue will be $250,000. If they expand and demand decreases, by the end of 2021 their revenue will be $50,000. (Keep in mind: Net profit = Revenue - Cost). Which decision would you recommend if Lumos wants to look at the worst possible outcome for each decision and choose the decision that has the best "worst outcome" with regards to net profit? A. Choose to expand B. Choose to stay in current location C. Indifferent between expand and stay in current location

B. Choose to stay in current location

_____ is an imputation method where most recently available value of a variable is used to replace missing values until a new observation is available in the data. A. Visual analytics B. Last observation carry forward C. Data reduction D. Normalization

B. Last observation carry forward

LinkedIn data shows that job retention is a challenge in every industry. Especially in the tech sector, worldwide annual turnover rate changes year by year with values between 10 to 14 percent turnover rate depending on the year. Within the tech sector, different sub-sectors also show differences with regards to job retention. Sectors such as the computer games and computer software industries drive tech turnover the most compared to other sectors. Which chart type below would be most helpful to show the relative proportions of annual turnover rate of different sub-sectors (e.g., computer games, computer software and other)? A. Histogram B. Pie chart C. Bar Chart D. Scatterplot

B. Pie chart

Assume the performance of the regression model developed for student retention is being evaluated. For this purpose, we apply the model to a new dataset (new group of 72 first year students). At the beginning of second year of college we compare what the model predicted and what actually happened (i.e., whether the student retained or dropped out). The resulting confusion matrix is shown below: Given the model predicted that a student would be retained, __________ is a measure that quantifies the ratio of number of students who actually were retained compared to all students that were predicted to be retained. The value of this measure is this example is ___________. A. Precision, 2/23 B. Precision, 2/46 C. Recall, 2/23 D. Recall, 2/46

B. Precision, 2/46

Historical data shows that the mean starting annual salary at a given department of a company is 95K and the median is 118K. Which of the following statement is true? A. Starting annual salary is negatively skewed, in other words there are a few employees that have much higher starting annual salaries than the majority of the employees B. Starting annual salary is negatively skewed, in other words there are a few employees that have much lower starting annual salaries than the majority of the employees C. Starting annual salary is positively skewed, in other words there are a few employees that have much higher starting annual salaries than the majority of the employees D. Starting annual salary is positively skewed, in other words there are a few employees that have much higher starting annual salaries than the majority of the employees

B. Starting annual salary is negatively skewed, in other words there are a few employees that have much lower starting annual salaries than the majority of the employees

Consider the following sentence: "I am watching a show about food in Italy, which is why I talk about food and Italy all the time." What are the missing numerical values in the Bag-of-Words vector below showing word frequencies in this sentence? A. [1, 1, 1, 2, 1, 1] B. [1, 1, 2, 2, 1, 1] C. [1, 2, 2, 1, 1, 1] D. [1, 1, 2, 2, 2, 1]

B. [1, 1, 2, 2, 1, 1]

When querying a 3-dimensional database, assume one of the dimensions is called "Location" and it shows different cities in the U.S., and the other dimensions are Product Category (laptop, PC, phone,..) and Month (January, February,...), and the cells show sales volume for the intersection of these dimensions. Assume the user wants to find out the sales volumes in Philadelphia and in April. The OLAP function that serves this purpose is: A. slice B. dice C. drill up D. pivot

B. dice

Fill in the blank. "In ___________, data is randomly split into mutually exclusive subsets and tested multiple times on each left-out subset, using others as a training set." A. Simple/single split B. k-Fold cross validation C. Overfitting D. Decision Tree Analysis

B. k-Fold cross validation

The Bag-of-Words method uses ____________ to extract feature from textual data. A. structured data B. word frequencies in a text C. syntax of a text D. semantics of a text

B. word frequencies in a text

Consider the following dataset (Table 1) that an analytics team at Netflix collected from 10 customers including information about the viewers and their content consumption choices. Table 1. Netflix data where 1=yes, and 0=no. What is the support, P(X&Y), for the rule X ⇒ Y where X represents Female and Y represents Watched Breaking Bad= 1? A. 10% B. 20% C. 30% D. 40%

C. 30%

Consider the following dataset (Table 1) that an analytics team at Netflix collected from 10 customers including information about the viewers and their content consumption choices. Table 1. Netflix data where 1=yes, and 0=no. What is the confidence, P(Y|X), for the rule X ⇒ Y where X represents Watched Breaking Bad = 1 and Y represents Watched Ozark = 0? A. 10% B. 20% C. 40% D. 60%

C. 40%

Consider the following dataset (Table 1) that an analytics team at Netflix collected from 10 customers including information about the viewers and their content consumption choices. Table 1. Netflix data where 1=yes, and 0=no. What is the confidence, P(Y|X), for the rule X ⇒ Y where X represents Single and Y represents Watched Breaking Bad = 1? A. 20% B. 40% C. 50% D. 80%

C. 50%

Assume a two-class classification model is developed to classify hospitalized patients as "healthy" (referred as positive) and "sick" (referred as negative). When the model is applied to 200 new patients to classify them as healthy or sick, and predicted class for each patient (healthy / sick) is compared to the actual/observed class (healthy / sick). Table 2 shows the model evaluation results. Table 2. Model evaluation results. How many mistakes (misclassifications) did the model make? A. 20 B. 40 C. 60 D. 80

C. 60

Consider the following sentence: "I enjoy taking a walk in the rain." Which of the following statements in incorrect? A. Tokenization can find the most frequently used words in this sentence. B. "I", "a", "in", "the" in this sentence would be considered as stop words. C. After removing the stop words, the bigram method creates the vector: [ "enjoy", "taking", "walk", "rain"] D. If we use a lexicon using emotion-word associations, the words "enjoy" could be assigned a positive sentiment.

C. After removing the stop words, the bigram method creates the vector: [ "enjoy", "taking", "walk", "rain"]

Which of the below is not a Text Mining goal? A. Identification of key phrases B. Identification of trends and popular topics C. Applying analytical methods to structured data D. Analyzing frequencies of words

C. Applying analytical methods to structured data

Assume there are 4 customers (Kristen, Dennis, Ben and Ross), and we want to develop a hierarchical clustering model to group the customers into clusters. The input variables used in this clustering model to calculate the distance values are: how much the individual spent for grocery shopping in winter 2020 (measured in $) and how much the individual spent for grocery shopping in summer 2020 (measured in $).Euclidean distance values (measured in $) between all pairs of customers are given in Table 1 below. For example, the Euclidean distance between Dennis and Ben is $360. Based on the distance values in Table 1, the first cluster in a hierarchical clustering model will include: A. Kristen and Dennis B. Dennis and Ben C. Dennis and Ross D. Ben and Ross

C. Dennis and Ross

Peloton is an American exercise equipment company founded in 2012. Their signature product, the Peloton bike, is a high-end indoor bicycle with a Wi-Fi-enabled touchscreen tablet that streams live and on-demand classes. According to available data, the company has sold more than 550,000 bikes and treadmills since 2013. Mia, a business analyst and an active new mom who is just back from a break from exercise has been considering buying an exercise bike. Before making the purchase decision she gathers data from multiple sources to analyze features of Peloton and its main competitors. While the raw data includes key features such as price, class types, usability, and market share, it requires preprocessing before any meaningful analysis can be conducted. Refer to Week 2 and 3 class content and reading material to respond to question below. Question 1. Which of the following data preprocessing activity/activities that Mia conducts would fall under data cleaning? A. Access and collect the sales data from different sources B. Derive new variables from the existing ones using mathematical functions C. Fill in missing values with most appropriate values D. Oversample the less represented rows in data

C. Fill in missing values with most appropriate values

Which of the below statement(s) is/are correct? A. Selecting the relevant data by deciding which data sources to collect is a data reduction subtask. B. Converting the numeric variables into discrete representations is a data consolidation subtask. C. For numerical variables, normalizing the observed values between 0 and 1 is a data transformation subtask. D. Reducing number of attributes in data is a data transformation subtask.

C. For numerical variables, normalizing the observed values between 0 and 1 is a data transformation subtask.

Peloton is an American exercise equipment company founded in 2012. Their signature product, the Peloton bike, is a high-end indoor bicycle with a Wi-Fi-enabled touchscreen tablet that streams live and on-demand classes. According to available data, the company has sold more than 550,000 bikes and treadmills since 2013. Mia, a business analyst and an active new mom who is just back from a break from exercise has been considering buying an exercise bike. Before making the purchase decision she gathers data from multiple sources to analyze features of Peloton and its main competitors. While the raw data includes key features such as price, class types, usability, and market share, it requires preprocessing before any meaningful analysis can be conducted. Refer to Week 2 and 3 class content and reading material to respond to question below. Which chart should Mia use to visualize the number of new members joining the customer community every month from 2012 to 2020? A. Boxplot B. Histogram C. Line chart D. Pie chart

C. Line chart

Select the correct answer. "______ is a measure of central tendency and is the sum of all the values/observations divided by the number of observations in the data set." A. Dispersion B. Median C. Mean D. Standard Deviation

C. Mean

Peloton is an American exercise equipment company founded in 2012. Their signature product, the Peloton bike, is a high-end indoor bicycle with a Wi-Fi-enabled touchscreen tablet that streams live and on-demand classes. According to available data, the company has sold more than 550,000 bikes and treadmills since 2013. Mia, a business analyst and an active new mom who is just back from a break from exercise has been considering buying an exercise bike. Before making the purchase decision she gathers data from multiple sources to analyze features of Peloton and its main competitors. While the raw data includes key features such as price, class types, usability, and market share, it requires preprocessing before any meaningful analysis can be conducted. Refer to Week 2 and 3 class content and reading material to respond to question below. Which chart should Mia use to visualize the relative proportion of market share of Peloton in 2020 compared to its competitors Nordic Track, Myx Fitness, and Echelon? A. Histogram B. Line chart C. Pie chart D. Scatterplot

C. Pie chart

Assume the performance of the regression model developed for student retention is being evaluated. For this purpose, we apply the model to a new dataset (new group of 72 first year students). At the beginning of second year of college we compare what the model predicted and what actually happened (i.e., whether the student retained or dropped out). The resulting confusion matrix is shown below: Given a student was retained, the number of times the model predicted a student's retention correctly is called _______. The value of this measure is this example is ___________. A. Precision, 2/23 B. Precision, 2/46 C. Recall, 2/23 D. Recall, 2/46

C. Recall, 2/23

Assume we fit a univariate regression line to the scatterplot - shown in Figure 1 below. Which of the following statement is incorrect? Figure 1 shows the relationship between high school GPA percentile (HSPercentile) and SAT score. In this scatterplot each dot represents a student. HSPercentile is on the horizontal axis and SAT score is on the vertical axis. A. The slope of the regression line would represent how much the SAT score changes if the HSPercentile changes by 1%. B. The slope of the regression line will be positive. C. The intercept of the line would represent the HSPercentile of a student given his/her/their SAT score is 0. D. The intercept of the line would represent the SAT score of a student given his/her/their HSPercentile is 0.

C. The intercept of the line would represent the HSPercentile of a student given his/her/their SAT score is 0.

Assume we have data on houses sold in Philadelphia including their size (sq-foot), number of bedrooms, whether the house has a garage or not (1= has garage, 0 = does not have a garage) and we use all these variables to predict the selling price ($) of a house. Which of the following statement is correct? A. We should use a scatter plot. B. We should use a univariate linear regression. C. We should use a multiple linear regression. D. We should use a logistic regression.

C. We should use a multiple linear regression.

Consider the following sentence: "We had to stay at home for two weeks." After removing the stop words, which of the below shows the result of word tokenization? A. ["We", "stay", "home", "weeks"] B. ["We", "had", "stay", "home", "two", "weeks"] C. ["stay", "home", "two", "weeks"] D. ["stay", "at", "home", "for", "two", "weeks"]

C. ["stay", "home", "two", "weeks"]

In text mining, tokenizing is the process of _________________. A. translating the words in a text to a different language B. finding all the synonym words in a text C. breaking a text into simple units, like sentences or words D. assigning a score to each word in a text based on their negative or positive sentiment

C. breaking a text into simple units, like sentences or words

Which of the below is a method to deal with filling out the missing values in data? A. data merging B. data visualization C. data imputation D. data consolidation

C. data imputation

A regression model is developed to predict which students will be retained in second year (i.e., still enrolled in second year). Using various characteristics in the given dataset, the variable we want to predict is "SecondFallRegistered" which is either yes (=1) if the student is still enrolled in second year (measure of retention), or no (=0) if student dropped out between the beginning of first year and second year of college.In this case, a ______________ would be the best suited model to predict the probability of retention of a student. A. Scatter plot B. Univariate linear regression C. Multiple linear regression D. Logistic regression

D. Logistic regression

Assume a regression model is developed to predict the annual revenue of a company (y) using the number of employees (x1), number of production facilities (x2), number of warehouses (x3), and whether the company has a facility in Philadelphia (x4, where 1= yes, the company has a facility in Philadelphia and 0 = no, the company doesn't have a facility in Philadelphia) is given as follows:y = 20000 + 300.24*x1 + 15000*x2 + 28000*x3 + 4500*x4 What is the predicted annual revenue of a company that has 1500 employees, 3 production facilities, 2 warehouses and does not have a facility in Philadelphia? A. $487, 360 B. $501,260 C. $547,860 D. $571,360 E. $575,860

D. $571,360

Using the student retention dataset, your team develops a regression model to predict the CombinedScore (i.e., score used to rank students) using HSPercentile, Gender (where Female=0 and Male=1) and student's SAT score. The regression equation is given by: CombinedScore =118.95 + 1.91*HSPercentile - 1.74*(Gender) + 0.06*SAT Assume there are two students, Neal and Jimmy, with the same high school GPA percentile and gender. Neal's SAT score is one point lower than Jimmy's. Using the regression model shown in Question 3, Neal's predicted combined score would be _______. A. 1.91 lower than Jimmy's combined score B. 1.91 higher than Jimmy's combined score C. 0.06 higher than Jimmy's combined score D. 0.06 lower than Jimmy's combined score

D. 0.06 lower than Jimmy's combined score

Use the following description for Questions 5-6. A regression model is developed to predict the blood alcohol level of individuals (referred as BAC) using their weight (referred as weight, measured in pounds) and number of alcoholic beverages they consumed on a given evening (referred as drinks). The regression model summary output from Excel Analysis ToolPak is shown in Figure 2 below. Figure 2. Regression model results - predicting BAC as a function of weight and drinks. Considering the model fit to given data, what percent of variation in data do the independent variables in this regression model explain? A. 0.039 % B. 3.986 % C. 15.00 % D. 95.17 %

D. 95.17 %

What is the main purpose of imputation methods? A. Normalize data to reduce the range of values to a standard range B. Convert numeric variables into discrete variables C. Improve the over- and under-sampling issues in data D. Fill in missing values with most appropriate values

D. Fill in missing values with most appropriate values

Which of the following is a segmentation model that classifies the items in a dataset based on pairwise distances between these items until every observation is linked into one large group? A. Market-basket analysis B. Logistic regression C. Simple split D. Hierarchical clustering

D. Hierarchical clustering

Which chart type would be most helpful to show the distribution and skewness of tech sector annual turnover rate? A. Line chart B. Pie chart C. Scatterplot D. Histogram

D. Histogram

According to the American Academy of Pediatrics, toddlers (i.e., children who are 12 to 36 months old) need plenty of sleep during the night. However, studies show that many toddlers struggle and resist sleeping through the night. Dr. Capan has a 22-month-old toddler whose night sleep is rather difficult to predict. To analyze her toddler's sleep, she collects daily data including: • day of the week (categorical, where the categories are: Monday,..., Sunday), • length of mid-day nap (numerical, in Minutes), • type of snacks he had before going to bed (categorical where the categories are: fruit, yoghurt, crackers), • length of night sleep (numerical, in hours) • night sleep category (binary, takes values 0=short or 1=long --> short means night sleep < 12 hours, and long means night sleep >= 12 hours) Which data mining method would be best suited to predict the length of night sleep (numerical) using the variables: day of the week, length of mid-day nap, and type of snacks he had before going to bed? A. Time Series B. Association Rule Mining C. Linear Regression D. Logistic Regression

D. Logistic Regression

Which data mining method would be best suited to predict the night sleep category (binary) using the variables: day of the week, length of mid-day nap, and type of snacks he had before going to bed? A. Time Series B. Association Rule Mining C. Linear Regression D. Logistic Regression

D. Logistic Regression

Which of the below would be considered a stop word? A. Two B. Sign C. Cut D. Me

D. Me

Assume we have data on houses sold in Philadelphia including their size (sq-foot), number of bedrooms, whether the house has a garage or not (1= has garage, 0 = does not have a garage) and we use all these variables to predict the selling price ($) of a house. The dependent variable(s) is/are: A. Size of a house B. Whether a house has a garage or not (1= has garage, 0 = does not have a garage) C. Size of a house, number of bedrooms, and whether the house has a garage or not (1= has garage, 0 = does not have a garage) D. Selling price of a house

D. Selling price of a house

Figure 1 shows the word cloud created using the posted content in a text-mining related blog on the internet. Figure 1. Word cloud example derived from a text-mining related blog Which of the following statements is incorrect? A. Size of the words indicate frequency of the use of these words in the posted content B. Most frequently used words and acronyms are NLP, natural, language, and processing C. The content in this blog contains the word "linguistics" more frequently than the word "public" D. The content in this blog contains the word "layout" more frequently than the word "language"

D. The content in this blog contains the word "layout" more frequently than the word "language"

When querying a 3-dimensional database, a user goes from summarized data (e.g., quarters) to its underlying details (e.g, months). The OLAP function that serves this purpose is: A. slice B. dice C. drill up D. drill down

D. drill down

Which type of data visualization method can be helpful when the intention is to show the distribution of annual salary at the Research and Development department of a health tech company? A. scatterplot B. line chart C. pie chart D. histogram

D. histogram

When querying a dimensional database, a user transforms the data coming from rows of a table into data grouped on several columns. The OLAP function that serves this purpose is: A. slice B. dice C. drill down D. pivot

D. pivot

Peloton is an American exercise equipment company founded in 2012. Their signature product, the Peloton bike, is a high-end indoor bicycle with a Wi-Fi-enabled touchscreen tablet that streams live and on-demand classes. According to available data, the company has sold more than 550,000 bikes and treadmills since 2013. Mia, a business analyst and an active new mom who is just back from a break from exercise has been considering buying an exercise bike. Before making the purchase decision she gathers data from multiple sources to analyze features of Peloton and its main competitors. While the raw data includes key features such as price, class types, usability, and market share, it requires preprocessing before any meaningful analysis can be conducted. Refer to Week 2 and 3 class content and reading material to respond to question below. Which of the following data preprocessing activity that Mia conducts is not associated with data cleaning? A.Fill in missing values for usability evaluations of bike users B.Identify erroneous values regarding class features, e.g. length of class with a negative value C.Eliminate duplicate values regarding live and on-demand classes D.Derive a new variable representing total time of class material from existing variables

D.Derive a new variable representing total time of class material from existing variables

Assume there are 4 students (Sam, Joe, Max and Hannah), and we want to develop a hierarchical clustering model to group the students into clusters. The input variables used in this clustering model to calculate the distance values are: Midterm exam grade (measured as grade points on a scale from 0 to 100) and Final exam grade (measured as grade points on a scale from 0 to 100). Euclidean distance values (measured as grade points) between all pairs of students are given in Table 1 below. For example, the Euclidean distance between Sam and Max is 11 points. Table 1.Distance values between all student pairs Is the statement true or false? "The input variables used in this model are not on the same scale, and this makes comparing the distance between students difficult. We need to convert the input variables to be on a similar scale by standardizing."

False

Comparing two univariate linear regression models (Model 1 and Model 2) developed using the same dataset, assume that Model 2 has an R-squared of 0.58, and Model 1 has an R-squared of 0.89. Is the following statement true or false? "Considering how well these two regression models explain the variation in the data, we would choose Model 1 because it has a better model fit compared with Model 2."

False

Is the following statement true or false? "Based on the results shown in Table 2, the true positive rate is higher than the true negative rate."

False

Is the following statement true or false? "Comparing two companies with the same features, except one has 4 and the other has 5 production facilities, the annual revenue of the company with 5 production facilities will be $28000 higher compared to the company that has 4 production facilities."

False

Is the following statement true or false? "Data reduction can only be applied to rows (observations) but not to columns (variables) in a given dataset."

False

Is the following statement true or false? "In Association Rule Mining, confidence is a metric that represents the probability of observing the items X and Y together in a given dataset, P(X&Y)."

False

Is the following statement true or false? "Linear regression models represent the relationship between one or more independent variables and a binary dependent variable (i.e., a variable that takes values 0=no and 1=yes)."

False

Is the following statement true or false? "Structured data represents organized data sources such as visual, text and web content."

False

Is the following statement true or false? "Decision support system are computer-based support systems that have precise definitions agreed to by practitioners." True False

False

Is the statement true or false? "The input to the text mining algorithms is comprised of structured data."

False

Assume there are 4 students (Sam, Joe, Max and Hannah), and we want to develop a hierarchical clustering model to group the students into clusters. The input variables used in this clustering model to calculate the distance values are: Midterm exam grade (measured as grade points on a scale from 0 to 100) and Final exam grade (measured as grade points on a scale from 0 to 100). Euclidean distance values (measured as grade points) between all pairs of students are given in Table 1 below. For example, the Euclidean distance between Sam and Max is 11 points. Table 1.Distance values between all student pairs is the statement true or false? "Based on the distance values in Table 1, the first cluster in a hierarchical clustering model will include {Joe and Hannah}."

True

Is the following statement true or false? "Choice of visualization method that meets the presentation requirements for a given piece of data depends on the data types available, purpose of the visual, and context."

True

Is the following statement true or false? "Classification learns the patterns in data using labeled output and input variables to predict the category of a new observation."

True

Is the following statement true or false? "Comparing two companies with the same features, except one has a facility in Philadelphia and the other does not, the annual revenue of the company without a facility in Philadelphia will be $4500 lower compared to a company that has a facility in Philadelphia."

True

Is the following statement true or false? "Descriptive statistics is about describing the sample data on hand, such as most likely values, extreme values, and spread."

True

Is the following statement true or false? "Due to potential risk of model overfitting, rather than using all available data we split the data into training and testing data, we use the training data for model development and evaluate model performance using the testing data."

True

Is the following statement true or false? "In a clustering model with two numerical input variables used for clustering, if the input variables are not on the same scale, standardizing is used to convert the variables and compare them on a single scale."

True

Is the following statement true or false? "Linear regression analysis can be used to predict an unknown value of a numerical dependent variable using numeric and/or categorical independent variables."

True

Is the following statement true or false? "Web structure mining focuses on navigation through a website by analyzing the links in Web documents, and Web content mining is related to extraction of information from Web pages using text mining."

True

Is the following statement true or false? "Tokenization is the process breaking complex data like paragraphs into simple units called tokens. Specifically, sentence tokenization splits a paragraph into a list of words."

True

________ analytics help managers understand what is happening in the organization by analyzing trends and patterns in data. a) Descriptive b) Predictive c) Prescriptive d) Data warehouse

a) Descriptive

Which of the following statement describes the key take away points of Analytics Applications in Healthcare - Humana Example 1? a) Humana's prevention efforts aim to treat all injuries after a fall more efficiently. b) As a strategy to enhance well-being of its members, Humana promotes increased utilization of physical therapy as a better alternative than investing in high-cost analytics. c) Using analytics can reduce healthcare costs by identifying subgroups of Humana members that are at high risk of falling and managing their fall risk. d) For Humana members, cost of treatment and recovery are less expensive than preventative measures.

c) Using analytics can reduce healthcare costs by identifying subgroups of Humana members that are at high risk of falling and managing their fall risk.

Using characteristics of first year undergraduate students, such as age, gender, major, location, workout/sports activities, if we developed a model to forecast which students are at risk of dropping out after the first year of college, decided which students to reach out to and offered them support services to reduce their risk of dropping out, what kind of analytics application would this work represent? a) descriptive analytics b) data warehouse analytics c) prescriptive analytics d) predictive analytics

c) prescriptive analytics

According to the Humana Example 3 (Section 1.6, Pg. 32-33), which data source(s) mentioned in the example can be used to identify the patterns of risk factors that increase an individual's risk of falling? a) A: Medical claims b) B: Self-reported health risk assessment data c) C: Pharmacy data d) A, B, and C

d) A, B, and C

According to Humana Example 1, and the article by Gates et al. (2008), many risk assessment methods, such as screening tools to identify fall risk factors, already exist. In your own words, give two reasons to why your team should not use one of the existing risk assessment methods?

· Existing models have limited reach - Different test attributes may be needed to predict falls successfully in different populations; for example, the timescale over which a prediction is needed varies from a few days or weeks in hospitalized patients to a year or more for community-living populations. Tools developed for one population may therefore be less accurate when used in a different setting. · Lack sufficient predictive power - not based on sound evidence that they are useful in discriminating between people who will fall and those who will not.


Related study sets

Medical-Surgical:Cardiovascular and Hematology

View Set

CH 1 - 13 Combined (45 Hour Post-license 1st renewal)

View Set

Design High-Performing Architectures Section Exam

View Set

Informatics finals, Informatics Chp 2, Informatics Chp 3, Informatics Chp 4, Informatics Chp 5, Informatics Chp 6, Info Exam 2 Chap 7, Info Exam 2 Chap 8, Chap 11 Informatics, Info Exam 2 Chap 12, INFO Exam 3 Chap 13, INFO Exam 3 Chap 14, INFO Exam 3...

View Set

Reading the Bible Midterm review

View Set

Face-Negotiation Theory: Chapter 32

View Set

ENGL 135 Ch 14 Documenting a Research Paper

View Set

Chapter 11: Breast Cancer Staging & Treatment

View Set