BSAN Final Exam practice questions
(36)Refer to Figure 10. Which of the following statement is incorrect? Figure 10: Characteristics of survey participants
"State" is an ordinal variable
(56)Use the following description to answer Questions 6-8.Assume a regression model is developed to predict the annual revenue of a company (y) using the number of employees (x1), number of production facilities (x2), number of warehouses (x3), and whether the company has a facility in Philadelphia (x4, where 1= yes, the company has a facility in Philadelphia and 0 = no, the company doesn't have a facility in Philadelphia) is given as follows:y = 20000 + 300.24*x1 + 15000*x2 + 28000*x3 + 4500*x4 What is the predicted annual revenue of a company that has 1500 employees, 3 production facilities, 2 warehouses and does not have a facility in Philadelphia?
$571,360
(2) Refer to Figure 2 Figure 1 shows the Annual Salary distribution (histogram) in Department 1. Figure 2 shows the Annual Salary distribution (histogram) in Department 2. Which of the following statement(s) is/are correct?
( I )Annual salary values are more spread out at Department 2 compared with Department 1.
(50) When analyzing the data of household income of a selected population, analysts notice that 5% of observations are missing and entered as N/A (not available). Which of the following method(s) can be used to prepare the data before conducting descriptive analysis?
(I, II, & III)
79. Refer to Figure 14 Compute the Expected Monetary Value (EMV) of net profit for decision alternative "stay in current location". Enter the numeric EMV value in the box below Do not use a dollar sign, do not use decimal points, do not use a comma separator for thousands.
-
80. Refer to Figure 14 Compute the Expected Monetary Value (EMV) of net profit for decision alternative "expand". Enter the numeric EMV value in the box below Do not use a dollar sign, do not use decimal points, do not use a comma separator for thousands.
-
(17). Refer to figure 5 A regression model is developed to predict the blood alcohol level of individuals (referred as BAC) using their weight (referred as weight, measured in pounds) and number of alcoholic beverages they consumed on a given evening (referred as drinks). The regression model summary output from Excel Analysis ToolPak is shown in Figure 5 . What is the approximate predicted BAC for a person who weighs 140 pounds and had 3 drinks? Note that the intercept starts with 0.03, not 0.08!
0.048
(14)Using the student retention dataset, your team develops a regression model to predict the CombinedScore (i.e., score used to rank students) using HSPercentile, Gender (where Female=0 and Male=1) and student's SAT score. The regression equation is given by:CombinedScore =118.95 + 1.91*HSPercentile - 1.74*(Gender) + 0.06*SAT Assume there are two students, Neal and Jimmy, with the same high school GPA percentile and gender. Neal's SAT score is one point lower than Jimmy's. Using the regression model shown in Question 3, Neal's predicted combined score would be _______.
0.06 lower than Jimmy's combined score
(25)Consider the following dataset in Figure 7 that an analytics team at Netflix collected from 10 customers including information about the viewers and their content consumption choices. Figure 7. Netflix data where 1=yes, and 0=no What is the support, P(X&Y), for the rule X ⇒ Y where X represents Married and Y represents Watched Breaking Bad = 1?
10%
(30)Assume a two-class classification model is developed to classify hospitalized patients as "healthy" (referred as positive) and "sick" (referred as negative). When the model is applied to 200 new patients to classify them as healthy or sick, and predicted class for each patient (healthy / sick) is compared to the actual/observed class (healthy / sick). Figure 9 shows the model evaluation results. What is the false positive (FP) count?
20
(24)Consider the following dataset in Figure 7 that an analytics team at Netflix collected from 10 customers including information about the viewers and their content consumption choices. Figure 7. Netflix data where 1=yes, and 0=no .What is the support, P(X&Y), for the rule X ⇒ Y where X represents Female and Y represents Watched Breaking Bad= 1?
30%
(68)Consider the item set in Figure 1 (with 6 items: eggs, beer, juice which is represented by the red glass, milk, diapers, and bread) in a dataset with 5 transactions where each row in the dataset is a shopping purchase transaction. For example, transaction 1 is the first row in dataset and includes the purchase of bread and milk, transaction 2 is the second row and includes the purchase of bread, diapers, beer, and eggs, etc. Figure 11.Itemset and a dataset derived from shopping data. Support for Beer and Milk being purchased together is ________.
40%
(27)Consider the following dataset in Figure 7 that an analytics team at Netflix collected from 10 customers including information about the viewers and their content consumption choices. Figure 7. Netflix data where 1=yes, and 0=no What is the confidence, P(Y|X), for the rule X ⇒ Y where X represents Single and Y represents Watched Breaking Bad = 1?
50%
(69)Consider the item set in Figure 1 (with 6 items: eggs, beer, juice which is represented by the red glass, milk, diapers, and bread) in a dataset with 5 transactions where each row in the dataset is a shopping purchase transaction. For example, transaction 1 is the first row in dataset and includes the purchase of bread and milk, transaction 2 is the second row and includes the purchase of bread, diapers, beer, and eggs, etc. Figure 11.Itemset and a dataset derived from shopping data. Confidence for Milk and Juice (that is the probability P(Y|X) where X= Milk and Y =Juice) is ___________
50%
(31)Assume a two-class classification model is developed to classify hospitalized patients as "healthy" (referred as positive) and "sick" (referred as negative). When the model is applied to 200 new patients to classify them as healthy or sick, and predicted class for each patient (healthy / sick) is compared to the actual/observed class (healthy / sick). Figure 9 shows the model evaluation results. How many mistakes (misclassifications) did the model make?
60
(26)Consider the following dataset in Figure 7 that an analytics team at Netflix collected from 10 customers including information about the viewers and their content consumption choices. Figure 7. Netflix data where 1=yes, and 0=no What is the confidence, P(Y|X), for the rule X ⇒ Y where X represents Watched Breaking Bad = 1 and Y represents Watched Ozark = 0?
60%
(16). Refer to figure 5 A regression model is developed to predict the blood alcohol level of individuals (referred as BAC) using their weight (referred as weight, measured in pounds) and number of alcoholic beverages they consumed on a given evening (referred as drinks). The regression model summary output from Excel Analysis ToolPak is shown in Figure 5 . Considering the model fit to given data, what percent of variation in data do the independent variables in this regression model explain?
95.17 %
72. Which of the below is not an example of textual data?
A database including features of all active Twitter users such as their age and annual income
(29)Assume there are 4 customers (Kristen, Dennis, Ben and Ross), and we want to develop a hierarchical clustering model to group the customers into clusters. The input variables used in this clustering model to calculate the distance values are: how much the individual spent for grocery shopping in winter 2020 (measured in $) and how much the individual spent for grocery shopping in summer 2020 (measured in $).Euclidean distance values (measured in $) between all pairs of customers are given in Table 1 below. For example, the Euclidean distance between Dennis and Ben is $360. Figure 8 :Distance values between all customer pairs Using the single linkage criteria, after creating the first cluster, the next step would be:
Add Ben to the cluster {Dennis and Ross}
(71)Assume there are 4 students (Sam, Joe, Max and Hannah), and we want to develop a hierarchical clustering model to group the students into clusters. The input variables used in this clustering model to calculate the distance values are: Midterm exam grade (measured as grade points on a scale from 0 to 100) and Final exam grade (measured as grade points on a scale from 0 to 100). Euclidean distance values (measured as grade points) between all pairs of students are given in Table 1 below. For example, the Euclidean distance between Sam and Max is 11 points. Figure 12:Distance values between all student pairs Using the single linkage criteria, after creating the first cluster, the next step would be:
Add Sam to the cluster {Joe and Hannah}
78. Consider the following sentence: "I hate unloading the dishwasher." Which of the following statements is incorrect?
After removing the stop words, the bigram method creates the vector: [ "hate", "unloading", "dishwasher"]
(1) Refer to Figure 1 Julie Miller, a manager at the health tech company, Wellness Path, oversees two departments in the wearables business line: Product Development (referred as Department 1), and Product Testing (referred as Department 2). Both departments focus on bringing impactful products to the market that provide a (3)competitive advantage due to their design features and interoperability with other devices. Table 1 below shows selected descriptive statistics for age in Department 1 and Department 2. Question 1. Based on Table 1, which of the following statements is correct?
Age distribution at Department 2 is negatively skewed
73. Which of the below is not a Text Mining goal?
Applying analytical methods to structured data
(23)According to the American Academy of Pediatrics, toddlers (i.e., 12 to 36 months ) need plenty of sleep during the night. However, studies show that many toddlers struggle and resist sleeping through the night. Dr. Capan has a 22-month-old toddler whose night sleep is rather difficult to predict. To analyze her toddler's sleep, she collects daily data including :• (1)day of the week (categorical, where the categories are: Monday,..., Sunday), • (4)length of mid-day nap (numerical, in Minutes), • (3)type of snacks eaten b4 bed (categorical where the categories are: fruit, yoghurt, etc, • (2)length of night sleep (numerical, in hours) • (5)night sleep category (binary, takes values 0=short or 1=long --> short means night sleep < 12 hours, and long means night sleep >= 12 hours) Which data mining method would be best suited to find out which days (categorical) and night sleep categories (binary) are fr
Association Rule Mining
(63)In data mining, finding an affinity of two products to be commonly purchased together is known as
Association rule mining
(3) Refer to Figure 3 Employees of both departments were asked the following question: " On a scale from 0 (worst value) to 10 (best value), how would you rate your job satisfaction at your current work position?" Results are shown in Figure 3. Which of the following statement is correct? Boxplots of job satisfaction for Departments 1 and 2 where 0 is the worst possible satisfaction value and 10 is the best satisfaction value.
At Department 1, 75% of all job satisfaction observations are below 4
82. Figure to 14 Which decision would you recommend if Lumos wants to look at the best possible outcome for each decision and choose the decision that has the best "best outcome" with regards to net profit?
Choose to expand
83. Figure to 14 Which decision would you recommend if Lumos wants to choose the decision by using a weighted average of the possible payoffs - in other words uses the Expected Monetary Value (EMV) and picks the decision with the largest EMV with regards to net profit?
Choose to expand
81. Refer to Figure 14 Which decision would you recommend if Lumos wants to look at the worst possible outcome for each decision and choose the decision that has the best "worst outcome" with regards to net profit?
Choose to stay in current location
(28)Assume there are 4 customers (Kristen, Dennis, Ben and Ross), and we want to develop a hierarchical clustering model to group the customers into clusters. The input variables used in this clustering model to calculate the distance values are: how much the individual spent for grocery shopping in winter 2020 (measured in $) and how much the individual spent for grocery shopping in summer 2020 (measured in $).Euclidean distance values (measured in $) between all pairs of customers are given in Table 1 below. For example, the Euclidean distance between Dennis and Ben is $360. Figure 8 :Distance values between all customer pairs Based on the distance values in Figure 8, the first cluster in a hierarchical clustering model will include:
Dennis and Ross
(10)Peloton is an American exercise equipment company founded in 2012. Their signature product, the Peloton bike, is a high-end indoor bicycle with a Wi-Fi-enabled touchscreen tablet that streams live and on-demand classes. According to available data, the company has sold more than 550,000 bikes and treadmills since 2013.Mia, a business analyst and an active new mom who is just back from a break from exercise has been considering buying an exercise bike. Before making the purchase decision she gathers data from multiple sources to analyze features of Peloton and its main competitors. While the raw data includes key features such as price, class types, usability, and market share, it requires preprocessing before any meaningful analysis can be conducted. Which of the following data preprocessing activity that Mia conducts is not associated with data cleaning?
Derive a new variable representing total time of class material from existing variables
(32)Assume a two-class classification model is developed to classify hospitalized patients as "healthy" (referred as positive) and "sick" (referred as negative). When the model is applied to 200 new patients to classify them as healthy or sick, and predicted class for each patient (healthy / sick) is compared to the actual/observed class (healthy / sick). Figure 9 shows the model evaluation results. Based on the results shown in Figure 9, the true positive rate is higher than the true negative rate."
False
(38)"Structured data represents organized data sources such as visual, text and web content."
False
(46) Data reduction can only be applied to rows (observations) but not to columns (variables) in a given dataset.
False
(51)Linear regression models represent the relationship between one or more independent variables and a binary dependent variable (i.e., a variable that takes values 0=no and 1=yes).
False
(57)Use the following description to answer Questions 6-8.Assume a regression model is developed to predict the annual revenue of a company (y) using the number of employees (x1), number of production facilities (x2), number of warehouses (x3), and whether the company has a facility in Philadelphia (x4, where 1= yes, the company has a facility in Philadelphia and 0 = no, the company doesn't have a facility in Philadelphia) is given as follows:y = 20000 + 300.24*x1 + 15000*x2 + 28000*x3 + 4500*x4 Comparing two companies with the same features, except one has 4 and the other has 5 production facilities, the annual revenue of the company with 5 production facilities will be $28000 higher compared to the company that has 4 production facilities
False
(64)In Association Rule Mining, confidence is a metric that represents the probability of observing the items X and Y together in a given dataset, P(X&Y)
False
(69)Assume there are 4 students (Sam, Joe, Max and Hannah), and we want to develop a hierarchical clustering model to group the students into clusters. The input variables used in this clustering model to calculate the distance values are: Midterm exam grade (measured as grade points on a scale from 0 to 100) and Final exam grade (measured as grade points on a scale from 0 to 100). Euclidean distance values (measured as grade points) between all pairs of students are given in Table 1 below. For example, the Euclidean distance between Sam and Max is 11 points. Figure 12:Distance values between all student pairs The input variables used in this model are not on the same scale, and this makes comparing the distance between students difficult. We need to convert the input variables to be on a similar scale by standardizing.
False
74. Is the following statement true or false?"Tokenization is the process breaking complex data like paragraphs into simple units called tokens. Specifically, sentence tokenization splits a paragraph into a list of words."
False
88. Is the statement true or false?"In this linear programming (LP) model, producing 30 laptops, 40 PCs, and 150 tablets is a feasible
False
(11)What is the main purpose of imputation methods?
Fill in missing values with most appropriate values
(7)Peloton is an American exercise equipment company founded in 2012. Their signature product, the Peloton bike, is a high-end indoor bicycle with a Wi-Fi-enabled touchscreen tablet that streams live and on-demand classes. According to available data, the company has sold more than 550,000 bikes and treadmills since 2013.Mia, a business analyst and an active new mom who is just back from a break from exercise has been considering buying an exercise bike. Before making the purchase decision she gathers data from multiple sources to analyze features of Peloton and its main competitors. While the raw data includes key features such as price, class types, usability, and market share, it requires preprocessing before any meaningful analysis can be conducted. Which of the following data preprocessing activity/activities that Mia conducts would fall under data cleaning?
Fill in missing values with most appropriate values
(47)Which of the below statement(s) is/are correct
For numerical variables, normalizing the observed values between 0 and 1 is a data transformation subtask.
(67)Which of the following is a segmentation model that classifies the items in a dataset based on pairwise distances between these items until every observation is linked into one large group?
Hierarchical clustering
(4)Which chart type would be most helpful to show the distribution and skewness of tech sector's annual turnover rate?
Histogram
(35)In ___________, data is randomly split into mutually exclusive subsets and tested multiple times on each left-out subset, using others as a training set
K-Fold cross validation
(49) _____ is an imputation method where most recently available value of a variable is used to replace missing values until a new observation is available in the data.
Last observation carry forward
(5) LinkedIn data shows that job retention is a challenge in every industry. Especially in the tech sector, worldwide annual turnover rate changes year by year with values between 10 to 14 percent turnover rate depending on the year. Within the tech sector, different sub-sectors also show differences with regards to job retention. Sectors such as the computer games and computer software industries drive tech turnover the most compared to other sectors. Which chart type would be most helpful to show the trend of worldwide annual turnover rate over time?
Line chart
(8)Peloton is an American exercise equipment company founded in 2012. Their signature product, the Peloton bike, is a high-end indoor bicycle with a Wi-Fi-enabled touchscreen tablet that streams live and on-demand classes. According to available data, the company has sold more than 550,000 bikes and treadmills since 2013.Mia, a business analyst and an active new mom who is just back from a break from exercise has been considering buying an exercise bike. Before making the purchase decision she gathers data from multiple sources to analyze features of Peloton and its main competitors. While the raw data includes key features such as price, class types, usability, and market share, it requires preprocessing before any meaningful analysis can be conducted. Which chart should Mia use to visualize the number of new members joining the customer community every month from 2012 to 2020?
Line chart
(20)According to the American Academy of Pediatrics, toddlers (i.e., 12 to 36 months ) need plenty of sleep during the night. However, studies show that many toddlers struggle and resist sleeping through the night. Dr. Capan has a 22-month-old toddler whose night sleep is rather difficult to predict. To analyze her toddler's sleep, she collects daily data including :• (1)day of the week (categorical, where the categories are: Monday,..., Sunday), • length of mid-day nap (numerical, in Minutes), • (3)type of snacks eaten b4 bed (categorical where the categories are: fruit, yoghurt, etc, • (2)length of night sleep (numerical, in hours) • night sleep category (binary, takes values 0=short or 1=long --> short means night sleep < 12 hours, and long means night sleep >= 12 hours) Which data mining method would be best suited to predict the length of night sleep (numerical) using the variables 1,2,3
Linear Regression
(22)According to the American Academy of Pediatrics, toddlers (i.e., 12 to 36 months ) need plenty of sleep during the night. However, studies show that many toddlers struggle and resist sleeping through the night. Dr. Capan has a 22-month-old toddler whose night sleep is rather difficult to predict. To analyze her toddler's sleep, she collects daily data including :• (1)day of the week (categorical, where the categories are: Monday,..., Sunday), • (4)length of mid-day nap (numerical, in Minutes), • (3)type of snacks eaten b4 bed (categorical where the categories are: fruit, yoghurt, etc, • (2)length of night sleep (numerical, in hours) • (5)night sleep category (binary, takes values 0=short or 1=long --> short means night sleep < 12 hours, and long means night sleep >= 12 hours) Which data mining method would be best to predict the night sleep category using the variables: 1, 4, 3
Logistic Regression
(15) A regression model is developed to predict which students will be retained in second year (i.e., still enrolled in second year). Using various characteristics in the given dataset, the variable we want to predict is "SecondFallRegistered" which is either yes (=1) if the student is still enrolled in second year (measure of retention), or no (=0) if student dropped out between the beginning of first year and second year of college.In this case, a ______________ would be the best suited model to predict the probability of retention of a student.
Logistic regression
(39)______ is a measure of central tendency and is the sum of all the values/observations divided by the number of observations in the data set
Mean
86. Consider the following decision problem. A company sells three different products: laptops, PCs, and tablets. The company want to decide how many laptops, PCs, and tablets to produce next quarter to maximize their total net profit from selling these products. The net profit associated with selling: a laptop is $800, a PC is $1000, and a tablet is $300. Each laptop costs $500 to produce, each PC costs $650 to produce, and each tablet costs $200 to produce. Total cost associated with producing all the products (laptops, PCs and tablets) to be sold next quarter cannot exceed $180,000. The market research shows that the company can sell at most 50 PCs and at most 100 tablets What are the decision variables in this linear programming (LP) model?
Number of products to produce next quarter (laptop, PC and tablet)
(6) LinkedIn data shows that job retention is a challenge in every industry. Especially in the tech sector, worldwide annual turnover rate changes year by year with values between 10 to 14 percent turnover rate depending on the year. Within the tech sector, different sub-sectors also show differences with regards to job retention. Sectors such as the computer games and computer software industries drive tech turnover the most compared to other sectors. Which chart type below would be most helpful to show the relative proportions of annual turnover rate of different sub-sectors (e.g., computer games, computer software and other)?
Pie chart
(9)Peloton is an American exercise equipment company founded in 2012. Their signature product, the Peloton bike, is a high-end indoor bicycle with a Wi-Fi-enabled touchscreen tablet that streams live and on-demand classes. According to available data, the company has sold more than 550,000 bikes and treadmills since 2013.Mia, a business analyst and an active new mom who is just back from a break from exercise has been considering buying an exercise bike. Before making the purchase decision she gathers data from multiple sources to analyze features of Peloton and its main competitors. While the raw data includes key features such as price, class types, usability, and market share, it requires preprocessing before any meaningful analysis can be conducted. Which chart should Mia use to visualize the relative proportion of market share of Peloton in 2020 compared to its competitors?
Pie chart
(18) Refer to Figure 6 Assume the performance of the regression model developed for student retention is being evaluated. For this purpose, we apply the model to a new dataset (new group of 72 first year students). At the beginning of second year of college we compare what the model predicted and what actually happened (i.e., whether the student retained or dropped out). The resulting confusion matrix is shown below: Given the model predicted that a student would be retained, __________ is a measure that quantifies the ratio of number of students who actually were retained compared to all students that were predicted to be retained. The value of this measure is this example is ___________.
Precision, 2/46
(19) Refer to Figure 6 Assume the performance of the regression model developed for student retention is being evaluated. For this purpose, we apply the model to a new dataset (new group of 72 first year students). At the beginning of second year of college we compare what the model predicted and what actually happened (i.e., whether the student retained or dropped out). The resulting confusion matrix is shown below: Given a student was retained, the number of times the model predicted a student's retention correctly is called _______. The value of this measure is this example is ___________.
Recall, 2/23
(55)Assume we have data on houses sold in Philadelphia including their size (sq-foot), number of bedrooms, whether the house has a garage or not (1= has garage, 0 = does not have a garage) and we use all these variables to predict the selling price ($) of a house. The dependent variable(s) is/are:
Selling price of a house
(12) If we want to analyze the impact of one unit increase/decrease in high school GPA on students' SAT score, which of the following methods should we use?
Simple linear regression with HSGPA as the independent variable and SAT score as the dependent variable
(68)In ___________, the original dataset is split once into two subsets, model is developed using the training data and model is evaluated using the testing data
Simple/single split
(40)__________ is calculated by taking the square root of the variations.
Standard deviation
(44) Historical data shows that the mean starting annual salary at a given department of a company is 95K and the median is 118K. Which of the following statement is true?
Starting annual salary is negatively skewed, in other words there are a few employees that have much lower starting annual salaries than the majority of the employees
76. Figure 13 Figure 1. Word cloud example derived from a text-mining related blog Which of the following statements is incorrect?
The content in this blog contains the word "layout" more frequently than the word "language
(13) Refer to Figure 4 Assume we fit a univariate regression line to the scatterplot - shown in Figure 4 below. Which of the following statement is incorrect? Figure shows the relationship between high school GPA percentile (HSPercentile) and SAT score. In this scatterplot each dot represents a student. HSPercentile is on the horizontal axis and SAT score is on the vertical axis.
The intercept of the line would represent the HSPercentile of a student given his/her/their SAT score is 0.
87. Consider the following decision problem. A company sells three different products: laptops, PCs, and tablets. The company want to decide how many laptops, PCs, and tablets to produce next quarter to maximize their total net profit from selling these products. The net profit associated with selling: a laptop is $800, a PC is $1000, and a tablet is $300. Each laptop costs $500 to produce, each PC costs $650 to produce, and each tablet costs $200 to produce. Total cost associated with producing all the products (laptops, PCs and tablets) to be sold next quarter cannot exceed $180,000. The market research shows that the company can sell at most 50 PCs and at most 100 tablets Which of the following is not a constraint of this model?
The net profit gained from selling a laptop is $800.
(21)According to the American Academy of Pediatrics, toddlers (i.e., 12 to 36 months ) need plenty of sleep during the night. However, studies show that many toddlers struggle and resist sleeping through the night. Dr. Capan has a 22-month-old toddler whose night sleep is rather difficult to predict. To analyze her toddler's sleep, she collects daily data including :• (1)day of the week (categorical, where the categories are: Monday,..., Sunday), • length of mid-day nap (numerical, in Minutes), • (3)type of snacks eaten b4 bed (categorical where the categories are: fruit, yoghurt, etc, • (2)length of night sleep (numerical, in hours) • night sleep category (binary, takes values 0=short or 1=long --> short means night sleep < 12 hours, and long means night sleep >= 12 hours) Which data mining method would be best to predict the length of night sleep using the previous nights' length of night sleep in t
Time Series
(33)In a clustering model with two numerical input variables used for clustering, if the input variables are not on the same scale, standardizing is used to convert the variables and compare them on a single scale."
True
(34) Due to potential risk of model overfitting, rather than using all available data we split the data into training and testing data, we use the training data for model development and evaluate model performance using the testing data
True
(37)"Descriptive statistics is about describing the sample data on hand, such as most likely values, extreme values, and spread."
True
(45)Choice of visualization method that meets the presentation requirements for a given piece of data depends on the data types available, purpose of the visual, and context
True
(52)Linear regression analysis can be used to predict an unknown value of a numerical dependent variable using numeric and/or categorical independent variables.
True
(53)Comparing two univariate linear regression models (Model 1 and Model 2) developed using the same dataset, assume that Model 2 has an R-squared of 0.58, and Model 1 has an R-squared of 0.89.Is the following statement true or false?"Considering how well these two regression models explain the variation in the data, we would choose Model 1 because it has a better model fit compared with Model 2.
True
(58)Use the following description to answer Questions 6-8.Assume a regression model is developed to predict the annual revenue of a company (y) using the number of employees (x1), number of production facilities (x2), number of warehouses (x3), and whether the company has a facility in Philadelphia (x4, where 1= yes, the company has a facility in Philadelphia and 0 = no, the company doesn't have a facility in Philadelphia) is given as follows:y = 20000 + 300.24*x1 + 15000*x2 + 28000*x3 + 4500*x4 Comparing two companies with the same features, except one has a facility in Philadelphia and the other does not, the annual revenue of the company without a facility in Philadelphia will be $4500 lower compared to a company that has a facility in Philadelphia.
True
(62)Classification learns the patterns in data using labeled output and input variables to predict the category of a new observation
True
(70)Assume there are 4 students (Sam, Joe, Max and Hannah), and we want to develop a hierarchical clustering model to group the students into clusters. The input variables used in this clustering model to calculate the distance values are: Midterm exam grade (measured as grade points on a scale from 0 to 100) and Final exam grade (measured as grade points on a scale from 0 to 100). Euclidean distance values (measured as grade points) between all pairs of students are given in Table 1 below. For example, the Euclidean distance between Sam and Max is 11 points. Figure 12:Distance values between all student pairs Based on the distance values in Table 1, the first cluster in a hierarchical clustering model will include {Joe and Hannah}
True
84. Is the following statement true or false?"An optimal solution of a linear programming model is a solution that represents the best values for all decision variables and satisfies all the constraints."
True
85. Consider the following decision problem. A company sells three different products: laptops, PCs, and tablets. The company want to decide how many laptops, PCs, and tablets to produce next quarter to maximize their total net profit from selling these products. The net profit associated with selling: a laptop is $800, a PC is $1000, and a tablet is $300. Each laptop costs $500 to produce, each PC costs $650 to produce, and each tablet costs $200 to produce. Total cost associated with producing all the products (laptops, PCs and tablets) to be sold next quarter cannot exceed $180,000. The market research shows that the company can sell at most 50 PCs and at most 100 tablets "This is a linear programming (LP) model where the objective function is a maximization."
True
(54)Assume we have data on houses sold in Philadelphia including their size (sq-foot), number of bedrooms, whether the house has a garage or not (1= has garage, 0 = does not have a garage) and we use all these variables to predict the selling price ($) of a house. Which of the following statement is correct?
We should use a multiple linear regression.
75. Consider the following sentence: "We had to stay at home for two weeks." After removing the stop words, which of the below shows the result of word tokenization?
["stay", "home", "two", "weeks"]
77. Consider the following sentence: "I am watching a show about food in Italy, which is why I talk about food and Italy all the time." What are the missing numerical values in the Bag-of-Words vector below showing word frequencies in this sentence?
[1, 1, 2, 2, 1, 1]
(48)Which of the below is a method to deal with filling out the missing values in data
data imputation
(60) When querying a 3-dimensional database, assume one of the dimensions is called "Location" and it shows different cities in the U.S., and the other dimensions are Product Category (laptop, PC, phone,..) and Month (January, February,...), and the cells show sales volume for the intersection of these dimensions. Assume the user wants to find out the sales volumes in Philadelphia and in April. The OLAP function that serves this purpose is:
dice
(59)When querying a 3-dimensional database, a user goes from summarized data (e.g., quarters) to its underlying details (e.g, months). The OLAP function that serves this purpose is:
drill down
(42)Which type of data visualization method can be helpful when the intention is to show the distribution of annual salary at the Research and Development department of a health tech company?
histogram
(41)Which type of data visualization method can be helpful when the intention is to show relative proportions of dollars per department allocated by a university administration?
pie chart
(61)When querying a dimensional database, a user transforms the data coming from rows of a table into data grouped on several columns. The OLAP function that serves this purpose is:
pivot
(43) Which type of data visualization method can be helpful when the intention is to show relationship between time spent working out per week (in hours) and average amount of sleep an individual gets per week (in hours)?
scatterplot
89. Consider the decision problem used in Questions 2-5. Assume there is one more constraint. The total number of laptops produced next quarter cannot exceed the total number of tablets produced next quarter. Let's define x1 = number of laptops to be produced next quarter, x2 = number of PCs to be produced next quarter, and x3 = number of tablets to be produced next quarter. Which formulation below represents this new constraint?
x1 <= x3
