MISY 5366 - Data Warehousing & Data Mining

Ace your homework & exams now with Quizwiz!

Question : There are 10 stages to building a predictive model. Which one of the following is not a valid stage. ( stages listed in no particular order ) Please select one valid answer from the five choice below: A. Obtain the data. B. Explore, clean and pre process the data. C. Determine the data mining task. D. Expand the data dimension. E. Use the algorithm to perform the task.

?

What are two actions an analyst can perform to avoid overfitting to occur with a model: A. Continuously use and collect more data and avoid or maintain differences between training errors and validation errors to a minimum. B. Not change the size of the data and maintain differences between training and validation errors to a minimum C. Continuously use and collect more data and not worry about the differences between training and validation errors D. Getting training errors to zero making a perfect fit and validation errors do not have a value of zero. All of the above

? don't have answer must find

What is data mining? A. Data Mining is the procedure of extracting information from huge sets of data and using it for other applications B. Data mining aims to build data-centric products and make data-driven business decisions. C. Data mining is a multidisciplinary area of scientific study and is purely concerned with algorithms. D. It is the process of visualizing or displaying the data extracted in the form of different graphical or visual formats such as statistical representations, pie charts, bar graphs, graphical images, etc. E. Data mining refers to the vast amount of data that can be structured, semi-structured, and unstructured sets of data ranging in terms of tera-bytes.

A Data mining(A) is the process of discovering meaningful new correlations, patterns, and trends by analyzing large amounts of data stored in data repositories like files and databases, using different technologies along with statistical and mathematical techniques.

When doing linear regression with many variables available, it is best practice to ______ the number of predictors in attempt to create the most accurate regression coefficients and minimize prediction variance. Two main methods of doing this are __________ 1 and __________ 2. While the first is more expensive and time consuming, it is more accurate and caters to smaller datasets. The second method is better used for large datasets, however can sometimes miss predictors that would perform well as a group but poorly by itself. A) Reduce, exhaustive search, popular subset selection algorithm B) Reduce, stepwise regression, exhaustive search C) Increase, exhaustive search, popular subset selection algorithm D) Increase, finding the biggest R2, Mallow's Cp E) Reduce, backward elimination, forward selection

A) Reduce, exhaustive search, popular subset selection algorithm

Each day organizations capture more and more of our data, creating the need for Data Scientist to sift and make sense of it. Which of the is an accurate statement on the skillset needed by Data Scientists? A. A broad skillset with deep knowledge in key areas B. Statistics and Business are the only skills needed C. Programming is a required skill D. Business acumen suffices E. Machine Learning is a core skill

A. According to the textbook the skillset Data Scientists have is broad and shallow and, in some areas, deep, resembling the shape of the letter "T".

Step five of the data mining process, determine the data mining task, specifically references which data mining step? A. Develop an understanding of the purpose of the data mining project B. Explore, clean, and preprocess the data C. Partition the data D. Choose the data mining techniques to be used E. Use algorithms to perform the task

A. Develop an understanding of the purpose of the data mining project

There are many data visualizations that can be used to illustrate data and help users better understand what is trying to be communicated. Bar charts are great for visualizations for many purposes but the need for more specialized visualizations are required. What would be a good visualization for seeing the connections of individuals that are associated and why? A. Network Graph because it can show the representation of people. B. Bar graph because it can show the many individuals at once. C. Map Chart because it can show the location of people. D. Scatter Plot because it can cluster people together showing how close they are. E. Histogram because it can show the frequency of how people are related.

A. Network Graph because it can show the representation of people.

Using data from the Boston_Housing.xls example in XL Miner, which variable-chart combination is most valuable when trying to determine high pollution areas in Boston? A. Plotting nitric oxide concentration (NOX) as the dependent variable and weighted distances to five Boston employment centers (DIS) as the independent variable using a scatter plot. B. Plotting nitric oxide concentration (NOX) as the dependent variable and the average number of rooms per dwelling (RM) as the independent variable using a scatter plot. C. Plotting weighted distances to five Boston employment centers (DIS) as the dependent variable and nitric oxide concentration (NOX) as the independent variable using a scatter plot. D. Plotting a line chart that shows nitric oxide concentration (NOX) and weighted distances to five Boston employment centers (DIS). E. Plotting weighted distances to five Boston employment centers (DIS) as the dependent variable and the average number of rooms per dwelling (RM) as the independent variable using a scatter plot.

A. Plotting nitric oxide concentration (NOX) as the dependent variable and weighted distances to five Boston employment centers (DIS) as the independent variable using a scatter plot.

Describe Machine learning A. The automatic attainment of information through computer programs using algorithms and data. B. The Manual attainment of information such as surveys and questionnaires C. It is a way to learn about machines and how they work. D. It is never Linear Regression E. It can accurately predict the stock market trends.

A. The automatic attainment of information through computer programs using algorithms and data.

Data mining can best be described as a relationship or combination of what following terms? A. The merging of statistics and machine learning(AI) B. Any or all the following: predictive analytics, predictive modeling or machine learning C. Classical statistics combined with predictive modeling, predictive modeling or machine learning D. Whether results and patterns occurred by pure chance E. When computing power and data are limited

A. The merging of statistics and machine learning(AI)

In data mining, big data is a common and essential element. With the advance of the internet and mobile network, collecting data becoming easier and faster. Big data is becoming bigger and bigger. Although the bigger the data set, the better performance and result can be achieved with data mining, there are challenges present to the big data. What are those challenges? A. Volume, Velocity, Variety, and Veracity B. Volume, Virtuous, Value, and Vision C. Volume, Value, Variety, and Vision D. Volume, Velocity, Value, and Veracity E. Volume, Velocity, Value, and Vision

A. Volume, Velocity, Variety, and Veracity

Which of the following softwares can be employed to achieve interactive visualizations? A.Spotfire,Tableau, JMP, and Watson Analytics B. Tableau, R, and Python C. JMP, Python, and Tableau D. R, Python, JMP, and Spotfire E. Watson Analytics, R, Spotfire, and Tableau

A.Spotfire,Tableau, JMP, and Watson Analytics

Question: What is the key difference between supervised and unsupervised learning? A.) Supervised learning is used to analyze patterns in data, while unsupervised learning is used to predict an output value. B.) Supervised learning is used to predict an output value, while unsupervised learning is used to analyze and learn patterns in that data. C.) Supervised learning is used to measure input variables, while unsupervised learning is used to measure output variables. D.) Supervised learning is used to measure output variables, while unsupervised learning is used to measure input variables. E.) None of the above

B - Supervised learning is used to predict an output value, while unsupervised learning is used to analyze and learn patterns in that data.

Question: The general approach of data mining makes it susceptible to which error? A.) Too large of data sets B.) overfitting of data C.) unable to hand open data sets D.) holdout data E.)unsupervised learning

B overfitting of data.

What is the 10-step process to develop analytical projects through data mining? A) Develop Purpose -> Create Dataset -> Process Data -> -> Adjust Data Dimension -> Redefine Purpose -> Partition -> Choose Technology -> Choose Algorithms -> Interpret Results -> Present Data B) Develop purpose -> Obtain Dataset -> Process Data -> Adjust Data Dimension -> Redefine Purpose -> Partition -> Choose Technology -> Choose Algorithms -> Interpret Results -> Present Data C) Obtain Dataset-> Process Data -> Develop Purpose -> Adjust Data Dimension -> Redefine Purpose -> Partition -> Choose Technology -> Choose Algorithms -> Interpret Results -> Present Data D) Develop purpose -> Obtain Dataset -> Process Data -> Adjust Data Dimension -> Redefine Purpose -> Choose Technology -> Choose Algorithms -> Choose Technology -> Interpret Results -> Present Data E) Create Dataset -> Develop Purpose -> Process Data -> -> Adjust Data Dimension -> Redefine Purpose -> Partition -> Choose Technology -> Choose Algorithms -> Interpret Results -> Present Data

B) Develop purpose -> Obtain Dataset -> Process Data -> Adjust Data Dimension -> Redefine Purpose -> Partition -> Choose Technology -> Choose Algorithms -> Interpret Results -> Present Data

Question: Accuracy measures of classifier performance derive primarily from which of the following? A. Naïve Rule B. Classification Matrix C. Receiver Operating Characteristic (ROC) Curves D. Validation Data E. Average Misclassification Cost

B. Classification Matrix

What is the goal of an unsupervised Task? A. Classification B. Clustering C. Regression D. Formatting E. Validate

B. Clustering

Which type of chart is not effective for showing an unsupervised learning outcome? A. Scatter plot B. Line chart C. Histogram D. Box plot E. Heatmap

B. Line chart

Imagine you are a manger of a credit loan company in South Texas area. You are planning to open a branch in Corpus Christi, credit cards to the residents of Corpus Christi. After you completed the explore, clean and preprocess the data steps, which of those variables would be the main variable for you to make predictions for each applicant, so you can make choices as; accept or reject an applicant. Remember, your main customer target is the population between 20-30 years old. A. Credit Score B. Payment History C. Length of Credit Card History D. Salary E. Location

B. Payment History - Even if an applicant doesn't have a credit score or hasn't use a credit card before, a lender will look at the payment history of an applicant, such as: has an applicant paid his/her bills on time, if an applicant has several late payments, are those payments in 30 days or more. In addition, based on a payment history of an applicant'/customer, you can use the classification method to classify "high risk" or "low risk" and make predictions.

When using Excel for Data Mining, it can require data to run a linear regression and for a classification tree within the application. Which of the following statements is accurate regarding the use of excel for data mining? A.) Companies, such as grocery chains, use them to solve issues with their displays in the store and selling a product. B.) The data can be a sample size and retrieved from external data sources to create an accurate representation of a population. C.) To solve for P (A|B) based on previous events. D.) So that everyone can learn to be a data scientist using excel. E.) To allow companies to bring in big data and make a transition into a digital world or enter a new market.

B.) The data can be a sample size and retrieved from external data sources to create an accurate representation of a population.

Where in the data mining process is data visualization most useful? A) Understanding the Project B) Acquiring the Data C) Preprocessing the Data D) Reducing the Data Dimension E) Interpreting Results

C - preprocessing the data because visualization is most useful in the data mining process through its support of data cleaning to identify incorrect, missing, and duplicated data.

A popular numerical measure of predictive accuracy can be defined as: 1. 1. A. ) Average error B. ) Mean absolute percentage error C. ) Mean absolute error/deviation D. ) Root-mean-squared error E. ) The total sum of squared errors

C. ) Mean absolute error/deviation

When using a supervised data mining application, a model that can be continuously used to predict or classify records is much more useful than a model the provides a one-time analysis of data. Automation of the application requires the components within the system to communicate with each other via __________ in order to create the elements needed for a predictive algorithm. A. Multiple Linear Regression B. Categorical and Continuous Variables C. Application Programming Interfaces (APIs) D. Scoring E. Scatter Plots

C. Application Programming Interfaces (APIs)

In the field of Business Analytics, the modern era of Big Data has created which of the following new opportunities for using data: A. Regression models for describing "on-average" relationships between variables B. Veracity of data generated by organic distributed processes C. Creation of the "data scientist" profession within the field of data science D. Algorithms that learn directly from the data E. Overfitting of models to fit the "noise" of the data, not just the signal

C. Creation of the "data scientist" profession within the field of data science

Question: Which of the following roles implements the automation of Data Mining solutions? A. Data Analysts B. Business Intelligence practitioners C. Data Engineers D. Data Scientists E. Software Developers

C. Data Engineers

Which statement is true about standard data partition in XLMiner? A. The data rows can only be randomly partitioned. B. Data must be partitioned into three sets - training, validation and test sets. C. Data can be partitioned using either partition variable or random partition. D. The default percentages for random row pick up is 60% for validation set and 40% for training set. E. When using multiple linear regression algorithm data partitioning is not necessary.

C. Data can be partitioned using either partition variable or random partition.

How is a variable normalized and why is this done? A. 6 X m X p, where m is the number of outcome classes and p is the number of variables. This is done to avoid redundancies. B. Histograms and/or boxplots are used to normalize the distribution of values and detect outliers. This is done to find any other information that is relevant to the analysis of the variables. C. Dimension reduction involves eliminating unneeded variables transforming variables and creating new variables. This is done to determine the data mining task and if the data needs to be partitioned. D. Subtract the mean from each value and then divide by the standard deviation. This is done to bring all the variables onto the same scale so the algorithm can be implemented effectively. E. Simple linear regression where a regression line is drawn to minimize the sum of squared deviations. This is done as supervised learning algorithms for classification and prediction for any outlier data.

C. Dimension reduction involves eliminating unneeded variables transforming variables and creating new variables. This is done to determine the data mining task and if the data needs to be partitioned.

Multidimensional Visualizations primarily aid in: A. Writing algorithms B. Adding color to bar charts C. Displaying complex information in an easily understood way D. Scoring new data E. Enhancing web search technology

C. Displaying complex information in an easily understood way

When evaluating the steps in data mining, why is it a good idea to split the data into three partitions? Training, Test. and Validation? A. This is an approach to dealing with missing values B. This is the best approach to choosing the best algorithm to deploy by assessing how well the algorithm performs. C. More partitions will reduce the likelihood of overfitting by decreasing the likelihood that we will be modeling the noise in the validation set by just using two sets to test the model. D. This will provide the answer to which data mining task is required (Classification, prediction, etc.). E. This will allow you to deploy the model and execute it on real records.

C. More partitions will reduce the likelihood of overfitting by decreasing the likelihood that we will be modeling the noise in the validation set by just using two sets to test the model.

Throughout the many chapters of the book, "Data Mining for Business Analytics" various methods can be applied to the data that is collected. The various methods can allow for a better fit of algorithms to help with predictive analytics. The question is it best practice to only apply one method and if so why or why not? A. Yes using only one method is the best practice because if multiple methods are used then it will confuse the user. B. Yes using only one method is the best practice because it is the fastest way to get an answer for large amounts of data. C. No using only one method is not the best practice because it is better to test multiple methods to find the best fit for the data also allowing better prediction. D. Yes using only one method is the best practice because it allows the best fit for the data. E. The method used doesn't matter and any method used will give the best fit of the data.

C. No using only one method is not the best practice because it is better to test multiple methods to find the best fit for the data also allowing better prediction.

· _______ & _______ are not considered to be data mining techniques because they do not involve statistical modeling. A. R & Python B. Clustering & Regression C. OLAP & SQL D. Python & SQL E. Machine Learning & Artificial Intelligence

C. OLAP & SQL

The skills of data scientists involved areas of statistics, machine learning, math, business, IT, and programming. The skillset of many data scientists can be seen resembling what letter of the alphabet? a. O b. R c. T d. I e. C

C. The skills of areas involved in data scientist are broad. It is incredibly unique to find a data scientist who is an expert in each area. Therefore, data scientists resemble the letter T, where the vertical line of the T represents a scientist who has deep knowledge in one area, such as programming, and the top line of the T indicates a superficial knowledge of other subjects, such as business and statistics.

Question: What challenges does big data have, presented as big data's four V's? A. Vanity, Vetted, Variety, and Validate B. Validate, Variety, Vacation, and Vulnerable C. Volume, Veracity, Variety, and Velocity D. Velocity, Veracity, Vetted, and Versification E. Volume, Vetted, Validate, and Victory

C. Volume, Veracity, Variety, and Velocity

Question: Jennifer is the store manager for XYZ Grocery Store. Recently, her store has experienced a large amount of shrink within various departments. Jennifer wants to present a multi-colored model that provides insights and correlations of shrink per department. Jennifer is confident that by creating and distributing this visualization, each department can reduce the shrink and maximize profits. Which visualization should Jennifer present to her department? A.) Line Graph B.) Scatter Plot C.) Heat Map D.) Box and Whisker E.) Bar Graph

C.) Heat Map

Question: What are the five factors that affect the usefulness of a method? A.) Size of the dataset, types of patterns, meet some underlying assumptions of the method, noisy, and a sample B.) Color, Size, Shape, Multiple panels, and Animation C.) Size of the dataset, types of patterns, meet some underlying assumptions of the method, noisy, and the particular goal D.) Develop an understanding, obtain the dataset, explore clean preprocess the data, reduce the data dimension, and determine the data mining task E.) Size of the dataset, types of patterns, exploration, noisy, and a sample

C.) Size of the dataset, types of patterns, meet some underlying assumptions of the method, noisy, and the particular goal

What is NOT a natural criterion to use for judging a classifier to reduce the probability of making a misclassification error? A.) The Classification Matrix B.) Class Separation C.) The Multicollinearity Rule D.) The Naïve Rule E.) The Confusion Matrix

C.) The Multicollinearity Rule

Question: When evaluating predictive performance, the use of specific data sets is important for predictive accuracy because? A.) Using a training set of data will provide the best predictive accuracy. B.) When compared to validation performance, training performance offers the best fit due to its complex modeling. C.) Validation performance utilizing measures such as Mean Absolute Deviation and Error, Root-Mean-Squared Error, and Total Sum Squared Error, provide the best predictive accuracy. D.) Regression analysis is the preferred method to conduct predictive accuracy. E.) Average Error and Mean Absolute Percentage Error is the preferred method to conduct predictive performance for goodness-of-fit.

C.) Validation performance utilizing measures such as Mean Absolute Deviation and Error, Root-Mean-Squared Error, and Total Sum Squared Error, provide the best predictive accuracy.

Question: Data Science can be described as a broad concept that includes a combination of which of the following skills? A.) IT, business, machine learning, and math B.) Statistics, machine learning, and programming C.) IT, business, programming, and statistics D.) IT, business, programming, math, machine learning, and statistics E.) IT, business, programming, math, and machine learning

D - Data science is a combination of IT, business, programming, math, machine learning, and statistics.

Which of these is not included in the Modeling Process? A. Explore, clean, and reprocess the data B. Deploy the model. C. Interpret the results D. Discover the dataset to be used. E. Partition the data

D - Discover the dataset to be used.

An analyst has a large data set with various data attributes for skin cancer patients and wishes to develop a model that can be used to predict those attributes which may indicate that an individual is more prone to skin cancer. What method should be used to create such a model? A - The analyst should use linear regression since skin cancer tends to be hereditary and therefore various aspects of family lineage will likely be the most predictive. B - The analyst should choose whatever model the analyst is most familiar with. Different methods exist so an analyst can choose one that generally works better for him or her. The better the analyst knows how to use a particular method, the better that method will be in predicting any given data set. C - The analyst should use a neural network since it is specifically designed for use in the medical sciences. D - The analyst should develop multiple models using a variety of methods and test and compare the performance of each method to determine which method predicts best in this case. E - The analyst should use a classification tree because he or she is trying to classify patients as to which will get skin cancer and which will not.

D - The analyst should develop multiple models using a variety of methods and test and compare the performance of each method to determine which method predicts best in this case.

What is a major difference between supervised and unsupervised learning? A. An instructor is present during supervised learning and is not present during unsupervised learning. B. During unsupervised learning the data model is split into training and testing datasets while supervised learning is not split into training and testing datasets. C. Supervised learning requires a smaller sample size than unsupervised learning because the data analyst cannot supervise large datasets. D. During supervised learning the data model is split into training and testing datasets while unsupervised learning is not split into training and testing datasets. E. Unsupervised models provide a prediction while supervised models do not provide a prediction.

D Letter "D" is the correct answer as supervised models are split into training and testing datasets while unsupervised models are not.

What are supervised learning algorithms? (A)These are methods of consolidating a large number of records or cases into smaller set. (B) These are used in classification and prediction those where there are no outcome variable to predict or classify. (C) These are procedures of extracting information from huge sets of data and using it for other applications (D) These are used in classification and prediction of those in which the value of the outcome of interest is known. (E) These are processes to visualize or display the data extracted in the form of different graphical or visual formats such as statistical representations, pie charts, bar graphs, graphical images etc.

D Supervised Learning algorithms (D) are those where we have input variables and an output variable and use an algorithm to learn the mapping function from the input to the output. It is called supervised learning because the process of an algorithm learning from the training dataset can be thought of as a teacher supervising the learning process.

If asked to provide a definition of big data in the context of data analytics, the most complete response would be: A. The large dataset used in data mining B. The term describes the variety of data used in data mining C. Data that is quickly captured and used in data mining D. A dataset used in data mining that is potentially very large as it is now easier to collect and store data. This dataset should also be considered in terms of the speed of collection, the authenticity of the data and the variety of data collected. E. A dataset used in data mining that has become larger in recent years as it is now easier to collect and store data

D. A dataset used in data mining that is potentially very large as it is now easier to collect and store data. This dataset should also be considered in terms of the speed of collection, the authenticity of the data and the variety of data collected.

You are a small business owner of a Bakery in your town, and during the months of March and August sales are typically 40% lower compared to other months. You conclude that March and August do not have a holiday which helps boost demand. Which one of the below models can you leverage to drive additional insights in order to decide on what actions will help yield an increase of sales for the months of March and August? A. Business Intelligence B. Data Mining C. Big Data D. Business Analytics E. Data Science

D. Business Analytics

Question: What tasks have become the "key elements" of Business Analytics in most organizations? A. Data mining and SQL B. Statistical modeling and automated algorithmic methods C. OLAP and SQL D. Classification, prediction, and pattern discovery E. Big data and artificial intelligence

D. Classification, prediction, and pattern discovery

In terms of data mining model from the process stand point, what is the very last step based on the choices given below? A. Data exploration and reduction B. Deployment of the Data Model C. Model Evaluation and Selection D. Prediction E. Data Preparation

D. Expand the data dimension. You don't expand you reduce

Which of the following is an example of data mining, as defined in the text? A. Determining which of last quarter's marketing campaigns lead to the most direct sales. B. Using statistical analysis to infer the effect a $1 increase will have on the average demand for a box of cereal. C. Applying A-B testing rules to charge Mac users more than Windows users for hotel reservations. D. Grouping customers into different "personas" that receive different marketing materials and applying one of these personas to every new prospect. E. A detailed breakdown of weekly sales by salesperson that is used to determine monthly commissions.

D. Grouping customers into different "personas" that receive different marketing materials and applying one of these personas to every new prospect.

Utilization of visualization techniques are primarily included in which processes of data mining? A. Obtaining data and sampling data. B. Defining a purpose for a data mining project. C. Deployment. D. Initial data exploration, preprocessing data, and reporting. E. Partitioning training and validation data sets.

D. Initial data exploration, preprocessing data, and reporting.

When studying the terminology within data mining, it is important to understand the definitions of the terms and identify which ones are often used interchangeably. Please read the following choice options carefully. Each of the options includes 3 terms that are often used interchangeably, however one option is not correct. Please select the option where the words that are listed are not interchangeably used for each other within data mining. A. Predictor, Feature, Field B. Observation, Pattern, Sample C. Response, Dependent variable, Output variable D. Instance, Independent variable, Target variable E. Record, Case, Row

D. Instance, Independent variable, Target variable

You are an associate at a company, and you have a new and exciting task to tackle: find a model that instead of fitting data well, will make great predictions whenever it comes to the new records that are put in the system. What would be the best way to find the highest cumulative predictive values? A. MAE B. MAPE C. Total SSE D. Lift Curve E. RMSE

D. Lift Curve.

Preprocessing and cleaning of data is an essential step required for building an effective model for data mining. Which of the following choices is completely correct as part of the steps involved in the data preprocessing? A. Identify the type of variables and classify them as numerical, text OR continuous or categorical using data mining routines that are highly capable of predicting the type of variable in all the cases. So, one can rely on tools like XL Miner for their data mining features for effective classification of variables and no variable conversion is needed manually. B. Select the variables. The more the number of variables used in the prediction, the more robust is the model as it allows us to establish relationships amongst them. C. Identify Outliers. Any value of data that is three standard deviations away from most of the data is considered an Outlier, which is an invalid data point. D. Normalizing and Rescaling of data. For some algorithms and models, it is important to rescale or normalize the data to bring all the variables to a common scale to build an effective data mining model. E. Missing values can be easily omitted without considering that they are large or small quantities.

D. Normalizing and Rescaling of data. For some algorithms and models, it is important to rescale or normalize the data to bring all the variables to a common scale to build an effective data mining model.

The discipline focused on the "average effect" an inference from a sample population may have on the whole population is best described as? A. Data Mining B. Artificial Intelligence C. Machine Learning D. Statistics E. Overfitting

D. Statistics

Jimmy Wong is a Data Analyst at ABC Corporation. His first project is to develop an application that will help Human Resources identify applicants that have a propensity to be successful within the organization. Jimmy is tasked with separating ABC Corporation's application data like years of experience and education, etc. into groups by using clustering methods. The second part of his project involves algorithms that are applied separately to each risk-level group to predict their propensity to be successful at ABC Corporation. What type of fundamental data mining method is Jimmy using? A. Supervised Learning methods B. Supervised Learning Algorithms C. Unsupervised Linear Regression D. Supervised Learning and Unsupervised Learning methods E. Clustering

D. Supervised Learning and Unsupervised Learning methods

You work at a car dealership in Dallas Texas. Your boss wants to expand their used car business and puts you in charge. You have decided to buy cars from auctions and then resell them at the dealership for a profit. The auction has provided you data on the cars that are going to be for sale; this data has 20 different variables ranging from price, make, model, mileage, etc.. You must select the cars to purchase that have the highest potential of profit margin for the dealership. This is not a one-time analysis; this will be done on-going in the future if done well. How do you know what variables are relevant and will help you make future vehicle predictive selections for the dealership? A. Go and collect more data, such as information on smoker/non-smoker vehicles B. Choose at random the variables you think may be helpful - no methodology applied C. Only use predictors that have values (numeric) not strings (words) D. Use an exhaustive search or subset selection algorithms to select the best predictors E. Chart all the data for the variables on scatter plots

D. Use an exhaustive search or subset selection algorithms to select the best predictors

What is the result of the predictive algorithm's results after new data is entered? A_ Prototype Mode B_ Manual Analysis C_ Business Rules D_ Predicted Classification E _ Application Programming Interface (API)

D_ Predicted Classification

User information collected from many sources on the web, social media and others provide data to feed business analytics. The output of these analytics is used to understand purchasing patterns, voting patterns, recommendations, and many other customer behaviors. Data mining is used to find these patterns in the data and predict outcomes. To process this data, there are several preliminary steps to apply to the datasets. Select which preliminary step below should be done to prepare the data for mining. A. Creating a sampling of a database B. Data is organized into a format that is standardized for mining C. Data should be cleaned as part of the preliminary steps. D. There are cases where the goal is to find rare events in the data. E. All of the above

E - All of the above

Following the question, only one of these statements is TRUE. A is incorrect. The statement is backwards-- A good explanatory model fits existing data more closely, while a good predictive model extrapolates more accurately. B Predictive models typically split data into several sets, or "partitions," while explanatory models use the entire dataset to create the model. C Explanatory models are primarily concerned with coefficients, predictive models are primarily concerned with prediction capability. D "Most likely" is an incorrect way to describe the chances of a company determining the best choice is to use the same input data for both their predictive and explanatory models. The class and magnitude of inputs are not "generally," or commonly the same. E Explanatory models are better than predictive models at approximating a given dataset (commonly, linear regression), and measuring the strength of data relationships is useful (R^2). Predictive models and explanatory models differ greatly; predictive models prefer to predict new data accurately whereas explanatory models seek to estimate the existing data better.

E Explanatory models are better than predictive models at approximating a given dataset (commonly, linear regression), and measuring the strength of data relationships is useful (R^2). Predictive models and explanatory models differ greatly; predictive models prefer to predict new data accurately whereas explanatory models seek to estimate the existing data better.

From a process perspective, which of the following methods would be least helpful in a predictive analysis? A. Linear Regression B. Regression Trees C. Neural Networks D. Ensembles E. Cluster Analysis

E Linear regression is the least sophisticated of the methods mentioned; nevertheless, it is one of the most approachable and popular techniques used in predictive analytics. Regression Trees, Neural Networks, and finally, Ensembles are more complex, less popular, and harder to grasp concepts that can also be employed with making predictions. Cluster analysis is a segmentation technique that is used for finding the underlying relationships. As such, it would be the least useful in predictive analysis.

The below statements describe and/or compare differences and/or similarities between predictive and explanatory models, as described by the assigned textbook. Only one statement is TRUE. Select that statement. A. A good predictive model fits existing data more closely, while a good explanatory model extrapolates more accurately. B. In explanatory models, data is typically split into training (or validation, testing, hold-out) sets, or "partitions," unlike predictive models, where the entire dataset is used to create the model. Using the entire dataset allows predictive models to determine the best relationship for estimating the data. C. Both explanatory models and predictive models are primarily concerned with the coefficients. D. If a company wanted to develop two data models concurrently, one predictive and one explanatory, they would most likely use the same inputs for both--although the models interpret the data with differing goals, class and magnitude of the input choices is generally the same. E. A good explanatory model approximates the dataset well and measuring the strength of data relationships is useful. Predictive models do not typically use the same class or magnitude of inputs as explanatory models.

E. A good explanatory model approximates the dataset well and measuring the strength of data relationships is useful. Predictive models do not typically use the same class or magnitude of inputs as explanatory models.

Today's data mining takes large amounts of data and transforms it into useful information through business analytics methods that can be interpreted and analyzed to provide valuable insight into models that can be used for predicting outcomes of actions. If you were analyzing customer sales for Walmart, which of the following business analytics would be best to determine the action that will increase sales of sporting goods the most. A. Linear Regression B. Logistic Regression C. Cluster Analysis D. Collaborative Filtering E. All of the above

E. All of the above

Which statement regarding the concept of "Classification" in data mining is true: A. It is a data analytics method that is used to extract models in order to describe various data classes and predict future data trends and patterns. B. The process of extracting data from large data repositories, such as databases, warehouses, and other information repositories in order to predict a value of a numerical variable. C. It is used for estimating class-labeled test samples that are new, and randomly selected. D. It is the process of cleansing and storing data. E. Answers A and C.

E. Answers A and C.

You're a new employee in XYZ corporation, on your first of day XYZ Corporation your boss Mr. Givens task you to gather a sample from an external database using excel/ XLMiner, he specifically states he needs a sample because the current database has over 200 million records. This is a time sensitive assignment and the team needs the sample to begin the process of multi-regression. What tab and group would the employee select to begin the process of sampling A. Select Analytic Solver, go to Solve Action group select optimize, select complete problem. B. Select Analytic Solver Tab, go to Data group and select file folder. C. Select Data Mining Tab, go to Data mining group and select Database D. Select Analytic Solver Tab, go to Data and select Database E. Only C & D above

E. Only C & D above

A company's printing machine is more than 20 years old and have been printing at a much lower rate than before. The expert, who has been maintaining the machine, has recently retired. The company hired a new expert, but he has little knowledge of this type of machine as it is an older model compared to the type of models the new expert is more familiar with operating. The company wants to minimize the cost by replacing the part of the machine that is causing the malfunction. Before the former expert retired, he maintained a list of the machine parts he replaced throughout the years, and data associated with each part. Since we have a list of machine parts (Attributes) and the data associated with each part (Features), we can do some analysis to determine which machine part to replace. After performing some regression analysis, the company was able to determine which machine part to replace. What analysis was used to determine which machine part to replace? A. Data preparation B. Unsupervised C. Observation D. Prediction E. Supervised

E. Supervised Learning is used when data (Attributes and Features) are known. Using logistic regression will help eliminate which machine parts has no significance on how fast the machine will print.

What are the four V's that big data presents as a challenge? A. Volume, Variety, Veracity, and Value B. Volume, Velocity, Veracity, and Value C. Volume, Velocity, Veracity, and Value D. Volume, Velocity, Variety, and Value E. Volume, Velocity, Variety, and Veracity

E. Volume, Velocity, Variety, and Veracity

Question: What three choices best describe the purpose and desired results of utilizing multiple linear regression for data analysis. A:) Classification; Unsupervised learning; determining if a customer will purchase a product or not. B:) Clustering; Supervised learning; foretelling customer credit card activities C:) Dimension reduction; Supervised learning; classifying credit card actions as fraudulent D:) Prediction; Unsupervised learning; separating separate loan applicants into several risk groups E:) Prediction; Supervised learning; determining failure time of equipment based on utilization

E:) Prediction; Supervised learning; determining failure time of equipment based on utilization

True or False: The average errors and the total sum of squared errors can under/overstate error in the data thus Data Analysis should always consider RMS errors as a true measurement of the observed residual before deploying the model.

True

Which of the following statements is true about Business Analytics? a) Business analytics is the process of using only qualitative data to make informed business decisions. b) Business analytics is the process of using quantitative data to make informed business decisions. c) Business analytics enable you to understand "what happened and what is happening," but you can't anticipate "What will happen?" d) Business analytics is the presentation of numerical data in visuals like charts, graphs, and diagrams. e) All of the above are true.

b) Business analytics is the process of using quantitative data to make informed business decisions.


Related study sets

positioning hand, wrist, thumb, fingers, forearm

View Set

Cultural Anthropology 111 ch.7-14

View Set

Enfermedad por Estreptococo B- Hemolítico: Faringitis Aguda, Glomerulonefritis, Fiebre Reumática, Endocarditis Infecciosa, Impétigo, Celulitis, Escarlatina, Sepsis Neonatal, Meningitis, Infecciones Puerperales, etc..

View Set

ATI Fundamentals Practice Exam (A+B)

View Set