MBA 630

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

Please watch a video posted this week in Module 3. According to the video, what is data literacy in business context?

Ability to understand, evaluate and apply data analytics to bring value to an organization

In RapidMiner, complete the Tutorial 5 Merging and Grouping. Based on the results of your analysis, which product has been sold most often?

Athsat

If you are interested only in itemsets that were purchased at least 30% of the time, what is a minimum support?

0.3

In the Statistics view, analyze the measurement of attributes. Using operator Numerical to Polynominal change data type for all remaining attributes. Analyze the results. Analyze attribute package_id. Which package_id value appears most often in the dataset?

1

What is the task of grouping a set of objects in such a way that objects in the same group are more similar to each other than to those in other groups?

Cluster analysis

Apply operator Normalize to all numeric variables in the dataset. Then apply operator k-NN Global Anomaly Score, leave parameters as default. Run the model and analyze the results. Analyze the outlier column. What is the average value of the outlier column? Round up the answer is necessary.(note the k-NN Global Anomaly Score operator is part of the Anomaly Detection extension, please download this extension in case you haven't done so).

1.46

Questions 11-15 are based on the Chapter 5 Exercise data. Upload Chapter05Exercise data to RapidMiner, and answer the following questions:Upload the file, run and explore the results. Set the role of ReceiptID to id, change data type for all integers to binominal. Run the model. How many binominal attributes are in the dataset?

11

In the Statistics view analyze data types. Using operator Select Attributes, remove all attributes with Real data type from the analysis. How many attributes are left in the dataset?

12

Complete Lab 2 Tutorial Normalization and Outlier Detection. How many examples are in the dataset after you removed unnecessary attributes and outliers?

1299

Complete Lab 2 Tutorial Handle Missing Values. How many examples are in the dataset after you removed unnecessary attributes and filtered examples with missing data?

1306

In RapidMiner, retrieve the Titanic dataset and apply pivoting. How many females traveled in the first class?

144

In RapidMiner, retrieve the Titanic dataset and apply pivoting. How many males boarded the ship in Cherbourg?

157

Questions 8-15 are based on the file Taxi cancellation.xlsx The purpose of this exercise is to clean the dataset and apply operators we used in class.Using operator Read Excel upload the 'Taxi Cancellations' file to RapidMiner. While uploading, have the "Replace errors with missing values" checked. Connect the dataset to the results port, run the model and explore the results. How many attributes are in the dataset?

19

In RapidMiner, retrieve Titanic dataset and apply pivoting. How many passengers traveling in the first class survived?

200

Reduce the number of categories in the attribute from_area_id. Using operator Replace Rare Values combine all values that appear less than 10 times (threshold = 10) to a new category Other. How many categories are now in the attribute? Save the process and close the file. (note: in case you have not done so, download the extension Operator Toolbox for the operator Replace Rare Values).

227

In operator FP-Growth, change min support value to 0.5. Run the model. What is the size of the largest itemset?

3

In the Statistics view, analyze the measurement of attributes. Using operator Numerical to Binominal change data type for attributes that are measured on a scale 0 and 1. How many replacements have been made?

3

In the Titanic dataset, inspect the Passenger Fare in the Statistics results. What is the average price of tickets? Round up the answer if necessary.

33.30

Replace the operator K-Means with operator K-Means(fast). Keep all other operators. In the parameters of K-Means(fast) change the number of k to 3 and have 'add as label' checked. Run the model and check the stats. What is the size of the largest cluster?

365

In the Design View, change the parameters of K-Means(fast). Change the 'numerical measure' to 'ManhattanDistance'. Run the model and check the stats. What is the size of the largest cluster?

468

Complete Lab 2 Tutorial Handle Missing Values. How many attributes in the original dataset have missing data?

5

Questions 16-20 are based on the file SAT_Outliers.csv The purpose of this exercise is to find and remove outliers.Using the operator Read CSV upload the file 'SAT Outliers' to RapidMiner. While uploading, have the "Replace errors with missing values" checked. Connect the dataset to the results port, run the model and explore the results. How many examples are in the dataset?

51

Add operators De-Normalize and Apply Model to bring data back to its original values. What is the average spending per state after removing outliers? Round up the answer if necessary. Save the process and close the file.

5178

Questions 11-15 are based on the "Chapter 4 Exercise" dataset. Upload Chapter04Exercise data to RapidMiner, and answer the following questions: Upload the file, run the model, and explore the results. How many attributes are in the dataset?

6

Using the operator Set Role, assign target role of ID to the attribute row#, and target role Label to the attribute Car_Cancellation. Then, using operator Select Attributes, remove all attributes with missing values over 1000 from the dataset. How many regular attributes are left in the dataset?

6

Add the operator K-Means. In the parameters, change the number of k to 3, and have 'add as label' checked. Run the model and check the stats. What is the size of the largest cluster?

621

Questions 16-20 are based on the "Chapter 6 Exercise" dataset. Upload Chapter06Exercise data to RapidMiner. Use operator Select Attributes to remove categorical variables First Nameand Last Name from the analysis. Then use operator Set Role to set the target role 'id' to the attribute Student_ID How many regular attributes are now in the dataset?

7

Retrieve the Titanic dataset available in RapidMiner (New Process > Repository > Samples > Data > Titanic). Connect the dataset with the results port and run the model. Inspect the Statistics results. How many males are in the dataset?

843

In the Titanic dataset, inspect the Statistics results. What is the largest family size (parents and children) in the dataset?

9

Using operator Filter Examples, filter missing data from the attribute from_area_id. How many examples are now in the dataset? (hint: pay attention to 'invert filter' checkbox in the parameters of this operator).

9,985

In RapidMiner, which operator select the optimal number of clusters based on data?

X-means

In the results, explore the Lift values. Based on the lift value between snack foods and frozen foods, is it likely that frozen foods are bought when snack foods are bought?

Yes

Please watch a video posted this week in Module 3. According to the video, what is the best definition for a citizen data scientist?

a person who generates value by applying data analytics, but whose primary job function is outside of the field of data analytics

Cluster analysis is

an unsupervised technique

Market basket analysis is:

an unsupervised technique

Review the article "Create Association Rules" by RapidMiner available on Blackboard. What are the two parts of the association rule "If a customer buys pasta, s/he is 50% likely to also buy butter."

antecedent and consequent

In RapidMiner, if the target variable has only two options YES and NO, what data type is it?

binominal

What measure of central tendency corresponds to the middle value of a numeric attribute?

median

What is the process of standardizing attributes by applying a distance-based algorithm that results in a mean value of 0 and a standard deviation of 1 for each numeric attribute in the dataset?

z-transformation

Review the article "Benefits of Market Basket Analysis" by Quantzig available on Blackboard. Which of the following is NOT the benefit of market basket analysis according to the article?

Sell more items to influencers

Assume that the correlation between humidity and cement strength is -0.5959 and statistically significant. How can we interpret this result?

The more humidity the weaker the cement

Assume that the correlation between sleep and level of oxygen is 0.8622 and statistically significant. How can we interpret this result?

The more sleep the higher the level of oxygen

Which cluster solution is appropriate for nominal (categorical) data?

Hierarchical clustering

Review the McKinsey article Defining the skills citizens will need in the future world of work posted in this week reading material. What are the four main categories of foundational skills according to the report?

A. Cognitive, Interpersonal, Self-Leadership, Digital B. Self-control, Storytelling, Adaptability, Digital Ethics C. Self-awareness, Agile thinking, Coaching, Collaboration D. Problem solving, Confidence, Resolving conflicts, Digital fluency Answer: A

Review the article "Beware Spurious Correlations" posted in this module readings. According to the article, which of the examples below may lead to a spurious correlation?

A. Comparing dissimilar variables B. Manipulating the ranges to align data C. Plotting unrelated data sets together D. All of the above Answer: D

Review the article Turning Data into Unmatched Business Value, posted in this week reading material. According to the article, what do industry leaders do differently in this space compared to other businesses?

A. Safeguard data so only people who understand data have tools and access to data resources B. Pursue multi-cloud strategies, access data across systems, use the AI enabled automation C. Analyze data in silos and separate internal and external data sources for decision making D. Spend their time performing data wrangling to extract value for an organization Answer: B

Review the McKinsey article The data-driven enterprise of 2025 posted in this week reading material. Which of these statements are true, according to the article?

A. Today, chief data officers (CDOs) and their teams function as a cost center B. By 2025, data sharing platforms both within and between organizations will become the norm C. By 2025, data management will be prioritized and automated for privacy, security, and resiliency D. All of the Above Answer: D

In sampling, population is:

A. a subset of observations B. a complete set of observations C. a random sample D. a convenient sample Answer: B

What analysis is more appropriate if you try to predict housing prices and a target variable is continuous (numeric)?

A. classification B. cluster analysis C. segmentation D. regression Answer: D

In RapidMiner, what is the term for columns?

A. column B. field C. variable D. attribute Answer: D

What are the two main classes of machine learning algorithms?

A. reinforcement and segmentation B. clustering and association rules C. classification and regression D. supervised and unsupervised Answer: D

In RapidMiner, what is the term for rows?

A. row B. record C. case D. example Answer: D

What is the type of data that contains emails, news articles and text messages?

A. social networks B. sensor C. text D. operational Answer: C

In market basket analysis, which metric shows the probability of an item to be purchased when another item is purchased?

Confidence

Review the article The five D's of data preparation. According to the article, which of the following does not belong to the five Ds of data preparation?

Deduce

Continue examining the correlation matrix. Is the following assumption true: the larger the engine size the smaller the horse power.

False

Continue examining the correlation matrix. Is the following assumption true: the smaller the vehicle weight, the smaller MPG.

False

What is a set of items that appear frequently together in a transaction dataset?

Frequent itemset

What is a set of items that are typically bought/consumed in a sequence by customers?

Frequent sequential pattern

What data type the value 25 belongs to?

Integer

Using the operator Filter Examples, filter out values equal to or greater than 3 in the Outlier column. What are the outlier states in the dataset?

KS and MT

In market basket analysis, which metric shows the likelihood of an item to be purchased when another item is purchased and ensures that there is no coincidence between the purchases?

Lift

When the distance between two clusters is defined as the distance between the nearest pair of objects with each object in the pair belonging to a distinct cluster, it is called

Nearest neighbor linkage

Review articles "Market Basket Analysis Explained" and "Apriori Explained" by KDNuggets available on Blackboard. What is the goal of apriori algorithm?

Reduce the number of itemsets we need to examine

Visualize boxplots and explore the results. Based on the boxplots, which attribute has the highest variability of values?

Spending

In market basket analysis, which metric indicates the proportion of how frequently items are purchased together?

Support

Add the operator Cluster Model Visualizer. Run the results. Explore the Overview section of the ClusterModelVisualizer results. Which cluster has the lowest number of absences?

The largest cluster

Add the operator Correlation Matrix, connect the ports, and run the model. Examine the correlation matrix.Is the following statement true: attributes Vehicle Weight (Vweight) and Engine Size (Engine_Size) are highly positively correlated?

True

Continue examining the correlation matrix. Is the following assumption true: the larger the engine size the greater the number of cylinders.

True

Add operator Create Association Rules, change min confidence to 0.5. Run the model. What rule has the highest confidence value?

when frozen foods & beer wine and spirits are bought, snack foods are also bought

What is a statistical measure that is used to analyze the strength of relationships between attributes in the dataset?

correlation

If you need to combine multiple datasets into one, what type of data preparation is it?

data blending

If you need to improve your existing dataset by dealing with missing data, what type of data preparation is it?

data cleansing

What is the process of detecting and correcting corrupt or inaccurate records from a dataset?

data cleansing process

Which of the phases of CRISP-DM deals with data cleansing and blending?

data preparation

Which method is a possible solution to deal with a missing value?

delete an example

Add operator FP-Growth, keep default parameters. Run the model. Which product has the highest support?

frozen_foods

When attributes in the dataset are measured using different scales, they may have different value ranges. What is the process of transforming the measures to bring all attributes to the same scale?

normalization

What is a data point that differs significantly from other observations in the dataset?

outlier

What are data points that are significantly different from the vast majority of the data?

outliers

What is a data transformation technique that rotates data into a wide table format and aggregates results?

pivoting

What is a sequence of characters similar to this count\((.*)\)_(.*)that defines a search pattern and is often used to find and replace operations on strings?

regular expression

In the Titanic dataset, apply the Decision Tree operator (Operators > Modeling > Predictive > Trees > Decision Tree). Which variable (attribute) has the highest predictive power?

sex

Review the article "Correlations in Finance" by Investopedia posted in this module readings. According to the article, if the stock is moving in the same direction as a benchmark index such as S&P 500, then

there is a positive correlation between the stock and the index

Analyze the Statistics results. Which attribute has the largest number of missing values?

to_city_id

What is the value obtained by the process of subtracting the mean from the individual raw score and then dividing the difference by standard deviation?

z-score

Review the article If Your Data Is Bad, Your Machine Learning Tools Are Useless. According to the article, how much time data scientists spend on data preparation?

up to 80% of the time


Ensembles d'études connexes

MEYERS UNIT 3 Practice test questions (robb)

View Set

Chapter 2 Prep U (Study Guide for Health Promotion Exam 1)

View Set

DMBOK - ch 8 - Ref and Master Data Mgt

View Set

Contracts Final (13): Material Breach

View Set

BIO240 Chapter 24 Genomics II: Functional Genomics, Proteomics, and Bioinformatics

View Set