MIS3300 Midterm 1

Ace your homework & exams now with Quizwiz!

Ordinal

(categorical and order matters) which refers to data that imply a certain order or ranking, but that don't necessarily function quite like numbers on a number line. However, these ordinal numbers don't act like numbers on a number line. For example: i. Place in a race (1st, 2nd, 3rd, ... 45th, 46th, etc.). ii. Disagree, Somewhat Disagree, Neither Agree nor Disagree, Somewhat Agree, Agree. iii. Not Important, Somewhat Important, Important. iv. Cold, Warm, Hot.

Nominal

(categorical, order does not matter) e.g. continents, drop down menu from a website e.g. race

Partitional clustering

(non-hierarchical): A division of objects into non-overlapping subsets (clusters) such that each object is in exactly one cluster

Lift can range from _____. • 0 to 1 • 0 to positive infinity • -1 to 1 • -1 to positive infinity

0 to positive infinity

Identify the interquartile range.

25% to 75%

correlation coefficient

A linear relationship is indicated by a number

Hierarchical clustering:

A set of nested clusters organized as a hierarchical tree. The hierarchical methods produce a set of nested clusters in which each pair of objects or clusters is progressively nested in a larger cluster until only one cluster remains

Which aspect(s) of business is now open to data collection and analysis? • Accounting • Economics & Finance • Management • All answers are correct

All answers are correct

In cluster analysis, seed (or centroid) values • Often start as arbitrary values • Change and should stabilize as the analysis advances • Are closer to data points in their own clusters than to points in other clusters • All of the above

All of the above

To effectively introduce data mining models into the organizational flow, you should • Clearly communicate the model's function and limitations to stake holders • Thoroughly test and prove the model • Plan for and monitor the model's implementation • All of the above

All of the above

Power BI has the ability to connect to a wide array of data sources; these include the following: • Database • Data cube • Text file • All of the answers are correct

All of the answers are correct

Which of the following are main types of data preparation or cleansing? • Handling missing data • Reducing attributes • Reducing rows • All of the answers are correct

All of the answers are correct

Target or label attribute

An attribute we'd like to predict attribute = feature = variable = column

Data instance

An entity (such as an object or person) that is described by a collection of attributes, also known as a feature vector. - A patient's medical record, a customer's profile, an employee's information are all examples of a data instance. data instance = record = case = observation = row

In an association rule {X => Y}, X is the ____ and Y is the _____.

Antecedent; consequent

Extracting purchase patterns from existing records is an example of... • Cluster analysis • Classification data mining • Supervised data mining • Association analysis

Association analysis

An association analysis can be conducted on what type of data variables? • Binominal • Ordinal • Interval • None of the answers is correct

Binominal

"Qualitative variable" is likely another term for which type of data? • Quantitative • Numerical • Textual • Categorical

Categorical

Extracting groups of like-minded people in a given network is an example of... • Cluster analysis • Classification data mining • Supervised data mining • Association analysis

Cluster analysis

A measure of how sure we are that when one item set is flagged as true, the associated item set will also be flagged as true is known as • Confidence • Association Rule • Support • Frequency Pattern

Confidence

CRISP-DM

Cross Industry Standard Process for Data Mining i. Data Gathering ii. Data Description iii. Data Exploration iv. Data Quality Verification

Which of the following is correct? • Information > Data > Knowledge • Data > Information > Knowledge • Knowledge > Information > Data • None

Data > Information > Knowledge

Data mining comes after which of the following steps in the CRISP-DM? • Data acquisition & understanding • Data transformation • Model evaluation • Model deployment

Data transformation

Taking some business action based upon what your model tells you is _____ in the CRISP-DM model.

Deployment

What type of analytics would be used to tell us why something occurred?

Diagnostic

To which other observation is A most likely to be clustered with? • Euclidean distance from A to B = 5.00 • Euclidean distance from A to C = 3.16 • Euclidean distance from A to D = 11.31 • Euclidean distance from A to E = 4.12

Euclidean distance from A to C = 3.16

Deployment

Finally, in the deployment phase, the plan of action created in the evaluation phase is carried out and measure the success of the plan and helps inform future decisions.

In the association rule {Flour, Eggs} => {Sugar} the antecedent is • Flour • Eggs • Flour, eggs • All of the answers are correct

Flour, eggs

Which step is part of supervised data mining, but not unsupervised data mining? • Obtain a data set • Identify the target variable • Build a model • Apply the model

Identify the target variable

Data preparation

In the data preparation phase (which happens concurrently with the data understanding phase), you ensure that the data available are high-quality, meaning that they are accurate and can be useful for any analysis that you might conduct to better understand your situation and address the problem.

In statistical terms, _____ refers to how peaked or flat a distribution is with respect to the normal distribution. • Variance • Skewness • Kurtosis • Heteroscedasticity

Kurtosis

Mean, median, and mode describe the ___ of the data.

Location

Data Science

Managing and analyzing massive sets of data for purposes such as target marketing, trend analysis, and the creation of individually tailored products and services.

Standard deviation

Mean distance of each data point from overall mean (square root of the variance): dispersion

In a skewed distribution, the ___ may be a more appropriate measure of central tendency than the mean

Median

In cluster analysis, the objective is to ____ the distance between each observation and its centroid and ____ the distance between centroids

Minimize; maximize

The median summary measure of location is not suited to ______ variables.

Nominal

A person's race would be an example of a(n) ________ variable, while a person's age would be an example of a(n) ________ variable.

Nominal; ratio

Editing your data in the Query Editor window in Power BI will ____ the data located in the original source.

Not change

Data understanding

Now that the problem and "the business" are understood, we need to collect available information regarding the situation and perform some initial explorations into it.

Missing data is also known in the database world as _____.

Null

Modeling

Once data are collected, the modeling phase can occur. I think the term "modeling" here, though, is a little unfortunate, since not all data analysis requires the creation of a model. I would have chosen the word "data analysis". I really should have been at that meeting. Create an unbiased model.

A movie has five possible ratings: G, PG, PG-13, R, and NC-17. Choose the data type. • Ordinal • Nominal • Ratio • Interval

Ordinal

Including two or more similar variables (employment and retirement status, age and DOB, etc.) in a cluster analysis could result in... • Meaningless results • Dummy variables • Overweighting the analysis • All of the above

Overweighting the analysis

Business understanding

Personal understanding) is understand the problem and the entity that has the problem. How both are being or may be impacted by the situation, what outcomes are possible and desirable, what assets might be brought to bear in addressing the situation, and what constraints we're facing. Next, we would need to consider our available assets and any constraints that need to be placed on the solution. Finally, we would need to understand what constraints are in place around any possible solution.

What cluster analysis is, the type of data it requires, and the types of business questions it can answer

Quantitative (interval/ratio) and/or qualitative (ordinal/nominal) data may be used in cluster analysis Do cases (e.g., customers, employees, etc.) tend to cluster into natural groups that we can use to take some action?

What correlation analysis is, the type of data it requires, and the types of business questions it can answer

Quantitative continuous - interval or ratio scale (at least for the type of correlation we cover in this course, Pearson's r) How does one variable relate to another?

A bar chart should be used when you need to visualize • Time-series data • Correlation • Part-to-whole relationships • Ranking or nominal comparisons

Ranking or nominal comparisons

In RapidMiner, the _____ operator allows you to filter the columns used in an analysis. • FP-Growth • Create Association Rules • Nominal to Numerical • Select Attributes

Select Attributes

The k-means clustering algorithm is _____ outliers. • Unable to identify • Not impacted by • Sensitive to • Outliers are not a consideration with cluster analysis

Sensitive to

This measurement of how dispersed or varied the values in an attribute are can be used to watch for inconsistent data; this measurement is known as:

Standard deviation

Ratio

Starts at 0

Variance

Summed squared distances of each data point from overall mean; for a set of values is a measure of how much the observed values tend to differ (or "vary") from the mean value.

In association analysis, _____ measures the "size" of the rule in the dataset. • Confidence • Lift • Support Count • Support

Support

In association analysis, it's important to identify thresholds for... • Lift and conviction • Support count and confidence • Support and confidence • Conviction and support

Support and confidence

Interquartile range

The range of the middle 50% of the data

Assume a cluster analysis with normalized variables. The resulting centroid table gives a value of 0.033 for the 'age' attribute in one of the clusters. This means that, prior to normalization, ... • The value of that variable was 0.033 • The value of that variable among observations in the cluster was about the average or the mean • The value of that variable was below the standard deviation value for the sample

The value of that variable among observations in the cluster was about the average or the mean

How might inconsistent data cause later trouble in data mining activities? • They can produce invalid or meaningless results. • They can undermine the value of experts' opinions. • They can lead to overconfidence in the DM results. • None of these answers is correct.

They can produce invalid or meaningless results.

Segments

This is a very marketing-centric term for the groups.

Classes

This is another favored term of analyst-types. A "class" can also be a term used for any possible value of a categorical variable, which I guess is cool.

Clusters

This is the term that's native to "clustering analysis", though it often doesn't mean much to people who aren't as analytics-savvy as you.

Partitions

This term works well for people from the tech world.

In data cleansing, a common transformation is to remove leading and trailing spaces from values in a column. This is known as: • Truncating • Pruning • Trimming • Consolidating

Trimming

Mining data to uncover naturally existing patterns without trying to make predictions is called ______ mining. • Unsupervised • Supervised • Classification • Regression

Unsupervised

Prescriptive analytics

What do we need to do about this? Goes beyond simply predicting options in the predictive model and actually suggests a range of prescribed actions and the potential outcomes of each action. ... Predictive and prescriptive analytics are co-dependent disciplines that take business intelligence to unprecedented levels.

What association analysis is, the type of data it requires, and the types of business questions it can answer

What types of data instances tend to be associated with each other? Finding frequent patterns or associations among sets of items or objects in transaction databases, relational databases, and other information repositories

Business Intelligence

a set of theories, methodologies, architectures, and technologies that transform raw data into meaningful and useful information for business purposes

Predictive Analytics

a variety of techniques from statistics, modeling, machine learning, and data mining that analyze current and historical facts to make predictions about future, or otherwise unknown, events

Unsupervised data

analysis refers to any data analysis that is descriptive in nature and in which there is no dependent variable (sometimes referred to as a target variable). Clustering analysis, association rules analysis, and text mining. Generally speaking, unsupervised data analysis focuses on finding (non-causal) patterns within data.

Supervised data

analysis, on the other hand, includes a dependent (target) variable, which therefore makes it apply to any sort of analysis that can become predictive in nature. Forms of supervised data analysis include linear and logistic regression, decision tree analysis, linear discriminant analysis, neural network analysis, and random decision forests. Supervised data analysis techniques can be used both descriptively (e.g., "what was it that caused x to happen?") and/or predicatively (e.g., "based on given factors, what do we expect the outcome of y to be?")

Consolidating

analyzing and identifying relationships between matched records and consolidating/merging them into ONE representation

Prescriptive Analytics

automatically synthesizes big data, multiple disciplines of mathematical sciences and computational sciences, and business rules, to make predictions and then suggests decision options to take advantage of the predictions

Discrete

can be counted in whole numbers or integers Measured in whole numbers

K-means

clustering, a non-hierarchical technique, is the most commonly used in business analytics cluster analysis is performed on a table of raw data, where each row represents an object and the columns represent quantitative characteristic of the objects.

Direction

coefficient sign (+ / -) indicates the direction of the linear relationship

Strength

coefficient size indicates the strength of the relationship (ranges from -1 to +1)

Categorical

data are data that belong to a category. They are often (but not always!) non-numerical values. Categorical data can be further broken down into three additional sub-types: Nominal, Dichotomous, Ordinal

Dichotomous

data refers to any variable for which the value must be one of two possible values — like an either/or situation. Such data are often represented as either a 0 or 1 or by using True/False values (sometimes Yes/No even) For instance: i. Married vs. Single ii. Dead vs. Alive iii. Bolivian vs. Not Bolivian iv. True vs. False

Quantitative

data that can be measured

Descriptive analytics

describes the use of a range of historic data to draw comparisons. Most commonly reported financial metrics are a product of descriptive analytics—for example, year-over-year pricing changes, month-over-month sales growth, the number of users, or the total revenue per subscriber.

Evaluation

during the evaluation phase, you take the analysis that was performed and create a plan.

N-Title Values

include quartile, decile, and percentile values. They're sort of similar to median values, except they're looking at different "cuts" of the data after lining them up from smallest to largest.

Diagnostic analytics

is a form of advance analytics which examines data or content to answer the question "Why did it happen?", and is characterized by techniques such as drill-down, data discovery, data mining and correlations.

Predictive analytics

is an area of statistics that deals with extracting information from data and using it to predict trends and behavior patterns. ... For example, "Predictive analytics—Technology that learns from experience (data) to predict the future behavior of individuals in order to drive better decisions."

Parsing

locating and identifying individual data elements in the source files and isolating these data elements in the target fields

Continuous

measured on a finer scale (age and height)

Non-hierarchical

methods divide a dataset of N objects into M clusters.

Correcting

modifying individual data components manually (and consistent) format using both standard and custom business rules

descriptive statistics

numerical data used to measure and describe characteristics of groups. Includes measures of central tendency and measures of variation. · Central Tendency · Dispersion · Distribution

Nominal to Numerical

operator does just that. Any categorical variable that's subjected to its preprocessing powers will be transformed into a cavalcade of dummy variables, whose values will all be either 1 or 0.

Qualitative

or "categorical" data about qualities (eye color, softness)

linear correlation

or the extent to which two variables have a simple linear (straight line) relationship.

FP-Growth Operator

outputs frequency patterns that meet whatever criterion you establish; this just means that it identifies the itemsets that are present in a dataset and yields support values (so long as the support value is higher than the minimum requirement).

Central tendency

refers simply to the idea of understanding what's in "the middle" or what's at the center of a range of data values. It's a way of understanding what a "typical" value is for a given variable. We're going to discuss three measures of central tendency that are hopefully already familiar to you: the mean, median, and mode.

Matching

searching and matching records within and across the parsed, corrected and standardized data based on predefined business rules

common sources of data

social networks; traditional business systems; internet of things

Data Mining

the computational process of discovering patterns in large data sets ("big data")

Business Analytics

the skills, technologies, applications, and practices for continuous, iterative exploration and investigation of past business performance to gain insight and drive business planning

Big Data

the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications

Descriptive Analytics

to gain insight from historical data with reporting, scorecards, clustering, etc.


Related study sets

Andrew Carnegie & Samuel Gompers

View Set

Masculino & Femenino-Nacionalidades

View Set

A&P 1 Ch. 11 Functional Organization of Nervous System

View Set

Certified Ethical Hacker 312-50v11 EXAM STUDY

View Set

(2.1) The Biology Behind Our Behavior

View Set

AP Psychology State of Consciousness Exam

View Set