Business Intelligence Exam

Ace your homework & exams now with Quizwiz!

options for dealing with missing values

eliminate data objects, estimate missing values (ie: take average of non-missing values), or ignore missing value during analysis

a measure of the uncertainty associated with a random variable in information gain

entropy

what does predictive analytics enable?

data mining, text mining, web/media mining, forecasting

record data, transaction data, document data, graph data are all _________

data types

collection of data objects including their attributes

dataset

tree structured plan of testing a set of attributes to an output

decision tree analysis

types of discriminative methods

decision tree, support vector machine, nearest neighbor

now customer has a member card that he/she uses during their purchase; by piling together their transaction history (i.e., noticed purchased milk and salad together), the store can generate _________

information

typical purity functions

information gain, gain ratio, gini index

the store decides to give coupons to customers for milk who also buy salad in order to increase sales; this represents _________________

intelligence

these features contain no information that is useful for the data mining task at hand

irrelevant features (ex: students' ID is often irrelevant for predicting students' GPA)

what does prescriptive analytics enable?

optimization, stimulation, decision modeling, expert systems

can be ordered high to low (ex: final grade at the end of this course)

ordinal (categorical attribute)

data objects with characteristics that are considerably different than most of the other data objects in the dataset

outliers

business understanding phase of CRISP DM Process

outline objectives, requirements, convert information into an analytics problem, questions: What exactly do we want to do? How do we want to do it? What kinds of models should we apply? What kind of techniques will be required to complete models?

these features duplicate much or all of the information contained in one or more other attributes

redundant features (ex: price paid for a product and amount of sales tax paid)

main technique employed for data selection / often used for both the preliminary investigation of the data and the final data analysis

sampling

purpose of dimensionality reduction

-overcome curse of dimensionality -reduce amount of time and memory required by data mining algorithms -allow data to be more easily visualized -may help to eliminate irrelevant features or reduce noise

process of dealing with duplicate data

data cleaning

evaluation phase of CRISP DM

-Assess the model and the results in terms of accuracy and reliability -Validate that the model addresses the business problem defined earlier in the process

stop when these stopping conditions are met

-all records for a given node belong to the same class -there are no remaining attributes for further partitioning -there are no records left

what does CRISP DM Process stand for?

Cross Industry Standard Process for Data Mining

sigmoid function

1/(1+e^-y)

artificial neural networks (ANN)

the artificial version of NN

in the information gain formula pi is

the probability that an arbitrary record in the dataset belongs to class i

key principle to sampling

the sample should be representative of the entire dataset

The set of records (the training data set) used for model

training set

special type of record data, where each record (____________) involves a set of items

transaction data (market basket)

how would you determine the best attribute at each iteration (in decision tree analysis)?

use purity metric to calculate homogeneity (choose the attribute that produces the "purest" nodes)

analytics task that is the descriptive, study of visual representation of data/ relationship between attributes. main goal: communicate information clearly through graphic means

visualization

outcome of descriptive analytics

well-defined business problems and opportunities (ex: management reports providing info about sales)

target attribute

what we are trying to assign a label to (for ex, will they cheat: NO/YES)

modeling phase of CRISP DM

Apply data mining / analytics techniques to the final data set to answer the business question

other methodologies besides CRISP DM Process for data mining (data analytics)

DDD (knowledge discovery in databases), SEMA

Data --> ________ --> ________ --> Intelligence

Data ---> INFORMATION ---> KNOWLEDGE ---> Intelligence

types of regression based methods

logistic regression, artificial neural networks

scale in the range of [0,1]

normalization

purpose of aggregation

-Data reduction --> Reduce the number of attributes or objects --> Curse of dimensionality -Change the scale --> Cities aggregated into regions, states, countries, etc. -More "stable" data --> Aggregated data trends to have less variability

Querying & reporting; online analytical processing (OLAP); alerts and notifications; business analytics; all of these are a form of ____________

Business Intelligence

can answer questions like WHY something is happening / if trends continue, what will happen next & how we can optimize this decision

Business analytics

Retailer operating in Midwest US, early 2000s realized value of analytics decided to investigate previous transactions of customers to make better business decisions, identify purchase patterns applied association rule mining method when customers purchase diapers on Thursdays and Saturdays they also commonly purchased beer rethink some of supply chain / shelf operations separated beer and diaper isles as far away from each other (make then walk through entire store, lead to buying more items), made sure beer and diapers were sold at full price on Thursdays and Saturdays Company? Data? Operations? Outcome?

Company: grocery store chain Data: purchasing transaction records Operations: shelf and pricing management Outcome: increased revenue

data preparation phase of CRISP DM Process

Convert initial raw data into final form (to be fed into the model) --> data transformation, cleaning, selection (ex: converting into table format, discretizing)

data understanding phase of CRISP DM Process

Identify the raw data set(s) that can help solve the problem from the previous step --> collect data, study strength and limitations of data, decide whether further investment is required)

deployment phase of CRISP DM

Put the model into real use in order to realize a return on investment -->Generate reports -->Set up an automated, repeatable process -->Often returns to business understanding stage / new ideas for improving business performance

what factors are used to determine which classification method (discriminative, regression based, or probabilistic) to use?

accuracy, robustness, scalability, interpretability

outcome of predictive analytics

accurate projections of future states and conditions (ex: credit scoring)

Combining two or more objects (or attributes) into a single object (or attribute)

aggregation

step-by-step procedure used to implement a particular analytics technique

algorithm

C5.0, K-means, backpropagation are examples of ____________

algorithms

analytics task used for discovering interesting relationships between different variables in large transactional datasets (applicable in Grocery Store example)

association rule mining

a property or characteristic of an object

attribute

a function that maps the entire set of values of a given attribute to a new set of replacement values such that each old value can be identified with one of the new values

attribute transformation

age of a person, eye color are examples of ____________

attributes

sample training dataset contains

attributes and a target attribute

outcomes of prescriptive analytics

best possible business decisions and transactions

what is neural networks (NN)?

biologically inspired computer models that operate like a human brain -can learn from the data and recognize patterns -can be used for classification/prediction

Technologies and techniques to transform raw data into meaningful and useful insights for business purposes/ aka computer based intelligence

business intelligence

steps of CRISP DM Process

business understanding --> data understanding --> data preparation --> modeling --> evaluation --> deployment (may need to move back and fourth)

-Has only a finite or countable set of values -Examples: eye color, letter grade, zip code -Can be different sub-types: nominal, ordinal and flag -Can be stored as string or numerical

categorical

classification has _____ class labels

categorical

type of attribute that has only a finite or countable set of values; can be stored as a numerical string and can be different sub-types: nominal or ordinal

categorical attribute

eye color, letter grade, IP address, zip code are examples of

categorical attributes

2 types of attributes

categorical, continuous

the task of mapping an input attribute vector into its class label

classification

which analytics task corresponds to this example: predicting whether a potential customer will actually buy a product (class attribute: buyer/non-buyer; attribute to help predict: income of a person)

classification

analytics task that involves finding a model that distinguishes between a discrete number of data / used to predict the class of an observation

classification (predictive)

analytics task that is descriptive, involving grouping of observations in the data in such a way that the observations in one group are more similar to each other than those in different groups/ analyzes data observations without having their known class labels (creating own class labels)

clustering

classification, visualization, regression, graph mining, association rule mining, clustering

common analytics tasks

-Quantitative attributes -Examples: weight, number of students, temperature -Can be stored as numerical (real or integer)

continuous

quantitative attribute, can be stored as numerical (real or integer)

continuous attribute

weight, temperature, revenue, height are examples of _________

continuous attributes

prediction has _________ functions

continuous-valued

when data becomes increasingly sparse in space it occupies; definitions of density and distance between points, which is critical for clustering and outlier detection, become less meaningful

curse of dimensionality

Customer completes a checkout; transaction into database as _______

data

as dimensionality increases

data becomes increasingly sparse in the space it occupies (high dimensional data --> objects are sparser and dissimilar)

analytics type that concerns what has happened / what is happening; enables business reporting, dashboards, scoreboards, data warehousing

descriptive analytics (past)

three analytics categories

descriptive, predictive, prescriptive

converting continuous attributes to discrete ones (ie: for monetary income: consider everyone making less than 80,000 = low income, anyone making more than 80,000= high income --> now discrete with only 2 possible outcomes)

discretization

types of classification methods

discriminative, regression based, probabilistic (bayesian)

data type where each document becomes a word vector; each word is a component (attribute) of the vector (the value of each component is the number of times the corresponding word occurs in the document)

document data

-At start, all training records are at the root -Records are partitioned based on splitting attribute and its condition --> Splitting attributes (and their split conditions, if needed) are selected on the basis of a heuristic or statistical measure (attribute selection measure)

generic greedy algorithm

transform into machine readable format via for example a adjacency matrix (assume HDML links from different websites)

graph data

analytics task used as structured form of data mining, looking for interrelations and connections between data observations

graph mining (dataset is in graph form)

splitting the records based on an attribute test that optimizes certain criterion - in a divide and conquer manner

greedy strategy (decision tree)

what makes a sample representative?

if it has approximately the same property (of interest) as the original set of data

if increase size of hidden layers

harder to converge but may capture more neurons and able to access more obscure information

now number of customers have historical records, and patterns are recognized; customers buy milk, then salad in the next two transactions; this pattern, not known before, is now new ___________

knowledge

Two-step process involves

learn the model (aka model construction), apply the model (aka model usage) (Existing data -->apply algorithm to data learn model--> model will fit relationship between learning and class labels-->Actual value--> model application (apply values to model and will be predicted class label as outcome)

the process of providing an algorithm dataset (records in the data) in order to address the analytics problem

learning

the outcome of the learning process (patterns, rules, parameters, etc.)

model

general structure of ANN

multiple layers: -input layer -hidden layer -output layer neurons (i.e. nodes) weights (Wij) bias values (0j)

random error or variance in the original value

nosie

each row is one ___________

object / case / record / observation (set of measurements for one entity)

sampling is important for data miners because ...

obtaining the entire dataset of interest is too expensive or time consuming; processing the entire dataset of interest is too expensive or time consuming

final class label is the

predicted class of the entity

analytics type that concerns what WILL happen / why it will happen (estimation regarding likelihood of outcome)

predictive analytics

analytics type that concerns "what should I do/ why should I do it" / foresees, forecasts, recommends

prescriptive analytics

data consists of collection of records, each of which consists of a fixed set of attributes

record data (relational form)

analytics task that predicts value of continuous attribute (ex: predict income level, continuous/numeric attribute)

regression

zero mean and unit variance

standardization


Related study sets