Business Intelligence Exam
options for dealing with missing values
eliminate data objects, estimate missing values (ie: take average of non-missing values), or ignore missing value during analysis
a measure of the uncertainty associated with a random variable in information gain
entropy
what does predictive analytics enable?
data mining, text mining, web/media mining, forecasting
record data, transaction data, document data, graph data are all _________
data types
collection of data objects including their attributes
dataset
tree structured plan of testing a set of attributes to an output
decision tree analysis
types of discriminative methods
decision tree, support vector machine, nearest neighbor
now customer has a member card that he/she uses during their purchase; by piling together their transaction history (i.e., noticed purchased milk and salad together), the store can generate _________
information
typical purity functions
information gain, gain ratio, gini index
the store decides to give coupons to customers for milk who also buy salad in order to increase sales; this represents _________________
intelligence
these features contain no information that is useful for the data mining task at hand
irrelevant features (ex: students' ID is often irrelevant for predicting students' GPA)
what does prescriptive analytics enable?
optimization, stimulation, decision modeling, expert systems
can be ordered high to low (ex: final grade at the end of this course)
ordinal (categorical attribute)
data objects with characteristics that are considerably different than most of the other data objects in the dataset
outliers
business understanding phase of CRISP DM Process
outline objectives, requirements, convert information into an analytics problem, questions: What exactly do we want to do? How do we want to do it? What kinds of models should we apply? What kind of techniques will be required to complete models?
these features duplicate much or all of the information contained in one or more other attributes
redundant features (ex: price paid for a product and amount of sales tax paid)
main technique employed for data selection / often used for both the preliminary investigation of the data and the final data analysis
sampling
purpose of dimensionality reduction
-overcome curse of dimensionality -reduce amount of time and memory required by data mining algorithms -allow data to be more easily visualized -may help to eliminate irrelevant features or reduce noise
process of dealing with duplicate data
data cleaning
evaluation phase of CRISP DM
-Assess the model and the results in terms of accuracy and reliability -Validate that the model addresses the business problem defined earlier in the process
stop when these stopping conditions are met
-all records for a given node belong to the same class -there are no remaining attributes for further partitioning -there are no records left
what does CRISP DM Process stand for?
Cross Industry Standard Process for Data Mining
sigmoid function
1/(1+e^-y)
artificial neural networks (ANN)
the artificial version of NN
in the information gain formula pi is
the probability that an arbitrary record in the dataset belongs to class i
key principle to sampling
the sample should be representative of the entire dataset
The set of records (the training data set) used for model
training set
special type of record data, where each record (____________) involves a set of items
transaction data (market basket)
how would you determine the best attribute at each iteration (in decision tree analysis)?
use purity metric to calculate homogeneity (choose the attribute that produces the "purest" nodes)
analytics task that is the descriptive, study of visual representation of data/ relationship between attributes. main goal: communicate information clearly through graphic means
visualization
outcome of descriptive analytics
well-defined business problems and opportunities (ex: management reports providing info about sales)
target attribute
what we are trying to assign a label to (for ex, will they cheat: NO/YES)
modeling phase of CRISP DM
Apply data mining / analytics techniques to the final data set to answer the business question
other methodologies besides CRISP DM Process for data mining (data analytics)
DDD (knowledge discovery in databases), SEMA
Data --> ________ --> ________ --> Intelligence
Data ---> INFORMATION ---> KNOWLEDGE ---> Intelligence
types of regression based methods
logistic regression, artificial neural networks
scale in the range of [0,1]
normalization
purpose of aggregation
-Data reduction --> Reduce the number of attributes or objects --> Curse of dimensionality -Change the scale --> Cities aggregated into regions, states, countries, etc. -More "stable" data --> Aggregated data trends to have less variability
Querying & reporting; online analytical processing (OLAP); alerts and notifications; business analytics; all of these are a form of ____________
Business Intelligence
can answer questions like WHY something is happening / if trends continue, what will happen next & how we can optimize this decision
Business analytics
Retailer operating in Midwest US, early 2000s realized value of analytics decided to investigate previous transactions of customers to make better business decisions, identify purchase patterns applied association rule mining method when customers purchase diapers on Thursdays and Saturdays they also commonly purchased beer rethink some of supply chain / shelf operations separated beer and diaper isles as far away from each other (make then walk through entire store, lead to buying more items), made sure beer and diapers were sold at full price on Thursdays and Saturdays Company? Data? Operations? Outcome?
Company: grocery store chain Data: purchasing transaction records Operations: shelf and pricing management Outcome: increased revenue
data preparation phase of CRISP DM Process
Convert initial raw data into final form (to be fed into the model) --> data transformation, cleaning, selection (ex: converting into table format, discretizing)
data understanding phase of CRISP DM Process
Identify the raw data set(s) that can help solve the problem from the previous step --> collect data, study strength and limitations of data, decide whether further investment is required)
deployment phase of CRISP DM
Put the model into real use in order to realize a return on investment -->Generate reports -->Set up an automated, repeatable process -->Often returns to business understanding stage / new ideas for improving business performance
what factors are used to determine which classification method (discriminative, regression based, or probabilistic) to use?
accuracy, robustness, scalability, interpretability
outcome of predictive analytics
accurate projections of future states and conditions (ex: credit scoring)
Combining two or more objects (or attributes) into a single object (or attribute)
aggregation
step-by-step procedure used to implement a particular analytics technique
algorithm
C5.0, K-means, backpropagation are examples of ____________
algorithms
analytics task used for discovering interesting relationships between different variables in large transactional datasets (applicable in Grocery Store example)
association rule mining
a property or characteristic of an object
attribute
a function that maps the entire set of values of a given attribute to a new set of replacement values such that each old value can be identified with one of the new values
attribute transformation
age of a person, eye color are examples of ____________
attributes
sample training dataset contains
attributes and a target attribute
outcomes of prescriptive analytics
best possible business decisions and transactions
what is neural networks (NN)?
biologically inspired computer models that operate like a human brain -can learn from the data and recognize patterns -can be used for classification/prediction
Technologies and techniques to transform raw data into meaningful and useful insights for business purposes/ aka computer based intelligence
business intelligence
steps of CRISP DM Process
business understanding --> data understanding --> data preparation --> modeling --> evaluation --> deployment (may need to move back and fourth)
-Has only a finite or countable set of values -Examples: eye color, letter grade, zip code -Can be different sub-types: nominal, ordinal and flag -Can be stored as string or numerical
categorical
classification has _____ class labels
categorical
type of attribute that has only a finite or countable set of values; can be stored as a numerical string and can be different sub-types: nominal or ordinal
categorical attribute
eye color, letter grade, IP address, zip code are examples of
categorical attributes
2 types of attributes
categorical, continuous
the task of mapping an input attribute vector into its class label
classification
which analytics task corresponds to this example: predicting whether a potential customer will actually buy a product (class attribute: buyer/non-buyer; attribute to help predict: income of a person)
classification
analytics task that involves finding a model that distinguishes between a discrete number of data / used to predict the class of an observation
classification (predictive)
analytics task that is descriptive, involving grouping of observations in the data in such a way that the observations in one group are more similar to each other than those in different groups/ analyzes data observations without having their known class labels (creating own class labels)
clustering
classification, visualization, regression, graph mining, association rule mining, clustering
common analytics tasks
-Quantitative attributes -Examples: weight, number of students, temperature -Can be stored as numerical (real or integer)
continuous
quantitative attribute, can be stored as numerical (real or integer)
continuous attribute
weight, temperature, revenue, height are examples of _________
continuous attributes
prediction has _________ functions
continuous-valued
when data becomes increasingly sparse in space it occupies; definitions of density and distance between points, which is critical for clustering and outlier detection, become less meaningful
curse of dimensionality
Customer completes a checkout; transaction into database as _______
data
as dimensionality increases
data becomes increasingly sparse in the space it occupies (high dimensional data --> objects are sparser and dissimilar)
analytics type that concerns what has happened / what is happening; enables business reporting, dashboards, scoreboards, data warehousing
descriptive analytics (past)
three analytics categories
descriptive, predictive, prescriptive
converting continuous attributes to discrete ones (ie: for monetary income: consider everyone making less than 80,000 = low income, anyone making more than 80,000= high income --> now discrete with only 2 possible outcomes)
discretization
types of classification methods
discriminative, regression based, probabilistic (bayesian)
data type where each document becomes a word vector; each word is a component (attribute) of the vector (the value of each component is the number of times the corresponding word occurs in the document)
document data
-At start, all training records are at the root -Records are partitioned based on splitting attribute and its condition --> Splitting attributes (and their split conditions, if needed) are selected on the basis of a heuristic or statistical measure (attribute selection measure)
generic greedy algorithm
transform into machine readable format via for example a adjacency matrix (assume HDML links from different websites)
graph data
analytics task used as structured form of data mining, looking for interrelations and connections between data observations
graph mining (dataset is in graph form)
splitting the records based on an attribute test that optimizes certain criterion - in a divide and conquer manner
greedy strategy (decision tree)
what makes a sample representative?
if it has approximately the same property (of interest) as the original set of data
if increase size of hidden layers
harder to converge but may capture more neurons and able to access more obscure information
now number of customers have historical records, and patterns are recognized; customers buy milk, then salad in the next two transactions; this pattern, not known before, is now new ___________
knowledge
Two-step process involves
learn the model (aka model construction), apply the model (aka model usage) (Existing data -->apply algorithm to data learn model--> model will fit relationship between learning and class labels-->Actual value--> model application (apply values to model and will be predicted class label as outcome)
the process of providing an algorithm dataset (records in the data) in order to address the analytics problem
learning
the outcome of the learning process (patterns, rules, parameters, etc.)
model
general structure of ANN
multiple layers: -input layer -hidden layer -output layer neurons (i.e. nodes) weights (Wij) bias values (0j)
random error or variance in the original value
nosie
each row is one ___________
object / case / record / observation (set of measurements for one entity)
sampling is important for data miners because ...
obtaining the entire dataset of interest is too expensive or time consuming; processing the entire dataset of interest is too expensive or time consuming
final class label is the
predicted class of the entity
analytics type that concerns what WILL happen / why it will happen (estimation regarding likelihood of outcome)
predictive analytics
analytics type that concerns "what should I do/ why should I do it" / foresees, forecasts, recommends
prescriptive analytics
data consists of collection of records, each of which consists of a fixed set of attributes
record data (relational form)
analytics task that predicts value of continuous attribute (ex: predict income level, continuous/numeric attribute)
regression
zero mean and unit variance
standardization