Bus 491 Study guide
Numeric (Interval or ratio variables)
-Continuous -Integer -Most algorithms can handle numeric data -May occasionally need to "bin" into categories
Research questions for One-Way ANOVA
-Do accountants, on average, earn more than teachers? -Do people with one of two new drugs have higher average T-cell counts than people in the control group? -Do people spend different amounts depending on what kind of credit card they use? -Does the type of fertilizer used affect the average weight of garlic grown at the Montana Gourmet Garlic Ranch?
Imputation
-Fill in the missing values with some reasonable value -Example: the mean within homogenous groups of the data -Categorical variables: separate category
Unsupervised: Clustering
-Goal: Form groups (clusters) of similar records -Example: Cluster your customers into groups with similar demographic attributes -Each row is a case (customer, tax return, insurance claim) -Each column is an attribute (height, weight, hair color) -Often used as an intermediate step that leads to supervised learning/predictive analysis
Supervised: Classification
-Goal: Predict categorical target (outcome) variable -Examples: purchase/no purchase, fraud/no fraud, creditworthy/not creditworthy... -Each row is a case (customer, tax return, application...) -Each column is a variable (predictor, target ...) -Target variable is often binary (yes/no)
Supervised: Prediction
-Goal: Predict numeric target (outcome) variable -Examples: sales, revenue, performance -As in classification: -Each row is a case (customer, tax return, application...) -Each column is a variable (predictor, target ...) -Taken together, classification and prediction constitute "predictive analytics" (narrow definition)
Unsupervised: Association Rules
-Goal: Produce rules that define "what goes with what" -Example: "If X was purchased, Y was also purchased" -Rows are transactions -Used in recommender systems - "Our records show you bought X, you may also like Y" -Also called "affinity analysis"
Used to examine the distribution of data values
-Histograms -Normal probability plots -Box plots
Why having too many variables may not be desirable?
-It may be expensive or not feasible to collect all predictors for future predictions -We may be able to measure fewer predictions more accurately -The more predictors, the higher the chance of missing data -Parsimony is an important property of good models -Estimates of regression coefficients are likely to be unstable, due to multicollinearity in models with many variables. -Bias-variance tade-off
Purposes of data quality
-Minimize IT project risk -Make timely business decisions -Ensure regulatory compliance -Expand customer base
Categorical
-Ordered (low, medium, high), also called ordinal variables -Unordered (male, female), also called nominal variables -Naïve Bayes can use as-is -In most other algorithms, must create binary dummies
Applicable data mining situations
-Predicting customer activity on credit cards from their demographics -Predicting the time to failure of equipment based on utilization and environment conditions -Predicting expenditures on vacation travel based on historical frequent flyer data - Predicting staffing requirements at help desks based on historical data and product sales information -Predicting the impact of discounts on sales in retail outlets
Complete-Case Analysis (Omission)
-Use only the cases that have complete records in the analysis -Disadvantage: may lead to disastrous reduction in data -Works if a small number of records or a small set of variables having missing values
Unsupervised learning algorithms
-Used where there is no outcome variable to predict or classify -No "learning" from cases where such an outcome variable is known -Examples: association rules, dimension reduction methods, and clustering.
Categorical variables
-can be either numerical or text -can ordered or unordered -can have categories such as high value, low value, and nil value -require special handling
Continuous variables
-can be handled by most data mining routines -in XLMiner all routines take _______ with the exception of the Bayes classifier
What to do in the case of missing data
1. Drop the records or 2. Impute a value or 3. Analyze the predictor
One-nearest neighbor can be extended to k>1
1. Find the nearest k neighbor to the record to be classified 2. Use a majority decision rule to classified the record, where the record is classified as a member of the majority class of the k neighbors
Difficulties with the K-NN approach
1. although no time is required to estimate parameters from the training data ( as would be the case for parametric models such as regression), the time to find the nearest neighbor in a large training set can be prohibitive 2. The number of records required in the training set to qualify as a large increase exponentially with the number of predictors p. This is because the expected distance to the nearest neighbor goes up dramatically with p unless the size of the training set increases exponentially with p. Curse of dimensionality.
-Use atomic facts -Create single-process fact tables -Include a date dimension for each fact table -Enforce consistent grain -Disallow null keys in fact tables -Honor hierarchies -Decode dimension tables -Use surrogate keys -Conform dimensions -Balance requirements with actual data
10 essential rules for data modeling
A good rule of thumb for the number of cases for modeling
5(p+2) where p is the number of predictors
Predictive analysis
A combination of classification, prediction, and to some extent affinity analysis.
Event
A database action(create/update/delete) that results from a transaction
Classification
A form of data analysis where one attempts to predict the categorical data that is unknown or will occur in the future, with the goal of predicting what that data will be.
Prediction
A form of data analysis where one attempts to predict the numerical data that is unknown or will occur in the future, with the goal of predicting what that data will be.
10
A good rule of thumb is to have ______ records for every predictor variable
Data warehouse
A large integrated data storage facility that ties together the decision support systems of an enterprise.
Standard Deviation
A measure of dispersion expressed in the same units of measurement as your data (the square root of the variance)
Variance
A measure of dispersion of the data around the mean
A normal distribution
A normal distribution is_______. if you draw a line down the center, you get the same shape on either side.
Informational system
A system designed to support decision making based on historical point-in-time and prediction data for complex queries or data-mining applications (Managers, business analysts, customers)
Operational system
A system that is used to run a business in real time, based on current data; also called a system of record (clerks, salespersons, administrators)
Exhaustive search
A very general problem-solving technique that consists of systematically enumerating all possible candidates for the solution and checking whether each candidate satisfies the problem's statement.
Time-variant
Can study trends and changes
Low values of k (1, 3, ...)
Capture local structure in data (but also noise)
MOLAP
Causes the aggregations of the partition and a copy of its source data to be stored in a multidimensional structure in Analysis Services when the partition is processed.
ROLAP
Causes the aggregations of the partition to be stored in indexed views in the relational database that was specified in the partition's data source.
Transient data
Changes to existing records are written over previous records, thus destroying the previous data content
HOLAP
Combines attributes of both MOLAP and ROLAP. Causes the aggregations of the partition to be stored in a multidimensional structure in an SQL Server Analysis Services instance.
Training partition
Contains the data used to build the various models we are examining. The same partition is generally used to develop multiple models. [Build model(s)]
variable=volume of beverage or variable=temperature of beverage
Continuos
Subject-oriented
Customers, patients, students, and products
-Detailed -Historical -Normalized -Comprehensive -Timely -Quality controlled
Data after the ETL process
A good explanatory model is one that fits the data closely,where as a good predictive model is one that predicts the new cases accurately.
Difference between explanatory and predictive models
Data propagation
Duplicates data across databases, usually with near-real-time delay.
Important attribute of the predictive model
Even if we drop the first assumption and allow the noise to allow an arbitrary distribution, these estimates are very good for prediction.
Data mining
Extracting useful information from large data sets.
Market basket analysis
Finer grains have better _____________ capability
Data exploration
Full understanding of the data may require a reduction in its scale or dimension to allows us to see the forest without getting lost in the trees. Similar variables (i.e. variables that supply similar information) might be aggregated into a single variable incorporating all the similar variables.
-Explain observed events or conditions -Confirm hypotheses -Explore data for new or unexpected relationships
Goals of data mining
Dimension hierarchies
Help to provide levels of aggregation for users wanting summary information in a data warehouse.
Missing data
If the number of records with outliers is very small, they might be treated as __________.
"curse of dimensionality"
In a large training set, it takes a long time to find distances to all the neighbors and then identify the nearest one(s)
K=1 means use the single nearest record K=5 means use the 5 nearest records
K is the number of nearby neighbors to be used to classify the new record
Domain knowledge
Knowledge of the particular application being considered: direct mail, mortgage, finance, and so on, as opposed to technical knowledge of statistical or data mining procedures.
Test partition
Known as the "holdout" or "evaluation" partition is used whenever we need to assess the performance of the chosen model with a new data. [Reevaluate model(s)]
Validation partition
Known as the "test partition" is used to assess the performance of each model so that you can compare models and pick the best one. [Evaluate model(s)]
External data sources
Lack of control over data quality
Stepwise regression
Like forward selection except that at each step we consider dropping predictors that are not statistically significant, as in backward elimination.
Data visualization
Looking at each variable separately as well as looking at the relationship between variables. Numerical values= histograms/ box plots. Categorical values= bar charts/ scatterplots.
Dimension model (usually implemented as star schema)
Most common data model
Variable=type of beverage
Nominal
Lack of organizational commitment
Not recognizing poor data quality as an organizational issue
variable=size of beverage
Ordinal
Data entry
Poor data capture controls
Redundant data storage and inconsistent metadata
Proliferation of databases with uncontrolled redundancy and metadata
R^2
Proportion of variance accounted for by the model
High values of k
Provide more smoothing, less noise, but may miss local structure
Fact tables
Provide statistics for sales broken down by product, period, and store dimensions
Data Federation
Provides a virtual view of integrated data without actually bringing the data all into one physical centralized database
Operational data store
Provides option for obtaining current data
Non-updatable
Read-only, periodically refreshed
Dealing with the curse
Reduce dimension of predictors (e.g., with PCA) Computational shortcuts that settle for "almost nearest neighbors"
OLAP
Refers to specialized tools that make warehouse data easily available. An OLAP cube is a logical structure that defines the metadata. The term cube describes existing measure groups and dimension tables and should not be interpreted as having limited dimensions. A cube is a combination of all existing measure groups. A measure group is a group of measures that match the business logic of the data and is another logical structure that defines metadata so that client tools can access the data.
Machine Learning techniques
Rely on computational intensity and are less structured than classical statistical models (tress/ neural networks).
Assess
SEMMA: Compare models using a validation dataset
Explore
SEMMA: Examine the dataset statistically and geographically
Sample
SEMMA: Take a _______ from the data set; partition into training, validation, and test datasets
Modify
SEMMA: Transform the variable and impute missing values
Model
SEMMA: Fit predictive models( e.g., regression tree, collaborative filtering)
Advantages of K-NN
Simple No assumptions required about Normal distribution, etc. Effective at capturing complex interactions among variables without having to define a statistical model
1. Develop and understanding of the purpose of the data mining project 2. Obtain the dataset to be used in the analysis 3.Explore, clean, and process the data 4. Reduce the data, if necessary and (where supervised training is involved) separate them into training, validation, and test datasets 5. Determine the data mining task 6. Choose the data mining techniques to be used 7. Use algorithms to perform the task 8. interpret the results of the algorithms 9. Deploy the model
Steps in data mining
Data marts
Sub categories in a data warehouse that focus on a simple subject ( credit rating data).
Association rules (affinity analysis)
Suggestion machine. The heart of a "recommender" system. Used by Netflix and Amazon.
Training Data
The data from which the classification and prediction algorithm "learns", or is "trained."
Interquartile range
The difference between the 25th and 75th percentiles
Range
The difference between the maximum and minimum data values.
Data management
The foundation for business analytics. Without correctly consolidated data, those working in the analytics, reporting, and solutions areas might not be working with the most current, accurate data.
When considering finding a k-value
The more complex and irregular the structure of the data , the lower the optimum value of K
Euclidean distance
The most popular distance measure is the__________.
Data reduction
The process of consolidating a large number of variables(or cases) into a smaller set.
Generalization
The purpose of predictive modeling is ________.
Forward selection
The simplest data-driven model building approach is called _________. In this approach, one adds variables to the model one at a time. At each step, each variable that is not already in the model is tested for inclusion in the model. The most significant of these variables is added to the model, so long as it's P-value is below some pre-set level.The algorithm stops when the contribution of additional predictors is not statistically significant. The main disadvantage of this method is that the algorithm will miss pairs of groups or predictors that perform very well together but perform poorly as single predictors.
Mallows Cp
This criterion assumes that full model(with all predictors is unbiased), although it may have predictors that if dropped would reduce prediction variability. Good models are those that have the subset p+1
Supervised learning algorithms
Those in classification and prediction. Value of the outcome of interest is known (i.e. purchase or no purchase). Algorithm learns from this data. (simple linear regression analysis). The Y variable is the (known) outcome variable and the X variable is the predictor variable.
Unsupervised learning algorithms
Those in which there is no outcome variable to predict or classify. Hence there is no "learning" from such cases where such and outcome variable is known. Association rules, dimension reduction methods, and clustering techniques.
Training
To find patterns and create an initial set of candidate models.
Validation
To find patterns and create an initial set of candidate models.
Test
To measure performance of the selected model on unseen data. The test set can be an out-of-time sample of the data, if necessary.
normalize
To______ we subtract the mean from each value and divide by the standard deviation of the resulting deviations of the mean also known as the "z-score'
Classification techniques
Used to identify those individuals whose demographic and other data closely matches that of our best existing customers.
Operational databases
Used to record individual transactions in support of routine business activity that can handle simple queries.
Variance inflation value
VIF= (1/1-R^2)
Backward elimination
We start with all the predictors and then at each step eliminate the least useful predictor (according to statistical significance). The algorithm stops when all remaining predictors have significant contributions. The weakness of this algorithm is that computing the initial model with all predictors can be time consuming and unstable.
less
When performing data mining analysis we want________ than the total number of records that are available.
Overfitting
Where a model is fit so closely to the available sample of data that it describes not merely structural characteristics of the data but random peculiarities as well (fitting the noise, not just the signal).
Rows
Where can you find records on a data set
Columns
Where can you find variable on a data set
Validation data
Where the outcome is known, to see how well it does in comparison to other models.
K=1
Where we look for the record that is closest (the nearest neighbor) and classify the record as belonging to the same class as the closest neighbor.
1/3/5/6
Which of the following might constitute a case in a predictive model? 1. a household 2. loan amount 3. an individual 4. the number of products purchased 5. a company 6. a ZIP code 7. salary
Periodic data
_________ are never physically altered or deleted once they have been added to the store
Integrated
consistent naming conventions, formats, encoding structures; from multiple data sources
Independent variable
input variable/ input variable/ regressor/ covariate
Dependent variable
outcome/response variable