Bus 491 Study guide

Lakukan tugas rumah & ujian kamu dengan baik sekarang menggunakan Quizwiz!

Numeric (Interval or ratio variables)

-Continuous -Integer -Most algorithms can handle numeric data -May occasionally need to "bin" into categories

Research questions for One-Way ANOVA

-Do accountants, on average, earn more than teachers? -Do people with one of two new drugs have higher average T-cell counts than people in the control group? -Do people spend different amounts depending on what kind of credit card they use? -Does the type of fertilizer used affect the average weight of garlic grown at the Montana Gourmet Garlic Ranch?

Imputation

-Fill in the missing values with some reasonable value -Example: the mean within homogenous groups of the data -Categorical variables: separate category

Unsupervised: Clustering

-Goal: Form groups (clusters) of similar records -Example: Cluster your customers into groups with similar demographic attributes -Each row is a case (customer, tax return, insurance claim) -Each column is an attribute (height, weight, hair color) -Often used as an intermediate step that leads to supervised learning/predictive analysis

Supervised: Classification

-Goal: Predict categorical target (outcome) variable -Examples: purchase/no purchase, fraud/no fraud, creditworthy/not creditworthy... -Each row is a case (customer, tax return, application...) -Each column is a variable (predictor, target ...) -Target variable is often binary (yes/no)

Supervised: Prediction

-Goal: Predict numeric target (outcome) variable -Examples: sales, revenue, performance -As in classification: -Each row is a case (customer, tax return, application...) -Each column is a variable (predictor, target ...) -Taken together, classification and prediction constitute "predictive analytics" (narrow definition)

Unsupervised: Association Rules

-Goal: Produce rules that define "what goes with what" -Example: "If X was purchased, Y was also purchased" -Rows are transactions -Used in recommender systems - "Our records show you bought X, you may also like Y" -Also called "affinity analysis"

Used to examine the distribution of data values

-Histograms -Normal probability plots -Box plots

Why having too many variables may not be desirable?

-It may be expensive or not feasible to collect all predictors for future predictions -We may be able to measure fewer predictions more accurately -The more predictors, the higher the chance of missing data -Parsimony is an important property of good models -Estimates of regression coefficients are likely to be unstable, due to multicollinearity in models with many variables. -Bias-variance tade-off

Purposes of data quality

-Minimize IT project risk -Make timely business decisions -Ensure regulatory compliance -Expand customer base

Categorical

-Ordered (low, medium, high), also called ordinal variables -Unordered (male, female), also called nominal variables -Naïve Bayes can use as-is -In most other algorithms, must create binary dummies

Applicable data mining situations

-Predicting customer activity on credit cards from their demographics -Predicting the time to failure of equipment based on utilization and environment conditions -Predicting expenditures on vacation travel based on historical frequent flyer data - Predicting staffing requirements at help desks based on historical data and product sales information -Predicting the impact of discounts on sales in retail outlets

Complete-Case Analysis (Omission)

-Use only the cases that have complete records in the analysis -Disadvantage: may lead to disastrous reduction in data -Works if a small number of records or a small set of variables having missing values

Unsupervised learning algorithms

-Used where there is no outcome variable to predict or classify -No "learning" from cases where such an outcome variable is known -Examples: association rules, dimension reduction methods, and clustering.

Categorical variables

-can be either numerical or text -can ordered or unordered -can have categories such as high value, low value, and nil value -require special handling

Continuous variables

-can be handled by most data mining routines -in XLMiner all routines take _______ with the exception of the Bayes classifier

What to do in the case of missing data

1. Drop the records or 2. Impute a value or 3. Analyze the predictor

One-nearest neighbor can be extended to k>1

1. Find the nearest k neighbor to the record to be classified 2. Use a majority decision rule to classified the record, where the record is classified as a member of the majority class of the k neighbors

Difficulties with the K-NN approach

1. although no time is required to estimate parameters from the training data ( as would be the case for parametric models such as regression), the time to find the nearest neighbor in a large training set can be prohibitive 2. The number of records required in the training set to qualify as a large increase exponentially with the number of predictors p. This is because the expected distance to the nearest neighbor goes up dramatically with p unless the size of the training set increases exponentially with p. Curse of dimensionality.

-Use atomic facts -Create single-process fact tables -Include a date dimension for each fact table -Enforce consistent grain -Disallow null keys in fact tables -Honor hierarchies -Decode dimension tables -Use surrogate keys -Conform dimensions -Balance requirements with actual data

10 essential rules for data modeling

A good rule of thumb for the number of cases for modeling

5(p+2) where p is the number of predictors

Predictive analysis

A combination of classification, prediction, and to some extent affinity analysis.

Event

A database action(create/update/delete) that results from a transaction

Classification

A form of data analysis where one attempts to predict the categorical data that is unknown or will occur in the future, with the goal of predicting what that data will be.

Prediction

A form of data analysis where one attempts to predict the numerical data that is unknown or will occur in the future, with the goal of predicting what that data will be.

10

A good rule of thumb is to have ______ records for every predictor variable

Data warehouse

A large integrated data storage facility that ties together the decision support systems of an enterprise.

Standard Deviation

A measure of dispersion expressed in the same units of measurement as your data (the square root of the variance)

Variance

A measure of dispersion of the data around the mean

A normal distribution

A normal distribution is_______. if you draw a line down the center, you get the same shape on either side.

Informational system

A system designed to support decision making based on historical point-in-time and prediction data for complex queries or data-mining applications (Managers, business analysts, customers)

Operational system

A system that is used to run a business in real time, based on current data; also called a system of record (clerks, salespersons, administrators)

Exhaustive search

A very general problem-solving technique that consists of systematically enumerating all possible candidates for the solution and checking whether each candidate satisfies the problem's statement.

Time-variant

Can study trends and changes

Low values of k (1, 3, ...)

Capture local structure in data (but also noise)

MOLAP

Causes the aggregations of the partition and a copy of its source data to be stored in a multidimensional structure in Analysis Services when the partition is processed.

ROLAP

Causes the aggregations of the partition to be stored in indexed views in the relational database that was specified in the partition's data source.

Transient data

Changes to existing records are written over previous records, thus destroying the previous data content

HOLAP

Combines attributes of both MOLAP and ROLAP. Causes the aggregations of the partition to be stored in a multidimensional structure in an SQL Server Analysis Services instance.

Training partition

Contains the data used to build the various models we are examining. The same partition is generally used to develop multiple models. [Build model(s)]

variable=volume of beverage or variable=temperature of beverage

Continuos

Subject-oriented

Customers, patients, students, and products

-Detailed -Historical -Normalized -Comprehensive -Timely -Quality controlled

Data after the ETL process

A good explanatory model is one that fits the data closely,where as a good predictive model is one that predicts the new cases accurately.

Difference between explanatory and predictive models

Data propagation

Duplicates data across databases, usually with near-real-time delay.

Important attribute of the predictive model

Even if we drop the first assumption and allow the noise to allow an arbitrary distribution, these estimates are very good for prediction.

Data mining

Extracting useful information from large data sets.

Market basket analysis

Finer grains have better _____________ capability

Data exploration

Full understanding of the data may require a reduction in its scale or dimension to allows us to see the forest without getting lost in the trees. Similar variables (i.e. variables that supply similar information) might be aggregated into a single variable incorporating all the similar variables.

-Explain observed events or conditions -Confirm hypotheses -Explore data for new or unexpected relationships

Goals of data mining

Dimension hierarchies

Help to provide levels of aggregation for users wanting summary information in a data warehouse.

Missing data

If the number of records with outliers is very small, they might be treated as __________.

"curse of dimensionality"

In a large training set, it takes a long time to find distances to all the neighbors and then identify the nearest one(s)

K=1 means use the single nearest record K=5 means use the 5 nearest records

K is the number of nearby neighbors to be used to classify the new record

Domain knowledge

Knowledge of the particular application being considered: direct mail, mortgage, finance, and so on, as opposed to technical knowledge of statistical or data mining procedures.

Test partition

Known as the "holdout" or "evaluation" partition is used whenever we need to assess the performance of the chosen model with a new data. [Reevaluate model(s)]

Validation partition

Known as the "test partition" is used to assess the performance of each model so that you can compare models and pick the best one. [Evaluate model(s)]

External data sources

Lack of control over data quality

Stepwise regression

Like forward selection except that at each step we consider dropping predictors that are not statistically significant, as in backward elimination.

Data visualization

Looking at each variable separately as well as looking at the relationship between variables. Numerical values= histograms/ box plots. Categorical values= bar charts/ scatterplots.

Dimension model (usually implemented as star schema)

Most common data model

Variable=type of beverage

Nominal

Lack of organizational commitment

Not recognizing poor data quality as an organizational issue

variable=size of beverage

Ordinal

Data entry

Poor data capture controls

Redundant data storage and inconsistent metadata

Proliferation of databases with uncontrolled redundancy and metadata

R^2

Proportion of variance accounted for by the model

High values of k

Provide more smoothing, less noise, but may miss local structure

Fact tables

Provide statistics for sales broken down by product, period, and store dimensions

Data Federation

Provides a virtual view of integrated data without actually bringing the data all into one physical centralized database

Operational data store

Provides option for obtaining current data

Non-updatable

Read-only, periodically refreshed

Dealing with the curse

Reduce dimension of predictors (e.g., with PCA) Computational shortcuts that settle for "almost nearest neighbors"

OLAP

Refers to specialized tools that make warehouse data easily available. An OLAP cube is a logical structure that defines the metadata. The term cube describes existing measure groups and dimension tables and should not be interpreted as having limited dimensions. A cube is a combination of all existing measure groups. A measure group is a group of measures that match the business logic of the data and is another logical structure that defines metadata so that client tools can access the data.

Machine Learning techniques

Rely on computational intensity and are less structured than classical statistical models (tress/ neural networks).

Assess

SEMMA: Compare models using a validation dataset

Explore

SEMMA: Examine the dataset statistically and geographically

Sample

SEMMA: Take a _______ from the data set; partition into training, validation, and test datasets

Modify

SEMMA: Transform the variable and impute missing values

Model

SEMMA: Fit predictive models( e.g., regression tree, collaborative filtering)

Advantages of K-NN

Simple No assumptions required about Normal distribution, etc. Effective at capturing complex interactions among variables without having to define a statistical model

1. Develop and understanding of the purpose of the data mining project 2. Obtain the dataset to be used in the analysis 3.Explore, clean, and process the data 4. Reduce the data, if necessary and (where supervised training is involved) separate them into training, validation, and test datasets 5. Determine the data mining task 6. Choose the data mining techniques to be used 7. Use algorithms to perform the task 8. interpret the results of the algorithms 9. Deploy the model

Steps in data mining

Data marts

Sub categories in a data warehouse that focus on a simple subject ( credit rating data).

Association rules (affinity analysis)

Suggestion machine. The heart of a "recommender" system. Used by Netflix and Amazon.

Training Data

The data from which the classification and prediction algorithm "learns", or is "trained."

Interquartile range

The difference between the 25th and 75th percentiles

Range

The difference between the maximum and minimum data values.

Data management

The foundation for business analytics. Without correctly consolidated data, those working in the analytics, reporting, and solutions areas might not be working with the most current, accurate data.

When considering finding a k-value

The more complex and irregular the structure of the data , the lower the optimum value of K

Euclidean distance

The most popular distance measure is the__________.

Data reduction

The process of consolidating a large number of variables(or cases) into a smaller set.

Generalization

The purpose of predictive modeling is ________.

Forward selection

The simplest data-driven model building approach is called _________. In this approach, one adds variables to the model one at a time. At each step, each variable that is not already in the model is tested for inclusion in the model. The most significant of these variables is added to the model, so long as it's P-value is below some pre-set level.The algorithm stops when the contribution of additional predictors is not statistically significant. The main disadvantage of this method is that the algorithm will miss pairs of groups or predictors that perform very well together but perform poorly as single predictors.

Mallows Cp

This criterion assumes that full model(with all predictors is unbiased), although it may have predictors that if dropped would reduce prediction variability. Good models are those that have the subset p+1

Supervised learning algorithms

Those in classification and prediction. Value of the outcome of interest is known (i.e. purchase or no purchase). Algorithm learns from this data. (simple linear regression analysis). The Y variable is the (known) outcome variable and the X variable is the predictor variable.

Unsupervised learning algorithms

Those in which there is no outcome variable to predict or classify. Hence there is no "learning" from such cases where such and outcome variable is known. Association rules, dimension reduction methods, and clustering techniques.

Training

To find patterns and create an initial set of candidate models.

Validation

To find patterns and create an initial set of candidate models.

Test

To measure performance of the selected model on unseen data. The test set can be an out-of-time sample of the data, if necessary.

normalize

To______ we subtract the mean from each value and divide by the standard deviation of the resulting deviations of the mean also known as the "z-score'

Classification techniques

Used to identify those individuals whose demographic and other data closely matches that of our best existing customers.

Operational databases

Used to record individual transactions in support of routine business activity that can handle simple queries.

Variance inflation value

VIF= (1/1-R^2)

Backward elimination

We start with all the predictors and then at each step eliminate the least useful predictor (according to statistical significance). The algorithm stops when all remaining predictors have significant contributions. The weakness of this algorithm is that computing the initial model with all predictors can be time consuming and unstable.

less

When performing data mining analysis we want________ than the total number of records that are available.

Overfitting

Where a model is fit so closely to the available sample of data that it describes not merely structural characteristics of the data but random peculiarities as well (fitting the noise, not just the signal).

Rows

Where can you find records on a data set

Columns

Where can you find variable on a data set

Validation data

Where the outcome is known, to see how well it does in comparison to other models.

K=1

Where we look for the record that is closest (the nearest neighbor) and classify the record as belonging to the same class as the closest neighbor.

1/3/5/6

Which of the following might constitute a case in a predictive model? 1. a household 2. loan amount 3. an individual 4. the number of products purchased 5. a company 6. a ZIP code 7. salary

Periodic data

_________ are never physically altered or deleted once they have been added to the store

Integrated

consistent naming conventions, formats, encoding structures; from multiple data sources

Independent variable

input variable/ input variable/ regressor/ covariate

Dependent variable

outcome/response variable


Set pelajaran terkait

Exam 2 - Energy Economics & Policy

View Set

Passing the PPR TExES Exam for EC-12 Teachers

View Set

U.S Government Midterm Study Guide

View Set

Chapter 41. Community & Home Nursing

View Set

Business Law Mid-Term & Final Exam

View Set

Empowerment Technology: Dangers in Internet

View Set

ATI Professional Nursing Practice

View Set