C743 Data Mining 1 - Terminology

¡Supera tus tareas y exámenes ahora con Quizwiz!

- Used to determine if a statistically significant difference exists between the means of 3 or more variables/groups - Useful for testing groups to see if there's a difference between them, i.e. students from different colleges take same exam & you want to test to see which college performs better on the exam - Analysis on 3 or more groups - Assumes error terms (residuals) are norm. dist., error terms (residuals) have constant variance (homoscedastic), trials & observations independent, data is norm. dist., groups/populations are independent - Null Hyp = means for each group are the same - Alt Hyp = at least one mean is different from the other means

ANOVA Test

- Is an erroneous value corresponding to an incorrect measurement, a calculation error, an input error or a false declaration - Examples of aberrant values include: * Incoherent dates, such as birth dates before the customer's date of birth * Customers categorized as 'private buyer' when in fact they are 'business' * Amount of transaction input as Dollars when it should have been in Euros; * Telephone numbers that are not numbers at all

Aberrant Values

- Extract, store, analyse relevant customer information to provide a comprehensive overview of the customer in the business sense - Essentially collection and analysis of customer info with goal being to understand customer profile and needs more fully

Analytical CRM

- Best used for finding the most frequently occurring combinations of variables in a data set - Detects relationships or associations between specific values of independent categorical variables in large data sets - Good example of association rules being used in real life is how Amazon store has "Customers who viewed this item also viewed" and "Frequently bought together" product recommendations for every item on the Amazon store. Those recommendations are derived from analysis done using association rules - Can be summarized as using association rules to associations of the kind, "If X, then likely Y", where X and Y can be single values, items, word, etc. --> So if I buy X, then I would likely want to buy Y - Apriori is main algorithm for detecting association rules

Association Rules/Analysis

- Data that pertains to customer opinion of company - Attitudes towards products - Reasons for buying products - How attractive a business's competitors are to a customer

Attitudinal Data

- Used to confirm homoscedasticity of data by testing that the variances are equal for all samples. - Best used for normally distributed data - DO NOT use Bartlett with norn-norm. dist. data - Null Hyp = variance equal for all samples - Alt Hyp = variance is not equal for at least one pair of samples

Bartlett Test

- Means analyzing 2 variables, i.e. comparing 2 sets of related data (like how much alcohol a bar sells on New Years vs any other day) - Examine bi-variate statistics of variables to: + Detect incompatibilities between variables, i.e. to see if variables are correlated or dependent + Detect links between dependent variable and independent variables and their interactions. This will help eliminate independent variables that have no effect on dependent variable + Detect links, relationships, or associations between the independent variables, which must be avoided in linear and logistic regression since independence of variables is assumed in those methods

Bi-variate

- Experiment where there are fixed number of independent trials and only two outcomes for each trial are possible, i.e. coin flip is either heads or tails

Binomial Experiment

- Sampling with replacement - Samples data over and over until sample distribution is an approximation of the population distribution - Bootstrap can estimate the shape and spread for any distribution of interest - Bootstrap needs large sample sizes, small sample size are not recommended for bootstrap since small samples provide biased estimate of sample std. dev. and dist.

Bootstrapping

- Customer Relationship Management - A way of using data about customers to manage and understand customer needs & expectations of a company/product - Goal is to increase profitability and customer loyalty while controlling risk and using the right channels to sell the right product at the right time - Uses data analysis about customers' history with a company to improve business relationships with customers - Specifically focuses on customer retention and ultimately driving sales growth

CRM

- Used to analyze CATEGORICAL data in contingency tables - Compares 2 variables in contingency table to see if they have a relationship - Tests whether distributions of categorical variables differ from one another - Small Chi-Square test statistic means observed data fits expected values very well, i.e. there is a relationship - Large Chi-Square test statistic means data does not fit well, i.e. there is not a relationship - Assumes variables are nominal (categorical) data, categories of data are independent, each cell has value of at least 5, large sample size assumed, groups are independent, observations are independent - Null Hyp = proportions are equal, there is no relationship between the variables so they are independent - Alt Hyp = proportion are NOT equal, there is a relationship between the variables, variables are NOT independent

Chi-Square Test

- Factor analysis method - Intended for the analysis of qualitative variables and categorical data - Useful for exploring the structure of categorical variables in a table - Present a graphical display of the relationships between the rows and columns in a table, aka "maps" row and column variables - Can be used to convert qualitative values into quantitative values - Goal of correspondence analysis is to transform a data table into two sets of factor scores: 1 for the rows and 1 for the columns; the factor scores represent the similarity of the structure of the rows and the columns of the table - The factors scores can be plotted as maps, which display the essential information of the original table. - High frequency denotes a strong positive relationship, i.e. two positively related categories, A & B, are close - Low frequency denotes a strong negative relationship, i.e. two negatively related categories, A & B, are opposed - Intermediate frequency denotes a weak relationship - Strongest oppositions are on the first axis (horizontal) - Categories not related to others are in the center

Correspondence Analysis

1.) If highly precise model desired, linear regression, discriminant analysis, and logistic regression are the best choices 2.) For most robust model, avoid decision trees and neural networks 3.) For most concise model, linear regression, discriminant analysis, and logistic regression are preferred 4.) If data set is very small, avoid decision trees and neural networks 5.) If data set has missing values, decision trees, MARS, PLS regression or logistic regression can be used by coding the missing values as a special class 6.) For data sets with extreme values(outliers), MARS and decision trees are best as they are not sensitive to extreme values. For logistic regression and DISQUAL, divide variables into classes and place the extreme values into 1 class and the other values into another 7.) If independent variables are highly correlated, use decision trees, PLS regression, and regularized regression 8.) If data set is very large, avoid neural networks, SVMs, and logistic regression as using those methods will greatly increase compute times 9.) Neural networks are most useful when structure of data is not clear 10.) Best methods to use without having to prepare or homogenize data are decision trees, MARS, bagging and boosting. 11.) Linear Regression = use with continuous variables 12.) Discriminant Analysis = use for nominal dep. var. and continuous indep. var. 13.) Logistic Regression = use for qualitat. dep. var and continuous or qualitat. indep. var 14.) Neural Networks = use for continuous variables in range of [0,1] and transform the rest

Choosing Modeling Method

- An operation that places each variable from the population of a study into a specified class or classes based on the characteristics of the variable. - Variables are generally assigned to a class based by using formulas, algorithms, rule sets, etc - There are usually two classes (i.e. club members vs non-club members, customers vs potential customers, etc) - Classification allows model to be built that will help predict outcomes based on variables in their assigned classes (i.e. the probability that a customer will be a club member or not) - Dependent variable is qualitative

Classification

- Operation that assigns each individual from the population being studied into a specific class/group. - Class/group assignment is based on exploratory characteristics of the individual - Dependent variable is qualitative - Classification methods include: * Logistic regression * Linear discriminant analysis * Naive Bayes * Decision trees

Classification

- Exploratory analysis used to identify structures in dataset - Useful for organizing variables into smaller, homogeneous groups to simplify analysis and find patterns in data - Variables are grouped together based on similar traits, i.e factors like age, income, education level, etc - Cluster analysis is typically a highly subjective process since it is dependent on one's familiarity with the data - Grouping is determined by what you perceive to be common threads or patterns in the data - Distance-based clustering: variables are grouped based on their proximity or distance, i.e. cluster of cancer cases near Chernobyl - Conceptual clustering: variables are grouped by factors that the variables have in common, i.e. cancer clusters could be grouped by "people who smoke"

Cluster Analysis

- The whole population is subdivided into clusters/groups, then random samples are collected from each cluster/group - Example of cluster sampling: * A researcher is interested in data about city taxes in Florida. The researcher would collect data about taxes from selected cities in Florida, and then compile all of the tax data together to see a picture of the taxes in the cities across Florida as a whole. The individual cities themselves, would be the clusters in this case.

Cluster Sampling

- Exclusive Clustering: Each variable can only belong in one single cluster. It cannot belong to any other cluster - Fuzzy Clustering: Data points are assigned a probability of belonging to one or more clusters - Overlapping Clustering: Each variable can belong to more than one cluster - Hierarchical Clustering: Each variable is given its own cluster, then a pair of clusters is joined together based on similarities, which combines 2 clusters into 1. This process is repeated until all variables are clustered. - Probabilistic Clustering: Data is clustered using algorithms that connect variables using distances or densities. This is done via computer.

Cluster Types

- Grouping objects (individuals or variables) with similar characteristics into a number of smaller groups called clusters (segments). - Objects are not defined in advance by the analyst and instead are discovered during the grouping process - Clustering is descriptive, NOT predictive, as clustering is only useful for showing underlying patterns or associations between objects in a data set. No actual predictions can be made from the clusters themselves. - No dependent variable - Useful because it helps homogenize data by grouping similar objects together

Clustering

- Defined as a correlation or relationship between predictors (independent variables) in regression analysis - For example, height and weight have a high collinearity between them since generally speaking the taller a person is, the more that person weighs - Collinearity creates bias and skews results of analysis since variables are too closely related

Collinearity

- Transactional i.e. data describing an event, like the results of a transaction (orders, payments, deliveries, etc) Transaction data always has a time dimension to it and a numerical value - Product i.e. data that is describing a product, like shoes or cars - Customer i.e. data that describes a customer (customer ID, first and last name, etc) - Geodemographic i.e. data about a population in an area - Technical i.e. data that gives a status report on something (date of death, official titles, payer status, etc)

Commercial Sector Data Types

- Provides 2 dimensional view of contingency tables - Can be used to convert qualitative values into quantitative values

Component Analysis

- Used to indicate how strongly associated two different categorical/nominal variables are with each other. - Based on Pearson Chi-Square statistic - Test is performed on contingency tables - Cramer's V value is between 0 & 1. - Values closer to 0 indicate little association between variables. - Values closer to 1 indicated strong association between variables. - In 2x2 contingency table, Cramer's V = Phi coefficient - Can be used on tables with size > 2x2 (i.e. 3x3, 3x2, 4x4 etc)

Cramer V Test

- The following parameters should be considered when deciding how to deal with an aberrant value: * If the anomaly arises because the selected observation is outside the range of the study, delete the observation. * If the variable appears to be useless in describing the thing to be predicted, keep the observation but eliminate the variable, or treat its aberrant values as missing values. * If the variable is useful in describing the thing to be predicted, keep the variable and eliminate the bad observations

Criteria for Fixing Aberrant Values

- Used to reduce right skew of data points - Weaker than log transformation - CAN be used for zero or negative values

Cube Root Transformation

- Relational (customer reactions to marketing) - Attitudinal (customer satisfaction/loyalty) - Psychographic (customer personality) - Lifetime (how long one has been customer) - Channel - Sociodemgraphic (customer social info)

Customer Data

- Discriminant Analysis on Qualitative Variables (DIScrimination on QUALitative variables) - DISQUAL performs a Fishers Discriminant Analysis on components from a Multiple Correspondence Analysis - Fisher function can be expressed as a linear combination of indicators of categories, which is equivalent to assigning a coefficient to each of the categories

DISQUAL

- AKA 'binning' - Is the process of converting continuous data into a smaller number of finite values, i.e. taking the data and putting it into classes (bins) - An example of discretization (binning) is taking a bunch of measurements related to temperature outside and classifying that data into 'bins' that are either Hot or Cold - Variation of original data set is also maintained intact in binned data - Accomplished by assigning each value in a data set to a 'bin' - Tips for binning data include: * Avoid having too many differences in classes between variables * Don't use too many classes for a variable * Try not to make classes to small * Usually, 4 to 5 classes is acceptable

Data Discretization

- Data mining application goes through a number of phases: - Development (construction of the model) in the decision-making environment - Testing (verifying the performance of the model) in the decision-making environment - Use in the production environment (application of the model to the production data to obtain the specified output data).

Data Mining Application Phases

1.) Define aims, i.e. desired outcome 2.) List existing and available data to use 3.) Collect and obtain new data if necessary 4.) Explore and prepare data 5.) Segment the population 6.) Create and validate predictive models 7.) Deploy the models 8.) Train model users 9.) Monitor models 10.) Improve models

Data Mining Steps

- Used to make variables proportional to each other - Normalization of continuous variables is done with mathematical functions (like exponent, log, etc.) to transform the data - In most cases, data normalization eliminates units of measurement for data, which enables you to more easily compare data from different places

Data Normalization

- Longest stage of data study. Done to decrease workload and increase efficiency of statistical analysis - Used to make data cleaner and easier to use - File handling (merging, aggregation, tranpose, etc) - Data display, color individuals based on criteria - Detection, filtering, Winsorization of outliers - Analyze and impute missing values - Transform variables (recode, standardize, normalize, discretization, etc) - Create new variables - Select best independent variables, discretizations, interactions

Data Preparation Functions

1.) Examine the distribution of variables 2.) Detect rare, missing, aberrant, or extreme values 3.) Test for normality of distribution 4.) Identify the most useful (discriminant) variables for the model, i.e. the variables that contribute the most to the prediction 5.) Transform variables if necessary 6.) Choose the range of binned variables 7.) Create new variables if necessary 8.) Identify interactions between variables 9.) Auto variable selection 10.) Detect collinearity between variables 11.) Do sampling

Data Preparation Steps

- Classification AND Prediction method - Maps possible outcomes of a series of related choices - Allows researchers to weigh possible actions against one another based on their potential outcomes, benefits, etc - Starts at single node, then branches out from there based out possible outcomes and choices - Considered predictive form of analysis for decision making - Useful for identifying criteria to divide data into classes - Detects the two-way interaction between tables - Benefits: easy to read and understand, highly modular, great for picking the best of several options, easy to combine with other decision making methods - Disadvantages: become extremely complex very quickly

Decision Tree

- Samples are considered dependent if members from one sample are related to members of the other sample - For example, if Sample A = Husbands and Sample B = Wives, then Sample A & Sample B are dependent because husbands and wives are related to each other

Dependence of Samples

- Used to find information that is 'buried' in a data set - Allows you to find patterns, clusters, groups, etc that were hidden in the data set - Essentially allows you to summarize the data - Does not allow predictions to be made from data, it merely allows associations between variables to be found

Descriptive (Exploratory) Data Mining

- Measure of distribution of data

Dispersion

- Are essentially outliers - Will affect some methods of analysis more than others, logistic and linear regression are particularly vulnerable to outliers and extremes

Extreme Values

- Factor analysis is an exploratory/descriptive data mining method - Collapses a large number of variables into a few interpretable underlying factors - Key concept of factor analysis is that multiple observed variables have similar patterns of responses because they are all associated with a latent (i.e. not directly measured) variable - Detects links between variables and also identifies characteristics that most clearly separate objects from each other - Good for clustering and pattern recognition - Involves grouping variables with similar attributes into a matrix using linear algebra techniques - In FA there are the same number of factors as there are variables - Each factor captures a certain amount of the overall variance in the observed variables & the factors are always listed in order of how much variation they explain - The eigenvalue is a measure of how much of the variance of the observed variables a factor explains - Any factor with an eigenvalue ≥1 explains more variance than a single observed variable - So if the factor for socioeconomic status had an eigenvalue of 2.3 it would explain as much variance as 2.3 of the three variables. This factor, which captures most of the variance in those three variables, could then be used in other analyses. - The factors that explain the least amount of variance are generally discarded. - Factor analysis methods include: Principal Component Analysis (PCA), Correspondence Analysis (CA), and also Multiple Correspondence Analysis (MCA)

Factor Analysis

- Data mining has some of the following distinctive features: * The development phase cannot be completed in the absence of data * The development of a model is primarily dependent on data * Development and testing are carried out in the same environment, with only the data sets differing from each other * To obtain an optimal model, moving frequently between testing and development is common * The data analysis for development and testing is carried out using a special-purpose program, i.e. SAS, SPSS, etc * Some programs also offer the use of the model, which can be a realistic option if the program is implemented on a server * Conciseness of the data mining models: unlike the instructions of a computer program, which are often relatively numerous, the number of instructions in a data mining model is nearly always small

Features of Data Mining Application Development

- Data preparation method done prior to start of statistical analysis in software - Includes the following steps and procedures: + Merging files together + Aggregation of files + Transposition of files

File handling

- Aberrant values can be remedied or fixed by doing one or more of the following: * Delete the observations in question (if they are not too numerous and if their distribution is suitably random, or if it is clear that they should never have been included in the sample) * Keep the observations but remove the variable from the rest of the analysis (if it is not considered essential, or replace it with a variable that is similar but has no aberrant values) * Keep the observations and the variable, but replace the aberrant value with another value assumed to be as close as possible to its true value * Keep the observations and the variable and use the variable as it is, tolerating a small margin of error in the results of the models. - Use other data sources, cross check between several variables to establish reliability of the 1st variable

Fixing Aberrant Values

- To remedy extreme values and outliers, do one or more of the following: * Exclude outliers with extreme values from the model while ensuring that no more than 2% of total values are excluded * Divide variable into classes, with the extreme values being placed into 1 class and your other normal values into the other class * Winsorize the variable

Fixing Extreme Values and Outliers

Remedies for missing values include: * Not using the variable(s) concerned if its contribution to the analysis of the problem is not essesntial * Replacing the variable with a similar variable that is not missing values * Replacing the missing value (imputation) with a value determined via a statistical method or replacing it with a value from an external source

Fixing Missing Values

- Regression model in which the choice of which independent (predictor) variables to use in the model is done by automated procedure - Forward stepwise has a starting model with no predictors - Goes through independent variables 1 at time and assesses the residual sum of squares for each variable to see which has smallest sum of squares error. - Adds 1 variable to model at a time, in ascending order, from smallest to largest. - Continues this process until stopping rule is satisfied, i.e. all remaining variables not being used in model have a p-value above some threshold.

Forward Stepwise Selection

- Tests to see if data is normally distributed - Compares data with known distribution and lets you know if both sets have the same distribution - Non-parametric - Used to check normality assumption in ANOVA - Null Hyp = data comes from norm. dist. - Alt Hyp = data NOT from norm. dist. - Mean & variance need to be known in advance prior to running the test - Has weakness of being more sensitive in center of dist. and less sensitive at the tails of dist.

Kolmogorov-Smirnov Test

- Have distinctive feature of NOT relating directly to individuals - Provides details about a person's geographic environment, including: + Economics (# of businesses, working population, unemployment, local businesses and services, consumption habits of local population, etc) + Sociodemographics (population, wealth, average # and age of children, family structures, occupational information) + Housing (age, type, proportion of tenants, owners, etc) + Competition (presence of the business, presence of competitors, market share, etc.) - Less precise than other data since not directly related to individual - Has advantage of being able to target people or entire population who are net yet customers of a business by targeting their geodemographic data - Enables companies to break into a new market more easily by targeting their services to the local population

Geodemograhpic Data

- Is any set of data that is NOT homoscedastic - It has data points that are unequal distances from the slope line on a graph. - If your data points on a graph are cone shaped, i.e. they look like < or >, then this is indicative of heteroscedastic data - Residual plots can suggest (but not prove) that the data is heteroscedastic - Park Test & White Test can be used to test for heteroscedastic data - Problems that stem from heteroscedastic data include the following: * Significance test will either run too high or too low (i.e. alpha is too high or low) if data is heteroscedastic * Std. errors will be biased along with corresponding test statistics and confidence intervals - Tests for Heteroscedasticity include: * Levene Test * Bartlett Test * Fisher Test

Heteroscedastic

- Creates clusters that have a predetermined ordering from top to bottom, i.e. files and folders used in your computer are organized in hierarchy - Grouping is based on distance or density - Two main methods of hierarchical clustering: + Divisive method: works in top-down manner. Begins with roots in which all objects are included in single cluster. At each iteration, the most heterogeneous cluster is divided into two. Process iterated until all objects are in their own cluster. - Divisive method good for identifying large clusters + Agglomerative method: works in bottom-up manner. Initially, each object is considered a single-element cluster. At each iteration, the two clusters that are the most similar are combined into a new, bigger cluster. Process iterated until all points are part of 1 single big cluster. - Agglomerative method good for identifying small clusters

Hierarchical Clustering

- Means having same scatter or variance - Data values or points on a graph are all roughly the same distance from the slope line - If your data points on a graph generally follow a straight line, i.e. your plot looks like / or \ then this is indicative of homoscedastic data - General rule of thumb, "If the ratio of the largest variance to the smallest variance is 1.5 or below, then the data is homoscedastic" - Based on the assumption of equal variances that assumes different samples have the same variance even if they came from different populations - ANOVA, T-Test, linear & logistic regression, all assume data is homoscedastic - Tests for Homoscedasticity include: * Bartlett's Test * Box's M Test * Brown-Forsythe Test * Hartley F Max Test * Levene Test

Homoscedastic

- Work on structured files (SAS, SPSS, DB2, etc.) rather than flat files - Limit analyses to the lines and variables relevant to the current process - Recode the variables and make them smaller by using formats - Create Booleans such as alphanumeric variables of length 1, rather than numerical variables. - Clearly define the length of the variables used, limiting it to the minimum possible - Remove intermediate files which are no longer required - Keep enough free space on the hard disk - Defragment the hard disk if necessary. - Do not place the analysed file or the temporary workspace on a remote network since network latency and speed will become an issue - Increase the amount of RAM.

How to Reduce Processing Times

- USE KEEP and DROP commands to analyze only the relevant variables - Use the LENGTH command to clearly define the length of the variables used - Use the PROC DATASETS LIB = WORK KILL NOLIST command to clear out the temp WORK directory often since it is not automatically purged at until the end of the SAS session - Use BY command instead of CLASS in the MEANS procedure - Create index on variables used at least 3 times in a WHERE or BY filter - Use COMPRESS = YES command to reduce hard disk space occupied by file by removing all blank characters and spaces in data set - For copying tables, use PROC COPY or PROC DATASETS rather than a DATA SET step - Use TAGSORT option when sorting a large table - Use the PRESORTED option to sort the table if it has not been done already

Improve SAS Processing Times

- Is the process of replacing missing data with substituted values - Methods of imputing missing values include: + Replace the missing value with the most frequent value (qualitative variables) + Replacing missing value with the mean or median (numerical variables)

Imputation of Variables

- Samples are considered independent if members from one sample are unrelated to members of the other sample - For example, if Sample A = Cats and Sample B = Dogs, then Sample A & Sample B are independent since Cats and Dogs are unrelated to each other because they are different animals

Independence of Samples

- Clustering algorithm - Useful for data sets that have NO labels, i.e. data without defined categories or groups - Unsupervised learning method - Allows researcher to find groups of data which are similar to each other and cluster them together, even if the data is not labeled - Clusters defined by centroids, there are k centroids in each data set, k is a value returned by the alogrithm - A point is considered to belong to a particular cluster if the point is closer to that cluster's centroid than any other centroid - Iterative method, process of assigning data points to clusters based on current centroids and then choosing next, closer centroid is done until data set reaches convergence

K-Means

- Classification method - AKA Self-Organizing Map (SOM) - Unsupervised learning neural network - Self organizes around the data - No variables to predict, it learns the structure of the data in order to distinguish clusters hidden within the data - Operates like a matrix that is made up of cells (nodes), vectors, and the magnitudes (weights) of each vector - Each node of the matrix (map) is defined by a vector - Each vector is adjusted during the 'learning' process - Clusters of data are formed after learning process completes - Learning process occurs as follows: 1.) From your data set, select a variable to analyze 2.) The closest node to the selected variable is found 3.) The magnitude (weight) of the vectors closest to the node, and also the surrounding nodes, is adjusted in a way that the vector of the node is 'moved' towards the variable 4.) Steps 1 through 4 are repeated for a fixed number of times until 'learning' process is complete.

Kohonen Network (Map)

- Rank based, non-parametric test - Equivalent to one-way ANOVA - Extends Wilcoxon Rank test to 3 or more groups - Used to determine statistical differences between 2 or more groups - Identifies if 1 group is systematically different from the others - 3 or more samples, non norm. dist., heteroscedastic - Null Hyp = All of the groups of data are from same distribution - Alt Hyp = At least 1 group of data is from different distribution - Assumes dependent variable is ordinal or continuous, independent variable consists of 2 or more categorical groups, groups are independent of each other

Kruskal-Wallis Test

- Is a measure of whether data is heavy-tailed or light-tailed in relation to a normal distribution, i.e. more or less data in the tails (skewed) - Positive value tells you that you have heavy-tails (i.e. a lot of data in your tails). So heavy positive kurtosis would have bell shaped histogram - Negative value means that you have light-tails (i.e. little data in your tails). So heavy negative kurtosis would have nearly flat histogram - The kurtosis for a standard normal distribution is 3, so if your values are close to that then your graph is nearly normal distributed

Kurtosis

- Life Time Value - Measures the value of a customer over their lifetime based on the quantity and value of their total purchases with company - Takes into account a customer's propensity to buy a company's products - LTV is a combination of the main predictive indicators (propensity, attrition, risk) for a customer

LTV

- Used to verify homoscedasticity of data by testing that variance is equal for all samples. - Best used for data that is NOT normally distributed - DO NOT use Levene's test for norm. dist. data, use Bartlett instead if data is norm. dist. - Assumes samples from the populations under consideration are INDEPENDENT - Null Hyp = variance equal for all samples - Alt Hyp = variance is not equal for at least one pair of samples

Levene's Test

- Data that pertains to time in relation to customer - Age of customer - Length of time as customer - How long at current address - How long at current job - Time since last purchase or return

Lifetime Data

- Classification method, predictive in nature - LDA is based upon the concept of searching for a linear combination of variables (predictors) that best separates two classes (targets) - Quantitative input and qualitative output - DISQUAL is a subset of this method

Linear Discriminant Analysis

- Used to predict the value of a dependent variable Y using the values of 1 or more independent predictor variables X - Is a predictive data analysis method - Assumes the following: * Residuals (error terms) are norm. dist. * Best fitting regression line is straight line * Residuals (error terms) have constant variance at every value of X * Residuals (error terms) are independent * Residuals have mean of zero (error terms sum to 0)

Linear Regression

- Changing a variable by doing to one or more of the following: * Adding or subtracting a constant to a variable * Dividing or multiplying the variable by a constant - When a random variable is linear transformed, new random variable is created

Linear Transformation

- Used to reduce right or left skew of data points - If data points on graph are all clustered to one side or the other, instead of distributed evenly across graph, this indicates skew - Cannot be used for zero or negative values

Log Transformation

- Is a geodemographic typology based on sociodemographic data, lifestyles, behavior and preferences - Used in Experian credit reporting - Comprised of financial variables like education level, size of household, occupation, income, etc.

MOSAIC Analysis

- Another word for Association rules/analysis - Market Basket Analysis is a modelling technique based upon the theory that if you buy a certain group of items, you are more (or less) likely to buy another group of items - Uses the exact same principles and rules as Association Analysis

Market Basket Analysis

- Student's T-Test (2 samples, norm. dist., homo) - Welch's T-Test (2 samples, norm. dist., hetero) - Wilcoxon-Mann-Whitney (2 samples, non-norm. dist., hetero) - Median Test (2 & 3+ samples, non-norm. dist. hetero) - ANOVA (3+ samples, norm. dist., homo) - Welch-ANOVA (3+ samples, norm. dist., hetero) - Kruskal-Wallis (3+ samples, non-norm. dist., hetero) - Jonckheere-Terpstra (3+ samples that are ordered, non-norm. dist., hetero)

Mean Comparison Tests

- Clustering algorithm to group data points - Akin to K-Means method as they work in the same manner

Moving Centers Method

- Component analysis with more than 2 variables - Can be used to convert qualitative values into quantitative values

Multiple Component Analysis (MCA)

- Each missing value is replaced with a number of plausible values (five is often enough), i.e the missing entries of the incomplete data set are replaced more than 1 time - From this, several complete data tables without missing values are obtained - Can even impute qualitative values

Multiple Imputation

- Classification algorithm - Based on finding functions that describe the probability of data belonging to a class given certain features - Predictor (independent) variables assumed independent - Requires LARGE data sets - Can group objects based on features (i.e. cars have 4 wheels, motorcycles have 2 wheels -> if object has 4 wheels it is most likely a car, if it has 2 wheels it is most likely motorcycle) - For example, a fruit has a high probability of being classified as an apple if it is red, round, and about 3 inches in diameter.

Naive Beyes Method

- Kohonen maps - Basic operating principle of neural networks, which work by using layers of units and the connection between each unit to mimic function and learning ability of human brain - Used to learn structure of data and identify clusters of similar data in data set

Neural Clustering

- The perceptron is an algorithm for supervised learning of binary classifiers (binary classifiers are functions that can decide whether an input, which is represented by a vector of numbers, belongs to some specific class or not) - It is essentially a learning and classification algorithm

Neural Network Perceptron

- Extrapolates new information based on the present information available - Newly obtained information is either qualitative or quantitative - Used to forecast values or make predictions on potential values - Has a dependent variable, i.e. something you are trying to predict

Predictive Data Mining

- System of hardware and/or software that mimic the operation of neurons in the human brain - Designed to recognize patterns in data - Interprets sensory data through a kind of machine perception, labeling or clustering raw data input - The patterns recognized by neural networks are numerical and contained in vectors, input data is translated from it's raw form into numerical vector values - Can be used for clustering as well as classifying data - Predictive, can be used as a predictive model to predict future outcomes or values based on past data - Require MASSIVE amounts of computer power - Capacity to handle non-linear relations between the variables is a MAJOR BENEFIT of neural networks

Neural Networks

- Falls into categories - Is NOT ranked or ordered in any way - Examples include: [blue, white, orange], [shirt, pants, jacket], etc

Nominal Data

- Does not assume anything about the underlying distribution of the data set - Applies to any distribution that is not normally distributed - Use only if you have to, i.e. your data is not normally distributed - Parametric tests are preferred over non-parametric since parametric tests are more accurate - Types of non-parametric tests include: * Kruskal-Wallis * Mann-Whitney * Wilcoxon-Mann-Whitney

Non-Parametric Test

- Bell curve, half of data will fall to left of mean, and half of data will fall to right of mean - Mean, median, mode are all equal - Total are under curve is 1 - Curve is symmetric at center, around the mean

Normal Distribution

- Concerned with managing the various marketing channels (sales force, call centres, voice servers, interactive terminals, mobile telephones, Internet, etc.) - Manages marketing campaigns to achieve the best implementation of the marketing strategies for the customers identified by the analytical CRM process - Also supplies analytical CRM process with new additional data for analysis. Creates data 'loop' between operational and analytical CRM

Operational CRM

- Falls into categories - Is ranked or ordered in someway - Examples include: [low, medium, high], [Scale of 1 to 5 for ratings], [good, bad, worst] etc

Ordinal Data

- Principal Component Analysis - Reduces the number of variables in a model by helping identify which variables to remove from model completely - Helps ensure that variables are independent of one another - Makes variables less interpretable - Measures how each variable is associated with each other via Covariance matrix - Breaks Covariance matrix down into 2 separate components, direction and magnitude - Show the directions in which data is dispersed via eigenvectors - Show importance of those different directions via eigenvalues - Combines predictor variables and allows the dropping of eigenvectors that are unimportant to model

PCA Analysis

- Makes assumptions about the population parameters and the distribution of the data - Applies to all distributions that are normally distributed - Parametric tests are preferred over non-parametric since parametric tests are more accurate - Use parametric test over non-parametric tests if possible - Examples of parametric tests include: * Chi-Square * ANOVA * T-Tests * Fisher's Exact Test

Parametric Test

- Shows the linear relationship between two sets of data, most typically via graph - Answers the the question, "Can I draw a line graph to represent the data?" - Measures how well related two sets of data are - Does not give you any information about slope of line, only tells if you a relationship exists between data sets - Correlation ranges from -1 to 1 - Values closer to 1 mean strong positive correlation (relationship) between variables - Values closer to -1 mean strong negative correlation (relationship) between variables - Values closer to 0 mean there is no correlation (aka zero correlation) - Positive correlation = positive (upward) slope on graph - Negative correlation = negative (downward) slope on graph

Pearson Correlation

- Used to model which independent variables have statistically significant effects on the dependent variable. - Explains which X values work on the Y value - Best used for rare events as they tend to follow Poisson distribution (i.e. # of bacteria found on surface, # of 911 calls that end in death of suspect, # of colds caught at school) - Poisson regression is used only for numerical, continuous data

Poisson Regression

- Is an approach to estimate the computational complexity of an algorithm or computational problem. - Starts with assumption about prob. dist. of all possible inputs - Assumption is then used to design efficient algorithm or derive the complexity of known algorithm

Probabilistic Analysis

- Is an application of data mining in CRM - Studies the probability that a customer will be interested in a product or service - Enables targeted marketing campaigns to be refined - Saves a company money on marketing by advertising specifics products to specific people based on the individual's propensity to consume a product

Propensity Analysis

- Qualitative data that describes consumers and customers based on psychological attributes like personality, values, opinions, attitudes, interests, and lifestyles - Typically used for segmenting consumers into unique groups based on psychographic profiles. - Psychographic profiles are often used in targeted marketing and advertising campaigns that are aimed at a group with particular psychographic profile traits Examples of psychographic data include: - Lifestyle - Personality (shy, prudent, ambitious, outgoing, etc) - Values ( conservative, liberal, materialistic, etc) - Risk aversion (trustful, mistrustful, anxious, demanding etc) - Knowledge - Focus of interest - Opinions and behavior

Psychographic Data

- Variables are grouped/clustered together based on categorical values, i.e. bands are grouped together in iTunes based on the genre of music (pop, rock, rap, etc, all have many different bands that play the same style of music)

Qualitative Clustering

- Data is categorical, descriptive - Cannot perform math operations on qualitative data - Is either Ordinal (ranked or ordered) or it is non-ordinal (not ranked or ordered) - Represents characteristics, i.e. a person's gender, marital status, hometown, etc

Qualitative Data

- Variables are grouped/clustered together based on numerical values (either continuous or discrete), i.e. best selling items on Amazon (they are best sellers because they have sold more quantity than other similar items)

Quantitative Clustering

- Data is continuous or discrete - Data can be measured or counted in some way

Quantitative Data

- RFM stand for Recency, Frequency, Monetary value - Cross-tabulates recency of last purchase in the period being studied with the frequency of purchases in that period, then examines the distribution of purchases - Answers the Where, When, How, Quantity, and What questions - Analyzes Where was product purchased? When was it purchased? How was it paid for? How much was purchased? What was purchased?

RFM Analysis

- Neural network - Supervised learning network - Works with only 1 hidden layer - Centres of the hidden layer in RBF network are not adjusted at each iteration during learning - In RBF network, the hidden neurons share space and are virtually independent of each other. This makes for faster convergence of RBF networks in the learning phase, which is one of their strong points. - Weakness of RBF is that it may need large number of units in its hidden layer, which increases execution time of network without always yielding perfect modelling of complex structures and irregular data

Radial Basis Function Networks

- Can create bias in an analysis by appearing more important than they really are. - Removing or replacing rare values with a more frequently occurring value is the best method for fixing rare values.

Rare Values

- Multiplication by the inverse - So x would be transformed into 1/x - Cannot be used for values of zero or on values with negative numbers

Reciprocal Transformation

- Data that pertains to customer interactions with company - Preferred method of contact - Preferred method of delivery - History of calls to customer service - History of complaints

Relational Data

- Used to test data sets to ensure that the normality assumption holds (i.e. data is normally distributed) - Testing for normality is critical for linear and logistic regression, t-tests, ANOVA, correlation, etc since they all assume data is norm. dist. - Results can be biased or even invalidated outright if normality assumption is violated - Test for normality include: * Shapiro-Wilk (P-P Plot) * Kolmogorov-Smirnov * Anderson-Darling

Tests for Normality

- Process of collecting data to be used for analysis

Sampling

- Used to tell if a random sample comes from a population that is norm. dist. - Tests normality of sample data - Test statistic is given by W - Small values of W indicate sample is NOT norm. dist. - Value of 1 is perfect normal distribution - The closer to 0 value gets, the more non-normally distributed your data set is. - Null Hyp = sample comes from norm. dist. population - Alt Hyp = sample is NOT from norm. dist. population - Test has inherent bias in that the larger the sample size, the more likely it is that test results will be statistically significant

Shapiro-Wilk Test

- Each missing value is replaced with an assumed value - Has the drawback of underestimating the variability of the imputed data and the confidence intervals of the estimated parameters

Simple Imputation

- Draw n individuals at random, without replacement, from a population of N - Each individual has 1/N probability of being drawn

Simple Random Sampling

- Personal information about individual (name, gender, level of education, etc.) - Family information (family situation, number and ages of children, number of dependents, etc.) - Occupational (income, occupation, social category, # of working and retired people in the household, etc.) - Wealth (property ownership, owner or renter of home, value of residence, possession of a second home, etc.) - Geographical (length of time at the address, region of residence, place of residence, # of inhabitants living in residence, ZIP code, etc) - Environmental and Geodemographic (competition, population, working population, customer population, unemployment rates, economic potential, product ownership rates, etc., in the area of residence of the customer or prospect)

Socio-demographic Data

- Used to reduce right skew of data points - Weaker than log or cube transformations - CAN be used for zero values

Square Root Transformation

- Used to reduce left skew of data points - Moderate effect on distribution shape - Can only be used if variables is zero or positive

Square Transformation

- Forward stepwise selection starts with no variables in model initially, variables are added into model one by one - Backward stepwise selection starts with all of the variables in the initial model, variables are removed from model one by one - Combined stepwise starts with forward stepwise then changes to backward stepwise, process alternates between forward and backward until no variables can be removed or added

Stepwise Selection

- Done by dividing the population into stratums (levels/layers) and then drawing individuals at random from each stratum (level/layer). This creates a sub-sample for each stratum (level/layer). All of the sub-samples can then be brought together for analysis

Stratified Sampling

- AKA Un-paired t-test - Test compares two averages (means) and tells you if they are different from each other. - T-Test also tells you how significant the differences in means are; In other words it lets you know if those differences could have happened by chance. - Best used to determine if there is a difference between 2 independent groups, i.e men vs women, treatment vs control group, etc. - Null Hyp = the means are equal - Alt Hyp = the means are different

Student's T-Test

- Multi-Layer Perceptron (MLP) - Radial Basis Function (RBF)

Supervised (Predictive) Learning Methods

- Supervised learning methods used for classification, regression, and detecting outliers in data - Builds predictive model that can assign new input objects into 1 category or another - Essentially SVM is used to separate classes from each other - ADVANTAGES: + Can model non-linear phenomena + Precision of prediction in certain cases + Robust - DISADVANTAGES: + Does not directly provide estimated probabilities + Sensitive to choice of Kernel parameters + Long computation time + Limited # of software programs that can implement SVM

Support Vector Machines

- Set of methods for analyzing data where the dependent variable is the period of time until an event of interest occurs - Time to event/survival time is what's being measured, so variable to be predicted is a period of time - The focus of SA is on the passage of time until an event of interest occurs - Objective of survival analysis to explain the variable 'length of survival' in terms of time in order to reveal factors (ie other variables) that contribute to increased survival time - Observations in SA are called censored when the information about their survival time is incomplete, i.e. the event of interest did not happen while person was under observation during a study - Unlike ordinary regression models, SA can correctly incorporate information from both censored and uncensored observations in estimating important model parameters - Churn and retention rates for customers is an ideal scenario to use survival analysis with

Survival Analysis

- Individuals are not drawn at random - Instead, individuals are drawn in a processed way, i.e. a pattern - Examples of systematic sampling include: * Selecting only customer ID numbers that end in an even number * Choosing only the order numbers that begin with an odd number * Selecting the 1st, 101st, 301st....n01st customer

Systematic Sampling

- Used to validate the model for statistical accuracy and ensure that no bias is present

Test Sample

- Tests for Heteroscedasticity include: * Levene Test (best all around, has low sensitivity to non-normality) * Bartlett Test (best for normal distributions) * Fisher Test (least robust method if data is not normally distributed)

Tests for Heteroscedasticity

- Tests for Homooscedasticity include: * Bartlett's Test * Box's M Test * Brown-Forsythe Test * Hartley F Max Test * Levene Test

Tests for Homoscedacity

- Predictive method - Used to study sequences of measurements of a variable or variables, the measurements are most often made at regular time intervals. - Explains how the past affects the future or how two time series can "interact" with each other - Forecast future values of the series. - Serve as a control standard for a variable that measures the quality of product in some manufacturing situations - Plotted most frequently with line charts - Has defining characteristic that order of observations is critical, because there is dependency between observations, so changing the order of the observations could change the meaning of the data

Time Series Analysis

- Used to develop the statistical model

Training Sample

- Data that describes a transaction - Where - When - How much - What - Method of payment

Transactional Data

- Used to determine if statistically significant difference exists between two sample means from two populations - Tests if the means of two independent groups are different - Assumes dependent variable is discrete, independent variable is categorical with 2 levels, samples are independent, samples are norm. dist., variance is equal in populations (homoscedastic), samples are random

Two-Sample T-Test

- Means analyzing 1 variable - Explores the statistics and details of 1 variable (i.e. examine the mean, median, mode, std. dev, outliers, etc.) - Useful for summarizing data and finding patterns - Examine uni-variate statistics of variables to: + Detect any anomalies in their distribution (i.e. find outliers or missing values) + Get idea of orders of magnitude (like averages) which will be useful in subsequent analysis + See how to discretize the variables if necessary

Uni-variate

- Association Analysis - Cluster Analysis - Neural Clustering - Factor Analysis

Unsupervised (Descriptive) Learning Methods

- Used to compare 2 sample means from 2 different populations to test if the means are equal to or different from each other. - Does NOT assume equal variance(homoscedasticity), unlike Student's T-Test - Assumes normal distribution in data, like Student's T-Test - Best used when sample sizes and variances are unequal - Gives same results as Student's T-Test - More robust than Student's T-Test due to Welch not being sensitive to equal variance among data or unequal sample sizes - 2 samples, norm. dist., data sets that are heteroscedastic (unequal variance) - Null Hyp = means are equal - Alt Hyp = mean are not equal

Welch's T-Test

- Non-parametric, rank based test method for 2 samples - Used to determine if 2 samples come from same population by comparing the mean of each sample and testing if the means are equal or not. - Alternative to t-test, used for non-norm. dist. data - Typically used on ordinal (ranked) data sets or when assumptions for t-test are not met (i.e data set is not norm. dist.) - Compares ranks of one group to the other group to find test statistic - Assumes that sample is drawn randomly, that observations appear only in one group (cannot be in both), ordinal measurement scale is assumed (i.e. one observation is always bigger or smaller than another observation) - Null Hyp = the means are equal and the 2 samples come from the same population - Alt Hyp = the mean are not equal and the samples come from different populations

C743 Data Mining 1 - Terminology

Conjuntos de estudio relacionados

Individ Health Assess. Midterm Questions

PSY 360 Chapter 6

RN218 FINAL

Chapter 2 - Insurance & Health Care Agencies

Mother/Baby HESI 9

NMNC 1110 EAQ 10: Safety and Infection Control (Mastery)

Pharm Exam I (Ch. 1-6) Questions

Accounting Chapter 1

INTRO TO BUSINESS - Chapter 6

MGMT 375 Exam 2 SELU Settoon

Chapter 1

Health Assessment Abdomen

Chapter 35 Assessment of Immune Function

AP Micro Section 9 Test

mine management

Dance 100 Chapter 5 Quiz (100% Right)

Physics Fossil Fuels Test

review for fundementals of nursing chapters, hygine, saftey, activity.

Information Security Chapter 7

MGMT exam 2 knowledge check