C743
Accuracy rate
# of correctly completed fields / # of completed fields
Wilcoxon
2 samples, non-normal and hetero
Welch's t-test
2 samples, normal and hetero
recall rate
# of correctly completed fields / # of fields to be completed
predefined requests
# of fixed terms applied - changes in dynamic way
ANOVA
Three or More samples, normality and homosadasticity
ordinal
ordered, same as discrete ie. low, Medium, high
simple random
select individual @ random without replacement. prob of 1/n of being drawn
reciprocal transformation
x to 1/x - not for zero or negative values
non-ordinal
nominal, ie. blue, white, orange
Extreme values
not necessarily aberrant
disambiguation
polysemy, ellipses, homographs, irony, anaphora,#0 and letter O
What are the 3 categories Data can fall into?
quantitative qualitative text
Missing valves
remove or stat replace if not more than 15-20% missing
Info extraction
search for specific info In docs without any comparison themes
interval
Data broken up in to small groups. The differences between the groups matter. ie. diff btwn 100 and 90 Is same as diff btwn 80 and 90
Categorical or nominal
Data broken up into categories but not necessairly related to each other Example: What color is the car?
bivariate
Data for two variables (usually two types of related data). Example: Ice cream sales versus the temperature on that day. The two variables are Ice Cream Sales and Temperature.
Transforming Variables
Data normalization-transform with math function Data discretization-convert cont data into small # of finite Vals
Example of tests of normality
Shapiro-Wilk (p-p plot) Kolmogorov-Smirnov Anderson-Darling
discrete
Discrete data are those whose values belong to a finite or infinite subset of the set N of natural integers (e.g. number of children, number of products bought) ie. #of items bought
What is Test of normality doing?
It is trying to test if a data will fit in a normal distribution. Assumption of normality means that you should make sure your data roughly fits a bell curve shape before running certain statistical tests or regression
Linguistic analysis
Language ID, ID of grammar categories, disambiguation, recognition of compound words
qualitative
Qualitative data are not quantities, but they may be ordered; in this case we speak of ordinal qualitative data (e.g. 'low, medium, high'). Non-ordered qualitative data are called nominal. Ordinal data can be classed in the family of discrete data and treated in the same way. ordinal nominal
quantitative
Quantitative (or numerical) data may be continuous or discrete. What distinguishes continuous and discrete data from other types is that they are concerned with quantities, so we can perform arithmetical operations on them; moreover, they are ordered Continuous discrete
attitudinal data
attitude toward product, reasons for buying, attractiveness of competition
bin tips
avoid too many diffin classes between Vars avoid Too Many classes for a variable avoid classes too small About 4 or 5 classes is good
psychographic data
lifestyle, personality, risk-aversion
LTV
lifetime value
non-monotonic
non-monotonic response is attendance at health spas, which will be lower for young and retired persons and higher for active persons.
R
not good for large data sets cost effective, open source, lots of documentation, More difficult to learn, freq updates, same as s-plus,
geodemographic data
not relating to indv, Mode of Consumption, place of residence in terms of economics, geocode
Sociodemographic data
personal, family, occupational, wealth, geographical, environmental, and geodemographic
systematic samping
the individuals are drawn not at random, but in a regular way. If we carry out a 'one in a hundred' sampling, we take the first individual, then the 101st, then the 201st, and so on.
non-proportional stratified sampling
to take into account variability of phenom In each stratum
Levene's
to test if n samples have equal variances
test Sample
to validate the model
commercial sector data types
transactional, product, customer, geodemographic, technical
log transformation
used to reduce right skewness-not for zero or negative values
Square root trans
weaker than cube root and log. reduce right skew. can apply to zero vals
cube root trans
weaker than log. reduce right skew. can be zero or negative
transactional data
where, when, how, how much, what
predictive
while the predictive (or explanatory) techniques are designed to extrapolate new information based on the present information, this new information being qualitative (in the form of classification or scoring4) or quantitative (regression). has dependent var. extrapolate new info.
monotonic
always Incr or desc
Data Mining software
Icons can be moved and linked with arrows, IBM spss
contextual data
Info abt author
Regression assumptions
1) Linearity 2) independence of errors 3) homoscedasticity 4) normality of error distribution
phases of data Mining
1) define aims 2) list existing data 3) collect data 4) explore and prepare data 5) segment pop 6) draw up and validate predictive models 7) deploy Models 8) train Model users 9) monitor models 10) enrich Models
data prep steps,
1) exam dist of var 2) detect rare, Missing, aberrant, and or extreme vals 3) test normality 4) detect most discrim Vars 5) transform vars 6) choose range of binned Vars 7) create new Vars 8) detect interactions 9) auto var selection 10) detect collinearity 11) sampling
Cramer's V or chi squared
2 discrete vars
Nonparametric Wilcoxon mann
2 groups
Median test
2 or more samples, non-normal and hetero
Student's t-test
2 sample, normality and homoscedasticity
Jonckheere-terpstra
3 or more samples, non-normal and hetero
Kruskal
3 or more samples, non-normal and hetero
Welch-ANOVA
3 or more samples, normal and hetero
Nonparametric krusal-wallis
> 2 groups
continuous
Continuous (or scale or interval) data are those whose values belong to an infinite subset of the set R of real numbers (e.g. wages, amount of purchases). ie. height, weight
cluster sampling
Create random clusters and then choose from each cluster
Stepwise selection
Forward-no vars in model at outset Backward-all in then remove one by one combined-forward then backward alternating until no vars can be removed or added
Descriptive Data Mining
In descriptive methods, for reducing, summarizing and grouping data, there is no dependent variable, i.e. no privileged variable.
predictive data Mining
In predictive methods, which explain data, there is a dependent variable, in other words a variable to be explained, or a privileged variable.
What does the Shapiro-Wilk (p-p plot) do?
In the Shapiro-Wilk test, the cumulative distribution of the data is shown on a normal probability scale, called a P-P (probability-probability) plot, where a normal distribution is shown by a straight line with a slope of 1 (Figure 3.10). Thus the Shapiro-Wilk statistic is a way of measuring how far the graphic representation of the data deviates from the straight line.
square trans
Moderate effect on dist shape. could reduce left skew. Main reason to fit to quad equation. Only if var is zero or positive
Parametric Anova
One discrete and one cont Var
Ordinal
Order matter but not between the variables. Examples would be ie. Scale of 1 to 10 or Temperature = "high" "Medium" "Low"
What does the Kolmogorov-Smirnov test do?
The Kolmogorov-Smirnov test involves measuring the maximum deviation D (in absolute terms) between the distribution function (cumulative density function) of the variable tested and the distribution function of a Gaussian variable (or, more generally, of any continuous variable whose distribution is to be compared with that of the observed variable
Analytical CRM
The aim of analytical CRM is to extract, store, analyse and output the relevant information to provide a comprehensive, integrated view of the customer in the business, in order to understand his profile and needs more fully.
text
Uncoded texts, written in natural language ie. letters, reports
examine dist of vars
Univariate Bivariate
Univariate
Univariate means "one variable" (one type of data). Example: You weigh the pups and get these results:2.5, 3.5, 3.3, 3.1, 2.6, 3.6, 2.4The "one variable" is Puppy Weight.
lifetimes data
age, Iength as cust, at current addr, at current job, time since last purchase or claim
ratio
all properties of interval. also has clear def of 0.0. When 0,00 then none of var ie. height, weight,
univariate visuals
bar chart or freq table for qual or discrete vars box plots or histogram for core vars
SAS
better for large data sets Most expensive, good interface, tech support, fairly easy to learn,
Bartlett's
better than Levene's if have strong evidence data come from normal or nearly normal dist
operational CRM
concerned with managing the various channels (sales force, call centres, voice servers, interactive terminals, mobile telephones, Internet, etc.) and marketing campaigns for the best implementation of the strategies identified by the analytical CRM.
Collinearity
correlation among predictors in Multi regression ie. height and weight used to predict something and results will be skewed because height and weight are related
rare values
create bias by appearing more important than they really are. better to remove or replace with more freq value
paratextual data
date and purpose of doc
descriptive
designed to bring out information that is present but buried in a mass of data (as in the case of automatic clustering of individuals and searches for associations between products or medicines) No dependent var. find association btwn. bring out info present.
IT dev data Mining features
dev cant be done without data dev and test in same env
profitability data
diff btwn profits and cost,LTV
stratified samping
divide poplulation into ranges then random select from each stratum
Info retrieval
docs In their totality
Aberrant values
erroneous value
Zipf's law
few 10's of words are enough to represent a large part of any corpus
training sample
for developing the modeI
lenmitization
form which is in dictionary
grouping variants
graphic, syntactic, semantic, synonyms, parasynonyms, full forms of abbr., expressions and metaphors
grouping analogies
group terms in families of derivative terms, ie.credit-loan-undertaking-debt
homoscedasticity
homoscedasticity means "having the same scatter." For it to exist in a set of data, the points must be about the same distance from the line tests for: Levene's, Bartlett's, Fisher
proportional stratified sampling
ie. 30% of cust in pop over 60, then 30% of strat sample Must be over 60
open requests
in form of keywords or free text - changes slowly
predictive qualitative
in the form of classification or scoring
relational data
pref for contact, delivery, calls to cust svc, complaints
Statistical software
programming windows or scrolling menus, Sas,s-plus
RFM
recency, freq, monetary value
Predictive quantitative
regression
customer data
relational, altitudinal, psychographic, lifetimes, channel, sociodemographic