C743

Lakukan tugas rumah & ujian kamu dengan baik sekarang menggunakan Quizwiz!

Accuracy rate

# of correctly completed fields / # of completed fields

Wilcoxon

2 samples, non-normal and hetero

Welch's t-test

2 samples, normal and hetero

recall rate

# of correctly completed fields / # of fields to be completed

predefined requests

# of fixed terms applied - changes in dynamic way

ANOVA

Three or More samples, normality and homosadasticity

ordinal

ordered, same as discrete ie. low, Medium, high

simple random

select individual @ random without replacement. prob of 1/n of being drawn

reciprocal transformation

x to 1/x - not for zero or negative values

non-ordinal

nominal, ie. blue, white, orange

Extreme values

not necessarily aberrant

disambiguation

polysemy, ellipses, homographs, irony, anaphora,#0 and letter O

What are the 3 categories Data can fall into?

quantitative qualitative text

Missing valves

remove or stat replace if not more than 15-20% missing

Info extraction

search for specific info In docs without any comparison themes

interval

Data broken up in to small groups. The differences between the groups matter. ie. diff btwn 100 and 90 Is same as diff btwn 80 and 90

Categorical or nominal

Data broken up into categories but not necessairly related to each other Example: What color is the car?

bivariate

Data for two variables (usually two types of related data). Example: Ice cream sales versus the temperature on that day. The two variables are Ice Cream Sales and Temperature.

Transforming Variables

Data normalization-transform with math function Data discretization-convert cont data into small # of finite Vals

Example of tests of normality

Shapiro-Wilk (p-p plot) Kolmogorov-Smirnov Anderson-Darling

discrete

Discrete data are those whose values belong to a finite or infinite subset of the set N of natural integers (e.g. number of children, number of products bought) ie. #of items bought

What is Test of normality doing?

It is trying to test if a data will fit in a normal distribution. Assumption of normality means that you should make sure your data roughly fits a bell curve shape before running certain statistical tests or regression

Linguistic analysis

Language ID, ID of grammar categories, disambiguation, recognition of compound words

qualitative

Qualitative data are not quantities, but they may be ordered; in this case we speak of ordinal qualitative data (e.g. 'low, medium, high'). Non-ordered qualitative data are called nominal. Ordinal data can be classed in the family of discrete data and treated in the same way. ordinal nominal

quantitative

Quantitative (or numerical) data may be continuous or discrete. What distinguishes continuous and discrete data from other types is that they are concerned with quantities, so we can perform arithmetical operations on them; moreover, they are ordered Continuous discrete

attitudinal data

attitude toward product, reasons for buying, attractiveness of competition

bin tips

avoid too many diffin classes between Vars avoid Too Many classes for a variable avoid classes too small About 4 or 5 classes is good

psychographic data

lifestyle, personality, risk-aversion

LTV

lifetime value

non-monotonic

non-monotonic response is attendance at health spas, which will be lower for young and retired persons and higher for active persons.

R

not good for large data sets cost effective, open source, lots of documentation, More difficult to learn, freq updates, same as s-plus,

geodemographic data

not relating to indv, Mode of Consumption, place of residence in terms of economics, geocode

Sociodemographic data

personal, family, occupational, wealth, geographical, environmental, and geodemographic

systematic samping

the individuals are drawn not at random, but in a regular way. If we carry out a 'one in a hundred' sampling, we take the first individual, then the 101st, then the 201st, and so on.

non-proportional stratified sampling

to take into account variability of phenom In each stratum

Levene's

to test if n samples have equal variances

test Sample

to validate the model

commercial sector data types

transactional, product, customer, geodemographic, technical

log transformation

used to reduce right skewness-not for zero or negative values

Square root trans

weaker than cube root and log. reduce right skew. can apply to zero vals

cube root trans

weaker than log. reduce right skew. can be zero or negative

transactional data

where, when, how, how much, what

predictive

while the predictive (or explanatory) techniques are designed to extrapolate new information based on the present information, this new information being qualitative (in the form of classification or scoring4) or quantitative (regression). has dependent var. extrapolate new info.

monotonic

always Incr or desc

Data Mining software

Icons can be moved and linked with arrows, IBM spss

contextual data

Info abt author

Regression assumptions

1) Linearity 2) independence of errors 3) homoscedasticity 4) normality of error distribution

phases of data Mining

1) define aims 2) list existing data 3) collect data 4) explore and prepare data 5) segment pop 6) draw up and validate predictive models 7) deploy Models 8) train Model users 9) monitor models 10) enrich Models

data prep steps,

1) exam dist of var 2) detect rare, Missing, aberrant, and or extreme vals 3) test normality 4) detect most discrim Vars 5) transform vars 6) choose range of binned Vars 7) create new Vars 8) detect interactions 9) auto var selection 10) detect collinearity 11) sampling

Cramer's V or chi squared

2 discrete vars

Nonparametric Wilcoxon mann

2 groups

Median test

2 or more samples, non-normal and hetero

Student's t-test

2 sample, normality and homoscedasticity

Jonckheere-terpstra

3 or more samples, non-normal and hetero

Kruskal

3 or more samples, non-normal and hetero

Welch-ANOVA

3 or more samples, normal and hetero

Nonparametric krusal-wallis

> 2 groups

continuous

Continuous (or scale or interval) data are those whose values belong to an infinite subset of the set R of real numbers (e.g. wages, amount of purchases). ie. height, weight

cluster sampling

Create random clusters and then choose from each cluster

Stepwise selection

Forward-no vars in model at outset Backward-all in then remove one by one combined-forward then backward alternating until no vars can be removed or added

Descriptive Data Mining

In descriptive methods, for reducing, summarizing and grouping data, there is no dependent variable, i.e. no privileged variable.

predictive data Mining

In predictive methods, which explain data, there is a dependent variable, in other words a variable to be explained, or a privileged variable.

What does the Shapiro-Wilk (p-p plot) do?

In the Shapiro-Wilk test, the cumulative distribution of the data is shown on a normal probability scale, called a P-P (probability-probability) plot, where a normal distribution is shown by a straight line with a slope of 1 (Figure 3.10). Thus the Shapiro-Wilk statistic is a way of measuring how far the graphic representation of the data deviates from the straight line.

square trans

Moderate effect on dist shape. could reduce left skew. Main reason to fit to quad equation. Only if var is zero or positive

Parametric Anova

One discrete and one cont Var

Ordinal

Order matter but not between the variables. Examples would be ie. Scale of 1 to 10 or Temperature = "high" "Medium" "Low"

What does the Kolmogorov-Smirnov test do?

The Kolmogorov-Smirnov test involves measuring the maximum deviation D (in absolute terms) between the distribution function (cumulative density function) of the variable tested and the distribution function of a Gaussian variable (or, more generally, of any continuous variable whose distribution is to be compared with that of the observed variable

Analytical CRM

The aim of analytical CRM is to extract, store, analyse and output the relevant information to provide a comprehensive, integrated view of the customer in the business, in order to understand his profile and needs more fully.

text

Uncoded texts, written in natural language ie. letters, reports

examine dist of vars

Univariate Bivariate

Univariate

Univariate means "one variable" (one type of data). Example: You weigh the pups and get these results:2.5, 3.5, 3.3, 3.1, 2.6, 3.6, 2.4The "one variable" is Puppy Weight.

lifetimes data

age, Iength as cust, at current addr, at current job, time since last purchase or claim

ratio

all properties of interval. also has clear def of 0.0. When 0,00 then none of var ie. height, weight,

univariate visuals

bar chart or freq table for qual or discrete vars box plots or histogram for core vars

SAS

better for large data sets Most expensive, good interface, tech support, fairly easy to learn,

Bartlett's

better than Levene's if have strong evidence data come from normal or nearly normal dist

operational CRM

concerned with managing the various channels (sales force, call centres, voice servers, interactive terminals, mobile telephones, Internet, etc.) and marketing campaigns for the best implementation of the strategies identified by the analytical CRM.

Collinearity

correlation among predictors in Multi regression ie. height and weight used to predict something and results will be skewed because height and weight are related

rare values

create bias by appearing more important than they really are. better to remove or replace with more freq value

paratextual data

date and purpose of doc

descriptive

designed to bring out information that is present but buried in a mass of data (as in the case of automatic clustering of individuals and searches for associations between products or medicines) No dependent var. find association btwn. bring out info present.

IT dev data Mining features

dev cant be done without data dev and test in same env

profitability data

diff btwn profits and cost,LTV

stratified samping

divide poplulation into ranges then random select from each stratum

Info retrieval

docs In their totality

Aberrant values

erroneous value

Zipf's law

few 10's of words are enough to represent a large part of any corpus

training sample

for developing the modeI

lenmitization

form which is in dictionary

grouping variants

graphic, syntactic, semantic, synonyms, parasynonyms, full forms of abbr., expressions and metaphors

grouping analogies

group terms in families of derivative terms, ie.credit-loan-undertaking-debt

homoscedasticity

homoscedasticity means "having the same scatter." For it to exist in a set of data, the points must be about the same distance from the line tests for: Levene's, Bartlett's, Fisher

proportional stratified sampling

ie. 30% of cust in pop over 60, then 30% of strat sample Must be over 60

open requests

in form of keywords or free text - changes slowly

predictive qualitative

in the form of classification or scoring

relational data

pref for contact, delivery, calls to cust svc, complaints

Statistical software

programming windows or scrolling menus, Sas,s-plus

RFM

recency, freq, monetary value

Predictive quantitative

regression

customer data

relational, altitudinal, psychographic, lifetimes, channel, sociodemographic


Set pelajaran terkait

CHEM chapter 1- molecular reasons

View Set

Marketing Research Exam 1 Chapters 1,2,3,4(part 1),5,6 UWF

View Set

MSG school test 1-1 review and MSG moodle test answers

View Set

Advanced Hardware Lab 1-1: Testing Mode: Identify Internal Parts of a Computer

View Set

Algorithm Analysis and Design Ch 1

View Set

Amos 5 - Flashcard MC questions - Ted Hildebrandt

View Set

Highest and Lowest Points on each continent and Earth

View Set

Self Assessment Week 11 (ch 18, 19, & 20)

View Set