buan 6356 exam 2

Ace your homework & exams now with Quizwiz!

OLS regression

Y= alpha+betax + E, where Y = (0,1)

difference between classification and clustering& association rules

classification: supervised learning, dependent variable is discrete/nominal. essense is to find rules to separate objects into pre determined classes. needs at least training and validation datasets to train classifiers clustering and association rules: unsupervised learning, essense is that clusters are not pre determined

fraud

classifying credit card purchases into fraudulent and legitimate ones

examples of classification

classifying customers as loyal vs likely to discontinue contract, classifying credit risk as high vs low

classification applications

direct marketing, customer profiling, fraud, prediction of stock price, medical diagnosis

lift chart

graphically represents the improvement that a mining model provides when compared against a random guess and measures the change in terms of a lift score -model with a curve closer to left and top is a good model

supervised learning

has a dependent variable, e.g. class A or B?

unsupervised learning

has no dependent variable, there is no predetermined class

testing data

helps estimate the performance of the chosen classifier

validation data

helps tune classifiers, e.g. pruning a decision tree, or help choose among different classifiers

overfitting problem

many possible splitting rules that near perfectly classify the data -may not generalize to future datasets, particularly a problem with small datasets -too many branches, end up fitting noise in training data set -effect: great for training data, poor accuracy for unseen samples. error rate (average square error) increases as number of splits (complexity) increases

predictive task

mapping instances onto a predefined set of classes - classification

normalization

normalize to remove scale effect - divide by stdev -usually performed in PCA

new data

only input informatio available

logistic regression model

p= (1/(1-e^-(alpha+betax+E))) p= probability that event Y occurs, i.e p(Y=1) -constrains estimated probabilities between 0 and 1 -if (alpha+betax)=0, p=.5 -as (alpha +betax) increases, p approaches 1 -as (alpha +betax) decreases, p approaches 0 this expression is equivalent: ln(p/(1-p)) = alpha+betax+E p= probability that event occurs, i.e. p(Y=1) p/1-p = the odds ln(p/1-p) the natural logarithm of odds, log odds or logit -this model overcomes problems of linear regression: it selects regression coefficient to force predicted values for Y to be between (0,1); produces s shaped regression predictions rather than straigh tline; selects coefficicent through maximum likelihood estimation

customer profiling

predicting whether a customer will respond positively to a product

direct marketing

predicting whether or not a customer will respond to mailing

neural network

simplified model of human brain -a function approximator: transoforms inputs to outputs

Mean Squared Error (MSE)

sum of (actual-predicted)^2

if the target is categorical

use accuracy rate, misclassification rate, etc categorical = prediction is either accurate or not

if target is numerical

use mean squared error numerical = prediction can be close/relative. like sales

scoring

using new dataset where the value of the dependent variable is unknown

backward propagation of errors

weights are adjusted by observing errors in output and propagating adjustments back through the network

odds ratio

(odds(X=x+1)/odds(X=x))

advantages of neural networks

-can handle most types of response - discrete/continuous -does not require prior knowledge of relationships between inputs and outputs -nonlinear relationships and interactions taken care of inherently -robust, works well even when training sample contains errors -prediction accuracy is comparable with other statistical methods

item-user matrix

-cells are user preferences for items -preferences can be ratings or binary

disadvantages of neural networks

-cmplex setup process -potentially long training time -black box, no insight into relationships between predictors and outcome -non intuitive model - difficult to interpret results -no statistical test to assess importance of weights -may lead to overfitting

evaluating model performance

-commonly used statistics to cmpare alternative models (or to evaluate the performance of a single model) -compare the model that includes alpha versus the model that includes alpha, beta -model chi square, percent correct predictions, pseudo r2

collaborative filtering

-community of users -opinions on items obtained implicitly or explicitly -opinions of others are used to predict a user's unknown opinion -key idea: filter info based on user preference -user based collaborative filtering: find users who share the same rating patterns with the user of interest, use ratings from these like minded users to estimate a prediction for the user of interest -item based collaborative filtering: build item-item matrix of relationships between pairs of items to find and recommend similar items

structure of neural network

-composed of many neurons that co operate to perform desired function -output of neuron is function of the weighted sum of inputs + bias = computation of outputs of all neurons = f(w1i1+...+wnin + bias) = activation function

growing the tree using chi square & logworth

-compute logworth of the best split for each interval-valued input variable -then select the variable that has the highest logworth and use its split, suppose it is balance -under each of the two balance nodes, we then find the logworth of the best split of age and continue the process -stopping criteria for splitting: threshold on the significance of the chi square value for splitting, no more improvement(lowest miscalculation rate, proportion of mismatch between prediction and observation)

chi square statistic

-computes a measure of how different the number of observations is in each of the four cells as compared to expected number. the pvalue associated with the null hypothesis is computed -logworth of the pvalue, logworth = -log10(pvalue) -low pvalue -> high logworth -split that generates the highest logworth for a given input variable is selected

decision tree strengths

-easy to understand and interpret -easy to implement -running time is low -very popular

entropy

-entropy(A) = sum(qlog2(q)) -q=proportion of cases (out of m) in set A that belong to class k -ranges between 0(most pure) and log2(m) (equal representation of classes)

information gain

-expected reduction in impurity (entropy, gini index) -suppose node N gets partitioned into M child nodes {c1....cm} given attribute A -infromation gain = entropy reduction = entropy of N -sum of weighted entropies of c1...cm -gain(A) = entropy of N - sum(weighted entropy of cm) -select attribute with highest information gain

input variable selection

-for each split of an input variable a chi square statistic is calculated -a contingency table is formed that maps default and non default against the partitioned input variable -for example, null hypothesis might be that there is no difference between people with income <50k and >50k, if null hypothesis is rejected, means splitting leads to a purer sub group. the lower p value, the more likely we reject the hypothesis, meaning income split is a discriminating factor

-pearson correlation

-for each user pair, find co rated items, calculate correlation between vectors of their ratings for those items -note that the average ratings for each user across all products, not just co rated -sum((r1-r1bar)*(r2-r2bar))/(sqrt(sum(r1-r1bar)^2) * sqrt(r2-rbar2)^2)

gini index

-gini index for set A containing m classes of response variable: I(A) = 1- sum(q^2) -q= proportion of cases in set A that belong to class k -max value when all classes are equally represented, .5 in binary case

explanatory modeling

-goal: explain relationship between predictors and target

Principal Component Analysis

-goal: reduce set of numerical variables -idea: remove overlap of information between these variables, information is measured by the sum of the variances of the variables -final product: a smaller number of numerical variables that contain the most of the info -how?: create new variables that are linear combinations ofthe original variables, i.e they are weighted averages of the original variables -these new variables are uncorrelated (no info overlap) and only a few contain most of the original information -the new variables are called principal components

growing decision tree

-growing tree involves successively partitioning the data: recursively partitioning -if an input variable is binary, then the two categories can be used to split the data -if an input variable is interval, a splitting value is used to classify the data into two segments

recommender system

-guide people to ineresting materials based on information from other similar people or information from other similar items

lift

-improvement obtained via modeling -confidence/support of consequent

classification terminology

-inputs = predictors = independent variable -outputs = responses = dependent variables: categorical outputs use classification techniques, numeric outputs use prediction/regression techniques -models=classifiers -with classification, we want to use a model to predict what output will be obtained from given inputs

common mistakes with predictive analysis

-learn things that aren't true, patterns may not represent any underlying rule -self selection bias: preaching to wrong population? -data may be at the wrong level of detail -learn things that are true, but not useful - most rules learned are already known -knowledge that cannot be used -data integrity

cosine simialrity

-like correlation coefficient, except do not subtract means -cold start problem: for users with just one item, or items with just one user, neither cosine similarity nor correlation produces useful metric

structural breaks

-logical breaks in dataset, ex) pooled vs split -chi squared value > critical value -> statistically different -or just add a dummy variable

three steps of classification

-model building: use training dataset -model validation: use validation dataset. controlling overfitting is a major purpose -model application, i.e scoring

model chi squared

-model log likelihood ratio is LR = -2ln(L(alpha)/L(alpha, beta)) = -2[Ln(L(alpha)) - Ln(L(alpha, beta))] -the lr statistic is distributed chi sqaure with i degrees of freedom. i is the number of independent variables -use the model chi squared statistic to determine if the model (alpha, beta) is statistically significant

training

-modifying weights to better approximate desired function -based on training data -supervised training: supplies network with inputs and desired outputs, response of network to inputs is measured, weights modified to reduce difference between actual and desired outputs -unsupervised training: only supplies inputs, network adjusts its own weights so that similar inputs cause similar outputs, network identifies patterns and differences in inputs without external guidance

multilayer feedforward networks

-most common neural network -extension of the perceptron -multiple layers: hidden layers between input and output layers -activation function is not simply a threshold, usually a sigmoid function -more general function approximator, not limited to linear problems -information flows in one direction, the outputs of one layer act as inputs to the next layer

comparing measures

-no measure is better than others, factors: -speed and scalability, robustness, interpretability

omitted variable bias

-omitted variables can result in bias in the coefficient estimates -conduct LR test -chi squared value < critical value -> coefficients not statistically significant

epoch

-one iteration through the process of providing network with input and updating network's weights -typically many epochs required to train network

receiver operating characteristic (ROC) curve

-plots performance of a classifier in terms of its true positive rate (vertical axis) and false positive rate (horizontal axis) -true positive rate = #true positives/(#true positives+#false negatives) = sensitivity -false positive rate = # false positives/ (#false positives + true negatives) = 1-specificity -model with a curve closer to left and top is a good model

typical recommendation process

-prediction -ranking -recommendation

avoid overfitting

-prepruning: halt construction tree early, do not split node if this would result in purity measure falling below threshold -postpruning: remove branches from a fully grown tree, get sequence of pregressively pruned trees. use different dataset to select best pruned tree (ie cross validation)

confusion matrix

-records source of error: false positives and negatives -can be used for model comparison

decision tree

-series of nested tests -each node (root and interior) represents a test on an input variable, nominal attribute or numeric attribute -each leaf is a class assignment, provides a distribution over all possible classes, uses decision tree to classify new instances -level of tree -> complexity

entropy: case of two classes

-set S with P elements of class P and n elements of class N -entropy of set S is: E(S) = -((p/(p+n))(log2(p/(p+n)))) - (n/(p+n)(log2((n/(p+n)))))

reducing categories

-single categorical variable with m categories is typically transformed into m or m-1 dummy variables -each dummy variable takes values of 0 or 1 -problem: can end up with too many variables -solution: reduce by combining categories that are close to each other -use pivot tables to assess outcome variable sensitivity

constructing lift chart

-sort cases in the test set based on their predicted probability of belonging to the class of interest, in descending order: ranking of a case is more important than predicted probability -divide sorted cases into deciles, each interval represents 1/10 of entire sample -for each decile, calculate success rate -for each decile, calculate lift factor = success rate of decile/success rate of entire test set

User-based collaborative filtering

-start with single user who will be target of recommendations -find other users who are most simiar based on comparing preference vector

maximum likelihood estimation

-statistical method for estimating the coefficients of the model -likelihood function measures the probability of observing the particular set of dependent variable values (p1...pn) that occur in the sample -higher the value of L, higher the probability of observing (p1...pn) in the sample

interpreting coefficients

-the odds is intuitive -since ln(p/1-p) = alpha + betax + E -> (p/1-p) = e^alpha+betax+E -e^beta is the effect of the independent variable on the odds, or the odd ratio

decision tree construction

-trees constructed in a top down recursive partitioning manner -all training examples are at the root initially -attributes are assumed to be categorical, discretized if necessary -examples partitioned recursively based on selected input variables -basic algorithms are greedy -popular methods: ID3, C4.5, CART. similar ideas but differ in how tree is grown, splitting criteria, pruning methods, termination criteria -basic idea: partition training examples into purer and purer subgroups. group a is purer than group b if group members in a are more similar than members of group b because trees are constructed by recursively partitioning instances

motivation for logistic regression

-using regression models for classification -however the dependent variable is sometimes limited, e.g. voting, morbidity/mortality -logistic regression can be a good choice when the dependent variable is a dummy variable -dummy variable: 0/1 variable that is a binary choice

activation function

-usually sigmoid function: k = steepness/scaling factor = f(kx) = 1/(1+e^-kx)

decision tree weaknesses

-volatile: small changes in underlying data result in very diff models -cannot capture interactions between variables -can result in large error -volatility is reduced through bootstrap aggregation

how are weights obtained?

-weights are critical to usefulness = determining appropriate function -training

attribute selection

-which attribute should be used for a split: choose attribute that best partitions the relevant population onto purer groups at each decision node -other measures of impurity: entropy and gini index -then use information gain to decide which attribute to use: how informative is the attribute in distinguishing among instances from different classes -ideally also try to minimize number of splits/nodes to make it more compact and accurate

buan 6356 exam 2

Related study sets

Pentest+ 10/9

ECON Exam Questions

APES 1.1-1.3 Exam Questions

rework

FRL 300 chapter 8

Global Health Chapter 11: Adolescent Health

Complications of liver disease: Esophageal Varices

Aircraft Performance (Aircraft performance)

MG 301 Ch. 5

Tratamentul sepsisului

CIS193 Chapter 2

chapter 5 part 4

Chapter 2: Suffixes

Neuro 1-2, Renal F&E, GU Reproductive Quiz questions

Homework 5

Chapter 26: Assessment of High Risk Pregnancy

Principles of Athletic Training Lab

Foundations of Behavioral Health Policy and Social Work Final Exam

Greece 8.1 Review

Soc Module 12 Exam