CSCE 587 Midterm Review
Depending on the nature of the variable, you may need to include an __________ component on one branch.
"equal to"
If we define L as a itemset {shoes, purses} and we define our "support" as 50%. If 50% of the transactions have this itemset, then we say the L is a _____
"frequent itemset".
Reasons to choose logistic regression: preserves the summary statistics of the training data
"the probabilities equal the counts"
Logistic regression reason to choose: It works well with _________variables and correlated variables. In this case the prediction is not impacted but we lose some explanatory value with the fitted model.
(robust) redundant
Ex. of confidence interval: if you are estimating the mean value (mu) of a Gaussian distribution with std. dev sigma, and your estimate after n samples is X, then mu falls within
+- 2 *sigma/sqrt(n) with about 95% probability
The ARMA Model is a Combination of what two process models
-Autoregressive: Yt is a linear combination of its last p values -Moving average: Yt is a constant value plus the effects of a dampened white noise process over the last q time steps
LDA Method
-Construct discriminant equations to separate classes -Maximize ratio of between SS and within SS -# discriminants is min(#classes -1, #features) -D1 separates class 1 from classes 2,3,..n. D2separates class 2 from classes 3,..n, etc
Seasonal Adjustment: Often, we know the "season"
-For both retail sales and CO2 concentration, we can model the period as being a year, with variation at the month level
Reason for Naive Bayes: computationally efficient
-Handles very high dimensional problems -Handles categorical variables with a lot of levels
•To construct tree T from training set S
-If all examples in S belong to some class in C, or S is sufficiently "pure", then make a leaf labeled C. -Otherwise: •select the "most informative" attribute A •partition S according to A's values •recursively construct sub-trees T1, T2, ..., for the subsets of S
stationary sequences
-Mean, variance and autocorrelation structure do not change over time -In practice, this often means you must de-trend and seasonally adjust the data -ARIMA in principle can make the data (more) stationary with differencing
Cautions for time series analysis
-No meaningful drivers: prediction based only on past performance No explanatory value Can't do "what-if" scenarios Can't stress test -it's an "art form" to select appropriate parameters -suitable for short-term predictions only
•In some cases, may have to fit a non-linear model
-Quadratic -Exponential
PCA Method
-Synthesis of orthogonal Eigen Vectors -All PCs start at the origin of the ordinate axes -The First PC is te direction of max variance -Subsequent PCs are orthogonal to preceding PCs and describe max residual variance
Caution of linear regression: Assumes that each variable affects the outcome linearly and additively. Ways to alleviate this...
-Variable transformations and modeling variable interactions -A good idea to take the log of monetary amounts or any variable with a wide dynamic range
can alleviate the fact that Logistic regression assumes that each variable affects the log-odds of the outcome linearly and additively
-Variable transformations and modeling variable interactions can alleviate this -A good idea to take the log of monetary amounts or any variable with a wide dynamic range
The problem with explicitly modeling P(X1,...,Xn|Y) is that there are usually way too many parameters:
-We'll run out of space -We'll run out of time -And we'll need tons of training data (which is usually not available)
One sample t test
-compare sample to population -Actually: sample mean vs population mean -Does the sample match the population
Two sample t test
-compare two samples -sample distribution of the difference of the means
reasons to choose decision tree classifier
-computationally efficient to build -easy to score data -many algorithms can return a measure of variable importance -in principle, decision rules are easy to understand
cautions for decision tree classifier
-decision surfaces can ony be axis-aligned -tree structure is sensitive to small changes in the training data -a "deep" tree is probably over-fit (because each split reduces the training data for subsequent splits)
reasons to choose time series analaysis
-minimal data collection -designed to handle the inherent autocorrelation of lagged time series
cautions for decision tree data
-not good for outcomes that are dependent on many variables (related to overfit problem) -doesn't naturally handle missing values -in practice, decision rules can be fairly complex
•ARIMA adds a differencing term, d, to make the series more stationary
-rule of thumb: •linear trend can be removed by d=1 •quadratic trend by d=2, and so on...
Assumptions for Hypothesis testing
-samples reflect an underlying distribution -if two distributions are different, the samples should reflect this. -a difference should be testable
Reasons to choose decision tree classifier
-takes any input type (numerical, categorical) can handle zipcode -robust with redundant variables, correlated variables -naturally handles variable interaction -handles variables . that have non-linear effect on outcome
Decision Tree Step 1: Pick the Most "Informative" Attribute
. There are many ways to do it. We detail Entropy based methods. Let p( c ) be the probability of a given class. H as defined by the formula shown above will have a value 0 if p (c ) is 0 or 1. So for binary classification H=0 means it is a "pure" node. H is maximum when all classes are equally probable. If the probability of classes are 50/50 then H=1 (maximum entropy).
The outputs for linear regression are:
1)A set of coefficients that indicate the relative impact of each driver (possibly and how strongly the variables are correlated) 2) A linear expression predicting the outcome as a function of drivers.
In K-means we define two measures of distances, between two data points(records) and the distance between two clusters. Distance can be measured (calculated) in a number of ways but four principles tend to hold true.
1.Distance is not negative (it is stated as an absolute value) 2.Distance from one record to itself is zero. 3.Distance from record I to record J is the same as the distance from record J to record I, again since the distance is stated as an absolute value, the starting and end points can be reversed. 4.Distance between two records can not be greater than the sum of the distance between each record and a third record.
# of parameters for modeling P(X1,...,Xn|Y):
2((2^n)-1)
# of parameters for modeling P(X1|Y),...,P(Xn|Y)
2n
For example if we have frequent itemset {shoes,purses, hats} and consider subsets {shoes,purses}. If 80% of the transactions that have {shoes,purses} also have {hats} we define Confidence for the rule that {shoes,purses} implies {hats} as _____
80%
The selection of (p,d,q) appropriately is not very straight forward.
A complete understanding of the domain knowledge and very detailed analysis of trend and seasonality may be required,
In order to render a sequence stationary we need to remove the effects of trend and seasonality. The __________model (implemented with Box Jenkins) uses the method of differencing to render the data stationary.
ARIMA
AR model predicts yt as a linear combination of its last p values.
An autoregressive model is simply a linear regression of the current value of the series on one or more prior values of the same series. Several options are available for analyzing autoregressive models, including standard linear least squares techniques. They also have a straightforward interpretation.
Reasons to choose Apriori: it uses a clever observation to prune the search space this is known as
Apriori property
Though there are many other distance measures, the____________ _______________ is the most commonly used distance measure and many packages use this measure.
Euclidian distance
Caution of naive bayes: it is sensitive to correlated variables as the algorithm double counts the effect of the correlated variables.
For example people with low income tend to default and people with low credit tend to default. It is also true that people with low income tend to have low credit. If we try to score "default" with both low income and low credit as variables we will see the double counting effect in our model output and in the scoring.
Tree structure is sensitive to small variations in the training data.
If you have a large data set and you build a Decision Tree on one subset and another Decision Tree on a different subset the resulting trees can be very different even though they are from the same data set. If you get a deep tree you are probably over fitting as each split reduces the training data for subsequent splits.
Caution of K means clustering: K-means tends to produce rounded and equal sized clusters which are not always desirable because..
If you have clusters which are elongated or crescent shaped, K-means may not be able to find these clusters appropriately. The data in this case may have to be transformed before modeling.
Caution of K means clustering: It is important that the variables must be all measured on similar or compatible scales (Not scale-invariant!)
If you measure the living space of a house in square feet, the cost of the house in thousands of dollars (that is, 1 unit is $1000), and then you change the cost of the house to dollars (so one unit is $1), then the clusters may change.
WSS primarily is a measure of homogeneity.
In general more clusters result in tighter clusters. But having too many clusters is over-fitting
A Stationary sequence is a random sequence in which the joint probability distribution does not vary over time.
In other words the mean, variance and auto correlations do not change in the sequence over time.
The general ARIMA (p, d, q) model gives a tremendous variety of patterns in the ACF and PACF, so it is not practical to state rules for identifying general ARIMA models.
In practice, it is seldom necessary to deal with values p, d, or q that are larger than 0, 1, or 2. It is remarkable that such a small range of values for p, d, or q can cover such a large range of practical forecasting situations.
Decision Tree - Example of Visual Structure
In the example here the outcomes are binary, although there could be more than 2 branches stemming from an internal node. For example, if the variable was categorical and had 3 choices, you might need a branch for each choice.
is defined as the difference between the base entropy and the conditional entropy of the attribute
Information Gain
Give the equation for Lift(X-> Y ) and explain what it means
It is the ratio of the joint probabilities vs. independent probability.
Cross validation: if we perform k-fold cross validation, training a Naive Bayes model, how many models will we end up creating?
K models. One for each test partition.
Give an example of a type of data (data type) that k-means should not be used for and explain why.
K-Means should not be used for categorical data. K-means uses Euclidean distance. This measure is not well suited for categorical data.
Used for clustering numerical data, usually a set of measurements about objects of interest.
K-means clustering
WSS =
K: # of clusters ni: # points in ith cluster ci: centroid of ith cluster xij: jth point of ith cluster
This method is a supervised method in the sense that the class labels must be known (LDA/PCA/Both)
LDA
this selects component axes that maximize class separation
LDA
measures how many times more often X and Y occur together than expected if they were statistically independent. It is a measure of how X and Y are really related rather than coincidentally happening together.
Lift
•Used to estimate a continuous value as a linear (additive) function of other variables
Linear Regression
Do I suspect some of the inputs are correlated?
Logistic regression
Are there categorical variables with a large number of levels?
Naive Bayes
Do I suspect some of the inputs are irrelevant?
Naive Bayes
Is the problem high-dimensional?
Naive Bayes
We now describe the general algorithm. Our objective is to construct a tree T from a training set S. If all examples in S belongs to some class "C " (good_credit for example) or S is sufficiently "pure" (in our case node p(credit_good) is 70% pure) we make a leaf labeled "C".
Otherwise we will select another attribute considered as the "most informative" (savings, housing etc.) and partition S according to A's values. Something similar to what we explained in the previous slide. We will construct sub-trees T1,T2..... or the subsets of S recursively until • You have all of the nodes as pure as required or •You cannot split further as per your specifications or •Any other stopping criteria specified.
conditional probabilities for atttribute X
P(x is low | + ), P(X is norm | +) , P(X is low |-), P(X is norm | -), P(X is high |+), P(X is high |-)
Bayesian Classification
Problem statement: -given features X1, X2,...,Xn -Predict a label Y
The holiday sales spike is an example of ____________.
Seasonality; The seasonal component of a series typically makes the interpretation of a series ambiguous. By removing the seasonal component, it is easier to focus on other components.
Caution of logistic regression: Assumes that each variable affects the log-odds of the outcome linearly and additively
So if we have some variables that affect the outcome non-linearly and the relationships are not actually additive the model does not fit well.
Cautions of linear regression: It cannot handle variables that affect the outcome in a discontinuous way (step funcitons)
So if we have some variables that affect the outcome non-linearly and the relationships are not actually additive the model does not fit well.
What one does is to take the data from the last n periods, average the data, and use that as the forecast for the next period. We count backwards in time, minus 1, minus, 2, minus 3 and so forth until we have n data points, divide the sum of those by the number of data points, n, and that gives you the forecast for the next period.
So it's called a single moving average or simple moving average. The forecast is simply a constant value that projects the next time period. "n" is also the order of the moving averages.
Explain why stratification is used in holdout estimation. What purpose does it serve?
Stratification is used to ensure that test + training data contain approximately the same percentage of each class. The purpose is to ensure that the partitions are representative of the sample population.
Specificity = proportion of patients with a negative index test among the patients who do not have the disease
TN / (TN + FP) = 35/(35+65); True negative / (true negative + False positive)
Sensitivity = proportion of patients with a positive index test among all patients that have the disease
TP/(TP+FN) = 65/ (65+35) ; True positive / True positive + False negative
are at the end of the last branch on the tree. These represent the outcome of all the prior decisions. The leaf nodes are the class labels, or the segment in which all observations that follow the path to the leaf would be placed.
The Leaf nodes
Reasons to choose logistic regression: provides concise representation with the outcome with the coefficients.
The data is easy to score
Cross Validation: Explain cross validation. Why is it considered an unbiased assessment of the accuracy of a classifier?
The data set is partitioned into K approximately equal sized sets. A model is trained on K-1 sets and tested on the "left out" set. This process is repeated so that each partition is the test set. Model accuracy is then taken as the average over all the test sets. It is unbiased in the sense that all of the data is used to test the accuracy of the model.
Diagnostics
The diagnostics we used in regression can be used to validate the effectiveness of the model we built. The technique of using the hold-out data and performing N-fold cross validations and using the ROC/Area Under the Curve methods can be deployed with Naïve Bayesian classifier as well
In the case of the T-test, we assume that samples reflect the underlying distributions. But what is the underlying assumption about the distributions themselves when the two sample t-test is used?
The samples are assumed to be normally distributed.
What is the null hypothesis in the case of the two sample t-test?
The two samples come from the same distribution. Specifically mu1= mu2
Prune
The word "prune" is used like it would be in gardening, where you prune away the excess branches of your bushes.
Practically based on the domain knowledge, a value for K is picked and the centroids are computed. Then..
Then a different K is chosen and the model is repeated to observe if it enhanced the cohesiveness of the data points within the cluster group. However if there is no apparent structure in the data we may have to try multiple values for K. It is an exploratory process.
In the k-means lab, you examined different values for k using the "knee" heuristic to pick the best value of k. Explain what the value for each k in the graph represents?
These values are the "within sum of squares". The value for a particular K is the sum of the WSS for each cluster.
Reasons to caution Apriori: Can mistakenly find spurious (or coincidental) relationships that are practically not very useful.
This is addressed with lift and leverage measures
What do we expect the mean of the residuals to be if we have a good fit?
This value should be very close to zero.
Determine the priors for the P(-) =
True neg + false pos = (13/25 of the observations)
can be considered the overall dispersion of the data
WSS
is a measure of how tight on average each cluster is
WSS
Apriori is a bottom-up approach where we start with all the frequent itemsets of size 1 (for example shoes, purses, hats etc) first and determine the support. Then we start pairing them.
We find the support for say {shoes,purses} or {shoes,hats} or {purses,hats}. Suppose we set our threshold as 50% we find those itemsets that appear in 50% of all transactions. We scan all the itemsets and "prune away" the itemsets that have less than 50% support (appear in less than 50% of the transactions), and keep the ones that have sufficient support.
Model evaluation: consider the classification in the following confusion matrix.
What is the sensitivity? What is the specificity?
When the decision is numerical, the "greater than" branch is usually shown on the _____and "less than" on the _____.
When the decision is numerical, the "greater than" branch is usually shown on the right and "less than" on the left.
•Simple ad-hoc adjustment: take several years of data, calculate the average value for each month, and subtract that from Y1t
Y2t = Y1t - S*t
Consider a data set described by n features: if these are all binary features, does n place a limit on the depth of the decision tree?
Yes, the decision tree can have a max depth of n if all n features are binary features
Modeling a Time Series
Yt =Tt +St +Rt, t=1,...,n
Use an Exponential Trend Model if the percentage differences are more or constant.
[ ((y2-y1) /y1 ) * 100% = .......((yn-yn-1)/yn-1 ) * 100%
Use a Linear Trend Model if the first differences are more or less constant
[ (y2-y1) = (y3-y2) = ....... = (yn-yn-1) ]
Use a Quadratic Trend Model if the second differences are more or less constant.
[ (y3-y2) - (y2-y1) = .........= (yn-yn-1)-(yn-1-yn-2) ]
Apriori algorithm uses the notion of Frequent itemset
a set of items L that appears together "often enough"; meets a minimum support criterion
•The details vary according to the specific __________- CART, ID3, C4.5 - but the general idea is the same
algorithm
There are several ____________ that implement Decision Trees and the methods of tree construction vary with each one of them. CART,ID3 and C4.5 are some of the popular algorithms.
algorithms
Earliest of the association rule algorithms
apriori algorithm
It is apparent that if 50% of itemsets have {shoes,purses} in them, then at least 50% of the transactions will have either {shoes} or {purses} in them. This is an ___ _____ which states that any subset of a frequent itemset is also frequent.
apriori property
Robust
are any statistics that yield good performance when data is drawn from a wide range of probability distributions that are largely unaffected by outliers or small departures from model assumptions in a given dataset. It is resistant to errors in the results
The linear regression problem itself is solving for the
bi
this method is based on a linear transformation of feature space (LDA/PCA/Both)
both LDA and PCA
refer to the outcome of a decision and are represented by the connecting lines here.
branches
Autoregressive (AR) models
can be coupled with moving average (MA) models to form a general and useful class of time series models called Autoregressive Moving Average (ARMA) models. This is the simplest Box-Jenkins model.
is the center of the discovered cluster. K-means clustering provides this as an output. When the number of clusters is fixed to k, K-means clustering gives a formal definition as an optimization problem: find the k cluster centers and assign the objects to the nearest cluster center, such that the squared distances from the cluster are minimized.
centroid
caution of naive bayes: is not very reliable for the probability estimation and should be used for _____ ______ ______ only. Naïve Bayesian classifier in its simple form is used only with categorical variables and any continuous variables should be rendered discrete into intervals.
class label assignments
LDA Goal
classification
one sample t test
compare the mean of the sample with the mean of the population
Caution of linear regression: doesn't work well with discrete drivers that have a lot of distinct values; the model becomes
complex and computationally inefficient. For example, ZIP code
Logistic regression reason to choose: it has the explanatory values and we can easily determine the relative impact of each of the variables affect on the outcome. The explanatory values are a little more _____ than linear regression.
complicated
Reason to choose linear regression: provides the ____ representation of the outcome with the representation with the coefficients
concise
Frequent itemsets are used to find rules X->Y with a minimum
confidence
The _____________ ____________for an estimate x of an unknown value mu is the interval that should contain the true value mu, to a desired probability
confidence interval
Reasons to caution Apriori: it requires many
database scans
PCA goal
dimension reduction
Caution for logistic regression: when you have ________ drivers with a large number of distinct values the model becomes complex and computationally inefficient.
discrete
Caution of naive bayes model: numeric values have to be ____ intervals
discrete (categorical); However, it is not necessary to have the continuous variables as "discrete" and several standard implementations can handle continuous variables as well.
In terms of Cautions (-), decision surface is axis aligned and the decision regions are rectangular surfaces. However, if the decision surface is not axis aligned (say a triangular surface), the Decision Tree algorithms
do not handle this type of data well.
Reason to choose linear regression: it has ____ data
easy to score
In LDA discriminants are
eigenvectors
What are the elbows? We look for the elbow of the curve which provides the optimal number of clusters for the given data.
elbows at k = 2, 4, 6
Caution for logistic regression: it Cannot handle variables that affect the outcome in a discontinuous way.
example: step functions
Reasons to caution Apriori: ______ time complexity
exponential
If we expand on categorical variables (ZIP codes as a categorical value) we will end up with lot of variables and the complexity of the solution becomes
extremely high
Picking K heuristic
find the "elbow" of the within-sum-of-squares (wss) plot as a function of K
A MA model adds to yt the effects of a dampened white noise process over the last q steps. This is a simple moving average or single moving average; it's probably the most basic of the _______ methods.
forecasting
explain why we have map function and reduce function for hadoop
hadoop sorts key-value pairs between the map+reduce function. this enables the reduce function to "reduce" all values associated with a given key
Hold-out data
how well does the model classify new instances
Reasons to choose K means clustering: it is easy to
implement
Reasons to choose Apriori: it is easy to
implement and parallelize ( but it is also computationally expensive)
ARMA models can be used when the series is weakly stationary;
in other words, the series has a constant variance around a constant mean.. This class of models can be extended to non-stationary series by allowing the differencing of the data series. These are called Autoregressive Integrated Moving Average(ARIMA) models. There are a large variety of ARIMA models.
Cautions of Linear regression: issue of infinite magnitude coefficients where the prediction is ______in ranges.
inconsistent
The complexity of the solution (both in storage and computation) ________as the number of variables increase.
increases
logistic regression assumes linearity of
independent variables and log odds
What are indicator variables used for? given an example
indicator variables are used to handle categorical variables of more than 2 variables in linear regression. ex. Test E {Hot, med, cold} y = b0 + b1hot+b2med+b3cold
So the most informative attribute is the attribute with most __________ _______________. Remember, this is just an example. There are other information/purity measures, but InfoGain is a fairly popular one for inducing Decision Trees.
information gain
Caution of K means clustering: is sensitive to the___ ____ on the centroids
initial guess
numerial. there must be a distance •distance metric defined over the variable space. -Euclidian distance
input
can be continuous or discrete
input variables for linear regression
are the decision or test points. Each refers to a single variable or attribute.
internal nodes
Reason to choose naive bayes: it is robust to independent variables
irrelevant variables are distributed among all the classes and their effects are not pronounced
De-trending
is often applied to remove a feature thought to distort or obscure the relationships of interest.
Euclidean distance
is the most popular method for calculating distance. Euclidian distance is a "ordinary" distance that one could measure with a ruler. In a single dimension the Euclidian distance is the absolute value of the differences between two points. The straight line distance between two points. In a plane with p1 at (x1, y1) and p2 at (x2, y2), it is √((x1 - x2)² + (y1 - y2)²). In N dimensions, the Euclidean distance between two points p and q is √(∑i=1N (pi-qi)²) where pi (or qi) is the coordinate of p (or q) in dimension i.
If the support criterion is met we grow the frequent____ from size 1 to size K or until we run out of support.
itemsets
is a similar notion to lift but instead of a ratio it is the difference
leverage
measures the difference in the probability of X and Y appearing together in the data set compared to what would be expected if X and Y were statistically independent.
leverage
There are other measures to evaluate candidate rules and we will define two such measures
lift and leverage
Are there mixed variable types?
logistic regression
Do I want class probabilities, rather than just class labels?
logistic regression
Do I want insight into how the variables affect the model?
logistic regression
We present a simple model for the time series with the trend, seasonality and a random fluctuation. There is often a ____________frequency cyclic term as well, but we are ignoring that for simplicity.
low
Estimation in logistic regression chooses the parameters that _____ the likelihood of observing the sample values
maximizes
Caution of Logistic regression: it does not handle ____ values well
missing
Caution of linear regression: does not handle _____ values well
missing
Reason to choose Naive bayes: it handles ____ values quite well
missing
ARIMA - difference the Yt d times to "induce stationarity". d is usually 1 or 2. "I" stands for integrated - the outputs of the model are summed up (or "integrated") to recover Yt
moving average: like a random walk, or brownian motion
methods that handle missing values well
naive bayes and decision tree
problem given features x1,...xn predict class y: which methods could solve this problem
naive bayes, decision tree, logistic regression
Is the k-means algorithm deterministic? In other words, will you always get the same results with the same data? Explain
no, it is not deterministic. K random seeds are chosen, it is possible to get K different kind clusters depending on the initial random seeds.
In practice, most people estimate the 95% confidence interval as the mean plus/minus twice the standard deviation. This is really only true if the data is ____ ______, but it is a helpful rule of thumb.
normally distributed
Categorical variables are expanded to a set of indicator variables,
one for each possible value.
used only for tests of the population mean
one sample t-test
the clustering algorithm is sensitive to ____. If the data has _____and removal of them is not possible, the results of the clustering can be substantially distorted
outliers
the centers of each discovered cluster, •the assignment of each input datum to a cluster. -Centroid
output
Reasons to choose K means clustering: provides concise ______
output; coordinates the K cluster centers
A combination of AR and MA models The general non-seasonal model is known as ARIMA (p, d, q)
p is the number of autoregressive terms d is the number of differences q is the number of moving average terms
this method maps the original space to a set of orthogonal dimensions
pca
this method selects component axes that maximize variance
pca
Caution of K means clustering: K (# of clusters) should be decided ahead of the modeling process (a priori). Wrong guesses may lead to
possibly poor results and improper clustering
Reason to choose logistic regression: returns good ____ estimates of an event
probability
Over fitting data
refers to fitting training data so well that we fit the idiosyncrasies such as the data that are not relevant in characterizing the data
Reason to choose linear regression: have explanatory values and we can easily determine the
relative impact of each variable on the outcome.
Changing the ____(for example from feet to inches) can significantly influence the results.
scale
The Euclidian distance is influenced by the _____ of the variables.
scale
Unlike the trend and cyclical components, ______ components, theoretically, happen with similar magnitude during the same time period each year
seasonal
Explain why it is better to present separate sensitivity and specificity results instead of the overall accuracy.
sensitivity is the "true positive rate". Specificity is the "true negative rate". Presenting the overall accuracy in this case would hide the individual rates, making it harder for one to interpret and analyze the results.
ROC curve is a plot of
sensitivity vs. 1-specificity
in the case of a one sample t-test would a large variance in the sample increase or decrease the likelihood that the null hypothesis would be rejected?
small variance is better, power is higher when the st. dev. is small
Many time series analyses (Basic Box-Jenkins in particular) assume ________________ sequences:
stationary
Apriori property tells us how to prune the search space and
stop searching further if the support threshold criterion is not met
The common measures used by Apriori algorithm are ____ and ____.
support and confidence. We rank all the rules based on the support and confidence and filter out the most "interesting" rules
The Naïve Bayes Assumption: Assume that all features are independent given
the class label Y; equationally speaking:
null hypothesis of the one sample t test
the means are the same
Reason to choose Naive bayes: easy to implement and scoring data (predicting) is easy
the model is resistant to over fitting
the goal of logistic regression is to predict
the odds of the outcome dependent variable
confidence
the percent of transactions that contain X, which also contain Y
The term "often enough" is formally defined with a support criterion where the support is defined as
the percentage of transactions that contain "L"
Reason to choose linear regression: robust to redundant variables and correlated variables.
the prediction is not impacted but we lose some explanatory value with the fitted model
The output of the apriori are
the set of all rules X -> Y with minimum support and confidence
we can think of verifying the null hypothesis as verifying whether the mean values of two different groups is the same. If they are not the same...
then the alternative hypothesis is true: the introduction of new behavior did have an effect.
Training set role:
this is the data used to Train the model
validation set role
this is the data used to detect overfitting
test set role:
this is the data used to provide an unbiased evaluation of the final model
Time Series Analysis is not a common "_____" in a Data Scientist's tool kit.
tool; Though the models require minimal data collection and handle the inherent auto correlations of lagged time series, it does not produce meaningful drivers for the prediction.
________ in a time series is a slow, gradual change in some property of the series over the whole interval under investigation.
trend
specificity=
true negative rate
What is another name for specificity
true negative rate; TN / (TN+FP)
Determine the priors for the P(+) =
true pos+ false neg = (12/25 of the observations)
sensitivity =
true positive rate
what is another name for sensitivity
true positive rate; TP / (TP+FN)
The euclidean equation ignores the relationship between __________
variables
Reasons to choose K means clustering: it is easy to assign new data to existing clusters by determining...
which centroid the new data point is closest to it (which is the nearest cluster center?)
Caution of K means clustering: K means works only on the ____ data and does not handle ____ variables
works only on the numerical data and does not handle categorical variables
Of course, you probably don't know sigma, but you do know the empirical standard deviation of your n samples, s. So you would estimate the 95% confidence interval as
x +/- 2*s.
Apriori Property
•Any subset of a frequent itemset is also frequent -It has at least the support of its superset
In the example shown, the graph of CO2 concentrations measured over many years shows a linear upward trend. In climatology, for example, this CO2 trend due to urban warming might obscure a relationship between air temperature and CO2 concentration.
•In this example, we see a linear trend, so we fit a linear model -T*t = mYt + b
Examples of linear regression
•Predicting income as a function of number of years of education, age and gender (drivers). •House price (outcome) as a function of median home price in the neighborhood, square footage, number of rooms. •Neighborhood house sales in the past year based on economic indicators.
Rt
•Random fluctuation -Noise, or regular high frequency patterns in fluctuation
De-trending is a pre-processing step to prepare time series for analysis by methods that assume stationarity. A simple linear trend can be removed by subtracting a least-squares-fit straight line. In the example shown we fit a linear model and obtain the difference. The graph shown next is a de-trended time series.
•The de-trended series is then -Y1t = Yt - T*t
Pick the Most "Informative" Attribute Information Gain
•The information that you gain, by knowing the value of an attribute •So the "most informative" attribute is the attribute with the highest InfoGain
St:
•The seasonal term (short term periodicity) -Retail sales fluctuate in a regular pattern over the course of a year. •Typically, sales increase from September through December and decline in January and February.
ARMA Model
•The simplest Box-Jenkins Model -combination of two process models
Tt
•Trend term -Sales of iPads steadily increased over the last few years: trending upward.
Linear regression: Ordinary Least Squares (OLS) The solution requires storage as the square of the number of variables and we need to invert a matrix
•storage quadratic in number of variables •must invert a matrix