Exam 2 Data Analytics
use cases- cluster similar documents
cluster news articles
use cases- geospatial clustering
cluster people by space and time i.e. groups of states interested in watching the same shows
when the event is equally likely to occur and not occur, the odds is
exactly 1
parameters for random forest
Number of decision trees to train in parallel: -higher trees give better performance but takes longer -must make trade-off based on computing power and performance Number of variabels to choose for splitting at each node: -must be less than total number of IVs -recommend and default value is sqrt(M) -example: if there are 100 IVs, then m=10 variables are randomly chosen at each node for splitting
odds vs probability
Odds: # of ways event can occur / # of ways event cannot occur. Probability: # of ways event can occur / total # of outcomes. example: you have a box filled with 5 red balls, 3 blue balls, 2 yellow balls (10 balls total) -the probability of taking a blue ball is 3/10 -odds of taking a blue ball is 3/7
steps in supervised predictive modeling
*1. Obtain data:* -data is info about the problem that you're working on -includes both inputs and outputs *2. Data pre-processing* -use appropriate data type for each column -code categorial variables -handle missing values -feature selection (select best features that are most relevant to your prediction) *3. Split processed data into training and testing data into training and testing data* -randomly split into training and testing -generally proportion of training data is greater than the testing data example: 75% training and 25% testing *4. Learn from training dataset* -use the known inputs and outputs in the training data to train a model to uncover the relationship between them *5. Make predictions using testing dataset* -use ONLY the inputs in the testing dataset to predict the output *6. Evaluate machine learning model* -compare the actual output and the predicted output -use performance measure to evaluate model -measures depend based on model task --for classification: accuracy, AUC, sensitivity, specificity --for regression: MSE, MAD, etc.
initialization: choose the number of clusters
*Empirical method*: - # of clusters: k=sqrt of N/2 for a dataset of N points - i.e. if n=200, then k=10 Other methods: -Elbow method: keep increasing the cluster until your decrease in error is exponential (once it starts becoming linear)
evaluating k-means cluster
*External*: employ criteria not inherent to the dataset -compare a clustering against priot or expert-specified knowledge using certain clustering quality measure *Internal*: unsupervised, criteria derived from data itself -evaluate the goodness of a clustering by considerign how well the clusters are seperated, and how compact the clusters are, -SSE, Modularity, Silhouette coefficient
maximum likelihood estimator
- Given the input variables, max likelihood estimator chooses coefficients (slope and intercept) to max the probability of the observed pattern of classes - To max likelihood, we must choose slope and intercept to max the following equation Ln(k1) + Ln (k2) +....Ln (kn) where ki is the likelihood of sample i being in a class *you find the slope and intercept using maximum likelihood estimator* NOT sum of square errors
when do you stop an algorithm?
- a branch with entropy of 0 is leaf node - a branch with fewer than a certain sample -have already split on all variables
unsupervised learning association
-*rule based machine learning* method for discovering interesting relations between variables in large databases -rules do not extract an individuals preference, rather find relationships between set of elements of every distinct transaction applications: -market basket analysis: takes recipts from all customers from a store and analyze all patterns of buying -web intrustion detection
real life applications of clustering
-Neflix divides its 93 million users around the world into 1,300 "taste communities" -AI in dating apps -Southwest airlines uses big data to deliver excellent customer service
confusion matrix for more than 2 classes
-you calculate the sensitivity for each category -calculate precision for each category -all values are between 0 and 1 -higher values preferred for each above measures
random chosen variable subset at each node for best split
-a problem contains a total of M variables -to ensure randomization, for each node of a tree: 1. randomly select 'm' variables (m << M) for splitting 2. evaluate information gain for each of the selected 'm' variables 3. choose the variable with max information gain for splitting -repeat the above steps at each node for all the decision trees that you are training
what is clustering
-a way of grouping *similar* objects together based on data describing the object samples i.e. similarity is measured using some criteria -a form of *unsupervised learning* - deals with finding a structure in unlabled data i.e. data has input variables but does not include examples of expected outcome/output -method of *data exploration* - a way of looking for patterns or structure in the data that are of interest
mean squared error
-also measures dispersion of erros, but larger errors get penalized more due to squarring -computed as average of square of errors of all time periods MSE= square all errors and add up divide by total number of errors
advantages of using random forest
-avoids overfitting -less variance compared to a single decision tree -provides good prediciton accuracy -useful for learning complex non-linear relationships
converting proabability to class label
-class label refers to output class or levels of categories in your DV (typically 2 categories) -output of classification models (logistic regression) is probabilities for each output class example: if our output class is presence or absence of diabetes, then logistic regression will give: 1. probability of presence of diabetes= p 2. probability of absence of diabetes= q -we choose a threshold between 0 and 1 to convert probability to class label example: if we set threshold for presence of diabetes as 0.7 then when p >= 0.7, we classify that individual as diabetic. Non diabetic when p is < 0.7
use cases- recommendation engines
-cluster similar movies and tv shows together
disadvantages of using random forest
-complex and time consuming to train model
decision trees for prediction
-decision trees can be used for prediciting continuous output - instead of using info gain criterion, we will aim to minimize the SSE at each split
unsupervised learning clustering
-determine the internal grouping in a set of unlabled data -help profile attributes of different groups -evaluation --external: experts review the results --internal: measure within group similarity (this is what we use in this class)
advantages to decision trees
-easy to understand -non parametric method-no assumption about the space distribution and classifier structure
strengths of k-means
-fast and computationally efficient -simple and easy to implement -easy to interpret -measurable and efficient in large data collection
choosing machine learning algorithm
-interpretability -performance -computational requirement
common clustering algorithms
-k-means (will be covered) -hierarchial (will NOT be covered) -density based
mean absolute deviation
-measures dispersion of prediction errors -computed as average of absolute value of errors of all time periods -error is predicted value - actual value MAD= average of absolute values of errors
objective function of k-means
-minimize total intra-cluster variance (or squared error function) -centroid of a cluster: average of all training data points in that cluster
getting odds ratio from logistic regression coefficients
-odds ratio can be calculted from the coefficient values (Bi) in the logistic regression -odds ratio for variable xi is e^Bi -in other words, the exponent of the coefficient of a variable will give the odds ratio for that variable -if coefficient of variable x is 0.25, then the odds ratio is e^0.25 = 1.2840
k-means clustering
-partitional clustering approach -each cluster is associated with a *centroid* (center point) --a centroid is the mean of data points in a cluster --does not have to be one of the original data points -each point is assigned to the cluster with the closest centroid --closeness is measured using a distance metric -number of clusters, K, must be specified
real world applications of using random forest
-predicting and detecting Alzheimer's disease at hospitals -predict whether a patient will miss the appointment -predict economic impact of adverse weather events -fraud detection by banks -click-through rate by online retailers
disadvantages to decision trees
-prone to overfitting: model learns the noise along with the pattern
limitations of k-means
-sensitive to initial condition. Different initial condition ma produce different results of cluster. The algorithm may be trapped in the local optimum. -sensitive to outliers -K-means has problems when clusters are of differing: sizes, densities, and non-globular shapes
random forests
-supervised algorithm machine learning for classification and regression -combines output from multiple decision trees to make predictions -overcomes the drawback of overfitting in decision trees
supervised learning prediction
-task of approximating a mapping function (f) from input variables (x) to *continuous output variables (y)* -continuous output variable is a real-value, such as an integer or floating point value -input variables can be real valued or discrete -regression problem requires the prediction of a quantity
supervised learning classification
-task of approximating a mapping function (f) from input variables (x) to *discrete output variables (y)* i.e. discrete outputs: MU can win or lose, colors like red or blue, patient has disease or no disease -requires the output to be classified into two or more classes -output variables often called labels or categories -problems with 2 classes is caled *binary classification problem*, and problem with more than 2 classes is called a *multi-class classification problem* -common for classification models to predict a continuous value as the probability of a given example belonging to each output class i.e. a patient may be assigned the probablities of 0.1 as being diabetic and 0.9 as being not diabetic. we can categorize it as not diabetic due to a high likelihood
choosing the right attribute for classification
1. compute the information gain of splitting the dataset by each of the predictors (IVs) in the dataset 2. choose the attribute with the LARGEST information gain 3. divide the dataset by it's branches to construct a decision tree, these steps are repeated until you reach stopping criteria
important properties of probability
1. probability of getting heads when you toss a coin? 0.5 or 1/2 2. what is the probability of getting 2 when you roll a dice? 1/6 3. probability of getting 7 when you roll a dice? 0 4. probability of getting heads or tails when you toss a coin? 1 5. probability of getting 1, 2, 3, 4, 5, or 6 when you roll a die? 1 *Property 1: Probability ranges from 0 to 1* *Property 2: Sum of probabilities of all possible events equals 1*
k-means clustering algorithm steps
1. select K points as initial centroids 2. repeat 3. form K clusters by assigning all points to the closest centroid. 4. recompute the centroid of each cluster. 5. until the centroids don't change
K-means example
Pizza outlet location: -New pizza chain wants to establish 3 outlets in como -Based on market research, the company has identified potential customer locations -Task: What would be the best locationcof these 3 outlets so that they can serve their customers effectively? **input/sample dataset=potential customer locations value of k=3 Step 1: randomly locate 3 clusters Step 2: form 3 clusters by assigning customer locations to the nearest outlet (and also compute the distance between each customer to each of the 3 chosen locations to assign them to the location with the shortest distance to them) Step 3: Change cluster center (outlet) to the centroid (average) of it's assigned customer locations, and then repeat step 2. Some customers will change the cluster location. Repeat steps 2 and 3 until stopping criterion is reached. Stopping criterion: No change in clusters
decision tree terminology
ROOT node: top most decision node which corresponds to the best predictor. Most important input variable. SPLITTING: process of dividing a node into two or more sub-nodes Decision node: when a sub-node splits into further sub-nodes Leaf/terminal node: nodes do not split at all (typically this is the prediction of the outcome)
ensemble methods
Use multiple algorithms to obtain better predictive performance than could be obtained from any of the algorithms by itself -in many real life problems, a single decision tree may not perform well (but it's very fast) so use multiple trees to create an ensemble method. -all algorithms must learn different info 1. Boostrapping 2. random chosen variable subset at each node for best split
use cases - image segmentation
break the image into meaningful or perceptually similar regions i.e. self driving cars
why is logistic regression necessary
cannot use simple or multiple regression to predict binary dependent variables simple regression line assumptions are violated (not normally distributed), also leads to some negative values which are not accurate, and errors do not have constant variance (violates line assumptions again)
Machine Learning
a subset of AI that uses algorithms to learn from and make predictions about data without being explicitly programmed -One domain -allow computers to evolve behaviors based on empirical data -it is necessary for the computers to acquire knowledge just like intelligence requires knowledge
predictive analytics
analysis of *historical* info (and external data) to find patterns and predict future outcomes -history tends to repeat itself examples: -Will MU win the next football match -Price of Google stock tomorrow? -Disease risk of an individual 10 years from now? -Patters are established using machine learning, a subset of artificial intelligence
artificial intelligence (AI)
any technique that enables computers to mirror human intelligence, using logic, if-then rules, decision trees, and machine learning (including deep integration) Examples: -Google home -Alexa -Self driving cars -Many domains Other subsets of AI: -computer vision -natural language processing
decision tree for predicion steps
at each node: 1. fit a simple regression model for each IV over all possibly binary splits 2. choose a variable that minimizes the SSE for further splitting 3. if stopping criterion is reach - STOP, or else go to step 1 stopping criteria -if argest decrease in SSE is less than some threshold - a branch with fewer than a certain sample - if all the points in the node have the same value for the IV
goal of clustering?
finding groups of objects such that the -data within a cluster will be similar to one another -data across clusters is different from one another intra-cluster distances are minimized (distance between groups within one cluster) inter-cluster distances are maximized (distance between two clusters)
when the event is more likely to occur, then odds is
greater than 1
types of clusters
hard clustering: -objects belong exclusively to one cluster i.e. k-means clustering soft clustering: -objects can belong to multuple clsuters -small degree of association i.e. fuzzy clustering
simple regression
has to be used when the output is continuous
if outcome is all continuous and is not a class or a category,
it is a prediction problem
when the event is less likely to occur, then odds is
less than 1
linear vs logistic model
linear: y= bo + b1x linear is straight line logistic: p= 1/ 1+e^(b0+b1x) logstic model is S shape
logistic regression model
ln[P/(1-P)] = βo + β1x1 +β2x2 ... logistic regression model= p= 1/ (1+ e ^ -(βo + β1x1 +β2x2 ...) p denotes the probability that a dependent variable is 1 relationship between binary DV on IV p is between 0 and 1 if p denotes the probability that a DV is 1, then q denotes the probability that a DV is 0 and is given: q = 1-p
logistic function
looks like a big S and will transform any value into the 0 to 1 range can have errors
logistic regression
many business problems/decisions deal with understanding the probability associated with certain behaviors or events these events are mostly dichotomous (binary) variable example: win or lose a game, die or survive a surgery, if a stock will go up or down the next day in this situation, the analyst must predict a binary dependent variable from a set of IVs one of the fundamental machine learning approach for classification of binary dependent variable is logistic regression
evaluating supervised regression algorithm
mean absolute errors: sum of absolute differences between predictions and actual values mean squared error: sum of squares of the differences between predictions and actual values
information gain
measures how well a given variable (attribute) separates the training examples according to the output class. Information gain uses the notion of entropy, commonly used in information theory. based on the decrease in entropy after a dataset is split on an attribute. Information gain= expected reduction of entropy Gain (S,X)= Entropy(S) -Entropy(S,X)
entropy of an input variable
measures the entropy of a variable with respect to the output class. Entropy (S,X)= probability of the class x the entropy of that class
entropy
measures the impurity of a collection of examples it depends on the distribution of the random variable p Entropy= -plog2p - qlog2q if S is the collection of training examples.. If sample is completely homogeneous, then Entropy is 0. (or if all outcomes are yes, with zero no's, entropy is 0) If sample is equally divided between yes and no's, entropy=1 (max value an entropy can have). If we have 9 yes, and 5 no's, the probability of yes would be 9/14 and probability of no would be 5/14. So the entropy= -9/14 log2 (9/14) - 5/14 log2 (5/14) = 0.94
unlike probability value
odds value varies from 0 to infinity
concept of odds in logistic regression
odds: probability of an event occuring divided by the probability of that event NOT occurring example: probability of having a disease is 0.25. What are the odds of having a disease? -probability of having a disease = 0.25 -probability of not having a disease = 0.75 -odds (having disease)= 0.25/0.75 = 0.33 interpretation: for every 0.33 individual who have a disease there is 1 who does not have a disease. OR for every 33 individuals who have a disease, there are 100 individuals who do NOT have the disease Probability of having the disease is 0.25, what are the odds of NOT having a disease? 0.75/0.25 =3 interpretation: for every 3 individuals who does NOT have a disease, there is 1 who has a disease
types of clustering methods
partitional clustering: -devision of objects into non-overlapping clusters such that each object is exactly in one cluster ex: k-means clustering hierarchial clustering: -set of nested cluster organized as a hierarchial tree (start with one big cluster and goes into smaller custers, or the other way around) -degree of association with each cluster is assigned ex: birch
why clustering?
pattern detection -group related documents for browsing -group genes and proteins that have similar functionality -group stocks with similar price fluctuations summariazation -reduce the size of large data sets
Logistic regression examples
prediction demographic behavior whether a person will or will not subscribe to a magazine predicting whether a person will or will not respond to a direct mail campaign predicting whether or not a cell phone customer will "churn" by end of year and switch to another carrier
supervised learning
problem of developing a model using historical data (both INPUTS and OUTPUTS) to make a prediction on new data, where we do not have the answer -described as the mathematical problem of approximating a mapping function (f) from input variables (x) to output variables (y) y~f(x) -called supervised learning because process of an algorithm learning from the training dataset can be thought of as a teacher supervising the learning process -*majority of practical machine learning uses supervised learning*
decision tree example
problem: predict whether a person cheats income tax input data: refund, material status, and taxable income output: cheat or no cheat
odds ratio
ratio of two odds measure of association between y and x example: -probability of male having disease is 0.6 -probability of female having disease is 0.2 solution: odds ratio (OR)= odds of male having disease/odds of female having disease Odds of male having disease= 0.6/0.4 = 1.5 Odds of female having disease= 0.2/0.8=0.25 odds ratio (OR) = 1.5/0.25=6 interpretation= for every 6 males who have the disease, there is 1 female who will get the disease
science behind netflix's popularity?
recommendations
bootsrapping
sampling with replacement from original dataset draw sample populations from the super population but replacing them each time before taking another sample
logistic regression for classification
set of IVs (ex: x1, x2, x3) have coefficients associated with the iVs (ex: b1, b2, b3) use them to predict an outcome variable which has two categories DV or outcome variable is categoric- it has categories
machine learning techniques
supervised learning: -develop predictive model based on both input and output data -you know input and ouput data to know patterns unsupervised learning -group and interpret data based only on input data -only given inputs
the odds ratio tells us
the relationship between an IV and a DV
if odds ratio is greater than 1
then the relationship between that IV and DV is positive
if odds ratio is less than 1
then the relationship between the IV and DV is negative
random forest for regression
to find final prediction=take average between all tree's outcomes
random forest for classification
to make final prediction, use the majority of the outcomes
decision trees
type of supervised algorithm that forms a tree structure that breaks up a dataset into smaller and smaller subsets while positioned top down key benefits: can use for classification as well as prediction problems. Also can provide a set of rules and insights. High interpretability.
confusion matrix
used to compute the performance measures for classification algorithms accuracy= sum of all correct predictions/total samples or accuracy = (true positive + true negative)/ (true positive +true negative +false positive +false negative) sensitivity= the proportion of acutual positives that have correctly identified as positives or sensitivity= true positive/ (true positive + false negative) precision= true positive/ (true positive + false positive) F1 score = 2 x (precision x recall/precision + recall) -recall is sensitivity value
unsupervised learning
where you only have the INPUT data (x) and no output variables -goal: to model the underlying structure or distribution in the dta in order to learn more about the data -these are called unsupervised learning because unlike supervised learning above there is no correct answers and no teacher -algorithms are left to their own devises to discover and present the interesting structure in the data -no right answer, just group them to establish a pattern