Data Science Foundations: Data Mining

Ace your homework & exams now with Quizwiz!

The two types algorithms you'll find in data mining

Classical statistics: methods based on familiar, statistics, typical transparent, possible to calculate by hand. Machine Learning more complex methods, often opaque, require substantial computing power.

If you wanted to cross-validate a model to ensure you don't overfit or make a model specific you can do what in R?

Do your statistical modeling in a training set then run the model on your testing data.

When clustering in Python, why do you remove any categorical variables?

because you cannot calculate distance with categorical variables

When you are looking for whether individuals are changing between different methods of responding you are looking for _____.

different states

general categories of clustering

distance between points distance from centroid density of data distribution models

If you want to capture correlations between the attributes of your variables when clustering you would use the _____ model.

distribution

What component of regression analysis is used to make predictions?

slope

When performing regression analysis in RapidMiner what process does not require making any selections for parameters and settings?

Apply model

Why would someone use unstructured data

Assessing Authorship in documents, seeing who wrote it or seeing for changes in voice over time. Another one is Clustering Groups of Respondents. Clustering's a very big topic in a huge number of fields and using natural text can be a great way of getting more insight into clusters another very common use is Sentiment Analysis in Social Media and in news articles and in blog posts and telling whether people are basically saying positive or negative things about something.

Algorithms for sequence mining

Many of these algorithms, many of the most common ones in sequential mining, are based on Apriori association analysis. On the other hand, some of them serve different tasks. HMM, or hidden Markov models, instead, those test for state changes. And either way, depending on what you're trying to get out of your data, even basic hacks, sort of adapting or misusing common procedures like decision trees and logistic regression can still provide useful insight into your data while capturing some of the sequential elements.

software for data mining

Text interfaces programming languages that use written commands ; easy to share and repeat. Graphical Interfaces specialized applications that use menus, widgets, and virtual connections; easy-to-see process.

If you fail to set an outcome variable as a binomial in Set Role in RapidMiner, what error will you receive?

The variable will be treated as an integer value

[T/F] The goal of clustering is to group cases in order to categorize them for your customers to analyze their responses.

True

[T/F] The order of events matters in sequence mining.

True

[T\F] Running a breadth-first search through a tree diagram is inefficient with a large data set.

True

Algorithms for classification

k-nearest neighbors, naive bayes, decision trees, random forests, support vector machines, artificail neural networks, k-means, logistic regression,

statistical reasons for data reduction

avoid multicollinearity( When your variables are correlated with each other, that can really cause some instability in the procedures that you do.) get increased degrees of freedom avoid over fitting (which is when you create a model that fits really well on your sample data, but doesn't generalize well to other situations. And that's one of the cardinal sins in data mining. )

naive bayes

begin with overall probabilities of group membership. Adjust probability with each new piece of information. Use bayes theorem. called naive, by the way, because it ignores the relationship between predictor variables

Association analysis types

collective strength, conviction, leverage, lift.

Some of the uses of sequence mining

genetic sequencing, recommendation engines, marketing offers, behavioral "state switching"

[T/F] "Anomaly" is another name for an outlier.

False

bivariate and multivariate outliers

these are cases with unusual combinations of scores. these kinds of outliers can grossly distort the perceived relationship between the variables

What is the first step in creating a training set in Python?

create a random uniform variable

Few import aspects of regresson

(Slope and intercept) Number one, in this case I just showed you, it gives you the slope and the intercept. The intercept may or may not be important, but the slope that shows when a score increases on this variable a certain amount, then you expect it to increase this much on the other one. That can be really useful for making predictions. (Correlation) Also, even with the same regression line, you can have different levels of fit how well does that line describe the relationship, really, by how tight are the dots to the line. That's described by the correlation coefficient (Require Normality) Now, there's a couple of assumptions about this standard approach. One is that it requires normality. Your distributions need to be approximately normal, or bell curve, in order for this to work well. (Gives fir and requires linearity) Also, I showed you the simplest version, which assumes linearity, a straight line association. That's not gonna always work, especially if you're doing something called logistic regression, where your outcome variable is a yes/no, then you're gonna have a different kind of line, and you'll need to do a different variation on regression.

details about classification in genera

1, is that the categories already exist. They've already been determined and you know what the slots are that you want to put things into 2 the major question is, where do you put the new cases, not what should the buckets be 3 the important thing to remember is that the accuracy or the output of the model is entirely dependent on the variables that are included in the data set. If you don't have a wide range of variables or if you don't have the most useful variables, you have a very hard time getting an accurate and useful classification system.

Random forests

A collection of decision trees randomly select cases and variables or features more reliable and less prone to over fitting It's been described by some as the closest thing we have to meeting the requirements for serving as an off-the-shelf procedure for data mining. Random forests, very easy to implement, very flexible, very little data preparation required, and very easy to interpret.

data reduction helps with

A data reduction helps with practical constraints, like simply being able to fit the data in the computer and work with it. It helps with interpretability, by allowing you to focus on larger patterns. And it also improves statistical ability by avoiding the confounding influence of multicollinearity and the risk of overfitting.

Anomaly Detection vs outliers

Anomalies are things that are not supposed to happen, they're unusual events that usually signal a problem. Another thing that an anomaly can tell you is that you need to revisit the validity of your entire categorization system. But one way of finding anomalies is through outliers and outliers represents a more general category

outlier detection

Anomalies can signal problems that must be fixed and addressed like diseases or potential fraud, and they're one kind of outliers. Outliers, in general, create analytic challenges by distorting individual measures or relationships and potentially leading to mistaken conclusions. Fortunately there's several methods for handling outliers. As we go through the procedures, you'll see that some handle outliers better than others, and that might be something you wanna consider when making an informed choice about your analytical procedures.

The most common statistical method, and the method used in market-basket analysis, of the options shown is _____.

Apriori algorithm

Give a store with a set number of items. Can you predict if someone puts one item in there basket they would put another specif one in aswell. Using what.

Association analysis using the steps of frequent itemsets and rule generation

What approach to finding bivariate outliers is the most common?

Bivariate normal distribution

What is the easiest way to find univariate outliers in R?

Box Plots

Goals of classification

Classification as a technique complements clustering. Clustering creates the buckets or boxes and classification puts new cases into them. It parallels the distinction between supervised and unsupervised learning in machine learning, which has to do with whether you actually have a label or a category membership for cases or not. And as always, the validity of the classification system that you create is limited by the data that's provided to the algorithm. And so you do want to make sure that you do have enough data and you have valid and useful data to feed into your algorithm and your classification system.

Goals of clustering

Clusters are pragmatic groupings. They allow limited interchangeability. the approach is in the validity of clustering varies according to the individual person's purpose, the data that's fed into the system and the algorithm used.

When using Python to detect anomalies, what will you do after finding univariate distributions?

Create scatter plots to look at the bivariate distributions

Isn't more data better?

Data reduction helps you simplify the dataset and focus on variables or constructs that are most likely to carry meaning and least likely to carry noise.

Eclat

Eclat, and this one works a little differently. It calculates support for a single-item itemset, so, it takes that one item, and if support for that is above the minsup, or minimum support level, then it goes down from that item and adds another one to it. It doesn't move sideways and go to the next single item. It starts with that one, and it immediately goes down the tree with that one. And when support is below minsup as you're going down the tree, then it will move over to the next single item and will start adding things onto that as well. That one has a different name and some of its challenges, it's called a depth-first search. And so, it tries to take one item and then run as far as it can with it before it moves onto another single item. And, in general, these kinds of approaches are fast, but the itemsets can get really long, and they can also get difficult to manipulate. So there is, again, a trade off there.

k-NN

Find the k cases in multidimensional space closest to the new one. K is a random number you select take a majority vote on their category

algorithms based on machine learning

Hidden Markov models, which are used for analyzing sequential data and changes over time. support vector regression, closely related to support vector machines, except this is used to predict a quantitative outcome. decision trees, and when taken in groups, random forests there's LASSO regression, a mono-variation on regression for choosing variables Apriori Algorithm which is used in association analysis, or market basket analysis. word counts using word stems, handling stop words, and comparing frequencies

How to choose between classification methods

Human-in-the-loop or black box models

distance algorithms

Imagine you have a collection of points here in two dimensional space. What you're going to do is you're going to measure the distance from every point to every other point. what you're doing here is you're measuring the euclidean distance. Think of it as measuring the direct distance between each of them and these are also known as connectivity models and one of the neat things you get from these is a hierarchical diagram One of the interesting choices here is whether you want to do a joining or splitting approach. Those are actually technically known as agglomerative or divisive. You either start with all of the data in one cluster and then take it apart or you start with all observations separately and then join them together sequentially. One of the tricks here is that these distance models really only find convex clusters. That is if you're in a two dimensional or three dimensional shape, it's gonna find these that are shaped like apples or maybe watermelons but not like bananas cause they curve back in and that's a limitation of a lot of the methods we'll see here. euclidean distance, connectivity models, hierarchical diagrams, joining or splitting, convex clusters, slow for big data

Anomaly Detection goals

In essence what you're trying to do is set up a system that allows you to find the unexpected so you can then react or respond to it appropriately

One of the big decisions you have to make with data reduction are the algorithms that you're going to use for this procedure. What are the two general categories?

Linear methods: straight lines through; data use linear equations. Nonlinear methods: for high-dimensional manifolds, use complex equations.

The general rule of thumb and this going to depend greatly on the kind of analysis you're doing and the size of your data set, but the rule of thumb is that if you have less than 10% of your cases in a category, and we have less than 2% here in the PhD, then that represents an categorical outlier.

Now the first thing that can happen with outliers is you can distorted statistics on univariate things like the mean or the variance. And it can greatly inflate the variant, it can distort the mean. Probably more important is that it can distort the relationships between variables. It can change correlations from positive to negative, depending on the presence of a single outlier and where it's located cause those outliers tend to have an extraordinarily huge influence on relationship statistics. Consequence of that is it can give you misleading results. You can see things as the wrong way, again, mistaking positive for negative or a statistically significant for a nonsignificant relationship. You can reach the wrong conclusion. That leads to one other problem and that is the failure to generalize. And that is, if your results were based primarily on the effects of one or a small number of outliers, well those outliers may not occur again in other settings, and so, your conclusions will have been distorted by this very small number of cases. And then they won't generalize well, and really you can have a failure in implementation

bivariate outliers (specific ways to use)

Now, one choice here is to use distance measures where you calculate each case's distance from the center. There are a lot of choices for that, and I'll talk about those for multivariate, but the thing is they ignore the possibility of two-dimensional visualization, which is one of the neat things when you're working with bivariate relationships. One option is to show a bivariate normal distribution, which is just an ellipse over a scatterplot, and you look for cases that are outside of that ellipse. a more sophisticated approach is density plots, or more accurately, kernel density estimates. These are like topographical map that follow the density of the data, and they can have irregular shapes for something outside of it.

hat to do about outliers?

Number one is to delete them and that is, remove the cases, just throw them out of the data set. You can do that if they are few in number and you have to explain your choice and you may even have to demonstrate that deleting them doesn't change anything substantive with the other analyses. You can transform the variables, say for instance you can take a logarithm of the variable, or you can square the variable, and that makes the scores and the distribution more symmetrical which usually works better with most analyses. Or you can use a more robust measure. You can use measures that are not strongly influenced by outliers. People are familiar for example, with the median as being less influenced by outliers than the mean, and procedures like decision trees tend to be less influenced by outliers than some other procedures.

Classification.

Once you have clusters or groups and you've got new cases you want to know which ones they should go into

univariate outliers (specific ways to use)

One easy way to do this is to use a variance or standard deviation-based measure, and what you're looking for here is cases that are several standard deviations away. For instance, you might compute what are called z scores. Well, that's one way to do it, but the problem, of course, is that the outliers themselves influence the variance or the standard deviation, and so they make it less likely that they will be seen as an outlier, so that's basically a problematic approach. A more common one is to use quartiles or percentile-based measures where the distances of each case from the rest of them are based on the interquartile range, or the middle 50% of the scores, and this is probably the most common approach, and it's the one that I generally use. Now, one that's not really statistical per say is to use experience. Maybe you're in a field where there are common standards for unusual scores. So, for instance, in medicine they might say, "If your white cell count is above this level, "you probably have an infection." I know that in psychology they give tests and say if you have a score this high, that that's what they call outpatient level and another score is inpatient level.

a few reasons that you might want to do simplification or reduction in variables, or fields

One is practicality. -storage -memory -time interpretive reasons -reduce noise: reduce distractions and meaningless information -focus on patterns: make regularities easier to see by "zooming out" - easier to interpret": is easier to interpret and use larger patterns

Most common choice for data reduction/dimensionality reduction and why

Principal component analysis is probably the most common overall, and it's really easy to interpret. And you'll find that interpretability, and simply the need for interpretability, is going to have a strong influence on informing your choice among methods for data or dimensionality reduction.

the analogy of projecting a shadow mean

Really what you're trying to do is take data from high dimensional space where each variable is a separate dimension, and you're trying to project a shadow into a lower dimensional one. It's the same way that you can have a three-dimensional object and project a two-dimensional shadow of it and still tell what it is.

Basic idea of regresson

Really, the general idea here is that you want to use many variables to predict scores or outcomes on one.

Goals of Regression

Regression is fundamental to pretty much any data analytic project, including data mining. It allows you to use many variables to predict one, and it does make some important assumptions about data, and that's where a lot of the theoretical work in regression and statistical analysis is going on, as we'll see in the future videos in this chapter.

dimensionality reduction

That's where you trying to find important variables or combination of variables that will either most informative and you can ignore some of the one's that are noisiest.

Algorithms for association analysis

The algorithms are important because they allow you, first, to find the itemsets. What items occur together oft-enough to even be worth considering? And most of them, but not all, also generate the rules. The if-then contingency statements that say if this is in the basket, then there's a certain probability that this other thing is as well. And, of course, there's a trade off. The different algorithms, they vary in their speed, and they also vary in their memory demand. You're gonna need to look carefully at the exact performance with your data set and your needs as you make an informed choice of the algorithm to use in your own association analysis.

centroid based models

The center point here, the centroid, is defined by a mean vector so it's the mean for that group on every variable that you have in the dimensional space and the most common version of this is what's called k-means and that's where you simply define how many centroids you want to have and then it figures out how far each point is from so many different centroids. Like the distance models, the centroid model finds convex clusters but with an extra limitation. They're generally of similar size and then the other major problem is you have to choose the value for k, the number of clusters that you want to have. There are some interesting adaptation sways around that but generally, you just kind of pick a number and you give that to the computer and see how well it works. defines point by mean vector, k-means, convex clusters of similar size, must choose k

supervised learning, or labeled learning

The data now has a true class or outcome variable; accuracy is now the guiding consideration.

Suppose you are a grocery store manager and are interested in the likelihood customers who buy wine also buy beer. What would the "lift" statistic tell you?

The likelihood customers will have beer in their carts when they have wine as opposed to having beer by chance

multivariate outliers (specific ways to use)

There are distance measures that generally measure the Euclidean distance or a straight line distance from the center of the data set or the centroid. The most common version of this is the Mahalanobis distance, which is really just a straight vector measurement of how far something is from the standardized centroid of the data. There are, however, a lot of robust measures of distance. Same idea, but they're not as sensitive to variations in the standard deviation of the variance of these scales. Then there are density measures, and this is where you look at the local density of data in a multidimensional space. The multivariate kernel density estimation is the most common approach, and these are more flexible, and they're more robust. They can also give irregular shapes, and that sounds like a good thing, but the fact is they tend to be really hard to describe and hard to generalize from one situation to another, so that's a trade-off.

Algorithms for anomaly detection

There are methods for both visual and numerical analysis in terms of identifying outliers. Second, there are means-based methods, like Mahalanobis distance, and there are more robust methods, say, for instance, with univariate, the IQR, and with multivariate, the kernel density estimators. It's nice to have measures that are robust and are not so sensitive, but those are often harder to interpret and harder to generalize, and so it becomes a trade-off of what's important for your particular purposes with your particular data set.

univariate outliers

These are cases with unusual scores on a single variable. Now that case may be unusual on several other variables but you're only focusing on one variable at a time. the problem is that tends to distort a lot of your statistics.

Groups of Convenience

They're not the platonic real universals that cut nature on its joints. They're groups of cases that are similar to each other, but with a pragmatic goal.

Sequence Mining

Time-ordered data, find events that typically precede or follow each other

SVM (support vector machine)

Uses the "kernel trick" to find the hyper plane that cleanly separates two groups It's a very sophisticated model that uses something called the kernel trick which makes it possible to find a hyperplane at very high dimensions, in fact, possibly more dimensions than you actually have in your data, to cleanly separate two different groups. It's an attempt to draw a straight line through what's a very squiggly surface by putting things up at angles and multiple dimensions.

Apriori

What Apriori does is it calculates support for single-item itemsets. So, for one thing at a time, it gets the support, meaning how common is that in the data set that you're looking at, and if the support for that one item is less than some designated minimum support amount, minsup, that you set, then it ignores that one, it throws it out, it only keeps ones that are above that particular critical value. Once it's gone through all these single-item evaluations, then it goes to the two-item itemsets, and it applies the same minsup criteria and throws out the ones that are too low, keeps the other ones, then it goes to three-item, and so on. Now there's a few challenges with this approach. First off, the way of doing all these single items, then all the two items, then all the three items, that's called a breadth-first search through a tree diagram, or BFS, and the thing is, it involves making a lot of passes through the data. In fact, you have to make as many passes through it, really, as you're going to have items in an item set. And consequently, the computation time can grow exponentially when you're doing this. This approach, while very common and easy to deal with, can be inefficient with a large data set where you have either a large number of items or a lot of items in an itemset.

Algorithms for classification

When you're doing classification as a machine learning exercise, you have an enormous range of options. And these options vary, of course, in the difficulty of their computations and the assumptions that go into them. But more than anything, they vary in their interpretability. When you're making your choices between this huge range of algorithms, you want to choose given the specific purposes to which you attempt to put the results of your analysis, and, specifically, how they will be implemented. Keep those in mind, you'll be able to make a more informed and more useful analysis for your data science projects.

Association analysis Algorithms

algorithms that are available, probably the most common is Apriori. Another very common one is Eclat which stands for equivalence class transformations. There's also FP-growth. That stands for frequent pattern growth. There's RElim, for recursive elimination. SaM, for split and merge. And JIM, for Jaccard Itemset Mining

When using R, what is the first step you need to perform in hierarchical clustering?

create a distance matrix

decision trees

find variables and values that best split cases at multiple levels follow largest branches to leaf at end (final category)

predicting scores.

find variables that can be used to predict outcomes. Use regression models

Human-in-the-loop

if humans make decisions using principles from results, then use transparent methods like decisions trees or naive Bayes for classification.

black box models

if the algorithm directly controls the outcome and accuracy is paramount, opaque methods like SVM and ANN are acceptable for classification.

distributional models.

imagine we have our points here and we draw a sort of normal shape around them in a bivariate model that's gonna look like an ellipse and so here I've drawn an approximate ellipse around these points so what we're doing here is we're getting the clusters and we're modeling them as statistical distributions. Now, you have a lot of choices but by far the most common is a multivariate normal distribution. There's one major problem to that in that the distributional approach is prone to overfitting. That is if you give the computer free rein, it'll just add more and more dimensions until you end up with models that really only apply to the exact data that you currently have. Now, you can overcome that by telling it how many dimensions to work with sort of like you have to do with k-means and tell it how many clusters. One advantage though is it's good at capturing the correlations or interrelations between the attributes in your data which none of the other approaches do specifically. On the other hand, it also is limited to convex clusters because that's the shape of our distributions clusters-modeled as statistical distributions, multivariate normal, prone to overfiting, good for correlations

Text Mining

in unstructured text data, find words and phrases that define voices and distinguish texts. generally can give you an inside into something even without having the structured data that's so common in the rows and columns of the spreadsheets and databases.

Classification

is an attempt to place new cases into the correct bucket. So, think of it as simply choosing the right bucket or box to put something in.

Types of classical statistics algorithms

linear regression, k-means classification, k-Nearest Neighbors, and hierarchical clustering.

The problem of having a large number of dimensions in which variables are associated with each other is _____.

multicollinearity

ANN (artificial neural networks)

multiple layers of equations and weights to model nonlinear outcomes. They're modeled after the biological neurons which receive input, do some processing, and send out outputs at several different levels.

For better interpretability you would like to have some additional graphics with your results. What PCA function would you select?

principal

Association analysis, or association rules learning,

provides an entire collection of really powerful methods for finding groups of items that tend to go together. You are able to find how common each individual item, as well as, you're able to get the conditional probabilities or the confidence that go with each item. And one of the most important things is that this analysis is able to suggest additional paths, like recommending additional items, or other elements that you may need to investigate in order to make sure you get a well-rounded picture of the thing that you're analyzing.

You are trying to find time-ordered data. What process allows you to determine that if a particular event occurs another event will likely happen next?

sequence mining

some strengths and limitations of clustering algorithms

some only do convex shapes, some do equal size clusters, some are difficult for large data sets, some are hard to describe parametrically and so when it comes down to actually conducting your cluster analysis, you're gonna need to be mindful of these options and try to choose an algorithm that both fits your data and fits the purposes or pragmatic applications of your analysis

unsupervised learning, or unlabeled learning

the data doesn't have any "true" classes or criteria; groups are shaped by similarity, not accuracy.

Clustering or Cluster Analysis general

the general idea is to take an entire collection of cases or observations and put them together so like goes with like. Now the important thing to remember is these are not natural groupings, rather these are Groups of Convenience.

principal component analysis or PCA

the most common linear method is principal component analysis or PCA. And PCA is designed to reduce the number of variables, reduce the number of dimensions. And it does this by trying to maximize the variability of the data in lower-dimensional spaces into which it's projected. considerations rotation: with multiple components, rotated solutions can be easier to interpret factor analysis: factors are closely related but based on a different theory Interpretability: for human use, the ability to interpret is critical; less so for machine learning.

If you were a marketer of technology products, what would "confidence" tell you in the context of association analysis?

the probability a laptop purchaser will also purchase flash drives

nonlinear dimensionality reduction, you have a few choices

there is a variation of principal component analysis called Kernel PCA that uses the so-called Kernel trick as a way of analyzing high-dimensionality data. And there's some methods of getting the Kernel that are related to this. There's Isomaps, there's something called local linear embedding, and there's a lot of other variations. A more interesting one that has a sort of semi-flexible Kernel is Maximum variance unfolding. They can work in different circumstances, but the trick is they're all pretty complicated, and it's hard to interpret the results. If you're using a straight black box method, doesn't matter, but if a human's involved, you're gonna wanna emphasize interpretability.

categorical outliers

this is a case that's in an uncommon category. The general rule of thumb and this going to depend greatly on the kind of analysis you're doing and the size of your data set, but the rule of thumb is that if you have less than 10% of your cases in a category, and we have less than 2% here in the PhD, then that represents an categorical outlier.

density algorithms

this is where you look in the multi-dimensional space and it simply computes the density of points in there and it draws a border around them. is you're connecting dense regions in k-dimensional space where k is the number of variables or attributes that you have. Now, what's neat about this is these approaches can model nonconvex clusters. It can also model clusters of different sizes. so it overcomes some of the limitations of other approaches. One interesting thing is it is possible to sort of ignore outliers or unusual cases because they're not in dense regions. The major problem however is that it's hard to describe exactly how the algorithm arrived at these things. You don't have a parametric distribution to say that this cluster's defined by high scores on this variable and low scores on that one because it might be an unusual shape that wraps around connected dense regions in k-dimensions, can model non convex clusters and clusters of different sizes, can ignore outliers, hard to describe

nonlinear methods for data reduction. Specifically, nonlinear methods for dimensionality reduction

useful for nonlinear manifolds (you now have a one-dimensional shape that's kind of been placed in a higher dimensional. That makes the s a manifold because it's embedded in a higher dimensional space.) used in computer vision, etc difficult to interpret

frequent itemsets and support

what that is is a combination of items that appear together, and you calculate a index number called "support" that indicates how often these things appear together. supp(X) = X/T the proportion of transactions T that contain itemset X Set minimum level:minsup

rule generation and confidence

what this is is a collection of if/then statements, where you calculate something called "confidence". That's a metric that's similar to conditional probability. If they purchase this item, or these three items, then they have a high probability of purchasing this item as well conf(x=>Y) = supp(XuY)/supp(X) if X occurs, then what is the (conditional) probability of Y? set minimum level:minconf

Anomaly Detection

when you're trying to find cases that are different from all the others, that don't fit into the clusters, that aren't associated with other things in predictable ways because sometimes you want to exclude those so you can focus on the other ones, or sometimes you want to see why they are different and whether that gives you any clues into the operation of normal situations

What about correlated predictors and regression ? multicollinearity

when your predictor variables are associated with each other, and the problem is that the slopes, or the regression coefficients, associated with each of your predictor variables changes when the context changes. You have a different group of variables, and if they overlap, they're gonna come out differently. Also, it turns out that the order of entry of the variables into the equation can dramatically affect the values of these slopes or regression coefficients, and so a lot of work has been done on methods for bringing in multiple correlated variables, and we'll talk about that in the video on algorithms But, that being said, one of the beautiful things about regression is that it's general purpose. Regression can be adapted to nearly any data situation. While I talk about this straight line situation with a quantitative outcome, it can be adapted to quantile regression, it can be adapted to logistic for binary outcomes, it can be adapted to multinomial outcomes, it can be adapted to curved lines. There's so many things that you can do with it. Plus, it has one major advantage, in that it's generally very easy to interpret the results of a regression equation.

Clustering

where you're trying to find sort of dense collections of points that are associated with each other, because that allows you for instance to treat those clusters as perhaps functionally similar for certain purposes.

Clustering Writers

Clustering the Writers of the people who are creating the text and because this is unprompted text, they're not responding to a specific question and they're not creating it for you, this can be more informative than the structured data where people can get sorta get apprehensive or they can get very brief. Some of the interesting uses of this are, for instance, identifying illness. So, for instance, I'm in Psychology and you can use tweets to identify depression. You can even use tweets to identify, that is, psychopathic tendencies. And then, another thing you can do, just in a commercial setting, is if you're looking at the different writers you can put them into groups and then you can tailor the offers you give to people. If people are writing in questions to a company, it can give sort of more specialized responses. Also, you can even use it to create interventions.

[T/F] Classification creates "buckets" and clustering puts new cases into "buckets".

FALSE

[T/F] An advantage of regression analysis in Python is that you do not need to scale your variables.

False

[T/F] Stepwise regression, conducted in R, tends to have higher generalizability than stagewise regression.

False

[T/F] When you are doing text mining you will use structured data.

False

categories of choices for sequence mining

GSP, SPADE,FreeSpan, HMM.

What type of data mining in Python provides the best way of looking for changes in states?

Hidden markov model

classical regression methods

In terms of classical methods, there is simultaneous entry, where you simply take a whole bunch of variables and you throw them in there all at once and see how they work together as an ensemble. You could also do blocked entry, where you choose a group of variables, put them in, then add a second group, and then a third group. Or there's stepwise. This is an automated procedure whereby the computer chooses the one variable that has the highest correlation with the outcome, puts that in. Then there's what are called partial correlations, where the computer chooses the one that is the highest, puts that in, and so forth. It sounds like a nice way to do things, kind of hand it over to the data. But stepwise entry, in many situations, is massively prone to overfitting and getting models that only fit with those exact data and capitalizing on chance. That's a problem and so most people don't recommend stepwise entry. In fact, they recommend against it rather strongly. There are also non-linear methods. So if you have a curvilinear relationship, even within classical methods, there are ways for dealing with that, usually by transforming a variable or getting a power of the variable.

modern regression methods

In the class of modern methods, there's LASSO regression, which stands for Least Absolute Shrinkage and Selection Operator. It's a nice way of doing something similar to stepwise regression, but without the risk of overfitting and the breakdown in generalization. There's also Least Angle Regression, which is related in some ways. There's RFE, which is Recursive Feature Elimination, sort of like a stepwise procedure but it's actually in a class of embedded methods, and it's used often with support vector machines for machine learning. And on the same topic of machine learning, very similar to what's called a support vector machine, or an SVM, there is a support vector regressioner, or SVR. It uses very advanced, high-dimensional calculations based on what's called the kernel trick, to find a hyper-plane, sort of a flat plane, that can separate the data and predict values very cleanly. On the other hand, RFE and especially support vector regression can be very hard to interpret.

Unstructured data

It's not numerical data that's in rows and columns or in lists, it's just a blob of text

Bag of Words

Now interestingly, a lot of machine learning algorithms work this way. They kind of break the text down into chunks or tokens, and just analyze it like that. So Naive Bayes is one, neural networks is one. I can also k-means clustering, support vector machines, the common TFIDF, which stands for Term Frequency Inverse Document Frequency vectorization, and so on, and sometimes you're simply marking whether a word is present in a document or not, that's binary presence, or you weight it by how frequently it occurs, that's the TFIDF. Either one, you're still gonna get meaning out of what you're doing.

If you are mining text in R, what items would you remove to provide the most meaningful analysis?

Spare items

hacks for they allow you to make sense out of data that has some order in it

Now, in addition to these legitimate approaches and several other related ones, I will mention that there are some hacks. And so these aren't, properly speaking, algorithms for sequential analysis, but they allow you to make sense out of data that has some order in it. Number one is decision trees. And even though we use those for a lot of other things, for categorizing between two groups or something like that, if at least one of the variables in your data set measures previous conditions, so you actually capture a little bit of before and after, then this gives you a short-term time perspective that can be incorporated in making the decisions in the decision tree Another one is logistic regression, in which you model a bivariate outcome. And like with decision trees, if at least one of your variables has t minus one, that's time minus one, or the time before, then you can get a really nice parsimonious model of the current outcome that's easy to interpret.

When you're looking at these various regression methods, what are a few things you want to think about?

Number one is: how well can this method explain the current data? How well can it model the association between the predictors I have in front of me and the outcome I have in front of me? Some are better at that than others. However, they might do it by overfitting, which is a real problem. And that gets us to the next one: How well does each method generalize to new data? And what you find is that the modern methods are usually much better suited towards generalization problems. They often have cross-validation built in as a way of checking the assumptions of the original model. Now there's the issue of ease of calculation, in that a lot of the classical methods were actually built to be calculated by hand. And that does make them easier to explain and easier to demonstrate, but given that nobody does this stuff by hand, everything's done by computers and our computers get faster and faster and faster over time, that's essentially a non-issue. On the other hand, there is the issue of the ease of interpretation. Can you explain what it all means? That might be really important. And then, perhaps, ultimately, the ease of application. Can you take the results you get from your model and do something useful with them? For many people, nothing else there matters as well as does it apply to new data and can I use it to generate new insights with new data?

Two general categories for algorithms for dealing with text.

One does the intuitive thing, and it focuses on the meaning of what's being said. So, for instance, you'll have algorithms that will identify parts of speech, this is a verb, this is an adjective, and so on. It'll identify sentiment, this is a positive statement, this is a negative statement. And it will use the meanings of words, like in the topics of a text, to analyze the text. That's pretty sophisticated processing. What's interesting, though, is that this other approach, the quote/unquote bag of words, also works a lot in this situation. These are methods that treat words simply as individual tokens of distinct categories without even understanding their meaning. They could shapes for all you know, or numbers. In fact, it turns them into numbers. You lose the order, you don't look at the particular function of a word, you're simply counting how often it happens, and maybe what it happens next to.

The algorithm that you choose, the method that you use for measuring the association between variables can affect the meaning and interpretation of the results substantially. What are two general classes of regression.

One is classical methods, or algorithms for regression. These are methods that are based on means, or averages, and squared deviations from predicted values. There are also a very broad category of, what you can call, modern methods. These are alternative methods for calculating distance and for choosing between predictors that may be correlated with each other.

This algorithm finds patterns in the database during the first pass, then splits them into sub-databases. The downside is that the algorithm is memory-intensive.

PrefixSpan

Algorithms for text mining General

The algorithms for mining text vary in their emphasis on meaning. Some place a lot of emphasis and try to model it with great care, others ignore it completely. Interestingly, the simple methods, the plain old Bag of Words simply indicates whether a word occurs or not can be sufficient for certain tasks. And the more complex methods are reserved for natural language processing, where the computer, for instance, is trying to understand what you're saying, infer your meaning, and answer your questions from it. Either way, you want to choose an algorithm that fits your goals in your task and helps you get the insight that you need for your particular data science project.

Sequence mining general

Sequence mining allows us to mine for data for common sequences, or orders, or chains, events, or objects in a row. It can be used to tell somebody what to do next or to predict what's gonna happen next. And in basic behavioral research, it can also be used as evidence of state switching in cognition affect or behavior.

Goals of sequence mining

Specifically, what you're doing is you're looking for chains in the data, chains that repeat. And you can think of it this way, here we've got a whole lot of letters. It's just a random blob of code. And we wanna see if there are any regularities in that where we can say, well, if this and this happen, then this and this will probably happen next. As it turns out, there are some regularities in this data. The key here is that we're looking for temporal or ordinal associations. It's like association analysis. That's the market basket analysis, and you wanna say if a person gets x and y, then they also get z, and we're trying to do that. We're looking for events that go together, but the major difference is that in this case, the order of events matters. In a market basket, it doesn't matter if it's x, y, and z, or z, y, and x, or whatever. But with sequence mining, those are different and they matter. Now, some of the uses of sequence mining are, of course, genetic sequencing.

What is a "Bag of Words" in text mining?

The algorithms break text down into chunks or tokens and analyze the occurrence of those.

Algorithms for regression general

There's a wide selection of both classical and modern algorithms, with different strengths to each of them. The problem is some of these methods, especially the classical ones, are prone to overfitting and they have problems with generalization. Then again, probably what's even more important is the ability to both interpret and apply the results of what you're doing in a useful situation to get you some extra insight into what's going on with your data.

text mining general

There's value in unprompted information. Second, you can cluster people and get really unprompted evaluations, spontaneous evaluations, of people and things. And there are both commercial and basic research uses for all of these procedures.

Sentiment Analysis

This is where you're trying to measure how a person, a product, an idea, an event, something is received, what people are saying about it, and this is especially important on social forms, where companies have the opportunity to see what people are saying about them and then tailoring your message appropriately.

When you are mining text in Python you need _____ which is a collection of words that can be used to count things.

a corpus

HMM

hidden Markov models. And this is mostly looking for a different kind of question. The other ones are trying to find this, then this, then this. Hidden Markov models are looking for switches in state conditions, or you might want to say qualitatively distinct patterns of behavior. Is a person switching from one mode of doing something to a different mode of doing something? Or is the economy switching in some way? Now what's really neat about this is it's easy to test specifically-framed hypotheses to see how well your theory about states matches the data using this particular approach.

Meaning based

meaning based approaches, this gets into the field of NLP, or natural language processing, which is a very big field. It's how your phone knows what you're saying to it. Now, I should specify, technically it's still not meaning. It's still a digital machine, it's still turning it into numbers, but it's doing a more nuanced approach. So for instance, this is where you get something like what's called a Hidden Markov model or HMM. This is where it's trying to get at changes in operations, and inferring some of the behaviors behind what's happening. Or you can get to something like a Latent Dirichlet allocation, an LDA, that uses topic modeling, where it actually tries to decide what is the topic of the paper to form unobserved groups that can be used to understand text.

common cases for classification

spam filters, fraud detection, genetic testing, psychological diagnosis

GSP,

stands for Generalized Sequential Patterns, this is a very common approach. It's sort of a basic one. And it's very similar to the Apriori analysis that we use in association analysis, but in this case, it does observe or respect the order of events. And so it says if you have X, then Y, then Z, that's different than X, then Z, then Y, or different than X, Y, and Z simultaneously, whereas those things will be treated equivalently in a standard Apriori market basket association analysis. One of the things about GSP is it allows you to use a sliding window to determine when events are simultaneous or should be treated as simultaneous. Now the trick is, like Apriori, it has to make a lot of passes through the data. And so if you have a large data set, GSP can be a rather slow way of trying to look for sequences

SPADE

stands for Sequential Pattern Discovery using Equivalence classes. You gotta be a little creative to get SPADE out of that. But SPADE is designed to be faster. What is does primarily is it does fewer database scans by using what are called intersecting ID-lists. It starts with what's called a vertical ID-list in the database and it goes down through that once, scans through it once, and then sort of stretches it out into a two-dimensional matrix of the first and second item. Then it can use the results of that sort of as the first item and add on a third one and so forth. So it's got some alternatives. It's faster, but there are other options that may be even better.

Which regression analysis method provides results that are the most difficult to interpret?

support vector regression

Comparing Voices.

there's a term, Stylometry, analyzing styles, and it dates all the way back to the 1400s. Obviously it was done manually at that time and it has been used successfully in the 1960s for authorship. Also, a really good place to use this is in fraud detection. If a person is entering text or is they're speaking out loud and it's coding the text, you can get an indication of whether this is the person it's supposed to be, and, so, that's one major use of this in a business setting.

PrefixSpan

this is an example of what's called a pattern growth algorithm as opposed to an Apriori-based algorithm. And this avoids the candidate generation part of Apriori altogether and instead focuses on a restricted portion. PrefixSpan, by the way, stands for Prefix-projected sequential pattern mining. And what this does is it takes the first pass, it gets some patterns in the database, splits them into sub-databases, and then grows them independently in each of those. And it's a much faster algorithm, but it can be memory-intensive for doing a lot of things simultaneously.

Association Analysis ( Market Basket Analysis)

this is where you try to find that when A occurs, B occurs also or when A is present, B is generally present as well.


Related study sets

Psychology the study of mental processes and behavior

View Set

Good Manufacturing Practices (GMP)

View Set

Chapter 8: Entrepreneurial Strategy and Competitive Dynamics

View Set