609 Final
Minkowski Generalization
combination of manhattan and euclidean distance, more robust but conscious of data points
TFIDF
combines term frequency and IDF, specific to one document
T Test
compare specific groups significance to each other
Competitive Advantage
data and data science capability are complementary strategic assets -> value is dependent on other strategic decisions (context)
Topic Models
first model the set of topics in a corpus separately -> learned from data (Latent Info Model) terms associated with topic and weights are learned by TM process
Causality
looking for what events influcen others, secondary data = inference
Inverse Document Frequency
measures sparseness, not too rare (depends on application, impose lower limit), not too common (doesn't distinguish anything or provide information, impose upper limit), takes into account the distribution, if only occurs in a few documents it is likely to be important in those documents 1+log(total # doc/doc containing term)
Data Analytic Thinking
-assess whether and how data can improve performance -dno't have to DO data sacience, but understnad it so you can be strategic with it (devise a few probing questions to determin e plausability of proprosals, how to make people responsilbe for decisions making comprehend)
Learning Curve
A plot of the generalization performance against the amount of training data is called a learning curve. The learning curve is another important analytical tool.
Dataset
A schema and a set of instances matching the schema. Generally, no ordering on instances is assumed. Most data mining work uses a single fixed-format table or collection of feature vectors.
Bag of Words
treat every document like a collection of individual words
Confusion Matrix
A matrix showing the predicted and actual classifications. A confusion matrix is of size l × l, where l is the number of different label values. A variety of classifier evaluation metrics are defined based on the contents of the confusion matrix, including accuracy, true positive rate, false positive rate, true negative rate, false negative rate, precision, recall, sensitivity, specificity, positive predictive value, and negative predictive value.
Manhattan Distance
A measure of travel through a grid system like navigating around the buildings and blocks of Manhattan, NYC. Traverse all data points
Cost (utility/loss/payoff)
A measurement of the cost to the performance task (and/or benefit) of making a prediction ŷ when the actual label is y. The use of accuracy to evaluate a model assumes uniform costs of errors and uniform benefits of correct classifications.
Dimension
An attribute or several attributes that together describe a property. For example, a geographical dimension might consist of three attributes: country, state, city. A time dimension might include 5 attributes: year, month, day, hour, minute.
Key Assets
Data: get the right info Capability: extract useful knowledge
Tech Stack
Presentation to User Data Warehouse (analytical) Extract, Transform, Load Data Lake Data Sources
Association Mining
Techniques that find conjunctive implication rules of the form "X and Y → A and B" (associations) that satisfy given criteria. Provost, Foster; Fawcett, Tom. Data Science for Business (p. 485). O'Reilly Media. Kindle Edition.
Missing Value
The situation where the value for an attribute is not known or does not exist. There are several possible reasons for a value to be missing, such as: it was not measured; there was an instrument malfunction; the attribute does not apply, or the attribute's value cannot be known. Some algorithms have problems dealing with missing values.
4 Types of Predictive Analytics
classification, regression, clustering, association
Entropy
how much disorder exists in the dataset, group and segements (probability of value times lof of that probability (base 2))
Induction
how we build models from data, already have to ending value and build downwards to knowledge of how
Supervised Segmentation
identify segments to look for patterns, find subgroups by attribute, determine if attribute is important
Decision Trees
make first split based on most informative attribute, continue until useful criteria, also used as decision boundary, sometimes use probability estimation instead of categorical target, use ranking to focus on nodes with greatest probability, use smoothing to weight denomicator more b/c training data should be representative
Strategic Database
managers review high level statistics to make overarching decisions
Poisson Regression
optimal when predicting "counts." For example, you would use a Poisson regression to predict the vote counts recieved by each election candidate. As a result, the dependent variable would never be negative and would be whole numbers (i.e. integers).
KDD
originally was an abbreviation for Knowledge Discovery from Databases. It is now used to cover broadly the discovery of knowledge from data, and often is used synonymously with data mining.
Bias-Variance Tradeoff
overly simple=ignores complexity, errors due to bias overly complex=additional splits, errors due to variance, overfitting to training data *impose limits on tree growth, but at certain point bias from errors occur, find balance
Data Science
principles, processes, techniques for understanidng phenomenon via automated analysis of data
Characteristics of a Good Measure
simple, easily obtainable, precisely definable, objective, robust
Data Visualization Theory
simplicity, certain charts, arrangement, provide context, creativity
Data Reduction/Latent Information/Movie Recommendation
take large set of data and replace with smaller set that preserves improtant information latent=relevant but not observed explicitly in the data (observe informational split and apply domain knowledge)
Type 1 Decision
use discoveries about the data to make new decisions
Support Vector Machine
want to see what class is more likely than others, estimate probability of instance belonging to a class, can use a score of which instance is more likely to belong to a class, distance of instance from decision line gives RANK of more or less likely erroneous predictions that occur w/i margin get MINIMUM penalty/not at all, outside of margin gets penalized more and more severly (Hinge Loss) *if not linearly separable, balance width of margin against penalty for misclass
Euclidean Distance
the straight-line distance, or shortest possible path, between two points most common but least robust
recall
true positive over tp and fn
Precision
true positive over tp and fp
Bayesian Linear Regression
typically more accurate than a regular linear regression and is still dependent on the assumption of a linear relationship between the dependent variable and each independent variable uses linear regression supplemented by additional information in the form of a prior probability distribution. Prior information about the parameters is combined with a likelihood function to generate estimates for the parameters
Term Frequency
use word COUNT (frequency), differentiate how many times a word is used, sometimes the importance increases with frequency normalize the case, stemmed (remove suffixes), stopwords (remove common words), numbers are commonly regarded as unimportant details raw count vs. noramlize (relative to length of total document)
Linear Regression Assumptions
variables have linear relationship multivariate normality: distribution/skewness/kurtosis multicollinearity: variables not too highly correlated with each other autocorrelation: residuals should be independent of each other homoscedasiticity: variance should be same along regression line
5. Evaluation
verify selected model can achieve business objectives, need strong evaluation framework, cost/benefit analysis updated deliverables: evaluation report, process review, next steps (recommendation)
Fundamental Concepts
1. extracting useful knowledge from data to solve business problems can be appropached systematically by following a process with resonably well-defined stages 2. from large mass of data, IT can be used to find informative descriptive attributes 3. if you look too hard at a set of dtata, you WILL find something but it may not generalize beyond the data you are looking at (overfitting) 4. formulating data mining solutions/evaluating results involves thinking carefully about the context in which they will be used
Classifier
A mapping from unlabeled instances to (discrete) classes. Classifiers have a form (e.g., classification tree) plus an interpretation procedure (including how to handle unknown values, etc.). Most classifiers also can provide probability estimates (or other likelihood scores), which can be thresholded to yield a discrete class decision thereby taking into account a cost/benefit or utility function.
Cross-validation
A method for estimating the accuracy (or error) of an inducer by dividing the data into k mutually exclusive subsets (the "folds") of approximately equal size. The inducer is trained and tested k times. Each time it is trained on the dataset minus one of the folds and tested on that fold. The accuracy estimate is the average accuracy for the k folds or the accuracy on the combined ("pooled") testing folds.
Attribute (field, variable, feature)
A quantity describing an instance. An attribute has a domain defined by the attribute type, which denotes the values that can be taken by an attribute. The following domain types are common: Categorical (symbolic): A finite number of discrete values. The type nominal denotes that there is no ordering between the values, such as last names and colors. The type ordinal denotes that there is an ordering, such as in an attribute taking on the values low, medium, or high. Continuous (quantitative): Commonly, subset of real numbers, where there is a measurable difference between the possible values. Integers are usually treated as continuous in practical problems.
Example/Instance
A single object of the world from which a model will be learned, or on which a model will be used (e.g., for prediction). In most data science work, instances are described by feature vectors; some work uses more complex representations (e.g., containing relations between instances or between parts of instances).
Principal Components Analysis
A statistical technique that groups the observations in a large data set into smaller sets of similar types based on commonalities in the data.
Model
A structure and corresponding interpretation that summarizes or partially summarizes a set of data, for description or prediction. Most inductive algorithms generate models that can then be used as classifiers, as regressors, as patterns for human consumption, and/or as input to subsequent stages of the data mining process..
OLAP
Online Analytical Processing. Usually synonymous with MOLAP (multi-dimensional OLAP). OLAP engines facilitate the exploration of data along several (predetermined) dimensions. OLAP commonly uses intermediate data structures to store precalculated results on multidimensional data, allowing fast computations. ROLAP (relational OLAP) refers to performing OLAP using relational databases.
Supervised Learning
Techniques used to learn the relationship between independent attributes and a designated dependent attribute (the label). Most induction algorithms fall into the supervised learning category.
ANOVA
analysis of variance, measure relationship between categorical and numeric variables, estiamte if statistically significant (F Statistic larger is better)
4. Modeling
apply data mining techniques, tune models (calibrate parameters, assess model fit, assess model quality) deliverables: modeling technique, test design, trained model, model assessment
3. Data Preparation
prepare for analysis (text to numeric, categorical to dummy, remove nulls, infer values, scale/normalize, transpose, non-linear transform deliverables: feature selection report, data cleaning report, data creation report, data integration report
1. Business Understanding
think about the problem (use case, feasibility) explore and discover (data have, predict, impact) being to breakdown problem (subproblems) deliverables: project objective and success criteria, review of current scenario, Project Plan
Parametric Modeling
use data mining techniques to SET the parameters, linear classifier with descriptive equation, ordinary least squares to minimize sum of squared error term
Artificial Intelligence
(AI) refers to programming logic—not the hardware itself, but the software within—that replicates human logic, reasoning, and decision-making. Typically, this begins with some data mining that helps us understand human behavior. But creating the AI program doesn't always require that prior data be stored, drawn from, or updated. Sometimes AI is based on an optimization formula with an objective and constraints (like the Solver add-in in Microsoft Excel). That is why part of the oval above representing AI lies outside of the data mining concept.
Schema
A description of a dataset's attributes and their properties.
Naive Bayes Classifier
A family of algorithms that consider every feature as independent of any other feature Generally p(E) never actually has to be calculated, for one of two reasons. First, if we are interested in classification, what we mainly care about is: of the different possible classes c, for which one is p(C| E) the greatest? In this case, E is the same for all, and we can just look to see which numerator is larger. the Naive Bayes classifier performs surprisingly well for classification on many real-world tasks. This is because the violation of the independence assumption tends not to hurt classification performance, for an intuitively satisfying reason. So, to some extent we'll be double-counting the evidence. However, as long as the evidence is generally pointing us in the right direction, for classification the double-counting won't tend to hurt us. In fact, it will tend to make the probability estimates more extreme in the correct direction: Pros: simple, efficient, performs well on real world tasks, independence violation doesnt hurt Cons: doulbe counting only problematic if using actual probability estimates instead of classification
AUC (area under the ROC Curve)
An evaluation metric that considers all possible classification thresholds. Useful when single number needed to summarize performance 0.5=randomness The Area Under the ROC curve is the probability that a classifier will be more confident that a randomly chosen positive example is actually positive than that a randomly chosen negative example is positive.
CRISP - DM process
Business Understanding, Data Understaniding, Data Preparation, Modeling, Evaluation, Deployment
Cumulative Response and Lift Curves
CRC plots HIT RATe (tp) on Y axis and % of positives correctly classified as a % of population targeted on X axis lift=degree to which pushes up positives instances above negatives (CRC at given X divided by diagonal line at that point) *use carefully if class priors unknown, shape informative but relationship of values not valid
2. Data Understanding
Exploratory Data Analysis (EDA) (descriptive statisticis, review scatterplots, correlation matrix, potential target variable), identify techniques appropriate for you, deliverables:Initial Data Collection Report, Description Report, Exploration, Quality
Coming Evidence Probabilistically
For any particular collection of evidence E, we probably have not seen enough cases with exactly that same collection of evidence to be able to infer the probability of class membership with any confidence.
Clustering
Hierarchical clustering focuses on the similarities between the individual instances and how similarities link them together. Agglomerative: bottom up or Divisive: top down. Key consideration is terminal cluster value (how many clusters) The most common method for focusing on the clusters themselves is to represent each cluster by its "cluster center," or centroid. Medoids = an actual data point is used as center instead of a mean.
Machine Learning
In data science, machine learning is most commonly used to mean the application of induction algorithms to data. The term is often used synonymously with the modeling stage of the data mining process. Machine Learning is the field of scientific study that concentrates on induction algorithms and on other algorithms that can be said to learn.
Induction
Induction is the process of creating a general model (such as a classification tree or an equation) from a set of data. Induction may be contrasted with deduction: deduction starts with a general rule or model and one or more facts, and creates other specific facts from them. Induction goes in the other direction: induction takes a collection of facts and creates a general rule or model. In the context of this book, model induction is synonymous with learning or mining a model, and the rules or models are generally statistical in nature.
Unsupevised Learning
Learning techniques that group instances without a pre-specified target attribute. (clustering, profiling, co-occurrence)
Class (label)
One of a small, mutually exclusive set of labels used as possible values for the target variable in a classification problem. Labeled data has one class label assigned to each example. For example, in a dollar bill classification problem the classes could be legitimate and counterfeit. In a stock assessment task the classes might be will gain substantially, will lose substantially, and will maintain its value.
Bayes' Rule
P(A|B) = P(B|A)P(A)/P(B) describes the probability of an event, based on prior knowledge of conditions that might be related to the event p(C = c) is the "prior" probability of the class, i.e., the probability we would assign to the class before seeing any evidence. p(E |C = c) is the likelihood of seeing the evidence E — the particular features of the example being classified — when the class C = c. One might see this as a "generative" question: if the world generated an instance of class c, how often would it look like E? This likelihood might be calculated from the data as the percentage of examples of class c that have feature vector E. p(E) is the likelihood of the evidence: how common is the feature representation E among all examples? This might be calculated from the data as the percentage occurrence of E among all examples. use directly as a class probability, as a score/rankg, or choose the maximum *makes assumption of probabalistic independence
Expected Value Framework
Problem structure, elements of analysis we can extract, create confusion matrix and then calculate accuracy, compare to class prior, create probability matrix, apply to cost-benefit matrix In an expected value calculation the possible outcomes of a situation are enumerated. The expected value is then the weighted average of the values of the different possible outcomes, where the weight given to each value is its probability of occurrence. if the outcomes represent dnt possible levels of profit, an expected profit calculation weights heavily the highly likely levels of profit, while unlikely levels of profit are given little weight EV= p(oi)*v(oi) + P(oi2)*v(oi2) Each oi is a possible decision outcome; p(oi) is its probability and v(oi) is its value. The probabilities often can be estimated from the data (ii), but the business values often need to be acquired from other sources evaluate the set of decisions made by a model when applied to a set of examples, necessary to compare one model to another: in the AGGREGATE, how well does each model do, what is its expected value
ROC
Receiver Operating Characteristics two-dimensional plot of classifier with false positives on x and true positives on y (shows trade offs) show entire space of uncertainty compute using only actual positives and actual negatives decouples performance from underlying conditions, region of interest may change based on conditions but the curve will not
K-Means
The algorithm starts by creating k initial cluster centers, usually randomly, but sometimes by choosing k of the actual data point. Then the algorithm proceeds as follows. The clusters corresponding to these cluster centers are formed, by determining which is the closest center to each point. Next, for each of these clusters, its center is recalculated by finding the actual centroid of the points in the cluster. The cluster centers typically shift; in the figure, we see that the new solid-lined stars are indeed closer to what intuitively seems to be the center of each cluster. The process simply iterates: since the cluster centers have shifted, we need to recalculate which points belong to each cluster. Once these are reassigned, we might have to shift the cluster centers again. The k-means procedure keeps iterating until there is no change in the clusters (or possibly until some other stopping criterion is met).
Knowlege Discovery
The nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data. This is the definition used in "Advances in Knowledge Discovery and Data Mining," by Fayyad, Piatetsky-Shapiro, & Smyth (1996).
Data cleaning/cleansing
The process of improving the quality of the data by modifying its form or content, for example by removing or correcting data values that are incorrect. This step usually precedes the modeling step, although a pass through the data mining process may indicate that further cleaning is desired and may suggest ways to improve the quality of the data.
Coverage
The proportion of a dataset for which a classifier makes a prediction. If a classifier does not classify all the instances, it may be important to know its performance on the set of cases for which it is confident enough to make a prediction
Accuracy
The rate of correct (incorrect) predictions made by the model over a dataset (cf. coverage). Accuracy is usually estimated using an independent (holdout) dataset that was not used at any time during the learning process. More complex accuracy estimation techniques, such as cross-validation and the bootstrap, are commonly used, especially with datasets containing a small number of instances.
Data Mining
The term data mining is somewhat overloaded. It sometimes refers to the whole data mining process and sometimes to the specific application of modeling techniques to data in order to build models or find other patterns/regularities.
Model Deployment
The use of a learned model to solve a real-world problem. Deployment often is used specifically to contrast with the "use" of a model in the Evaluation stage of the data mining process. In the latter, deployment usually is simulated on data where the true answer is known.
Bivariate Relationships
Theory= coherent group of tested propositions used as principles of explaantion for predicting a class of phenomena (could include generalized explanations) easily explained using scatterplot, add trendline, slope indicates positive or negative, linear not always best
i.i.d sample
a set of independent and identically distributed instances
Generalization
apply model to data we didn't use to build the model, induction
Analytic Attitude
ask the right questions, ETL relevant data, apply appropriate analytic technique, interpret and hare results with key stakeholder
N-Gram sequences
bag doesn't consider word order, next step if sequence of adjacent words, useful when phrase is significant but components may not be, easy to generate, no linguistic knowledge or complext passing algorithm, BUT produces huge set of features
Data Driven Decision Making
basing decisions on analysis of data rather than intuition (statistically more productive) Type 1 = new disoveries to be made in the data, Type 2 = deicions repeatedly on massive scale beneit from increase in accuracy
Bias, Variance, Ensemble Methods
build lots of recommendation models and combine = ensemble (observed to improve generalization performance) KNN is a simple ensemble method
Profiling
characterize typical behaviors of an individual can involve clustering define a numerci function w/ parameters, define goal/objective, find parameters to meet 'normal' or typical could be different based on your dataset
K-Nearest Neighbor
choose several closest to new example, predict target class or regression majority rules, smaller = overfitting, too many boundaries, too specific larger = single classfication space so always majority class use similarity weight so closer neighbors are more improtant, always use an odd K so no ties
Link Prediction/Social Recommendation
connections between items, should or shouldn't be there can also estimate strength of a link define similarity measures, even wieght them
Multivariate Prediction
designate one variable as dependent (what you want to predict), designate one/many independent variables (features, vectors) to explain the prediction, Rsquared = coefficient of determination (key output of line of best fit)
Decision Forest Regression
ensemble method based on the Decision Trees algorithm. decision trees are non-parametric models that perform a sequence of simple tests for each instance, traversing a binary tree data structure until a leaf node (decision) is reached. The advantages of decision tree-based algorithms are that they are quickly and efficiently computed, they can handle non-linear relationships, and they handle non-"normal" variables relatively well. Decision Trees calculate "tree-like" branch structures with binary decision nodes (as opposed to traditional regression coefficients).
Fitting Graph
error rate you find in testing, base rate is actual split of data, useful but only to certain extent because leave information on the table that could have been used to train model
Regression
estimating a value, numeric prediction
Unbalanced Classes
evaluation based soley on accuracy will not work because one class is so rare
Type 2 Decision
existing decisions that repeat, look for incremental improvements in decision process based on data
Co-occurrence
grouping, market based analysis, people buy one things what else are they likely to buy increase revenue from cross selling, improve customer experience use to build out inventory for regional distribution centers and reduce shipping costs complexity contro, only use rules apply to min # transaction % look at strenght of association (probability) threshold how "surprising" = informational lift, look at difference (instead of ratio) = leverage
Filter Based Feature Selection
implements the process of selecting the TOP n variables with the highest relationship with the dependent variable. it calculates a bivarate relationship (Pearson correlation by default) for each independent variable This pill will automatically adjust and keep selecting the best features
Boosted Decision Tree Regression
is also an ensamble algorithm that is similar to a decision forest, but it allows for distinct branches to reconverge later in the tree. This is deal for independent variables that are highly correlated. The Boosted Decision Tree algorithm tends to improve accuracy for some value ranges with the possible risk of reducing accuracy for other ranges
Permutation Feature Importance
it is better in the fact that it estimates the effect of each feature on the label using a trained model it considers the intercorrelation among all features to give you a much more accurate estimation of the how impactful and useful each feature will be starndardizes the coefficients so that they can be easily rank-ordered from best to worst. Those which with negative values are not those with negative coefficients; they are those which are actually hurting the the accuracy of your model by adding error, and it combines the coefficients for categorical features into a single value cons: it cannot be used to automatically select the best variables to go into a model because it can only be calculated based on the output of a trained model
Named Entity Extraction
knowledge intensive, have to be trained/coded
Logistic Regression
linear regression of log odds for class of interest, convert probability to LN and end up with approximation of likelihood
Neural Network Regression
often referred to as "deep learning" and often used for complex problems such as image classification. Any statistical algorithm can be termed a "neural network" if it uses a adaptive coefficient weights and can approximate non-linear inputs
Transactional (Operational) Database
operational, mission critical, slow down by using too many resources, low levels of detail, normalized supervise, dialy operations
Orginal Regression
optimal for dependent variables that represent a rank order--which implies that the distance between each ranking is not necessarily equal. For example, predicting the results of an election would be ideal for an ordinal regression because the distance between the first and second place candidates and the second and third place candidates are not equal in terms of votes recieved.
Ranking Classifiers
present model performance across spectrum of possible values for assumptions, classify each instance and then take appropriate action OR create ranking of all cases and take action on top most percentage of cases (score, represenation, budgetary constraints, cost benefits hard to estimate) if only have score - only assign positive class if relatively high level of certainty (conservative) even if given probability, can treat like score and create ranking
6. Deployment
put it in place and start using deliverables: deployment plan, monitor/maintenance plan, final report/presentation, documentation
Data Reduction
replace many attributes with a few 'meta' attributes
Analytical Database
restructure, add redundancies, add summarization, pre-calculated ratios and totals, used by analysts to explore/study data
Model
simplified represenation of reality created to serve particular purpose DS=formula for estimating unkown value of interest, usually supervised, extract away some of the complexity of real life minimize uncertainty of target value for decision making cost of being wrong? in either direction?
Pearson Correlation Coefficient
single number that describes the relationship (r), closer to 0=weaker, above +/-0.5=pretty strong
Alternatives to Line Charts
slope chart: change b/t start and end cycle plot: trends, seasonality highlight table: time horizontally, scan for hotspots
Unicorn
subject matter expert, math and statistics expert, computer science expert
Overfitting
tendency of DM to tailor to training data at expense of generalization, as complexity increases so does overfitting, logistic regression is more susceptible than SVM use holdout data to test model fit
Text Mining
text is important, it is EVERYWHERE, difficult because unstructured Represenation: the simplest, lease expensive option that works document=one piece of text, no matter how large or small, token/term=word, corpus=collection of documents
Information Gain
weighted value of entropy post split, determine which split reduces entropy the MOST and use that first (Entropy of Whole - probability of group times entropy of group)
Similarity
which instances of data may be similar to others based on what we know
Profit Curves
x-axis = threshold values, expressed as % of instances targets y-axis = profit take potential classificers and calculate CM and EV for every threshold 0-100%, plot extremely sensitive to assumptions in model (class priors, cost benefit estimation)