609 Final

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

Minkowski Generalization

combination of manhattan and euclidean distance, more robust but conscious of data points

TFIDF

combines term frequency and IDF, specific to one document

T Test

compare specific groups significance to each other

Competitive Advantage

data and data science capability are complementary strategic assets -> value is dependent on other strategic decisions (context)

Topic Models

first model the set of topics in a corpus separately -> learned from data (Latent Info Model) terms associated with topic and weights are learned by TM process

Causality

looking for what events influcen others, secondary data = inference

Inverse Document Frequency

measures sparseness, not too rare (depends on application, impose lower limit), not too common (doesn't distinguish anything or provide information, impose upper limit), takes into account the distribution, if only occurs in a few documents it is likely to be important in those documents 1+log(total # doc/doc containing term)

Data Analytic Thinking

-assess whether and how data can improve performance -dno't have to DO data sacience, but understnad it so you can be strategic with it (devise a few probing questions to determin e plausability of proprosals, how to make people responsilbe for decisions making comprehend)

Learning Curve

A plot of the generalization performance against the amount of training data is called a learning curve. The learning curve is another important analytical tool.

Dataset

A schema and a set of instances matching the schema. Generally, no ordering on instances is assumed. Most data mining work uses a single fixed-format table or collection of feature vectors.

Bag of Words

treat every document like a collection of individual words

Confusion Matrix

A matrix showing the predicted and actual classifications. A confusion matrix is of size l × l, where l is the number of different label values. A variety of classifier evaluation metrics are defined based on the contents of the confusion matrix, including accuracy, true positive rate, false positive rate, true negative rate, false negative rate, precision, recall, sensitivity, specificity, positive predictive value, and negative predictive value.

Manhattan Distance

A measure of travel through a grid system like navigating around the buildings and blocks of Manhattan, NYC. Traverse all data points

Cost (utility/loss/payoff)

A measurement of the cost to the performance task (and/or benefit) of making a prediction ŷ when the actual label is y. The use of accuracy to evaluate a model assumes uniform costs of errors and uniform benefits of correct classifications.

Dimension

An attribute or several attributes that together describe a property. For example, a geographical dimension might consist of three attributes: country, state, city. A time dimension might include 5 attributes: year, month, day, hour, minute.

Key Assets

Data: get the right info Capability: extract useful knowledge

Tech Stack

Presentation to User Data Warehouse (analytical) Extract, Transform, Load Data Lake Data Sources

Association Mining

Techniques that find conjunctive implication rules of the form "X and Y → A and B" (associations) that satisfy given criteria. Provost, Foster; Fawcett, Tom. Data Science for Business (p. 485). O'Reilly Media. Kindle Edition.

Missing Value

The situation where the value for an attribute is not known or does not exist. There are several possible reasons for a value to be missing, such as: it was not measured; there was an instrument malfunction; the attribute does not apply, or the attribute's value cannot be known. Some algorithms have problems dealing with missing values.

4 Types of Predictive Analytics

classification, regression, clustering, association

Entropy

how much disorder exists in the dataset, group and segements (probability of value times lof of that probability (base 2))

Induction

how we build models from data, already have to ending value and build downwards to knowledge of how

Supervised Segmentation

identify segments to look for patterns, find subgroups by attribute, determine if attribute is important

Decision Trees

make first split based on most informative attribute, continue until useful criteria, also used as decision boundary, sometimes use probability estimation instead of categorical target, use ranking to focus on nodes with greatest probability, use smoothing to weight denomicator more b/c training data should be representative

Strategic Database

managers review high level statistics to make overarching decisions

Poisson Regression

optimal when predicting "counts." For example, you would use a Poisson regression to predict the vote counts recieved by each election candidate. As a result, the dependent variable would never be negative and would be whole numbers (i.e. integers).

KDD

originally was an abbreviation for Knowledge Discovery from Databases. It is now used to cover broadly the discovery of knowledge from data, and often is used synonymously with data mining.

Bias-Variance Tradeoff

overly simple=ignores complexity, errors due to bias overly complex=additional splits, errors due to variance, overfitting to training data *impose limits on tree growth, but at certain point bias from errors occur, find balance

Data Science

principles, processes, techniques for understanidng phenomenon via automated analysis of data

Characteristics of a Good Measure

simple, easily obtainable, precisely definable, objective, robust

Data Visualization Theory

simplicity, certain charts, arrangement, provide context, creativity

Data Reduction/Latent Information/Movie Recommendation

take large set of data and replace with smaller set that preserves improtant information latent=relevant but not observed explicitly in the data (observe informational split and apply domain knowledge)

Type 1 Decision

use discoveries about the data to make new decisions

Support Vector Machine

want to see what class is more likely than others, estimate probability of instance belonging to a class, can use a score of which instance is more likely to belong to a class, distance of instance from decision line gives RANK of more or less likely erroneous predictions that occur w/i margin get MINIMUM penalty/not at all, outside of margin gets penalized more and more severly (Hinge Loss) *if not linearly separable, balance width of margin against penalty for misclass

Euclidean Distance

the straight-line distance, or shortest possible path, between two points most common but least robust

recall

true positive over tp and fn

Precision

true positive over tp and fp

Bayesian Linear Regression

typically more accurate than a regular linear regression and is still dependent on the assumption of a linear relationship between the dependent variable and each independent variable uses linear regression supplemented by additional information in the form of a prior probability distribution. Prior information about the parameters is combined with a likelihood function to generate estimates for the parameters

Term Frequency

use word COUNT (frequency), differentiate how many times a word is used, sometimes the importance increases with frequency normalize the case, stemmed (remove suffixes), stopwords (remove common words), numbers are commonly regarded as unimportant details raw count vs. noramlize (relative to length of total document)

Linear Regression Assumptions

variables have linear relationship multivariate normality: distribution/skewness/kurtosis multicollinearity: variables not too highly correlated with each other autocorrelation: residuals should be independent of each other homoscedasiticity: variance should be same along regression line

5. Evaluation

verify selected model can achieve business objectives, need strong evaluation framework, cost/benefit analysis updated deliverables: evaluation report, process review, next steps (recommendation)

Fundamental Concepts

1. extracting useful knowledge from data to solve business problems can be appropached systematically by following a process with resonably well-defined stages 2. from large mass of data, IT can be used to find informative descriptive attributes 3. if you look too hard at a set of dtata, you WILL find something but it may not generalize beyond the data you are looking at (overfitting) 4. formulating data mining solutions/evaluating results involves thinking carefully about the context in which they will be used

Classifier

A mapping from unlabeled instances to (discrete) classes. Classifiers have a form (e.g., classification tree) plus an interpretation procedure (including how to handle unknown values, etc.). Most classifiers also can provide probability estimates (or other likelihood scores), which can be thresholded to yield a discrete class decision thereby taking into account a cost/benefit or utility function.

Cross-validation

A method for estimating the accuracy (or error) of an inducer by dividing the data into k mutually exclusive subsets (the "folds") of approximately equal size. The inducer is trained and tested k times. Each time it is trained on the dataset minus one of the folds and tested on that fold. The accuracy estimate is the average accuracy for the k folds or the accuracy on the combined ("pooled") testing folds.

Attribute (field, variable, feature)

A quantity describing an instance. An attribute has a domain defined by the attribute type, which denotes the values that can be taken by an attribute. The following domain types are common: Categorical (symbolic): A finite number of discrete values. The type nominal denotes that there is no ordering between the values, such as last names and colors. The type ordinal denotes that there is an ordering, such as in an attribute taking on the values low, medium, or high. Continuous (quantitative): Commonly, subset of real numbers, where there is a measurable difference between the possible values. Integers are usually treated as continuous in practical problems.

Example/Instance

A single object of the world from which a model will be learned, or on which a model will be used (e.g., for prediction). In most data science work, instances are described by feature vectors; some work uses more complex representations (e.g., containing relations between instances or between parts of instances).

Principal Components Analysis

A statistical technique that groups the observations in a large data set into smaller sets of similar types based on commonalities in the data.

Model

A structure and corresponding interpretation that summarizes or partially summarizes a set of data, for description or prediction. Most inductive algorithms generate models that can then be used as classifiers, as regressors, as patterns for human consumption, and/or as input to subsequent stages of the data mining process..

OLAP

Online Analytical Processing. Usually synonymous with MOLAP (multi-dimensional OLAP). OLAP engines facilitate the exploration of data along several (predetermined) dimensions. OLAP commonly uses intermediate data structures to store precalculated results on multidimensional data, allowing fast computations. ROLAP (relational OLAP) refers to performing OLAP using relational databases.

Supervised Learning

Techniques used to learn the relationship between independent attributes and a designated dependent attribute (the label). Most induction algorithms fall into the supervised learning category.

ANOVA

analysis of variance, measure relationship between categorical and numeric variables, estiamte if statistically significant (F Statistic larger is better)

4. Modeling

apply data mining techniques, tune models (calibrate parameters, assess model fit, assess model quality) deliverables: modeling technique, test design, trained model, model assessment

3. Data Preparation

prepare for analysis (text to numeric, categorical to dummy, remove nulls, infer values, scale/normalize, transpose, non-linear transform deliverables: feature selection report, data cleaning report, data creation report, data integration report

1. Business Understanding

think about the problem (use case, feasibility) explore and discover (data have, predict, impact) being to breakdown problem (subproblems) deliverables: project objective and success criteria, review of current scenario, Project Plan

Parametric Modeling

use data mining techniques to SET the parameters, linear classifier with descriptive equation, ordinary least squares to minimize sum of squared error term

Artificial Intelligence

(AI) refers to programming logic—not the hardware itself, but the software within—that replicates human logic, reasoning, and decision-making. Typically, this begins with some data mining that helps us understand human behavior. But creating the AI program doesn't always require that prior data be stored, drawn from, or updated. Sometimes AI is based on an optimization formula with an objective and constraints (like the Solver add-in in Microsoft Excel). That is why part of the oval above representing AI lies outside of the data mining concept.

Schema

A description of a dataset's attributes and their properties.

Naive Bayes Classifier

A family of algorithms that consider every feature as independent of any other feature Generally p(E) never actually has to be calculated, for one of two reasons. First, if we are interested in classification, what we mainly care about is: of the different possible classes c, for which one is p(C| E) the greatest? In this case, E is the same for all, and we can just look to see which numerator is larger. the Naive Bayes classifier performs surprisingly well for classification on many real-world tasks. This is because the violation of the independence assumption tends not to hurt classification performance, for an intuitively satisfying reason. So, to some extent we'll be double-counting the evidence. However, as long as the evidence is generally pointing us in the right direction, for classification the double-counting won't tend to hurt us. In fact, it will tend to make the probability estimates more extreme in the correct direction: Pros: simple, efficient, performs well on real world tasks, independence violation doesnt hurt Cons: doulbe counting only problematic if using actual probability estimates instead of classification

AUC (area under the ROC Curve)

An evaluation metric that considers all possible classification thresholds. Useful when single number needed to summarize performance 0.5=randomness The Area Under the ROC curve is the probability that a classifier will be more confident that a randomly chosen positive example is actually positive than that a randomly chosen negative example is positive.

CRISP - DM process

Business Understanding, Data Understaniding, Data Preparation, Modeling, Evaluation, Deployment

Cumulative Response and Lift Curves

CRC plots HIT RATe (tp) on Y axis and % of positives correctly classified as a % of population targeted on X axis lift=degree to which pushes up positives instances above negatives (CRC at given X divided by diagonal line at that point) *use carefully if class priors unknown, shape informative but relationship of values not valid

2. Data Understanding

Exploratory Data Analysis (EDA) (descriptive statisticis, review scatterplots, correlation matrix, potential target variable), identify techniques appropriate for you, deliverables:Initial Data Collection Report, Description Report, Exploration, Quality

Coming Evidence Probabilistically

For any particular collection of evidence E, we probably have not seen enough cases with exactly that same collection of evidence to be able to infer the probability of class membership with any confidence.

Clustering

Hierarchical clustering focuses on the similarities between the individual instances and how similarities link them together. Agglomerative: bottom up or Divisive: top down. Key consideration is terminal cluster value (how many clusters) The most common method for focusing on the clusters themselves is to represent each cluster by its "cluster center," or centroid. Medoids = an actual data point is used as center instead of a mean.

Machine Learning

In data science, machine learning is most commonly used to mean the application of induction algorithms to data. The term is often used synonymously with the modeling stage of the data mining process. Machine Learning is the field of scientific study that concentrates on induction algorithms and on other algorithms that can be said to learn.

Induction

Induction is the process of creating a general model (such as a classification tree or an equation) from a set of data. Induction may be contrasted with deduction: deduction starts with a general rule or model and one or more facts, and creates other specific facts from them. Induction goes in the other direction: induction takes a collection of facts and creates a general rule or model. In the context of this book, model induction is synonymous with learning or mining a model, and the rules or models are generally statistical in nature.

Unsupevised Learning

Learning techniques that group instances without a pre-specified target attribute. (clustering, profiling, co-occurrence)

Class (label)

One of a small, mutually exclusive set of labels used as possible values for the target variable in a classification problem. Labeled data has one class label assigned to each example. For example, in a dollar bill classification problem the classes could be legitimate and counterfeit. In a stock assessment task the classes might be will gain substantially, will lose substantially, and will maintain its value.

Bayes' Rule

P(A|B) = P(B|A)P(A)/P(B) describes the probability of an event, based on prior knowledge of conditions that might be related to the event p(C = c) is the "prior" probability of the class, i.e., the probability we would assign to the class before seeing any evidence. p(E |C = c) is the likelihood of seeing the evidence E — the particular features of the example being classified — when the class C = c. One might see this as a "generative" question: if the world generated an instance of class c, how often would it look like E? This likelihood might be calculated from the data as the percentage of examples of class c that have feature vector E. p(E) is the likelihood of the evidence: how common is the feature representation E among all examples? This might be calculated from the data as the percentage occurrence of E among all examples. use directly as a class probability, as a score/rankg, or choose the maximum *makes assumption of probabalistic independence

Expected Value Framework

Problem structure, elements of analysis we can extract, create confusion matrix and then calculate accuracy, compare to class prior, create probability matrix, apply to cost-benefit matrix In an expected value calculation the possible outcomes of a situation are enumerated. The expected value is then the weighted average of the values of the different possible outcomes, where the weight given to each value is its probability of occurrence. if the outcomes represent dnt possible levels of profit, an expected profit calculation weights heavily the highly likely levels of profit, while unlikely levels of profit are given little weight EV= p(oi)*v(oi) + P(oi2)*v(oi2) Each oi is a possible decision outcome; p(oi) is its probability and v(oi) is its value. The probabilities often can be estimated from the data (ii), but the business values often need to be acquired from other sources evaluate the set of decisions made by a model when applied to a set of examples, necessary to compare one model to another: in the AGGREGATE, how well does each model do, what is its expected value

ROC

Receiver Operating Characteristics two-dimensional plot of classifier with false positives on x and true positives on y (shows trade offs) show entire space of uncertainty compute using only actual positives and actual negatives decouples performance from underlying conditions, region of interest may change based on conditions but the curve will not

K-Means

The algorithm starts by creating k initial cluster centers, usually randomly, but sometimes by choosing k of the actual data point. Then the algorithm proceeds as follows. The clusters corresponding to these cluster centers are formed, by determining which is the closest center to each point. Next, for each of these clusters, its center is recalculated by finding the actual centroid of the points in the cluster. The cluster centers typically shift; in the figure, we see that the new solid-lined stars are indeed closer to what intuitively seems to be the center of each cluster. The process simply iterates: since the cluster centers have shifted, we need to recalculate which points belong to each cluster. Once these are reassigned, we might have to shift the cluster centers again. The k-means procedure keeps iterating until there is no change in the clusters (or possibly until some other stopping criterion is met).

Knowlege Discovery

The nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data. This is the definition used in "Advances in Knowledge Discovery and Data Mining," by Fayyad, Piatetsky-Shapiro, & Smyth (1996).

Data cleaning/cleansing

The process of improving the quality of the data by modifying its form or content, for example by removing or correcting data values that are incorrect. This step usually precedes the modeling step, although a pass through the data mining process may indicate that further cleaning is desired and may suggest ways to improve the quality of the data.

Coverage

The proportion of a dataset for which a classifier makes a prediction. If a classifier does not classify all the instances, it may be important to know its performance on the set of cases for which it is confident enough to make a prediction

Accuracy

The rate of correct (incorrect) predictions made by the model over a dataset (cf. coverage). Accuracy is usually estimated using an independent (holdout) dataset that was not used at any time during the learning process. More complex accuracy estimation techniques, such as cross-validation and the bootstrap, are commonly used, especially with datasets containing a small number of instances.

Data Mining

The term data mining is somewhat overloaded. It sometimes refers to the whole data mining process and sometimes to the specific application of modeling techniques to data in order to build models or find other patterns/regularities.

Model Deployment

The use of a learned model to solve a real-world problem. Deployment often is used specifically to contrast with the "use" of a model in the Evaluation stage of the data mining process. In the latter, deployment usually is simulated on data where the true answer is known.

Bivariate Relationships

Theory= coherent group of tested propositions used as principles of explaantion for predicting a class of phenomena (could include generalized explanations) easily explained using scatterplot, add trendline, slope indicates positive or negative, linear not always best

i.i.d sample

a set of independent and identically distributed instances

Generalization

apply model to data we didn't use to build the model, induction

Analytic Attitude

ask the right questions, ETL relevant data, apply appropriate analytic technique, interpret and hare results with key stakeholder

N-Gram sequences

bag doesn't consider word order, next step if sequence of adjacent words, useful when phrase is significant but components may not be, easy to generate, no linguistic knowledge or complext passing algorithm, BUT produces huge set of features

Data Driven Decision Making

basing decisions on analysis of data rather than intuition (statistically more productive) Type 1 = new disoveries to be made in the data, Type 2 = deicions repeatedly on massive scale beneit from increase in accuracy

Bias, Variance, Ensemble Methods

build lots of recommendation models and combine = ensemble (observed to improve generalization performance) KNN is a simple ensemble method

Profiling

characterize typical behaviors of an individual can involve clustering define a numerci function w/ parameters, define goal/objective, find parameters to meet 'normal' or typical could be different based on your dataset

K-Nearest Neighbor

choose several closest to new example, predict target class or regression majority rules, smaller = overfitting, too many boundaries, too specific larger = single classfication space so always majority class use similarity weight so closer neighbors are more improtant, always use an odd K so no ties

Link Prediction/Social Recommendation

connections between items, should or shouldn't be there can also estimate strength of a link define similarity measures, even wieght them

Multivariate Prediction

designate one variable as dependent (what you want to predict), designate one/many independent variables (features, vectors) to explain the prediction, Rsquared = coefficient of determination (key output of line of best fit)

Decision Forest Regression

ensemble method based on the Decision Trees algorithm. decision trees are non-parametric models that perform a sequence of simple tests for each instance, traversing a binary tree data structure until a leaf node (decision) is reached. The advantages of decision tree-based algorithms are that they are quickly and efficiently computed, they can handle non-linear relationships, and they handle non-"normal" variables relatively well. Decision Trees calculate "tree-like" branch structures with binary decision nodes (as opposed to traditional regression coefficients).

Fitting Graph

error rate you find in testing, base rate is actual split of data, useful but only to certain extent because leave information on the table that could have been used to train model

Regression

estimating a value, numeric prediction

Unbalanced Classes

evaluation based soley on accuracy will not work because one class is so rare

Type 2 Decision

existing decisions that repeat, look for incremental improvements in decision process based on data

Co-occurrence

grouping, market based analysis, people buy one things what else are they likely to buy increase revenue from cross selling, improve customer experience use to build out inventory for regional distribution centers and reduce shipping costs complexity contro, only use rules apply to min # transaction % look at strenght of association (probability) threshold how "surprising" = informational lift, look at difference (instead of ratio) = leverage

Filter Based Feature Selection

implements the process of selecting the TOP n variables with the highest relationship with the dependent variable. it calculates a bivarate relationship (Pearson correlation by default) for each independent variable This pill will automatically adjust and keep selecting the best features

Boosted Decision Tree Regression

is also an ensamble algorithm that is similar to a decision forest, but it allows for distinct branches to reconverge later in the tree. This is deal for independent variables that are highly correlated. The Boosted Decision Tree algorithm tends to improve accuracy for some value ranges with the possible risk of reducing accuracy for other ranges

Permutation Feature Importance

it is better in the fact that it estimates the effect of each feature on the label using a trained model it considers the intercorrelation among all features to give you a much more accurate estimation of the how impactful and useful each feature will be starndardizes the coefficients so that they can be easily rank-ordered from best to worst. Those which with negative values are not those with negative coefficients; they are those which are actually hurting the the accuracy of your model by adding error, and it combines the coefficients for categorical features into a single value cons: it cannot be used to automatically select the best variables to go into a model because it can only be calculated based on the output of a trained model

Named Entity Extraction

knowledge intensive, have to be trained/coded

Logistic Regression

linear regression of log odds for class of interest, convert probability to LN and end up with approximation of likelihood

Neural Network Regression

often referred to as "deep learning" and often used for complex problems such as image classification. Any statistical algorithm can be termed a "neural network" if it uses a adaptive coefficient weights and can approximate non-linear inputs

Transactional (Operational) Database

operational, mission critical, slow down by using too many resources, low levels of detail, normalized supervise, dialy operations

Orginal Regression

optimal for dependent variables that represent a rank order--which implies that the distance between each ranking is not necessarily equal. For example, predicting the results of an election would be ideal for an ordinal regression because the distance between the first and second place candidates and the second and third place candidates are not equal in terms of votes recieved.

Ranking Classifiers

present model performance across spectrum of possible values for assumptions, classify each instance and then take appropriate action OR create ranking of all cases and take action on top most percentage of cases (score, represenation, budgetary constraints, cost benefits hard to estimate) if only have score - only assign positive class if relatively high level of certainty (conservative) even if given probability, can treat like score and create ranking

6. Deployment

put it in place and start using deliverables: deployment plan, monitor/maintenance plan, final report/presentation, documentation

Data Reduction

replace many attributes with a few 'meta' attributes

Analytical Database

restructure, add redundancies, add summarization, pre-calculated ratios and totals, used by analysts to explore/study data

Model

simplified represenation of reality created to serve particular purpose DS=formula for estimating unkown value of interest, usually supervised, extract away some of the complexity of real life minimize uncertainty of target value for decision making cost of being wrong? in either direction?

Pearson Correlation Coefficient

single number that describes the relationship (r), closer to 0=weaker, above +/-0.5=pretty strong

Alternatives to Line Charts

slope chart: change b/t start and end cycle plot: trends, seasonality highlight table: time horizontally, scan for hotspots

Unicorn

subject matter expert, math and statistics expert, computer science expert

Overfitting

tendency of DM to tailor to training data at expense of generalization, as complexity increases so does overfitting, logistic regression is more susceptible than SVM use holdout data to test model fit

Text Mining

text is important, it is EVERYWHERE, difficult because unstructured Represenation: the simplest, lease expensive option that works document=one piece of text, no matter how large or small, token/term=word, corpus=collection of documents

Information Gain

weighted value of entropy post split, determine which split reduces entropy the MOST and use that first (Entropy of Whole - probability of group times entropy of group)

Similarity

which instances of data may be similar to others based on what we know

Profit Curves

x-axis = threshold values, expressed as % of instances targets y-axis = profit take potential classificers and calculate CM and EV for every threshold 0-100%, plot extremely sensitive to assumptions in model (class priors, cost benefit estimation)


Ensembles d'études connexes

BUS491 - Chapter 8 Review Questions

View Set

Ch 4-6 Accounting Practice Problems

View Set