CMPE188 Midterm
extraction
Feature ___________: Transform the data in the high-dimensional space to a space of fewer dimensions
Q-Q
Quantile-Quantile (__-__) Plot: Graphs the quantiles of one univariate distribution against the corresponding quantiles of another. View: is there a shift going from on distribution to another
Gini Index
The attribute provides the smallest ____ _____ (or the largest reduction in impurity) is chosen to split the node (need to enumerate all the possible splitting points for each attribute).
Quantile
________ Plot: Displays all of the data (allowing the user to assess both the overall behavior and unusual occurrences). each value of xi is paired with fi indicating that approximately 100 fi % of data are <= xi.
Machine learning
________ __________: a branch of artificial intelligence concerned with the design and development of algorithms that allow computers to evolve behaviors based on empirical data.
Unsupervised
_____________ Learning: Clustering. Probability distribution estimation. Finding association (in features). Dimension reduction. Success: market segmentation, gene clustering, news aggregation, rule mining, image compression
Neighbor
k-Nearest _________ Algorithm: All instances correspond to points in the n-D space. The nearest neighbor are defined in terms of Euclidean (usually) distance. A flexible approach to estimate the class of data point when the target function is discrete. For any given X we find the k closest neighbors to X in the training data, and examine their corresponding Y. If the majority of the Y's are true (for instance) we predict true. Needs distance metric, # of neighbors, How to fit with the local points, and optional weighted function
V
4 Main _'s: Volume, Variety, Velocity(analysis of streaming data), and Veracity(uncertainty of data). 43 total _'s. make it hard for the traditional ETL (Extract, Transform, Load) functions to scale to fully exploit the data.
Model
Building the ________: -Algorithm looks for pattern in the data collected as ground truth. -Each of the features is weighed based on the pattern. -Test data is classified based on the weights and the model learned from the ground truth. -The bigger the data the higher the chances of precise truth.
hierarchy
Concept ___________ organizes concepts (i.e., attribute values) hierarchically and is usually associated with each dimension in a data warehouse. Recursively reduce the data by collecting and replacing low level concepts (such as numeric values for age) by higher level concepts (such as youth, adult, or senior). facilitate drilling and rolling in data warehouses to view data in multiple granularity. Some can be automatically generated based on the analysis of the number of distinct values per attribute in the data set
dimensionality
Curse of ______________: When dimensionality increases, data becomes increasingly sparse. Density and distance between points, which is critical to clustering, outlier analysis, becomes less meaningful. The possible combinations of subspaces will grow exponentially
reduction
Data ________: Dimensionality reduction, Numerosity reduction, Data compression. balancing act between clarity of representation, ease of understanding; and oversimplification: loss of important or relevant information. much smaller in volume but yet produces almost the same analytical results. Methods for data reduction (also data size reduction or numerosity reduction): Regression and Log-Linear Models, Histograms, clustering, sampling, Data cube aggregation, Data compression
cleaning
Data _________: Handle missing data, smooth noisy data, identify or remove outliers, and resolve inconsistencies. Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g., instrument faulty, human or computer error, and transmission error. incomplete. noisy, inconsistent, intentional. 3 steps: Data discrepancy detection, Data migration and integration, and Integration of the two processes (discrepancy detection and transformation)
integration
Data _________: Integration of multiple databases, data cubes, or files. Entity identification problem; Remove redundancies; Detect inconsistencies
scrubbing
Data __________: use simple domain knowledge (e.g., postal code, spell-check) to detect errors and make corrections
discrepancy
Data ___________ detection: Use metadata (e.g.,domain, range, dependency, distribution). Check field overloading. Check uniqueness rule, consecutive rule(e.g.bankcheck numbers) and rule (what character should represent a null value). Use commercial tools like Data scrubbing or Data auditing
auditing
Data ___________: by analyzing data to discover rules and relationship to detect violators(e.g., correlation and clustering to find outliers)
Compression
Data ____________: String compression(lossless), audio/video compression(lossy). Data reduction and dimensionality reduction may also be considered as forms of it. wavelet transform.
transformation
Data _______________: A function that maps the entire set of values of a given attribute to a new set of replacement values s.t. each old value can be identified with one of the new values. Methods: -Smoothing: Remove noise from data -Attribute/feature construction -Aggregation: Summarization, data cube construction -Normalization: Scaled to fall within a smaller, specified range -Discretization: Concept hierarchy climbing
integration
Data migration and ____________: -Data migration tools: allow transformations to be specified -ETL (Extraction/Transformation/Loading) tools:allow users to specify transformations through a graphical user interface
50
Data traffic is growing at nearly ___% each year.
reduction
Dimensionality ___________: Reducing the number of random variables under consideration, via obtaining a set of principal variables. Advantages: Avoid the curse of dimensionality, Help eliminate irrelevant features and reduce noise, Reduce time and space required in data mining, Allow easier visualization. Methodologies: Principal Component Analysis, Feature subset selection, Feature creation
Correlation analysis
Discretization by __________ __________ (e.g., Chi-merge: χ2-based discretization): Bottom-up merge: Find the best neighboring intervals (those having similar distributions of classes, based on χ2 values) to merge. Merge performed recursively, until a predefined stopping condition
Classification
Discretization by _____________ (e.g., decision tree analysis): Supervised: Given class labels, e.g., cancerous vs. benign. Using entropy to determine split point (discretization point). Top-down, recursive split
Stream
Discretized ________ Processing: Run a streaming computation as a series of very small, deterministic batch jobs. Chop up the live stream into batches of X seconds; Spark treats each batch of data as RDDs and processes them using RDD operations; Finally, the processed results of the RDD operations are returned in batches.
noisy
Handle ______ data by: -Binning: First sort data and partition into (equal-frequency) bins Then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc. -Regression: Smooth by fitting the data into regression functions -Clustering: Detect and remove outliers -Semi-supervised: Combined computer and human inspection, Detect suspicious values and check by human (e.g., deal with possible outliers)
Missing
How to Handle _________ Data: Ignore the tuple: usually done when class label is missing (when doing classification)—not effective when the % of missing values per attribute varies considerably. Fill in the missing value manually: tedious + infeasible? Fill in it automatically - a global constant, attribute mean, mean for all samples belonging to same class, inference-based such as Bayesian formula or decision tree
Gain
Information _____: IG(Y|X) = I must transmit Y. How many bits on average would it save me if both ends of the line knew X? IG(Y|X) = H(Y) - H(Y | X)
true
It takes humongous data to see statistical patterns emerge and meaningful hypotheses generated from the data automatically?
classifier
Linear _________: Builds a classification model using a straight line. Used for (categorical data) binary classification. 𝑓(𝑥) is a linear function based on the example's attribute values. a) The prediction is based on the value of 𝑓(𝑥), b) Data above the blue line belongs to class 'x' (i.e., 𝑓𝑥 >0), c) Data below blue line belongs to class 'o' (i.e., 𝑓𝑥 <0). Ex: Linear Discriminant Analysis, Logistic Regression, Perceptron, SVM.
regression
Linear __________: Data modeled to fit a line. Linear equation: y = w X + b. Often uses the least-square method to fit line. Used to predict continuous values
quality
Measures for data ______: -Accuracy: correct or wrong, accurate or not -Completeness: not recorded, unavailable, ... -Consistency: some modified but some not, dangling, -Timeliness: timely update? -Believability: how trustable the data are correct? -Interpretability: how easily the data can be understood?
Lunch
No Free _______ Theorem: There is a lack of inherent superiority of any classifier. If we make no prior assumption about the nature of the classification task, no classification method is superior overall. no algorithm is superior overall to random guessing.
PCA
Principal Component Analysis (____): How to find the 'best' low dimension space that conveys maximum useful information? A statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. The original data are projected onto a much smaller space, resulting in dimensionality reduction. Method: Find the eigenvectors of the covariance matrix, and these eigenvectors define the new space. assumes relationships among variables are LINEAR. Eigenvector with the highest eigenvalue is the principle component of the data set. Trick: Rotate Coordinate Axes
ROC
Receiver Operator Characteristic (____): Developed in WWII to statistically model false positive and false negative detections of radar operators. Better statistical foundations than most other measures. Standard measure in medicine and biology. Becoming more popular in ML. Properties: -Slope is non-increasing after a point -Each point on ROC represents different tradeoff (cost ratio) between false positives and false negatives / true positives -Slope of line tangent to curve defines the cost ratio -ROC Area represents performance averaged over all possible cost ratios -If two ROC curves do not intersect, one method dominates the other -If two ROC curves intersect, one method is better for some cost ratios, and other method is better for other cost ratios
with
Sampling ________ replacement: A selected object is not removed from the population
without
Sampling ________ replacement: Once an object is selected, it is removed from the population
Entropy
Specific Conditional ________: H(Y |X=v) = The entropy of Y among only those records in which X has value v
eigenvalues
Since x is required to be nonzero, the ____________ must satisfy det(A -^I) = 0 which is called the characteristic equation. Solving it for values of ^ gives the eigenvalues of matrix A.
SVM
Support Vector Machine (____): Used for text categorization, image classification, bioinformatics, hand-writing recognition. Advantages: Handles mixed variables, Handles missing data, Handles nonlinear data, Efficient for high dimensional data sets, Effective when dimensions > samples, Easy to understand, Predictive power
SVM
Support Vector Machines (____): finds the optimal separating hyperplane which maximizes the margin of the separated training data, i.e., maximum marginal hyperplane. used for both linear and nonlinear data. uses a nonlinear mapping to transform the original training data into a higher dimension. With the new dimension, it searches for the linear optimal separating hyperplane (i.e., "decision boundary"). Kernel implicitly maps from 2D to 3D, making problems always linearly separable.
big data
Technology comprising of tools and techniques to extract value from huge sets of data. Fusion: data coming together from various sources. Fission: analyzing that data
e
The difficulty of estimating f (unknown functions) will depend on the standard deviation of the _'s (random error with mean zero).
ML
Typical __ Task Process: -Collect: Collect the ground truth. Toughest challenge. -Extract: Extract features, working with the ground truth (training data). -Learn: Learn the model using one of the algorithms. -Apply: Apply the model to the test data of tweets. -Quantify: Quantify the accuracy of the model. -Tune: Fine-tune the algorithm for the best fit.
Co
__-training: Train two models with different independent feature sets. Add most confident instances from U of one model into L of the other(i.e. They 'teach' each other). Repeat
true
X: observable variables (features) Y: target variables (class labels)?
Eager
_______ learning (previously discussed methods): Given a set of training tuples, constructs a classification model before receiving new (e.g., test) data to classify
F
_-measure or balanced _-score (F1 score) is the harmonic mean of precision and recall: = 2*(precision*recall)/(precision+recall). Problem: Ignores domain characteristics - gives equal importance to precision and recall
Histograms
____________: reveal more than boxplots. x-axis are values, y-axis are frequencies.
Log-linear
___-_______ model: A math model that takes the form of a function whose logarithm is a linear combination of the parameters of the model, which makes it possible to apply (possibly multivariate) linear regression. Estimate the probability of each point (tuple) in a multi-dimension space for a set of discretized attributes, based on a smaller subset of dimensional combinations. Useful for dimensionality reduction and data smoothing. modeling is a generalized linear model where the output is assumed to have a Poisson distribution.
Non-parametric
___-__________ methods for data reduction: Do not assume models. Major families: histograms, clustering, sampling
kNN
____ Regression is similar to the kNN classifier. To predict Y for a given value of X, consider k closest points to X in training data and take the average of the responses. Distance-weighted nearest neighbor algorithm Weigh the contribution of each of the k neighbors according to their distance to the query. Give greater weight to closer neighbors.
Big Data
____ ____ CAN Perform Diagnostic, Predictive, and Prescriptive analysis; Find relations between elements/events; Monitor events realtime; Optimize operations, improve Quality of Life; Solve unsolved problems from the past; and Send focused, tailor-made communications. CAN NOT Translate a business problem into an analytics problem; Give definitive answers with 100% accuracy; Obsolete legacy machines; Make machines behave autonomously; Be the silver bullet for every problem.
Map Reduce
____ ______: Map - distribute the task among multiple computers. Reduce - take the results from each computer and combine them.
RDD
____ abstraction: Resilient Distributed Datasets. Partitioned collection of records. Spread across the cluster. Read-only. Caching dataset in memory, different storage levels available, fallback to disk possible.
axes
____ in a multi-dimensional space represent features.
ETL
____ methods can neither cope with the velocity of the data generation nor can deal with the veracity issues of the data. Hence there is a need for Big Data Analytics.
PC1
____: The eigenvalue with the largest absolute value will indicate that the data have the largest variance along its eigenvector, the direction along which there is greatest variation. In general, only few directions manage to capture most of the variability in the data.
KNN
____: completely non-parametric, No assumptions are made about the shape of the decision boundary. Advantages: We can expect KNN to dominate Logistic Regression when the decision boundary is highly non-linear. Disadvantages: KNN does not tell us which predictors are important (no table of coefficients)
PC2
____: the direction with maximum variation left in data, orthogonal to the 1st PC
Data Cube
_____ _____: Queries regarding aggregated information should be answered using it when possible. Multiple levels of aggregation. Further reduce the size of data to deal with. Reference appropriate levels, use the smallest representation which is enough to solve the task. The lowest level: aggregated data for an individual entity of interest, E.g. a customer in a phone company data warehouse
Lazy
_____ learning (e.g.,instance-based learning): Simply stores training data (or only minor processing) and waits until it is given a test tuple. less time in training but more time in predicting. Instance-based learning: Store training examples and delay the processing ("lazy evaluation") until a new instance must be classified. Typical approaches = K-nearest neighbor approach.
Semi
_____-Supervised Learning: Can we improve the quality of learning by combining labeled and unlabeled data. Usually a lot more unlabeled data available than labeled. assume a set of labeled data L and set of unlabeled data U from the same distribution. self-training and multi-model.
Self
_____-Training: Train supervised model on labeled data L. Test on unlabeled data U. Add the most confidently classified members of U to L. Repeat
Multi-view
_____-____ learning: Train multiple diverse models on L. Those instances in U which most models agree on are placed in L.
OLAP
______ Operations: -Roll up (drill-up): summarize data by climbing up hierarchy or by dimension reduction -Drill down (roll down): reverse roll-up from higher level summary to lower level summary or detailed data, or introducing new dimensions -Slice and dice: Project and select -Pivot (rotate): Reorient the cube, visualization, 3D to series of 2D planes -Other operations: Drill across: involving (across) more than one fact table
Naive Bayes
______ _______: requires initial knowledge of many probabilities, which may not be available or involving significant computational cost. generative classifier. requires each conditional probability be non-zero. Strength: Easy to implement, Good results obtained in most of the cases. Weakness: assumes attributes are conditionally independent., therefore loss of accuracy; Practically, dependencies exist among variables which cannot be modeled by it. Use Bayesian Belief Networks to deal with dependencies. hypothesis=label
Bayes
______' Theorem: Shows the relationship between a conditional probability and its inverse. I.e. it allows us to make an inference from: The probability of a hypothesis given the evidence to The probability of that evidence given the hypothesis And vice versa. Helps us to find P(B | A) By having already known P(A|B)
Spark
______: Fast and Expressive Cluster Computing Engine Compatible with Apache Hadoop. Efficient and usable. runs Discretized Stream Processing that uses RDD abstractions.
Model
_______ Validation and Testing: Test = estimate accuracy of the model. The known label of test sample is compared with the classified result from the model. Accuracy = % of test set samples that are correctly classified by the model. Test set is independent of training set. Validation = if the test set is used to select or refine models
Occam's Razor
_______ _______: Given two models of similar generalization errors, one should prefer the simpler model over the more complex model. For complex models, there is a greater chance that it was fitted accidentally by errors in data. Therefore, one should include model complexity when evaluating a model
Decision Tree
_________ _____: Leaves represent classifications and branches represent tests on features that lead to those classifications. Each internal node tests an attribute. Each branch corresponds to an attribute value node. Each leaf node assigns a classification. could be more than one tree that fits the same data! One of the simplest algorithms, no parameters. Outcome is transparent to the user! Able to handle both numerical and categorical data. Robust, performs well with large data in a short time. Methods include ID3, C4.5, CART, IG.
Feature vector
_________ ________: is a one-dimension matrix. can have magnitude and direction. stores the features for a particular observation in a specific order. maps to the vector space(N+1 dimensions).
Logistic regression
_________ __________: Key idea: turns linear predictions into probabilities using sigmoid function. More smooth than a linear probability model. parameter is and comprises of weights. used for regression and classification. weights determined by maximizing the log likelihood.
Wavelet
__________ Transform: Decomposes a signal into different frequency subbands. Applicable to n-dimensional signals. Data are transformed to preserve relative distance between objects at different levels of resolution. Allow natural clusters to become more distinguishable. Used for image compression. C represents how closely correlated the wavelet is with this section of the signal. The higher C is, the more the similarity.
Attribute Creation
__________ _________(feature generation): Create new attributes (features) that can capture the important information in a data set more effectively than the original ones. Three general methodologies: -Attribute extraction: Domain-specific -Mapping data to new space (see: data reduction) -Attribute construction: Combining features (look for frequent patterns in data and combine). Data discretization (binning, histogram analysis etc)
Variance
__________ of a single random variable X provides a measure of how much the value of X deviates from the mean or expected value of X. Sample covariance is a generalization of the sample variance.
Multiple
__________ regression: Y = b0 + b1 X1 + b2 X2. Allows a response variable Y to be modeled as a linear function of multidimensional feature vector. Many nonlinear functions can be transformed into the above
Stratified
__________ sampling: Partition (or cluster) the data set, and draw samples from each partition (proportionally, i.e., approximately the same percentage of the data)term-96
Residual
__________ variation is information in A that is not retained in X.
Entropy
__________: measures the lack of information of a system. Shannon's measure of information is the number of bits to represent the amount of uncertainty (randomness) in a data source, and is defined as ________. = expected number of bits needed to encode class (+ or -) of randomly drawn members of S
Sampling
__________: obtaining a small sample s to represent the whole data set N. Allow a mining algorithm to run in complexity that is potentially sub-linear to the size of the data. Key principle: Choose a representative subset of the data; Simple random sampling may have very poor performance in the presence of skew; Develop adaptive sampling methods, e.g., stratified sampling
Histogram
___________ Analysis: Divide data into buckets and store average (sum) for each bucket. Partitioning rules: Equal-width = equal bucket range; Equal-frequency (or equal-depth).
Supervised
___________ Learning: Prediction, Classification(discrete labels), Regression (real values). The training data such as observations or measurements are accompanied by labels indicating the classes which they belong to. New data is classified based on the models built from the training set. Key issue: generalization (can't just memorize the training set). Successes: face detection, steering an autonomous car, detecting credit card fraud, etc.
Generative
___________ classifier model p(Y, X): Models how the data was "generated"? "What is the likelihood this or that class generated this instance?" and pick the one with higher probability. Ex: Naive Bayes, Bayesian Networks (not covered)
Parametric
___________ methods (e.g., regression) for data reduction: Assume the data fits some model, estimate model parameters, store only the parameters, and discard the data (except possible outliers)
Nonlinear
___________ regression: Data are modeled by a function which is a nonlinear combination of the model parameters and depends on one or more independent variables. The data are fitted by a method of successive approximations
Correlation
____________ analysis for categorical data: Χ2 (chi-square) test. Null hypothesis: The two distributions are independent. The cells that contribute the most to the Χ2 value are those whose actual count is very different from the expected count. The larger the Χ2 value, the more likely the variables are related.
Redundant
____________ data: -Object identification: The same attribute or object may have different names in different databases -Derivable data: One attribute may be a "derived" attribute in another table, e.g., annual revenue -Redundant attributes may be able to be detected by correlation analysis and covariance analysis. Careful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality.
Clustering
____________: Partition data set into clusters based on similarity, and store cluster representation (e.g., centroid and diameter) only. Can have hierarchical clustering and be stored in multi- dimensional index tree structures.
Classification
____________: Predict categorical class labels(discrete or nominal). Construct a model based on the training set and the class labels and use it in classifying new data
Overfitting
____________: You can perfectly fit to any training data. Zero bias, high variance. Two approaches: Stop growing the tree when further splitting the data does not yield an improvement; Grow a full tree, then prune the tree, by eliminating nodes.
clustering
____________: class label is unknown. group to form new categories. Principle: maximizing intra-class similarity and minimizing interclass similarity
Regression
_____________ analysis: A collective name for techniques for the modeling and analysis of numerical data consisting of values of a dependent variable (also called response variable or measurement) and of one or more independent variables. Used for prediction (including forecasting of time-series data), inference, hypothesis testing, and modeling of causal relationships. Most commonly the best fit is evaluated by using the least squares method, but other criteria have also been used
Binning
_____________: -Equal-width (distance) partitioning: Divides the range into N intervals of equal size: uniform grid. if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B -A)/N. The most straightforward, but outliers may dominate presentation. Skewed data is not handled well -Equal-depth (frequency) partitioning: Divides the range into N intervals, each containing approximately same number of samples. Good data scaling. Managing categorical attributes can be tricky
Discretization
_____________: Divide the range of a continuous attribute into intervals. Interval labels can then be used to replace actual data values. can be performed recursively on an attribute. Three types of attributes -Nominal: values from an unordered set, e.g., color, profession -Ordinal: values from an ordered set, e.g., military or academic rank -Numeric: real numbers,e.g.,integer or real numbers
Discriminative
______________ classifier model p(Y | X): Uses the data to create a decision boundary. Ex: Logistic regression, Support vector machines. Strength: Prediction accuracy is generally high compared to generative models, robust, works when training examples contain errors, Fast evaluation of the learned target function compared to bayesian networks (which are normally slow). Criticism: Long training time, Difficult to understand the learned function (weights), Not easy to incorporate domain knowledge
Reinforcement
_______________ Learning: Given a sequence of examples/states and a reward after completing that sequence, learn to predict the action to take in for an individual example/state.