CMPE188 Midterm

¡Supera tus tareas y exámenes ahora con Quizwiz!

extraction

Feature ___________: Transform the data in the high-dimensional space to a space of fewer dimensions

Q-Q

Quantile-Quantile (__-__) Plot: Graphs the quantiles of one univariate distribution against the corresponding quantiles of another. View: is there a shift going from on distribution to another

Gini Index

The attribute provides the smallest ____ _____ (or the largest reduction in impurity) is chosen to split the node (need to enumerate all the possible splitting points for each attribute).

Quantile

________ Plot: Displays all of the data (allowing the user to assess both the overall behavior and unusual occurrences). each value of xi is paired with fi indicating that approximately 100 fi % of data are <= xi.

Machine learning

________ __________: a branch of artificial intelligence concerned with the design and development of algorithms that allow computers to evolve behaviors based on empirical data.

Unsupervised

_____________ Learning: Clustering. Probability distribution estimation. Finding association (in features). Dimension reduction. Success: market segmentation, gene clustering, news aggregation, rule mining, image compression

Neighbor

k-Nearest _________ Algorithm: All instances correspond to points in the n-D space. The nearest neighbor are defined in terms of Euclidean (usually) distance. A flexible approach to estimate the class of data point when the target function is discrete. For any given X we find the k closest neighbors to X in the training data, and examine their corresponding Y. If the majority of the Y's are true (for instance) we predict true. Needs distance metric, # of neighbors, How to fit with the local points, and optional weighted function

V

4 Main _'s: Volume, Variety, Velocity(analysis of streaming data), and Veracity(uncertainty of data). 43 total _'s. make it hard for the traditional ETL (Extract, Transform, Load) functions to scale to fully exploit the data.

Model

Building the ________: -Algorithm looks for pattern in the data collected as ground truth. -Each of the features is weighed based on the pattern. -Test data is classified based on the weights and the model learned from the ground truth. -The bigger the data the higher the chances of precise truth.

hierarchy

Concept ___________ organizes concepts (i.e., attribute values) hierarchically and is usually associated with each dimension in a data warehouse. Recursively reduce the data by collecting and replacing low level concepts (such as numeric values for age) by higher level concepts (such as youth, adult, or senior). facilitate drilling and rolling in data warehouses to view data in multiple granularity. Some can be automatically generated based on the analysis of the number of distinct values per attribute in the data set

dimensionality

Curse of ______________: When dimensionality increases, data becomes increasingly sparse. Density and distance between points, which is critical to clustering, outlier analysis, becomes less meaningful. The possible combinations of subspaces will grow exponentially

reduction

Data ________: Dimensionality reduction, Numerosity reduction, Data compression. balancing act between clarity of representation, ease of understanding; and oversimplification: loss of important or relevant information. much smaller in volume but yet produces almost the same analytical results. Methods for data reduction (also data size reduction or numerosity reduction): Regression and Log-Linear Models, Histograms, clustering, sampling, Data cube aggregation, Data compression

cleaning

Data _________: Handle missing data, smooth noisy data, identify or remove outliers, and resolve inconsistencies. Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g., instrument faulty, human or computer error, and transmission error. incomplete. noisy, inconsistent, intentional. 3 steps: Data discrepancy detection, Data migration and integration, and Integration of the two processes (discrepancy detection and transformation)

integration

Data _________: Integration of multiple databases, data cubes, or files. Entity identification problem; Remove redundancies; Detect inconsistencies

scrubbing

Data __________: use simple domain knowledge (e.g., postal code, spell-check) to detect errors and make corrections

discrepancy

Data ___________ detection: Use metadata (e.g.,domain, range, dependency, distribution). Check field overloading. Check uniqueness rule, consecutive rule(e.g.bankcheck numbers) and rule (what character should represent a null value). Use commercial tools like Data scrubbing or Data auditing

auditing

Data ___________: by analyzing data to discover rules and relationship to detect violators(e.g., correlation and clustering to find outliers)

Compression

Data ____________: String compression(lossless), audio/video compression(lossy). Data reduction and dimensionality reduction may also be considered as forms of it. wavelet transform.

transformation

Data _______________: A function that maps the entire set of values of a given attribute to a new set of replacement values s.t. each old value can be identified with one of the new values. Methods: -Smoothing: Remove noise from data -Attribute/feature construction -Aggregation: Summarization, data cube construction -Normalization: Scaled to fall within a smaller, specified range -Discretization: Concept hierarchy climbing

integration

Data migration and ____________: -Data migration tools: allow transformations to be specified -ETL (Extraction/Transformation/Loading) tools:allow users to specify transformations through a graphical user interface

50

Data traffic is growing at nearly ___% each year.

reduction

Dimensionality ___________: Reducing the number of random variables under consideration, via obtaining a set of principal variables. Advantages: Avoid the curse of dimensionality, Help eliminate irrelevant features and reduce noise, Reduce time and space required in data mining, Allow easier visualization. Methodologies: Principal Component Analysis, Feature subset selection, Feature creation

Correlation analysis

Discretization by __________ __________ (e.g., Chi-merge: χ2-based discretization): Bottom-up merge: Find the best neighboring intervals (those having similar distributions of classes, based on χ2 values) to merge. Merge performed recursively, until a predefined stopping condition

Classification

Discretization by _____________ (e.g., decision tree analysis): Supervised: Given class labels, e.g., cancerous vs. benign. Using entropy to determine split point (discretization point). Top-down, recursive split

Stream

Discretized ________ Processing: Run a streaming computation as a series of very small, deterministic batch jobs. Chop up the live stream into batches of X seconds; Spark treats each batch of data as RDDs and processes them using RDD operations; Finally, the processed results of the RDD operations are returned in batches.

noisy

Handle ______ data by: -Binning: First sort data and partition into (equal-frequency) bins Then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc. -Regression: Smooth by fitting the data into regression functions -Clustering: Detect and remove outliers -Semi-supervised: Combined computer and human inspection, Detect suspicious values and check by human (e.g., deal with possible outliers)

Missing

How to Handle _________ Data: Ignore the tuple: usually done when class label is missing (when doing classification)—not effective when the % of missing values per attribute varies considerably. Fill in the missing value manually: tedious + infeasible? Fill in it automatically - a global constant, attribute mean, mean for all samples belonging to same class, inference-based such as Bayesian formula or decision tree

Gain

Information _____: IG(Y|X) = I must transmit Y. How many bits on average would it save me if both ends of the line knew X? IG(Y|X) = H(Y) - H(Y | X)

true

It takes humongous data to see statistical patterns emerge and meaningful hypotheses generated from the data automatically?

classifier

Linear _________: Builds a classification model using a straight line. Used for (categorical data) binary classification. 𝑓(𝑥) is a linear function based on the example's attribute values. a) The prediction is based on the value of 𝑓(𝑥), b) Data above the blue line belongs to class 'x' (i.e., 𝑓𝑥 >0), c) Data below blue line belongs to class 'o' (i.e., 𝑓𝑥 <0). Ex: Linear Discriminant Analysis, Logistic Regression, Perceptron, SVM.

regression

Linear __________: Data modeled to fit a line. Linear equation: y = w X + b. Often uses the least-square method to fit line. Used to predict continuous values

quality

Measures for data ______: -Accuracy: correct or wrong, accurate or not -Completeness: not recorded, unavailable, ... -Consistency: some modified but some not, dangling, -Timeliness: timely update? -Believability: how trustable the data are correct? -Interpretability: how easily the data can be understood?

Lunch

No Free _______ Theorem: There is a lack of inherent superiority of any classifier. If we make no prior assumption about the nature of the classification task, no classification method is superior overall. no algorithm is superior overall to random guessing.

PCA

Principal Component Analysis (____): How to find the 'best' low dimension space that conveys maximum useful information? A statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. The original data are projected onto a much smaller space, resulting in dimensionality reduction. Method: Find the eigenvectors of the covariance matrix, and these eigenvectors define the new space. assumes relationships among variables are LINEAR. Eigenvector with the highest eigenvalue is the principle component of the data set. Trick: Rotate Coordinate Axes

ROC

Receiver Operator Characteristic (____): Developed in WWII to statistically model false positive and false negative detections of radar operators. Better statistical foundations than most other measures. Standard measure in medicine and biology. Becoming more popular in ML. Properties: -Slope is non-increasing after a point -Each point on ROC represents different tradeoff (cost ratio) between false positives and false negatives / true positives -Slope of line tangent to curve defines the cost ratio -ROC Area represents performance averaged over all possible cost ratios -If two ROC curves do not intersect, one method dominates the other -If two ROC curves intersect, one method is better for some cost ratios, and other method is better for other cost ratios

with

Sampling ________ replacement: A selected object is not removed from the population

without

Sampling ________ replacement: Once an object is selected, it is removed from the population

Entropy

Specific Conditional ________: H(Y |X=v) = The entropy of Y among only those records in which X has value v

eigenvalues

Since x is required to be nonzero, the ____________ must satisfy det(A -^I) = 0 which is called the characteristic equation. Solving it for values of ^ gives the eigenvalues of matrix A.

SVM

Support Vector Machine (____): Used for text categorization, image classification, bioinformatics, hand-writing recognition. Advantages: Handles mixed variables, Handles missing data, Handles nonlinear data, Efficient for high dimensional data sets, Effective when dimensions > samples, Easy to understand, Predictive power

SVM

Support Vector Machines (____): finds the optimal separating hyperplane which maximizes the margin of the separated training data, i.e., maximum marginal hyperplane. used for both linear and nonlinear data. uses a nonlinear mapping to transform the original training data into a higher dimension. With the new dimension, it searches for the linear optimal separating hyperplane (i.e., "decision boundary"). Kernel implicitly maps from 2D to 3D, making problems always linearly separable.

big data

Technology comprising of tools and techniques to extract value from huge sets of data. Fusion: data coming together from various sources. Fission: analyzing that data

e

The difficulty of estimating f (unknown functions) will depend on the standard deviation of the _'s (random error with mean zero).

ML

Typical __ Task Process: -Collect: Collect the ground truth. Toughest challenge. -Extract: Extract features, working with the ground truth (training data). -Learn: Learn the model using one of the algorithms. -Apply: Apply the model to the test data of tweets. -Quantify: Quantify the accuracy of the model. -Tune: Fine-tune the algorithm for the best fit.

Co

__-training: Train two models with different independent feature sets. Add most confident instances from U of one model into L of the other(i.e. They 'teach' each other). Repeat

true

X: observable variables (features) Y: target variables (class labels)?

Eager

_______ learning (previously discussed methods): Given a set of training tuples, constructs a classification model before receiving new (e.g., test) data to classify

F

_-measure or balanced _-score (F1 score) is the harmonic mean of precision and recall: = 2*(precision*recall)/(precision+recall). Problem: Ignores domain characteristics - gives equal importance to precision and recall

Histograms

____________: reveal more than boxplots. x-axis are values, y-axis are frequencies.

Log-linear

___-_______ model: A math model that takes the form of a function whose logarithm is a linear combination of the parameters of the model, which makes it possible to apply (possibly multivariate) linear regression. Estimate the probability of each point (tuple) in a multi-dimension space for a set of discretized attributes, based on a smaller subset of dimensional combinations. Useful for dimensionality reduction and data smoothing. modeling is a generalized linear model where the output is assumed to have a Poisson distribution.

Non-parametric

___-__________ methods for data reduction: Do not assume models. Major families: histograms, clustering, sampling

kNN

____ Regression is similar to the kNN classifier. To predict Y for a given value of X, consider k closest points to X in training data and take the average of the responses. Distance-weighted nearest neighbor algorithm Weigh the contribution of each of the k neighbors according to their distance to the query. Give greater weight to closer neighbors.

Big Data

____ ____ CAN Perform Diagnostic, Predictive, and Prescriptive analysis; Find relations between elements/events; Monitor events realtime; Optimize operations, improve Quality of Life; Solve unsolved problems from the past; and Send focused, tailor-made communications. CAN NOT Translate a business problem into an analytics problem; Give definitive answers with 100% accuracy; Obsolete legacy machines; Make machines behave autonomously; Be the silver bullet for every problem.

Map Reduce

____ ______: Map - distribute the task among multiple computers. Reduce - take the results from each computer and combine them.

RDD

____ abstraction: Resilient Distributed Datasets. Partitioned collection of records. Spread across the cluster. Read-only. Caching dataset in memory, different storage levels available, fallback to disk possible.

axes

____ in a multi-dimensional space represent features.

ETL

____ methods can neither cope with the velocity of the data generation nor can deal with the veracity issues of the data. Hence there is a need for Big Data Analytics.

PC1

____: The eigenvalue with the largest absolute value will indicate that the data have the largest variance along its eigenvector, the direction along which there is greatest variation. In general, only few directions manage to capture most of the variability in the data.

KNN

____: completely non-parametric, No assumptions are made about the shape of the decision boundary. Advantages: We can expect KNN to dominate Logistic Regression when the decision boundary is highly non-linear. Disadvantages: KNN does not tell us which predictors are important (no table of coefficients)

PC2

____: the direction with maximum variation left in data, orthogonal to the 1st PC

Data Cube

_____ _____: Queries regarding aggregated information should be answered using it when possible. Multiple levels of aggregation. Further reduce the size of data to deal with. Reference appropriate levels, use the smallest representation which is enough to solve the task. The lowest level: aggregated data for an individual entity of interest, E.g. a customer in a phone company data warehouse

Lazy

_____ learning (e.g.,instance-based learning): Simply stores training data (or only minor processing) and waits until it is given a test tuple. less time in training but more time in predicting. Instance-based learning: Store training examples and delay the processing ("lazy evaluation") until a new instance must be classified. Typical approaches = K-nearest neighbor approach.

Semi

_____-Supervised Learning: Can we improve the quality of learning by combining labeled and unlabeled data. Usually a lot more unlabeled data available than labeled. assume a set of labeled data L and set of unlabeled data U from the same distribution. self-training and multi-model.

Self

_____-Training: Train supervised model on labeled data L. Test on unlabeled data U. Add the most confidently classified members of U to L. Repeat

Multi-view

_____-____ learning: Train multiple diverse models on L. Those instances in U which most models agree on are placed in L.

OLAP

______ Operations: -Roll up (drill-up): summarize data by climbing up hierarchy or by dimension reduction -Drill down (roll down): reverse roll-up from higher level summary to lower level summary or detailed data, or introducing new dimensions -Slice and dice: Project and select -Pivot (rotate): Reorient the cube, visualization, 3D to series of 2D planes -Other operations: Drill across: involving (across) more than one fact table

Naive Bayes

______ _______: requires initial knowledge of many probabilities, which may not be available or involving significant computational cost. generative classifier. requires each conditional probability be non-zero. Strength: Easy to implement, Good results obtained in most of the cases. Weakness: assumes attributes are conditionally independent., therefore loss of accuracy; Practically, dependencies exist among variables which cannot be modeled by it. Use Bayesian Belief Networks to deal with dependencies. hypothesis=label

Bayes

______' Theorem: Shows the relationship between a conditional probability and its inverse. I.e. it allows us to make an inference from: The probability of a hypothesis given the evidence to The probability of that evidence given the hypothesis And vice versa. Helps us to find P(B | A) By having already known P(A|B)

Spark

______: Fast and Expressive Cluster Computing Engine Compatible with Apache Hadoop. Efficient and usable. runs Discretized Stream Processing that uses RDD abstractions.

Model

_______ Validation and Testing: Test = estimate accuracy of the model. The known label of test sample is compared with the classified result from the model. Accuracy = % of test set samples that are correctly classified by the model. Test set is independent of training set. Validation = if the test set is used to select or refine models

Occam's Razor

_______ _______: Given two models of similar generalization errors, one should prefer the simpler model over the more complex model. For complex models, there is a greater chance that it was fitted accidentally by errors in data. Therefore, one should include model complexity when evaluating a model

Decision Tree

_________ _____: Leaves represent classifications and branches represent tests on features that lead to those classifications. Each internal node tests an attribute. Each branch corresponds to an attribute value node. Each leaf node assigns a classification. could be more than one tree that fits the same data! One of the simplest algorithms, no parameters. Outcome is transparent to the user! Able to handle both numerical and categorical data. Robust, performs well with large data in a short time. Methods include ID3, C4.5, CART, IG.

Feature vector

_________ ________: is a one-dimension matrix. can have magnitude and direction. stores the features for a particular observation in a specific order. maps to the vector space(N+1 dimensions).

Logistic regression

_________ __________: Key idea: turns linear predictions into probabilities using sigmoid function. More smooth than a linear probability model. parameter is and comprises of weights. used for regression and classification. weights determined by maximizing the log likelihood.

Wavelet

__________ Transform: Decomposes a signal into different frequency subbands. Applicable to n-dimensional signals. Data are transformed to preserve relative distance between objects at different levels of resolution. Allow natural clusters to become more distinguishable. Used for image compression. C represents how closely correlated the wavelet is with this section of the signal. The higher C is, the more the similarity.

Attribute Creation

__________ _________(feature generation): Create new attributes (features) that can capture the important information in a data set more effectively than the original ones. Three general methodologies: -Attribute extraction: Domain-specific -Mapping data to new space (see: data reduction) -Attribute construction: Combining features (look for frequent patterns in data and combine). Data discretization (binning, histogram analysis etc)

Variance

__________ of a single random variable X provides a measure of how much the value of X deviates from the mean or expected value of X. Sample covariance is a generalization of the sample variance.

Multiple

__________ regression: Y = b0 + b1 X1 + b2 X2. Allows a response variable Y to be modeled as a linear function of multidimensional feature vector. Many nonlinear functions can be transformed into the above

Stratified

__________ sampling: Partition (or cluster) the data set, and draw samples from each partition (proportionally, i.e., approximately the same percentage of the data)term-96

Residual

__________ variation is information in A that is not retained in X.

Entropy

__________: measures the lack of information of a system. Shannon's measure of information is the number of bits to represent the amount of uncertainty (randomness) in a data source, and is defined as ________. = expected number of bits needed to encode class (+ or -) of randomly drawn members of S

Sampling

__________: obtaining a small sample s to represent the whole data set N. Allow a mining algorithm to run in complexity that is potentially sub-linear to the size of the data. Key principle: Choose a representative subset of the data; Simple random sampling may have very poor performance in the presence of skew; Develop adaptive sampling methods, e.g., stratified sampling

Histogram

___________ Analysis: Divide data into buckets and store average (sum) for each bucket. Partitioning rules: Equal-width = equal bucket range; Equal-frequency (or equal-depth).

Supervised

___________ Learning: Prediction, Classification(discrete labels), Regression (real values). The training data such as observations or measurements are accompanied by labels indicating the classes which they belong to. New data is classified based on the models built from the training set. Key issue: generalization (can't just memorize the training set). Successes: face detection, steering an autonomous car, detecting credit card fraud, etc.

Generative

___________ classifier model p(Y, X): Models how the data was "generated"? "What is the likelihood this or that class generated this instance?" and pick the one with higher probability. Ex: Naive Bayes, Bayesian Networks (not covered)

Parametric

___________ methods (e.g., regression) for data reduction: Assume the data fits some model, estimate model parameters, store only the parameters, and discard the data (except possible outliers)

Nonlinear

___________ regression: Data are modeled by a function which is a nonlinear combination of the model parameters and depends on one or more independent variables. The data are fitted by a method of successive approximations

Correlation

____________ analysis for categorical data: Χ2 (chi-square) test. Null hypothesis: The two distributions are independent. The cells that contribute the most to the Χ2 value are those whose actual count is very different from the expected count. The larger the Χ2 value, the more likely the variables are related.

Redundant

____________ data: -Object identification: The same attribute or object may have different names in different databases -Derivable data: One attribute may be a "derived" attribute in another table, e.g., annual revenue -Redundant attributes may be able to be detected by correlation analysis and covariance analysis. Careful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality.

Clustering

____________: Partition data set into clusters based on similarity, and store cluster representation (e.g., centroid and diameter) only. Can have hierarchical clustering and be stored in multi- dimensional index tree structures.

Classification

____________: Predict categorical class labels(discrete or nominal). Construct a model based on the training set and the class labels and use it in classifying new data

Overfitting

____________: You can perfectly fit to any training data. Zero bias, high variance. Two approaches: Stop growing the tree when further splitting the data does not yield an improvement; Grow a full tree, then prune the tree, by eliminating nodes.

clustering

____________: class label is unknown. group to form new categories. Principle: maximizing intra-class similarity and minimizing interclass similarity

Regression

_____________ analysis: A collective name for techniques for the modeling and analysis of numerical data consisting of values of a dependent variable (also called response variable or measurement) and of one or more independent variables. Used for prediction (including forecasting of time-series data), inference, hypothesis testing, and modeling of causal relationships. Most commonly the best fit is evaluated by using the least squares method, but other criteria have also been used

Binning

_____________: -Equal-width (distance) partitioning: Divides the range into N intervals of equal size: uniform grid. if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B -A)/N. The most straightforward, but outliers may dominate presentation. Skewed data is not handled well -Equal-depth (frequency) partitioning: Divides the range into N intervals, each containing approximately same number of samples. Good data scaling. Managing categorical attributes can be tricky

Discretization

_____________: Divide the range of a continuous attribute into intervals. Interval labels can then be used to replace actual data values. can be performed recursively on an attribute. Three types of attributes -Nominal: values from an unordered set, e.g., color, profession -Ordinal: values from an ordered set, e.g., military or academic rank -Numeric: real numbers,e.g.,integer or real numbers

Discriminative

______________ classifier model p(Y | X): Uses the data to create a decision boundary. Ex: Logistic regression, Support vector machines. Strength: Prediction accuracy is generally high compared to generative models, robust, works when training examples contain errors, Fast evaluation of the learned target function compared to bayesian networks (which are normally slow). Criticism: Long training time, Difficult to understand the learned function (weights), Not easy to incorporate domain knowledge

Reinforcement

_______________ Learning: Given a sequence of examples/states and a reward after completing that sequence, learn to predict the action to take in for an individual example/state.


Conjuntos de estudio relacionados

Rhetoric of Medicine and Health EXAM

View Set

Life Insurance Underwriting and Policy Issue

View Set

Chapter 7 Anatomy and Physiology of Pregnancy **

View Set