Data Mining Test # 1
Measures of data central tendency
A measure of central tendency is a single value that attempts to describe a set of data by identifying the central position within that set of data.Mean (algebraic measure) x bar = 1/n( i = 1 to n ∑ x subscript i )
Similarity and proximity measures discussed in class
Similarity Numerical measure of how alike two data objects are. Is higher when objects are more alike. Often falls in the range [0,1] Dissimilarity Numerical measure of how different are two data objects Lower when objects are more alike Minimum dissimilarity is often 0 Upper limit varies Proximity refers to a similarity or dissimilarity
Differences between Data mining and traditional statistical methods ?
Statistics is the traditional field that deals with the quantification, collection, analysis, interpretation, and drawing conclusions from data. Data mining is an interdisciplinary field that draws on computer sci- ences (data base, artificial intelligence, machine learning, graphical and visualization models), statistics and engineering (pattern recognition, neural networks).
Data transformation techniques
The range of attributes (features) values differ, thus one feature might overpower the other one. Solution: Normalization min to max normalization - v' = ( v- min subscript a / max subscript a - min subscript a) (new_max subscript a - new_min subscript a) + new_min subscript a) z score normalization - v' = (v - u / o) decimal scaling normalization - v/10^j
Confusion Matrix
confusion matrix contains information about actual and predicted classifications done by a classification system. Performance of such systems is commonly evaluated using the data in the matrix. a: TP (true positive) b: FN (false negative) c: FP (false positive) d: TN (true negative)
Correlation analysis through "Correlation coefficient" and "chi-square test" to detect redundancy
correlation coefficient - If rA,B > 0, A and B are positively correlated (A's values increase as B's). The higher, the stronger correlation. rA,B = 0: independent; rA,B < 0: negatively correlated chi test - x^2 = ∑ ((observed - expected ) ^2 / expected) The larger the Χ 2 value, the more likely the variables are related
data preprocessing
to transform the raw input data into an appropriate format for subsequent analysis.
the major steps of KDD (knowledge discovery from databases) or data mining process
1. Selection 2. Preprocessing 3.Transformation 4. Data Mining 5. Interpretation/Evaluation
The major tasks of data preprocessing
Data Cleaning - fill in missing values, identify outliers and smooth out noisy data, correct inconsistent data, resolve redundancy caused by data integration. Missing data - Ignore the tuple: usually done when class label is missing (assuming the tasks in classification not effective when the percentage of missing values per attribute varies considerably, Fill in it automatically with (global constant , attribute mean). Noisy data - random error or variance in a measured variable, Incorrect attribute values may due to, Other data problems which requires data cleaning.
Different sampling techniques to reduce data, and their advantages and disadvantages
Dimensionality reduction: — e.g., reduce dataset size Feature selection - Another way to reduce dimensionality of data Redundant features Attribute subset selection — e.g., remove unimportant attributes Select a minimum set of features such that the probability distribution of different classes given the values for those features is as close as possible to the original distribution given the values of all features reduce # of patterns in the results, easier to understand. Numerosity reduction — e.g., fit data into models Discretization and concept hierarchy generation Advantages - Sample easy to select, Suitable sampling frame can be identified easily, Sample evenly spread over entire reference population. Disadvanatages - Sample may be biased if hidden periodicity in population coincides with that of selection, difficult to assess precision of estimate from one survey.
Impurity Measures
Entropy: Entropy ∑ - p(d)*log( p(d)) d∈decisions Conditional Entropy v = values(A) ∑ |S subscript v | / |S| * Entropy(S subscript v) GIN(D) subscript A = 1 - (i = 1 to k ∑) p^2 subscript j GINI(t) = 1 - (j ∑) p(j|t)]^2 Misclassification Error: Error(t) 1 - max P(i|t)
How to use the binning method to handle noisy data?
Equal-width binning - Divides the range into N intervals of equal size: uniform grid. if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B -A)/N. Equal-depth binning - Divides the range into N intervals, each containing approximately same number of samples
What is classification?
Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class. Find a model for class attribute as a function of the values of other attributes. class labels are categorical (discrete or nominal) Goal: previously unseen records should be assigned a class as accurately as possible. A test set is used to determine the accuracy of
Data Dispersion
Measures of dispersion measure how spread out a set of data is. o^2 = 1/n(i = 1 to b (x subscript i - u)^2
Steps or procedures involved in classification?
Model construction: description of a set of predetermined classes Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute The set of tuples used for model construction is training set The model is represented as classification rules, decision trees, or mathematical formula Model usage: for classifying future or unknown objects Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy rate is the percentage of test set samples that are correctly classified by the model Test set is independent of training set, otherwise over-fitting will occur If the accuracy is acceptable, use the model to classify data tuples whose class labels are not known
What is Data Mining?
Non-trivial extraction of implicit, previously unknown and potentially useful information from data. Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns to support future decisions
Model performance evaluation
Performance of a model may depend on other factors besides the learning algorithm: Class distribution Cost of misclassification Size of training and test sets
data mining tasks ?
Predictive Methods - Use some variables to predict unknown or future values of other variables. Descriptive Methods - Find human interpretable patterns/rules that describe the data. Classification [Predictive] - Given a collection of records (training set ), Each record contains a set of attributes, one of the attributes is the class or label, Find a model for the class attribute as a function of the values of other attributes, previously unseen records should be assigned a class as accurately as possible, A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it. Clustering [Descriptive] - Partition data set into clusters based on similarity, and store cluster, representation (e.g., centroid and diameter) only, Can have hierarchical clustering and be stored in multi-dimensional, index tree structures. Association Rule Discovery [Descriptive] - Given a set of records each of which contain some number of items from a given collection, Produce dependency rules which will predict occurrence of an item based on occurrences of other items. Regression [Predictive] - Predict a value of a given continuous valued variable based on the values of other variables, assuming a linear or nonlinear model of dependency, Greatly studied in statistics, neural network fields. Deviation Detection [Predictive] - Detect significant deviations from normal behavior. Frequent Subgraph mining [Descriptive] - an active research topic in the data mining community. A graph is a general model to represent data and has been used in many do- mains like cheminformatics and bioinformatics. Min- ing patterns from graph databases is challenging since graph related operations, such as subgraph testing, gen- erally have higher time complexity than the correspond- ing operations on itemsets, sequences, and trees, which have been studied extensively.
Differences between Data mining and Database query processing ?
Query Tools - are tools that help analyze the data in a database. They provide query building, query editing, searching, finding, reporting and summarizing functionalities. Data mining - extraction of previously unknown and interesting information from raw data, utilize statistical models to look for hidden patterns in data. Data miners are interested in finding useful relationships between different data elements.
related measurements
ROC Curve as way to measure
What is decision tree? Procedures to build a decision tree.
They do classification: predict a categorical output from categorical and/or real inputs Decision trees are the single most popular data mining tool Easy to understand Easy to implement Easy to use Computationally cheap Tree is constructed in a top-down recursive divide-and-conquer manner At start, all the training examples are at the root Tested attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain) Examples are partitioned recursively based on selected attributes Conditions for stopping partitioning All samples for a given node belong to the same class There are no remaining attributes for further partitioning - majority vote is employed for classifying the leaf There are no samples left
Overfitting and Underfitting
Underfitting and Overfitting Underfitting: Training and test error rates of the model are large when the size of the tree is very small Due to model is too simple. Overfitting: when the model becomes complicated, test error rate of the model increases even though the model's training error rate continues to decrease, when the model become complicated how to address overfitting: Stop the algorithm before it becomes a fully-grown tree Typical stopping conditions for a node: Stop if all instances belong to the same class Stop if all the attribute values are the same More restrictive conditions: Stop if number of instances is less than some user-specified threshold Stop if class distribution of instances are independent of the available features (e.g., using 2 test) Stop if expanding the current node does not improve impurity measures (e.g., Gini or information gain).