Data Mining 1 - 2.3 Data Preprocessing
What is the range of correlation?
-1 to +1. +-1 is perfect positive/negative correlation. 0 is no correlation.
Name four (4) techniques for Feature subset selection.
1) "Brute force" 2) Embedded 3) Filter 4) Wrapper
What is the purpose of dimensionality reduction? (4)
1) Avoid the curse of dimensionality 2) Reduce time/space needed for analysis algorithm. 3) Allow data to be more easily visualized. 4) May help to remove irrelevant features/noise
What are three (3) methods for Feature creation?
1) Feature extraction 2) Mapping data to new space 3) Feature construction
What is Proximity
Either Similarity or Dissimilarity
Name three types of distances?
Eucliedean Minkowsky (generalization of euclidean) Mahalanobis
What is the "Mapping data to new space" method in Feature creation?
Example: applying Fourier transform to time domain to find pattern. (That could be hidden in noise otherwise)
What is the Embedded technique for Feature subset selection?
Feature selection is natural part of the Data Mining algorithm. (Used by decision tree classifiers). Is algorithm - specific.
What is the Filter technique for Feature subset selection?
Features are selected before algorithm is run.
When could Jaccard Coefficient benefit over SMC?
For example if each row is a purchase in a store, end each attribute is a item that is bought(1) or not bougth(0). Most items in store will not be bought at each transaction. SMC will then give say that all rows are quite similar, since the 0's will dominate the calculations.
What is the "Feature extraction" method in Feature creation?
From raw data, create new features. (Ex from pixels, find more highlevel patterns) - Highly domain-specific (ex methods in image processing might not be applicable to other fields.)
What is Feature creation?
From the original attributes, create a new set of attributes that captures the important information in a data set much more effectively. (The number of new attributes can then be smaller than the number of original attributes)
Name three properties of similarities?
s(x,y) = 1 only if x = y. s(x,y) = s(y, x) for all x and y. (Symmetry)
What is the "Feature construction" method in Feature creation?
New features constructed out of the original features (Eg. desity from mass/volume if that more useful for the task)
What is standardization/normalization?
To make an entire set of values have a particular property. (Ex have a specific mean or a specific standard derivation.)
What is the "Brute force" technique for Feature subset selection?
Tries every possible combination of subset, chooses the best one. Ideal result, but unpractical when we have many attributes.
What is high dimensionality?
Many attributes (could be thousands/tens of thousands). Data gets more sparse in the space it occupies.
What is the curse of dimensionality?
Many data algorithms work worse with high dimensionaly data. (Hard to build a model for classification, and for clustering does density and distance become less meaningful.)
What is attribute transformation?
A function maps the entire set of values to a new set of values. Could be done with either a simple function, or by standardization/normalization.
Where is Proximity used?
Clustering, Nearest neighbor, and anomaly detection.
Aggregation
Combining two or more objects into a single object
What is discretization?
Continuous attribute -> categorical
What is binarization?
Continuous/discrete attribute -> binary
Whats the purpose of aggregation? (3)
Data reduction (reduce nr of attributes or objects) Change of scale (cities into regions, countries...) Get more stable data
What is Dissimilarity?
Degree of how different two objects are. Lower -> more alike. Min = 0. Upper limit varies.
What is Similarity?
Degree of how much two objects are alike. Higher = more alike. Often between 0 and 1.
Where is cosine similarity often used?
Document vectors. Each attribute = the frequence of a word. Will have few non-zero. Need to ignore 0-0 but also handle non-binary vectors.
What is the difference between Simple random sampling and Stratified sampling?
In SRS, there is the same probability to draw each object. (Less frequent objects can be missed.) In stratified sampling, the objects are divided in groups, and objects are drawn from each group. (Same nr or proportional to size).
What is Correlation?
Measure of linear relationship between attributes of objects.
How is Jaccard Coefficient calculated?
J = number of matching presences/number of attributes not involved in 00 matches
Whats the effect of user smaller/larger sampling size?
Larger -> Lose some of the advantage of sampling, but gets good representation. Smaller -> Can miss pattern.
What is feature subset selection?
Other way of reducing dim., by removing redundant or irrelevant features.
Name one correlation coefficient?
Pearsson's correlation
Name three properties for distances?
Positivety Symmetry Triangle Inequality
Why is standardization/normalization used?
To avoid having variables with large values domination the calculations.
What are two linear algebra techniques for dim. reduction?
Principal Components Analysis (PCA) Singular Value Decomposition (SVD)
How is Simple matching coefficient (SMC) calculated?
SMC= Number of matches/Total number of attributes
What is sampling?
Selection of a representative subset of the data
What are Similarity coefficients?
Similarity measure between objects that have only binary attributes?
Name two Similarity coefficients?
Simple matching coefficient (SMC) Jaccard coefficient
Name 4 ypes of sampling
Simple random samling Sampling without replacement Sampling with replacement Stratified sampling
What is the Wrapper technique for Feature subset selection?
Uses Data Mining algorithm as a "black box" to find the best subset.
Why is sampling used in data mining?
When whole data set is too big or too time consuming to process.