Data Mining 1 - 2.3 Data Preprocessing

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

What is the range of correlation?

-1 to +1. +-1 is perfect positive/negative correlation. 0 is no correlation.

Name four (4) techniques for Feature subset selection.

1) "Brute force" 2) Embedded 3) Filter 4) Wrapper

What is the purpose of dimensionality reduction? (4)

1) Avoid the curse of dimensionality 2) Reduce time/space needed for analysis algorithm. 3) Allow data to be more easily visualized. 4) May help to remove irrelevant features/noise

What are three (3) methods for Feature creation?

1) Feature extraction 2) Mapping data to new space 3) Feature construction

What is Proximity

Either Similarity or Dissimilarity

Name three types of distances?

Eucliedean Minkowsky (generalization of euclidean) Mahalanobis

What is the "Mapping data to new space" method in Feature creation?

Example: applying Fourier transform to time domain to find pattern. (That could be hidden in noise otherwise)

What is the Embedded technique for Feature subset selection?

Feature selection is natural part of the Data Mining algorithm. (Used by decision tree classifiers). Is algorithm - specific.

What is the Filter technique for Feature subset selection?

Features are selected before algorithm is run.

When could Jaccard Coefficient benefit over SMC?

For example if each row is a purchase in a store, end each attribute is a item that is bought(1) or not bougth(0). Most items in store will not be bought at each transaction. SMC will then give say that all rows are quite similar, since the 0's will dominate the calculations.

What is the "Feature extraction" method in Feature creation?

From raw data, create new features. (Ex from pixels, find more highlevel patterns) - Highly domain-specific (ex methods in image processing might not be applicable to other fields.)

What is Feature creation?

From the original attributes, create a new set of attributes that captures the important information in a data set much more effectively. (The number of new attributes can then be smaller than the number of original attributes)

Name three properties of similarities?

s(x,y) = 1 only if x = y. s(x,y) = s(y, x) for all x and y. (Symmetry)

What is the "Feature construction" method in Feature creation?

New features constructed out of the original features (Eg. desity from mass/volume if that more useful for the task)

What is standardization/normalization?

To make an entire set of values have a particular property. (Ex have a specific mean or a specific standard derivation.)

What is the "Brute force" technique for Feature subset selection?

Tries every possible combination of subset, chooses the best one. Ideal result, but unpractical when we have many attributes.

What is high dimensionality?

Many attributes (could be thousands/tens of thousands). Data gets more sparse in the space it occupies.

What is the curse of dimensionality?

Many data algorithms work worse with high dimensionaly data. (Hard to build a model for classification, and for clustering does density and distance become less meaningful.)

What is attribute transformation?

A function maps the entire set of values to a new set of values. Could be done with either a simple function, or by standardization/normalization.

Where is Proximity used?

Clustering, Nearest neighbor, and anomaly detection.

Aggregation

Combining two or more objects into a single object

What is discretization?

Continuous attribute -> categorical

What is binarization?

Continuous/discrete attribute -> binary

Whats the purpose of aggregation? (3)

Data reduction (reduce nr of attributes or objects) Change of scale (cities into regions, countries...) Get more stable data

What is Dissimilarity?

Degree of how different two objects are. Lower -> more alike. Min = 0. Upper limit varies.

What is Similarity?

Degree of how much two objects are alike. Higher = more alike. Often between 0 and 1.

Where is cosine similarity often used?

Document vectors. Each attribute = the frequence of a word. Will have few non-zero. Need to ignore 0-0 but also handle non-binary vectors.

What is the difference between Simple random sampling and Stratified sampling?

In SRS, there is the same probability to draw each object. (Less frequent objects can be missed.) In stratified sampling, the objects are divided in groups, and objects are drawn from each group. (Same nr or proportional to size).

What is Correlation?

Measure of linear relationship between attributes of objects.

How is Jaccard Coefficient calculated?

J = number of matching presences/number of attributes not involved in 00 matches

Whats the effect of user smaller/larger sampling size?

Larger -> Lose some of the advantage of sampling, but gets good representation. Smaller -> Can miss pattern.

What is feature subset selection?

Other way of reducing dim., by removing redundant or irrelevant features.

Name one correlation coefficient?

Pearsson's correlation

Name three properties for distances?

Positivety Symmetry Triangle Inequality

Why is standardization/normalization used?

To avoid having variables with large values domination the calculations.

What are two linear algebra techniques for dim. reduction?

Principal Components Analysis (PCA) Singular Value Decomposition (SVD)

How is Simple matching coefficient (SMC) calculated?

SMC= Number of matches/Total number of attributes

What is sampling?

Selection of a representative subset of the data

What are Similarity coefficients?

Similarity measure between objects that have only binary attributes?

Name two Similarity coefficients?

Simple matching coefficient (SMC) Jaccard coefficient

Name 4 ypes of sampling

Simple random samling Sampling without replacement Sampling with replacement Stratified sampling

What is the Wrapper technique for Feature subset selection?

Uses Data Mining algorithm as a "black box" to find the best subset.

Why is sampling used in data mining?

When whole data set is too big or too time consuming to process.


Kaugnay na mga set ng pag-aaral

Ahip 2025 Module 3: Medicare Part D Prescription Drug Coverage

View Set

Washington Life and Health (Disability)

View Set

Digestive Hormones & Important Secretions

View Set