Comp 542 - Midterm 2
Ordinal data
(ranking, order, scaling) (order of choices)
curse of dimensionality
- Volume of feature space increases exponentially. - Data becomes increasingly sparse in the space it occupies. - Sparsity makes it difficult to achieve statistical significance for many methods. - Definitions of density and distance (critical for clustering and other methods) become less useful. all distances start to converge to a common value
Regression
Regression models are used to predict a continuous value. Predicting prices of a house given the features of house like size, price etc is one of the common examples of Regression. It is a supervised technique.
data preprocessing is important as
Data preprocessing is responsible for 60% to 80% of the time of data scientists and engineers work." • "Better data beats fancier algorithms" • "Garbage in gets you garbage out
categorical data
Data that consists of names, labels, or other nonnumerical values Categorical data are variables that contain label values rather than numeric values. The number of possible values is often limited to a fixed set. Categorical variables are often called nominal. Some examples include: A "pet" variable with the values: "dog" and "cat". A "color" variable with the values: "red", "green" and "blue". A "place" variable with the values: "first", "second" and "third".
data transformation
Data transformation is the process of converting data from one format or structure into another format or structure. - In data transformation , the data are transformed or consolidated into forms appropriate for mining. The mining process may be more efficient, and the patterns found may be easier to understand.
Interval Data
Differences between values can be found, but there is no absolute 0. (Temp. and Time)
Data Exploration
Discovery through numerical summaries and visualizations
Handling Redudancy in Data Intergration
Redundant data occur often when integration of multiple databases-Object identification: The same attribute or object may have different names in different databases-Derivable data:One attribute may be a "derived" attribute in another table, e.g., annual revenue•Redundant attributes may be able to be detected by-Correlation coefficient for numeric data-Chi-square test for categorical data•Careful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality
Major Tasks in Data Preprocessing
data cleaning data integration data transformation data reduction
Binary
A binary variable contains two possible outcomes: 1 (positive/present) or 0 (negative/absent). If there is no preference for which outcome should be coded as 0 and which as 1, the binary variable is called symmetric. In which outcomes are not of equal importance, the binary variable is called asymmetric.
Feature engineering is important
Better features means flexibility Better features means simpler models Better features means better results.
continous data
Continuous Data can take any value (within a range) Examples: A person's height: could be any value (within the range of human heights), not just certain fixed heights, Time in a race: you could even measure it to fractions of a second, A dog's weight, The length of a leaf,
Interval data
Interval data has equal spaces between the numbers and does not represent a temporal pattern. Examples include percentages, temperatures, and income. Interval data is the most precise measurement scale data and very common. Although each value is a discrete number, e.g. 3.1 miles, it doesn't generally matter for machine learning purposes whether it is a continuous scale (e.g. infinitely smaller measurement sizes are possible) nor does it matter whether there is an absolute zero. Interval data is generally easy to work with but you may want to create bins to cut down on the number of ranges. (ex: calendar date, temp)
curse of dimensionality
Most data-mining problems arise because there are large amounts of samples with different types of features. •Besides, these data are very often high dimensional. •This additional dimension of large data sets causes the problem known as "the curse of dimensionality" •The "curse of dimensionality" is produced because of the geometry of high-dimensional spaces, and these kinds of data spaces are typical for data -mining problems.
Overfitting & Underfitting
Overfitting refers to a model that models the training data too well. Overfitting happens when a model learns the detail and noise in the training data to the extent that it negatively impacts the performance of the model on new data Intuitively, underfitting occurs when the model or the algorithm does not fit the data well enough. Specifically, underfitting occurs if the model or algorithm shows low variance but high bias bias -also known as algorithm bias or AI bias, is a phenomenon that occurs when an algorithm produces results that are systematically prejudiced due to erroneous assumptions in the machine learning process
data transformation
Process of changing the data from their original form to a format suitable for performing a data analysis addressing research objectives.
How can we transform data
Smoothing•Attribute/feature construction •Aggregation•Normalization•Discretization•Concept hierarchy generation
Information gain does not work with
a large number of distinct values. (As it cause overfitting issues)
Data types
attribute
Classification
classification is a supervised learning approach in which the computer program learns from the data input given to it and then uses this learning to classify new observation. This data set may simply be bi-class (like identifying whether the person is male or female or that the mail is spam or non-spam) or it may be multi-class too. Some examples of classification problems are: speech recognition, handwriting recognition, bio metric identification, document classification etc. Here we have the types of classification algorithms in Machine Learning: Linear Classifiers: Logistic Regression, Naive Bayes Classifier Nearest Neighbor Support Vector Machines Decision Trees Boosted Trees Random Forest Neural Networks
feature selection methods
• Embedded methods - Embedded methods learn which features best contribute to the accuracy of the model while the model is being created - Regularization methods such as LASSO, elastic net and ridge regression - Decision tree algorithms • Filter methods - Features are selected before the algorithm is run, using some approach that is independent of the mining or machine learning task - The features are ranked by the score such as correlation and chi-square test and either selected to be kept or removed from the dataset • Remove redundant or irrelevant features - Chi-squared test, Gini index, information gain and correlation coefficient scores • Wrapper methods - The selection of a set of features as a search problem, where different combinations are prepared, evaluated and compared to other combinations. - best-first search, a random hill-climbing, or heuristics algorithm
Data Cleaning Machine Learning
essentially the task of removing errors and anomalies or replacing observed values with true values from data to get more value in analytics. Fill out missing values Removing rows with missing values fixing errors in the structure
feature subset selection
Both filter and wrapper approaches require: - A way to measure the predictive quality of the subset - A strategy for searching the possible subsets exhaustive search usually infeasible - search space is the power set (2 d subsets)
Ratio data
(equal spaces between values and a meaningful zero value — mean makes sense) values are ordered units with intermediate values. Ratio values are the same as interval values, with the difference that they do have an absolute zero. Good examples are height, weight, length etc.
training data
data that is used to train a predictive model and that therefore must have known values for the target variable of the model
feature extraction
• Feature extractionis a process of automatically reducing the dimensionality of high-dimensional data into a much smaller set that can be modelled. - Common examples include image, audio, textual data, and tabular data with millions of attributes. - For structured data, PCA and clustering are used - For image data, line and edge detection - Depending on the domain, for image, video, and audio, many of the same types of DSP (Digital Signal Processing) methods such as DWT (Discrete Wavelet Transform)
How does feature engineering work
• Identify features from attributes • Estimate the usefulness of a feature • Feature extraction • Feature selection • Feature construction • Feature learning
Semi-Supervied Learning in Real World
Semi-supervised learning models are becoming widely applicable in scenarios across a large variety of industries. Let's explore a few of the most well-known examples: — Speech Analysis: Speech analysis is a classic example of the value of semi-supervised learning models . Labeling audio files typically is a very intensive tasks that requires a lot of human resources. Applying SSL techniques can really help to improve traditional speech analytic models. — Protein Sequence Classification: Inferring the function of proteins typically requires active human intervention. — Web Content Classification: Organizing the knowledge available iun billions of web pages will advance different segments of AI. Unfortunately, that task typically requires human intervention to classify the content. There are plenty of other scenarios for SSL models. However, not all AI scenarios can directly be tackled using SSL. There are a few essential characteristics that should be present on a problem to be effectively solvable using SSL. 1 — Sizable Unlabeled Dataset: In SSL scenarios , the seize of the unlabeled dataset should be substantially bigger than the labeled data. Otherwise, the problem can be simply addressed using supervised algorithms. 2 — Input-Output Proximity Symmetry: SSL operates by inferring classification for unlabeled data based on proximity with labeled data points. Inverting that reasoning, SSL scenarios entail that if two data points are part of the same cluster (determined by a K-means algo or similar) their outputs are likely to be in close proximity as well. Complementarily, if two data points are separated by a low density area, their output should not be close. 3 — Relatively Simple Labeling & Low-Dimension Nature of the Problem: In SSL scenarios, it is important that the inference of the labeled data doesn't become a problem more complicated than the original problem. This is known in AI circles as the "Vapnik Principle" which essentially states that in order to solve a problem we should not pick an intermediate problem of a higher order of complexity. Also, problems that use datasets with many dimensions or attributes are likely to become really challenging for SSL algorithms as the labeling task will become very complex
Properties of Attribute Values
The type of an attribute depends on which of the following properties it possesses: - Distinctness: = ≠ - Order: < > - Addition: + - - Multiplication: * / - Nominal attribute: distinctness - Ordinal attribute: distinctness & order - Interval attribute: distinctness, order & addition - Ratio attribute: all 4 properties
Statistical Techniques
Univariate analysis-is perhaps the simplest form of statistical analysis. Like other forms of statistics, it can be inferential or descriptive. The key fact is that only one variable is involved. Univariate analysis can yield misleading results in cases in which multivariate analysis is more appropriate. Central tendency • mean, mode and median - Dispersion - is the extent to which a distribution is stretched or squeezed. range, variance, maximum, minimum, quartiles (including the interquartile range) coefficient of variation and standard deviation, skewness, and kurtosis skewness - is a measure of symmetry, or more precisely, the lack of symmetry. A distribution, or data set, is symmetric if it looks the same to the left and right of the center point. (Degree of distribution from symmetrical bell curve or normal distribution) kurtosis - In probability theory and statistics, kurtosis is a measure of the "tailedness" of the probability distribution of a real-valued random variable multivariate analysis is based on the statistical principle of multivariate statistics, which involves observation and analysis of more than one statistical outcome variable at a time. Multivariate analysis (bivariate analysis) - numerical v. s . numerical • correlation matrix- is a table showing correlation coefficients between variables. Each cell in the table shows the correlation between two variables. A correlation matrix is used to summarize data, as an input into a more advanced analysis, and as a diagnostic for advanced analyses. categorical v. s . categorical • Pearson's Chi -squared test
Numeric data
is information that is something that is measurable. It is always collected in number form, although there are other types of data that can appear in number form. An example of numerical data would be the number of people that attended the movie theater over the course of a month.
What is clustering?
is the assignment of a set of observations into subsets (called clusters) so that observations in the same cluster are similar in some sense. Clustering is a method of unsupervised learning, and a common technique for statistical data analysis used in many fields
Ratio Data
data with an absolute 0. Ratios are meaningful. (Length, Width, Weight, Distance)
Binary data
Binary data is discrete data that can be in only one of two categories — either yes or no, 1 or 0, off or on, etc. Binary can be thought of as a special case of ordinal, nominal, count, or interval data. Binary data is a very common outcome variable in machine learning classification problems. For example, we may want to create a supervised learning model to predict whether a tumor is malignant or benign. Binary data is common and merits its own category when thinking about your data.
What is Classification?
Classification is the process of predicting the class of given data points. Classes are sometimes called as targets/ labels or categories. Classification predictive modeling is the task of approximating a mapping function (f) from input variables (X) to discrete output variables (y). For example, spam detection in email service providers can be identified as a classification problem. This is s binary classification since there are only 2 classes as spam and not spam. A classifier utilizes some training data to understand how given input variables relate to the class. In this case, known spam and non-spam emails have to be used as the training data. When the classifier is trained accurately, it can be used to detect an unknown email. Classification belongs to the category of supervised learning where the targets also provided with the input data. There are many applications in classification in many domains such as in credit approval, medical diagnosis, target marketing etc. There are two types of learners in classification as lazy learners and eager learners. Lazy learners Lazy learners simply store the training data and wait until a testing data appear. When it does, classification is conducted based on the most related data in the stored training data. Compared to eager learners, lazy learners have less training time but more time in predicting. Ex. k-nearest neighbor, Case-based reasoning 2. Eager learners Eager learners construct a classification model based on the given training data before receiving data for classification. It must be able to commit to a single hypothesis that covers the entire instance space. Due to the model construction, eager learners take a long time for train and less time to predict. Ex. Decision Tree, Naive Bayes, Artificial Neural Networks
Classification
Classifier: An algorithm that maps the input data to a specific category. Classification model: A classification model tries to draw some conclusion from the input values given for training. It will predict the class labels/categories for the new data. Feature: A feature is an individual measurable property of a phenomenon being observed. Binary Classification: Classification task with two possible outcomes. Eg: Gender classification (Male / Female) Multi class classification: Classification with more than two classes. In multi class classification each sample is assigned to one and only one target label. Eg: An animal can be cat or dog but not both at the same time Multi label classification: Classification task where each sample is mapped to a set of target labels (more than one class). Eg: A news article can be about sports, a person, and location at the same time.
data integration
Combines data from multiple sources into a coherent store may cause entity identification problem (schema conflict and value conflict) - Schema conflict • A.cust -id ≡ B.cust -number • Integrate metadata from different sources - Data value conflict • "Bill Clinton" = "William Clinton" • Data codes for pay_type in one database may be "H" and "S" but "1" and "2" in another • Possible reasons: different representations, different scales, e.g., metric vs. British units
Semi-Supervised Algorithm assumes the following
Continuity Assumption: The algorithm assumes that the points which are closer to each other are more likely to have the same output label. Cluster Assumption: The data can be divided into discrete clusters and points in the same cluster are more likely to share an output label. Manifold Assumption: The data lie approximately on a manifold of much lower dimension than the input space. This assumption allows the use of distances and densities which are defined on a manifold.
Count data
Count data is discrete whole number data — no negative numbers here. Count data often has many small values, such as zero and one. Count data is usually treated similarly to interval data, but it is unique enough and widespread enough to merit its own category.
data preprocessing
Covers all activities needed to create final dataset. [data that will be fed into the modeling tool(s)] from the initial raw data. Data preparation tasks include data cleaning, data integration, data reduction, data transformation and data discretization, feature engineering)
what is feature engineering
Feature engineering is the process of transforming raw data into features that better represent the underlying problem to the predictive models, resulting inimproved model accuracy on unseen data.
Feature Construction
Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work •It is manual, it is slow, it requires lots of human brain power, and it makes a big difference. involves transforming a given set of input features to gener- ate a new set of more powerful features which can then used for prediction. ... Engineering a good feature space is a prerequisite for achiev- ing high performance in any machine learning task. Higher-level features can be obtained from already available features and added to the feature vector; for example, for the study of diseases the feature 'Age' is useful and is defined as Age = 'Year of death' minus 'Year of birth' . This process is referred to as feature construction. Feature construction is the application of a set of constructive operators to a set of existing features resulting in construction of new features. Examples of such constructive operators include checking for the equality conditions {=, ≠}, the arithmetic operators {+,−,×, /}, the array operators {max(S), min(S), average(S)} as well as other more sophisticated operators, for example count(S,C)[4] that counts the number of features in the feature vector S satisfying some condition C or, for example, distances to other recognition classes generalized by some accepting device. Feature construction has long been considered a powerful tool for increasing both accuracy and understanding of structure, particularly in high-dimensional problems.[5] Applications include studies of disease and emotion recognition from speech.[
identify the feature
Find attributes useful for your modeling task - A feature is an attribute that is useful or meaningful to your problem
Interval Data
In Interval scale or interval variable we know exact different between the measurement of values. Like in between the boiling & freezing point there are 100 intervals. The difference between 20-10 degree Celsius is exactly same as that between 90-80 degree Celsius. Other examples are Fahrenheit temperature & IQ.
Split Dataset
In cases where cross validation is not applicable, it is common to separate the data in the ratio of 7:3 (70:30) for training and testing respectively. The general rule of thumb is to partition the data set into the ratio of 3:1:1 (60:20:20) for training, validation and testing respectively
Regression
In machine learning, regression algorithms attempt to estimate the mapping function (f) from the input variables (x) to numerical or continuous output variables (y). In this case, y is a real value, which can be an integer or a floating point value. Therefore, regression prediction problems are usually quantities or sizes. For example, when provided with a dataset about houses, and you are asked to predict their prices, that is a regression task because price will be a continuous output. Examples of the common regression algorithms include linear regression, Support Vector Regression (SVR), and regression trees. Some algorithms, such as logistic regression, have the name "regression" in their names but they are not regression algorithms.
Attribute subset selection
It is a wrapper method for feature selection. It where the selection of features as a set problem. It is where different combinations are prepared, evaluated, and compared to other combinations. (Best First Search, Random Hill Climbing, or Heuristic Algorithm).
What is machine learning?
It is the scientific study of algorithms and statistical models that computer systems use to perform a specific task without using explicit instructions, relying on patterns and inference instead. It is seen as a subset of artificial intelligence.
test data machine learning
It is used to see how well machine learning algorithms can predict new answers based on its training. It helps assess the performance of the model.
Data Cleaning
Missing data may be due to - Data is not always available - equipment malfunction - data not entered due to misunderstanding - Data not entered due to unavailability - certain data may not be considered important at the time of entry How to handle? Ignore the tuple - usually done when class label is missing (when doing classification) - not effective when the % of missing values per attribute varies considerably • Fill in the missing value manually - tedious + infeasible? • Fill in it automatically with - a global constant : e.g., "unknown", a new class?! - the attribute mean - the attribute mean for all samples belonging to the same class: smarter - the most probable value: inference -based such as Bayesian formula or decision tree Noise Data • Noise - random error or variance in a measured variable • Incorrect attribute values may be due to - faulty data collection instruments - data entry problems - data transmission problems - technology limitation - inconsistency in naming convention Handle Noise Data: Binning - first sort data and partition into (equal -frequency) bins - then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc. • Regression - smooth by fitting the data into regression functions • Clustering - detect and remove outliers
Nominal data
Nominal data is made of discrete values with no numerical relationship between the different categories — mean and median are meaningless. Animal species is one example. For example, pig is not higher than bird and lower than fish. Nationality is another example of nominal data. There is group membership with no numeric order — being French, Mexican, or Japanese does not in itself imply an ordered relationship. You can one-hot-encode or hash nominal features. Do not ordinal encode them because the relationship between the groups cannot be reduced to a monotonic function. The assigning of values would be random.
Data Normalization
Normalization is a technique often applied as part of data preparation for machine learning. The goal of normalization is to change the values of numeric columns in the dataset to a common scale, without distorting differences in the ranges of values. For machine learning, every dataset does not require normalization. It is required only when features have different ranges.
Outliers
Outliers can cause problems with certain types of models. For example, linear regression models are less robust to outliers than decision tree models. • In general, if you have a legitimate reason to remove an outlier, it will help your model's performance. • However, outliers are innocent until proven guilty. You should never remove an outlier just because it's a "big number." That big number could be very informative for your model. • We can't stress this enough: you must have a good reason for removing an outlier, such as suspicious measurements that are unlikely to be real data.
curse of dimensionality
The curse of dimensionality refers to how certain learning algorithms may perform poorly in high-dimensional data. First, it's very easy to overfit the the training data, since we can have a lot of assumptions that describe the target label (in case of supervised learning). In other words we can easily express the target using the dimensions that we have. Second,we may need to increase the number of training data exponentially, to overcome the curse of dimensionality and that may not be feasible. Third, in ML learning algorithms that depends on the distance, like k-means for clustering or k nearest neighbors, everything can become far from each others and it's difficult to interpret the distance between the data points.
what is regression?
The learning dataset label contains real numbers and machine learning algorithm produces a model to assign a real number to the unseen outputs. In this type of learning, the output label can be any numeric value with the given range.
Curse of Dimensionality States:
The size of a data set yielding the same density of data points in an n-dimensional space increases exponentially with dimensions •A larger radium is needed to enclose a fraction of the data points in a high-dimensional space •Almost every point is closer to an edge than to another sample point in a high-dimensional space •Almost every point is an outliner
What is supervised learning?
They learn from past to try and predict the future. They primarily find patterns from labeled data and try to fit labels for unlabeled data. Example of supervised learning techniques: decision tree k-d forest Naive Bayes Classifier Bayesian Network Linear Regression Logistic Regression Support Vector Machine Neural Network model relationships and dependencies between the target prediction output and the input features such that we can predict the output values for new data based on those relationships which it learned from the previous data sets. - Predictive Model - We have labeled data - Classification or Regression Problems learning in which we teach or train the machine using data which is well labeled that means some data is already tagged with the correct answer. After that, the machine is provided with a new set of examples(data) so that supervised learning algorithm analyses the training data(set of training examples) and produces a correct outcome from labeled data.
Unsupervised learning is the machine learning task of inferring a function to describe hidden structure from unlabeled data. Since the examples given to the learner are unlabeled, there is no error or reward signal to evaluate a potential solution. This distinguishes unsupervised learning from supervised learning and reinforcement learning. The trainer does not provide labeled output in the learning dataset. The machine learning algorithm learns from unlabeled data and gathers information from it . we can make clusters according to similar observations in the dataset (ecommerance based on products, we search for) Clustering - (Assignment of a set of observations in subsets called clusters).is the task of dividing the population or data points into a number of groups such that data points in the same groups are more similar to other data points in the same group and dissimilar to the data points in other groups. It is basically a collection of objects on the basis of similarity and dissimilarity between them. Kmeans Hierarchical Cluster Analysis Expectation Maximization
Unsupervised learning is the machine learning task of inferring a function to describe hidden structure from unlabeled data. Since the examples given to the learner are unlabeled, there is no error or reward signal to evaluate a potential solution. This distinguishes unsupervised learning from supervised learning and reinforcement learning. The trainer does not provide labeled output in the learning dataset. The machine learning algorithm learns from unlabeled data and gathers information from it . we can make clusters according to similar observations in the dataset (ecommerance based on products, we search for) Clustering - (Assignment of a set of observations in subsets called clusters).is the task of dividing the population or data points into a number of groups such that data points in the same groups are more similar to other data points in the same group and dissimilar to the data points in other groups. It is basically a collection of objects on the basis of similarity and dissimilarity between them. Kmeans Hierarchical Cluster Analysis Expectation Maximization
Useless data
Useless data is unique, discrete data with no potential relationship with the outcome variable. A useless feature has high cardinality. An example would be bank account numbers that were generated randomly. If a feature consists of unique values with no order and no meaning, that feature is useless and need not be included when fitting a model.
Estimate the Usefulness of a feature
You can objectively estimate the usefulness of features. - Features are allocated scores and can then be ranked by their scores. - A feature may be important if it is highly correlated with the dependent variable (the thing being predicted)
feature selection
feature selection is the process of selecting a subset of relevant features for use in model construction
What are the steps in machine learning
gather data, prepare data, choose a model, train, evaluation, hyperparameter tuning, deployment 1. data collection 2. data preparation 3. choosing a model 4. training 5. evaluation 6. parameter tuning 7. prediction Gathering data Cleaning data Feature engineering Defining a model training, testing model, and predicting an output
feature construction
involves transforming a given set of input features to generate a new set of more powerful features which can then used for prediction. ... Engineering a good feature space is a prerequisite for achieving high performance in any machine learning task.
semi supervised learning
is a class of machine learning tasks and techniques that also make use of unlabeled data for training - typically a small amount of labeled data with a large amount of unlabeled data. Assumptions used: Continuity assumption Cluster assumption Manifold assumption Methods: 1. Generative models 2. Low-density separation 3. Graph-based methods 4. Heuristic approaches
validation set
is a sample of data held back from training your model that is used to give an estimate of model skill while tuning model's hyperparameters. The validation dataset is different from the test dataset that is also held back from the training of the model, but is instead used to give an unbiased estimate of the skill of the final tuned model when comparing or selecting between final models.
discrete data
is based on counts. Only a finite number of values is possible, and the values cannot be subdivided meaningfully. For example, the number of parts damaged in shipment.
Data Discretization
is defined as a process of converting continuous data attribute values into a finite set of intervals with. minimal loss of information. • Three types of attributes - Nominal— values from an unordered set, e.g., color, profession - Ordinal —values from an ordered set, e.g., military or academic rank - Numeric —real numbers, e.g., integer or real numbers • Discretization: Divide the range of a continuous attribute into intervals - Interval labels can then be used to replace actual data values - Reduce data size by discretization - Supervised vs. unsupervised - Split (top- down) vs. merge (bottom -up) - Discretization can be performed recursively on an attribute - Prepare for further analysis, e.g., classification • Typical methods: All the methods can be applied recursively - Binning • To p -down split, unsupervised - Histogram analysis • To p -down split, unsupervised - Clustering analysis • unsupervised, top- down split or bottom -up merge - Decision-tree analysis • supervised, top -down split - Correlation (e.g.,χ2) analysis • unsupervised, bottom -up merge
Semi-Supervised Learning
the algorithm is trained upon a combination of labeled and unlabeled data. Typically, this combination will contain a very small amount of labeled data and a very large amount of unlabeled data. The basic procedure involved is that first, the programmer will cluster similar data using an unsupervised learning algorithm and then use the existing labeled data to label the rest of the unlabeled data. The typical use cases of such type of algorithm have a common property among them - The acquisition of unlabeled data is relatively cheap while labeling the said data is very expensive.
curse of dimensionality
the more attributes there are, the easier it is to build a model that fits the sample data but that is worthless as a predictor refers to the phenomena that occur when classifying, organizing, and analyzing high dimensional data that does not occur in low dimensional spaces, specifically the issue of data sparsity and "closeness" of data.
Anomaly Detection
the process of identifying rare or unexpected items or events in a data set that do not conform to other items in the data set
Data Duplication
• In addition to detecting redundancies between attributes, duplication should also be detected at the data object level - where there are two or more identical tuples for a given unique data entry case • The use of denormalized tables (often done to improve performance by avoiding joins) is another source of data redundancy • Inconsistencies often arise between various duplicates, due to inaccurate data entry or updating some but not all data occurrences.
Data Visualization
• Univariate analysis - frequency distribution tables, bar charts, histograms, frequency polygons, pie charts • Multivariate analysis (bivariate analysis) - numerical v. s . numerical • scatter plot - categorical v. s . categorical • cross table and stacked plot
Semi Supervised Learning Algorithms
▪ Self-Training ▪ Generative methods, mixture models ▪ Graph-based methods ▪ Co-Training ▪ Semi-supervised SVM ▪ Many others