Data Mining
itemset
A collection of one or more items
Discuss one of the distance measures that are commonly used for computing the dissimilarity of objects described by numeric attributes.
Euclidean distance d(i, j) =sqrt((xi1 − xj1)^2 + (xi2 − xj2)^2 +··· ) Manhattan Distance |x1 - x2| + |y1 - y2| Minkowski distance d(i, j) = sqrt(h, |xi1 − xj1|^h + |xi2 − xj2|^h + ...) Supremum distance d(i, j) = max(f, p) |xif − xjf |
What is the first step of association rule mining
Find all frequent itemsets: In other words, finding each itemset that will occur at least as frequently as a predetermined minimun support count(minimum number of transactions that contain the itemsets).
What is the second step of association rule mining
Generate strong association rules from the frequent itemsets: Creating rules that satisfy both the minimum support and minimum confidence.
What are association rules?
if-then statements that help to show the probability of relationships between data items within large data sets in various types of databases.
closed frequent itemset
meets the property of a closed itemset, but also passing minimum support threshold
Data integration
merges data from multiple sources into a coherent data store, such as a data warehouse.
Data cleaning
remove noise and correct inconsistencies in the data.
Binning
smooths out the values around the noise
Data Characterization
Data characterization is a summary of the general characteristics or features of a target class of data. The data corresponding to the user-specified class is typically collected by a query. For example, to study the characteristics of software products with sales that increased by 10% in the previous year, the data related to such products can be collected by executing an SQL query on the sales database.
How can the data be preprocessed in order to help improve its quality?
Data cleaning Data integration Data reduction Data transformations
Data reduction
can reduce the data size by aggregating, eliminating redundant features, or clustering, for instance.
Data discrimination
comparison of the target class with one or a set of comparative classes
Data transformations
data are scaled to fall within a smaller range like 0.0 to 1.0. This can improve the accuracy and efficiency of mining algorithms involving distance measurements.
Regression
derives a linear equation to get a best fit line of the noise
Regarding data mining methodology and user interaction, explain data mining challenges
Mining various and new kinds of knowledge Mining knowledge in multidimensional space Integrating new methods from multiple disciplines Boosting the power of discovery in a networked environment Handling uncertainty, noise, or incompleteness of data Pattern evaluation and pattern- or constraint-guided mining Interactive mining Incorporation of background knowledge Ad hoc data mining and data mining query languages Presentation and visualization of data mining results
The mean is in general affected by outliers
True
Not all numerical data sets have a median.
False
Discuss one of the factors comprising data quality and provide examples.
Accuracy Completeness Consistency Timeliness Believability Interpretability
Explain one challenge of mining a huge amount of data in comparison with mining a small amount of data.
Algorithms that deal with data need to scale nicely so that even vast amounts of data can be handled efficiently, and take short amounts of time
How would you catalog a boxplot, as a measure of dispersion or as a data visualization aid? Why?
As a data visualization aid. The boxplot shows how the boundaries relate to each other visually, where the minimum, maximum values lie, and the Interquartile ranges with a line signifying the median. It does not give you a specific measure, but allows you to somewhat visualize the data set. For example, if you have a boxplot for the grades in a class, if the box is closer to the minimum boundary then you can see that most scores were low.
closed itemset
By closed itemset we mean that there exists no set that contains a greater number of examples of the given set's contents; for example if a set A contains { a, b } it cannot be said to be a closed item set if there also exists a set B { a, b, c} because B has as much support, the same number of a's and b's, but also contains a { c }, which makes it a superset to A.
What are the data mining functionalities
Characterization and discrimination Mining of frequent patterns, associations, and correlations Classification and regression Clustering analysis Outlier analysis
What are the steps involved in data mining when viewed as a process of knowledge discovery?
Data Cleaning Data Integration Data Selection Data Transformation Data Mining Pattern Evaluation Knowledge Presentation
Why is data integration necessary? What are some of the challenges to consider and the techniques employed in data integration?
Data integration is used to combine multiple sources of the same type of data. The more sources the better in case of bias and the more data the better in general. The problem is that different sources don't always label the same data in the same way. For example, customer_name could be called cust_nm by another source. Additionally, the name Bill could be William from another source and yet be the same person. This problem is called "Entity Identification Problem." Correlation is a technique for data integration. It is a calculation used to determine how dependent or independent attributes are with each other. Correlation analysis is used to keep redundancy in check. Redundancy is if data can be derived from an existing attribute. For example, years can be derived from days, so having both of those attributes in the same data set is redundant. FIX
What is data mining?
Data mining is the process of sorting through large data sets to identify patterns and establish relationships to solve problems through data analysis
What do we understand by data quality and what is its importance?
Data quality is when an object satisfies the requirements of the intended use. It has many factors like: including accuracy, completeness, consistency, timeliness, believability, and interpretability. Data quality also depends on the intended use of the data, for some users the data may be inconsistent, but for others, it can just be hard to interpret. Data quality is extremely important in data mining since the data can become difficult to analyze, hard to use, unreliable, outdated. In other words, having a database with bad quality can defeat the whole purpose of having a database.
What do we understand by dissimilarity measure and what is its importance?
Dissimilarity is described as measuring the difference between to objects, the greater the difference between two objects the higher the value. The importance of this is that in some instances, having two objects with low dissimilarity could mean something negative. For example, two patients, when comparing symptoms, with low dissimilarity would mean a sickness that is environmental or something that can affect others. Another example, a student's essay papers with low dissimilarity with another paper could mean plagiarism on the student.
In many real-life databases, objects are described by a mixture of attribute types. How can we compute the dissimilarity between objects of mixed attribute types?
In order to determine the dissimilarity between objects of mixed attributes there are two main approaches. One of them indicates to separate each attribute type and do a data mining analysis for each of them. This method is acceptable if the results are consistent. Applying this method to real life projects is not viable as analyzing the attribute types separately will most likely generate different results. The second approach is more acceptable. It processes all attributes types together and do only one analysis by combining the attributes into a dissimilarity matrix
Maximal Frequent Itemset
Meets closed frequent itemset critieria, but also has no superset.
outlier analysis
Removes outliers from noise
What are the differences between the measures of central tendency and the measures of dispersion?
The measures of central tendency are the mean, median, mode and midrange. They are used to measure the location of the middle or the center of the data distribution, basically where the most values fall. Whereas, the central tendency measures are the range, quartiles, interquartile range, the five-number summary, boxplots, the variance and standard deviation of the data. A low standard deviation means the data observation is close to the mean; whereas a high standard deviation means the opposite. They are mainly used to find an idea of the dispersion of the data, how is the data spread out, and to identify outliers.
The mode is the only measure of central tendency that can be used for nominal attributes.
True. A nominal attribute is a symbol or name of a thing. An example of this would be hair color, with different categories such as black, brown, blond, and red. A nominal attribute can also be represented using a number, however, they are not meant to be used quantitatively. It is not right to find the median or mean of a nominal attribute because there is no order to the numbers since they are for attributes. However, the mode is valuable because it shows the most commonly occurring value.
What do we understand by similarity measure and what is its importance?
A similarity measure quantifies the similarity between two objects. Usually, large values are for similar objects and zero or negative values are for dissimilar objects. Similarity measures are important because they help us see patterns in data. They also give us knowledge about our data. Similarity measures are used in clustering algorithms. Similar data points are put into the same clusters, and dissimilar points are placed into different clusters.
What is an outlier? Does an outlier need to be discarded always?
An outlier is an object which does not fit in with the general behavior of the model. In most cases of data mining, outliers are discarded. However, there are special circumstances, such as fraud detection, where outliers can be useful. There is even a field of outlier analysis, that is used heavily in credit card fraud in relation to the amount, location, etc.
What do we understand by "frequent patterns"? How are they used in data mining? Please provide examples.
Frequent patterns, as the name indicates, are patterns that are frequent in a data set. There are three categories for these patterns: itemsets, subsequences and substructures. Frequent itemset is useful for the discovery of associations and correlations between items in a data set. This can help businesses to make smart marketing decisions. One example of this is the market basket analysis that determines what items are frequently purchased together by customers (for instance milk and bread, computers and antivirus...). Frequent substructures mining focuses on determining structures that occur frequently (subgraphs, subtrees or sublattices) One of its applications in on the analysis and determination of substructures in chemical compounds.Frequent subsequences or sequential patterns, focus on the discovery of subsequences in a set of sequences (ordered list of items). One of its main applications is on the analysis to determine what sequences of items are frequently bought by customers. This helps businesses to understand the behavior of their customers.
Please discuss the meaning of noise in data sets and the methods that can be used to remove the noise (smooth out the data).
Noise is the random errors found in measured variables, they are basically outliers. There are several methods to remove the noise. Such as binning, regression and outlier analysis.