Data Mining Test 1

Ace your homework & exams now with Quizwiz!

Noise

A random error or variance in a measured variable.

What are the 6 methods to handle Missing Values?

1. Ignore the tuple. 2. Fill in the missing value manually. 3. Use a global constant to fill in the missing value. 4.Use a measure of central tendency for the attribute (e.g., the mean or median) to fill in the missing value. 5.Use the attribute mean or median for all samples belonging to the same class as the given tuple. 6. Use the most probable value to fill in the missing value (may be determined with regression, inference-based tools using a Bayesian formalism, or decision tree induction).

Semi-supervised learning

A class of machine learning techniques that make use of both labeled and unlabeled examples when learning a model. In one approach, labeled examples are used to learn class models and unlabeled examples are used to refine the boundaries between classes. For a two-class problem, we can think of the set of examples belonging to one class as the positive examples and those belonging to the other class as the negative examples.

Three types of measures

Distributive, Algebraic, Holistic

Unsupervised learning

Essentially a synonym for clustering. The learning process is unsupervised since the input examples are not class labeled. Typically, we may use clustering to discover classes within the data. For example, an unsupervised learning method can take, as input, a set of images of handwritten digits. Suppose that it finds 10 clusters of data. These clusters may correspond to the 10 distinct digits of 0 to 9, respectively.

Data quality - Accuracy

Inaccurate, incomplete, and inconsistent data. Can be caused by faulty instruments during data recording, human or computer error, or user entered disguised missing data (intentional inaccurately entered data)

Interval-scaled data type

Is quantitative; that is, it is a measurable quantity, represented in integer or real values. Numeric attributes can be interval-scaled or ratio-scaled. A temperature attribute is interval-scaled.

How is the Interquartile Range calculated?

Quartile 3 minus Quartile 1 (IQR = Q3 - Q1).

Data mining functionality: Association

The discovery of association rules showing attribute-value conditions that occur frequently together in a given set of data. (ex. a data mining system may Find association rules like <omitted> where X is a variable representing a student. The rule indicates that of the students under study, 12% (support) major in computing science and own a personal computer. There is a 98% probability (confidence, or certainty) that a student in this group owns a personal computer.)

Data quality - timeliness

The process in which data is recorded consistently can impact the quality of the data. For example, imagine sales representatives submitting sales records at different intervals which causes inaccuracy in data to determine sales bonuses for top performing Sales rep. employees.

What is Data Mining?

The process of discovering interesting patterns and knowledge from large amounts of data. The data sources can include databases, data warehouses, the Web, other information repositories, or data that are streamed into the system dynamically.

Data sets with one, two, or three modes are respectively called:

Uni-modal, Bi-modal, and Tri-modal.

Data mining functionality: Classification

differs from prediction in that the former constructs a set of models (or functions) that describe and distinguish data classes or concepts, whereas the latter builds a model to predict some missing or unavailable, and often numerical, data values. Their similarity is that they are both tools for prediction: Classification is used for predicting the class label of data objects and prediction is typically used for predicting missing numerical data values.

Machine learning

investigates how computers can learn or improve their performance based on data.

Active learning

machine learning approach that lets users play an active role in the learning process. An active learning approach can ask a user (e.g., a domain expert) to label an example, which may be from a set of unlabeled examples or synthesized by the learning program.

Data mining functionality: Characterization

A summarization of the general characteristics or features of a target class of data. (ex. the characteristics of students can be produced, generating a profile of all the University first year computing science students, which may include such information as a high GPA and large number of courses taken.)

What is the knowledge discovery process?

Data cleaning, data integration, data selection, data transformation, data mining, pattern evaluation, knowledge presentation.

What is the process of data discovery?

Data cleaning, data integration, data selection, data transformation, data mining, pattern evaluation, knowledge presentation.

What are some Data Mining Functionalities?

Generalization, Association Rule Discovery and Correlation Analysis, Classification, Cluster Analysis, and Outlier Analysis.

knowledge presentation

where visualization and knowledge representation techniques are used to present mined knowledge to users.

How is mid-range calculated?

Min + Max divided by 2

Data quality - Completeness

Missing data. Can be caused due to data that is unavailable. Also can be caused by neglect to record data if it was not considered useful at the time of recording, equipment malfunctions, etc.

Pattern Evaluation

To identify the truly interesting patterns representing knowledge based on interestingness measures.

Data transformation

Data are transformed and consolidated into forms appropriate for mining by performing summary or aggregation operations.

Nominal data type

Means "relating to names". Values include: symbols, a category, code, state, etc.

Outlier

A data set may contain objects that do not comply with the general behavior or model of the data.

Data cube

A multidimensional data structure in which each dimension corresponds to an attribute or a set of attributes in the schema, and each cell stores the value of some aggregate measure such as count.

Binary data type

A nominal attribute with only two categories or states being 0 or 1. 0 typically represents "absent", while 1 represents "present".

Ratio-scaled data type

A numeric attribute with an inherent zero-point. That is, if a measurement is ratio-scaled, we can speak of a value as being a multiple (or ratio) of another value. Examples of ratio-scaled attributes include count attributes such as years of experience and number of words.

What is a data warehouse?

A repository of information collected from multiple sources, stored under a unified schema, and usually residing at a single site.

What are the primary factors that comprise data quality?

Accuracy, completeness, consistency, timeliness, believability, and interpretability

Ordinal data type

An attribute with possible values that have a meaningful order or ranking among them, but the magnitude between successive values is not known. One example could be varying Drink sizes at a fast food restaurant such as "Small", "Medium" and "Large".

Data Mining (as a process)

An essential process where intelligent methods are applied to extract data patterns.

Data mining functionality: Data Evolution

Analysis describes and models regularities or trends for objects whose behavior changes over time. Although this may include characterization, discrimination, association, classification or clustering of time-related data, distinct features of such an analysis include time-series data analysis, sequence or periodicity pattern matching, and similarity-based data analysis.

Data mining functionality: Clustering

Analyzes data objects without consulting a known class label. The objects are clustered or grouped based on the principle of maximizing the intraclass similarity and minimizing the interclass similarity. Each cluster that is formed can be viewed as a class of objects. Clustering can also facilitate taxonomy formation, that is, the organization of observations into a hierarchy of classes that group similar events together.

Cluster Analysis

Analyzes data objects without consulting class labels. Clustering can be used to generate class labels for a group of data. clusters of objects are formed so that objects within a cluster have high similarity in comparison to one another, but are rather dissimilar to objects in other clusters.

Supervised learning

Basically a synonym for classification. The supervision in the learning comes from the labeled examples in the training data set. For example, in the postal code recognition problem, a set of handwritten postal code images and their corresponding machine-readable translations are used as the training examples, which supervise the learning of the classification model

5-Number summary

Consists of the following: Minimum, Quartile 1 (Q1), Median, Quartile 3 (Q3) and Max.

Outlier Analysis

Rather than discarding outliers as noise, they can be used in to observe interesting behaviors. A typical application could be fraud detection.

Data Quality - Interpretability

Reflects how easy the data are understood.

Data Quality - Believability

Reflects how much the data are trusted by users.

Data selection

Where data relevant to the analysis task are retrieved from the database.

Data mining functionality: Discrimination

a comparison of the general features of target class data objects with the general features of objects from one or a set of contrasting classes. (ex. the general features of students with high GPA's may be compared with the general features of students with low GPA's. The resulting description could be a general comparative profile of the students such as 75% of the students with high GPA's are fourth-year computing science students while 65% of the students with low GPA's are not.)

Data discrimination

a comparison of the general features of the target class data objects against the general features of objects from one or multiple contrasting classes. The target and contrasting classes can be specified by a user, and the corresponding data objects can be retrieved through database queries.


Related study sets

chapter 9 study questions hth 245

View Set

Ch 26: Vascular Disorders and problems of Peripheral Disorders PrepU

View Set

Module 3: Portfolio Risk and Return: Part II

View Set

Intro to the Old Testament Quizzes 26-44

View Set