Data Mining ek

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

What are the data mining functionalities

Characterization and discrimination Mining of frequent patterns, associations, and correlations Classification and regression Clustering analysis Outlier analysis

Why is data quality important

Data can become difficult to analyze, hard to use, unreliable, outdated. In other words, having a database with bad quality can defeat the whole purpose of having a database.

How can the data be preprocessed in order to help improve its quality?

Data cleaning Data integration Data reduction Data transformations

Attribute generalization

If there is a large set of distinct values for an attribute in the initial working relation, and there exists a set of generalization operators on the attribute, then a generalization operator should be selected and applied to the attribute

Attribute removal

If there is a large set of distinct values for an attribute of the initial working relation, but either there is no generalization operator on the attribute, or its higher-level concepts are expressed in terms of other attributes, then the attribute should be removed from the working relation.

Holistic measures

If there is no constant bound on the storage size needed to describe a subaggregate. That is, there does not exist an algebraic function with M arguments (where M is a constant) that characterizes the computation. Common examples of holistic functions include median(), mode(), and rank(). If applying holistic aggregate functions

Full Materialization

Precompute all of the cuboids. The resulting lattice of computed cuboids is referred to as the full cube. This choice typically requires huge amounts of memory space in order to store all of the precomputed cuboids.

What are the differences between the measures of central tendency and the measures of dispersion?

The measures of central tendency are the mean, median, mode and midrange. They are used to measure the location of the middle or the center of the data distribution, basically where the most values fall. Whereas, the dispersion measures are the range, quartiles, interquartile range, the five-number summary, boxplots, the variance and standard deviation of the data. They are mainly used to find an idea of the dispersion of the data, how is the data spread out, and to identify outliers.

Galaxy Schema

Sophisticated applications may require multiple fact tables to share dimension tables. This kind of schema can be viewed as a collection of stars.

Distributive measures

Suppose the data are partitioned into n sets. We apply the function to each partition, resulting in n aggregate values. If the result derived by applying the function to the n aggregate values is the same as that derived by applying the function to the entire data set (without partitioning), the function can be computed in a distributed manner. By applying a distributive aggregate function

What is the importance of dissimilarity measures

The importance of this is that in some instances, having two objects with low dissimilarity could mean something negative. For example, cheating.

Star schema

The most common modeling paradigm, in which the data warehouse contains a large central table containing the bulk of the data, with no redundancy, and a set of smaller attendant tables, one for each dimension. The schema graph resembles a starburst, with the dimension tables displayed in a radial pattern around the central fact table

What do we understand by data normalization?

The process by which data is transformed to fall within a smaller range such as [−1,1] or [0.0, 1.0]. This attempts to give all attributes of the data set an equal weight.

What is data mining?

The process of sorting through large data sets to identify patterns and establish relationships to solve problems through data analysis

Please discuss the meaning of noise in data sets and the methods that can be used to remove the noise (smooth out the data).

The random errors found in measured variables, they are basically outliers. Binning, regression and outlier analysis.

What is the importance of similarity measures

They are important because they help us see patterns in data. They also give us knowledge about our data. They are used in clustering algorithms. Similar data points are put into the same clusters, and dissimilar points are placed into different clusters.

Discretization

Where the raw values of a numeric attribute are replaced by interval labels or conceptual labels. The labels can be recursively organized into higher-level concepts, resulting in a concept hierarchy for the numeric attribute. More than one concept hierarchy can be defined for the same attribute to accommodate the needs of various users. (Concept hierarchy climbing)

Smoothing

Works to remove noise from the data. Techniques include binning, regression, and clustering.

Nonvolatile

a data warehouse does not require transaction processing, recovery, and concurrency control mechanisms. It usually requires only two operations in data accessing: initial loading of data and access of data.

Data mart

a data warehouse model that contains a subset of corporate-wide data that is of value to a specific group of users. The scope is confined to specific selected subjects.

In an online transaction processing system, the typical unit of work is

a read-only operation

snowflake schema

a variant of the star schema model, where some dimension tables are normalized, thereby further splitting the data into additional tables. The resulting schema graph forms a shape similar to a snowflake

Pivot

a visualization operation that rotates the data axes in view to provide an alternative data presentation

Top-down view

allows the selection of the relevant information necessary for the data warehouse. This information matches current and future business needs.

Intuitively, the roll-up OLAP operation corresponds to concept ___ in a concept hierarchy

ascension

In attribute-oriented induction, data relevant to the task at hand is collected and then generalization is performed by either attribute generalization or __

attribute removal

Data reduction

can reduce the data size by aggregating, eliminating redundant features, or clustering.

Data discrimination

comparison of the target class with one or a set of comparative classes

Redundancy

data can be derived from an existing attribute.

Among the data warehouse applications, __ applications supports knowledge discovery

data mining

Subject-oriented

data warehouses typically provide a simple and concise view of particular subject issues by excluding data that are not useful in the decision support process.

Regression

derives a linear equation to get a best fit line of the noise

Metadata

describe or define warehouse elements

Entity Identification Problem

different sources don't always label the same data in the same way.

Consider a data cube measure obtained by applying the sum() function. The measure is

distributive

Data Value Conflict Detection and Resolution

for the same real-world entity, attribute values from different sources may differ. This may be due to differences in representation, scaling, or encoding

Discrete Attribute

has a finite or countably infinite set of variables

Spiral

involves the rapid generation of increasingly functional systems, with short intervals between successive releases. This is considered a good choice for data warehouse development, especially for data marts, because the turnaround time is short, modifications can be done quickly, and new designs and technologies can be adapted in a timely manner.

Concept description

is the most basic form of descriptive data mining. It describes a given set of task-relevant data in a concise and summative manner, presenting interesting general properties of the data. It consists of characterization and comparison (or discrimination).

Partial materialization

is the selective computation of a subset of the cuboids or subcubes in the lattice

A major distinguishing feature of an online analytical processing system is that

manages large amounts of historic data

Data integration

merges data from multiple sources into a coherent data store, such as a data warehouse.

Drill-down

navigates from less detailed data to more detailed data. Can be realized by either stepping down a concept hierarchy for a dimension or introducing additional dimensions

Waterfall

performs a structured and systematic analysis at each step before proceeding to the next

Roll-up

performs aggregation on a data cube, either by climbing up a concept hierarchy for a dimension or by dimension reduction

Nominal attribute

refer to symbols or names of things. Categorical. It can also be represented using a number, however, they are not meant to be used quantitatively. Has no median, but has a mode

Data cleaning

remove noise and correct inconsistencies in the data.

The ___ OLAP operation performs a selection on one dimension of the given cube

slice

Binning

smooths out the values around the noise by placing data into bins based on mean, median, boundaries, etc.

In the ___ schema some dimension tables are normalized generating additional tables

snowflake

In data warehouse development, with the ___ process changes to requirements can be resolved faster

spiral

Analytical processing

supports basic OLAP operations, including slice-and-dice, drill-down, roll-up, and pivoting. It generally operates on historic data in both summarized and detailed forms. The major strength is the multidimensional data analysis of data warehouse data.

Information processing

supports querying, basic statistical analysis, and reporting using crosstabs, tables, charts, or graphs. A current trend is to construct low-cost web-based accessing tools that are then integrated with web browsers.

Dimensionality reduction

the process of reducing the number of random variables or attributes under consideration. Include wavelet transforms, principal components analysis, and attribute subset selection

Z-score normalization

the values for an attribute, A, are normalized based on the mean (i.e., average) and standard deviation of A Vi = vi - A / o'A

Tuple Duplication

there are two or more identical tuples for a given unique data entry case

Data compression

transformations are applied so as to obtain a reduced or "compressed" representation of the original data.

Continuous Attributes

typically represented as floating-point variables.

Data transformation strategies

1. Smoothing 2. Attribute construction (or feature construction) 3. Aggregation 4. Normalization 5. Discretization 6. Concept hierarchy generation for nominal data

How many cuboids are there in a 6-dimensional data cube if there were no hierarchies associated to any dimension?

64

Correlation

A calculation used to determine how dependent or independent attributes are with each other. Its analysis is used to keep redundancy in check.

Integrated

A data warehouse is usually constructed by integrating multiple heterogeneous sources, such as relational databases, flat files, and online transaction records

Data transformations

A function that maps the entire set of values of a given attribute to a new set of replacement values, each old value can be identified with one of the new values. This can improve the accuracy and efficiency of mining algorithms involving distance measurements.

Binary Attributes

A nominal attribute with only two categories or states: 0 or 1, where 0 typically means that the attribute is absent, and 1 means that it is present.

Data warehouse

A subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management's decision making process.

Data Characterization

A summary of the general characteristics or features of a target class of data. The data corresponding to the user-specified class is typically collected by a query. For example, to study the characteristics of software products with sales that increased by 10% in the previous year, the data related to such products can be collected by executing an SQL query on the sales database.

Discuss one of the factors comprising data quality and provide examples.

Accuracy Completeness Consistency Timeliness Believability Interpretability

Explain one challenge of mining a huge amount of data in comparison with mining a small amount of data.

Algorithms that deal with data need to scale nicely so that even vast amounts of data can be handled efficiently, and take short amounts of time

Ordinal Attributes

An attribute with possible values that have a meaningful order or ranking among them, but the magnitude between successive values is not known.

What is an outlier?

An object which does not fit in with the general behavior of the model.

How would you catalog a boxplot, as a measure of dispersion or as a data visualization aid? Why?

As a data visualization aid. The boxplot shows how the boundaries relate to each other visually, where the minimum, maximum values lie, and the Interquartile ranges with a line signifying the median. It does not give you a specific measure, but allows you to somewhat visualize the data set. For example, if you have a boxplot for the grades in a class, if the box is closer to the minimum boundary then you can see that most scores were low.

What are the steps involved in data mining when viewed as a process of knowledge discovery?

Data Cleaning Data Integration Data Selection Data Transformation Data Mining Pattern Evaluation Knowledge Presentation

Dice

Defines a subcube by performing a selection on two or more dimensions

Data reduction strategies.

Dimensionality reduction Numerosity reduction Data compression

Algebraic measures

If it can be computed by an algebraic function with M arguments, obtained by applying a distributive aggregate function. For example, avg() (average) can be computed by sum()/count(), where both sum() and count() are distributive aggregate functions. By applying an algebraic aggregate function

What are some of the challenges to consider and the techniques employed in data integration?

Entity Identification Problem Redundancy Tuple Duplication Data Value Conflict Detection and Resolution Correlation is a technique for data integration

Discuss one of the distance measures that are commonly used for computing the dissimilarity of objects described by numeric attributes.

Euclidean distance d(i, j) =sqrt((xi1 − xj1)^2 + (xi2 − xj2)^2 +··· ) Manhattan Distance |x1 - x2| + |y1 - y2| Minkowski distance d(i, j) = sqrt(h, |xi1 − xj1|^h + |xi2 − xj2|^h + ...) Supremum distance d(i, j) = max(f, p) |xif − xjf |

Time-variant

Every key structure in the data warehouse contains, either implicitly or explicitly, a time element.

Not all numerical data sets have a median. (T/F)

False

Does an outlier need to be discarded always?

In most cases of data mining, outliers are discarded. However, there are special circumstances, such as fraud detection, where outliers can be useful.

In many real-life databases, objects are described by a mixture of attribute types. How can we compute the dissimilarity between objects of mixed attribute types?

In order to determine the dissimilarity between objects of mixed attributes there are two main approaches. One of them indicates to separate each attribute type and do a data mining analysis for each of them. This method is acceptable if the results are consistent. Applying this method to real life projects is not viable as analyzing the attribute types separately will most likely generate different results. The second approach is more acceptable. It processes all attributes types together and do only one analysis by combining the attributes into a dissimilarity matrix

Data mining methodology challenges

Mining various and new kinds of knowledge Mining knowledge in multidimensional space Integrating new methods from multiple disciplines Boosting the power of discovery in a networked environment Handling uncertainty, noise, or incompleteness of data Pattern evaluation and pattern- or constraint-guided mining

What is the importance of data reduction?

It can increase storage efficiency and reduce costs. It allows analytics to take less take and yield similar (if not identical) results

Why is data integration necessary?

It is used to combine multiple sources of the same type of data. The more sources the better in case of bias and the more data the better in general.

What do we understand by similarity measure?

It quantifies the similarity between two objects. Usually, large values are for similar objects and zero or negative values are for dissimilar objects.

What do we understand by dissimilarity measure and what is its importance?

Measuring the difference between to objects, the greater the difference between two objects the higher the value.

Data normalization methods

Min-max normalization Z-score normalization Normalization by decimal scaling

Normalization by decimal scaling

Normalizes by moving the decimal point of values of attribute A. The number of decimal points moved depends on the maximum absolute value of A Vi = vi/10^j

Min-max normalization

Performs a linear transformation on the original data. It preserves the relationships among the original data values. It will encounter an "out-of-bounds" error if a future input case for normalization falls outside of the original data range for A. Vi = (vi - minA/maxA - minA)(new maxA − new minA) + new minA.

Slice

Performs a selection on one dimension of the given cube, resulting in a subcube.

Numeric Attributes

Quantitative; that is, it is a measurable quantity, represented in integer or real values. Can be interval-scaled or ratio-scaled.

outlier analysis

Removes outliers from noise

Numerosity reduction

Replace the original data volume by alternative, smaller forms of data representation, These techniques may be parametric or nonparametric

The mean is in general affected by outliers (T/F)

True

The mode is the only measure of central tendency that can be used for nominal attributes. (T/F)

True. An example of this would be hair color, with different categories such as black, brown, blond, and red. Which one is the most common one?

What do we understand by data quality and what is its importance?

When an object satisfies the requirements of the intended use. It has many factors like: including accuracy, completeness, consistency, timeliness, believability, and interpretability. It also depends on the intended use of the data, for some users the data may be inconsistent, but for others, it can just be hard to interpret.

Concept hierarchy generation for nominal data

Where attributes such as street can be generalized to higher-level concepts, like city or country. Many hierarchies for nominal attributes are implicit within the database schema and can be automatically defined at the schema definition level.

Attribute construction (or feature construction)

Where new attributes are constructed and added from the given set of attributes to help the mining process.

Aggregation

Where summary or aggregation operations are applied to the data. This step is typically used in constructing a data cube for data analysis at multiple abstraction levels.

Normalization

Where the attribute data are scaled so as to fall within a smaller range, such as −1.0 to 1.0, or 0.0 to 1.0.


Kaugnay na mga set ng pag-aaral

Test #3 (Ch 6-7)--World Religions (Benjamin)

View Set

Texas Govt. Exam 3 Review Questions Ch. 6,7

View Set

Cell Division in Eukaryotic Cells Quiz

View Set

Chapter 11 water and major minerals

View Set

NCLEX PRACTICE QUESTIONS-ONCOLOGY UNITY 1

View Set