2 Basics of Data Mining

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

What is data?

A collection of data objects and their attributes

Transaction Data

A special type of record data, where each record (transaction) involves a set of items. For example, consider a grocery store. The set of products purchased by a customer during one shopping trip constitute a transaction, while the individual products that were purchased are the items.

Record Data Sets

Data Matrix Document Data Transaction Data

Record Data

Data that consists of a collection of records, each of which consists of a fixed set of attributes

Interval

Description: For interval attributes, the differences between values are meaningful, i.e., a unit of measurement exists. (+, - ) Examples: calendar dates, temperatures in Celsius or Fahrenheit Attribute: distinctness, order, & addition Operations: mean, standard deviation, Pearson's correlation, t and F tests Transformation: new_value =a * old_value + b where a and b are constants Comments: Thus, the Fahrenheit and Celsius temperature scales differ in terms of where their zero value is and the size of a unit (degree).

Ratio

Description: For ratio variables, both differences and ratios are meaningful. (*, /) Examples: temperature in Kelvin, length, time, counts Attribute: distinctness, order, addition, & multiplication Operations: geometric mean, harmonic mean, percent variation Transformation: new_value = a * old_value Comments: Length can be measured in meters or feet.

Nominal

Description: The values of a nominal attribute are just different names, i.e., nominal attributes provide only enough information to distinguish one object from another. (=, !=) Examples: ID numbers, eye color, zip codes Attribute: distinctness Operations: mode, entropy, contingency correlation, 2 test Transformation: Any permutation of values Comments: If all employee ID numbers were reassigned, would it make any difference?

Ordinal

Description: The values of an ordinal attribute provide enough information to order objects. (<, >) Examples: rankings (e.g., taste of potato chips on a scale from 1-10), grades, height in {tall, medium, short} Attribute: distinctness & order Operations: median, percentiles, rank correlation, run tests, sign tests Transformation: An order preserving change of values, i.e., new_value = f(old_value) where f is a monotonic function. Comments: An attribute encompassing the notion of good, better best can be represented equally well by the values {1, 2, 3} or by { 0.5, 1, 10}.

Important characteristics of structured data

Dimensionality, sparsity, and resolution

Document Data

Each document becomes a 'term' vector, each term is a component (attribute) of the vector, the value of each component is the number of times the corresponding term occurs in the document

Spatial-Temporal Data

Example: Average Monthly Temperature of land and ocean

Graph Data

Examples: Generic graph, Chemical data and HTML Links

Sparsity

For some data sets, such as those with asymmetric features, most attributes of an object have values of 0; in many cases, fewer than 1% of the entries are non-zero. In practical terms, _______________ is an advantage because usually only the non-zero values need to be stored and manipulated. This results in significant savings with respect to computation time and storage. Furthermore, some data mining algorithms work well only for _______________ data

Genomic sequence data

GGTTCCGCCTTCA

Discrete Attribute

Has only a finite or countably infinite set of values Examples: zip codes, counts, or the set of words in a collection of documents Often represented as integer variables. Note: binary attributes are a special case of these

Continuous Attribute

Has real numbers as attribute values Examples: temperature, height, or weight. Practically, real values can only be measured and represented using a finite number of digits. These are typically represented as floating-point variables.

Data Matrix

If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multi-dimensional space, where each dimension represents a distinct attribute Such data set can be represented by an m by n matrix, where there are m rows, one for each object, and n columns, one for each attribute

Differences Between Ratio and Interval

Is it physically meaningful to say that a temperature of 10 ° degrees twice that of 5° on the Celsius scale? the Fahrenheit scale? the Kelvin scale? Consider measuring the height above average If Bill's height is three inches above average and Bob's height is six inches above average, then would we say that Bob is twice as tall as Bill? Is this situation analogous to that of temperature?

Types of attributes

Nominal, Ordinal, Interval, Ratio

Asymmetric Attributes

Only presence (a non-zero attribute value) is regarded as important Examples: Words present in documents Items present in customer transactions

Types of data sets

Record Graph Ordered

Distinction between attributes and attribute values

Same attribute can be mapped to different attribute values Example: height can be measured in feet or meters Different attributes can be mapped to the same set of values Attribute values for ID and age are integers But properties of attribute values can be different - ID has no limit but age has a maximum and minimum value

Ordered Data Sets

Spatial Data Temporal Data Sequential Data Genetic Sequence Data

Dimensionality

The _______________ of a data set is the number of attributes that the objects in the data set possess. Data with a small number of _______________ tends to be qualitatively different than moderate or high-______________ data. Indeed, the difficulties associated with analyzing high-_______________ data are sometimes referred to as the curse of _______________ . Because of this, an important motivation in preprocessing the data is _______________ reduction .

Graph Data Sets

World Wide Web Molecular Structures

Object

described by a collection of attributes also known as a record, point, case, sample, entity, or instance

Resolution

lt is frequently possible to obtain data at different levels of _______________ , and often the properties of the data are different at different _______________s . For instance, the surface of the Earth seems very uneven at a _______________ of a few meters, but is relatively smooth at a resolution of tens of kilometers. The patterns in the data also depend on the level of _______________ . If the _______________ is too fine, a pattern may not be visible or may be buried in noise; if the _______________ is too coarse, the pattern may disappear. For example, variations in atmospheric pressure on a scale of hours reflect the movement of storms and other weather systems. On a scale of months, such phenomena are not detectable.

Attribute value

numbers or symbols assigned to an attribute

Attribute

property or characteristic of an object Also known as variable, field, characteristic, or feature

The type of an attribute depends on

which of the following properties it possesses: Distinctness: = != Order: < > Addition: + - Multiplication: * /


Kaugnay na mga set ng pag-aaral

ETS Praxis Audiology Practice Test

View Set

Intermediate Finance Quizzes 1-3

View Set

Kin Nutrition Cypress College Chapter # 4

View Set

Microeconomics Final Exam True/False

View Set

Mythology & Folklore: Unit 3 - The Heroic Mono-myth in Mythology and Folklore

View Set

Neuroscience: Organization of the Nervous System

View Set

PSY 4932 - Exam 3 - Chapters 9-12

View Set

American Art: Colonial through Nineteenth Century

View Set

Colorado Contracts & Regulations Unit

View Set