2 Basics of Data Mining
What is data?
A collection of data objects and their attributes
Transaction Data
A special type of record data, where each record (transaction) involves a set of items. For example, consider a grocery store. The set of products purchased by a customer during one shopping trip constitute a transaction, while the individual products that were purchased are the items.
Record Data Sets
Data Matrix Document Data Transaction Data
Record Data
Data that consists of a collection of records, each of which consists of a fixed set of attributes
Interval
Description: For interval attributes, the differences between values are meaningful, i.e., a unit of measurement exists. (+, - ) Examples: calendar dates, temperatures in Celsius or Fahrenheit Attribute: distinctness, order, & addition Operations: mean, standard deviation, Pearson's correlation, t and F tests Transformation: new_value =a * old_value + b where a and b are constants Comments: Thus, the Fahrenheit and Celsius temperature scales differ in terms of where their zero value is and the size of a unit (degree).
Ratio
Description: For ratio variables, both differences and ratios are meaningful. (*, /) Examples: temperature in Kelvin, length, time, counts Attribute: distinctness, order, addition, & multiplication Operations: geometric mean, harmonic mean, percent variation Transformation: new_value = a * old_value Comments: Length can be measured in meters or feet.
Nominal
Description: The values of a nominal attribute are just different names, i.e., nominal attributes provide only enough information to distinguish one object from another. (=, !=) Examples: ID numbers, eye color, zip codes Attribute: distinctness Operations: mode, entropy, contingency correlation, 2 test Transformation: Any permutation of values Comments: If all employee ID numbers were reassigned, would it make any difference?
Ordinal
Description: The values of an ordinal attribute provide enough information to order objects. (<, >) Examples: rankings (e.g., taste of potato chips on a scale from 1-10), grades, height in {tall, medium, short} Attribute: distinctness & order Operations: median, percentiles, rank correlation, run tests, sign tests Transformation: An order preserving change of values, i.e., new_value = f(old_value) where f is a monotonic function. Comments: An attribute encompassing the notion of good, better best can be represented equally well by the values {1, 2, 3} or by { 0.5, 1, 10}.
Important characteristics of structured data
Dimensionality, sparsity, and resolution
Document Data
Each document becomes a 'term' vector, each term is a component (attribute) of the vector, the value of each component is the number of times the corresponding term occurs in the document
Spatial-Temporal Data
Example: Average Monthly Temperature of land and ocean
Graph Data
Examples: Generic graph, Chemical data and HTML Links
Sparsity
For some data sets, such as those with asymmetric features, most attributes of an object have values of 0; in many cases, fewer than 1% of the entries are non-zero. In practical terms, _______________ is an advantage because usually only the non-zero values need to be stored and manipulated. This results in significant savings with respect to computation time and storage. Furthermore, some data mining algorithms work well only for _______________ data
Genomic sequence data
GGTTCCGCCTTCA
Discrete Attribute
Has only a finite or countably infinite set of values Examples: zip codes, counts, or the set of words in a collection of documents Often represented as integer variables. Note: binary attributes are a special case of these
Continuous Attribute
Has real numbers as attribute values Examples: temperature, height, or weight. Practically, real values can only be measured and represented using a finite number of digits. These are typically represented as floating-point variables.
Data Matrix
If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multi-dimensional space, where each dimension represents a distinct attribute Such data set can be represented by an m by n matrix, where there are m rows, one for each object, and n columns, one for each attribute
Differences Between Ratio and Interval
Is it physically meaningful to say that a temperature of 10 ° degrees twice that of 5° on the Celsius scale? the Fahrenheit scale? the Kelvin scale? Consider measuring the height above average If Bill's height is three inches above average and Bob's height is six inches above average, then would we say that Bob is twice as tall as Bill? Is this situation analogous to that of temperature?
Types of attributes
Nominal, Ordinal, Interval, Ratio
Asymmetric Attributes
Only presence (a non-zero attribute value) is regarded as important Examples: Words present in documents Items present in customer transactions
Types of data sets
Record Graph Ordered
Distinction between attributes and attribute values
Same attribute can be mapped to different attribute values Example: height can be measured in feet or meters Different attributes can be mapped to the same set of values Attribute values for ID and age are integers But properties of attribute values can be different - ID has no limit but age has a maximum and minimum value
Ordered Data Sets
Spatial Data Temporal Data Sequential Data Genetic Sequence Data
Dimensionality
The _______________ of a data set is the number of attributes that the objects in the data set possess. Data with a small number of _______________ tends to be qualitatively different than moderate or high-______________ data. Indeed, the difficulties associated with analyzing high-_______________ data are sometimes referred to as the curse of _______________ . Because of this, an important motivation in preprocessing the data is _______________ reduction .
Graph Data Sets
World Wide Web Molecular Structures
Object
described by a collection of attributes also known as a record, point, case, sample, entity, or instance
Resolution
lt is frequently possible to obtain data at different levels of _______________ , and often the properties of the data are different at different _______________s . For instance, the surface of the Earth seems very uneven at a _______________ of a few meters, but is relatively smooth at a resolution of tens of kilometers. The patterns in the data also depend on the level of _______________ . If the _______________ is too fine, a pattern may not be visible or may be buried in noise; if the _______________ is too coarse, the pattern may disappear. For example, variations in atmospheric pressure on a scale of hours reflect the movement of storms and other weather systems. On a scale of months, such phenomena are not detectable.
Attribute value
numbers or symbols assigned to an attribute
Attribute
property or characteristic of an object Also known as variable, field, characteristic, or feature
The type of an attribute depends on
which of the following properties it possesses: Distinctness: = != Order: < > Addition: + - Multiplication: * /