Comp 541

अब Quizwiz के साथ अपने होमवर्क और परीक्षाओं को एस करें!

• Hybrid OLAP (HOLAP) (e.g., Microsoft SQLServer)

(e.g., Microsoft SQLServer) - Flexibility, e.g., low level: relational, high-level: array

Data Integration

Data integration Entity ID problem Metadata Redundancy Correlation analysis (Correlation coefficient, chi-square test)

Important Characteristics of Structured Data

Dimensionality • Curse of Dimensionality - Sparsity • Only presence counts - Resolution • Patterns depend on the scale

HOLAP

HOLAP technologies attempt to combine the advantages of MOLAP and ROLAP.

OLAP Server Architectures: ROLAP versus MOLAP versus HOLAP

Hybrid OLAP (HOLAP) servers: The hybrid OLAP approach combines ROLAP and MOLAP technology, benefiting from the greater scalability of ROLAP and the faster of detailed data to be stored in a relational database, while aggregations are kept in a separate MOLAP store. The Microsoft SQL Server 2000 supports a hybrid OLAP server. Specialized SQL servers: To meet the growing demand of OLAP processing in relational databases, some database system vendors implement specialized SQL servers that provide advanced query language and query processing support for SQL

Detecting Redundancy (1)

If an attributed can be "derived" from another attribute or a set of attributes, it may be redundant Some redundancies can be detected by correlation analysis - Correlation coefficient for numeric data - Chi-square test for categorical data • These can be also used for data reduction

Apex cupoid

In data warehousing literature, an n-D base cube is called a base cuboid. The top most 0-D cuboid, which holds the highest level of summarization, is called the apex cuboid. The lattice of cuboids forms a data cube.

What is OLAM

It is an integration of data mining and data warehousing. Online analytical mining integrates with online analytical processing with data mining and mining knowledge in multidimensional databases.

Outliers

Outliers are data objects with characteristics that are considerably different than most of the other data objects in the data set

• Specialized SQL servers (e.g., Redbricks)

Specialized support for SQL queries over star/snowflake schemas

Concept Hierarchy Generation for Categorical Data

Specification of a partial ordering of attributes explicitly at the schema level by users or experts •Specification of a portion of a hierarchy by explicit data grouping •Specification of a set of attributes, but not of their partial ordering

normalization by decimal scaling

v_j= v/10^j Where j is the smallest integer

Data Marts

what Is a Data Mart? A data mart is a simple form of a data warehouse that is focused on a single subject (or functional area), such as Sales or Finance or Marketing. Data marts are often built and controlled by a single department within an organization. Given their single-subject focus, data marts usually draw data from only a few sources. The sources could be internal operational systems, a central data warehouse, or external data. How Is It Different from a Data Warehouse? A data warehouse, in contrast, deals with multiple subject areas and is typically implemented and controlled by a central organizational unit such as the Corporate Information Technology (IT) group. Often, it is called a central or enterprise data warehouse. Typically, a data warehouse assembles data from multiple source systems. Nothing in these basic definitions limits the size of a data mart or the complexity of the decision-support data that it contains. Nevertheless, data marts are typically smaller and less complex than data warehouses; hence, they are typically easier to build and maintain.

z score normalization

x-u/sigma

Extraction, Transformation, and Loading (ETL)

• Data extraction - get data from multiple, heterogeneous, and external sources • Data cleaning - detect errors in the data and rectify them when possible • Data transformation - convert data from legacy or host format to warehouse format • Load - sort, summarize, consolidate, compute views, check integrity, and build indicies and partitions • Refresh - propagate the updates from the data sources to the warehouse

Concept Hierarchy Generation for Nominal Data

• Specification of a partial/total ordering of attributes explicitly at the schema level by users or experts -street< city< state< country •Specification of a hierarchy for a set of values by explicit data grouping -{Urbana, Champaign, Chicago} < Illinois •Specification of only a partial set of attributes -E.g., only street< city, not others •Automatic generation of hierarchies (or attribute levels) by the analysis of the number of distinct values -E.g., for a set of attributes: {street, city, state, country}

Cuboid

• A data cube is a lattice of cuboids - A data cube is a metaphor for multidimensional data storage - The actual physical storage of such data may differ from its logical representation - The data cube can viewed with different levels of cuboids • The base cuboid - The lowest level of summarization • The apex cuboid - The highest level of summarization

Conceptual Modeling of Data Warehouses

• Modeling data warehouses: dimensions & measures - Star schema: A fact table in the middle connected to a set of dimension tables - Snowflake schema: A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake - Fact constellations: Multiple fact tables share dimension tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation

OLAP Server Architectures

• Relational OLAP (ROLAP) - Use relational or extended-relational DBMS to store and manage warehouse data and OLAP middle ware - Include optimization of DBMS backend, implementation of aggregation navigation logic, and additional tools and services - Greater scalability

From On-Line Analytical Processing (OLAP) to On Line Analytical Mining (OLAM)

• Why online analytical mining? - High quality of data in data warehouses • DW contains integrated, consistent, cleaned data - Available information processing structure surrounding data warehouses • ODBC, OLEDB, Web accessing, service facilities, reporting and OLAP tools - OLAP-based exploratory data analysis • Mining with drilling, dicing, pivoting, etc. - On-line selection of data mining functions • Integration and swapping of multiple mining functions, algorithms, and tasks

Ordinal attribute

An ordinal attribute is an attribute with possible values that have a meaningful order or ranking among them, but the magnitude between successive values is not known.

Compare and contrast between data mining, artificial intelligence, machine learning, deep learning and statistics.

​There is a difference among data mining, artificial intelligence, machine learning, deep learning, and statistics. Data mining is the practice of examining large databases in order to generate new information. Artificial intelligence is the theory and development of computer systems that are able to perform tasks that normally require human intelligence. Machine learning is an application of artificial intelligence that provides systems the ability to automatically learn and improve from experience without being explicitly programmed. It also focuses on the development of computer programs that can retrieve data and learn it for themselves. Deep learning is a subfield of machine learning methods based on learning data representations, as opposed to task specific algorithms. Statistics is a branch of mathematics that deals with the collection, analysis, interpretation, and presentation of masses of numerical data. Despite there being differences among data mining, artificial intelligence, machine learning, deep learning and statistics, there are similarities. The similarity between data mining and artificial intelligence is that they are the same place, but are hugely different. The similarity between data mining and machine learning is that data mining can depend on machine learning. As in data mining one can use machine learning techniques to build models of what is happening behind some data , so that it can predict future outcomes. The similarity between data mining and deep learning is the specific set of algorithms within deep learning is one of the many types of algorithms that can be used in data mining. The similarity between data mining and statistics is that statistics is the core of data mining. The activities of data mining cover the entire process of data analysis and statistics helps identify patterns that further help identify differences among random noise and significant findings. They are both techniques in data-analysis to help in better decision making. The similarity between artificial intelligence and machine learning is that machine learning is a field of artificial intelligence. As artificial intelligence is the broader concept of machines, while machine learning is a current application of artificial intelligence. The similarity between artificial intelligence and deep learning is that they are related as deep learning is a subset of machine learning, which is then a subset of artificial intelligence. The similarity between artificial intelligence and statistics is that in order to do something with artificial intelligence, statistical analysis is necessary. As, artificial in intelligence methods are offered by the statistics domain. Statistics is the domain of particular interest to artificial intelligence researchers. In order to do successful artificial intelligence one must need to do a statistical approach to extract data and evaluate system qualities. The similarities between machine learning and deep learning is that they are related to the fact that is deep learning relies on machine learning, as deep learning is a sub field of machine learning which is concerned with algorithms inspired by the structure and function of the brain. Finally, the similarity among machine learning and statistics is that despite machine learning being about predictions, supervised learning, etc and statistics being about sample, population, etc. they have the same objective. They have the same objective that they are both concerned with the same question of how do we learn from data? Both these methods focus on drawing knowledge or insights from data, but the methods are affected by a variety of differences. As machine learning and statistics techniques are used in pattern recognition, knowledge discovery, and data mining. Lastly, deep learning and statistics are similar as probability theory and statistics are the best foundation for doing work in deep learning. Deep learning requires understanding of techniques for analyzing high dimensional parameter and input spaces. Mathematical understanding is a fundamental part of these processes, as many concepts of deep learning are derived by the assimilation of statistical concepts.

ROLAP

• Advantages - Can handle large amounts of data - Can leverage functionalities inherent in the relational database • Disadvantages - Performance can be slow - Limited by SQL functionalities

MOLAP

• Advantages - Excellent performance - Can perform complex calculations • Disadvantages - Limited in the amount of data it can handle - Requires additional investment

Major tasks in Data Preprocessing

• Data cleaning - Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies • Data integration - Integration of multiple databases, data cubes, or files • Data reduction - Dimensionality reduction - Numerosity reduction - Data compression • Data transformation and data discretization - Normalization - Concept hierarchy generation

Qualitative Flavors:

Binomial Data, Nominal Data, and Ordinal Data

What is Data Mining?

Given lots of data Discover patterns and models that are: - Valid:hold on new data with some certainty - Useful:should be possible to act on the item - Unexpected:non-obvious to the system - Understandable: humans should be able to interpret the pattern

Data cleaning

Missing values Use the most probable value to fill in the missing value (and five other methods) Noisy data Binning; regression; clustering

what is a data warehouse?

It is a repository for long-term storage of data from multiple sources organized , so as to facilitate management decision making. The data are stored under a unified schema and are typically summarized. Data warehouse provide multidimensional data analysis capabilities, collectively referred to as online analytical processing.

What is data

collection of data objects and their attributes

base- cuboid

The bottom-most cuboid

Schema Integration and Object Matching

custom_idand cust_number - Schema conflict • "H" and "S", and 1and 2for pay_typein one database - Value conflict • Solutions - meta data (data about data) 25

• Multidimensional OLAP (MOLAP)

- Sparse array-based multidimensional storage engine - Fast indexing to pre-computed summarized data

OLAP Operations

OLAP OPERATIONS OLAP provides a user-friendly environment for interactive data analysis. A number of OLAP data cube operations exist to materialize different views of data, allowing interactive querying and analysis of the data. The most popular end user operations on dimensional data are: Roll up The roll-up operation (also called drill-up or aggregation operation) performs aggregation on a data cube, either by climbing up a concept hierarchy for a dimension or by climbing down a concept hierarchy, i.e. dimension reduction. Let me explain roll up with an example: Consider the following cube illustrating temperature of certain days recorded weekly: Fig 7: Example. Assume we want to set up levels (hot(80-85), mild(70-75), cold(64-69)) in temperature from the above cube. To do this we have to group columns and add up the values according to the concept hierarchy. This operation is called roll-up. By doing this we obtain the following cube: Fig 8: Rollup. The concept hierarchy can be defined as hot-->day-->week. The roll-up operation groups the data by levels of temperature. Roll Down The roll down operation (also called drill down) is the reverse of roll up. It navigates from less detailed data to more detailed data. It can be realized by either stepping down a concept hierarchy for a dimension or introducing additional dimensions. Performing roll down operation on the same cube mentioned above: Fig 9: Rolldown. The result of a drill-down operation performed on the central cube by stepping down a concept hierarchy for temperature can be defined as day<--week<--cool. Drill-down occurs by descending the time hierarchy from the level of week to the more detailed level of day. Also new dimensions can be added to the cube, because drill-down adds more detail to the given data. Slicing Slice performs a selection on one dimension of the given cube, thus resulting in a subcube. For example, in the cube example above, if we make the selection, temperature=cool we will obtain the following cube: Fig 10: Slicing. Dicing The dice operation defines a subcube by performing a selection on two or more dimensions. For example, applying the selection (time = day 3 OR time = day 4) AND (temperature = cool OR temperature = hot) to the original cube we get the following subcube (still two-dimensional): Fig 11:Dice Pivot Pivot otheriwise known as Rotate changes the dimensional orientation of the cube, i.e. rotates the data axes to view the data from different perspectives. Pivot groups data with different dimensions. The below cubes shows 2D represntation of Pivot. Fig 12:Pivot Other OLAP operations Some more OLAP operations include: SCOPING: Restricting the view of database objects to a specified subset is called scoping. Scoping will allow users to recieve and update some data values they wish to recieve and update. SCREENING: Screening is performed against the data or members of a dimension in order to restrict the set of data retrieved. DRILL ACROSS: Accesses more than one fact table that is linked by common dimensions. COmbiens cubes that share one or more dimensions. DRILL THROUGH: Drill down to the bottom level of a data cube down to its back end relational tables. In Summary: Concept hierarchies organize the values of attributes or dimensions into abstraction levels. They are useful in mining at multiple abstraction levels. Typical OLAP operations include roll-up, and drill-( down, across, through), slice-and-dice, and pivot ( rotate), as well as some statistical operations. OLAP operations can be implemented efficiently using the data cube structure.

OLAP vs OLTP

. The following table summarizes the major differences between OLTP and OLAP system design. OLTP System - Online Transaction Processing (Operational System) OLAP System - Online Analytical Processing (Data Warehouse) Source of data OLTP: Operational data; OLTPs are the original source of the data. OLAP: Consolidation data; OLAP data comes from the various OLTP Databases Purpose of data OLTP: To control and run fundamental business tasks OLAP: To help with planning, problem solving, and decision support What the data OLTP: Reveals a snapshot of ongoing business processes OLAP: Multi-dimensional views of various kinds of business activities Inserts and Updates OLTP: Short and fast inserts and updates initiated by end users OLAP: Periodic long-running batch jobs refresh the data Queries OLTP: Relatively standardized and simple queries Returning relatively few records OLAP: Often complex queries involving aggregations Processing Speed OLTP: Typically very fast OLAP: Depends on the amount of data involved; batch data refreshes and complex queries may take many hours; query speed can be improved by creating indexes Space Requirements OLTP: Can be relatively small if historical data is archived OLAP: Larger due to the existence of aggregation structures and history data; requires more indexes than OLTP DatabaseDesign OLTP: Highly normalized with many tables OLAP: Typically de-normalized with fewer tables; use of star and/or snowflake schemas Backup and Recovery OLTP: Backup religiously; operational data is critical to run the business, data loss is likely to entail significant monetary loss and legal liability OLAP: Instead of regular backups, some environments may consider simply reloading the OLTP data as a recovery methodsource:

Data Warehouse Usage

- Information processing • supports querying, basic statistical analysis, and reporting using crosstabs, tables, charts and graphs - Analytical processing • multidimensional analysis of data warehouse data • supports basic OLAP operations, slice-dice, drilling, pivoting - Data mining • knowledge discovery from hidden patterns • supports associations, constructing analytical models, performing classification and prediction, and presenting the mining results using visualization tools

OLAP Server Architectures

• Relational OLAP (ROLAP) • Multidimensional OLAP (MOLAP) • Hybrid OLAP (HOLAP)

Conceptual Modeling of Data Warehouse

• Modeling data warehouses: dimensions & measures - Star schema: A fact table in the middle connected to a set of dimension tables - Snowflake schema: A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake - Fact constellations: Multiple fact tables share dimension tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation

Typical OLAP Operations

• Roll up (drill-up): summarize data - by climbing up hierarchy or by dimension reduction • Drill down (roll down): reverse of roll-up - from higher level summary to lower level summary or detailed data, or introducing new dimensions • Slice and dice: project and select • Pivot (rotate): - reorient the cube, visualization, 3D to series of 2D planes • Other operations - drill across: involving (across) more than one fact table - drill through: through the bottom level of the cube to its back-end relational tables (using SQL)

Typical OLAP Operations

• Roll up (drill-up): summarize data - by climbing up hierarchy or by dimension reduction • Drill down (roll down): reverse of roll-up - from higher level summary to lower level summary or detailed data, or introducing new dimensions • Slice and dice: project and select • Pivot (rotate): - reorient the cube, visualization, 3D to series of 2D planes • Other operations - drill across: involving (across) more than one fact table - drill through: through the bottom level of the cube to its back-end relational tables (using SQL)

schemas for multidimensional databases

A multidimensional database (MDB) is a type of database that is optimized for data warehouse and online analytical processing (OLAP) applications. Multidimensional databases are frequently created using input from existing relational databases. Whereas a relational database is typically accessed using a Structured Query Language (SQL) query, a multidimensional database allows a user to ask questions like "How many Aptivas have been sold in Nebraska so far this year?" and similar questions related to summarizing business operations and trends. An OLAP application that accesses data from a multidimensional database is known as a MOLAP (multidimensional OLAP) application. A multidimensional database - or a multidimensional database management system (MDDBMS) - implies the ability to rapidly process the data in the database so that answers can be generated quickly. A number of vendors provide products that use multidimensional databases. Approaches to how data is stored and the user interface vary.

HOLAP

HOLAP technologies attempt to combine the advantages of MOLAP and ROLAP. For summary-type information, HOLAP leverages cube technology for faster performance. When detail information is needed, HOLAP can "drill through" from the cube into the underlying relational dat

Concept Hierarchy Generation

Concept hierarchyorganizes concepts (i.e., attribute values) hierarchically and is usually associated with each dimension in a data warehouse •Concept hierarchies facilitate drilling and rollingin data warehouses to view data in multiple granularity •Concept hierarchy formation: Recursively reduce the data by collecting and replacing low level concepts (such as numeric values for age) by higher level concepts (such as youth, adult, or senior) •Concept hierarchies can be explicitly specified by domain experts and/or data warehouse designers •Concept hierarchy can be automatically formed for both numeric and nominal data. For numeric data, use discretization methods shown A concept hierarchy for a given numerical attribute defines a discretization of the attribute •Recursively reduce the data by collecting and replacing low level concepts by higher level concepts

data reduction

Data reduction techniques obtain a reduced representation of the data while minimizing the loss of information content. These include methods of dimensionality reduction, numerosity reduction, and data compression. Dimensionality reduction reduces the number of random variables or attributes under consideration.Methods include wavelet transforms, principal components analysis, attribute subset selection, and attribute creation. Numerosity reduction methods use parametric or nonparatmetric models to obtain smaller representations of the original data. Parametric models store only the model parameters instead of the actual data. Examples include regression and log-linear models. Nonparamteric methods include histograms, clustering, sampling, and data cube aggregation. Data compression methods apply transformations to obtain a reduced or "compressed" representation of the original data. The data reduction is lossless if the original data can be reconstructed from the compressed data without any loss of information; otherwise, it is lossy.

Importance of OLAM

Importance of OLAM OLAM is important for the following reasons − High quality of data in data warehouses − TThe data mining tools are required to work on integrated, consistent, and cleaned data. These steps are very costly in the preprocessing of data. The data warehouses constructed by such preprocessing are valuable sources of high quality data for OLAP and data mining as well. AAvailable information processing infrastructure surrounding data warehouses − Information processing infrastructure refers to accessing, integration, consolidation, and transformation of multiple heterogeneous databases, web-accessing and service facilities, reporting and OLAP analysis tools. OLAP−based exploratory data analysis − Exploratory data analysis is required for effective data mining. OLAM provides facility for data mining on various subset of data and at different levels of abstraction. Online selection of data mining functions − Integrating OLAP with multiple data mining functions and online analytical mining provide users with the flexibility to select desired data mining functions and swap data mining tasks dynamically.

MOLAP

MOLAP This is the more traditional way of OLAP analysis. In MOLAP, data is stored in a multidimensional cube. The storage is not in the relational database, but in proprietary formats. Advantages: Excellent performance: MOLAP cubes are built for fast data retrieval, and are optimal for slicing and dicing operations. Can perform complex calculations: All calculations have been pre-generated when the cube is created. Hence, complex calculations are not only doable, but they return quickly. Disadvantages: Limited in the amount of data it can handle: Because all calculations are performed when the cube is built, it is not possible to include a large amount of data in the cube itself. This is not to say that the data in the cube cannot be derived from a large amount of data. Indeed, this is possible. But in this case, only summary-level information will be included in the cube itself. Requires additional investment: Cube technology are often proprietary and do not already exist in the organization. Therefore, to adopt MOLAP technology, chances are additional investments in human and capital resources are needed.

Multidimensional Schemas

Multidimensional OLAP (MOLAP) uses array-based multidimensional storage engines for multidimensional views of data. With multidimensional data stores, the storage utilization may be low if the data set is sparse. Therefore, many MOLAP servers use two levels of data storage representation to handle dense and sparse data-sets. MOLAP vs ROLAP .No. MOLAP ROLAP 1 Information retrieval is fast. Information retrieval is comparatively slow. 2 Uses sparse array to store data-sets. Uses relational table. 3 MOLAP is best suited for inexperienced users, since it is very easy to use. ROLAP is best suited for experienced users. 4 Maintains a separate database for data cubes. It may not require space other than available in the Data warehouse. 5 DBMS facility is weak. DBMS facility is strong.

OLAP VS OLTP

OLAP It involves historical processing of information. OLTP It involves day-to-day processing. OLAP systems are used by knowledge workers such as executives, managers, and analysts. OLTP systems are used by clerks, DBAs, or database professionals. OLAP It is used to analyze the business. OLTP It is used to run the business. OLAP It focuses on Information out. OLTP It focuses on Data in. OLAP It is based on Star Schema, Snowflake Schema, and Fact Constellation Schema. OLTP It is based on Entity Relationship Model. OLAP It focuses on Information out. OLTP It is application oriented. OLAP It contains historical data. OLTP It contains current data. OLAP It provides summarized and consolidated data. OLTP It provides primitive and highly detailed data. OLAP It provides summarized and multidimensional view of data. OLTP It provides detailed and flat relational view of data. OLAP The number of users is in hundreds. OLTP The number of users is in thousands. OLAP The number of records accessed is in millions. OLTP The number of records accessed is in tens. OLAP The database size is from 100GB to 100 TB. OLTP The database size is from 100 MB to 100 GB. OLAP These are highly flexible. OLTP It provides high performance.

OLTP vs OLAP

OLTP (On-line Transaction Processing) is involved in the operation of a particular system. OLTP is characterized by a large number of short on-line transactions (INSERT, UPDATE, DELETE). The main emphasis for OLTP systems is put on very fast query processing, maintaining data integrity in multi-access environments and an effectiveness measured by number of transactions per second. In OLTP database there is detailed and current data, and schema used to store transactional databases is the entity model (usually 3NF). It involves Queries accessing individual record like Update your Email in Company database. OLAP (On-line Analytical Processing) deals with Historical Data or Archival Data. OLAP is characterized by relatively low volume of transactions. Queries are often very complex and involve aggregations. For OLAP systems a response time is an effectiveness measure. OLAP applications are widely used by Data Mining techniques. In OLAP database there is aggregated, historical data, stored in multi-dimensional schemas (usually star schema). Sometime query need to access large amount of data in Management records like what was the profit of your company in last year.

Automatic Concept Hierarchy Generation

Some hierarchies can be automatically generated based on the analysis of the number of distinct values per attribute in the data set -The attribute with the most distinct values is placed at the lowest level of the hierarchy -Exceptions, e.g., weekday, month, quarter, year country province_or_ state city street 15 distinct values 365 distinct values 3567 distinct values 674,339 distinct values

Extraction, Transformation, and Transportation

The main difference between independent and dependent data marts is how you populate the data mart; that is, how you get data out of the sources and into the data mart. This step, called the Extraction-Transformation-Transportation (ETT) process, involves moving data from operational systems, filtering it, and loading it into the data mart. With dependent data marts, this process is somewhat simplified because formatted and summarized (clean) data has already been loaded into the central data warehouse. The ETT process for dependent data marts is mostly a process of identifying the right subset of data relevant to the chosen data mart subject and moving a copy of it, perhaps in a summarized form. With independent data marts, however, you must deal with all aspects of the ETT process, much as you do with a central data warehouse. The number of sources are likely to be fewer and the amount of data associated with the data mart is less than the warehouse, given your focus on a single subject. The motivations behind the creation of these two types of data marts are also typically different. Dependent data marts are usually built to achieve improved performance and availability, better control, and lower telecommunication costs resulting from local access of data relevant to a specific department. The creation of independent data marts is often driven by the need to have a solution within a shorter time. Hybrid data marts simply combine the issues of independent and independent data marts.

Data cube

a data cube (or datacube) is a multi-dimensional array of values, commonly used to describe a time series of image data. The data cube is used to represent data along some measure of interest. Even though it is called a 'cube', it can be 1-dimensional, 2-dimensional, 3-dimensional, or higher-dimensional. Every dimension represents a new measure whereas the cells in the cube represent the facts of interest.

Multidimensional OLAP (MOLAP) servers:

support multidimensional data views through array-based multidimensional storage engines. They map multidimensional views directly to data cube array structures. The advantage of using a data cube is that it allows fast indexing to precomputed summarized data. Notice that with multidimensional data stores, the storage utilization may be low if the data set is sparse. In such cases, sparse matrix compression techniques should be explored (Chapter 5). Many MOLAP servers adopt a two-level storage representation to handle dense and sparse data sets: Denser subcubes are identified and stored as array structures, whereas sparse subcubes employ compression technology for efficient storage utilization.

data cube

• A data cube, such as sales, allows data to be modeled and viewed in multiple dimensions - Dimension tables, such as item (item_name, brand, type), or time(day, week, month, quarter, year) - Fact table contains measures (such as dollars_sold) and keys to each of the related dimension tables • In data warehousing literature, an n-D base cube is called a base cuboid. The top most 0-D cuboid, which holds the highest-level of summarization, is called the apex cuboid. The lattice of cuboids forms a data cube.

Three Data Warehouse Models

• Enterprise warehouse - collects all of the information about subjects spanning the entire organization • Data Mart - a subset of corporate-wide data that is of value to a specific groups of users. Its scope is confined to specific, selected groups, such as marketing data mart • Independent vs. dependent (directly from warehouse) data mart • Virtual warehouse - A set of views over operational databases - Only some of the possible summary views may be materialized

What Is Data Mining?

Data mining (knowledge discovery from data) -Extraction of interesting (non-trivial,implicit, previously unknownand potentially useful)patterns or knowledge from huge amount of data -Data mining: a misnomer? •Alternative names -Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc. •Watch out: Is everything "data mining"? -Simple search and query processing -(Deductive) expert systems

Parametric Data Reduction: Regression and Log-Linear Models

Linear regression -Data modeled to fit a straight line -Often uses the least-square method to fit the line •Multiple regression -Allows a response variable Y to be modeled as a linear function of multidimensional feature vector •Log-linear model -Approximates discrete multidimensional probability distributions

Data Mining Tasks

• Descriptive methods - Find human-interpretable patterns that describe the data • Example:Clustering • Predictive methods - Use some variables to predict unknown or future values of other variables • Example:Recommender systems

An attribute

An attribute is a property or characteristic of an object - Examples: eye color of a person, temperature, etc. - Attribute is also known as variable, field, characteristic, or feature • A collection of attributes describe an object - Object is also known as record, point, case, sample, entity, or instance

Duplicate Data

Data set may include data objects that are duplicates, or almost duplicates of one another - Major issue when merging data from heterogeous sources • Examples: - Same person with multiple email addresses • Data cleaning - Process of dealing with duplicate data issues

Discuss issues to consider during data integration.

​Issues to consider during data integration is isolation, business needs, department needs, technological advancement, data problems, timing, and will it continue working. Applications are built and deployed in isolation. But, challenges arise with the workflow as well compliance technology upgrades and additions. Business needs is an issue as even though an enterprise might use a small database it will probably, want to use multiple data products that do not automatically work together. Department needs is a challenge as applications continually change, requiring the use of new applications. Technological advancements is a challenge as even though products will continuously improve , integrating data is not the top propriety. Data problems is another issue as there will always be data that is incorrect, missing, uses of wrong format, incomplete, and etc. So, businesses should first profile data to asses its quality for both the data source, and the environment in which it will integrate. Timing is a challenge as sometimes a data integration system will unable to handle real time data and periodic access. Finally, another issue to consider is if it will continue to work. New platforms and technologies come and go, and issues can arise. Which can lead to data management platforms to seamlessly be problematic. It is important for companies to think forwarding and when a change is made within their environment, they need to take into account a solution to data integration.

4. Discuss why a document-term matrix is an example of a data set that has asymmetric discrete or asymmetric continuous features.

​Why a document-term matrix is an example of a data set that has asymmetric discrete features as the ijth entry of a document term matrix is the number of a timers that j occurs in document i, and since most documents contain a small fraction of all the possible terms, thus zero entries are not very meaningful either is describing or comparing documents. Leading a document-term matrix to have asymmetric discrete features. However a document-term matrix can also be an example of a asymmetric continuous feature as if you apply a term frequency inverters document frequency normalization to the terms and normalize the documents to have L2 norm of . Nevertheless no matter the approach the features will still be asymmetric.

Data quality problems

- Noise and outliers - missing values - duplicate data

Duplicate Data •

Data set may include data objects that are duplicates, or almost duplicates of one another - Major issue when merging data from heterogeous sources • Examples: - Same person with multiple email addresses • Data cleaning - Process of dealing with duplicate data issues

binary attribute

A binary attribute is a nominal attribute with only two categories or states: 0 or 1, where 0 typically means that the attribute is absent, and 1 means that it is present. Binary attributes are referred to as Boolean if the two states correspond to true and false.

Attribute Subset Selection

Another way to reduce dimensionality of data •Redundant attributes -Duplicate much or all of the information contained in one or more other attributes -E.g., purchase price of a product and the amount of sales tax paid •Irrelevant attributes -Contain no information that is useful for the data mining task at hand -E.g., students' ID is often irrelevant to the task of predicting students' GPA

Attribute Subset Selection (1)

Attribute selection can help in the phases of data mining (knowledge discovery) process - By attribute selection, • we can improve data mining performance (speed of learning, predictive accuracy, or simplicity of rules) • we can visualize the data for model selected • we reduce dimensionality and remove noise. Attribute (Feature) selection is a search problem - Search directions • (Sequential) Forward selection • (Sequential) Backward selection (elimination) • Bidirectional selection • Decision tree algorithm (induction) Attribute Subset Selection (3) Attribute (Feature) selection is a search problem - Search strategies • Exhaustive search • Heuristic search - Selection criteria • Statistic significance • Information gain • etc. Attribute Subselection 4 Greedy (heuristic) methods for attribute subset selection 56 Attribute Creation (Feature Generation) Create new attributes (features) that can capture the important information in a data set more effectively than the original ones • Three general methodologies - Attribute extraction • Domain-specific - Mapping data to new space (see: data reduction) • E.g., Fourier transformation, wavelet transformation, manifold approaches (not covered) - Attribute construction • Combining features (see: discriminative frequent patterns in Chapter 7) • Data discretization

Attribute values

Attribute values are numbers or symbols assigned to an attribute • Distinction between attributes and attribute values - Same attribute can be mapped to different attribute values • Example: height can be measured in feet or meters - Different attributes can be mapped to the same set of values • Example: Attribute values for ID and age are integers • But properties of attribute values can be different - ID has no limit but age has a maximum and minimum value

How to Handle Noisy Data?

Binning (equal frequency, means, boundaries) -first sort data and partition into (equal-frequency) bins -then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc. •Regression -smooth by fitting the data into regression functions •Clustering -detect and remove outliers •Combined computer and human inspection -detect suspicious values and check by human (e.g., deal with possible outliers)

Correlation Analysis (Numeric Data)

Correlation Analysis (Numeric Data) ∑ (ab)- b mean(a)mean(b)/(n-1) std(a) stdb(b) Correlation coefficient (also called Pearson's product moment coefficient) where n is the number of tuples, and are the respective means of A and B, σA and σB are the respective standard deviation of A and B, and Σ(aibi) is the sum of the AB cross-product. •If rA,B> 0, A and B are positively correlated (A's values increase as B's). The higher, the stronger correlation. •rA,B= 0: independent; rAB< 0: negatively correlate

Covariance

Covariance (Numeric Data) • Covariance is similar to correlation where n is the number of tuples, and are the respective mean or expected values of A and B, σA and σB are the respective standard deviation of A and B. •Positive covariance: If CovA,B> 0, then A and B both tend to be larger than their expected values. •Negative covariance: If CovA,B< 0 then if A is larger than its expected value, B is likely to be smaller than its expected value. •Independence: CovA,B= 0 but the converse is not true: -Some pairs of random variables may have a covariance of 0 but are not independent. Only under some additional assumptions (e.g., the data follow multivariate normal distributions) does a covariance of 0 imply independence

Data Reduction 1: Dimension Reduction

Curse of dimensionality -When dimensionality increases, data becomes increasingly sparse -Density and distance between points, which is critical to clustering, outlier analysis, becomes less meaningful -The possible combinations of subspaces will grow exponentially •Dimensionality reduction -Avoid the curse of dimensionality -Help eliminate irrelevant features and reduce noise -Reduce time and space required in data mining -Allow easier visualization •Dimensionality reduction techniques -Wavelet transforms -Principal Component Analysis -Supervised and nonlinear techniques (e.g., feature selection)

Data cube aggregation

Data Cube Aggregation • The lowest level of a data cube - the aggregated data for an individual entity of interest - e.g., a customer in a phone calling data warehouse. • Multiple levels of aggregation in data cubes - Further reduce the size of data to deal with • Reference appropriate levels - Use the smallest representation which is enough to solve the task

Data Reduction Strategies

Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same (or almost the same) analytical results •Why data reduction? —A database/data warehouse may store terabytes of data. Complex data analysis may take a very long time to run on the complete data set. •Data reduction strategies -Dimensionality reduction, e.g.,remove unimportant attributes •Wavelet transforms •Principal Components Analysis (PCA) •Feature subset selection, feature creation -Numerosity reduction(some simply call it: Data Reduction) •Regression and Log-Linear Models •Histograms, clustering, sampling •Data cube aggregation -Data compression

Data transformation

Data transformation routines convert the data into appropriate forms for mining. For example, in normalization, attribute data are scaled so as to fall within a small range such as 0.0 to 1.0. Other examples are data discretization and concept hierarchy generation.

data transformation

Data transformation routines convert the data into appropriate forms for mining. For example, in normalization, attribute data are scaled so as to fall within a small range such as 0.0 to 1.0. Other examples are data discretization and concept hierarchy generation. Data discretization transforms numeric data by mapping values to interval or concept labels. Such methods can be used to automatically generate concept hierarchies for the data, which allows for mining at multiple levels of granularity. Discretization techniques include binning, histogram analysis, cluster analysis, decision tree analysis, and correlation analysis. For nominal data, concept hierarchies may be generated based on schema definitions as well as the number of distinct values per attribute.

Why data warehouse?

Data warehousing is must for running an enterprise of any size to make intelligent decisions. It enables the competitive advantage.Data warehousing is essentially tells about data and its relationships and it is foundation for Business Intelligence(BI). It clearly draws the distinction between data and information. Data consists of recorded "facts" - for example, Sales amounts initiated by a customer. Information involves interpreting facts, identifying the relation between them and find the more abstract meaning. Each characteristic, such as customer, store, date could serve as predicate in queries. Data warehousing emphasizes organizing, standardizing and formatting facts in such a way that we can derive information from them. BI is then concerned about acting on that information.

PCA

Find a projection that captures the largest amount of variation in data •The original data are projected onto a much smaller space, resulting in dimensionality reduction. We find the eigenvectors of the covariance matrix, and these eigenvectors define the new space.

What is the knowledge discovery process?

It is an iterative sequence of the following steps. 1.Developing an understanding of the application domain the relevant prior knowledge the goals of the end-user 2.Creating a target data set: selecting a data set, or focusing on a subset of variables, or data samples, on which discovery is to be performed. 3.Data cleaning and preprocessing. Removal of noise or outliers. Collecting necessary information to model or account for noise. Strategies for handling missing data fields. Accounting for time sequence information and known changes. 4.Data reduction and projection. Finding useful features to represent the data depending on the goal of the task. Using dimensionality reduction or transformation methods to reduce the effective number of variables under consideration or to find invariant representations for the data. 5.Choosing the data mining task. Deciding whether the goal of the KDD process is classification, regression, clustering, etc. 6.Choosing the data mining algorithm(s). Selecting method(s) to be used for searching for patterns in the data. Deciding which models and parameters may be appropriate. Matching a particular data mining method with the overall criteria of the KDD process. 7.Data mining. Searching for patterns of interest in a particular representational form or a set of such representations as classification rules or trees, regression, clustering, and so forth. 8.Interpreting mined patterns. 9.Consolidating discovered knowledge.

PCA, step by step

Given N data vectors from n-dimensions, find k ≤ n orthogonal vectors (principal components) that can be best used to represent data -Normalize input data: Each attribute falls within the same range -Compute korthonormal (unit) vectors, i.e., principal components -Each input data (vector) is a linear combination of the kprincipal component vectors -The principal components are sorted in order of decreasing "significance" or strength -Since the components are sorted, the size of the data can be reduced by eliminating the weak components, i.e., those with low variance (i.e., using the strongest principal components, it is possible to reconstruct a good approximation of the original data) • Works for numeric data only

Data quality can be assessed in terms of several issues, including accuracy, completeness, and consistency. For each of the above three issues, discuss how data quality assessment can depend on the intended use of the data, giving examples. Propose two other dimensions of data quality.

​Data accuracy refers to the degree of how data properly represents 'real life' objects that one is intended to model. Data is in the range of possible results in order to be useful for decision making. For example in many cases accuracy is measure on how people agree with the identified source of correct information. ​Data completeness refers to degree of whether or not all the data necessary to meet the current and future business information is available, and it will not exceed the benefit of use. An example in order for a company to mail a customer their package, the company would need the person's complete mailing address. When the company has the customer's complete mailing address, then the company can consider the quality of the customer's data to be complete. ​Data consistency refers to the state of which difference is absent, when comparing two or more representations of a thing against a definition. An example of when the quality of data consistency should be essential is when doing a hearing test. As data should be collected if hearing is consistent in both early. Two other attributes of data quality is data timeliness and data interpretability. Data timeliness refers to the degree that data must be available within a time frame that allows it to be useful for the decision making. An example of when the quality of timeliness would be important is the data between when the patient was diagnosed when Sepsis the first time versus when the patient got diagnosed with Sepsis the second time. Data interpretability refers to the degree that the quality of data is not so complex, and that in order to understand it, will provide you an extreme benefit of analysis. An example of when data interpretability would be important is we are still determining audience. As it is always important to make sure who is going to use the model and for what.

Interval-scaled attributes

Interval-scaled attributes are measured on a scale of equal-size units. The values of interval-scaled attributes have order and can be positive, 0, or negative. Thus, in addition to providing a ranking of values, such attributes allow us to compare and quantify the difference between values.

Data Reduction 2: Numerosity Reduction

Reduce data volume by choosing alternative, smaller formsof data representation •Parametric methods(e.g., regression) -Assume the data fits some model, estimate model parameters, store only the parameters, and discard the data (except possible outliers) -Ex.: Log-linear models—obtain value at a point in m-D space as the product on appropriate marginal subspaces •Non-parametricmethods -Do not assume models -Major families: histograms, clustering, sampling, ...

Handling Redundancy in Data Integration

Redundant data occur often when integration of multiple databases -Object identification: The same attribute or object may have different names in different databases -Derivable data:One attribute may be a "derived" attribute in another table, e.g., annual revenue •Redundant attributes may be able to be detected by correlation analysis and covariance analysis •Careful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality

Cure of Dimensionality

Size - The size of a data set yielding the same density of data points in an n-dimensional space increase exponentially with dimensions • Radius - A larger radius is needed to enclose a faction of the data points in a high-dimensional space 41 Distance - Almost every point is closer to an edge than to another sample point in a high-dimensional space • Outlier - Almost every point is an outlier in a high-dimensional space 42

Types of attributes

There are different types of attributes - Nominal • Examples: ID numbers, eye color, zip codes - Ordinal • Examples: rankings (e.g., taste of potato chips on a scale from 1-10), grades, height in {tall, medium, short} - Interval • Examples: calendar dates, temperatures in Celsius or Fahrenheit. - Ratio • Examples: temperature in Kelvin, length, time, counts

Three Tier Data Warehouse

Bottom Tier - The bottom tier of the architecture is the data warehouse database server. It is the relational database system. We use the back end tools and utilities to feed data into the bottom tier. These back end tools and utilities perform the Extract, Clean, Load, and refresh functions. Middle Tier - In the middle tier, we have the OLAP Server that can be implemented in either of the following ways. By Relational OLAP (ROLAP), which is an extended relational database management system. The ROLAP maps the operations on multidimensional data to standard relational operations. By Multidimensional OLAP (MOLAP) model, which directly implements the multidimensional data and operations. Top-Tier - This tier is the front-end client layer. This layer holds the query tools and reporting tools, analysis tools and data mining tools.

Data Integration

Data integration: -Combines data from multiple sources into a coherent store •Schema integration: e.g., A.cust-id ≡B.cust-# -Integrate metadata from different sources •Entity identification problem: -Identify real world entities from multiple data sources, e.g., Bill Clinton = William Clinton •Detecting and resolving data value conflicts -For the same real world entity, attribute values from different sources are different -Possible reasons: different representations, different scales, e.g., metric vs. British units

What is a data object?

Data sets are made up of data objects. A data object represents an entity—in a sales database, the objects may be customers, store items, and sales; in a medical database, the objects may be patients; in a university database, the objects may be students, professors, and courses. Data objects are typically described by attributes. Data objects can also be referred to as samples, examples, instances, data points, or objects. If the data objects are stored in a database, they are data tuples. That is, the rows of a database correspond to the data objects, and the columns correspond to the attributes. In this section, we define attributes and look at the various attribute types.

Missing Values

Reasons for missing values - Information is not collected (e.g., people decline to give their age and weight) - Attributes may not be applicable to all cases (e.g., annual income is not applicable to children) • Handling missing values - Eliminate Data Objects - Estimate Missing Values - Ignore the Missing Value During Analysis - Replace with all possible values (weighted by their probabilities

Missing Values

Reasons for missing values - Information is not collected (e.g., people decline to give their age and weight) - Attributes may not be applicable to all cases (e.g., annual income is not applicable to children) • Handling missing values - Eliminate Data Objects - Estimate Missing Values - Ignore the Missing Value During Analysis - Replace with all possible values (weighted by their probabilities)

Types of data sets

Record - Data Matrix - Iff data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multidimensional space, where each dimension represents a distinct attribute • Such data set can be represented by an m by n matrix, where there are m rows, one for each object, and n columns, one for each attribute - Document Data - Transaction Data • Graph - World Wide Web - Molecular Structures • Ordered - Spatial Data - Temporal Data - Sequential Data - Genetic Sequence Data

Measures of similarity and dissimilarity

are used in data mining applications such as clustering, outlier analysis, and nearest-neighbor classification. Such measures of proximity can be computed for each attribute type studied in this chapter, or for combinations of such attributes. Examples include the Jaccard coefficient for asymmetric binary attributes and Euclidean, Manhattan, Minkowski, and supremum distances for numeric attributes. For applications involving sparse numeric data vectors, such as term-frequency vectors, the cosine measure and the Tanimoto coefficient are often used in the assessment of similarity.

Data Integration

combines data from multiple sources to form a coherent data store. The resolution of semantic heterogeneity, metadata, correlation analysis, tuple duplication detection, and data conflict detection contribute to smooth data integration.

Data cleaning

routines attempt to fill in missing values, smooth out noise while identifying outliers, and correct inconsistencies in the data. Data cleaning is usually performed as an iterative two-step process consisting of discrepancy detection and data transformation.

Data Transformation

set of replacement values s.t. each old value can be identified with one of the new values •Methods -Smoothing: Remove noise from data -Attribute/feature construction •New attributes constructed from the given ones -Aggregation: Summarization, data cube construction -Normalization: Scaled to fall within a smaller, specified range •min-max normalization •z-score normalization •normalization by decimal scaling -Discretization: Concept hierarchy climbing

Compare and contrast between data, information and knowledge.

​There is a difference among data, information, and knowledge. Data are raw facts that are gathered about someone or something. These raw facts can be basic and random. Data is usually text and numbers that are based off of records and observations. It may be unorganized. Additionally it may or may not be useful, it may not be specific, and it does not depend on information. Information are facts which are concerning a certain event or subject that is refined data. It is based on an analysis. It is always organized, useful, and specific. It depends on data and without data information cannot be processed. Knowledge varies based on context. It is having awareness or familiarity gained by experience of a fact or situation. It is when the understanding of rules needed to interpret information. Despite there being a difference between data, information, and knowledge there are similarities. The similarities between data and information is that data is a raw form of information. Therefore they are one and the same, due to the only difference being that data is unorganized and information is organized accordingly.The similarities between data and knowledge is having a know-how, experience, insight, understanding, on facts and figures which rely on something specific. Moreover the similarities between information and knowledge is that they both have data as an essential component. Additionally knowledge and information can be viewed by observation, stored, retrieved, and processed further. Finally information can unfold into knowledge and knowledge can be exchanged into information.

Compare and contrast between database, data warehouse, data mart, data mining and big data.

There is a difference among database, data warehouse, data mart, data mining, and big data. Database is a structured set of data held in a computer. It is accessible in a variety of different ways. A data warehouse it is a large store of data accumulated for a wide variety of different sources within a company. It is used to guide management decisions. A data mart is the access layer of the data warehouse environment and is used to get the data out to the users. It is a subset of the data warehouse and oriented to a specific business line or team. Data mining is the practice of examining large databases in order to generate new information. Big data is a term for data sets that are so big or complex that traditional data processing application software is inadequate to deal with them. Some of the challenges of big data are capturing data, data storage, data analysis, sharing, transferring, querying, and updating. Big data can be analyzed for insights that lead to better decisions and tactical business moves. Despite there being differences between database, data warehouse, data mart, data mining, and big data there are similarities. The similarity between database and data warehouse is that they are both repositories of information and they store large amounts of data. The similarity between database and data mart both store data in a unique way, so it is accessible. The similarity between database and data mining is that they rely on each other. Because in order to data mine you need a large data set on your computer that can be accessible is variety of different ways depending on the computer visualization one chooses to do. The similarity between database and big data is that a vast majority big data stored within databases, so it can be accessible, despite the complexity that big data might bring. The similarity between database warehouse and data mart is they are concepts that describe a creation of sets of tables that are used for reporting or analysis.They are separate from data creation systems. The similarity between data mining and data warehouse is they are business intelligence tools that are used too turn information or data into actionable knowledge to achieve a goal. The similarity between data warehouse and big data is that since big data is a term for data sets that are so big and complex it could rely and be stored in a data warehouse, as big data could come from a variety of different sources. The similarity between data mart and data mining is that data mining could rely on the data mart, as data mining examines large databases to generate new information, and it could examine the data mart which knows the data of each database contains and extracts the information when asked. The similarity between data mart and big data is that one could find data sets that are big and complex living in a data mart which is oriented to a specific business line or team. The similarity between data mining and big data is they both can use large data sets to handle the collection or reporting of data that serves businesses or other recipients means. They could rely on each other too. As big data can be the asset of data mining which is the handler that is used to provide beneficial results.

2. An educational psychologist wants to use association analysis to analyze test results. The test consists of 100 questions with four possible answers each.

(a) How would you convert this data into a form suitable for association analysis? Answer: You would convert this into a form suitable for association analysis by converting the original data into binary form as follows: Q1= A Q2=B ........ Q100=D 1 0 ............ 1 0 1 0 (b) In particular, what type of attributes would you have and how many of them are there? You would have a binary asymmetric attribute. There are 400 of them.

1.Classify the following attributes as binary, discrete, or continuous. Also classify them as qualitative (nominal or ordinal) or quantitative (interval or ratio). Some cases may have more than one interpretation, so briefly indicate your reasoning if you think there may be some ambiguity.

(a) Time in terms of AM or PM. Answer: Binary, qualitative, ordinal (b) Brightness as measured by a light meter. Answer: Continuous, quantitative, ratio. (c) Brightness as measured by people's judgments. Answer:Discrete, qualitative, ordinal (d) Angles as measured in degrees between 0◦ and 360◦. Answer: Continuous, quantitive, ratio. (e) Bronze, Silver, and Gold medals as awarded at the Olympics. Answer: Discrete, qualitative, ordinal (f) Height above sea level. Answer: Continuous, quantitative interval/ratio (g) Number of patients in a hospital. Answer: Discrete, quantitative, interval/ratio (h) ISBN numbers for books. (Look up the format on the Web.) Answer: Discrete, qualitative, nominal (i) Ability to pass light in terms of the following values: opaque, translucent,transparent. Answer: Discrete, qualitative, ordinal (j) Military rank. Answer: Discrete, qualitative, ordinal (k) Distance from the center of campus. Answer: Continuous, quantitative, interval/ratio (l) Density of a substance in grams per cubic centimeter. Answer: Discrete, quantitative, ratio (m) Coat check number. Answer: Discrete, qualitative, nominal

what is data preprocessing?

is a data mining technique that involves transforming raw data into an understandable format. Real-world data is often incomplete, inconsistent, and/or lacking in certain behaviors or trends, and is likely to contain many errors. Data preprocessing is a proven method of resolving such issues.

Ratio-scaled attributes

is a numeric attribute with an inherent zero-point. That is, if a measurement is ratio-scaled, we can speak of a value as being a multiple (or ratio) of another value. In addition, the values are ordered, and we can also compute the difference between values, as well as the mean, median, and mode.

Data quality

is defined in terms of accuracy, completeness, consistency, timeliness, believability, and interpretabilty.

Numeric attribute

is quantitative (i.e., it is a measurable quantity) represented in integer or real values. Numeric attribute types can be interval-scaled or ratioscaled. The values of an interval-scaled attribute are measured in fixed and equal units. Ratio-scaled attributes are numeric attributes with an inherent zero-point. Measurements are ratio-scaled in that we can speak of values as being an order of magnitude larger than the unit of measurement.

What is normalization?

is the process of organizing the columns (attributes) and tables (relations) of a relational database to reduce data redundancy and improve data integrity.

2. In real-world data, tuples with missing values for some attributes are a common occurrence. Describe various methods for handling this problem.

​One method for handling this problem is filling in the missing values manually. This method handles the tuples of missing values by manually filling in each and every empty data cells. This method cannot consider to be an efficient approach as its time consuming and does not work for large databases. Another method for handling this problem is fill in the missing value filled by a constant. This method handles the tuples of missing values by replacing all empty data cells by global constants that do not affect the actual meaning of the data and does not affect any analysis of some sort. Another method is to ignore the tuple. This is done when the value is missing. This method is not effective, as the type contains several attributes with missing values. Another method is to use the attribute mean for numeric values or attribute mode for categorical values. You use this value to replace any missing values. Finally, another method is to use the most probable value to fill in the missing value.

Noisy Data

• Noise: random error or variance in a measured variable • Incorrect attribute valuesmay be due to - faulty data collection instruments - data entry problems - data transmission problems - technology limitation - inconsistency in naming convention • Other data problemswhich require data cleaning - duplicate records - incomplete data - inconsistent data

Data Quality: Why Preprocess the Data?

• Measures for data quality: A multidimensional view - Accuracy: correct or wrong, accurate or not - Completeness: not recorded, unavailable, ... - Consistency: some modified but some not, dangling, ... - Timeliness: timely update? - Believability: how trustable the data are correct? - Interpretability: how easily the data can be understood?

Data Cleaning as a Process

Data discrepancy detection -Use metadata (e.g., domain, range, dependency, distribution) -Check field overloading -Check uniqueness rule, consecutive rule and null rule -Use commercial tools •Data scrubbing: use simple domain knowledge (e.g., postal code, spell-check) to detect errors and make corrections •Data auditing: by analyzing data to discover rules and relationship to detect violators (e.g., correlation and clustering to find outliers) •Data migration and integration -Data migration tools: allow transformations to be specified -ETL (Extraction/Transformation/Loading) tools: allow users to specify transformations through a graphical user interface •Integration of the two processes -Iterative and interactive (e.g., Potter's Wheels)

Data Cleaning •

Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g., instrument faulty, human or computer error, transmission error -incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data •e.g., Occupation=" " (missing data) -noisy: containing noise, errors, or outliers •e.g., Salary="−10" (an error) -inconsistent: containing discrepancies in codes or names, e.g., •Age="42", Birthday="03/07/2010" •Was rating "1, 2, 3", now rating "A, B, C" •discrepancy between duplicate records -Intentional(e.g., disguised missingdata) •Jan. 1 as everyone's birthday?

Incomplete (Missing) Data

Data is not always available -E.g., many tuples have no recorded value for several attributes, such as customer income in sales data •Missing data may be due to -equipment malfunction -inconsistent with other recorded data and thus deleted -data not entered due to misunderstanding -certain data may not be considered important at the time of entry -not register history or changes of the data •Missing data may need to be inferred

Data Reduction

Data reduction techniques obtain a reduced representation of the data while minimizing the loss of information content. These include methods of dimensionality reduction, numerosity reduction, and data compression. Dimensionality reduction reduces the number of random variables or attributes under consideration. Methods include wavelet transforms, principal components analysis, attribute subset selection, and attribute creation. Numerosity reduction methods use parametric or nonparatmetric models to obtain smaller representations of the original data. Parametric models store only the model parameters instead of the actual data. Examples include regression and log-linear models. Nonparamteric methods include histograms, clustering, sampling, and data cube aggregation. Data compression methods apply transformations to obtain a reduced or "compressed" representation of the original data. The data reduction is lossless if the original data can be reconstructed from the compressed data without any loss of information; otherwise, it is lossy.

Data Mining: On What Kinds of Data?

Database-oriented data sets and applications -Relational database, data warehouse, transactional database •Advanced data sets and advanced applications -Data streams and sensor data -Time-series data, temporal data, sequence data (incl. bio-sequences) -Structure data, graphs, social networks and multi-linked data -Object-relational databases -Heterogeneous databases and legacy databases -Spatial data and spatiotemporal data -Multimedia database -Text databases -The World-Wide Web

How to Handle Missing Data?

Ignore the tuple: usually done when class label is missing (when doing classification)—not effective when the % of missing values per attribute varies considerably •Fill in the missing value manually: tedious + infeasible? •Fill in it automatically with -a global constant : e.g., "unknown", a new class?! -the attribute mean -the attribute mean for all samples belonging to the same class: smarter -the most probable value: inference-based such as Bayesian formula or decision tree

Data Cube Aggregation

Imagine that you have collected the data for your analysis. These data consist of the AllElectronics sales per quarter, for the years 2008 to 2010. You are, however, interested in the annual sales (total per year), rather than the total per quarter. Thus, the data can be aggregated so that the resulting data summarize the total sales per year instead of per quarter. This aggregation is illustrated in Figure 3.10. The resulting data set is smaller in volume, without loss of information necessary for the analysis task. Data cubes are discussed in detail in Chapter 4 on data warehousing and Chapter 5 on data cube technology. We briefly introduce some concepts here. Data cubes store multidimensional aggregated information. For example, Figure 3.11 shows a data cube for multidimensional analysis of sales data with respect to annual sales per item type for each AllElectronics branch. Each cell holds an aggregate data value, corresponding to the data point in multidimensional space. (For readability, only some cell values are shown.) Concept hierarchies may exist for each attribute, allowing the analysis of data at multiple abstraction levels. For example, a hierarchy for branch could allow branches to be grouped into regions, based on their address. Data cubes provide fast access to precomputed, summarized data, thereby benefiting online analytical processing as well as data mining. The cube created at the lowest abstraction level is referred to as the base cuboid. The base cuboid should correspond to an individual entity of interest such as sales or customer. In other words, the lowest level should be usable, or useful for the analysis. A cube at the highest level of abstraction is the apex cuboid. For the sales data in Figure 3.11, the apex cuboid would give one total—the total sales for all three years, for all item types, and for all branches. Data cubes created for varying levels of abstraction are often referred to as cuboids, so that a data cube may instead refer to a lattice of cuboids. Each higher abstraction level further reduces the resulting data size. When replying to data mining requests, the smallest available cuboid relevant to the given task should be used. This issue is also addressed in Chapter 4.

Attribute Subset Selection

In machine learning and statistics, feature selection, also known as variable selection, attribute selection or variable subset selection, is the process of selecting a subset of relevant features (variables, predictors) for use in model construction. Feature selection techniques are used for four reasons: simplification of models to make them easier to interpret by researchers/users,[1] shorter training times, to avoid the curse of dimensionality, enhanced generalization by reducing overfitting[2] (formally, reduction of variance[1]) The central premise when using a feature selection technique is that the data contains many features that are either redundant or irrelevant, and can thus be removed without incurring much loss of information. [2] Redundant or irrelevant features are two distinct notions, since one relevant feature may be redundant in the presence of another relevant feature with which it is strongly correlated.[3]

nominal attributes

Nominal means "relating to names." The values of a nominal attribute are symbols or names of things. Each value represents some kind of category, code, or state, and so nominal attributes are also referred to as categorical. The values do not have any meaningful order. In computer science, the values are also known as enumerations. (Quantative)

Basic statistical description

provide the analytical foundation for data preprocessing. The basic statistical measures for data summarization include mean, weighted mean, median, and mode for measuring the central tendency of data; and range, quantiles, quartiles, interquartile range, variance, and standard deviation for measuring the dispersion of data. Graphical representations (e.g., boxplots, quantile plots, quantile- quantile plots, histograms, and scatter plots) facilitate visual inspection of the data and are thus useful for data preprocessing and mining.

Data discrezation

transforms numeric data by mapping values to interval or concept labels. Such methods can be used to automatically generate concept hierarchies for the data, which allows for mining at multiple levels of granularity. Discretization techniques include binning, histogram analysis, cluster analysis, decision tree analysis, and correlation analysis. For nominal data, concept hierarchies may be generated based on schema definitions as well as the number of distinct values per attribute.

Heuristic Search in Attribute Selection

•Typical heuristic attribute selection methods: -Best single attribute under the attribute independence assumption: choose by significance tests -Best step-wise feature selection: •The best single-attribute is picked first •Then next best attribute condition to the first, ... -Step-wise attribute elimination: •Repeatedly eliminate the worst attribute -Best combined attribute selection and elimination -Optimal branch and bound: •Use attribute elimination and backtracking

PCA in 4 steps

1. Normalize the data. We're dealing with covariance, so it's a good idea to have features on the same scale. 2. Calculate the covariance matrix. 3. Find the eigenvectors of the covariance matrix. 4. Translate the data to be in terms of the components. This involves just a simple matrix multiplication. I found some data on the psychology within a large financial company. The dataset is called attitude, and is accessible within R. library(MASS) # install.packages("MASS") if you have to attitude <- attitude # put the data in our workspace in R 1 2 library(MASS) # install.packages("MASS") if you have to attitude <- attitude # put the data in our workspace in R Ok, now let's do those 4 steps. 1. Normalize the data. This literally means put each feature on a normal curve. Just like you would calculate a z-score, subtract the mean and divide by the standard deviation, to the entire feature vector. attach(attitude) # to save me having to type 'attitude' 20 times attitude$rating <- (rating - mean(rating)) / sd(rating) attitude$complaints <- (complaints - mean(complaints)) / sd(complaints) attitude$privileges <- (privileges - mean(privileges)) / sd(privileges) attitude$learning <- (learning - mean(learning)) / sd(learning) attitude$raises <- (raises - mean(raises)) / sd(raises) attitude$critical <- (critical - mean(critical)) / sd(critical) attitude$advance <- (advance - mean(advance)) / sd(advance) # re-attach so it calls the updated feature attach(attitude) summary(attitude) # means are all 0 sd(privileges) # and sd's are all 1 (you can check them all if you like) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 attach(attitude) # to save me having to type 'attitude' 20 times attitude$rating <- (rating - mean(rating)) / sd(rating) attitude$complaints <- (complaints - mean(complaints)) / sd(complaints) attitude$privileges <- (privileges - mean(privileges)) / sd(privileges) attitude$learning <- (learning - mean(learning)) / sd(learning) attitude$raises <- (raises - mean(raises)) / sd(raises) attitude$critical <- (critical - mean(critical)) / sd(critical) attitude$advance <- (advance - mean(advance)) / sd(advance) # re-attach so it calls the updated feature attach(attitude) summary(attitude) # means are all 0 sd(privileges) # and sd's are all 1 (you can check them all if you like) 2. Get the covariance matrix # this is actually really simple..thanks R :) cov(attitude) # Quiz question - Why is the diagonal all 1's? # Because we normalized each feature to have a variance of 1! 1 2 3 4 5 # this is actually really simple..thanks R :) cov(attitude) # Quiz question - Why is the diagonal all 1's? # Because we normalized each feature to have a variance of 1! 3. The Principal Components Remember, the principal components are the eigenvectors of the covariance matrix. x <- eigen(cov(attitude)) x$vectors # just out of curiosity, I'm going to check that it did this right (%*% is matrix multiplication in R) cov(attitude) %*% x$vectors[, 1] # gives the same values as... x$values[1] * x$vectors[, 1] 1 2 3 4 5 6 x <- eigen(cov(attitude)) x$vectors # just out of curiosity, I'm going to check that it did this right (%*% is matrix multiplication in R) cov(attitude) %*% x$vectors[, 1] # gives the same values as... x$values[1] * x$vectors[, 1] 4. Putting the data in terms of the components We do this by matrix multiplying the transpose of the feature vector and the transpose of matrix containing the data. Why transpose? Theoretical reasons aside, the dimensions have to line up. Step4Formula A <- x$vectors[, 1:3] B <- data.matrix(attitude) # because we can't do matrix multiplication with data frames! # now we arrive at the new data by the above formula! newData <- t(A) %*% t(B) 1 2 3 4 5 A <- x$vectors[, 1:3] B <- data.matrix(attitude) # because we can't do matrix multiplication with data frames! # now we arrive at the new data by the above formula! newData <- t(A) %*% t(B) And then just a hint of data cleaning so we can have a nice data frame to work with and run algorithms on. # note - in the newData matrix, each row is a feature, and each column is a data point. Let's change that newData <- t(newData) newData <- data.frame(newData) names(newData) <- c("feat1", "feat2", "feat3") 1 2 3 4 # note - in the newData matrix, each row is a feature, and each column is a data point. Let's change that newData <- t(newData) newData <- data.frame(newData) names(newData) <- c("feat1", "feat2", "feat3")

Data cube aggregation

A data cube refers is a three-dimensional (3D) (or higher) range of values that are generally used to explain the time sequence of an image's data. It is a data abstraction to evaluate aggregated data from a variety of viewpoints. It is also useful for imaging spectroscopy as a spectrally-resolved image is depicted as a 3-D volume. A data cube can also be described as the multidimensional extensions of two-dimensional tables. It can be viewed as a collection of identical 2-D tables stacked upon one another. Data cubes are used to represent data that is too complex to be described by a table of columns and rows. As such, data cubes can go far beyond 3-D to include many more dimensions.

What is the importance of the database?

A database management system is important because it manages data efficiently and allows users to perform multiple tasks with ease. A database management system stores, organizes and manages a large amount of information within a single software application.

What is an attribute

An attribute is a data field, representing a characteristic or feature of a data object. The nouns attribute, dimension, feature, and variable are often used interchangeably in the literature. The term dimension is commonly used in data warehousing.Machine learning literature tends to use the term feature, while statisticians prefer the term variable. Data mining and database professionals commonly use the term attribute, and we do here as well. Attributes describing a customer object can include, for example, customer ID, name, and address. Observed values for a given attribute are known as observations. A set of attributes used to describe a given object is called an attribute vector (or feature vector).The distribution of data involving one attribute (or variable) is called univariate. A bivariate distribution involves two attributes, and so on. The type of an attribute is determined by the set of possible values—nominal, binary, ordinal, or numeric—the attribute can have.

Quantative flavors

Continuous Data and Discrete Data There are two types of quantitative data, which is also referred to as numeric data: continuous and discrete. As a general rule, counts are discrete and measurements are continuous. Discrete data is a count that can't be made more precise. Typically it involves integers. For instance, the number of children (or adults, or pets) in your family is discrete data, because you are counting whole, indivisible entities: you can't have 2.5 kids, or 1.3 pets. Continuous data, on the other hand, could be divided and reduced to finer and finer levels. For example, you can measure the height of your kids at progressively more precise scales—meters, centimeters, millimeters, and beyond—so height is continuous data.

Why Data Preprocessing is Beneficial to Data Mining?

Less data - data mining methods can learn faster • Higher accuracy - data mining methods can generalize better • Simple results - they are easier to understand • Fewer attributes - For the next round of data collection, saving can be made by removing redundant and irrelevant features

Discrete and Continuous Attributes

Discrete Attribute - Has only a finite or countably infinite set of values - Examples: zip codes, counts, or the set of words in a collection of documents - Often represented as integer variables. - Note: binary attributes are a special case of discrete attributes • Continuous Attribute - Has real numbers as attribute values

Chi Square Correlation Analysis

Do you remember how to test the independence of two categorical variables? This test is performed by using a Chi-square test of independence. Recall that we can summarize two categorical variables within a two-way table, also called a r × c contingency table, where r = number of rows, c = number of columns. Our question of interest is "Are the two variables independent?" This question is set up using the following hypothesis statements: Null Hypothesis: The two categorical variables are independent. Alternative Hypothesis: The two categorical variables are dependent.

What is data mining?

Knowledge mining from data It is one of the steps in the knowledge discovery process. It is the computing process of discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database design.

min-max normalization

Minmax normalization is a normalization strategy which linearly transforms x to y= (x-min)/(max-min), where min and max are the minimum and maximum values in X, where X is the set of observed values of x. It can be easily seen that when x=min, then y=0, and When x=max, then y=1. This means, the minimum value in X is mapped to 0 and the maximum value in X is mapped to 1. So, the entire range of values of X from min to max are mapped to the range 0 to 1.

PCA

Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components (or sometimes, principal modes of variation).

Data Reduction 3: Data Compression

String compression -There are extensive theories and well-tuned algorithms -Typically lossless, but only limited manipulation is possible without expansion •Audio/video compression -Typically lossy compression, with progressive refinement -Sometimes small fragments of signal can be reconstructed without reconstructing the whole •Time sequence is not audio -Typically short and vary slowly with time •Dimensionality and numerosity reduction may also be considered as forms of data compression

Business Analysis Framework

The business analyst get the information from the data warehouses to measure the performance and make critical adjustments in order to win over other business holders in the market. Having a data warehouse offers the following advantages: Since a data warehouse can gather information quickly and efficiently, it can enhance business productivity. A data warehouse provides us a consistent view of customers and items, hence, it helps us manage customer relationship. A data warehouse also helps in bringing down the costs by tracking trends, patterns over a long period in a consistent and reliable manner. To design an effective and efficient data warehouse, we need to understand and analyze the business needs and construct a business analysis framework. Each person has different views regarding the design of a data warehouse. These views are as follows: The top-down view - This view allows the selection of relevant information needed for a data warehouse. The data source view - This view presents the information being captured, stored, and managed by the operational system. The data warehouse view - This view includes the fact tables and dimension tables. It represents the information stored inside the data warehouse. The business query view - It is the view of the data from the viewpoint of the end-user.

Properties of Attribute Values

The type of an attribute depends on which of the following properties it possesses: - Distinctness: = ≠ - Order: < > - Addition: + - - Multiplication: * / - Nominal attribute: distinctness - Ordinal attribute: distinctness & order - Interval attribute: distinctness, order & addition - Ratio attribute: all 4 properties

Why data mining?

We live in a world where vasts amount of data are collected daily. S Analyzing such data is an important need. The Explosive Growth of Data: from terabytes to petabytes -Data collection and data availability •Automated data collection tools, database systems, Web, computerized society -Major sources of abundant data •Business: Web, e-commerce, transactions, stocks, ... •Science: Remote sensing, bioinformatics, scientific simulation, ... •Society and everyone: news, digital cameras, YouTube •We are drowning in data, but starving for knowledge! •"Necessity is the mother of invention"—Data mining—Automated analysis of massive data sets

Why Data Preprocessing Is Important?

Welcome to the Real World! • No quality data, no quality mining results! • Preprocessing is one of the most critical steps in a data mining process

What is a database?

a structured set of data held in a computer, especially one that is accessible in various ways.

min-max normalization

f you want to normalize you data you can do as you suggest and simply calculate: zi=xi−min(x)/max(x)−min(x) zi=xi−min(x)max(x)−min(x) where x=(x1,...,xn)x=(x1,...,xn) and zizi is now your ithith normalized data. As a proof of concept (although you did not ask for it) here is some R code and accompanying graph to illustrate this point:

Cov(A,B)

summation(xi - mean(x))(yi-mean(y))/(n-1)

Data visualization

techniques may be pixel-oriented, geometric-based, icon-based, or hierarchical. These methods apply to multidimensional relational data. Techniques have been proposed for the visualization of complex data, such as text and social networks.

3. Which of the following quantities is likely to show more temporal autocorrelation: daily rainfall or daily temperature? Why

​Daily temperature is likely to show more temporal autocorrelation as daily temperature because if locations are closer to each they are more similar with respect to the values of that feature that locations that are father way. Additionally, it is more common for close locations to have similar temperatures that similar amounts of rainfall because the amount of rainfall can different from one location to another.


संबंधित स्टडी सेट्स

Chapter 5 ERP Planning and Package Selection

View Set

Applied Business Management Ch. 1 & 2 Test

View Set

Risk in Business and Society- Exam 1

View Set

kémia 7. fő csoport dolgozat Kedd

View Set

CS 4A: Chapter 9 - Objects & Classes

View Set

Chapter 18: Personal Selling and Sales Promotion

View Set