Data Mining ch 1

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

What is data mining ex?

- Certain names are more prevalent in certain US locations (O'Brien, O'Rurke, O'Reilly... in Boston area) - Group together similar documents returned by search engine according to their context (e.g. Amazon rainforest, Amazon.com,)

What does similarity measure?

- Euclidean Distance if attributes are continuous. - Other Problem-specific Measures.

What are description methods?

- Find human-interpretable patterns that describe the data.

What is not data mining ex?

- Look up phone number in phone directory - Query a Web search engine for information about "Amazon"

What are prescriptive methods?

- Optimization and simulation of algorithms.

What is the biggest challenge in todays world?

- Too much data - Too little information

What are predictive methods?

- Use some variables to predict unknown or future values of other variables.

Difference in use of data?

Data Mining uses more data to extract useful information and predict future outcomes whereas machine learning do not rely much on data rather it uses algorithms.

Differences between data mining and machine learning and techniques or movement?

Data Mining uses static set of algorithms and techniques, whereas Machine Learning algorithms and analytics are constantly meant to be improving for better outcome.

Data mining may help scientists?

In classifying and segmenting data In hypothesis formation

What is artificial intelligence?

-A broad term referring to computers and systems that are capable of essentially coming up with solutions to problems on their own. -The solutions are not hardcoded into the program. -Instead, the information needed to get to the solution is coded. -AI uses the data and calculations to come up with a solution on its own. -Data mining is an integral part of coding programs with the information, statistics, and data necessary for AI to create a solution.

Traditional Techniques may be unsuitable due to

-Enormity of data -High dimensionality of data -Heterogeneous, distributed nature of data

What is the evaluation phase?

-Evaluate one or more models for effectiveness -Determine whether defined objectives achieved -Establish whether some important facet of the problem has not been sufficiently accounted for -Make decision regarding data mining results before deploying to field

What is the deployment phase?

-Make use of models created -Simple deployment example: generate report -Complex deployment example: implement parallel data mining effort in another department -In businesses, customer often carries out deployment based on your model

What is the modeling phase?

-Select and apply one or more modeling techniques -Calibrate model settings to optimize results -If necessary, additional data preparation may be required for supporting a particular technique

What is machine learning?

-The study of algorithms that can extract information automatically. -Is a way to discover new algorithm from the experience. -Involves the algorithm that improves automatically through experience based on data. -Machine learning uses data mining techniques and other algorithms to build models of what is happening behind data in order to predict future outcome.

What is regression?

Predict a value of a given continuous valued variable based on the values of other variables, assuming a linear or nonlinear model of dependency. Greatly studied in statistics, neural network fields.

What is the goal of fraud detection?

Predict fraudulent cases in credit card transactions.

What are examples of regression?

Predicting sales amounts of new product based on advertising expenditure. Predicting wind velocities as a function of temperature, humidity, air pressure, etc. Time series prediction of stock market indices.

What is the approach of Sky Survey Cataloging?

Segment the image. Measure image attributes (features) - 40 of them per object. Model the class based on these features. Success Story: Could find 16 new high red-shift quasars, some of the farthest objects that are difficult to find!

A test set is used to

determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.

Data Mining refers to?

extracting knowledge from large amount of data by uncovering previously unknown trends and patterns

What is the goal of classification?

previously unseen records should be assigned a class as accurately as possible.

What is the goal of market segmentation?

subdivide a market into distinct subsets of customers where any subset may conceivably be selected as a market target to be reached with a distinct marketing mix.

What is the data understanding phase?

◦Collect data ◦Perform exploratory data analysis (EDA) ◦Assess data quality ◦Optionally, select interesting subsets

What is the Business/Research Understanding Phase?

◦Define project requirements and objectives ◦Translate objectives into data mining problem definition ◦Prepare preliminary strategy to meet objectives

What is the data preparation phase?

◦Prepares for modeling in subsequent phases◦Select cases and variables appropriate for analysis◦Cleanse and prepare data so it is ready for modeling tools◦Perform transformation of certain variables, if needed

Step 1 of the Cross industry standard process?

Business/Research Understanding phase

What is data?

Facts and numbers that can be recorded Have implicit meaning concerning an event or an object.

What is information?

Knowledge derived from data Data presented in a meaningful context Processed data by summing, ordering, averaging, grouping, comparing, and other operations.

In telecommunications alarm logs

(Inverter_Problem Excessive_Line_Current) (Rectifier_Alarm) --> (Fire_Alarm)

What is the goal of inventory management?

A consumer appliance repair company wants to anticipate the nature of repairs on its consumer products and keep the service vehicles equipped with right parts to reduce on number of visits to consumer households.

What does data mining do for us?

About solving problems by analyzing data already present in databases. Enable us to learn from daily actions and/or activities. Is about looking for patterns

What is the approach of market segmentation?

Collect different attributes of customers based on their geographical and lifestyle related information. Find clusters of similar customers. Measure the clustering quality by observing buying patterns of customers in same cluster vs. those from different clusters.

In point of sale transaction sequence

Computer Bookstore: (Intro_To_Visual_C) (C++_Primer) --> (Perl_for_dummies,Tcl_Tk)

What is the difference in the fields?

Data Mining is a cross-disciplinary field focuses on discovering properties of data sets. On the other hand, Machine Learning is a sub-field of data science that focuses on designing algorithms that can learn from and make predictions on the data.

Rules for data mining and machine learning?

Data Mining is not capable of self-earning rather it follows the predefined rules. Machine learning algorithms are self-defined and can change their rules based on scenario.

Difference between data mining and machine learning when it comes to future outlook?

Data Mining is used for more like prediction situations whereas Machine Learning is used for more for optimization by applying new algorithms.

What is step 3 of the Cross industry standard process?

Data preparation phase

Step 2 of the Cross industry standard process?

Data understanding Phase

What is step 6 in the cross industry standard process?

Deployment phase

Association Rule Discovery

Descriptive

Clustering

Descriptive

Sequential Pattern Discovery

Descriptive

What is the origin of data mining?

Draws ideas from machine learning/AI, pattern recognition, statistics, and database systems

What is step 5 of the cross industry standard process?

Evaluation Phase

What is the definition of clustering?

Given a set of data points, each having a set of attributes, and a similarity measure among them, find clusters such that - Data points in one cluster are more similar to one another. - Data points in separate clusters are less similar to one another.

Association rule discovery

Given a set of records each of which contain some number of items from a given collection; - Produce dependency rules which will predict occurrence of an item based on occurrences of other items.

There is often information ------- in the data that is not readily evident

Hidden

What is the gain of document clustering?

Information Retrieval can utilize the clusters to relate a new document or search term to clustered documents.

Can machine learning and data mining be used for eachother?

Machine Learning can be used for data mining. However, data mining can use other techniques besides or on top of machine learning.

What is step 4 of the cross industry standard process?

Modeling Phase

Classification

Predictive

Deviation Detection

Predictive

Estimation

Predictive

Regression

Predictive

What is the approach of inventory management?

Process the data on tools and parts required in previous repairs at different consumer locations and discover the co-occurrence patterns.

What is the approach of the supermarket shelf management?

Process the point-of-sale data collected with barcode scanners to find dependencies among items.

What is the goal of direct marketing?

Reduce cost of mailing by targetinga set of consumers likely to buy a new cell-phone product.

Challenges of data mining

Scalability Dimensionality Complex and Heterogeneous Data Data Quality Data Ownership and Distribution Privacy Preservation Streaming Data

What is the goal of document clustering?

To find groups of documents that are similar to each other based on the important terms appearing in them.

What is the approach of document clustering?

To identify frequently occurring terms in each document. Form a similarity measure based on the frequencies of different terms. Use it to cluster.

What is the goal of the supermarket shelf management?

To identify items that are bought together by sufficiently many customers.

What is your job as a scientist?

To make sense of data To discover the patterns To propose theories that can be used for predictions To make policies and govern based on data

What is the goal of Sky Survey Cataloging?

To predict class (star or galaxy) of sky objects, especially visually faint ones, based on the telescopic survey images

What is the goal of customer attrition?

To predict whether a customer is likely to be lost to a competitor.

What is the approach of fraud detection?

Use credit card transactions and the information on its account-holder as attributes.- When does a customer buy, what does he buy, how often he pays on time, etc Label past transactions as fraud or fair transactions. This forms the class attribute. Learn a model for the class of the transactions. Use this model to detect fraud by observing credit card transactions on an account.

What is the approach of customer attrition?

Use detailed record of transactions with each of the past and present customers, to find attributes. Label the customers as loyal or disloyal. Find a model for loyalty.

What is the direct marketing approach?

Use the data for a similar product introduced before. We know which customers decided to buy and which decided otherwise. This {buy, don't buy}decision forms the class attribute. Collect various demographic, lifestyle, and company-interaction related information about all such customers.- Type of business, where they stay, how much they earn, etc. Use this information as input attributes to learn a classifier model.


Ensembles d'études connexes

Ch. 23: Health Assessment of Children

View Set

PSYC TEST 3: launchpad questions: module 36, 37, 38, 39

View Set

Module 4 Study Questions NUR110 Mental Health

View Set

Chapter 16 The Civil War; Section 1 The War Begins, ( From Textbook ; U.S History, Begginings to 1914")

View Set

labce game mode questions 1/6/23 -3

View Set

Chapter 53 Male Reproductive Disorders Prep U

View Set

Sociálna psychológia - 2. Socializácia a 3. Kultúra a spoločnosť

View Set

Quickbooks practice test 2011-2012

View Set

Chapter 6 Exam - Markets and Social Security

View Set