Data Mining Test 1

Ace your homework & exams now with Quizwiz!

Tables

Simplest way of representing output: Use the same format as input! Main problem: selecting the right attributes Have to experiment to decide which are not relevant to the concept to be learned

Trees for numeric prediction

Regression: the process of computing an expression that predicts a numeric quantity Regression tree: "decision tree" where each leaf predicts a numeric quantity Predicted value is average value (of the class attribute) of training instances that reach the leaf Model tree: combine regression tree with linear regression equations at the leaf nodes Linear patches approximate continuous function

Trees

"Divide-and-conquer" approach produces tree Value of one attribute rules out certain classes Nodes involve testing a particular attribute Usually, attribute value is compared to constant Other possibilities: Comparing values of two attributes Using a function of one or more attributes Leaves assign classification, set of classifications, or probability distribution to instances New, unknown instance is routed down the tree for classification / prediction 9

Linear models

Another simple representation Also called a regression model Inputs (attribute values) and output are all numeric Output is the sum of weighted attribute values The trick is to find good values for the weights that give a good fit to the training data Easiest to visualize in two dimensions Straight line drawn through the data points represents the regression model / function Can be applied to binary classification Notonlynumericalestimation Line separates the two classes Decision boundary ‐ defines where the decision changes from one class value to the other Prediction is made by plugging in observed values of the attributes into the expression Predict one class if output 0, and the other class if output < 0 Boundary becomes a high‐dimensional plane (hyperplane) when there are multiple attributes

Association learning

Can be applied if no class is specified and any kind of structure is considered "interesting" Difference from classification learning: Unsupervised I.e., not told what to learn Can predict any attribute's value, not just the class, and more than one attribute's value at a time Hence: far more association rules than classification rules Thus: constraints are necessary Minimum coverage and minimum accuracy

Preparing the input

Denormalization is not the only issue Problem: different data sources (e.g. sales department, customer billing department, ...) Differences: styles of record keeping, conventions, time periods, primary keys, errors Data must be assembled, integrated, cleaned up "Data warehouse": consistent point of access External data may be required ("overlay data")

Missing values

Does absence of value have some significance? Yes "missing" is a separate value No "missing" must be treated in a special way Solution A: assign instance to most popular branch Solution B: split instance into pieces Pieces receive weight according to fraction of training instances that go down each branch Classifications from leaf nodes are combined using the weights that have percolated to them

Clustering

Finding groups of items that are similar Clustering is unsupervised The class of an example is not known Success often measured subjectively

Ordinal quantities

Impose order on values Butnodistancebetweenvaluesdefined Example: attribute "temperature" in weather data Values: "hot" > "mild" > "cool" Note:additionandsubtractiondon'tmakesense Example rule: temperature < hot play = yes Distinction between nominal and ordinal not always clear (e.g. attribute "outlook" - is there an ordering?) also called "numeric", or "continuous"

Interval quantities

Interval quantities are not only ordered but measured in fixed numerical units Example: attribute "year" Difference of two values makes sense Sum or product doesn't make sense

Ratio quantities

Ratio quantities are those for which the measurement scheme defines a zero point Example: attribute "distance" Distance between an object and itself is zero Ratio quantities are treated as real numbers All mathematical operations are allowed

Inaccurate values

Reason: data has not been collected for the purpose of mining Result: errors and omissions that don't affect original purpose of data but are critical to mining E.g. age of customer in banking data Typographical errors in nominal attributes values need to be checked for consistency Typographical, measurement, rounding errors in numeric attributes outliers need to be identified What facility of Weka did we learn in lab that might be useful here? Errors may be deliberate E.g. wrong zip codes Other problems: duplicates, stale data

Reasons: Missing values

Reasons: malfunctioning equipment changes in experimental design collation of different datasets measurement not possible user refusal to answer survey question Missing value may have significance in itself (e.g. missing test in a medical examination) Most schemes assume that is not the case: "missing" may need to be coded as additional value 25

Numeric estimation

Variant of classification learning where the output attribute is numeric (also called "regression") Learning is supervised Algorithm is provided with target values Measure success on test data

What are the 3 main structural descriptions?

-Rules: classification and association -Decision trees -Linear regression formulas

What are the 3 main learning strategies in Data Mining?

-Unsupervised clustering -supervised learning (classification, numerical estimation, prediction) -Association leanring (market basket analysis)

Data and Information

Data is recorded facts; whereas, information is patterns underlying the data. Information tells us what is useful about the data.

attribute

Each column in a collection of training data The attributes can be divided into two types: - the output attribute - the one we want to determine/predict - the input attributes - everything else Each instance is described by a fixed predefined set of features, its "attributes" But: number of attributes may vary in practice Example: table of transportation vehicles Possible solution: "irrelevant value" flag Related problem: existence of an attribute may depend on value of another one Example:"spousename"dependson"married?" Possiblesolution:methodsofdatareduction Possible attribute types ("levels of measurement"): Nominal, ordinal, interval and ratio Simplifies to nominal and numeric

example or instance.

Each row in a collection of training data

Data Mining

Employing machine learning techniques in order to find useful trends and patterns in data

Classification learning

Example problems: weather data, medical diagnosis, contact lenses, irises, labor negotiations, etc. Can you think of others? Classification learning is supervised Algorithm is provided with actual outcomes Outcome is called the class attribute of the example Measure success on fresh data for which class labels are known (test data, as opposed to training data) In practice success is often measured subjectively How acceptable the learned description is to a human user

Overfitting

In general,workingtoohardtomatchthetraining examples can lead to an overly complicated model that doesn't generalize well.

Supervised Learning and Unsupervised Learning

In supervised learning, the outputs are given to train the machine to the desired outputs. In unsupervised learning, outputs are not given, so data is clumped into different classes.

Input Attribute and Output Attribute

Input Attribute help us formulate our output attribute. An output attribute is the attribute we want to predict.

Shallow Knowledge and Hidden Knowledge

Shallow Knowledge is factual and easily manipulated in the database. Hidden knowledge is patterns in the data that are not easy to find. (That's why we have Data Mining.)

test examples

To test how well a model generalizes, we typically withhold some of the examples

Training Data and Test Data

Training Data is the data we used to formulate the patterns of the data. We use test data to see how correct the pattern is.

Output: representing structural patterns

Understanding the output is the key to understanding the underlying learning methods

Nominal attributes

have values that are "names" of categories. - there is a small set of possible values attribute Fever Diagnosis Outlook possible values {Yes, No} {Allergy, Cold, Strep Throat} {sunny, overcast, raining} • In classification learning, the output attribute is always nominal. • Nominal comes from the Latin word for name • No relation is implied among nominal values • No ordering or distance measure • Can only test for equality also called "categorical", "enumerated", or "discrete"

Numeric attributes

have values that come from a range of numbers. attribute Body Temp Salary possible values any value in 96.0‐106.0 any value in $15,000‐250,000 - you can order their values (definition of "ordinal" type) $210,000 > $125,000 98.6 < 101.3

Concepts

kinds of things that can be learned Goal: intelligible and operational concept description E.g.: "Under what conditions should we play?" This concept is located somewhere in the input data

Attributes

measuring aspects of an instance We will focus on nominal and numeric attributes

Classification rule

predicts value of a given attribute

Association rule

predicts value of arbitrary attribute (or combination)

machine learning

the area of computer science that aims to build systems & algorithms that learn from data

Instances

the individual, independent examples of a concept Note: more complicated forms of input are possible

error rate of a model

the percentage of test examples that it misclassifies

Data mining vs. data query

• Database queries in SQL are not the same thing as data mining. • Queries allow us to extract factual information. - "shallow knowledge" • In data mining, we attempt to extract patterns and relationships that go beyond mere factual information. - "hidden knowledge"

Why Data Pre‐processing?

• Dataintherealworldisdirty • incomplete: lacking attribute values, lacking certain attributes of interest - e.g., occupation=" " • noisy: containing errors or outliers - e.g., Salary="‐10" • inconsistent: containing discrepancies in codes or names - e.g., Age="42" Birthday="03/07/1997" - e.g., Was rating "1,2,3", now rating "A, B, C" - e.g., discrepancy between duplicate records • No quality data, no quality mining results! • Quality decisions must be based on quality data • Data extraction, integration, cleaning, transformation, and reduction comprises the majority of the work of building target data

Data Integration

• In designing a database, we try to avoid redundancies by normalizing the data • As a result, the data for a given entity (e.g., a customer) may be: - spread over multiple tables - spread over multiple records within a given table • In designing a database, we try to avoid redundancies by normalizing the data. • As a result, the data for a given entity (e.g., a customer) may be: - spread over multiple tables - spread over multiple records within a given table • To prepare for data warehousing and/or data mining, we often need to denormalize the data. - multiple records for a given entitya single record

Why is Data Dirty?

• Incomplete data (missing values) may come from - "Notapplicable"datavaluewhencollected - Differentconsiderationsbetweenthetimewhenthedatawas collected and when it is analyzed - Human/hardware/softwareproblems • Noisydata(incorrectvalues)maycomefrom - Faultydatacollectioninstruments - Humanorcomputererroratdataentry - Errorsindatatransmission • Inconsistentdatamaycomefrom - Differentdatasources(resultingfromintegration) - Functionaldependencyviolation(e.g.,modifysomelinkeddata)

Handing Noisy Data

• Noise: - randomerrororvarianceinameasuredattribute - outliervalues - moregenerally:non‐predictivevalues • Combined computer and human inspection - detectsuspiciousvaluesandcheckbyhuman - datavisualizationthekeytool • Clustering - detectandremoveoutliers - alsoemploysdataviz • Regression - smoothbyfittingthedataintoregressionfunctions • Clustering - employstechniquessimilartodiscretizing

Noise

• Noisy data is meaningless data • The term has often been used as a synonym for corrupt data • Its meaning has expanded to include any data that cannot be understood and interpreted correctly by machines - unstructured text for example

Handling Missing Values

• Options: - Ignore them • PRISMandID3won'tworkatall • Naïve Bayes handles them fine • J48 and nearest neighbor use tricks to get around - Remove all instances with missing attributes • Unsupervised RemoveWithValues attribute filter in Weka - Replace missing values with the most common value for that attribute • Unsupervised ReplaceMissingValues attribute filter in Weka • Only works with nominal values • Issues?

Data Reduction

• Problematic attributes include: - irrelevant attributes: ones that don't help to predict the class • despite their irrelevance, the algorithm may erroneously include them in the model - attributes that cause overfitting • example: a unique identifier such as Patient ID - redundant attributes: those that offer basically the same information as another attribute • example: in many problems, date‐of‐birth and age provide the same information • some algorithms may end up giving the information from these attributes too much weight • WecanremoveanattributemanuallyinWekaby clicking the checkbox next to the attribute in the Preprocess tab and then clicking the Remove button - How to determine? • Experimentation • Correlation analysis (filters in Weka) • Undoing preprocess actions: - InthePreprocesstab,theUndobuttonallowsyoutoundoactionsthatyou perform, including: • applying a filter to a dataset • manually removing one or more attributes - IfyouapplytwofilterswithoutusingUndoinbetweenthetwo,thesecond filter will be applied to the results of the first filter - Undocanbepressedmultipletimestoundoasequenceofactions

Discretizing Numeric Attributes

• We can turn a numeric attribute into a nominal/categorical one by using some sort of discretization • This involves dividing the range of possible values into subranges called buckets or bins. - example: an age attribute could be divided into these bins: child: 0‐12 teen: 12‐17 young: 18‐35 middle: 36‐59 senior: 60‐ • What if we don't know which subranges make sense? • Equal‐width binning divides the range of possible values into N subranges of the same size. - bin width = (max value - min value) / N - example: if the observed values are all between 0‐ 100, we could create 5 bins as follows: width = (100 - 0)/5 = 20 bins: [0‐20], (20‐40], (40‐60], (60‐80], (80‐100] - problems with this equal‐width approach? • Equal‐frequency or equal‐height binning divides the range of possible values into N bins, each of which holds the same number of training instances. - example: let's say we have 10 training examples with the following values for the attribute that we're discretizing: 5, 7, 12, 35, 65, 82, 84, 88, 90, 95 To select the boundary values for the bins, this method typically chooses a value halfway between the training examples on either side of the boundary final bins: (‐inf, 9.5], (9.5, 50], (50, 83], (83, 89], (89, inf)

Transforming the Data

• We may also need to reformat or transform the data. - we can use a Python program to do the reformatting - Weka also provides several useful filters • One reason for transforming the data: many machine‐learning algorithms can only handle certain types of data • We may also need to reformat or transform the data. - we can use a Python program to do the reformatting - Weka also provides several useful filters • One reason for transforming the data: many machine‐learning algorithms can only handle certain types of data - some algorithms only work with nominal attributes - attributes with a specified set of possible values • examples: {yes, no} {strep throat, cold, allergy} - other algorithms only work with numeric attributes

What are 3 problems that often occur in Data Mining?

◆ Problem 1: most patterns are not interesting (noise!) ◆ Problem 2: patterns may be inexact (or spurious) ◆ Problem 3: data may be garbled or missing


Related study sets

Health Comm Chapter 11 - Textbook

View Set

Chapter 9: Ankle and Leg Pathologies***

View Set

Quiz Review 1.1-1.4 (Literal Equations, Solving Multistep Equations, Real Numbers)

View Set

Introductory PME (Enlisted) - Block 5 - Planning for Operations (NWC-IEPME-INTRO-B5-V5)

View Set

Hands-On Server 2019 Pre-Assessment Quiz

View Set

Ch. 9 Human Growth & Development

View Set