Data Mining Test 1
Tables
Simplest way of representing output: Use the same format as input! Main problem: selecting the right attributes Have to experiment to decide which are not relevant to the concept to be learned
Trees for numeric prediction
Regression: the process of computing an expression that predicts a numeric quantity Regression tree: "decision tree" where each leaf predicts a numeric quantity Predicted value is average value (of the class attribute) of training instances that reach the leaf Model tree: combine regression tree with linear regression equations at the leaf nodes Linear patches approximate continuous function
Trees
"Divide-and-conquer" approach produces tree Value of one attribute rules out certain classes Nodes involve testing a particular attribute Usually, attribute value is compared to constant Other possibilities: Comparing values of two attributes Using a function of one or more attributes Leaves assign classification, set of classifications, or probability distribution to instances New, unknown instance is routed down the tree for classification / prediction 9
Linear models
Another simple representation Also called a regression model Inputs (attribute values) and output are all numeric Output is the sum of weighted attribute values The trick is to find good values for the weights that give a good fit to the training data Easiest to visualize in two dimensions Straight line drawn through the data points represents the regression model / function Can be applied to binary classification Notonlynumericalestimation Line separates the two classes Decision boundary ‐ defines where the decision changes from one class value to the other Prediction is made by plugging in observed values of the attributes into the expression Predict one class if output 0, and the other class if output < 0 Boundary becomes a high‐dimensional plane (hyperplane) when there are multiple attributes
Association learning
Can be applied if no class is specified and any kind of structure is considered "interesting" Difference from classification learning: Unsupervised I.e., not told what to learn Can predict any attribute's value, not just the class, and more than one attribute's value at a time Hence: far more association rules than classification rules Thus: constraints are necessary Minimum coverage and minimum accuracy
Preparing the input
Denormalization is not the only issue Problem: different data sources (e.g. sales department, customer billing department, ...) Differences: styles of record keeping, conventions, time periods, primary keys, errors Data must be assembled, integrated, cleaned up "Data warehouse": consistent point of access External data may be required ("overlay data")
Missing values
Does absence of value have some significance? Yes "missing" is a separate value No "missing" must be treated in a special way Solution A: assign instance to most popular branch Solution B: split instance into pieces Pieces receive weight according to fraction of training instances that go down each branch Classifications from leaf nodes are combined using the weights that have percolated to them
Clustering
Finding groups of items that are similar Clustering is unsupervised The class of an example is not known Success often measured subjectively
Ordinal quantities
Impose order on values Butnodistancebetweenvaluesdefined Example: attribute "temperature" in weather data Values: "hot" > "mild" > "cool" Note:additionandsubtractiondon'tmakesense Example rule: temperature < hot play = yes Distinction between nominal and ordinal not always clear (e.g. attribute "outlook" - is there an ordering?) also called "numeric", or "continuous"
Interval quantities
Interval quantities are not only ordered but measured in fixed numerical units Example: attribute "year" Difference of two values makes sense Sum or product doesn't make sense
Ratio quantities
Ratio quantities are those for which the measurement scheme defines a zero point Example: attribute "distance" Distance between an object and itself is zero Ratio quantities are treated as real numbers All mathematical operations are allowed
Inaccurate values
Reason: data has not been collected for the purpose of mining Result: errors and omissions that don't affect original purpose of data but are critical to mining E.g. age of customer in banking data Typographical errors in nominal attributes values need to be checked for consistency Typographical, measurement, rounding errors in numeric attributes outliers need to be identified What facility of Weka did we learn in lab that might be useful here? Errors may be deliberate E.g. wrong zip codes Other problems: duplicates, stale data
Reasons: Missing values
Reasons: malfunctioning equipment changes in experimental design collation of different datasets measurement not possible user refusal to answer survey question Missing value may have significance in itself (e.g. missing test in a medical examination) Most schemes assume that is not the case: "missing" may need to be coded as additional value 25
Numeric estimation
Variant of classification learning where the output attribute is numeric (also called "regression") Learning is supervised Algorithm is provided with target values Measure success on test data
What are the 3 main structural descriptions?
-Rules: classification and association -Decision trees -Linear regression formulas
What are the 3 main learning strategies in Data Mining?
-Unsupervised clustering -supervised learning (classification, numerical estimation, prediction) -Association leanring (market basket analysis)
Data and Information
Data is recorded facts; whereas, information is patterns underlying the data. Information tells us what is useful about the data.
attribute
Each column in a collection of training data The attributes can be divided into two types: - the output attribute - the one we want to determine/predict - the input attributes - everything else Each instance is described by a fixed predefined set of features, its "attributes" But: number of attributes may vary in practice Example: table of transportation vehicles Possible solution: "irrelevant value" flag Related problem: existence of an attribute may depend on value of another one Example:"spousename"dependson"married?" Possiblesolution:methodsofdatareduction Possible attribute types ("levels of measurement"): Nominal, ordinal, interval and ratio Simplifies to nominal and numeric
example or instance.
Each row in a collection of training data
Data Mining
Employing machine learning techniques in order to find useful trends and patterns in data
Classification learning
Example problems: weather data, medical diagnosis, contact lenses, irises, labor negotiations, etc. Can you think of others? Classification learning is supervised Algorithm is provided with actual outcomes Outcome is called the class attribute of the example Measure success on fresh data for which class labels are known (test data, as opposed to training data) In practice success is often measured subjectively How acceptable the learned description is to a human user
Overfitting
In general,workingtoohardtomatchthetraining examples can lead to an overly complicated model that doesn't generalize well.
Supervised Learning and Unsupervised Learning
In supervised learning, the outputs are given to train the machine to the desired outputs. In unsupervised learning, outputs are not given, so data is clumped into different classes.
Input Attribute and Output Attribute
Input Attribute help us formulate our output attribute. An output attribute is the attribute we want to predict.
Shallow Knowledge and Hidden Knowledge
Shallow Knowledge is factual and easily manipulated in the database. Hidden knowledge is patterns in the data that are not easy to find. (That's why we have Data Mining.)
test examples
To test how well a model generalizes, we typically withhold some of the examples
Training Data and Test Data
Training Data is the data we used to formulate the patterns of the data. We use test data to see how correct the pattern is.
Output: representing structural patterns
Understanding the output is the key to understanding the underlying learning methods
Nominal attributes
have values that are "names" of categories. - there is a small set of possible values attribute Fever Diagnosis Outlook possible values {Yes, No} {Allergy, Cold, Strep Throat} {sunny, overcast, raining} • In classification learning, the output attribute is always nominal. • Nominal comes from the Latin word for name • No relation is implied among nominal values • No ordering or distance measure • Can only test for equality also called "categorical", "enumerated", or "discrete"
Numeric attributes
have values that come from a range of numbers. attribute Body Temp Salary possible values any value in 96.0‐106.0 any value in $15,000‐250,000 - you can order their values (definition of "ordinal" type) $210,000 > $125,000 98.6 < 101.3
Concepts
kinds of things that can be learned Goal: intelligible and operational concept description E.g.: "Under what conditions should we play?" This concept is located somewhere in the input data
Attributes
measuring aspects of an instance We will focus on nominal and numeric attributes
Classification rule
predicts value of a given attribute
Association rule
predicts value of arbitrary attribute (or combination)
machine learning
the area of computer science that aims to build systems & algorithms that learn from data
Instances
the individual, independent examples of a concept Note: more complicated forms of input are possible
error rate of a model
the percentage of test examples that it misclassifies
Data mining vs. data query
• Database queries in SQL are not the same thing as data mining. • Queries allow us to extract factual information. - "shallow knowledge" • In data mining, we attempt to extract patterns and relationships that go beyond mere factual information. - "hidden knowledge"
Why Data Pre‐processing?
• Dataintherealworldisdirty • incomplete: lacking attribute values, lacking certain attributes of interest - e.g., occupation=" " • noisy: containing errors or outliers - e.g., Salary="‐10" • inconsistent: containing discrepancies in codes or names - e.g., Age="42" Birthday="03/07/1997" - e.g., Was rating "1,2,3", now rating "A, B, C" - e.g., discrepancy between duplicate records • No quality data, no quality mining results! • Quality decisions must be based on quality data • Data extraction, integration, cleaning, transformation, and reduction comprises the majority of the work of building target data
Data Integration
• In designing a database, we try to avoid redundancies by normalizing the data • As a result, the data for a given entity (e.g., a customer) may be: - spread over multiple tables - spread over multiple records within a given table • In designing a database, we try to avoid redundancies by normalizing the data. • As a result, the data for a given entity (e.g., a customer) may be: - spread over multiple tables - spread over multiple records within a given table • To prepare for data warehousing and/or data mining, we often need to denormalize the data. - multiple records for a given entitya single record
Why is Data Dirty?
• Incomplete data (missing values) may come from - "Notapplicable"datavaluewhencollected - Differentconsiderationsbetweenthetimewhenthedatawas collected and when it is analyzed - Human/hardware/softwareproblems • Noisydata(incorrectvalues)maycomefrom - Faultydatacollectioninstruments - Humanorcomputererroratdataentry - Errorsindatatransmission • Inconsistentdatamaycomefrom - Differentdatasources(resultingfromintegration) - Functionaldependencyviolation(e.g.,modifysomelinkeddata)
Handing Noisy Data
• Noise: - randomerrororvarianceinameasuredattribute - outliervalues - moregenerally:non‐predictivevalues • Combined computer and human inspection - detectsuspiciousvaluesandcheckbyhuman - datavisualizationthekeytool • Clustering - detectandremoveoutliers - alsoemploysdataviz • Regression - smoothbyfittingthedataintoregressionfunctions • Clustering - employstechniquessimilartodiscretizing
Noise
• Noisy data is meaningless data • The term has often been used as a synonym for corrupt data • Its meaning has expanded to include any data that cannot be understood and interpreted correctly by machines - unstructured text for example
Handling Missing Values
• Options: - Ignore them • PRISMandID3won'tworkatall • Naïve Bayes handles them fine • J48 and nearest neighbor use tricks to get around - Remove all instances with missing attributes • Unsupervised RemoveWithValues attribute filter in Weka - Replace missing values with the most common value for that attribute • Unsupervised ReplaceMissingValues attribute filter in Weka • Only works with nominal values • Issues?
Data Reduction
• Problematic attributes include: - irrelevant attributes: ones that don't help to predict the class • despite their irrelevance, the algorithm may erroneously include them in the model - attributes that cause overfitting • example: a unique identifier such as Patient ID - redundant attributes: those that offer basically the same information as another attribute • example: in many problems, date‐of‐birth and age provide the same information • some algorithms may end up giving the information from these attributes too much weight • WecanremoveanattributemanuallyinWekaby clicking the checkbox next to the attribute in the Preprocess tab and then clicking the Remove button - How to determine? • Experimentation • Correlation analysis (filters in Weka) • Undoing preprocess actions: - InthePreprocesstab,theUndobuttonallowsyoutoundoactionsthatyou perform, including: • applying a filter to a dataset • manually removing one or more attributes - IfyouapplytwofilterswithoutusingUndoinbetweenthetwo,thesecond filter will be applied to the results of the first filter - Undocanbepressedmultipletimestoundoasequenceofactions
Discretizing Numeric Attributes
• We can turn a numeric attribute into a nominal/categorical one by using some sort of discretization • This involves dividing the range of possible values into subranges called buckets or bins. - example: an age attribute could be divided into these bins: child: 0‐12 teen: 12‐17 young: 18‐35 middle: 36‐59 senior: 60‐ • What if we don't know which subranges make sense? • Equal‐width binning divides the range of possible values into N subranges of the same size. - bin width = (max value - min value) / N - example: if the observed values are all between 0‐ 100, we could create 5 bins as follows: width = (100 - 0)/5 = 20 bins: [0‐20], (20‐40], (40‐60], (60‐80], (80‐100] - problems with this equal‐width approach? • Equal‐frequency or equal‐height binning divides the range of possible values into N bins, each of which holds the same number of training instances. - example: let's say we have 10 training examples with the following values for the attribute that we're discretizing: 5, 7, 12, 35, 65, 82, 84, 88, 90, 95 To select the boundary values for the bins, this method typically chooses a value halfway between the training examples on either side of the boundary final bins: (‐inf, 9.5], (9.5, 50], (50, 83], (83, 89], (89, inf)
Transforming the Data
• We may also need to reformat or transform the data. - we can use a Python program to do the reformatting - Weka also provides several useful filters • One reason for transforming the data: many machine‐learning algorithms can only handle certain types of data • We may also need to reformat or transform the data. - we can use a Python program to do the reformatting - Weka also provides several useful filters • One reason for transforming the data: many machine‐learning algorithms can only handle certain types of data - some algorithms only work with nominal attributes - attributes with a specified set of possible values • examples: {yes, no} {strep throat, cold, allergy} - other algorithms only work with numeric attributes
What are 3 problems that often occur in Data Mining?
◆ Problem 1: most patterns are not interesting (noise!) ◆ Problem 2: patterns may be inexact (or spurious) ◆ Problem 3: data may be garbled or missing