Week 8 notes Health data management

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

Supervised tasks (i.e., prediction), require data with an answer and can be broken down into two categories:

1. First is classification. Given a new item whose class is unknown, predict to which class it belongs. For instance, we may want to predict if a patient is at-risk for suicide. Based on data from other patients where their suicide risk is known, a data mining model can be built to predict suicide at-risk status for new patients 2. Second is regression. Given values for a set of variables, predict a new parameter value (not class). For instance, we may want to predict the risk score of a patient with spinal cord impairment of getting a pressure ulcer

The amount to which each of these characteristics is important informs the choices made during the KDD process, which consist of six steps:

1. First is selection. Here we select the data or sets of data that will be used for analysis. Many times, this will include data from your day-to-day database; however, data from 3rd parties or other sources may be included too. 2. Second is preprocessing. This step handles noisy, redundant, inconsistent and/or missing data. 3. Third is transformation. This step deals with scaling, smoothing, deriving new data attributes, and/or reducing the dimensionality of the data (i.e., select which attributes will remain in the dataset). Although the data has already been preprocessed, there may be additional changes to be made for the data to be ready for data mining. For instance, some data mining algorithms expect attributes to be scaled. 4. Fourth is performing data mining. Here we match the goal of the KDD process to a data mining method. If the purpose is to predict class membership (e.g., patient suffered a fall or not), then a classification algorithm will be selected. If instead, the purpose is to find similar patients, then a descriptive algorithm will be chosen. 5. Fifth is the interpretation and evaluation of the data mining model. Depending on the goal of the process, there may be more focus on interpreting the mined patterns or on evaluating the performance of the mined model. The KDD process can be refined based on these results. 6. Finally, the last step is to act on the discovered knowledge.

Unsupervised tasks, do not require an answer with the data. We will discuss two categories of unsupervised data mining:

1. First, associations find items with some form of relationship between one another. For instance, Swanson used such a technique to discover a previously unknown association between fish oil and Raynaud's syndrome 2. Second, clustering groups items with similar characteristics. For instance, grouping study subjects based on health factors

Knowledge Discovery in Databases or KDD is the process of identifying patterns in data that have the following characteristics:

1. First, the patterns must be valid and supported by the data 2. Second, the patterns should be new and previously unknown 3. Third, the patterns should be useful. Ultimately the usefulness of the patterns will be determined by the user examining the results; however, there are mechanisms within the KDD process that can filter out patterns that are less likely to be useful. For instance, patterns which impact a small portion of the population can be removed 4. Finally, the patterns should be understandable. Ideally, the user examining the results would be able to clearly see how the patterns were derived - i.e., the resulting patterns are easy to interpret. However, the ability to understand the patterns may be of less importance if the potential predictive power of the patterns is of greater interest

What a document represents can by almost anything.

A document may be a progress note, discharge summary, etc. or even just segments or sentences within these document types.

If labels are not available, then the documents must be annotated so a label can be assigned.

A simple form of annotation consists of a human reading through each document within the corpus and then manually assigning a label.

To begin, k records are chosen at random and are designated as the cluster centers.

All other records then calculate their distance to each cluster center. Each record is then assigned to the closest cluster. After the initial assignment, the following three-step process repeats until the cluster centers stabilize -- i.e., they do not change location by a substantial distance. First, the center for each cluster is calculated. This is done by taking the average coordinates of each record assigned to the cluster. Second, records re-calculate their distance to the updated cluster centers. Third, records re-assign themselves to their closest cluster.

All subsequent steps assume separate training and test sets exist.

Any operation performed on the training set will then be applied to the test set.

Feature Selection and/or Engineering

As mentioned previously, term-by-document matrices are typically sparse with numerous terms.

The type of data mining performed will depend on the goal of the KDD process.

At the highest level, data mining can be broken down into supervised or unsupervised tasks.

The process of training and testing is then repeated x times, where each fold is used as a test set one time and all other folds are used for training.

Data is also commonly stratified so that the same proportion of positive and negative cases are present in the training and test sets.

The final step in evaluating the performance of a trained model is to perform error analysis.

Documents incorrectly classified by the model should be manually examined to see if patterns of errors can be found. Error analysis is useful for highlighting limitations of the trained model. In addition, future STM projects may be better informed based on what is learned during error analysis.

The figure below shows an example of a decision tree for classifying smoking status within a document.

Each oval represents a decision point based on whether the specified term is found within the document with the given frequency. If the condition in the decision point is met, the line to the left is taken. Otherwise, the line to the right is taken. For example, if the term "smoke" occurs more than one time the document, it is classified as "Smoker". If not, the decision point representing "pack" is then evaluated.

Having so many terms can present computational challenges such as increased training time and memory usage along with potentially impacting model performance (e.g., overfitting).

Feature selection can be used to reduce the matrix to a more manageable size for computational purposes while also retaining and/or creating generalizable features.

The second step in the STM process involves structuring the text into a form usable for text mining.

First, the text is converted into a term-by-document matrix. Second, the data in the matrix is then split into training and test sets. Third, the matrix is weighted to convey a sense of term importance within the documents. Finally, the size of the term-by-document matrix may be reduced by performing feature selection and/or engineering. Each sub-step is explained in further detail.

The numbers within the matrix represent how many times a specific term was found in a particular document.

For example, Documents 1 and 3 both have the term "smoking", whereas Document 2 is the only document with the term "cough." Noticeable is the matrix contains more terms than documents and that many of the values in the matrix are zero (i.e., the matrix is sparse). In a larger corpus, it is not uncommon to have tens of thousands of terms or more with a vast majority of the matrix being sparse.

To illustrate, the three figures below illustrate the step-wise process of k-Means Clustering.

For this example, we have set k = 5. In the first figure, five records are selected at random. These randomly selected records are now the center of their respective cluster.

The figure below shows a graphical example of k-Nearest Neighbors.

Here we are dealing with two-dimensional space because there are only two attributes (x1 and x2). The new record to be classified is shown via the arrow. The distance of all the other records to this new record will be calculated. If k is equal to three then the classes of the three closest records are recorded and the most common class would be selected. In this example, the new record would be classified to the blue class

Algorithms commonly used for STM include Naive Bayes and Support Vector Machines.

However, if being able to interpret the learned model is also a requirement, an algorithm such as decision tree induction may be more appropriate.

In the third figure, the records all re-calculate their distance to each cluster.

If another cluster is now closer, then the record re-assigns itself. Once again, the center for each cluster is re-calculated and adjusted as needed. This process of re-assigning and re-calculating continues until the clusters stabilize

This reduced space allows documents with different, but related, term usages to map to the same dimension (reducing the sparseness of the matrix).

In addition, dimensions thought to represent noise (i.e., > k) are discarded.

In this example, there is perfect separation between classes -- i.e., no Class A records are in Class B's territory and vice-versa

In more realistic datasets, perfect separation is not usually seen; however, SVM algorithms have the ability to allow slack in how perfect the separation must be.

Evaluate Performance

In the final step of the STM process, the trained model will be evaluated by classifying unseen documents from the test set. The table below illustrates the 2x2 contingency table created when comparing the predicted label of a document (based on the trained model) versus the actual label of a document (based on the reference standard).

The figure below is a graphical example of SVM

In this example, there are two attributes (X1 and X2) and two classes (Class A and Class B).

Separate the Data

Once the term-by-document matrix is created, the data is then separated into training and test sets.

Feature selection can also be done to reduce the size of the matrix for computational reasons.

One technique removes terms that occur very infrequently in the matrix. For example, removing terms that only occur a single time typically results in a substantial reduction in features. Another technique keeps the k "best" features, where "best" is based on feature selection metrics (e.g., supervised global weighting functions such as X2).

Regardless of how a document is defined,

STM represents each document as a bag of words - i.e., an unordered collection of words.

To illustrate, assume a grocery store would like to know which items customers are buying together.

Shown in the table below, we have five shopping carts with a variety of items from our store.

To help exclude uninteresting rules, associations which do not meet a pre-specified support or confidence can be set.

Support is the fraction of the population that satisfies both the antecedent and consequent Confidence is how often the consequent is true when the antecedent is true

The weights of Terms 1 and 4 highlight the difference between common and rare terms.

Term 1 is found in every document and hence is given no weight, whereas Term 4 is only found in a single document and thus is given the highest weight in this example.

if the goal is classification,

Term 5 should be weighted more heavily than Term 3 because both of its documents belong to the positive class (as opposed to being equally distributed across the classes).

The first step in the STM process is gathering a collection of documents called a corpus.

The corpus along with a label assigned to each document (i.e., class membership) is known as the reference standard and is used to train and evaluate text mining models. For example, the table below illustrates a corpus of three documents, with two of those documents having a label of smoker.

The figure below shows an example of a decision tree that predicts if we should go outside and play ball or not.

The decision is based on values from up to three attributes: weather outlook, wind conditions, and humidity.

mining to include text data.

The general process for performing text mining (which is similar to data mining) is discussed.

he figure below provides an example of global weighting functions.

The left-hand side of the figure presents a simplified term-by-document matrix, where a filled square indicates a term exists in the specified document and an unfilled square means a term does not exist. For example, Term 1 exists in all four documents, whereas Term 4 only exists in a single document. The vertical line in the matrix signifies the boundary between documents belonging to the positive and negative class.

The focus here will be on supervised STM with a single-label binary classification (e.g., FRI versus not-FRI; smoker vs. non-smoker).

The main sections of the STM process will be discussed.

The table below shows an example term-by-document matrix from the three documents in the previously discussed corpus.

The matrix was created by converting all document text to lowercase, ignoring punctuation marks, and then splitting words on spaces to create terms.

Create a Term-by-Document Matrix

The table below illustrates the structure of the term-by-document matrix. The matrix M has n rows representing terms and m columns representing documents - i.e., each distinct term from all documents within the corpus is shown down the side of the matrix, while each document from the corpus is represented across the top of the matrix. Mi,j contains the count of term i for document j.

Derive Patterns

The third step of the STM process involves training a text mining model to find patterns within the training dataset that help distinguish between the two classes.

Weight the Matrix

The values within the term-by-document matrix are weighted to convey a sense of importance.

Patterns are then discovered based on counts of words within documents.

This statistical approach is in contrast to other text-based methods, such as Natural Language Processing, which attempts to infer the meaning of the text.

Supervised (Prediction):

To illustrate supervised data mining, three different classification algorithms, and one regression algorithm are discussed briefly.

Unsupervised (Descriptive Patterns)

To illustrate unsupervised data mining, association rules and k-means clustering are introduced.

Thus, a supervised function, like X2, could be used that takes into account the distribution of term occurrence amongst classes.

With X2, both Terms 1 and 2 are given no weight because they do not have any discriminatory power, Terms 3 and 4 are given the same weight since they are the exact opposite of one another, and Term 5 is given the highest weight because Term 5 only occurs in positive documents

Decision tree induction builds a decision tree automatically.

You may have seen decision trees before that were generated by individuals or based on best practices. Here, the decision tree is built using an algorithm. In an iterative fashion, the algorithm picks the attribute that results in the purest separation of data between the two classes being examined. Purity is determined via any of a number of user-selected measures, such as Gini index, information gain, gain ratio, accuracy, etc. Once an attribute is selected for splitting the data, the process repeats itself using any remaining attributes, until a stopping criterion is met. For instance, not enough data remains in any nodes of the tree.

Latent Semantic Analysis (LSA) is

a common feature engineering technique that uses Singular Value Decomposition (SVD) to represent documents in a reduced k-dimensional space.

For each type of task,

a small sample of algorithms is provided along with examples of their use.

IDF is

an unsupervised function which values less common terms.

New records are

classified based on which side of the hyperplane they are located.

Supervised tasks include

classifying documents into one or more classes, where documents and their class membership are known. For example, discharge summaries may be classified as related to a fall-related injury (FRI) or not, based on characteristics from known FRI and non-FRI discharge summaries.

Unsupervised means

data exists without any answer available. For these algorithms, the models associate or group data based on common characteristics.

Support Vector Machines (SVM)

define a linear or non-linear line known as a hyperplane that separates records from each class in n-dimensional space.

A sample of possible rules is shown in the table below,

derived from the data above. If we set support to be ≥ 0.4 (2 5), then only the first three rules would remain. If we also set the confidence to be ≥ 0.6 (2 3) then only the first and third rule would remain.

The margin is

determined by finding support vectors (or records) which delineate the boundaries of each class.

Supervised global weighting functions require

document class membership to be known, whereas unsupervised does not.

Association rule mining is concerned with

finding relationships between variables. (Note, association rules find co-occurrence, not causality.) Association rules are in the form if --> then (e.g., if a person buys bread then they tend to buy milk also), where "if" is the antecedent or left-hand-side and "then" is the consequent or right-hand-side.

The purpose of separating the data is to

have an unseen dataset to evaluate the trained text mining model.

The global component describes

how informative a term is across all documents

The local component describes

how informative a term is to a document.

The right-hand side of the figure presents the global weights for each term calculated using two functions:

inverse document frequency (IDF) and chi-square (X2).

As IDF is unsupervised,

it does not make use of class membership. Thus, Terms 2 and 5 are both given the same weight because they occur in two of the four documents.

This algorithm looks at the k-nearest neighbors in n-dimensional space.

k is a parameter set by the user performing data mining. n is simply the number of attributes (excluding the class variable) available in the dataset. Whenever a new record is encountered, its distance to every other record with class information is calculated. (Generally, a measure such as Euclidean distance is used to calculate distance.) The classes assigned to those k closest neighbors are examined and the new record is assigned to the majority class.

k-Means Clustering

k-Means Clustering groups data into k clusters, where k is a user-specified parameter.

The matrix is weighted via the product of three weighting components:

local, global, and normalization.

The second figure shows the assignment of the remaining records to their closest cluster.

n this step, each record had calculated the distance to each of the five clusters and picked the closest cluster (designated by different colored dots). Once all records have assigned themselves to a cluster, the center of the cluster is re-calculated. The lines in the second figure show how the center of each cluster has changed.

Data mining is

one of the core elements of the KDD process. Data mining is the "...exploration and analysis, by automatic or semiautomatic means, of large quantities of data in order to discover meaningful patterns and rules" (Berry and Linoff, 1997). (Note, although this section refers to just data mining, many of the concepts introduced here are also relevant to text mining.)

the knowledge discovery in database (KDD) process, which

outlines the basic steps involved with any data or text mining project.

A hyperplane is

placed in the middle of the margin.

Split validation takes a

proportion of documents for the training set and the remainder are placed in the test set.

Supervised means data is

provided to the data mining algorithm with the correct answer (e.g., whether the patient responded to the given treatment). The data mining algorithm then generates a model based on learning from the provided data.

Feature selection aims to

retain helpful features while removing terms that are unlikely to aid in classification.

Records which are at the furthest edge of either class are

selected as support vectors, which define the margin.

Two common methods of separating the data include

split validation and x-fold cross-validation.

two major types of data mining tasks:

supervised and unsupervised learning.

STM, like data mining, can be broken down into two tasks:

supervised and unsupervised.

The table below lists some common statistics generated from

the 2x2 contingency table to evaluate model performance.

Feature engineering involves

the creation of new features.

x-fold cross-validation splits

the data into roughly x equally-sized folds.

Once a database has been implemented and in use,

the data within that database can be used for more than just supporting day-to-day operations.

Starting at the root of the tree,

the first decision point relies on the value for the outlook attribute. If the outlook is overcast, then we should go outside and play ball. If the outlook is either sunny or rainy then the decision tree takes us to another decision point. Based on the value of either humidity or wind conditions, a final decision is produced by the model.

the normalization component reduces

the impact of document length.

For instance,

the linear regression equation below illustrates the health rating for a meal (Y) based on its fat (X1), fiber (X2), and sugar content (X3).

Regression deals with

the prediction of a value, rather than a class. Given a set of variables, X1, X2, ..., Xn, the goal is to predict the value of variable Y. To meet that goal, coefficients for the variables are calculated based on the data provided. The equation below is an example of linear regression, where a0, a1, ..., an are the coefficients determined by the algorithm. When a new record is encountered, its values for the variables X1, X2, ..., Xn are put into the formula to calculate Y.

k-Nearest Neighbors operates under

the principle that similar data should belong to the same class, where similarity is based on distance.

Within the global weighting component,

the weighting functions may be supervised or unsupervised.

The hyperplane is

then placed in the middle of the largest margin found.

The goal of the SVM algorithm is

to define a hyperplane that has the largest separation between classes, otherwise known as maximizing the margin.

The purpose of statistical text mining (STM) is

to extract patterns from documents.

Each cell of the table represents the number of documents which are

true positives (TPs), false positives (FPs), false negatives (FNs), and true negatives (TNs).

Unsupervised tasks include

uncovering patterns where no class membership of documents is required. For example, clustering may be used to find groups of documents that share similar characteristics to one another (e.g., patterns of word usage).

A number of classification algorithms are available,

with the selection of the "best" algorithm empirically chosen based on prediction performance.


Ensembles d'études connexes

Chapter 6 - The Proteins and Amino Acids

View Set

New Venture Finance Exam 2 (Set 2)

View Set

Chapter 24: GENOMICS II - FUNCTIONAL GENOMICS, PROTEOMICS, AND BIOINFORMATICS

View Set

Anatomy of the Head & Neck - Chap. 3 Skeletal system

View Set