ISM 6404 FINAL- Chapters 4, 5 & 7
Reducing dimensionality of the matrix
-A domain expert goes through the list of terms and eliminates those that do not make much sense for the context of the study (this is a manual, labor-intensive process). -Eliminate terms with very few occurrences in very few documents. -Transform the matrix using SVD.
Difference between SEMMA and CRISP-DM
-CRISP-DM takes a more comprehensive approach—understanding of the business and the relevant data—to data mining projects -SEMMA implicitly assumes that the data mining project's goals and objectives along with the appropriate data sources have been identified and understood
Text mining and security
-FBI, CIA data warehouse
Removing information (often in different formats and deep in databases)
-Sophisticated new tools, including advanced visualization tools, help to remove the information ore buried in corporate files or archival public records. Finding it involves massaging and synchronizing the data to get the right results. Cutting-edge data miners are also exploring the usefulness of soft data (i.e., unstructured text stored in such places as Lotus Notes databases, text files on the Internet, or enterprise-wide intranets) -sometimes necessary to use parallel processing for data mining -often combined with spreadsheets and other software development tools -striking it rich involves finding an unexpected result and requires the end users to think creativity throughout the process
Today's current computer technology
-advancing faster than anything else, both on the hardware and software front -text mining & analytics enables this -rapid growth in the amount of data collected, stored, and made available -virtually unstructured (doubling in size every 18 months)
Prediction (DM task)
-associated with forecasting (data and model based)
1. Establish the corpus (Text Mining Process)
-collect all documents related to the context being studied (XML, emails, webpages, notes, transcribed voice messages) -once collected, documents are transformed and organized into the same form for computer processing (ie. digital text excerpts in a file folder or webpages in a domain)
2. Create the term-document matrix (Text Mining Process)
-digitized and organized documents (the corpus) are used to create the TDM rows- documents columns- terms indices- relationships
bag-of-words model
-early form of text mining -text (sentence, para, or doc) is represented as a collection of words, disregarding grammar or order -still used in some simple classification tools
Customer Relationship Management (CRM) (data mining application)
-goal is to create one-on-one relationships with customers by developing an intimate understanding of their needs and wants. As businesses build relationships with their customers over time through a variety of interactions (e.g., product inquiries, sales, service requests, warranty calls, product reviews, social media connections), they accumulate tremendous amounts of data. When combined with demographic and socioeconomic attributes, this information-rich data can be used to (1) identify most likely responders/buyers of new products/services (i.e., customer profiling); (2) understand the root causes of customer attrition to improve customer retention (i.e., churn analysis); (3) discover time-variant associations between products and services to maximize sales and customer value; and (4) identify the most profitable customers and their preferential needs to strengthen relationships and to maximize sales.
Text mining and marketing
-increase cross-selling and up-selling by analyzing the unstructured data generated by call centers (text generated notes or call transcriptions -predict customer perceptions and purchasing behavior
Why Data Mining has gained attention in the business world?
-more intense competition at the global scale driven by customers ever-changing needs and wants in an increasingly saturated marketplace -general recognition of the untapped value hidden in large data sources -consolidation of database -exponential increase in data processing and storage -reduction of hardware/software cost -movement into demassificaation
classification/text categorization (extraction method)
-most common -for a given set of categories and collection of text documents, the goal is to find the correct topic for each document using models developed with a training data set that includes both the documents and actual document categories main approaches: 1. knowledge engineering: an experts knowledge is encoded into the system declaratively as procedural classification rules 2. machine learning (more popular): general inductive process builds a classifier by learning from a set of reclassified examples used: automatic & semiautomatic (interactive) indexing of t4ext, spam filtering, web page categorization under hierarchal catalogs, automatic generation of metadata, detection of genre
Classification (DM task)
-most common task -objective is to analyze historical data and automatically generate a model that can predict future behavior -related category of tools is rule induction -recent techniques such as SVM, rough sets, and genetic algorithms are gradually finding their way into the arsenal of classification algorithms. tools include: -neural networks (machine learning): more effective w large # of variables/complex relationships, lots of training, not well in data-rich domains -decision trees (machine learning): classify data into a finite # of classes based on values, hierarchy of if-then statements, faster than neural, appropriate for categorical & interval data, continuous variables require discretization- converting variables to ranges and categories -logistic regression and discriminant analysis (from traditional statistics), and emerging tools such as rough sets, support vector machines (SVMs), and genetic algorithms
Cross Industry Standard Process for Data Mining (CRISP-DM)
-most popular -backtracking occurs often, focus on earlier steps 1. business understanding- what the study is for 2. data understanding- tasks, quantitative/qualitative/ordinal/nominal 3. data preparation- most time/effort (80%) 4. model building 5. testing and evaluation- critical/challenging, interaction needed, visualization 6. deployment- customer not analyst, maintenance
Challenges associated with the implementation of NLP
-part of speech tagging -text segmentation -word sense disambiguation -syntactic ambiguity -imperfect or irregular input -speech acts
Categories & examples of linguistic features used in deception detection
-quantity: verb count, noun-phrase count -complexity: average number of clauses, average sentence length -uncertainty: modifiers, modal verbs -nonimmediacy: passive voice, objectification -expressivity: emotiveness -diversity: lexical diversity, redundancy -informality: typographical error ratio -specificity: spatiotemporal and perceptual information -affect: positive affect, negative affect
Data mining v statistics
-similar and both look for relationships within data -Most people call statistics the "foundation of data mining." -The main difference between the two is that statistics starts with a well-defined proposition and hypothesis, whereas data mining starts with a loosely defined discovery statement. Statistics collects sample data (i.e., primary data) to test the hypothesis, whereas data mining and analytics use all the existing data (i.e., often observational, secondary data) to discover novel patterns and relationships. Another difference comes from the size of data that they use. Data mining looks for data sets that are as "big" as possible, whereas statistics looks for the right size of data (if the data is larger than what is needed/required for the statistical analysis, a sample of the data is used). The meaning of "large data" is rather different between statistics and data mining. A few hundred to a thousand data points are large enough to a statistician, but several million to a few billion data points are considered large for data mining studies.
Data mining seeks to identify 4 types of patterns
1. Associations- find the commonly co-occurring groupings of things, such as beer and diapers going together in market-basket analysis 2. Predictions-tell the nature of future occurrences of certain events based on what has happened in the past, such as predicting the winner of the Super Bowl or forecasting the absolute temperature of a particular day 3. Clusters- identify natural groupings of things based on their known characteristics, such as assigning customers in different segments based on their demographics and past purchase behaviors 4. Sequential relationships- discover time-ordered events, such as predicting that an existing banking customer who already has a checking account will open a savings account followed by an investment account within a year
Banking (data mining application)
(1) automating the loan application process by accurately predicting the most probable defaulters, (2) detecting fraudulent credit card and online banking transactions, (3) identifying ways to maximize customer value by selling them products and services that they are most likely to buy, and (4) optimizing the cash return by accurately forecasting the cash flow on banking entities (e.g., ATM machines, banking branches)
Insurance (data mining application)
(1) forecast claim amounts for property and medical coverage costs for better business planning, (2) determine optimal rate plans based on the analysis of claims and customer data, (3) predict which customers are more likely to buy new policies with special features, and (4) identify and prevent incorrect claim payments and fraudulent activities
Government and defense (data mining application)
(1) forecast the cost of moving military personnel and equipment; (2) predict an adversary's moves and, hence, develop more successful strategies for military engagements; (3) predict resource consumption for better planning and budgeting; and (4) identify classes of unique experiences, strategies, and lessons learned from military operations for better knowledge sharing throughout the organization
Healthcare (data mining application)
(1) identify people without health insurance and the factors underlying this undesired phenomenon, (2) identify novel cost-benefit relationships between different treatments to develop more effective strategies, (3) forecast the level and the time of demand at different service locations to optimally allocate organizational resources, and (4) understand the underlying reasons for customer and employee attrition
Retailing and Logistics (data mining application)
(1) predict accurate sales volumes at specific retail locations to determine correct inventory levels; (2) identify sales relationships between different products (with market-basket analysis) to improve the store layout and optimize sales promotions; (3) forecast consumption levels of different product types (based on seasonal and environmental conditions) to optimize logistics and, hence, maximize sales; and (4) discover interesting patterns in the movement of products (especially for the products that have a limited shelf life because they are prone to expiration, perishability, and contamination) in a supply chain by analyzing sensory and radio-frequency identification (RFID) data
Manufacturing and production (data mining application)
(1) predict machinery failures before they occur through the use of sensory data (enabling what is called condition-based maintenance); (2) identify anomalies and commonalities in production systems to optimize manufacturing capacity; and (3) discover novel patterns to identify and improve product quality
Integrate shallow and deep knowledge (overarching principle of deep QA)
(latest and greatest) Balance the use of strict semantics and shallow semantics, leveraging many loosely formed ontologies
stop words (text mining)
(noise words) words that are filtered out prior to or after processing of natural language data (text)- there is no universally accepted list of stop words, most NLP tools use a list that includes articles (a, am, the, of), auxiliary verbs (is, are, was, ere), & context specific words deemed to not have value
Data Mining Tasks
1. prediction 2. association 3. clustering (segmentation)
Text-based deception-detection process
1. researchers prepare data for processing- handwriting into a word processing file 2. features were identified representing languages or categories that are independent and can be readily analyzed
Visualization and time-series forecasting
2 techniques in data mining Visualization- used in conjunction with other data mining techniques to gain a clearer understanding of underlying relationships. As the importance of visualization has increased in recent years, a new term, visual analytics, has emerged. The idea is to combine analytics and visualization in a single environment for easier and faster knowledge creation time-series forecasting- the data consists of values of the same variable that is captured and stored over time in regular intervals. These data are then used to develop forecasting models to extrapolate the future values of the same variable
Natural Language Processing (NLP)
A technology that converts human language (structured or unstructured) into data that can be translated then manipulated by computer systems; branch of artificial intelligence & computational linguistics -goal to move beyond syntax-driven text manipulation to a true understanding -important component of text mining
Deep QA
An architecture with an accompanying methodology and overlapping approaches that bring strengths to bear contributing to improvements in accuracy, confidence and speed Overarching principles are 1. massive parallelism, many experts, pervasive confidence estimation, and integration of the latest & greatest in text analytics Ex) Watson in Jeapardy
deception detection
Applying text mining to a large set of real-world criminal (POI) statements
topic tracking (text mining AA)
Based on a user profile and documents that a user views, text mining can predict other documents of interest to the user
Brokerage and securities trading (data mining application)
Brokers and traders use data mining to (1) predict when and how much certain bond prices will change; (2) forecast the range and direction of stock fluctuations; (3) assess the effect of particular issues and events on overall market movements; and (4) identify and prevent fraudulent activities in securities trading
Computer hardware and software (data mining application)
Data mining can be used to (1) predict disk drive failures well before they actually occur, (2) identify and filter unwanted Web content and e-mail messages, (3) detect and prevent computer network security breaches and (4) identify potentially unsecure software products
Homeland security and law enforcement (data mining application)
Data mining has a number of homeland security and law enforcement applications. Data mining is often used to (1) identify patterns of terrorist behaviors (see Application Case 4.3 for an example of the use of data mining to track funding of terrorists' activities); (2) discover crime patterns (e.g., locations, timings, criminal behaviors, and other related attributes) to help solve criminal cases in a timely manner; (3) predict and eliminate potential biological and chemical attacks to the nation's critical infrastructure by analyzing special-purpose sensory data; and (4) identify and stop malicious attacks on critical information infrastructures (often called information warfare)
Travel industry (data mining application)
Data mining has a variety of uses in the travel industry. It is successfully used to (1) predict sales of different services (seat types in airplanes, room types in hotels/resorts, car types in rental car companies) in order to optimally price services to maximize revenues as a function of time-varying transactions (commonly referred to as yield management); (2) forecast demand at different locations to better allocate limited organizational resources; (3) identify the most profitable customers and provide them with personalized services to maintain their repeat business; and (4) retain valuable employees by identifying and acting on the root causes for attrition.
Entertainment industry (data mining application)
Data mining is successfully used by the entertainment industry to (1) analyze viewer data to decide what programs to show during prime time and how to maximize returns by knowing where to insert advertisements, (2) predict the financial success of movies before they are produced to make investment decisions and to optimize the returns, (3) forecast the demand at different locations and different times to better schedule entertainment events and to optimally allocate resources, and (4) develop optimal pricing policies to maximize revenues
part of speech tagging (NLP)
Difficulty to mark up terms in a text corresponding to a particular part of speech (nouns, verbs) b/c the part of speech depends on the definition and context
Massive parallelism (overarching principle of deep QA)
Exploits massive parallelism in the consideration of multiple interpretations and hypotheses
Many experts (overarching principle of deep QA)
Facilitate the integration, application, and contextual evaluation of a wide range of loosely coupled probabilistic question and content analytics
Simple term-document matrix
First generation of the TDM- filtering -Stop terms or stop words- Exclude non-important terms such as articles, auxiliary verbs, terms used in all documents -Include terms or dictionary- predetermined terms under which the documents are to be indexed -Synonyms & specific phrases can be included -stemming
Information extraction (text mining AA)
Identification og key phrases and relationships within text by looking for predefined objects and sequences in text by pattern matching
Text Mining Process (3 steps)
If the output of a task is not that what is expected, a backward redirection to the previous task execution is necessary 1. Establish the corpus 2. Create the term-document matrix 3. Extract the knowledge
Text mining and academic applications
Important to publishers who hold large databases of information requiring indexing for better retrieva
KDD data mining process
Knowledge discovery in databases -using data mining methods to find useful information and patterns in the data, as opposed to data mining, which involves using algorithms to identify patterns in data -comprehensive process that encompasses data mining -input is organizational data 1. data selection 2. data preprocessing 3. data transformation 4. data mining 5. interpretation/evaluation
NLP tasks
NLP has been applied to a variety of domains for a wide range of tasks via computer programs to automatically process natural human language that previously could only be done by humans. tasks: -question answering -automatic summarization -natural language generation: converts info from computer databases into readable human language -natural language understanding: converts samples of human language into more formal representations that are easier for the computer to manipulate -machine translation: one language to another -foreign language reading: assists nonnative user -foreign language writing -speech recognition -text-to-speech -text proofing -optical character recognition: pictures into text
Pervasive confidence estimation (overarching principle of deep QA)
No component commits to an answer; all components produce features and associated confidences, scoring different question & content interpretations. An underlying confidence-processing substrate learns how to stack and combine the scores
Data Mining
Process: implies that data mining comprises many iterative steps Nontrivial: means that some experimentation-type search or inference is involved, that is, not as straightforward as a computation of predefined quantities Valid: means that the discovered patterns should hold true on new data with a sufficient degree of certainty Novel: the patterns are not previously known to the user within the context being analyzed Potentially useful: the discovered patterns should lead to some benefit Ultimately understandable: makes business sense
SEMMA process
Sample, Explore, Modify, Model, and Assess -easy to apply exploratory statistical and visualization techniques, select and transform the most significant predictive variables, model the variables to predict outcomes, and confirm a model's accuracy -driven by highly iterative experimentation cycle -allows developer to determine how to model new questions raised by the previous results and thus proceed back to the exploration phase for additional refinement of the data- with CRISPDM
Text analytics v. text mining
Text Analytics: broader concept that includes information retrieval (searching & identifying relevant documents for a given set of key terms) and information extraction, data mining, and web mining. -Relatively new term -More commonly used in a business application context Text Mining: primarily focused on discovering new and useful knowledge from the textual data sources -More commonly used in academic research circles Both may be defined differently or synonymously
Text Analytics Formula
Text analytics= Information retrieval + information extraction + data mining + web mining Text analytics= information retrieval + text mining
Text mining (text data mining or knowledge discovery in textual databases)
The semiautomated process of extracting patterns (useful information and knowledge) from large amounts of unstructured data sources -Same as data mining except the input to process is a collection of unstructured data files (ie. word docs, pdf, xml) Step 1: imposing structure on the text-based data sources Step 2: Extracting relevant information & knowledge from this structured text-based data using data mining techniques & tools Benefits: Obvious in the areas where very large amounts of textual data are being generated (law- court orders, academic research, finance-quarterly reports, medicine- discharge summaries, biology- molecular interactions, technology- patent files, marketing- customer comments) and electronic communications- email (filter junk and prioritize mail)
Medicine (data mining application)
Use of data mining in medicine should be viewed as an invaluable complement to traditional medical research, which is mainly clinical and biological in nature. Data mining analyses can (1) identify novel patterns to improve survivability of patients with cancer, (2) predict success rates of organ transplantation patients to develop better organ donor matching policies, (3) identify the functions of different genes in the human chromosome (known as genomics), and (4) discover the relationships between symptoms and illnesses (as well as illnesses and successful treatments) to help medical professionals make informed and correct decisions in a timely manner
How data mining works
Using existing and relevant data obtained from within and outside the organization, data mining builds models to discover patterns among the attributes presented in the data set. Models are the mathematical representations (simple linear relationships/affinities and/or complex and highly nonlinear relationships) that identify the patterns among the attributes of the things (ex. customers, events) described within the data set. Some of these patterns are explanatory (explaining the interrelationships and affinities among the attributes), whereas others are predictive (foretelling future values of certain attributes)
morphology (text mining)
a branch of the field of linguistics and a part of NLP that studies the internal structure of words (patterns of word formation within or across a language/s)
term dictionary (text mining)
a collection of terms specific to a narrow field that can be used to restrict the extracted terms within a corpus
term-by-document/occurrence matrix (text mining)
a common representation schema of the frequency-based relationship btw terms & documents in a tabular format where terms are listed in columns, documents are listed in rows and the frequency btw the terms and docs is listed in cells as integer values
singular value decomposition/latent semantic indexing (text mining)
a dimensionality reduction method used to transform the term-by-document matrix to a manageable size by generating an intermediate representation of the frequencies using a matrix manipulation method similar to principle component analysis
speech acts (NLP)
a sentence can be an action and the structure may not show this (ex. can you pass the class? can you pass the salt?)
Tokenzing (text mining)
a token is a categorized block of text in a sentence. the block of text corresponding to the token is categorized according to the function it performs- has to be useful part of structured text
imperfect or irregular input (NLP)
accents/vocal impediments in speech & typos/grammatical errors in text
Supervised v unsupervised learning algorithms
based on how the patterns are extracted, the learning algorithms of data mining methods are either: Supervised- training data includes both the descriptive attributes (i.e., independent variables or decision variables) as well as the class attribute (i.e., output variable or result variable) Unsupervised- training data includes only the descriptive attributes
Methods for performing data mining studies
classification, regression, clustering, association -most software tools employ more than one
concept linking (text mining AA)
connects related documents by identifying their shared concepts and by doing so, helps users find information that they perhaps would not have found using traditional search methods
Standford University's NLP lab
developed methods that can automatically identify the concepts and relationships between those concepts in the text. By applying a unique procedure to large amounts of text, their algorithms automatically acquire hundreds of thousands of items of world knowledge and use them to produce significantly enhanced repositories for WordNet
term
extracted directly from the corpus for a specific domain by means of NLP methods
concepts(text mining)
features generated from a collection of documents by means of manual, statistical, rule-based, or hybrid categorization methodology (compared to terms, concepts are from higher level abstraction)
question answering (text mining AA)
finding the best answer to a given question through knowledge-driven pattern matching
syntactic ambiguity (NLP)
grammar for natural languages is ambiguous- multiple sentence structured -requires a fusion of semantic and contextual information
Text mining and biomedical applications
great potential because: 1. published literature & publication outlets are expanding rapidly 2) compared to other fields, medical literature is more standardized and orderly- more mineable 3) terminology used is relatively consistent -location prediction systems
clustering (text mining AA)
grouping similar documents without having a predefined set of categories
categorization (text mining AA)
identifying the main themes of a document and then placing the document into a predefined st of categories based on those themes
corpus (text mining)
in linguistics (plural corpora) is a large and structured set of texts, usually processed electronically now, prepared for the purpose of conducting knowledge discovery
Application areas of text mining
information extraction, topic tracking, summarization, categorization, clustering, concept linking, question answering
WordNet
is a laboriously hand-coded database of english words, their definitions, sets of synonyms and various semantic relations btw synonym sets -major resource for NLP applications but expensive to build/maintain -by automatically inducing knowledge into wordnet, the potential exists to make wordnet an even greater & more comprehensive resource for NLP at a fraction of the cost
stemming (text mining)
is the process of reducing inflected words to their stem (base or root) form
Data mining in business world
most used in financial, retail and healthcare sectors -detects/reduces fraud, identify customer behaviors, reclaim profitable customers, identify trading rules from historical data, increased profitability using market-basket analysis
3. Extract the knowledge (Text Mining Process)
novel patterns are extracted in the context of the specific problem being addressed- Categories of methods- 1. classification 2. clustering 3. association 4. trend analysis
Miner
often an end user, empowered by data drills and other powerful query tools to ask ad hoc questions and obtain answers quickly, with little or no programming skill
Clustering (DM task)
partitions a collection of things (objects, events, presented in a structured data set) into segments (or natural groupings) whose members share similar characteristics. Unlike in classification, in clustering the class labels are unknown. As the selected algorithm goes through the data set, identifying the commonalities of things based on their characteristics, the clusters are established. Because the clusters are determined using a heuristic-type algorithm, and because different algorithms may end up with different sets of clusters for the same data set, before the results of clustering techniques are put to actual use it may be necessary for an expert to interpret, and potentially modify, the suggested clusters. After reasonable clusters have been identified, they can be used to classify and interpret new data. -include optimization- the goal of clustering is to create groups so that the members within each group have maximum similarity and the members across groups have minimum similarity -most commonly used clustering techniques include k-means (from statistics) and self-organizing maps (from machine learning), which is a unique neural network architecture developed by Kohonen -Cluster analysis is a means of identifying classes of items so that items in a cluster have more in common with each other than with items in other clusters
Associations (association rule learning in data mining)
popular and well-researched technique for discovering interesting relationships among variables in large databases. Thanks to automated data-gathering technologies such as bar code scanners, the use of association rules for discovering regularities among products in large-scale transactions recorded by point-of-sale systems in supermarkets has become a common knowledge discovery task in the retail industry. In the context of the retail industry, association rule mining is often called market-basket analysis. Two commonly used derivatives of association rule mining : 1. link analysis- the linkage among many objects of interest is discovered automatically, such as the link between Web pages and referential relationships among groups of academic publication authors 2. sequence mining, relationships are examined in terms of their order of occurrence to identify associations over time -Algorithms used in association rule mining include the popular Apriori (where frequent itemsets are identified) and FP-Growth, OneR, ZeroR, and Eclat
association
refers to direct relationships between terms or set of terms (concepts) The concept set association rule A + C relating two frequent concept sets A and C can be quantified by the two basic measures of support and confidence. In this case, confidence is the percentage of documents that include all the concepts in C within the same subset of those documents that include all the concepts in A. Support is the percentage (or number) of documents that include all the concepts in A and C. ex) in a document collection the concept "Software Implementation Failure" may appear most often in association with "Enterprise Resource Planning" and "Customer Relationship Management" with significant support (4%) and confidence (55%), meaning that 4% of the documents had all three concepts represented together in the same document, and of the documents that included "Software Implementation Failure," 55% of them also included "Enterprise Resource Planning" and "Customer Relationship Management." used: published literature
Indices
relational measure that can be as simple as the number of occurrences of the term in respective documents After extracting terms, one has to decide the following: (1) What is the best representation of the indices? and (2) How can we reduce the dimensionality of this matrix to a manageable size? As opposed to showing the actual frequency counts, the numerical representation between terms and documents can be normalized using a number of alternative methods- log frequencies, binary frequencies, and inverse document frequencies TF/IDF- high frequency terms are not a good discriminator
Two most popular clustering methods
scatter/gather clustering: document browsing method uses clustering to enhance the efficiency of human browsing of documents when a search query can not be formulated- generates a table of contents for user to modify query-specific clustering: employs a hierarchical clustering approach where the most relevant documents to the posed query appear in small, tight clusters containing less similar documents, creating a spectrum of relevance levels among the documents- good for large sizes
text segmentation (NLP)
some languages (chinese, japanese, thai) have single-word boundaries- in these instances, the text-parsing task requires the identification of word boundaries, which is difficult -similar challenges in speech segmentation emerge when analyzing spoken language b/c words blend into eachother
Data mining uses multiple disciplines
statistics, AI, machine learning, management science, information systems, databases
summarization (text mining AA)
summarizing a document to save time on the part of the reader
synonyms and polysemes (text mining)
synonyms are syntactically identical words with different meanings (film and movie) polysemes or homonyms are syntactically identical words with different meanings (ex. bow)
Data Mining
term used to define discovering or "mining" knowledge from from large amounts of data. -process uses statistical, mathematical, and artificial intelligence techniques to extract and identify useful information and subsequent knowledge or patterns from large data, these patterns can be in the form of business rules, affinities, correlations, trends, or prediction models -literature defines this as "the nontrivial process of identifying valid, novel, potentially useful and ultimately understandable patterns stored in structured bases where the data is organized in records categorical, ordinal, and continuous variables
CRM and NLP and Wordnet
the goal of CRM is to maximize customer value by better understanding and effectively responding to their actual and perceived needs- an important area of CRM, where NLP is making a significant impact, is sentiment analysis
word frequency (text mining)
the number of times a word is found in.a specific document
part-of-speech tagging (text mining)
the process of marking up the words in a text as corresponding to a particular part of the speech (nouns, verbs) based on a word's definition & context
Sports (data mining application)
to optimally utilize their limited resources for a winning season
Goal of both text analytics and text mining
to turn unstructured textual data into actionable information through the application of NLP
unstructured data v structured data (text mining)
unstructured: does not have a predetermined format and is stored in the form of textual documents -for humans to process and understand structured: has a predetermined format, usually organized into records w simple data values (categorical, ordinal, & continuous variables) stored in databases -for computers to process and understand
clustering (extraction method)
unsupervised process where objects are classified into "natural" groups called clusters *no prior knowledge- relevant docs are more similar than irrelevant docs used: document retrieval to enabling better web content searches, very large text collections like web pages improved search recall: overall similarity rather than term- when a query matches a document its whole cluster is returned Improved search precision: grouping documents into smaller groups of related documents, ordering them by relevance and returning only the documents from the most relevant groups -scatter/gather clustering and query-specific clustering
Data mining environment
usually a client/server architecture or web-based IS architecture
word sense disambiguation (NLP)
words with one or more meaning- have to consider context