ISDS 2001 Test Four: Chapter 4
Why should retailers, especially omni-channel retailers, pay extra attention to advanced analytics and data mining?
Utilizing large and information-rich transactional and customer data (that they collect on a daily basis) to optimize their business processes is not a choice for large-scale retailers anymore, but a necessity to stay competitive.
Do you think data mining, while essential for fighting terrorist cells, also jeopardizes individuals' rights of privacy?
Yes, because it inevitably involves tracking personal and financial data of individuals. (As an opinion question, students' answers will vary.)
How do you think Hollywood did, and perhaps still is performing, this task without the help of data mining tools and techniques?
Most is done by gut feel and trial-and-error. This may keep the movie business as a financially risky endeavor, but also allows for creativity. Sometimes uncertainty is a good thing.
What are the most common data mining mistakes/blunders? How can they be minimized and/or eliminated?
Selecting the wrong problem for data mining Ignoring what your sponsor thinks data mining is and what it really can and cannot do Leaving insufficient time for data preparation. It takes more effort than one often expects Looking only at aggregated results and not at individual records Being sloppy about keeping track of the mining procedure and results Ignoring suspicious findings and quickly moving on Running mining algorithms repeatedly and blindly. (It is important to think hard enough about the next stage of data analysis. Data mining is a very hands-on activity.) Believing everything you are told about data Believing everything you are told about your own data mining analysis Measuring your results differently from the way your sponsor measures them Ways to minimize these risks are basically the reverse of these items.
What are the major data mining processes?
Similar to other information systems initiatives, a data mining project must follow a systematic project management process to be successful. Several data mining processes have been proposed: CRISP-DM, SEMMA, and KDD.
Can you think of other application areas for data mining not discussed in this section? Explain.
Students should be able to identify an area that can benefit from greater prediction or certainty. Answers will vary depending on their creativity.
Why do you think the early phases (understanding of the business and understanding of the data) take the longest in data mining projects?
Students should explain that the early steps are the most unstructured phases because they involve learning. Those phases (learning/understanding) cannot be automated. Extra time and effort are needed upfront because any mistake in understanding the business or data will most likely result in a failed BI project.
What do you think is the most prominent application area for data mining? Why?
Students' answers will differ depending on which of the applications (most likely banking, retailing and logistics, manufacturing and production, government, healthcare, medicine, or homeland security) they think is most in need of greater certainty. Their reasons for selection should relate to the application area's need for better certainty and the ability to pay for the investments in data mining.
Give examples of situations in which classification would be an appropriate data mining technique. Give examples of situations in which regression would be an appropriate data mining technique.
Students' answers will differ, but should be based on the following issues. Classification is for prediction that can be based on historical data and relationships, such as predicting the weather, product demand, or a student's success in a university. If what is being predicted is a class label (e.g., "sunny," "rainy," or "cloudy") the prediction problem is called a classification, whereas if it is a numeric value (e.g., temperature such as 68°F), the prediction problem is called a regression.
What would be your top five selection criteria for a data mining tool? Explain.
Students' answers will differ. Criteria they are likely to mention include cost, user-interface, ease-of-use, computational efficiency, hardware compatibility, type of business problem, vendor support, and vendor reputation.
What do you think are the reasons for these myths about data mining?
Students' answers will differ. Some answers might relate to fear of analytics, fear of the unknown, or fear of looking dumb.
Moving beyond the chapter discussion, where else can association be used?
Students' answers will vary.
What do you think about data mining and its implication for privacy? What is the threshold between discovery of knowledge and infringement of privacy?
There is a tradeoff between knowledge discovery and privacy rights. Retailers should be sensitive about this when targeting their advertising based on data mining results, especially regarding topics that could be embarrassing to their customers. Otherwise they risk offending these customers, which could hurt their bottom line.
Briefly describe the general algorithm used in decision trees.
A general algorithm for building a decision tree is as follows: Create a root node and assign all of the training data to it. Select the best splitting attribute. Add a branch to the root node for each value of the split. Split the data into mutually exclusive (non-overlapping) subsets along the lines of the specific split and mode to the branches. Repeat steps 2 and 3 for each and every leaf node until the stopping criteria is reached (e.g., the node is dominated by a single class label).
Is data mining a new discipline? Explain.
Although the term data mining is relatively new, the ideas behind it are not. Many of the techniques used in data mining have their roots in traditional statistical analysis and artificial intelligence work done since the early part of the 1980s. New or increased use of data mining applications makes it seem like data mining is a new discipline. In general, data mining seeks to identify four major types of patterns: Associations, Predictions, Clusters and Sequential relationships. These types of patterns have been manually extracted from data by humans for centuries, but the increasing volume of data in modern times has created a need for more automatic approaches. As datasets have grown in size and complexity, direct manual data analysis has increasingly been augmented with indirect, automatic data processing tools that use sophisticated methodologies, methods, and algorithms. The manifestation of such evolution of automated and semi-automated means of processing large datasets is now commonly referred to as data mining.
What are the main data mining application areas? Discuss the commonalities of these areas that make them a prospect for data mining studies.
Applications are listed near the beginning of section 4.3: CRM, banking, retailing and logistics, manufacturing and production, brokerage and securities trading, insurance, computer hardware and software, government and defense, travel, healthcare, medicine, entertainment, homeland security and law enforcement, and sports. The commonalities are the need for predictions and forecasting for planning purposes and to support decision making.
What are the major application areas for data mining?
Applications are listed near the beginning of this section (pp. 160-161): CRM, banking, retailing and logistics, manufacturing and production, brokerage and securities trading, insurance, computer hardware and software, government and defense, travel, healthcare, medicine, entertainment, homeland security and law enforcement, and sports.
How do you think the discussion between privacy and data mining will progress? Why?
As technology advances and more information about people becomes easier to get, the privacy debate will adjust accordingly. People's expectations about privacy will become tempered by their desires for the benefits of data mining, from individualized customer service to higher security. As with all issues of social import, the privacy issue will include social discourse, legal and legislative decisions, and corporate decisions. The fact that companies often choose to self-regulate (e.g., by ensuring their data is de-identified) implies that we may as a society be able to find a happy medium between privacy and data mining. (Answers will vary by student.)
Give examples of situations in which association would be an appropriate data mining technique.
Association rule mining is appropriate to use when the objective is to discover two or more items (or events or concepts) that go together. Students' answers will differ.
List and briefly define the phases in the CRISP-DM process.
CRISP-DM provides a systematic and orderly way to conduct data mining projects. This process has six steps. First, an understanding of the data and an understanding of the business issues to be addressed are developed concurrently. Next, data are prepared for modeling; are modeled; model results are evaluated; and the models can be employed for regular use.
What type of analytics help did Cabela's get from their efforts? Can you think of any other potential benefits of analytics for large-scale retailers like Cabela's?
Cabela's has long relied on SAS statistics and data mining tools to help analyze the data it gathers from sales transactions, market research, and demographic data associated with its large database of customers. Using SAS data mining tools and Teradata, Cabela's analysts create predictive models to optimize customer selection for all customer contacts. Cabela's uses these prediction scores to maximize marketing spending across channels and within each customer's personal contact strategy. These efforts have allowed Cabela's to continue its growth in a profitable manner. In addition, dismantling the information silos, and integration of SAS and Teradata, enabled them to create "a holistic view of the customer." Since this works so well for the sales/customer side of the business, it could also work in other areas as well. Supply chain is one example. Analytics could help produce a "holistic view of the vendor" as well. The clustering and association models helped the company understand the value of customers, using a five-point scale as illustrated in this quote, "We treat all customers well, but we can develop strategies to treat higher-value customers a little better".
What are the sources of data that retailers such as Cabela's use for their data mining projects?
Cabela's uses large and information-rich transactional and customer data (that they collect on a daily basis) to optimize business processes and stay competitive. In addition, through Web mining they track clickstream patterns of customers shopping online.
What was the reason for Cabela's to bring together SAS and Teradata, the two leading vendors in the analytics marketplace?
Cabela's was already using both for different elements of their business. Each of the two systems was producing actionable analysis of data. But by being separate, too much time was required to construct data marts, bringing together disparate data sources and keeping statisticians from working on analytics. Now, with the integration of the two systems, statisticians can leverage the power of SAS using the Teradata warehouse as one source of information.
What is the main difference between classification and clustering? Explain using concrete examples.
Classification learns patterns from past data (a set of information—traits, variables, features—on characteristics of the previously labeled items, objects, or events) in order to place new instances (with unknown labels) into their respective groups or classes. The objective of classification is to analyze the historical data stored in a database and automatically generate a model that can predict future behavior. Classifying customer-types as likely to buy or not buy is an example. Cluster analysis is an exploratory data analysis tool for solving classification problems. The objective is to sort cases (e.g., people, things, events) into groups, or clusters, so that the degree of association is strong among members of the same cluster and weak among members of different clusters. Customers can be grouped according to demographics.
Identify at least three of the main data mining methods.
Classification learns patterns from past data (a set of information—traits, variables, features—on characteristics of the previously labeled items, objects, or events) in order to place new instances (with unknown labels) into their respective groups or classes. The objective of classification is to analyze the historical data stored in a database and automatically generate a model that can predict future behavior. Cluster analysis is an exploratory data analysis tool for solving classification problems. The objective is to sort cases (e.g., people, things, events) into groups, or clusters, so that the degree of association is strong among members of the same cluster and weak among members of different clusters. Association rule mining is a popular data mining method that is commonly used as an example to explain what data mining is and what it can do to a technologically less savvy audience. Association rule mining aims to find interesting relationships (affinities) between variables (items) in large databases.
What is the major difference between cluster analysis and classification?
Classification methods learn from previous examples containing inputs and the resulting class labels, and once properly trained they are able to classify future cases. Clustering partitions pattern records into natural segments or clusters.
Give examples of situations in which cluster analysis would be an appropriate data mining technique.
Cluster algorithms are used when the data records do not have predefined class identifiers (i.e., it is not known to what class a particular record belongs).
What were the challenges, the proposed solution, and the obtained results?
Crime across the metro area was surging, there were budget pressures, and city leaders were growing impatient. The solution, a project called Blue CRUSH, involves IBM SPSS Modeler, which enables officers to unlock the intelligence hidden in the department's huge digital library of crime records and police reports going back nearly a decade. This has put a serious dent in Memphis area crime. Since the program was launched, the number of Part One crimes—a category of serious offenses including homicide, rape, aggravated assault, auto theft, and larceny—has plummeted, dropping 27 percent from 2006 to 2010.
Define data mining. Why are there many different names and definitions for data mining?
Data mining is the process through which previously unknown patterns in data were discovered. Another definition would be "a process that uses statistical, mathematical, and artificial learning techniques to extract and identify useful information and subsequent knowledge from large sets of data." This includes most types of automated data analysis. A third definition: Data mining is the process of finding mathematical patterns from (usually) large sets of data; these can be rules, affinities, correlations, trends, or prediction models. Data mining has many definitions because it's been stretched beyond those limits by some software vendors to include most forms of data analysis in order to increase sales using the popularity of data mining.
Why is it important for many Hollywood professionals to predict the financial success of movies?
It is hard to predict box-office receipts for a given movie. The movie industry is the "land of hunches and wild guesses" due to the difficulty associated with forecasting product demand, making the movie business in Hollywood a risky endeavor. If Hollywood could better predict financial success, this would mitigate some of the financial risk.
Define data mining. Why are there many names and definitions for data mining?
Data mining is the process through which previously unknown patterns in data were discovered. Another definition would be "a process that uses statistical, mathematical, and artificial learning techniques to extract and identify useful information and subsequent knowledge from large sets of data." This includes most types of automated data analysis. A third definition: Data mining is the process of finding mathematical patterns from (usually) large sets of data; these can be rules, affinities, correlations, trends, or prediction models. Data mining has many definitions because it's been stretched beyond those limits by some software vendors to include most forms of data analysis in order to increase sales using the popularity of data mining.
What are the most common myths about data mining?
Data mining provides instant, crystal-ball predictions. Data mining is not yet viable for business applications. Data mining requires a separate, dedicated database. Only those with advanced degrees can do data mining. Data mining is only for large firms that have lots of customer data.
What are the most common myths and mistakes about data mining?
Data mining provides instant, crystal-ball predictions. Data mining is not yet viable for business applications. Data mining requires a separate, dedicated database. Only those with advanced degrees can do data mining. Data mining is only for large firms that have lots of customer data.
Why do you think the most popular tools are developed by statistics companies?
Data mining techniques involve the use of statistical analysis and modeling. So it's a natural extension of their business offerings.
What are the main data preprocessing steps? Briefly describe each step and provide relevant examples.
Data preprocessing is essential to any successful data mining study. Good data leads to good information; good information leads to good decisions. Data preprocessing includes four main steps (listed in Table 4.1 on page 167): data consolidation: access, collect, select and filter data data cleaning: handle missing data, reduce noise, fix errors data transformation: normalize the data, aggregate data, construct new attributes data reduction: reduce number of attributes and records; balance skewed data
Why do we need data preprocessing? What are the main tasks and relevant techniques used in data preprocessing?
Data preprocessing is essential to any successful data mining study. Good data leads to good information; good information leads to good decisions. Data preprocessing includes four main steps (listed in Table 4.1 on page 167): data consolidation: access, collect, select and filter data data cleaning: handle missing data, reduce noise, fix errors data transformation: normalize the data, aggregate data, construct new attributes data reduction: reduce number of attributes and records; balance skewed data
What are the privacy issues in data mining?
Data that is collected, stored, and analyzed in data mining often contains information about real people. This includes identification, demographic, financial, personal, and behavioral information. Most of these data can be accessed through some third-party data providers. In order to maintain the privacy and protection of individuals' rights, data mining professionals have ethical (and often legal) obligations.
What are the privacy issues with data mining? Do you think they are substantiated?
Data that is collected, stored, and analyzed in data mining often contains information about real people. This includes identification, demographic, financial, personal, and behavioral information. Most of these data can be accessed through some third-party data providers. In order to maintain the privacy and protection of individuals' rights, data mining professionals have ethical (and often legal) obligations. As time goes on, this will continue to be a public debate. As technology advances and more information about people becomes easier to get, the privacy debate will adjust accordingly. People's expectations about privacy will become tempered by their desires for the benefits of data mining, from individualized customer service to higher security. As with all issues of social import, the privacy issue will include social discourse, legal and legislative decisions, and corporate decisions. The fact that companies often choose to self-regulate (e.g., by ensuring their data is de-identified) implies that we may as a society be able to find a happy medium between privacy and data mining.
List and briefly define at least two classification techniques.
Decision tree analysis. Decision tree analysis (a machine-learning technique) is arguably the most popular classification technique in the data mining arena. Statistical analysis. Statistical classification techniques include logistic regression and discriminant analysis, both of which make the assumptions that the relationships between the input and output variables are linear in nature, the data is normally distributed, and the variables are not correlated and are independent of each other. Case-based reasoning. This approach uses historical cases to recognize commonalities in order to assign a new case into the most probable category. Bayesian classifiers. This approach uses probability theory to build classification models based on the past occurrences that are capable of placing a new instance into a most probable class (or category). Genetic algorithms. The use of the analogy of natural evolution to build directed search-based mechanisms to classify data samples. Rough sets. This method takes into account the partial membership of class labels to predefined categories in building models (collection of rules) for classification problems.
What are the most popular commercial data mining tools?
Examples of these vendors include IBM (IBM SPSS Modeler), SAS (Enterprise Miner), StatSoft (Statistica Data Miner), KXEN (Infinite Insight), Salford (CART, MARS, TreeNet, RandomForest), Angoss (KnowledgeSTUDIO, KnowledgeSeeker), and Megaputer (PolyAnalyst). Most of the more popular tools are developed by the largest statistical software companies (SPSS, SAS, and StatSoft).
What are the main reasons for the recent popularity of data mining?
Following are some of the most pronounced reasons: More intense competition at the global scale driven by customers' ever-changing needs and wants in an increasingly saturated marketplace General recognition of the untapped value hidden in large data sources Consolidation and integration of database records, which enables a single view of customers, vendors, transactions, etc. Consolidation of databases and other data repositories into a single location in the form of a data warehouse The exponential increase in data processing and storage technologies Significant reduction in the cost of hardware and software for data storage and processing Movement toward the de-massification (conversion of information resources into nonphysical form) of business practices Discuss what an organization should consider before making a decision to purchase data mining software. Before making a decision to purchase data mining software, organizations should consider the standard criteria to use when investing in any major software: cost/benefit analysis, people with the expertise to use the software and perform the analyses, availability of historical data, a business need for the data mining software. Distinguish data mining from other analytical tools and techniques. Students can view the answer in Figure 4.1 (p. 152) which shows that data mining is a composite or blend of multiple disciplines or analytical tools and techniques.
What recent factors have increased the popularity of data mining?
Following are some of the most pronounced reasons: More intense competition at the global scale driven by customers' ever-changing needs and wants in an increasingly saturated marketplace. General recognition of the untapped value hidden in large data sources. Consolidation and integration of database records, which enables a single view of customers, vendors, transactions, etc. Consolidation of databases and other data repositories into a single location in the form of a data warehouse. The exponential increase in data processing and storage technologies. Significant reduction in the cost of hardware and software for data storage and processing. Movement toward the de-massification (conversion of information resources into nonphysical form) of business practices.
What are some major data mining methods and algorithms?
Generally speaking, data mining tasks can be classified into three main categories: prediction, association, and clustering. Based on the way in which the patterns are extracted from the historical data, the learning algorithms of data mining methods can be classified as either supervised or unsupervised. With supervised learning algorithms, the training data includes both the descriptive attributes (i.e., independent variables or decision variables) as well as the class attribute (i.e., output variable or result variable). In contrast, with unsupervised learning the training data includes only the descriptive attributes. Figure 4.3 (p. 157) shows a simple taxonomy for data mining tasks, along with the learning methods, and popular algorithms for each of the data mining tasks.
What does it mean to have "a single view of the customer"? How can it be accomplished?
Having a single view of the customer means treating the customer as a single entity across whichever channels the customer utilizes. Shopping channels include brick-and-mortar, television, catalog, and e-commerce (through computers and mobile devices). Achieving this single view helps to better focus marketing efforts and drive increased sales.
Why do we need a standardized data mining process? What are the most commonly used data mining processes?
In order to systematically carry out data mining projects, a general process is usually followed. Similar to other information systems initiatives, a data mining project must follow a systematic project management process to be successful. Several data mining processes have been proposed: CRISP-DM, SEMMA, and KDD.
What is in-database analytics, and why would you need it?
In-database analytics refers to the practice of applying analytics directly to a database or data warehouse rather than the traditional practice of first transforming into the analytics application's data format. The time it takes to transform production data into a data warehouse format can be very long. In-database analytics eliminates this need.
What were the challenges, the proposed solution, and the obtained results?
Infinity P&C faces a significant challenge in recognizing fraudulent claims. Fraud represents a $20 billion exposure to the insurance industry and in certain venues could be an element in around 40 percent of claims. The solution involved using IBM SPSS predictive analytics tools. Based on "red flag" claims, they used this to develop rules for rating and identifying potential frauds. A key benefit of the IBM SPSS system is its ability to continually analyze and score these claims, which helps ensure that we get the claim to the right adjuster at the right time. As a result of implementing the IBM SPSS analytics tools, Infinity P&C has doubled the accuracy of its fraud identification, contributing to a return on investment of 403 percent per a Nucleus Research study. With SPSS, Infinity P&C has reduced SIU referral time from an average of 45-60 days to approximately 1-3 days. Predictive analytics also helped with subrogation, the process of collecting damages from the at-fault driver's insurance company.
Are data mining processes a mere sequential set of activities? Explain.
No. Even though these steps are sequential in nature, there is usually a great deal of backtracking. Because data mining is driven by experience and experimentation, depending on the problem/situation and the knowledge/experience of the analyst, the whole process can be very iterative (i.e., one should expect to go back and forth through the steps quite a few times) and time consuming. Because latter steps are built on the outcome of the former ones, one should pay extra attention to the earlier steps in order not to put the whole study on an incorrect path from the onset.
How did Infinity P&C improve customer service with data mining?
One out of five claims is fraudulent. Rather than putting all five customers through an investigatory process, SPSS helps Infinity 'fast-track' four of them and close their cases within a matter of days. This results in much happier customers, contributes to a more efficient workflow with improved cycle times, and improves retention due to an overall better claims experience.
What are the key differences between the major data mining methods?
Prediction: the act of telling about the future. It differs from simple guessing by taking into account the experiences, opinions, and other relevant information in conducting the task of foretelling. A term that is commonly associated with prediction is forecasting. Even though many believe that these two terms are synonymous, there is a subtle but critical difference between the two. Whereas prediction is largely experience and opinion based, forecasting is data and model based. That is, in order of increasing reliability, one might list the relevant terms as guessing, predicting, and forecasting, respectively. In data mining terminology, prediction and forecasting are used synonymously, and the term prediction is used as the common representation of the act.
What are the most popular free data mining tools?
Probably the most popular free and open source data mining tool is Weka. Others include RapidMiner and Microsoft's SQL Server.
Did Target go too far? Did it do anything illegal? What do you think Target should have done? What do you think Target should do next (quit these types of practices)?
Target might have made a tactical mistake, but they certainly didn't do anything illegal. They did not use any information that violates customer privacy; rather, they used transactional data that most every other retail chain is collecting and storing (and perhaps analyzing) about their customers. Indeed, even the father apologized when realizing his daughter was actually pregnant. The fact is, we live in a world of massive data, and we are all as consumers leaving traces of our buying behavior for anyone to see.
What are some of the criteria for comparing and selecting the best classification technique?
The amount and availability of historical data The types of data, categorical, interval, ration, etc.
How can data mining be used to fight terrorism? Comment on what else can be done beyond what is covered in this short application case.
The application case discusses use of data mining to detect money laundering and other forms of terrorist financing. The solution was using data mining techniques to find foreign exporters that are members of foreign terrorist organizations. Other applications could be to track the behavior and movement of potential terrorists, as well as text mining emails, blogs, and social media threads.
Discuss the differences between the two most commonly used data mining process.
The main difference between CRISP-DM and SEMMA is that CRISP-DM takes a more comprehensive approach—including understanding of the business and the relevant data—to data mining projects, whereas SEMMA implicitly assumes that the data mining project's goals and objectives along with the appropriate data sources have been identified and understood.
What are the main differences between commercial and free data mining software tools?
The main difference between commercial tools, such as Enterprise Miner and Statistica, and free tools, such as Weka and RapidMiner, is computational efficiency. The same data mining task involving a rather large dataset may take a whole lot longer to complete with the free software, and in some cases it may not even be feasible (i.e., crashing due to the inefficient use of computer memory).
Discuss the reasoning behind the assessment of classification models.
The model-building step also encompasses the assessment and comparative analysis of the various models built. Because there is not a universally known best method or algorithm for a data mining task, one should use a variety of viable model types along with a well-defined experimentation and assessment strategy to identify the "best" method for a given purpose.
What are some of the methods for cluster analysis?
The most commonly used clustering algorithms are k-means and self-organizing maps.
What is being predicted—class or numeric value
The purpose or objective
What are the top challenges for multi-channel retailers? Can you think of other industry segments that face similar problems?
The retail industry is amongst the most challenging because of the change that they have to deal with constantly. Understanding customer needs, wants, likes, and dislikes is an ongoing challenge. As the volume and complexity of data increase, so does the time spent on preparing and analyzing it. Prior to the integration of SAS and Teradata, data for modeling and scoring customers was stored in a data mart. This process required a large amount of time to construct, bringing together disparate data sources and keeping statisticians from working on analytics.
How can data mining be used for predicting financial success of movies before the start of their production process?
The way Sharda and Delen did it was to use data from movies between 1998 and 2005 as training data, and movies of 2006 as test data. They applied individual and ensemble prediction models, and were able to identify significant variables impacting financial success. They also showed that by using sensitivity analysis, decision makers can predict with fairly high accuracy how much value a specific actor (or a specific release date, or the addition of more technical effects, etc.) brings to the financial success of a film, making the underlying system an invaluable decision aid. Utilizing predictive models in early stages of movie production is effective to minimize investments in flops.
How did the Memphis Police Department use data mining to better combat crime?
They started mining MPD's crime data banks to help zero in on where and when criminals were hitting hardest. This began as a pilot program called Operation Blue CRUSH, or Crime Reduction Utilizing Statistical History. Shortly after all precincts embraced Blue CRUSH, predictive analytics became one of the most potent weapons in the Memphis police department's crime-fighting arsenal. Their use of data mining enabled them to focus police resources intelligently by putting them in the right place, on the right day, at the right time. Today, the MPD is continuing to explore new ways to exploit statistical analysis in its crime-fighting mission.
Identify at least five specific applications of data mining and list five common characteristics of these applications.
This question expands on the prior question by asking for common characteristics. Several such applications and their characteristics are listed on pp. 160-161.
Discuss the main data mining methods. What are the fundamental differences among them?
Three broad categories of data mining methods are prediction (classification or regression), clustering, and association. Prediction is the act of telling about the future. It differs from simple guessing by taking into account the experiences, opinions, and other relevant information in conducting the task of foretelling. A term that is commonly associated with prediction is forecasting. Even though many believe that these two terms are synonymous, there is a subtle but critical difference between the two. Whereas prediction is largely experience and opinion based, forecasting is data and model based. Classification is analyzing the historical behavior of groups of entities with similar characteristics, to predict the future behavior of a new entity from its similarity to those groups. Clustering is finding groups of entities with similar characteristics. Association is establishing relationships among items that occur together. The fundamental differences are: Prediction (classification or regression) predicts future cases or conditions based on historical data. Clustering partitions pattern records into natural segments or clusters. Each segment's members share similar characteristics. Association is used to discover two or more items (or events or concepts) that go together.