Test 04
What is Prediction?
Prediction: the act of telling about the future.
3. What are the most popular free data mining tools?
Probably the most popular free and open source data mining tool is Weka. Others include RapidMiner and Microsoft's SQL Server.
2. What is a web crawler? What is it used for? How does it work?
A Web crawler (also called a spider or a Web spider) is a piece of software that systematically browses (crawls through) the World Wide Web for the purpose of finding and fetching Web pages. It starts with a list of "seed" URLs, goes to the pages of those URLs, and then follows each page's hyperlinks, adding them to the search engine's database. Thus, the Web crawler navigates through the Web in order to construct the database of websites.
10. Give examples of situations in which association would be an appropriate data mining technique.
Association rule mining is appropriate to use when the objective is to discover two or more items (or events or concepts) that go together. Students' answers will differ.
1. What is a search engine? Why are they important for today's businesses?
A search engine is a software program that searches for documents (Internet sites or files) based on the keywords (individual words, multi-word terms, or a complete sentence) that users have provided that have to do with the subject of their inquiry. This is the most prominent type of information retrieval system for finding relevant content on the Web. Search engines have become the centerpiece of most Internet-based transactions and other activities. Because people use them extensively to learn about products and services, it is very important for companies to have prominent visibility on the Web; hence the major effort of companies to enhance their search engine optimization (SEO).
Describe speech acts.
A sentence can often be considered an action by the speaker. The sentence structure alone may not contain enough information to define this action.
3. Is data mining a new discipline?
Although the term data mining is relatively new, the ideas behind it are not.
2. What is clickstream analysis? What is it used for?
Analysis of the information collected by Web servers can help us better understand user behavior. Analysis of this data is often called clickstream analysis. By using the data and text mining techniques, a company might be able to discern interesting patterns from the clickstreams.
1. What are the major application areas for data mining? There are fourteen.
CRM, banking, RETAILING AND LOGISTICS, manufacturing and production, brokerage and securities trading, insurance, computer hardware and software, government and defense, travel, healthcare, medicine, entertainment, homeland security and law enforcement, and sports.
What is website usability?
How were they using my website? These involve page views, time on site, downloads, click map, and click paths.
4. What things can help Web pages rank higher in the search engine results?
Cross-linking between pages of the same website to provide more links to the most important pages may improve its visibility. Writing content that includes frequently searched keyword phrases, so as to be relevant to a wide variety of search queries, will tend to increase traffic. Updating content so as to keep search engines crawling back frequently can give additional weight to a site. Adding relevant keywords to a Web page's metadata, including the title tag and metadescription, will tend to improve the relevancy of a site's search listings, thus increasing traffic. URL normalization of Web pages so that they are accessible via multiple URLs. Using canonical link elements and redirects can help make sure links to different versions of the URL all count toward the page's link popularity score.
2. What are the most popular application areas for sentiment analysis? Why?
Customer relationship management (CRM) and customer experience management are popular "voice of the customer (VOC)" applications. Other application areas include "voice of the market (VOM)" and "voice of the employee (VOE)."
Person-of-interest statements completed by people involved in crimes on military bases were analyzed using text mining techniques to determine which statements were truthful or deceptive. The study analyzed text-based testimonies of persons of interest in crimes. The deception detection used only text-based features (cues) and did NOT analyze the observed behavior of the witnesses during their testimony. 1. Why is it difficult to detect deception?
Deception detection is difficult and if deception detection is limited to only text, then the problem is even more difficult.
What is decision tree analysis?
Decision tree analysis (a machine-learning technique) is arguably the most popular classification technique in the data mining arena.
What does the Apriori Algorithm do?
Finds subsets that are common to at least a minimum number of the item sets. Uses a bottom-up approach
What is forecasting?
Forecasting: estimating a future data value based on past data values.
4. What are some major data mining methods and algorithms?
Generally speaking, data mining tasks can be classified into three main categories: prediction, association, and clustering.
What is sequence discovery?
Sequence discovery: finding time-based associations
1. How can data mining be used to fight terrorism? Comment on what else can be done beyond what is covered in this short application case.
The application case discusses use of data mining to detect money laundering and other forms of terrorist financing. The solution was using data mining techniques to find foreign exporters that are members of foreign terrorist organizations.
2. What are the main knowledge extraction methods from corpus?
The main categories of knowledge extraction methods are summarization, classification, clustering, association, and trend analysis.
9. What are some of the methods for cluster analysis?
The most commonly used clustering algorithms are k-means and self-organizing maps.
What is case based reasoning?
This approach uses historical cases to recognize commonalities in order to assign a new case into the most probable category.
2. How can text/data mining be used to detect deception in text?
Through a process known as message feature mining, statements are transcribed for processing, then cues are extracted and selected. Text processing software identifies cues in statements and generates quantified cues. Classification models are trained and tested on quantified cues, and based on this, statements are labeled as truthful or deceptive (e.g., by law enforcement personnel). The feature-selection methods along with 10-fold cross-validation allow researchers to compare the prediction accuracy of different data mining methods (for example, neural networks).
2. Do you think data mining, while essential for fighting terrorist cells, also jeopardizes individuals' rights of privacy?
Yes, because it inevitably involves tracking personal and financial data of individuals. (As an opinion question, students' answers will vary.)
What is data cleaning?
data cleaning: handle missing data, reduce noise, fix errors
What is data consilidation
data consolidation: access, collect, select and filter data
What is the Apriori Algorithm?
most common for association rule mining Widely used for data mining
6. What was the reason for Cabela's to bring together SAS and Teradata, the two leading vendors in the analytics marketplace?
Cabela's was already using both for different elements of their business. Each of the two systems was producing actionable analysis of data. But by being separate, too much time was required to construct data marts, bringing together disparate data sources and keeping statisticians from working on analytics. Now, with the integration of the two systems, statisticians can leverage the power of SAS using the Teradata warehouse as one source of information.
2. Give examples of situations in which classification would be an appropriate data mining technique.
Classification is for prediction that can be based on historical data and relationships, such as predicting the weather, product demand, or a student's success in a university. If what is being predicted is a class label (e.g., "sunny," "rainy," or "cloudy") the prediction problem is called a classification, whereas if it is a numeric value (e.g., temperature such as 68°F), the prediction problem is called a regression.
5. How can you measure the impact of social media analytics?
First, determine what your social media goals are. From here, you can use analysis tools such as descriptive analytics, social network analysis, and advanced (predictive, text examining content in online conversations), and ultimately prescriptive analytics tools.
Discuss about imperfect or irregular input.
Foreign or regional accents and vocal impediments in speech and typographical or grammatical errors in texts make the processing of the language an even more difficult task.
1. What is meant by social analytics? Why is it an important business topic?
From a philosophical perspective, social analytics focuses on a theoretical object called a "socius," a kind of "commonness" that is neither a universal account nor a communality shared by every member of a body. Thus, social analytics in this sense attempts to articulate the differences between philosophy and sociology. From a BI perspective, social analytics involves "monitoring, analyzing, measuring and interpreting digital interactions and relationships of people, topics, ideas and content." In this perspective, social analytics involves mining the textual content created in social media (e.g., sentiment analysis, natural language processing) and analyzing socially established networks (e.g., influencer identification, profiling, prediction). This is an important business topic because it helps companies gain insight about existing and potential customers' current and future behaviors, and about the likes and dislikes toward a firm's products and services.
4. Why did IBM spend all that time and money to build Watson? Where is the ROI?
IBM's goal was to advance computer science by exploring new ways for computer technology to affect science, business, and society. If successful, this could give IBM a distinct competitive advantage in this important technological application area.
Describe part of speech tagging.
It is difficult to mark up terms in a text as corresponding to a particular part of speech because the part of speech depends not only on the definition of the term but also on the context within which it is used.
3. What are some of the benefits and challenges of NLP?
NLP moves beyond syntax-driven text manipulation (which is often called "word counting") to a true understanding and processing of natural language that considers grammatical and semantic constraints as well as the context. The challenges include: • Part-of-speech tagging. • Text segmentation. • Word sense disambiguation. • Syntactic ambiguity. • Imperfect or irregular input. • Speech acts.
3. What do you think are the main challenges for such an automated system of deception detection?
One challenge is that the training the system depends on humans to ascertain the truthfulness of statements in the training data itself. You can't know for sure whether these statements are true or false, so you may be using incorrect training samples when "teaching" the machine learning system to predict lies in new text data.
3. What would be the expected benefits and beneficiaries of sentiment analysis in politics?
Opinions matter a great deal in politics. Because political discussions are dominated by quotes, sarcasm, and complex references to persons, organizations, and ideas, politics is one of the most difficult, and potentially fruitful, areas for sentiment analysis. By analyzing the sentiment on election forums, one may predict who is more likely to win or lose. Sentiment analysis can help understand what voters are thinking and can clarify a candidate's position on issues. Sentiment analysis can help political organizations, campaigns, and news analysts to better understand which issues and positions matter the most to voters. The technology was successfully applied by both parties to the 2008 and 2012 American presidential election campaigns.
1. Why is it important for companies to keep up with patent filings?
Patents provide exclusive rights to an inventor, granted by a country, for a limited period of time in exchange for a disclosure of an invention.
What's important to remember about the data mining definition?
Process - most common and comprehensive is CRISP-DM Novel - previously unknown patterns are discovered. Potentially useful - results should lead to some business benefit. Valid - discovered patterns should hold true on new data.
3. What is "search engine optimization"? Who benefits from it?
Search engine optimization (SEO) is the intentional activity of affecting the visibility of an e-commerce site or a website in a search engine's natural (unpaid or organic) search results. It involves editing a page's content, HTML, metadata, and associated coding to both increase its relevance to specific keywords and to remove barriers to the indexing activities of search engines. In addition, SEO efforts include promoting a site to increase its number of inbound links. SEO primarily benefits companies with e-commerce sites by making their pages appear toward the top of search engine lists when users query.
1. What is sentiment analysis? How does it relate to text mining?
Sentiment analysis tries to answer the question, "What do people feel about a certain topic?" by digging into opinions of many using a variety of automated tools. It is also known as opinion mining, subjectivity analysis, and appraisal extraction. Sentiment analysis shares many characteristics and techniques with text mining. However, unlike text mining, which categorizes text by conceptual taxonomies of topics, sentiment classification generally deals with two classes (positive versus negative), a range of polarity (e.g., star ratings for movies), or a range in strength of opinion.
1. What are the major data mining processes?
Several data mining processes have been proposed: CRISP-DM, SEMMA, and KDD.
4. What is social media analytics? What are the reasons behind its increasing popularity?
Social media analytics refers to the systematic and scientific ways to consume the vast amount of content created by Web-based social media outlets, tools, and techniques for the betterment of an organization's competitiveness. Data includes anything posted in a social media site. The increasing popularity of social media analytics stems largely from the similarly increasing popularity of social media together with exponential growth in the capacities of text and Web analytics technologies.
3. What is social media? How does it relate to Web 2.0?
Social media refers to the enabling technologies of social interactions among people in which they create, share, and exchange information, ideas, and opinions in virtual communities and networks. It is a group of Internet-based software applications that build on the ideological and technological foundations of Web 2.0, and that allow the creation and exchange of user-generated content.
What is social network analysis?
Social network analysis (SNA) is the systematic examination of social networks. Dating back to the 1950s, social network analysis is an interdisciplinary field that emerged from social psychology, sociology, statistics, and graph (network) theory.
Talk about text segmentation.
Some written languages, such as Chinese, Japanese, and Thai, do not have single-word boundaries.
What are the top challenges for multi-channel retailers? Can you think of other industry segments that face similar problems?
The retail industry deals with change constantly. Understanding customer needs, wants, likes, and dislikes is an ongoing challenge. As the volume and complexity of data increase, so does the time spent on preparing and analyzing it. Prior to the integration of SAS and Teradata, data for modeling and scoring customers was stored in a data mart. This process required a large amount of time to construct, bringing together disparate data sources and keeping statisticians from working on analytics.
What is visualization?
Visualization: presenting results obtained through one or more of the other methods
1. What is Watson? What is special about it?
Watson is a question answering (QA) computer system developed by an IBM Research team. What makes it special is that it is able to compete at the human champion level in real time on the TV quiz show, Jeopardy!
4. What is Web content mining? How can it be used for competitive advantage?
Web content mining refers to the extraction of useful information from Web pages. The documents may be extracted in some machine-readable format so that automated techniques can generate some information about the Web pages. Collecting and mining Web content can be used for competitive intelligence (collecting intelligence about competitors' products, services, and customers), which can give your organization a competitive advantage.
3. What are the three main areas of Web mining?
Web content mining, Web structure mining, and Web usage (or activity) mining.
2. What is Web mining? How does it differ from regular data mining or text mining?
Web mining is the discovery and analysis of interesting and useful information from the Web and about the Web, usually through Web-based tools. Text mining is less structured because it's based on words instead of numeric data.
5. What is Web structure mining? How does it differ from Web content mining?
Web structure mining is the process of extracting useful information from the links embedded in Web documents. By contrast, Web content mining involves analysis of the specific textual content of web pages. So, Web structure mining is more related to navigation through a website, whereas Web content mining is more related to text mining and the document hierarchy of a particular web page.
What are visitor profiles?
What do my visitors look like? These include keywords, content groupings, geography, time of day, and landing page profiles.
What are conversion statistics?
What does all this mean for the business? Metrics include new visitors, returning visitors, leads, sales/conversions, and abandonments.
What are traffic sources?
Where did they come from? These include referral websites, search engines, direct, offline campaigns, and online campaigns.
5. What would be your top five selection criteria for a data mining tool? There are eight.
cost, user-interface, ease-of-use, computational efficiency, hardware compatibility, type of business problem, vendor support, and vendor reputation.
What is data transformation?
data transformation: normalize the data, aggregate data, construct new attributes
1. What are the three types of data generated through Web page visits?
• Automatically generated data stored in server access logs, referrer logs, agent logs, and client-side cookies • User profiles • Metadata, such as page attributes, content attributes, and usage data.
3. What are the main applications of Web mining?
• Determine the lifetime value of clients. • Design cross-marketing strategies across products. • Evaluate promotional campaigns. • Target electronic ads and coupons at user groups based on user access patterns. • Predict user behavior based on previously learned rules and users' profiles. • Present dynamic information to users based on their interests and profiles.
In the patent analysis case study, text mining of thousands of patents held by Kodak and its competitors helped improve competitive intelligence and identified complementary products. If carefully analyzed, patent documents can help:
• Identify emerging technologies • Inspire novel solutions • Identify complementary inventions/products • Foster symbiotic partnerships
What is information extraction?
• Information extraction. Identification of key phrases and relationships within text by looking for predefined sequences in text via pattern matching.
What is question answering?
• Question answering. Finding the best answer to a given question through knowledge-driven pattern matching.
What is summarization?
• Summarization. Summarizing a document to save time on the part of the reader.
1. What are some of the main challenges the Web poses for knowledge discovery?
• The Web is too big for effective data mining. • The Web is too complex. • The Web is too dynamic. • The Web is not specific to a domain. • The Web has everything.
4. What are some of the criteria for comparing and selecting the best classification technique?
• The amount and availability of historical data • The types of data, categorical, interval, ration, etc. • What is being predicted—class or numeric value • The purpose or objective
What is topic tracking?
• Topic tracking. Based on a user profile and documents that a user views, text mining can predict other documents of interest to the user.
4. What are commonly used Web analytics metrics? What is the importance of metrics?
• Website usability: • Traffic sources: • Visitor profiles: • Conversion statistics: These metrics are important because they provide access to a lot of valuable marketing data, which can be leveraged for better insights to grow your business and better document your ROI. The insight and intelligence gained from Web analytics can be used to effectively manage the marketing efforts of an organization and its various products or services.
2. Why do we care about Watson again??
IBM Research created Watson, an extraordinary computer system (a novel combination of advanced hardware and software) designed to answer questions posed in natural human language. Watson was capable of listening, understanding, responding, and winning in real time on the Jeopardy quiz show. Watson proved that machines can do things that require human creativity and intelligence.
2. How can text mining be used in security and counterterrorism?
In 2007, EUROPOL developed an integrated system capable of accessing, storing, and analyzing vast amounts of structured and unstructured data sources in order to track transnational organized crime. Another security-related application of text mining is in the area of deception detection.
7. What is in-database analytics, and why would you need it?
In-database analytics refers to the practice of applying analytics directly to a database or data warehouse rather than the traditional practice of first transforming into the analytics application's data format. The time it takes to transform production data into a data warehouse format can be very long. In-database analytics eliminates this need.
1. Why is it important for many Hollywood professionals to predict the financial success of movies?
It is hard to predict box-office receipts for a given movie.
Explain whether or not data mining is a new discipline.
Many of the techniques used in data mining have their roots in traditional statistical analysis and artificial intelligence work done since the early part of the 1980s. New or increased use of data mining applications makes it seem like data mining is a new discipline.
Discuss word sense disambiguation.
Many words have more than one meaning. Selecting the meaning that makes the most sense can only be accomplished by taking into account the context within which the word is used.
3. How do you think Hollywood did, and perhaps still is performing, this task without the help of data mining tools and techniques?
Most is done by gut feel and trial-and-error. This may keep the movie business as a financially risky endeavor, but also allows for creativity. Sometimes uncertainty is a good thing.
1. What is natural language processing?
Natural language processing (NLP) studies the problem of "understanding" the natural human language, with the view of converting depictions of human language (such as textual documents) into more formal representations (in the form of numeric and symbolic data) that are easier for computer programs to manipulate.
1. How did Infinity P&C improve customer service with data mining?
One out of five claims is fraudulent. Rather than putting all five customers through an investigatory process, SPSS helps Infinity 'fast-track' four of them and close their cases within a matter of days. This results in much happier customers, contributes to a more efficient workflow with improved cycle times, and improves retention due to an overall better claims experience.
What is regression?
Regression: a statistical estimation technique based on fitting a curve defined by a mathematical equation of known type but unknown parameters to existing data
1. How did the Memphis Police Department use data mining to better combat crime?
Shortly after all precincts embraced Blue CRUSH, predictive analytics became one of the most potent weapons in the Memphis police department's crime-fighting arsenal. Their use of data mining enabled them to focus police resources intelligently by putting them in the right place, on the right day, at the right time.
Do data mining projects follow a systematic project management process to be successful.
Similar to other information systems initiatives, a data mining project must follow a systematic project management process to be successful.
4. What do you think are the reasons for these myths about data mining?
Some answers might relate to fear of analytics, fear of the unknown, or fear of looking dumb.
What is statistical analysis
Statistical classification techniques include logistic regression and discriminant analysis, both of which make the assumptions that the relationships between the input and output variables are linear in nature, the data is normally distributed, and the variables are not correlated and are independent of each other.
2. Why do you think the early phases (understanding of the business and understanding of the data) take the longest in data mining projects?
Students should explain that the early steps are the most unstructured phases because they involve learning. Those phases (learning/understanding) cannot be automated. Extra time and effort are needed upfront because any mistake in understanding the business or data will most likely result in a failed BI project.
2. Did Target go too far? Did it do anything illegal? What do you think Target should have done? What do you think Target should do next (quit these types of practices)?
Target might have made a tactical mistake, but they certainly didn't do anything illegal. They did not use any information that violates customer privacy; rather, they used transactional data that most every other retail chain is collecting and storing (and perhaps analyzing) about their customers.
1. What is text analytics? How does it differ from text mining?
Text analytics is a concept that includes information retrieval (e.g., searching and identifying relevant documents for a given set of key terms) as well as information extraction, data mining, and Web mining. By contrast, text mining is primarily focused on discovering new and useful knowledge from textual data sources. You can think of text analytics as a combination of information retrieval plus text mining.
3. Why is the popularity of text mining as a BI tool increasing?
Text mining as a BI is increasing because of the rapid growth in text data and availability of sophisticated BI tools. The benefits of text mining are obvious in the areas where very large amounts of textual data are being generated, such as law (court orders), academic research (research articles), finance (quarterly reports), medicine (discharge summaries), biology (molecular interactions), technology (patent files), and marketing (customer comments).
1. List and briefly discuss some of the text mining applications in marketing.
Text mining can be used to increase cross-selling and up-selling by analyzing the unstructured data generated by call centers. Text mining has become invaluable for customer relationship management. Companies can use text mining to analyze rich sets of unstructured text data, combined with the relevant structured data extracted from organizational databases, to predict customer perceptions and subsequent purchasing behavior.
2. What is text mining? How does it differ from data mining?
Text mining is the application of data mining to unstructured, or less structured, text files. As the names indicate, text mining analyzes words; and data mining analyzes numeric data.
2. How does NLP relate to text mining?
Text mining uses natural language processing to induce structure into the text collection and then uses data mining algorithms such as classification, clustering, association, and sequence discovery to extract knowledge from it.
Describe syntactic ambiguity.
The grammar for natural languages is ambiguous; that is, multiple possible sentence structures often need to be considered. Choosing the most appropriate structure usually requires a fusion of semantic and contextual information.
4. What are the main differences between commercial and free data mining software tools?
The main difference between commercial tools, such as Enterprise Miner and Statistica, and free tools, such as Weka and RapidMiner, is computational efficiency. The same data mining task involving a rather large dataset may take a whole lot longer to complete with the free software, and in some cases it may not even be feasible (i.e., crashing due to the inefficient use of computer memory).
What are genetic algorithms
The use of the analogy of natural evolution to build directed search-based mechanisms to classify data samples.
Why should a single view of the customer be accomplished?
Achieving this single view helps to better focus marketing efforts and drive increased sales.
2. What were the obtained results of P&C Improve?
As a result of implementing the IBM SPSS analytics tools, Infinity P&C has doubled the accuracy of its fraud identification, contributing to a return on investment of 403 percent per a Nucleus Research study.
3. What are some promising text mining applications in biomedicine?
As in any other experimental approach, it is necessary to analyze the vast amount of data in the context of previously known information about the biological entities under study. The literature is a particularly valuable source of information for experiment validation and interpretation. Therefore, the development of automated text mining tools to assist in such interpretation is one of the main challenges in current bioinformatics research.
3. List and briefly define the phases in the CRISP-DM process.
CRISP-DM provides a systematic and orderly way to conduct data mining projects. This process has six steps. First, an understanding of the data and an understanding of the business issues to be addressed are developed concurrently. Next, data are prepared for modeling; are modeled; model results are evaluated; and the models can be employed for regular use.
3. What are the sources of data that retailers such as Cabela's use for their data mining projects?
Cabela's uses large and information-rich transactional and customer data (that they collect on a daily basis) to optimize business processes and stay competitive. In addition, through Web mining they track clickstream patterns of customers shopping online.
1. What do you think about data mining and its implication for privacy? What is the threshold between discovery of knowledge and infringement of privacy?
There is a tradeoff between knowledge discovery and privacy rights. Retailers should be sensitive about this when targeting their advertising based on data mining results, especially regarding topics that could be embarrassing to their customers. Otherwise they risk offending these customers, which could hurt their bottom line.
What are Bayesian classifiers?
This approach uses probability theory to build classification models based on the past occurrences that are capable of placing a new instance into a most probable class (or category).
What are rough sets?
This method takes into account the partial membership of class labels to predefined categories in building models (collection of rules) for classification problems.
4. What are the main data preprocessing steps?
data consolidation: access, collect, select and filter data data cleaning: handle missing data, reduce noise, fix errors data transformation: normalize the data, aggregate data, construct new attributes data reduction: reduce number of attributes and records; balance skewed data
What is data reduction?
data reduction: reduce number of attributes and records; balance skewed data
1. What are the privacy issues in data mining?
Data that is collected, stored, and analyzed in data mining often contains information about real people. This includes identification, demographic, financial, personal, and behavioral information. Most of these data can be accessed through some third-party data providers. In order to maintain the privacy and protection of individuals' rights, data mining professionals have ethical (and often legal) obligations.
4. What does it mean to have "a single view of the customer"?
treating the customer as a single entity across whichever channels the customer utilizes.
1. What are the most popular commercial data mining tools?
Examples of these vendors include IBM (IBM SPSS Modeler), SAS (Enterprise Miner), StatSoft (Statistica Data Miner), KXEN (Infinite Insight), Salford (CART, MARS, TreeNet, RandomForest), Angoss (KnowledgeSTUDIO, KnowledgeSeeker), and Megaputer (PolyAnalyst). Most of the more popular tools are developed by the largest statistical software companies (SPSS, SAS, and StatSoft).
5. Briefly describe the general algorithm used in decision trees.
A general algorithm for building a decision tree is as follows: 1. Create a root node and assign all of the training data to it. 2. Select the best splitting attribute. 3. Add a branch to the root node for each value of the split. Split the data into mutually exclusive (non-overlapping) subsets along the lines of the specific split and mode to the branches. 4. Repeat steps 2 and 3 for each and every leaf node until the stopping criteria is reached (e.g., the node is dominated by a single class label).
Discuss the distinction between prediction and forecasting?
A term that is commonly associated with prediction is forecasting. Whereas prediction is largely experience and opinion based, forecasting is data and model based. That is, in order of increasing reliability, one might list the relevant terms as guessing, predicting, and forecasting, respectively. In data mining terminology, prediction and forecasting are used synonymously, and the term prediction is used as the common representation of the act.
2. How do you think the discussion between privacy and data mining will progress? Why?
As technology advances and more information about people becomes easier to get, the privacy debate will adjust accordingly. People's expectations about privacy will become tempered by their desires for the benefits of data mining, from individualized customer service to higher security. As with all issues of social import, the privacy issue will include social discourse, legal and legislative decisions, and corporate decisions. The fact that companies often choose to self-regulate (e.g., by ensuring their data is de-identified) implies that we may as a society be able to find a happy medium between privacy and data mining. (Answers will vary by student.)
What is association?
Association: establishing relationships among items that occur together
What does it mean for data to be unsupervised?
Based on the way in which the patterns are extracted from the historical data, the learning algorithms of data mining methods can be classified as either supervised or unsupervised. In contrast, with unsupervised learning the training data includes only the descriptive attributes.
What does it mean for data to be supervised?
Based on the way in which the patterns are extracted from the historical data, the learning algorithms of data mining methods can be classified as either supervised or unsupervised. With supervised learning algorithms, the training data includes both the descriptive attributes (i.e., independent variables or decision variables) as well as the class attribute (i.e., output variable or result variable).
1. Identify at least three of the main data mining methods.
Classification learns patterns from past data (a set of information—traits, variables, features—on characteristics of the previously labeled items, objects, or events) in order to place new instances (with unknown labels) into their respective groups or classes. Cluster analysis is an exploratory data analysis tool for solving classification problems. Association rule mining is a popular data mining method that is commonly used as an example to explain what data mining is and what it can do to a technologically less savvy audience.
8. What is the major difference between cluster analysis and classification?
Classification methods learn from previous examples containing inputs and the resulting class labels, and once properly trained they are able to classify future cases. Clustering partitions pattern records into natural segments or clusters.
What is classification
Classification: analyzing the historical behavior of groups of entities with similar characteristics, to predict the future behavior of a new entity from its similarity to those groups
7. Give examples of situations in which cluster analysis would be an appropriate data mining technique.
Cluster algorithms are used when the data records do not have predefined class identifiers (i.e., it is not known to what class a particular record belongs).
What is clustering?
Clustering: finding groups of entities with similar characteristics
2. What were the challenges of the Memphis Police Department?
Crime across the metro area was surging, there were budget pressures, and city leaders were growing impatient.
Why are there many different names and definitions for data mining?
Data mining has many definitions because it's been stretched beyond those limits by some software vendors to include most forms of data analysis in order to increase sales using the popularity of data mining.
1. Define data mining. Why are there many different names and definitions for data mining?
Data mining is the process through which previously unknown patterns in data were discovered. Another definition would be "a process that uses statistical, mathematical, and artificial learning techniques to extract and identify useful information and subsequent knowledge from large sets of data." This includes most types of automated data analysis. A third definition: Data mining is the process of finding mathematical patterns from (usually) large sets of data; these can be rules, affinities, correlations, trends, or prediction models.
2. Why do you think the most popular tools are developed by statistics companies?
Data mining techniques involve the use of statistical analysis and modeling. So it's a natural extension of their business offerings.
5. What type of analytics help did Cabela's get from their efforts? Can you think of any other potential benefits of analytics for large-scale retailers like Cabela's?
Using SAS data mining tools and Teradata, Cabela's analysts create predictive models to optimize customer selection for all customer contacts. Cabela's uses these prediction scores to maximize marketing spending across channels and within each customer's personal contact strategy.The clustering and association models helped the company understand the value of customers, using a five-point scale as illustrated in this quote, "We treat all customers well, but we can develop strategies to treat higher-value customers a little better".
Why should retailers, especially omni-channel retailers, pay extra attention to advanced analytics and data mining?
Utilizing large and information-rich transactional and customer data (that they collect on a daily basis) to optimize their business processes is not a choice for large-scale retailers anymore, but a necessity to stay competitive.
2. How can data mining be used for predicting financial success of movies before the start of their production process?
Utilizing predictive models in early stages of movie production is effective to minimize investments in flops.
How does Watson work?
Watson is built on the DeepQA framework. The hardware for this system involves a massively parallel processing architecture to enable simultaneous consideration of multiple interpretations and hypotheses. In terms of software, IBM's Watson was superior because: a) massive parallelism an underlying confidence estimation subsystem that ranks and integrates answers. b) many experts to facilitate the integration, application, and contextual evaluation of a wide range of loosely coupled probabilistic question and content analytics
What is categorization?
• Categorization. Identifying the main themes of a document and then placing the document into a predefined set of categories based on those themes.
What is clustering?
• Clustering. Grouping similar documents without having a predefined set of categories.
What is concept linking?
• Concept linking. Connects related documents by identifying their shared concepts and, by doing so, helps users find information that they perhaps would not have found using traditional search methods.
3. What are the most common myths about data mining?
• Data mining provides instant, crystal-ball predictions.
1. What are the main steps in the text mining process?
• Establish the Corpus: Collect and organize the domain-specific unstructured data • Create the Term-Document Matrix: Introduce structure to the corpus • Extract Knowledge: Discover novel patterns from the T-D matrix
2. What recent factors have increased the popularity of data mining?
• General recognition of the untapped value hidden in large data sources.
4. What are some popular application areas of text mining?
• Information extraction. Identification of key phrases and relationships within text by looking for predefined sequences in text via pattern matching. • Topic tracking. Based on a user profile and documents that a user views, text mining can predict other documents of interest to the user. • Summarization. Summarizing a document to save time on the part of the reader. • Categorization. Identifying the main themes of a document and then placing the document into a predefined set of categories based on those themes. • Clustering. Grouping similar documents without having a predefined set of categories. • Concept linking. Connects related documents by identifying their shared concepts and, by doing so, helps users find information that they perhaps would not have found using traditional search methods. • Question answering. Finding the best answer to a given question through knowledge-driven pattern matching.
2. What are the main reasons for the recent popularity of data mining? There are seven.
• More intense competition at the global scale driven by customers' ever-changing needs and wants in an increasingly saturated marketplace • General recognition of the untapped value hidden in large data sources • Consolidation and integration of database records, which enables a single view of customers, vendors, transactions, etc. • Consolidation of databases and other data repositories into a single location in the form of a data warehouse • The exponential increase in data processing and storage technologies • Significant reduction in the cost of hardware and software for data storage and processing • Movement toward the de-massification (conversion of information resources into nonphysical form) of business practices
4. What are the most common tasks addressed by NLP? There are eleven.
• Question answering • Automatic summarization • Natural language generation • Natural language understanding • Machine translation • Foreign language reading • Foreign language writing • Speech recognition • Text-to-speech • Text proofing • Optical character recognition
5. What are the most common data mining mistakes/blunders? How can they be minimized and/or eliminated? There are ten.
• Selecting the wrong problem for data mining • Ignoring what your sponsor thinks data mining is and what it really can and cannot do • Leaving insufficient time for data preparation. It takes more effort than one often expects • Looking only at aggregated results and not at individual records • Being sloppy about keeping track of the mining procedure and results • Ignoring suspicious findings and quickly moving on • Running mining algorithms repeatedly and blindly. (It is important to think hard enough about the next stage of data analysis. Data mining is a very hands-on activity.) • Believing everything you are told about data • Believing everything you are told about your own data mining analysis • Measuring your results differently from the way your sponsor measures them