Business Intelligence Midterm Review

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

22.What are the commonalities and differences between regression and correlation?

Correlation makes no a priori assumption of whether one variable is dependent on the other(s) and is not concerned with the relationship between variables

Did DSS evolve into BI or vice versa?

DSS systems became more advanced in the 2000s with the addition of data warehousing capabilities and began to be referred to as Business Information (BI) systems.

3. Where does the data for business analytics come from?

Data can come from a wide variety of locations. Examples can include business processes and systems, the Internet and social media, and machines or the Internet of Things.

How do you describe the importance of data in analytics? Can we think of analytics without data?

Data is the main ingredient in all forms of analytics. You cannot have analytics without data.

How is descriptive analytics different from traditional reporting?

Descriptive analytics gathers more data, often automatically. It makes results available in real time and allows reports to be customized.

What are the most common tasks addressed by NLP?

Following are among the most popular tasks: • Question answering • Automatic summarization • Natural language generation • Natural language understanding • Machine translation • Foreign language reading • Foreign language writing • Speech recognition • Text-to-speech • Text proofing • Optical character recognition

Which companies are dominant in more than one category?

It appears that several larger IT companies have products and services in several of these areas. Examples include IBM, Microsoft, SAS, Dell, and SAP.

23.What is OLS? How does OLS determine the linear regression line?

Ordinary least squares (OLS) method aims to minimize the sum of squared residuals and leads to a mathematical expression for the estimated value of the regression line.

What is a web crawler? What is it used for? How does it work?

A Web crawler (also called a spider or a Web spider) is a piece of software that systematically browses (crawls through) the World Wide Web for the purpose of finding and fetching Web pages. It starts with a list of "seed" URLs, goes to the pages of those URLs, and then follows each page's hyperlinks, adding them to the search engine's database. Thus, the Web crawler navigates through the Web in order to construct the database of websites.

What is a performance measurement system? How does it work?

A performance measurement system is one component of a performance management system. The most popular performance measurement systems in use are some variant of Kaplan and Norton's balanced scorecard (BSC).

What are the three key components of a BPM system?

According to Colbert (2009), a BPM encompasses three key components. The first is a set of integrated, closed-loop management and analytic processes (supported by technology) that addresses financial as well as operational activities. The second involves tools for businesses to define strategic goals and then measure and manage performance against those goals. And the third component involves a core set of processes, including financial and operational planning, consolidation and reporting, modeling, analysis, and monitoring of key performance indicators (KPIs), linked to organizational strategy.

46.What are the common characteristics of dashboards and other information visuals?

All well-designed dashboards share some common characteristics. They use visual components (e.g., charts, performance bars, sparklines, gauges, meters, stoplights) to highlight, at a glance, the data and exceptions that require action. They are transparent to the user, meaning that they require minimal training and are extremely easy to use. They combine data from a variety of systems into a single, summarized, unified view of the business. They enable drill-down or drill-through to underlying data sources or reports, providing more detail about the underlying comparative and evaluative context. They present a dynamic, real-world view with timely data refreshes, enabling the end user to stay up to date with any recent changes in the business. And they require little, if any, customized coding to implement, deploy, and maintain.

How do you think the discussion between privacy and data mining will progress? Why?

As technology advances and more information about people becomes easier to get, the privacy debate will adjust accordingly. People's expectations about privacy will become tempered by their desires for the benefits of data mining, from individualized customer service to higher security. As with all issues of social import, the privacy issue will include social discourse, legal and legislative decisions, and corporate decisions. The fact that companies often choose to self-regulate (e.g., by ensuring their data is de-identified) implies that we may as a society be able to find a happy medium between privacy and data mining.

Give examples of situations in which association would be an appropriate data mining technique.

Association rule mining is appropriate to use when the objective is to discover two or more items (or events or concepts) that go together.

List and describe the major components of BI.

BI systems have four major components: the data warehouse (with its source data), business analytics (a collection of tools for manipulating, mining, and analyzing the data in the data warehouse), business performance management (for monitoring and analyzing performance), and the user interface (e.g., a dashboard).

Compare BSC and Six Sigma as two competing performance measurement systems.

BSC is focused on improving overall strategy, whereas Six Sigma is focused on improving processes. BSC gives a longer-term view of the organization, whereas Six Sigma gives a snapshot at a particular point in time of its operational effectiveness. BSC focuses on long-term growth, whereas Six Sigma emphasizes current profitability. These are a few of the differences between the two. Some companies choose to give a more holistic performance assessment by combining elements of both approaches.

47.What are the best practices in dashboard design?

Benchmark key performance indicators with industry standards. Wrap the dashboard metrics with contextual metadata (e.g. data source, data currency, refresh schedule). Prioritize and rank alerts/exceptions streamed to the dashboard. Enrich the dashboard with business users' comments. Present information in three different levels (visual dashboard, static report, and self-service cube). Pick the right visual construct using dashboard design principles. Provide for guided analytics.

What are the key similarities and differences between a two-tiered architecture and a three-tiered architecture?

Both provide the same user visibility through a client system that accesses a DSS/BI application remotely. The difference is behind the scenes and is invisible to the user: in a two-tiered architecture, the application and data warehouse reside on the same machine

Define BI.

Business Intelligence (BI) is an umbrella term that combines architectures, tools, databases, analytical tools, applications, and methodologies. Its major objective is to enable interactive access (sometimes in real time) to data, enable manipulation of these data, and provide business managers and analysts the ability to conduct appropriate analysis.

What is business performance management? How does it relate to BI?

Business performance management (BPM) refers to the business processes, methodologies, metrics, and technologies used by enterprises to measure, monitor, and manage business performance. It is also known as corporate performance management (CPM), enterprise performance management (EPM), and strategic enterprise management (SEM). It can be considered to be a type of BI tool/technique. The most significant differentiator of BPM from any other BI tools and practices is its strategy focus. BPM encompasses a closed-loop set of processes that link strategy to execution in order to optimize business performance.

Give examples of situations in which cluster analysis would be an appropriate data mining technique.

Cluster algorithms are used when the data records do not have predefined class identifiers (i.e., it is not known to what class a particular record belongs).

How can a computer help overcome the cognitive limits of humans?

Computer-based systems are not limited in many of the ways people are, and this lack of limits allows unique abilities to evaluate data. Examples of abilities include being able to store huge amounts of data, being able to run extensive numbers of scenarios and analyses, and the ability to spot trends in vast datasets or models.

What things can help Web pages rank higher in the search engine results?

Cross-linking between pages of the same website to provide more links to the most important pages may improve its visibility. Writing content that includes frequently searched keyword phrases, so as to be relevant to a wide variety of search queries, will tend to increase traffic. Updating content so as to keep search engines crawling back frequently can give additional weight to a site. Adding relevant keywords to a Web page's metadata, including the title tag and metadescription, will tend to improve the relevancy of a site's search listings, thus increasing traffic. URL normalization of Web pages so that they are accessible via multiple URLs and using canonical link elements and redirects can help make sure links to different versions of the URL all count toward the page's link popularity score.

What are the most popular application areas for sentiment analysis? Why?

Customer relationship management (CRM) and customer experience management are popular "voice of the customer (VOC)" applications. Other application areas include "voice of the market (VOM)" and "voice of the employee (VOE)."

What is DMAIC? List and briefly describe the steps involved in DMAIC.

DMAIC is a closed loop performance improvement model that involves the following steps: define, measure, analyze, improve, and control. First, you define the goals, objectives, and boundaries of the improvement activity. Next, you measure the existing system, in order to monitor its performance against the goals. Then, you analyze the system to identify ways to eliminate the gap between the current performance of the system or process and the desired goal. This leads to improvement, which involves initiating actions to reduce these gaps. Finally, control involves modifying compensation and incentive systems, policies, procedures, manufacturing resource planning, budgets, operation instructions, or other management systems.

Why do you think the most popular tools are developed by statistics companies?

Data mining techniques involve the use of statistical analysis and modeling. So it's a natural extension of their business offerings.

What are the main data preprocessing steps? Briefly describe each step and provide relevant examples.

Data preprocessing is essential to any successful data mining study. Good data leads to good information

35.Why do you think there are many different types of charts and graphs?

Different types of charts are appropriate for conveying different types of information. Line graphs are good for time-series data. Bar charts are good for depicting nominal or numerical data that can be easily categorized. Pie charts should be used for depicting proportions. Scatter plots and bubble charts are good for illustrating relationships between two or three variables (bubble charts add a dimension via the size of the dot). Histograms are like bar charts, except they depict frequency distributions. Gantt charts and PERT charts are good at illustrating project timelines and task dependencies. Geographic maps, of course, show geographic information. Bullet graphs show progress toward a goal. Heat maps and highlight tables illustrate the comparison of continuous values across two categories using color. Tree maps are good for showing hierarchical information. Even though these charts and graphs cover a major part of what is commonly used in information visualization, they by no means cover it all. Nowadays, one can find many other specialized graphs and charts that serve a specific purpose.

What are the ingredients for an effective performance management system?

Effective performance management/measurement should focus on key factors. It should mix past, present, and future. Also, it should balance the needs of shareholders, employees, partners, suppliers, and other stakeholders. Performance measures should start at the top and flow to the bottom, and should involve targets that are based on research and reality rather than arbitrary.

What are the most popular commercial data mining tools?

Examples of these vendors include IBM (IBM SPSS Modeler), SAS (Enterprise Miner), StatSoft (Statistica Data Miner), KXEN (Infinite Insight), Salford (CART, MARS, TreeNet, RandomForest), Angoss (KnowledgeSTUDIO, KnowledgeSeeker), and Megaputer (PolyAnalyst). Most of the more popular tools are developed by the largest statistical software companies (SPSS, SAS, and StatSoft).

How can you measure the impact of social media analytics?

First, determine what your social media goals are. From here, you can use analysis tools such as descriptive analytics, social network analysis, and advanced (predictive, text examining content in online conversations), and ultimately prescriptive analytics tools.

What are some major data mining methods and algorithms?

Generally speaking, data mining tasks can be classified into three main categories: prediction, association, and clustering. Based on the way in which the patterns are extracted from the historical data, the learning algorithms of data mining methods can be classified as either supervised or unsupervised. With supervised learning algorithms, the training data includes both the descriptive attributes (i.e., independent variables or decision variables) as well as the class attribute (i.e., output variable or result variable). In contrast, with unsupervised learning the training data includes only the descriptive attributes. Figure 4.3 (p. 157) shows a simple taxonomy for data mining tasks, along with the learning methods, and popular algorithms for each of the data mining tasks.

10.Why is the original/raw data not readily usable by analytics tasks?

It is often dirty, misaligned, overly complex, and inaccurate.

What are the sources of Big Data?

Major sources include clickstreams from Web sites, postings on social media, and data from traffic, sensors, and the weather.

Identify and discuss the role of middleware tools.

Middleware tools enable access to the data warehouse. Power users such as analysts may write their own SQL queries. Others may access data through a managed query environment. There are many front-end applications that business users can use to interact with data stored in the data repositories, including data mining, OLAP, reporting tools, and data visualization tools. All these have their own data access requirements. Those may not match with how a given data warehouse must be accessed. Middleware translates between the two.

45.List and describe the three layers of information portrayed on dashboards. The three layers of information found in most dashboards are:

Monitoring. Graphical, abstracted data to monitor key performance metrics. Analysis. Summarized dimensional data to analyze the root cause of problems. Management. Detailed operational data that identify what actions to take to resolve a problem.

What is Six Sigma? How is it used as a performance measurement system?

Most companies use Six Sigma as a process improvement methodology that enables them to scrutinize their processes, pinpoint problems, and apply remedies. It's not used much as a performance management or measurement methodology. As a performance tool, it is aimed at reducing the number of defects in a business process to as close to zero DPMO (defects per million opportunities) as possible.

What are some of the benefits and challenges of NLP?

NLP moves beyond syntax-driven text manipulation (which is often called "word counting") to a true understanding and processing of natural language that considers grammatical and semantic constraints as well as the context. The challenges include: · Part-of-speech tagging. It is difficult to mark up terms in a text as corresponding to a particular part of speech because the part of speech depends not only on the definition of the term but also on the context within which it is used. · Text segmentation. Some written languages, such as Chinese, Japanese, and Thai, do not have single-word boundaries. · Word sense disambiguation. Many words have more than one meaning. Selecting the meaning that makes the most sense can only be accomplished by taking into account the context within which the word is used. · Syntactic ambiguity. The grammar for natural languages is ambiguous

Define OLTP.

OLTP (online transaction processing) is a type of computer processing where the computer responds immediately to user requests. Each request is considered to be a transaction, which is a computerized record of a discrete event, such as the receipt of inventory or a customer order.

What processing technique is applied to process Big Data?

One computer, even a powerful one, could not handle the scale of Big Data. The solution is to push computation to the data, using the MapReduce programming paradigm.

What is an ODS?

Operational Data Store is the database from which a business operates on an on-going basis.

What is predictive analytics? How can organizations employ predictive analytics?

Predictive analytics is the use of statistical techniques and data mining to determine what is likely to happen in the future. Businesses use predictive analytics to forecast whether customers are likely to switch to a competitor, what customers are likely to buy, how likely customers are to respond to a promotion, and whether a customer is creditworthy. Sports teams have used predictive analytics to identify the players most likely to contribute to a team's success.

How has the Web influenced data warehouse design?

Primarily by making Web-based data warehousing possible.

What are the most popular free data mining tools? Why are they gaining overwhelming popularity (especially R)?

Probably the most popular free and open source data mining tool is Weka. Others include RapidMiner and Microsoft's SQL Server. Their popularity continues to grow because of their availability, features, and user communities. R remains very popular as a default language because of its feature base supporting data manipulation.

What are the major DW implementation tasks that can be performed in parallel?

Reeves (2009) and Solomon (2005) provided some guidelines regarding the critical questions that must be asked, some risks that should be weighted, and some processes that can be followed to help ensure a successful data warehouse implementation. They compiled a list of 11 major tasks that could be performed in parallel: Establishment of service-level agreements and data-refresh requirements Identification of data sources and their governance policies Data quality planning Data model design ETL tool selection Relational database software and platform selection Data transport Data conversion Reconciliation process Purge and archive planning End-user support

What is scalability? How does it apply to DW?

Scalability refers to the degree to which a system can adjust to changes in demand without major additional changes or investments. DW scalability issues are the amount of data in the warehouse, how quickly the warehouse is expected to grow, the number of concurrent users, and the complexity of user queries. A data warehouse must scale both horizontally and vertically. The warehouse will grow as a function of data growth and the need to expand the warehouse to support new business functionality. Data growth may be a result of the addition of current cycle data (e.g., this month's results) and/or historical data.

What are the main steps in the text mining process?

See Figure 5.6 (p. 222). Text mining entails three tasks: · Establish the Corpus: Collect and organize the domain-specific unstructured data · Create the Term-Document Matrix: Introduce structure to the corpus · Extract Knowledge: Discover novel patterns from the T-D matrix

What is sentiment analysis? How does it relate to text mining?

Sentiment analysis tries to answer the question, "What do people feel about a certain topic?" by digging into opinions of many using a variety of automated tools. It is also known as opinion mining, subjectivity analysis, and appraisal extraction. Sentiment analysis shares many characteristics and techniques with text mining. However, unlike text mining, which categorizes text by conceptual taxonomies of topics, sentiment classification generally deals with two classes (positive versus negative), a range of polarity (e.g., star ratings for movies), or a range in strength of opinion.

Why is strategy the most important part of a BPM implementation?

Strategy is the art and the science of crafting decisions that help businesses achieve their goals. More specifically, it is the process of identifying and stating the organization's mission, vision, and objectives. Business strategy provides an overall direction to the enterprise, which is why it is so important.

Is it better to be the strongest player in one category or be active in multiple categories?

Student opinions will vary. It can be argued that cross-discipline strength provides better integration and insight, or that that domination in multiple areas reduces completion or innovation.

What do you think are the reasons for these myths about data mining?

Students' answers will differ. Some answers might relate to fear of analytics, fear of the unknown, or fear of looking dumb.

What is text analytics? How does it differ from text mining?

Text analytics is a concept that includes information retrieval (e.g., searching and identifying relevant documents for a given set of key terms) as well as information extraction, data mining, and Web mining. By contrast, text mining is primarily focused on discovering new and useful knowledge from textual data sources. The overarching goal for both text analytics and text mining is to turn unstructured textual data into actionable information through the application of natural language processing (NLP) and analytics. However, text analytics is a broader term because of its inclusion of information retrieval. You can think of text analytics as a combination of information retrieval plus text mining.

Define Gini index. What does it measure?

The Gini index and information gain (entropy) are two popular ways to determine branching choices in a decision tree. The Gini index measures the purity of a sample. If everything in a sample belongs to one class, the Gini index value is zero.

19.What is a box-and-whiskers plot? What types of statistical information does it represent?

The box-and-whiskers plot is a graphical illustration of several descriptive statistics about a given data set. The box plot shows the centrality, the dispersion, and the minimum and maximum ranges.

24.What are the most commonly pronounced assumptions for linear regression?

The most commonly pronounced assumptions for linear regression include linearity, independence, normality, constant variance, and multicollinearity.

What is Big Data analytics?

The term Big Data refers to data that cannot be stored in a single storage unit. Typically, the data is arriving in many different forms, be they structured, unstructured, or in a stream. Big Data analytics is analytics on a large enough scale, with fast enough processing, to handle this kind of data.

What are the three main areas of Web mining?

The three main areas of Web mining are Web content mining, Web structure mining, and Web usage (or activity) mining.

26.What is time series? What are the main forecasting techniques for time series data?

Time series forecasting is the use of mathematical modeling to predict future values of the variable of interest based on previously observed values.

What are the characteristics of Big Data?

Today Big Data refers to almost any kind of large data that has the characteristics of volume, velocity, and variety. Examples include data about Web searches, such as the billions of Web pages searched by Google

What is Web content mining? How can it be used for competitive advantage?

Web content mining refers to the extraction of useful information from Web pages. The documents may be extracted in some machine-readable format so that automated techniques can generate some information about the Web pages. Collecting and mining Web content can be used for competitive intelligence (collecting intelligence about competitors' products, services, and customers), which can give your organization a competitive advantage.

these can be rules

affinities, correlations, trends, or prediction models. Data mining has many definitions because it's been stretched beyond those limits by some software vendors to include most forms of data analysis in order to increase sales using the popularity of data mining.

good information leads to good decisions. Data preprocessing includes four main steps (listed in Table 4.1 on page 167): data consolidation: access

collect, select and filter data data cleaning: handle missing data, reduce noise, fix errors data transformation: normalize the data, aggregate data, construct new attributes data reduction: reduce number of attributes and records

Why do we need to define separate objectives

measures, targets, and initiatives for each of these four BSC perspectives?, BSC is designed to overcome the limitations of systems that are financially focused. An organization's vision and strategy should recognize the interrelation between financial and nonfinancial objectives, measures, targets, and initiatives. Therefore, nonfinancial objectives form a simple causal chain with "learning and growth" driving "internal business process" change, which produces "customer" outcomes that are responsible for reaching a company's "financial" objectives.

8. Can we use the same data representation for all analytics models? Why

or why not?, No, other data types, including textual, spatial, imagery, video, and voice, need to be converted into some form of categorical or numeric representation before they can be processed by analytics methods.

instead it gives an estimate on the degree of association between the variables. On the other hand

regression attempts to describe the dependence of a response variable on one (or more) explanatory variables where it implicitly assumes that there is a one-way causal effect from the explanatory variable(s) to the response variable, regardless of whether the path of effect is direct or indirect. Also, although correlation is interested in the low-level relationships between two variables, regression is concerned with the relationships between all explanatory variables and the response variable.

in a three-tiered architecture

they are on separate machines.

4. In your opinion

what are the top three data-related challenges for better analytics?, Opinions will vary, but examples of challenges include data reliability, accuracy, accessibility, security, richness, consistency, timeliness, granularity, validity, and relevance.

data about financial trading

which operates in the order of microseconds

What are the three types of data generated through Web page visits?

· Automatically generated data stored in server access logs, referrer logs, agent logs, and client-side cookies · User profiles · Metadata, such as page attributes, content attributes, and usage data

Describe the major components of a data warehouse.

· Data sources. Data are sourced from operational systems and possibly from external data sources. · Data extraction and transformation. Data are extracted and properly transformed using custom-written or commercial software called ETL. · Data loading. Data are loaded into a staging area, where they are transformed and cleansed. The data are then ready to load into the data warehouse. · Comprehensive database. This is the EDW that supports decision analysis by providing relevant summarized and detailed information. · Metadata. Metadata are maintained for access by IT personnel and users. Metadata include rules for organizing data summaries that are easy to index and search. · Middleware tools. Middleware tools enable access to the data warehouse from a variety of front-end applications.

List the alternative data warehousing architectures.

· Independent data marts architecture · Data mart bus architecture with linked dimensional data marts · Hub-and-spoke architecture (corporate information factory) · Centralized data warehouse architecture · Federated architecture

Give examples of situations in which association would be an appropriate data mining technique. Examples include the following:

· Sales transactions · Credit card transactions · Banking services · Insurance service products · Telecommunication services · Medical records

What are the most common myths about data mining?

• Data mining provides instant, crystal-ball predictions. • Data mining is not yet viable for business applications. • Data mining requires a separate, dedicated database. • Only those with advanced degrees can do data mining. • Data mining is only for large firms that have lots of customer data.

What issues should be considered when deciding which architecture to use in developing a data warehouse? List the 10 most important factors.

1. Information interdependence between organizational units 2. Upper management's information needs 3. Urgency of need for a data warehouse 4. Nature of end-user tasks 5. Constraints on resources 6. Strategic view of the data warehouse prior to implementation 7. Compatibility with existing systems 8. Perceived ability of the in-house IT staff 9. Technical issues 10. Social/political factors

28.What is a business report? What are the main characteristics of a good business report?

A business report is a written document that contains information regarding business matters. Business reporting (also called enterprise reporting) is an essential part of the larger drive toward improved managerial decision making and organizational knowledge management. The foundation of these reports is various sources of data coming from both inside and outside the organization. Creation of these reports involves ETL (extract, transform, and load) procedures in coordination with a data warehouse and then using one or more reporting tools. While reports can be distributed in print form or via e-mail, they are typically accessed via a corporate intranet. Primary characteristics of a good business report include clarity, brevity, completeness, and correctness.

31.What are the main components of a business reporting system?

A business reporting system includes several components. One is the online transaction processing system (ERP, POS, etc.) that records transactions. A second is a data supply that takes recorded events and transactions and delivers them to the reporting system. Next comes an ETL component that ensures quality and performs necessary transformations prior to loading the data into a data store. Then there is the data storage itself (such as a data warehouse). Business logic converts the data into the reporting outputs. Publication distributes or hosts the reports for end users. And finally assurance provides a quality control check on the reports and their dissemination.

38.Find two more kinds of charts that are not covered in this section and comment on their usability.

A concept map is a diagram that shows relationships between concepts, usually showing specific ideas and information as boxes and using arrows to connect them. Concept maps are often used by designers and engineers to organize ideas. Another type of chart is an organization chart (or org-chart). This is a hierarchical, tree-structured chart that shows how an organization is structured and how its parts and jobs are related. A motion chart is like a bubble chart in that it depicts data on dimensions of the x-axis, y-axis, size, and color of bubbles. In addition, however, it is also animated, so that bubbles move and resize themselves over time.

What is a data warehouse?

A data warehouse is defined in this section as "a pool of data produced to support decision making." This focuses on the essentials, leaving out characteristics that may vary from one DW to another but are not essential to the basic concept. The same paragraph gives another definition: "a subject-oriented, integrated, time-variant, nonvolatile collection of data in support of management's decision-making process." This definition adds more specifics, but in every case appropriately: it is hard, if not impossible, to conceive of a data warehouse that would not be subject-oriented, integrated, etc.

What is a DW? How can data warehousing technology help to enable analytics?

A data warehouse, introduced in Section 1.7, is the component of a BI system that contains the source data. As described in this section, developing a data warehouse usually includes development of the data infrastructure for descriptive analytics—that is, consolidation of data sources and making relevant data available in a form that enables appropriate reporting and analysis. A data warehouse serves as the basis for developing appropriate reports, queries, alerts, and trends.

Briefly describe the general algorithm used in decision trees.

A general algorithm for building a decision tree is as follows: 1. Create a root node and assign all of the training data to it. 2. Select the best splitting attribute. 3. Add a branch to the root node for each value of the split. Split the data into mutually exclusive (non-overlapping) subsets along the lines of the specific split and mode to the branches. 4. Repeat steps 2 and 3 for each and every leaf node until the stopping criteria is reached (e.g., the node is dominated by a single class label).

42.What is a high-powered visual analytics environment? Why do we need it?

A high-powered visualization environment is one in which high-performance, in-memory solutions are applied to exploring massive amounts of data in a very short time (almost instantaneously). Due to the increasing demand for visual analytics coupled with fast-growing data volumes, there is an ever-growing need to invest in highly efficient visualization systems. SAS Visual Analytics is an example of such an environment. These systems help to empower larger numbers of users, solve complex problems more quickly, and improve collaboration and information sharing. By enabling end-users, IT staff are freed up. In addition, these tools allow for growth at a self-determined pace.

27.What is a report? What are reports used for?

A report is any communication artifact prepared with the specific intention of conveying information in a presentable form to whoever needs it, whenever and wherever they may need it. It is usually a document that contains information (usually driven from data and personal experiences) organized in a narrative, graphic, and/or tabular form, prepared periodically (recurring) or on an as-required (ad hoc) basis, referring to specific time periods, events, occurrences, or subjects.

What is a search engine? Why are they important for today's businesses?

A search engine is a software program that searches for documents (Internet sites or files) based on the keywords (individual words, multi-word terms, or a complete sentence) that users have provided that have to do with the subject of their inquiry. This is the most prominent type of information retrieval system for finding relevant content on the Web. Search engines have become the centerpiece of most Internet-based transactions and other activities. Because people use them extensively to learn about products and services, it is very important for companies to have prominent visibility on the Web

What is a social network? What is social network analysis?

A social network is a social structure composed of individuals/people (or groups of individuals or organizations) linked to one another with some type of connections/relationships. Social network analysis (SNA) is the systematic examination of social networks. Dating back to the 1950s, social network analysis is an interdisciplinary field that emerged from social psychology, sociology, statistics, and graph (network) theory.

Is data mining a new discipline? Explain.

Although the term data mining is relatively new, the ideas behind it are not. Many of the techniques used in data mining have their roots in traditional statistical analysis and artificial intelligence work done since the early part of the 1980s. New or increased use of data mining applications makes it seem like data mining is a new discipline. In general, data mining seeks to identify four major types of patterns: associations, predictions, clusters, and sequential relationships. These types of patterns have been manually extracted from data by humans for centuries, but the increasing volume of data in modern times has created a need for more automatic approaches. As datasets have grown in size and complexity, direct manual data analysis has increasingly been augmented with indirect, automatic data processing tools that use sophisticated methodologies, methods, and algorithms. The manifestation of such evolution of automated and semiautomated means of processing large datasets is now commonly referred to as data mining.

What is clickstream analysis? What is it used for?

Analysis of the information collected by Web servers can help us better understand user behavior. Analysis of this data is often called clickstream analysis. By using the data and text mining techniques, a company might be able to discern interesting patterns from the clickstreams.

List three of the terms that have been predecessors of analytics.

Analytics has evolved from other systems over time including data support systems (DSS), operations research (OR) models, and expert systems (ES).

What are the major application areas for data mining?

Applications are listed near the beginning of this section (pp. 160-161): CRM, banking, retailing and logistics, manufacturing and production, brokerage and securities trading, insurance, computer hardware and software, government and defense, travel, healthcare, medicine, entertainment, homeland security and law enforcement, and sports.

Define modeling from the analytics perspective.

As Application Case 1.6 illustrates, analytics uses descriptive data to create models of how people, equipment, or other variables operate in the real world. These models can be used in predictive and prescriptive analytics to develop forecasts, recommendations, and decisions.

What are some promising text mining applications in biomedicine?

As in any other experimental approach, it is necessary to analyze the vast amount of data in the context of previously known information about the biological entities under study. The literature is a particularly valuable source of information for experiment validation and interpretation. Therefore, the development of automated text mining tools to assist in such interpretation is one of the main challenges in current bioinformatics research.

How can analytics aid in objective decision making?

As noted in the analysis of Application Case 1.4, problem solving in organizations has tended to be subjective, and decision makers tend to rely on familiar processes. The result is that future decisions are no better than past decisions. Analytics builds on historical data and takes into account changing conditions to arrive at fact-based solutions that decision makers might not have considered.

Is it a good idea to follow a hierarchy of descriptive and predictive analytics before applying prescriptive analytics?

As noted in the analysis of Application Case 1.5, it is important in any analytics project to understand the business domain and current state of the business problem. This requires analysis of historical data, or descriptive analytics. Although the chapter does not discuss a hierarchy of analytics, students may observe that testing a model with predictive analytics could logically improve prescriptive use of the model.

What is a balanced scorecard? Where did it come from?

Balanced Scorecard was first developed by Kaplan and Norton in their 1992 Harvard Business Review article, "The Balanced Scorecard: Measures That Drive Performance." It is a performance management system whose key feature is that it does not rely solely on financial measures of success. Over the past few years, BSC has become a generic term that is used to represent virtually every type of scorecard application and implementation, but it is intended to emphasize a strategic focus.

List and briefly define the phases in the CRISP-DM process.

CRISP-DM provides a systematic and orderly way to conduct data mining projects. This process has six steps. First, an understanding of the data and an understanding of the business issues to be addressed are developed concurrently. Next, data are prepared for modeling

Give examples of situations in which classification would be an appropriate data mining technique. Give examples of situations in which regression would be an appropriate data mining technique.

Classification is for prediction that can be based on historical data and relationships, such as predicting the weather, product demand, or a student's success in a university. If what is being predicted is a class label (e.g., "sunny," "rainy," or "cloudy") the prediction problem is called a classification, whereas if it is a numeric value (e.g., temperature such as 68°F), the prediction problem is called a regression.

Identify at least three of the main data mining methods.

Classification learns patterns from past data (a set of information—traits, variables, features—on characteristics of the previously labeled items, objects, or events) in order to place new instances (with unknown labels) into their respective groups or classes. The objective of classification is to analyze the historical data stored in a database and automatically generate a model that can predict future behavior. Cluster analysis is an exploratory data analysis tool for solving classification problems. The objective is to sort cases (e.g., people, things, events) into groups, or clusters, so that the degree of association is strong among members of the same cluster and weak among members of different clusters. Association rule mining is a popular data mining method that is commonly used as an example to explain what data mining is and what it can do to a technologically less savvy audience. Association rule mining aims to find interesting relationships (affinities) between variables (items) in large databases.

What is the major difference between cluster analysis and classification?

Classification methods learn from previous examples containing inputs and the resulting class labels, and once properly trained, they are able to classify future cases. Clustering partitions pattern records into natural segments or clusters.

List and briefly define the four most commonly cited operational areas for KPIs.

Customer performance. Metrics for customer satisfaction, speed and accuracy of issue resolution, and customer retention. Service performance. Metrics for service-call resolution rates, service renewal rates, service-level agreements, delivery performance, and return rates. Sales operations. New pipeline accounts, sales meetings secured, conversion of inquiries to leads, and average call closure time. Sales plan/forecast. Metrics for price-to-purchase accuracy, purchase order-to- fulfillment ratio, quantity earned, forecast-to-plan ratio, and total closed contracts.

13.Why do we need data transformation? What are the commonly used data transformation tasks?

Data transformation is often needed to ensure that data is in a format in which it can be used for analysis. During data transformation the data is normalized, discretized, and attributes are created.

What was the primary difference between the systems called MIS

DSS, and Executive Support Systems?, Many systems have been used in the past and present to provide analytics. Management information systems (MIS) provided reports on various aspects of business functions using captured information while decision support systems (DSS) added the ability to use data with models to address unstructured problems. Executive support systems (ESS) added to these abilities by capturing understanding from experts and integrating it into systems via if-then-else rules or heuristics.

44.What are the graphical widgets commonly used in dashboards? Why?

Dashboards can include many kinds of visual widgets, including charts, performance bars, sparklines, gauges, meters, stoplights, geographic maps, etc. These help to highlight, at a glance, the data and exceptions that require action. A picture tells a thousand words, and through the use of many graphical widgets, a dashboard can convey a wealth of information to decision makers in a short time.

Describe data integration.

Data integration is an umbrella term that covers three processes that combine to move data from multiple sources into a data warehouse: accessing the data, combining different views of the data, and capturing changes to the data.

Define data mining. Why are there many different names and definitions for data mining?

Data mining is the process through which previously unknown patterns in data were discovered. Another definition would be "a process that uses statistical, mathematical, and artificial learning techniques to extract and identify useful information and subsequent knowledge from large sets of data." This includes most types of automated data analysis. A third definition: Data mining is the process of finding mathematical patterns from (usually) large sets of data

14.Data reduction can be applied to rows (sampling) and/or columns (variable selection). Which is more challenging?

Data reduction as it applies to variable selection is more complex. This is because variables to be studied must be selected and others discarded. This is typically done by individuals who are experts in the field.

6. What is data? How does data differ from information and knowledge?

Data refers to a collection of facts usually obtained as the result of experiments, observations, transactions, or experiences. Data may consist of numbers, letters, words, images, voice recordings, and so on, as measurements of a set of variables. Data is a raw commodity and does not become information or knowledge until after it is processed.

What is OLAP and how does it differ from OLTP?

Data stored in a data warehouse can be analyzed using techniques referred to as OLAP. OLAP is one of the most commonly used data analysis techniques in data warehouses. OLAP is an approach to quickly answer ad hoc questions that require data analysis. OLTP is concerned with the capture and storage of data. OLAP is concerned with the analysis of that data.

What are the privacy issues in data mining?

Data that is collected, stored, and analyzed in data mining often contains information about real people. This includes identification, demographic, financial, personal, and behavioral information. Most of these data can be accessed through some third-party data providers. In order to maintain the privacy and protection of individuals' rights, data mining professionals have ethical (and often legal) obligations.

32.What is data visualization? Why is it needed?

Data visualization, perhaps more appropriately called "information visualization," is the use of visual representations to explore, make sense of, and communicate data. It is closely related to the fields of information graphics, scientific visualization, and statistical graphics. What is portrayed in visualizations is the information (aggregations, summarizations, and contextualization) and not the data. Companies and individuals increasingly rely on data to make good decisions. Because data is so voluminous, there is a need for visual tools that help people understand it.

What is descriptive analytics? What are the various tools that are employed in descriptive analytics?

Descriptive analytics refers to knowing what is happening in the organization and understanding some underlying trends and causes of such occurrences. Tools used in descriptive analytics include data warehouses and visualization applications.

16.What are the main differences between descriptive and inferential statistics?

Descriptive statistics is all about describing the sample data on hand, and inferential statistics is about drawing inferences or conclusions about the characteristics of the population.

List the benefits of data warehouses.

Direct benefits include: · Allowing end users to perform extensive analysis in numerous ways. · A consolidated view of corporate data (i.e., a single version of the truth). · Better and more timely information. A data warehouse permits information processing to be offloaded from costly operational systems onto low-cost servers

What steps can an organization take to ensure the security and confidentiality of customer data in its data warehouse?

Effective security in a data warehouse should focus on four main areas: Step 1. Establishing effective corporate and security policies and procedures. An effective security policy should start at the top and be communicated to everyone in the organization. Step 2. Implementing logical security procedures and techniques to restrict access. This includes user authentication, access controls, and encryption. Step 3. Limiting physical access to the data center environment. Step 4. Establishing an effective internal control review process for security and privacy.

Describe the three steps of the ETL process.

Extraction: selecting data from one or more sources and reading the selected data. Transformation: converting data from their original form to whatever form the DW needs. This step often also includes cleansing of the data to remove as many errors as possible. Load: putting the converted (transformed) data into the DW.

What recent factors have increased the popularity of data mining?

Following are some of the most pronounced reasons: • More intense competition at the global scale driven by customers' ever-changing needs and wants in an increasingly saturated marketplace. • General recognition of the untapped value hidden in large data sources. • Consolidation and integration of database records, which enables a single view of customers, vendors, transactions, etc. • Consolidation of databases and other data repositories into a single location in the form of a data warehouse. • The exponential increase in data processing and storage technologies. • Significant reduction in the cost of hardware and software for data storage and processing. • Movement toward the de-massification (conversion of information resources into nonphysical form) of business practices.

What recent technologies may shape the future of data warehousing? Why?

Following are some of the recently popularized concepts and technologies that will play a significant role in defining the future of data warehousing. Sourcing: Acquisition of data from diverse and dispersed sources · Web, social media, and Big Data · Open source software · SaaS (software as a service) "The Extended ASP Model" · Cloud computing Infrastructure: Architectural—hardware and software—enhancements · Columnar (a new way to score and access data in the database) · Real-time data warehousing · Data warehouse appliances (all-in-one solutions to DW) · Data management technologies and practices · In-database processing technology (putting the algorithms where the data is) · In-memory storage technology (moving the data in the memory for faster processing) · New database management systems · Advanced analytics As the world of business becomes more global and complex, the need for business intelligence and data warehousing tools also becomes more prominent. The fast improving information technology tools and techniques seem to be moving in the right direction to address the needs of the future business intelligence systems.

What is meant by social analytics? Why is it an important business topic?

From a philosophical perspective, social analytics focuses on a theoretical object called a "socius," a kind of "commonness" that is neither a universal account nor a communality shared by every member of a body. Thus, social analytics in this sense attempts to articulate the differences between philosophy and sociology. From a BI perspective, social analytics involves "monitoring, analyzing, measuring and interpreting digital interactions and relationships of people, topics, ideas and content." In this perspective, social analytics involves mining the textual content created in social media (e.g., sentiment analysis, natural language processing) and analyzing socially established networks (e.g., influencer identification, profiling, prediction). This is an important business topic because it helps companies gain insight about existing and potential customers' current and future behaviors, and about the likes and dislikes toward a firm's products and services.

List some of the implementation topics addressed by Gartner's report.

Gartner's framework decomposes planning and execution into business, organization, functionality, and infrastructure components. At the business and organizational levels, strategic and operational objectives must be defined while considering the available organizational skills to achieve those objectives. Issues of organizational culture surrounding BI initiatives and building enthusiasm for those initiatives and procedures for the intra-organizational sharing of BI best practices must be considered by upper management—with plans in place to prepare the organization for change.

37.Why would you use a geographic map? What other types of charts can be combined with a geographic map?

Geographic maps are useful when the data set includes any kind of location data, including addresses, postal codes, state names or abbreviations, country names, latitude/longitude, or some type of custom geographic encoding. Maps can be used in conjunction with other charts and graphs. For instance, one can use maps to show distribution of customer service requests by product type (depicted in pie charts) by geographic locations.

List some other success factors of BI.

If the company's strategy is properly aligned with the reasons for DW and BI initiatives, and if the company's IS organization is or can be made capable of playing its role in such a project, and if the requisite user community is in place and has the proper motivation, it is wise to start BI and establish a BI Competency Center (BICC) within the company. The center could serve some or all of the following functions. · The center can demonstrate how BI is clearly linked to strategy and execution of strategy. · A center can serve to encourage interaction between the potential business user communities and the IS organization. · The center can serve as a repository and disseminator of best BI practices between and among the different lines of business. · Standards of excellence in BI practices can be advocated and encouraged throughout the company. · The IS organization can learn a great deal through interaction with the user communities, such as knowledge about the variety of types of analytical tools that are needed. · The business user community and IS organization can better understand why the data warehouse platform must be flexible enough to provide for changing business requirements. · It can help important stakeholders like high-level executives see how BI can play an important role. Another important success factor of BI is its ability to facilitate a real-time, on-demand agile environment.

What are some of the key system-oriented trends that have fostered IS-supported decision making to a new level?

Improvements and innovation in systems in many areas have facilitated the growth of decision-making systems. These areas include: · Group communication and collaboration software and systems · Improved data management applications and techniques · Data warehouses and Big Data for the information collection · Analytical support systems · Growth in processing and storing formation storage capabilities · Knowledge management systems · Support of all of these systems that is always available

How can text mining be used in security and counterterrorism?

In 2007, EUROPOL developed an integrated system capable of accessing, storing, and analyzing vast amounts of structured and unstructured data sources in order to track transnational organized crime. Another security-related application of text mining is in the area of deception detection.

What is the meaning of and motivation for balance in BSC?

In BSC, the term "balance" arises because the combined set of measures is supposed to encompass indicators that are financial and nonfinancial, leading and lagging, internal and external, quantitative and qualitative, and both short term and long term.

12.What does it mean to clean/scrub the data? What activities are performed in this phase?

In this step, the values in the data set are identified and dealt with. The analyst will identify noisy values in the data and smooth them out, as well as addressing any missing values.

43.What is an information dashboard? Why are they so popular for BI software tools?

Information dashboards provide visual displays of important information that is consolidated and arranged on a single screen so that information can be digested at a single glance and easily drilled in and further explored. They are common components of most, if not all, performance management systems, performance measurement systems, BPM software suites, and BI platforms. Dashboards pack a lot of information into a single screen, which is one reason for their popularity.

List some capabilities of information systems that can facilitate managerial decision making.

Information systems can aid decision making because they have the ability to perform functions that allow for better communication and information capture, better storage and recall of data, and vastly improved analytical models that can be more voluminous or more precise.

5. What are the most common metrics that make for analytics-ready data?

It must be relevant to the problem at hand and meet the quality/quantity requirements. It also has to have a certain data structure in place with key fields/variables with properly normalized values and conform to organizational definitions.

25.What is logistics regression? How does it differ from linear regression?

Logistics regression is a very popular, statistically sound, probability-based classification algorithm that employs supervised learning. It differs from linear regression with one major point: its output (response variable) is a class as opposed to a numerical variable.

17.List and briefly define the central tendency measures of descriptive statistics.

Measures of centrality are the mathematical methods by which we estimate or describe central positioning of a given variable of interest. A measure of central tendency is a single numerical value that aims to describe a set of data by simply identifying or estimating the central position within the data. The arithmetic mean (or simply mean or average) is the sum of all the values/observations divided by the number of observations in the data set. The median is the measure of center value in a given data set. It is the number in the middle of a given set of data that has been arranged/sorted in order of magnitude (either ascending or descending). The mode is the observation that occurs most frequently (the most frequent value in our data set).

18.List and briefly define the dispersion measures of descriptive statistics.

Measures of dispersion are the mathematical methods used to estimate or describe the degree of variation in a given variable of interest. The range is the difference between the largest and the smallest values in a given data set (i.e., variables). Variance is a method used to calculate the deviation of all data points in a given data set from the mean. The standard deviation is a measure of the spread of values within a set of data. The standard deviation is calculated by simply taking the square root of the variations. Mean absolute deviation is calculated by measuring the absolute values of the differences between each data point and the mean and summing them. Quartiles help us identify spread within a subset of the data. A quartile is a quarter of the number of data points given in a data set. Quartiles are determined by first sorting the data and then splitting the sorted data into four disjoint smaller data sets.

Explain the importance of metadata.

Metadata, "data about data," are the means through which applications and users access the content of a data warehouse, through which its security is managed, and through which organizational management manages, in the true sense of the word, its information assets. Most database management systems would be unable to function without at least some metadata. Indeed, the use of metadata, which enable data access through names and logical relationships rather than physical locations, is fundamental to the very concept of a DBMS. Metadata are essential to any database, not just a data warehouse. (See answer to Review Question 2 of this section above.)

What is NLP?

Natural language processing (NLP) is an important component of text mining and is a subfield of artificial intelligence and computational linguistics. It studies the problem of "understanding" the natural human language, with the view of converting depictions of human language (such as textual documents) into more formal representations (in the form of numeric and symbolic data) that are easier for computer programs to manipulate.

9. What is a 1-of-N data representation? Why and where is it used in analytics?

Nominal or ordinal variables are converted into numeric representations using some type of 1-of-N pseudo variables (e.g., a categorical variable with three unique values can be transformed into three pseudo variables with binary values—1 or 0). This allows it to be used in predictive analytics.

Define OLAP.

OLAP (online analytical processing) is processing for end-user ad hoc reports, queries, and analysis. Separating the OLTP from analysis and decision support provided by OLAP enables the benefits of BI that were described earlier and provides for competitive intelligence and advantage as described next.

What would be the expected benefits and beneficiaries of sentiment analysis in politics?

Opinions matter a great deal in politics. Because political discussions are dominated by quotes, sarcasm, and complex references to persons, organizations, and ideas, politics is one of the most difficult, and potentially fruitful, areas for sentiment analysis. By analyzing the sentiment on election forums, one may predict who is more likely to win or lose. Sentiment analysis can help understand what voters are thinking and can clarify a candidate's position on issues. Sentiment analysis can help political organizations, campaigns, and news analysts to better understand which issues and positions matter the most to voters. The technology was successfully applied by both parties to the 2008 and 2012 American presidential election campaigns.

What are the major data mining processes?

Similar to other information systems initiatives, a data mining project must follow a systematic project management process to be successful. Several data mining processes have been proposed: CRISP-DM, SEMMA, and KDD.

What are the two common methods for polarity identification? Explain.

Polarity identification can be done via a lexicon (as a reference library) or by using a collection of training documents and inductive machine learning algorithms. The lexicon approach uses a catalog of words, their synonyms, and their meanings, combined with numerical ratings indicating the position on the N-P polarity associated with these words. In this way, affective, emotional, and attitudinal phrases can be classified according to their degree of positivity or negativity. By contrast, the training-document approach uses statistical analysis and machine learning algorithms, such as neural networks, clustering approaches, and decision trees to ascertain the sentiment for a new text document based on patterns from previous "training" documents with assigned sentiment scores.

33.What are the historical roots of data visualization?

Predecessors to data visualization date back to the second century AD. Today's most popular visual forms date back a few centuries. Geographical exploration, mathematics, and popularized history spurred the creation of early maps, graphs, and timelines as far back as the 1600s. The now familiar line and bar charts date back to the late 1700s. Charles Joseph Minard used visualizations to graphically portray the losses suffered by Napoleon's army in the Russian campaign of 1812. The 1900s saw the rise of a more formal, empirical attitude toward visualization, which tended to focus on aspects such as color, value scales, and labeling. In the 2000s the Internet has emerged as a new medium for visualization, and added interactivity to previously static graphics.

What are the key differences between the major data mining methods?

Prediction: the act of telling about the future. It differs from simple guessing by taking into account the experiences, opinions, and other relevant information in conducting the task of foretelling. A term that is commonly associated with prediction is forecasting. Even though many believe that these two terms are synonymous, there is a subtle but critical difference between the two. Whereas prediction is largely experience and opinion based, forecasting is data and model based. That is, in order of increasing reliability, one might list the relevant terms as guessing, predicting, and forecasting, respectively. In data mining terminology, prediction and forecasting are used synonymously, and the term prediction is used as the common representation of the act. Classification: analyzing the historical behavior of groups of entities with similar characteristics, to predict the future behavior of a new entity from its similarity to those groups Clustering: finding groups of entities with similar characteristics Association: establishing relationships among items that occur together Sequence discovery: finding time-based associations Visualization: presenting results obtained through one or more of the other methods Regression: a statistical estimation technique based on fitting a curve defined by a mathematical equation of known type but unknown parameters to existing data Forecasting: estimating a future data value based on past data values

What is prescriptive analytics? What kind of problems can be solved by prescriptive analytics?

Prescriptive analytics is a set of techniques that use descriptive data and forecasts to identify the decisions most likely to result in the best performance. Usually, an organization uses prescriptive analytics to identify the decisions or actions that will optimize the performance of a system. Organizations have used prescriptive analytics to set prices, create production plans, and identify the best locations for facilities such as bank branches.

What is "search engine optimization"? Who benefits from it?

Search engine optimization (SEO) is the intentional activity of affecting the visibility of an e-commerce site or a website in a search engine's natural (unpaid or organic) search results. It involves editing a page's content, HTML, metadata, and associated coding to both increase its relevance to specific keywords and to remove barriers to the indexing activities of search engines. In addition, SEO efforts include promoting a site to increase its number of inbound links. SEO primarily benefits companies with e-commerce sites by making their pages appear toward the top of search engine lists when users query.

Which data warehousing architecture is the best? Why?

See Table 3.1 Average Assessment Scores for the Success of the Architectures. What is interesting is the similarity of the averages for the bus, hub-and-spoke, and centralized architectures. The differences are sufficiently small that no claims can be made for a particular architecture's superiority over the others, at least based on a simple comparison of these success measures.

Why is the ETL process so important for data warehousing efforts?

Since ETL is the process through which data are loaded into a data warehouse, a DW could not exist without it. The ETL process also contributes to the quality of the data in a DW.

What is SVD? How is it used in text mining?

Singular value decomposition (SVD), which is closely related to principal components analysis, reduces the overall dimensionality of the input matrix (number of input documents by number of extracted terms) to a lower dimensional space, where each consecutive dimension represents the largest degree of variability (between words and documents) possible.

20.What are the two most commonly used shape characteristics to describe a data distribution?

Skewness is a measure of asymmetry in a distribution of the data that portrays a unimodal structure—only one peak exists in the distribution of the data. Kurtosis is another measure to use in characterizing the shape of a unimodal distribution that is more interested in characterizing the peak/tall/skinny nature of the distribution.

What is social media analytics? What are the reasons behind its increasing popularity?

Social media analytics refers to the systematic and scientific ways to consume the vast amount of content created by Web-based social media outlets, tools, and techniques for the betterment of an organization's competitiveness. Data includes anything posted in a social media site. The increasing popularity of social media analytics stems largely from the similarly increasing popularity of social media together with exponential growth in the capacities of text and Web analytics technologies.

What is social media? How does it relate to Web 2.0?

Social media refers to the enabling technologies of social interactions among people in which they create, share, and exchange information, ideas, and opinions in virtual communities and networks. It is a group of Internet-based software applications that build on the ideological and technological foundations of Web 2.0, and that allow the creation and exchange of user-generated content.

15.What is the relationship between statistics and business analytics?

Statistics can be used as a part of business analytics, either to help generate reports or as a presentation format.

How does a data warehouse differ from a database?

Technically a data warehouse is a database, albeit with certain characteristics to facilitate its role in decision support. Specifically, however, it is (see previous question) an "integrated, time-variant, nonvolatile, subject-oriented repository of detail and summary data used for decision support and business analytics within an organization." These characteristics, which are discussed further in the section just after the definition, are not necessarily true of databases in general—though each could apply individually to a given one. As a practical matter most databases are highly normalized, in part to avoid update anomalies. Data warehouses are highly denormalized for performance reasons. This is acceptable because their content is never updated, just added to. Historical data are static.

Identify at least five specific applications of data mining and list five common characteristics of these applications.

This question expands on the prior question by asking for common characteristics. Several such applications and their characteristics are listed on pp. 160-161.

Why is the popularity of text mining as a BI tool increasing?

Text mining as a BI tool is increasing because of the rapid growth in text data and availability of sophisticated BI tools. The benefits of text mining are obvious in the areas where very large amounts of textual data are being generated, such as law (court orders), academic research (research articles), finance (quarterly reports), medicine (discharge summaries), biology (molecular interactions), technology (patent files), and marketing (customer comments).

List and briefly discuss some of the text mining applications in marketing.

Text mining can be used to increase cross-selling and up-selling by analyzing the unstructured data generated by call centers. Text mining has become invaluable for customer relationship management. Companies can use text mining to analyze rich sets of unstructured text data, combined with the relevant structured data extracted from organizational databases, to predict customer perceptions and subsequent purchasing behavior.

What is text mining? How does it differ from data mining?

Text mining is the application of data mining to unstructured, or less structured, text files. As the names indicate, text mining analyzes words

How does NLP relate to text mining?

Text mining uses natural language processing to induce structure into the text collection and then uses data mining algorithms such as classification, clustering, association, and sequence discovery to extract knowledge from it.

List and briefly describe the four phases of the BPM cycle.

The BPM cycle contains four main phases. First is to strategize. This involves answering the question, "Where do we want to go?", and involves a high-level, long-term plan. Missions, visions, and objectives are key components of this phase. The second phase is to plan, which answers the question, "How do we get there?" Key elements here are a detailed operational plan and a financial plan including budget. The next phase is to monitor and analyze, which answers the question, "How are we doing?" Here is where KPIs, dashboards, reporting, and analytics are helpful. Finally come action and adjustment, based on comparing our analysis results against our plans. Sometimes this means changing the way we operate, and sometimes it means adjusting our strategy.

41.Why should storytelling be a part of your reporting and data visualization?

The central idea of business reporting is to tell a story. Everyone who has data to analyze has stories to tell, whether it's diagnosing the reasons for manufacturing defects, selling a new idea in a way that captures the imagination of your target audience, or informing colleagues about a particular customer service improvement program. Stories bring life to data and facts. They can help you make sense and order out of a disparate collection of facts. They make it easier to remember key points and can paint a vivid picture of what the future can look like. Stories also create interactivity—people put themselves into stories and can relate to the situation. People will be much more engaged and receptive if information is presented to them in a story format.

29.Describe the cyclic process of management and comment on the role of business reports.

The cyclic process of management, as illustrated in Figure 2.1, involves these steps: data acquisition leads to information generation which leads to decision making which leads to business process management. Perhaps the most critical task in this cyclic process is the reporting (i.e., information generation)—converting data from different sources into actionable information.

Describe the data warehousing process.

The data warehousing process consists of the following steps: 1. Data are imported from various internal and external sources 2. Data are cleansed and organized consistently with the organization's needs 3. a. Data are loaded into the enterprise data warehouse, or b. Data are loaded into data marts. 4. a. If desired, data marts are created as subsets of the EDW, or b. The data marts are consolidated into the EDW 5. Analyses are performed as needed

What are the main steps in carrying out sentiment analysis projects?

The first step when performing sentiment analysis of a text document is called sentiment detection, during which text data is differentiated between fact and opinion (objective vs. subjective). This is followed by negative-positive (N-P) polarity classification, where a subjective text item is classified on a bipolar range. Following this comes target identification (identifying the person, product, event, etc. that the sentiment is about). Finally come collection and aggregation, in which the overall sentiment for the document is calculated based on the calculations of sentiments of individual phrases and words from the first three steps.

What are some of the methods for cluster analysis?

The most commonly used clustering algorithms are k-means and self-organizing maps.

What are the four perspectives that BSC suggests to view organizational performance?

The four perspectives are: customer, financial, internal business processes, and learning and growth. If customers are not satisfied, they will eventually find other suppliers that will meet their needs. Poor performance from this perspective is thus a leading indicator of future decline, even though the current financial picture may look good. Timely and accurate funding data will always be a priority, and managers will do whatever is necessary to provide it. This should include risk analysis. In the current climate of rapid technological change, it is becoming necessary for knowledge workers to be in a continuous learning and growing mode. Metrics based on this perspective allow the managers to know how well their internal business processes and functions are running, and whether the outcomes of these processes (i.e., products and services) meet and exceed the customer requirements (the mission).

34.What do you think is the "next big thing" in data visualization?

The future of data/information visualization is very hard to predict. We can only extrapolate from what has already been invented: more three-dimensional visualization, more immersive experience with multidimensional data in a virtual reality environment, and holographic visualization of information. There is a pretty good chance that we will see something that we have never seen in the information visualization realm invented before the end of this decade.

39.What are the main reasons for the recent emergence of visual analytics?

The growth of visual analytics correlates with the growth of analytics in general. More BI and analytics vendors are becoming aware that their customers require quick and preferably interactive visualizations, not just for their normal reporting systems, but also to illustrate predictive and prescriptive decision-making information. Many of the information visualization vendors are adding the capabilities to call themselves visual analytics solution providers. Conversely, analytics solution providers such as SAS are embedding their analytics capabilities into a high-performance data visualization environment that they call visual analytics.

What are the distinguishing features of KPIs?

The key features described in the book are strategy, targets, ranges, encodings, time frames, and benchmarks. KPIs embody strategic objectives and measure performance against specific targets, based on specified ranges of values. Encodings provide visual cues (e.g., color) to indicate how close or far from a target we are on a particular metric. Benchmarks provide something to compare against.

7. What are the main categories of data? What types of data can we use for BI and analytics?

The main categories of data are structured data and unstructured data. Both of these types of data can be used for business intelligence and analytics, although it is easier and more expedient to use structured data.

What are the main knowledge extraction methods from corpus?

The main categories of knowledge extraction methods are classification, clustering, association, and trend analysis.

11.What are the main data preprocessing steps?

The main data preprocessing steps include data consolidation, data cleaning, data transformation, and data reduction.

How does CRISP-DM differ from SEMMA?

The main difference between CRISP-DM and SEMMA is that CRISP-DM takes a more comprehensive approach—including understanding of the business and the relevant data—to data mining projects, whereas SEMMA implicitly assumes that the data mining project's goals and objectives along with the appropriate data sources have been identified and understood.

What are the main differences between commercial and free data mining software tools?

The main difference between commercial tools, such as Enterprise Miner and Statistica, and free tools, such as Weka and RapidMiner, is computational efficiency. The same data mining task involving a rather large dataset may take a whole lot longer to complete with the free software, and in some cases it may not even be feasible (i.e., crashing due to the inefficient use of computer memory).

What is a performance management system? Why do we need one?

The purpose of a performance management system is to (a) identify and articulate the strategic mission, goals, and objectives of an organization, and (b) assist managers in tracking the implementations of business strategy by comparing actual results against these strategic goals and objectives. The latter task is accomplished by a performance measurement system, which can be considered a subset of the overall performance management system. A performance measurement system typically comprises systematic methods of setting business goals together with periodic feedback reports that indicate progress against goals. This is a key and necessary element of the BPM process.

What is the reason for normalizing word frequencies? What are the common methods for normalizing word frequencies?

The raw indices need to be normalized in order to have a more consistent TDM for further analysis. Common methods are log frequencies, binary frequencies, and inverse document frequencies.

Define analytics.

The term replaces terminology referring to individual components of a decision support system with one broad word referring to business intelligence. More precisely, analytics is the process of developing actionable decisions or recommendations for actions based upon insights generated from historical data. Students may also refer to the eight levels of analytics and this simpler descriptive language: "looking at all the data to understand what is happening, what will happen, and how to make the best of it."

30.List and describe the three major categories of business reports.

There are a wide variety of business reports, which for managerial purposes can be grouped into three major categories: metric management reports, dashboard-type reports, and balanced scorecard-type reports. Metric management reports involve outcome-oriented metrics based on service level agreements and/or key performance indicators. Dashboard-type reports present a range of performance indicators on one page, with both static/predefined elements and customizable widgets and views. Balanced scorecard reports present an integrated view of a company's health and include financial, customer, business process, and learning/growth perspectives.

List the 11 categories of players in the analytics ecosystem.

These categories include: · Data Generation Infrastructure Providers · Data Management Infrastructure Providers · Data Warehouse Providers · Middleware Providers · Data Service Providers · Analytics Focused Software Developers · Application Developers: Industry Specific or General · Analytics Industry Analysts and Influencers · Academic Institutions and Certification Agencies · Regulators and Policy Makers · Analytics User Organizations

What are the differences and commonalities between dashboards and scorecards?

These terms are often used interchangeably, and they share many common features. The main difference is that scorecards are used by executives, managers, and staff to monitor strategic alignment and success with strategic objectives and targets. By contrast, dashboards are used at the operational and tactical levels. Managers, supervisors, and operators use operational dashboards to monitor detailed operational performance on a weekly, daily, or even hourly basis.

40.What is the difference between information visualization and visual analytics?

Visual analytics is the combination of visualization and predictive analytics. While information visualization is aimed at answering "what happened" and "what is happening" and is closely associated with business intelligence (routine reports, scorecards, and dashboards), visual analytics is aimed at answering "why is it happening," "what is more likely to happen," and is usually associated with business analytics (forecasting, segmentation, and correlation analysis).

What is Web mining? How does it differ from regular data mining or text mining?

Web mining is the discovery and analysis of interesting and useful information from the Web and about the Web, usually through Web-based tools. Text mining is less structured because it's based on words instead of numeric data.

What is Web structure mining? How does it differ from Web content mining?

Web structure mining is the process of extracting useful information from the links embedded in Web documents. By contrast, Web content mining involves analysis of the specific textual content of web pages. So, Web structure mining is more related to navigation through a website, whereas Web content mining is more related to text mining and the document hierarchy of a particular web page.

What are the main applications of Web mining?

· Determine the lifetime value of clients. · Design cross-marketing strategies across products. · Evaluate promotional campaigns. · Target electronic ads and coupons at user groups based on user access patterns. · Predict user behavior based on previously learned rules and users' profiles. · Present dynamic information to users based on their interests and profiles.

Differentiate among a DM

an ODS, and an EDW., An ODS (Operational Data Store) is the database from which a business operates on an ongoing basis. Both an EDW and a data mart (DM) are data warehouses. An EDW (Enterprise Data Warehouse) is an all-encompassing DW that covers all subject areas of interest to the entire organization. A data mart is a smaller DW designed around one problem, organizational function, topic, or other suitable focus area.

List several criteria for selecting a data warehouse vendor

and describe why they are important., Six important criteria are: financial strength, ERP linkages, qualified consultants, market share, industry experience, and established partnerships. These are important to indicate that a vendor is likely to be in business for the long term, to have the support capabilities its customers need, and to provide products that interoperate with other products the potential user has or may obtain. One could add others, such as product functionality (Does it do what we need?), vendor strategic vision (Does their direction make sense for our future plans and/or is it consistent with industry trends?) and quality of customer references (What do their existing customers think of them?).

21.What is regression

and what statistical purpose does it serve?, Regression is a relatively simple statistical technique to model the dependence of a variable (response or output variable) on one (or more) explanatory (input) variables.

36.What are the main differences among line

bar, and pie charts? When should you use one over the others?, Line graphs are good for time-series data. Bar charts are good for depicting nominal or numerical data that can be easily categorized. Pie charts should be used for depicting proportions. You shouldn't use pie charts if the number of categories is very large.

therefore

end-user information requests can be processed more quickly. · Enhanced system performance. A data warehouse frees production processing because some operational system reporting requirements are moved to DSS. · Simplification of data access. Indirect benefits arise when end users take advantage of these direct benefits.

that is

multiple possible sentence structures often need to be considered. Choosing the most appropriate structure usually requires a fusion of semantic and contextual information. · Imperfect or irregular input. Foreign or regional accents and vocal impediments in speech and typographical or grammatical errors in texts make the processing of the language an even more difficult task. · Speech acts. A sentence can often be considered an action by the speaker. The sentence structure alone may not contain enough information to define this action.

What is a cube? What do drill down

roll up, and slice and dice mean?, The main operational structure in OLAP is based on a concept called cube. A cube in OLAP is a multidimensional data structure (actual or virtual) that allows fast analysis of data. Using OLAP, an analyst can navigate through the database and screen for a particular subset of the data (and its progression over time) by changing the data's orientations and defining analytical calculations. These types of user-initiated navigation of data through the specification of slices (via rotations) and drill down/up (via aggregation and disaggregation) are sometimes called "slice and dice." Commonly used OLAP operations include slice and dice, drill down, roll up, and pivot. · Slice: A slice is a subset of a multidimensional array (usually a two-dimensional representation) corresponding to a single value set for one (or more) of the dimensions not in the subset. · Dice: The dice operation is a slice on more than two dimensions of a data cube. · Drill Down/Up: Drilling down or up is a specific OLAP technique whereby the user navigates among levels of data ranging from the most summarized (up) to the most detailed (down).

What skills should a DWA possess? Why?

· Familiarity with high-performance hardware, software, and networking technologies, since the data warehouse is based on those · Solid business insight, to understand the purpose of the DW and its business justification · Familiarity with business decision-making processes to understand how the DW will be used · Excellent communication skills, to communicate with the rest of the organization

2. Considering the new and broad definition of business analytics

what are the main inputs and outputs to the analytics continuum?, Because of the broader definition of business analytics, almost any data from almost any source can be considered an input. In the same way, after analytics has been performed output can take a wide variety of forms depending on the specific business purpose.

When developing a successful data warehouse

what are the most important risks and issues to consider and potentially avoid?, · Starting with the wrong sponsorship chain · Setting expectations that you cannot meet · Engaging in politically naive behavior · Loading the data warehouse with information just because it is available · Believing that data warehousing database design is the same as transactional database design · Choosing a data warehouse manager who is technology oriented rather than user oriented · Focusing on traditional internal record-oriented data and ignoring the value of external data and of text, images, and, perhaps, sound and video · Delivering data with overlapping and confusing definitions · Believing promises of performance, capacity, and scalability · Believing that your problems are over when the data warehouse is up and running · Focusing on ad hoc data mining and periodic reporting instead of alerts

Give examples of companies in each of the 11 types of players. Examples of companies by area include:

· Data Generation Infrastructure Providers (Sports Sensors, Zepp, Shockbox, Advantech B+B SmartWorx, Garmin, and Sensys Network, Intel, Microsoft, Google, IBM, Cisco, Smartbin, SIKO Products, Omega Engineering, Apple, and SAP) · Data Management Infrastructure Providers (Dell NetApp, IBM, Oracle, Teradata, Microsoft, Amazon (Amazon Web Services), IBM (Bluemix), Salesforce.com, Hadoop clusters, MapReduce, NoSQL, Spark, Kafka, Flume) · Data Warehouse Providers (IBM, Oracle, Teradata, Snowflake, Redshift, SAS, Tableau) · Middleware Providers (Microstrategy, Plum, Oracle, SAP, IBM, SAS, Tableau, and many more) · Data Service Providers (Nielsen, Experian, Omniture, Comscore, Google, Equifax, TransUnion, Acxiom, Merkle, Epsilon, Avention, ESRI.org) · Analytics Focused Software Developers (Microsoft, Tableau, SAS, Gephi, IBM, KXEN, Dell, Salford Systems, Revolution Analytics, Alteryx, RapidMiner, KNIME, Rulequest, NeuroDimensions, FICO, AIIMS, AMPL, Frontline, GAMS, Gurobi, Lindo Systems, Maximal, NGData, Ayata, Rockwell, Simio, Palisade, Frontline, Exsys, XpertRule, Teradata, Apache, Tibco, Informatica, SAP, Hitachi) · Application Developers: Industry Specific or General (IBM, SAS, Teradata, Nike, Sportsvision, Acxiom, FICO, Experian, YP.com, Towerdata, Qualia, Simulmedia, Shazam, Soundhound, Musixmatch, Waze, Apple, Google, Amazon, Uber, Lyft, Curb, Ola, Facebook, Twitter, LinkedIn, Unmetric, Smartbin) · Analytics Industry Analysts and Influencers (Gartner Group, The Data Warehousing Institute, Forrester, McKinsey, INFORMS, AIS, Teradata, SAS) · Academic Institutions and Certification Agencies (IBM, Microsoft, Microstrategy, Oracle, SAS, Tableau, Teradata, INFORMS · Regulators and Policy Makers (Federal Communications Commission, Federal Trade Commission, International Telecommunication Union, National Institute of Standards and Technology) · Analytics User Organizations (many topic-specific and local groups)

What are some popular application areas of text mining?

· Information extraction. Identification of key phrases and relationships within text by looking for predefined sequences in text via pattern matching. · Topic tracking. Based on a user profile and documents that a user views, text mining can predict other documents of interest to the user. · Summarization. Summarizing a document to save time on the part of the reader. · Categorization. Identifying the main themes of a document and then placing the document into a predefined set of categories based on those themes. · Clustering. Grouping similar documents without having a predefined set of categories. · Concept linking. Connects related documents by identifying their shared concepts and, by doing so, helps users find information that they perhaps would not have found using traditional search methods. · Question answering. Finding the best answer to a given question through knowledge-driven pattern matching.

What are the most common data mining mistakes/blunders? How can they be minimized and/or eliminated?

· Selecting the wrong problem for data mining · Ignoring what your sponsor thinks data mining is and what it really can and cannot do · Leaving insufficient time for data preparation. It takes more effort than one often expects · Looking only at aggregated results and not at individual records · Being sloppy about keeping track of the mining procedure and results · Ignoring suspicious findings and quickly moving on · Running mining algorithms repeatedly and blindly. (It is important to think hard enough about the next stage of data analysis. Data mining is a very hands-on activity.) · Believing everything you are told about data · Believing everything you are told about your own data mining analysis · Measuring your results differently from the way your sponsor measures them Ways to minimize these risks are basically the reverse of these items.

List and discuss the most pronounced DW implementation guidelines.

· Senior management must support development of the data warehouse. The DW needs a project champion at a high position in the organization chart. Benefits of a DW project may be difficult to measure, so management support makes it more likely the project will receive funding. · Web-based data warehouses may need special security requirements. These ensure that only authorized users have access to the data. · Users should participate in the development process. Their participation is essential for data modeling and access modeling. User participation ensures that the DW includes the needed data and that decision makers can retrieve the data they need. · DW implementation requires certain skills from members of the development team: in-depth knowledge of database technology and the development tools used.

What are commonly used Web analytics metrics? What is the importance of metrics? There are four main categories of Web analytic metrics:

· Website usability: How were they using my website? These involve page views, time on site, downloads, click map, and click paths. · Traffic sources: Where did they come from? These include referral websites, search engines, direct, offline campaigns, and online campaigns. · Visitor profiles: What do my visitors look like? These include keywords, content groupings, geography, time of day, and landing page profiles. · Conversion statistics: What does all this mean for the business? Metrics include new visitors, returning visitors, leads, sales/conversions, and abandonments. These metrics are important because they provide access to a lot of valuable marketing data, which can be leveraged for better insights to grow your business and better document your ROI. The insight and intelligence gained from Web analytics can be used to effectively manage the marketing efforts of an organization and its various products or services.

List and briefly define at least two classification techniques.

• Decision tree analysis. Decision tree analysis (a machine-learning technique) is arguably the most popular classification technique in the data mining arena. • Statistical analysis. Statistical classification techniques include logistic regression and discriminant analysis, both of which make the assumptions that the relationships between the input and output variables are linear in nature, the data is normally distributed, and the variables are not correlated and are independent of each other. • Case-based reasoning. This approach uses historical cases to recognize commonalities in order to assign a new case into the most probable category. • Bayesian classifiers. This approach uses probability theory to build classification models based on the past occurrences that are capable of placing a new instance into a most probable class (or category). • Genetic algorithms. The use of the analogy of natural evolution to build directed search-based mechanisms to classify data samples. • Rough sets. This method takes into account the partial membership of class labels to predefined categories in building models (collection of rules) for classification problems.

What are some of the main challenges the Web poses for knowledge discovery?

• The Web is too big for effective data mining. • The Web is too complex. • The Web is too dynamic. • The Web is not specific to a domain. • The Web has everything.

What are some of the criteria for comparing and selecting the best classification technique?

• The amount and availability of historical data • The types of data, categorical, interval, ration, etc. • What is being predicted—class or numeric value • The purpose or objective


Ensembles d'études connexes

First Aid USMLE Step 1: Behavioral Science and Psychiatry

View Set

Practice Questions PT 2: Upper GI Problems

View Set

Liver, Gallbladder, and Pancreas

View Set

Chapter 16 Dilutive Securties and Earnings Per Share

View Set

Pharm: Chapter 34 therapy for fluid volume, 32: Drug Therapy for Fluid Volume Excess, Chapter 34: Drug Therapy for Fluid Volume Excess, Chapter 34: Drug Therapy for Fluid Volume Excess, 28: Drug Therapy for Hypertension, Prep U: Chapter 26=Drug Thera…

View Set

Operant and Classical Conditioning

View Set

Psychology test 2 questions: chapters 6,8,9,11

View Set