acc 271 exam 2

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

What are the main steps in the text mining process?

1. establish the corpus 2. create the term document matrix 3. extract knowledge

List the alternative data warehousing architectures discussed in this section

1. independent data mart architecture 2.Data Mart Bus Architecture with Linked Dimensional Data Marts 3.Hub-and-Spoke Architecture (Corporate Information Factory) 4.Centralized Data Warehouse Architecture 5.Federated Architecture

What are some of the criteria for comparing and selecting the best classification technique?

1. predictive accuracy 2. speed 3. robustness 4. scalability 5. interpretation

What are the most common tasks addressed by NLP?

1. question answering 2. automatic summarization 3. natural language generation 4. natural language understanding 5. machine translation 6.foreign language reading 7. foreign language writing 8. speech recognition 9. text-to-speech 10. text proofing 11. optical character recognition

KDD

1. raw sources for data 2. data selection targets data 3. data cleaning leads to preprocessed data 4. data transformation 5. data mining leads to extracted patterns 6. internalization leads to actionable insight

What are the main steps in carrying out sentiment analysis projects?

1. sentiment detection 2. n-p Polarity classification 3. target identification 4.Tabulate & aggregate the sentiment analysis results

What recent technologies may shape the future of data warehousing?

1. sourcing 2. infrastructure

Briefly describe the general algorithm used in decision trees

1. split between two alternatives. is it pure? - yes = stop - no = split again

characteristics of Data warehouses?

1. subject-oriented 2. integrated 3. time-variant. ex: weeks, months, years contain all. 4.nonvolatile- users cannot change or update data 3. web based 4. relational / multidimensional 5. realtime

What are some of the main challenges the Web poses for knowledge discovery?

1. the web is too big for effective data mining 2. the web is too complex 3. the web is too dynamic 4. the web is not specific to a domain 5. the web has everything

What are the most popular application areas for sentiment analysis?

1. voice of the customer 2. voice of the market 3. voice of the employee 4. brand management 5. financial markets 6. government intelligence 7. politics

sourcing technologies include

1. web, social media and big data 2. open source software 3. SaaS 4. cloud computing 5. data lakes

What are commonly used Web analytics metrics?

1. website usability 2. traffic sources 3. visitor profiles 4. conversion stats

important criteria in selecting a ETL tool?

1.Ability to read from and write to an unlimited number of data sources/architectures 2.Automatic capturing and delivery of metadata 3.A history of conforming to open standards 4.An easy to use interface for the developer and the functional user

common characteristics of banking and other financial data mining applications

1.Automate the loan application process 2.Detecting fraudulent transactions 3.Maximize customer value (cross-, up-selling) 4. Optimizing cash reserves with forecasting

List and briefly define the four most commonly cited operational areas for KPIs.

1.Customer performance 2.Service performance 3.Sales operations 4.Sales plan/forecast

What are the most common myths about data mining?

1.Data mining provides instant, crystal-ball predictions. 2. data mining is not yet viable for mainstream business applications 3. data mining requires a separate dedicated data base 4. only those with advanced degrees can do data mining 5. Data mining is only for large firms that have lots of customer data

issues affecting the purchase of an ETL tool?

1.Data transformation tools are expensive 2.Data transformation tools may have a long learning curve 3.It is difficult to measure how the IT organization is doing until it has learned to use the data transformation tools

How can you measure the impact of social media analytics?

1.Descriptive analytics 2.Social network analysis 3.Advanced analytics

List several criteria for selecting a data warehouse vendor, and describe why they are important.

1.Financial strength 2. ERP linkages 3.qualified consultants 4.market share 5.industry experience, 6.established partnerships These are important to indicate that a vendor is likely to be in business for the long term, to have the support capabilities its customers need, and to provide products that interoperate with other products the potential user has or may obtain

What things can help Web pages rank higher in the search engine results?

1.Hire a company that specializes in SEO 2.pay the search engine providers to be listed on the paid sponsors' sections; 3. consider liberating yourself from dependence on search engines`

common characteristics of customer relationship mangement data mining applications

1.Maximize return on marketing campaigns 2.Improve customer retention (churn analysis) 3.Maximize customer value (cross-, up-selling) 4.Identify and treat most valued customers

common characteristics of retailing and logistic data mining applications

1.Optimize inventory levels at different locations 2. Improve the store layout and sales promotions 3. Optimize logistics by predicting seasonal effects 4.Minimize losses due to limited shelf life

common characteristics of brokerage and securities data mining applications

1.Predict changes on certain bond prices 2.Forecast the direction of stock fluctuations 3.Assess the effect of events on market movements 4. Identify and prevent fraudulent activities in trading

measure

Measure the existing system.

operational planning can be

Tactic-centric (operationally focused) Budget-centric plan (financially focused)

word sense disambiguation

When a word has more than one meaning, selecting the meaning that makes the most sense can only be accomplished by taking into account the context within which the word is used.

how can we use analytics measuring the impact of social media analytics?

Uses simple statistics to identify activity characteristics and trends, such as how many followers you have, how many reviews were generated on Facebook, and which channels are being used most often.

authoritative pages

Web pages that are identified as particularly popular based on links by other Web pages and directories

What is an EDW?

a large scale data warehouse that is used across the enterprise for decision support. associated with medium or long term decision making.

what is a data warehouse?

a large store of data accumulated from a wide range of sources within a company and used to guide management decisions.

why are their many different names and definitions for data mining?

because it is a misnomer as you are not mining for data by mining for patterns

Why is a performance management system superior to a performance measurement system?

because measurement alone has little use without action

What is the main reason parallel processing is sometimes used for data mining?

because of the massive data amounts and search efforts

Why is the popularity of text mining as an analytics tool increasing?

because of the rapid growth in text data and availability of sophisticated BI tools.

why will new sourcing technologies shape the future of data warehousing

because they are mechanisms for acquisition of data from diverse and dispersed sources

Why do you think the early phases (understanding of the business and understanding of the data) take the longest in data mining projects?

because they are the most important steps as if we dont correctly understand the data or the business it could all be for a waste

What is sentiment?

belief, view, opinion, and conviction

what does roll up mean?

involves computing all the data relationships for one or more dimensions. To do this, a computational relationship or formula might be defined

What is a cube?

is a multidimensional data structure (actual or virtual) that allows fast analysis of data.

What are search engines?

is a software program that searches for documents (Internet sites or files) based on the keywords (individual words, multi-word terms, or a complete sentence) that users have provided that have to do with the subject of their inquiry.

hub

is one or more Web pages that provide a collection of links to authoritative pages

In the Target case study, why did Target send a teen maternity ads?

targets analytical model suggested she was pregnant based on her buying habits

predictions

tell the nature of future occurrences of certain events based on what has happened in the past

What is text analytics? how does it differ from text mining?

text analytics is a broader concept that includes information retrieval (e.g., searching and identifying relevant documents for a given set of key terms), as well as information extraction, data mining, and Web mining. whereas text mining is primarily focused on discovering new and useful knowledge from the textual data sources.

How does text differ from data mining?

text mining is done on unstructured data where as data mining is done on structured data in a database

What is social media? How does it relate to Web 2.0?

the enabling technologies of social interactions among people in which they create, share, and exchange information, ideas, and opinions in virtual com-munities and networks.It is a group of Internet-based software applications that build on the ideological and technological foundations of Web 2.0 and that allow the creation and exchange of user-generated content

all of these are true about data mining except?

the ideas behind data mining a relatively new

Link Analysis

the linkage among many objects of interest is discovered automatically, such as the link between Web pages and referential relationships among groups of academic publication authors

stemming

the process of reducing inflected words to their stem (or base or root) form. For instance, stemmer, stemming, stemmed are all based on the root stem

metadata in a data warehouse

these are maintained so that they can be assessed by IT personnel and users. this includes software programs about data and rules for organizing data summaries that are easy to index and search, especially with Web tools

why will new infrastructure technologies shape the future of warehousing?

they provide architecture and software enhancements

how has the web influence data warehouse design?

they require a design choice for housing the web data warehouse with the transaction servers of separate servers. page loading speed must be considered so server capacity must be planned carefully.

comprehensive database

this is the enterprise data warehouse (EDW) used to support all decision analyses by providing relevant summarized and detailed information originating from many different sources.

A Web client that connects to a Web server, which is in turn connected to a BI application server, is reflective of a?

three tier structure

What are the key differences between a two-tiered architecture and a three-tiered architecture?

three tiered system also has a client or front end software that allows users to acess and analyze data from the warehouse. two tier is more economical then 3 tier. but 2 tier can have performance problems with large data warehouses that work with data intesive applications for decision support.

what are common methods for normalizing word frequencies?

through log frequencies, binary frequencies, inverse document frequencies

Give examples of situations in which association would be an appropriate data mining technique.

to identify strong relationships among different products. cross-selling

What are the key similarities between a two-tiered architecture and a three-tiered architecture?

two tiered and 3 tiered structures both have a client workspace and an application server

model SEMMA

use a variety of machine learning statistical models

model building

various modeling techniques are selected and applied to an already prepared data set to address the specific business need.

Which of the following statements about Web site conversion statistics is FALSE?

visitors who began a purchase on a website must complete it.

explore SEMMA

visualization and basic description of the data

popular open source data mining tools?

weka, rapidminer, KNIME, R, python.

act and adjust phase of BPM cycle

what do we need to do differently

strategize phase of BPM cycle

where do we want to go

polysemes

which are also called homonyms, are syntactically identical words (i.e., spelled exactly the same) with different meanings

Create the Term-Document Matrix

introduce structure to the corpus

What is meant by social analytics?

"monitoring, analyzing, measuring and interpreting digital inter-actions and relationships of people, topics, ideas and content"

List the benefits of data warehouses.

- Allowing end users to perform extensive analysis in numerous ways - A consolidated view of corporate data (i.e. single version of the truth) - Better and more timely information - Enhanced system performance - Simplification of data access

data warehouse development approaches

- Inmon Model: EDW approach (top down) - Kimball Model: data mart approach (bottom up)

What are the ingredients for an effective performance management system?

-Measures should focus on key factors. • Measures should be a mix of past, present, and future. • Measures should balance the needs of shareholders, employees, partners, suppliers, and other stakeholders. • Measures should start at the top and flow down to the bottom. • Measures need to have targets that are based on research and reality rather than arbitrary

what are some of the challenges of NLP?

-Part-of-speech tagging - Text segmentation - Word sense disambiguation - Syntax ambiguity - Imperfect or irregular input - Speech acts

What are the three key components of a BPM system?

1 A set of integrated, closed-loop management and analytic processes, supported by technology that adresses financial and operational activities 2 Tools for businesses to define strategic goals and then measure/manage performance against them 3 A core set of processes - Methods and tools - for monitoring key performance indicators (KPIs), linked to organizational strategy

infrastructure technologies include

1. Columnar- a new way to store and access data in a database 2. real time data warehousing 3. data warehouse appliances - all in one solutions to data warehouses 4, data management technologies and practices 5. in-database processing technology - putting the algorithms where the data is. 6. in memory storage technology - moving data in memory for faster processing 7. new database management systems 8. advanced analytics

Common tasks for the strategic planning process

1. Conduct a current situation analysis 2. Determine the planning horizon 3. Conduct an environment scan 4. Identify critical success factors 5. Complete a gap analysis 6. Create a strategic vision 7. Develop a business strategy 8. Identify strategic objectives and goals

describe the major components of a data warehouse.

1. Data sources. 2. Data extraction, transformation 3. data loading 4. Comprehensive database 5. metadata. 6. Data Marts. 7. Middleware tools.

What are the three types of data generated through Web page visits?

1. Data stored in server access logs, referrer logs, agent logs, and client-side cookies 2. User Profiles 3. Metadata, such as page attributes, content attributes, and usage data

What are some of the most popular application areas of text mining?

1. Information extraction. 2. Topic tracking. 3.Summarization. 4.Categorization. 5.Clustering. 6.Concept linking. 7. Question answering.

List the 10 most important factors when when deciding which architecture to use in developing a data warehouse

1. Information interdependence between organizational units 2. Upper management's information needs 3. Urgency of need for a data warehouse 4. Nature of end-user tasks 5. Constraints on resources 6. Strategic view of the data warehouse prior to implementation 7. Compatibility with existing systems 8. Perceived ability of the in-house IT staff 9. Technical issues 10. Social/political factors

3 Main types of Data Warehouses

1. ODS 2. data marts 3. EDW

What are the most common data mining mistakes/blunders?

1. Selecting the wrong problem for data mining 2. Ignoring what your sponsor thinks data mining is and what it really can/cannot do 3. Beginning without the end in mind 4. Not leaving sufficient time for data acquisition, selection, and preparation 5. Looking only at aggregated results and not at individual records/predictions 6.Looking only at aggregated results and not at individual records. 7. Being sloppy about keeping track of the data mining procedure and results. 8.Using data from the future to predict the future. 9.Ignoring suspicious findings and quickly moving on. 10.Starting with a high-profile complex project that will make you a superstar. 11.Running data mining algorithms repeatedly and blindly. 12.Ignore the subject matter experts. 13.Believing everything you are told about the data. 14.Assuming that the keepers of the data will be fully on board with cooperation. 15.Measuring your results differently from the way your sponsor measures them 16.If you build it, they will come: don't worry about how to serve it up.

CRISP-DM process

1. business understanding 2. data understanding 3.data preparation 4. model building 5. testing and evaluation 6. deployment

What are the main knowledge extraction methods from corpus?

1. classification 2. clustering 3. association 4. trend analysis

List and briefly discuss some of the text mining applications in marketing.

1. cross selling an up selling by analyzing the unstructed data that comes from call centers 2. reviews and blogs from customers are a gold mine of customer sentiments 3. customer relationship management 4.enhance retailers' ability to analyze product data-bases.

What are the four perspectives that BSC suggests to view organizational performance?

1. customer 2. financial 3. internal business processes 4. learning and growth

What are the major application areas for data mining?

1. customer relationship management 2. banking 3.retailing and logistics 4. maunfacturing and production 5. brokerage and securities trading 6. insurance 7. computer hardware and software 8. government and defense 9. travel industry 10. healthcare 11. medicine 12. entertainment industry 13. homeland security and law enforcement 14. sports

What steps can an organization take to ensure the security and confidentiality of customer data in its data warehouse?

1. establish effective corporate and policy procedures, start at the top with executives and communicate downward to all 2. implementing logical procedures to restrict access this includes user authentication, access controls and encryption technology 3. limiting physical access to data center environment 4.establishing effective internal control review process with an emphasis on security and privacy.

common characteristics of manufacturing and maintenace data mining applications

1.Predict/prevent machinery failures 2.Identify anomalies in production systems to optimize the use manufacturing capacity 3. Discover novel patterns to improve product quality

When developing a successful data warehouse, what are the most important risks and issues to consider and potentially avoid?

1.Starting with the wrong sponsorship chain 2,Setting expectations that you cannot meet 3.Engaging in politically naive behavior 4.Loading the data warehouse with information just because it is available 5.Believing that data warehousing database design is the same as transactional database design 6. choosing a data warehouse manager that is technology oriented rather than user oriented 7. delivering data with overlapping and confusing definitions 8. believing promises of performance, capacity and scalability 9. believing that your problems are over when the data warehouse is up and running. 10. focusing on ad hoc mining and periodic reporting instead of alerts

4 phases of BPM cycle

1.Strategize 2. Plan 3.Monitor/analyze 4. Act/adjust

What are the distinguishing features of KPIs?

1.Strategy − 2.Targets 3.Ranges 4.Encodings 5.Time frames 6.Benchmarks

What are the two common methods for polarity identification?

1.Using a lexicon as a reference library 2. Using a collection of training documents as the source of knowledge about the polarity of terms within a specific domain

what are the 3 main areas of web mining?

1.Web Content Mining 2.Web Structure Mining 3.Web Usage Mining

List and discuss the most pronounced DW implementation guidelines.

1.strong sponsorship is needed 2.Project champion needed with focus on what the user needs, not the technology. 3.Consider various risks 4.Assess risks at the inception phase of the data warehouse project 5.User participation in the development of data and access modelling 6.Team skills including in)depth knowledge of the database technology

sequence mining

A pattern discovery method where relationships among the things are examined in terms of their order of occurrence to identify associations over time

What is Six Sigma?

A performance management methodology aimed at reducing the number of defects in a business process to as close to zero defects per million opportunities (DPMO) as possible

What is a performance measurement system? How does it work

A performance measurement and management methodology that helps translate an organization's financial, customer, internal process, and learning and growth objectives and targets into a set of actionable initiatives

Dimensional Modeling

A retrieval-based system that supports high-volume query access

speech acts

A sentence can often be considered an action by the speaker. The sentence structure alone may not contain enough information to define this action. For example, "Can you pass the class?" requests a simple yes/no answer, whereas "Can you pass the salt?" is a request for a physical action to be performed.

what does slice mean?

A slice is a subset of a multidimensional array (usually a two-dimensional representation) corresponding to a single value set for one (or more) of the dimen-sions not in the subset.

independent data mart

A small data warehouse designed for a strategic business unit or a department

dependent data mart

A subset that is created directly from a data warehouse

What is a performance management system? Why do we need one?

A system that assists managers in tracking the implementations of business strategy by comparing actual results against strategic goals and objectives - Comprises systematic comparative methods that indicate progress (or lack thereof) against goals

Enterprise Application Integration

A technology that provides a vehicle for pushing data from source systems into a data warehouse

enterprise information integration

An evolving tool space that promises real-time data integration from a variety of sources, such as relational or multidimensional databases, Web services, etc.

snowflake schema

An extension of star schema where the diagram resembles a snowflake in shape

Data Understanding

Analysis of all current data along with identifying any data quality issues

What is clickstream analysis? What is it used for?

Analysis of the information collected by Web servers.By using the data and text mining techniques, a company might be able to discern interesting patterns from it.

analyze

Analyze the system to identify ways to eliminate the gap between the current performance of the system or process and the desired goal.

topic tracking application of data mining

Based on a user profile and documents that a user views, text mining can predict other documents of interest to the user.

Why is strategy the most important part of a BPM implementation?

Business strategy provides an overall direction to the enterprise

syntatic ambiguity

Choosing the most appropriate grammar structure usually requires a fusion of semantic and contextual information.

establish the corpus

Collect and organize the domain-specific unstructured data

concept linking application of data mining

Connects related documents by identifying their shared concepts and, by doing so, helps users find information that they perhaps would not have found using traditional search methods

Data Extraction and Transfer

Data are extracted and properly transformed using custom-written or commercial software called ETL.

data loading

Data are loaded into a staging area, where they are transformed and cleansed. The data are then ready to load into the data warehouse and/or DMs.

data sources

Data are sourced from multiple independent operational "legacy" systems and possibly from external data providers (such as the U.S. Census). Data may also come from an OLTP or enterprise resource planning (ERP) system. Web data in the form of Web logs may also feed to a data warehouse

Describe data integration

Data integration comprises the major processes of data access, data federation, and change capture.

What is OLAP and how does it differ from OLTP?

Data stored in a data warehouse can be analyzed using techniques referred to as OLAP, Online Analytical Processing. OLAP is one of the most commonly used data analysis techniques in data warehouses. OLAP is an approach to quickly answer ad hoc questions that require data analysis. OLTP is concerned with the capture and storage of data and is designed to best carry out day-to-day business functions. OLAP is concerned with the analysis of that data and provide answers to business and management queries.

categorical data

Data that consists of names, labels, or other nonnumerical values

define

Define the goals, objectives, and boundaries of the improvement activity.

What is DMAIC?

Define, Measure, Analyze, Improve, Control

interval data

Differences between values can be found, but there is no absolute 0. (Temp. and Time)

multidimensional presentation includes

Dimensions: products, salespeople, market segments, business units, geographical locations, distribution channels, country, or industry - Measures: money, sales volume, head count, inventory profit, actual versus forecast - Time: daily, weekly, monthly, quarterly, or yearly

Access SEMMA

Evaluate the accuracy and usefulness of the models

Describe the three steps of the ETL process.

Extraction: selecting data from one or more sources and reading the selected data. Transformation: converting data from their original form to whatever form the DW needs. This step often also includes cleansing of the data to remove as many errors as possible. Load: putting the converted (transformed) data into the DW.

question answering application of data mining

Finding the best answer to a given question through knowledge-driven pattern matching

how can we use social network analytics to measure the impact of social media analytics?

Follows the links between friends, fans, and followers to identify connections of influence as well as the biggest sources of influence.

Business Understanding

Gain a clear understanding of the business problem that must be solved and how it impacts the company

clustering application of data mining

Grouping similar documents without having a predefined set of categories.

what is the reason for normalizing word frequencies?

In order to have a more consistent TDM for further analysis.

monitor / analyze phase of BPM cycle

How are we doing

HITS

Hyperlinked induced topic search is a link-analysis algorithm that rates Web pages using the hyperlink information contained within them.

What are the most popular commercial data mining tools?

IBM Cognos, Oracle Hyperion, SAP Business Objects, Tableau, Tibco, Qlik, MicroStrategy, Teradata, and Microsof

information extraction application of data mining

Identification of key phrases and relationships within text by looking for predefined sequences in text via pattern matching.•

categorization application of data mining

Identifying the main themes of a document and then placing the document into a predefined set of categories based on those themes

Give examples of situations in which classification would be an appropriate data mining technique.

If what is being predicted is a class label (e.g., "sunny," "rainy," or "cloudy"), the prediction problem is called a classification,

how can we use advanced analytics to measure the impact of social media analytics?

Includes predictive analytics and text analytics that examine the content in online conversations to identify themes, sentiments, and connections that would not be revealed by casual surveillance.

improve

Initiate actions to eliminate the gap by finding ways to do things better, cheaper, or faster

control

Institutionalize the improved system by modifying compensation and incentive systems, policies, procedures, manufacturing resource planning, budgets, operation instructions, or other management systems

What is "search engine optimization?" Who benefits from it?

It is the intentional activity of affecting the visibility of an e-commerce site or a Web site in a search engine's natural (unpaid or organic) search results. the website benefits from it.

what is NLP?

It studies the problem of "understanding" the natural human language, with the view of converting depictions of human language (such as textual documents) into more formal representations (in the form of numeric and symbolic data) that are easier for computer programs to manipulate.

customer performance

Metrics for customer satisfaction, speed and accuracy of issue resolution, and customer retention.

sales plan/forecast

Metrics for price-to-purchase accuracy, purchase order-to-fulfillment ratio, quantity earned, forecast-to-plan ratio, and total closed contracts.

service performance

Metrics for service-call resolution rates, service renewal rates, service level agreements, delivery performance, and return rates.

What are some of the benefits NLP?

NLP moves beyond syntax-driven text manipulation (which is often called "word counting") to a true understanding and processing of natural language that considers grammatical and semantic constraints as well as the context.

sales operations

New pipeline accounts, sales meetings secured, conversion of inquiries to leads, and average call closure time

What is SVD?

Reduces the overall dimensionality of the input matrix (number of input documents by number of extracted terms) to a lower dimensional space, where each consecutive dimensions represents the largest degree of variability between words and documents. To find one or two salient dimensions that account for most of the variability

SEMMA

Sample, Explore, Modify, Model, and Assess

What is scalability? How does it apply to DW?

Scalability refers to the degree to which a system can adjust to changes in demand without major additional changes or investments. DW scalability issues are the amount of data in the warehouse, how quickly the warehouse is expected to grow, the number of concurrent users, and the complexity of user queries.

What are the major data mining processes?

Several data mining processes have been proposed: CRISP-DM, SEMMA, and KDD.

Text segmentation

Some written languages, such as Chinese, Japanese, and Thai, do not have single-word boundaries. In these instances, the text-parsing task requires the identification of word boundaries, which is often a difficult task. Similar challenges in speech segmentation emerge when analyzing spoken language because sounds representing successive letters and words blend into each other

summarization application of data mining

Summarizing a document to save time on the part of the reader

synonyms

Synonyms are syntactically different words (i.e., spelled differently) with identical or at least similar meanings

scalability in selecting the best criteria for classification techniques

The ability to construct a prediction model efficiently given a rather large amount of data

multidimensionality

The ability to organize, present, and analyze data by several dimensions, such as sales by region, by product, by salesperson, and by time (four dimensions)

Tokenizing

The block of text corresponding to the token is categorized according to the function it performs.

speed in selecting the best criteria for classification techniques

The computational costs involved in generating and using the model, where faster is deemed to be better.

fact table

The fact table contains a large number of rows that correspond to observed facts and external links (i.e., foreign keys). A fact table contains the descriptive attributes needed to perform decision analysis and query reporting, and foreign keys are used to link to dimension tables.

Describe the data warehousing process.

The data warehousing process consists of the following steps: 1. Data are imported from various internal and external sources 2. Data are cleansed and organized consistently with the organization's needs 3. a. Data are loaded into the enterprise data warehouse, or b. Data are loaded into data marts. 4. a. If desired, data marts are created as subsets of the EDW, or b. The data marts are consolidated into the EDW 5. Analyses are performed as needed

what does dice mean?

The dice operation is a slice on more than two dimensions of a data cube

Interperation in selecting the best criteria for classification techniques

The level of understanding and insight provided by the model (e.g., how and/or what the model concludes on certain predictions)

predictive accuracy in selecting best criteria for classifaction techniques

The model's ability to correctly predict the class label of new or previously unseen data.

robustness in selecting the best criteria for classification techniques

The model's ability to make reasonably accurate predictions, given noisy data or data with missing and erroneous values.

Star Schema

The most commonly used and the simplest style of dimensional modeling Contain a fact table surrounded by and connected to several dimension tables

what are some methods for cluster analysis

The most commonly used clustering algorithms are k-means and self-organizing maps. neural networks, and fuzzy logic, genetic algorithms

data mining

The nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data stored in structured databases

part-of-speech tagging

The process of marking up the words in a text as corresponding to a particular part of speech based on a word's definition and context of its use

what does pivot meant?

This is used to change the dimensional orientation of a report or ad hoc query-page display.

Define Gini index. What does it measure?

To evaluate the goodness of the split in a decsion tree. has been used in economics to measure the diversity of a population. used to determine the purity of a specific class as a result of a decision to branch along a particular attribute or variable.

What issues should be considered when deciding which architecture to use in developing a data warehouse?

Which database management system (DBMS) should be used? - Will parallel processing and/or partitioning be used? - Will data migration tools be used to load the data warehouse? - What tools will be used to support data retrieval and analysis?

In text analysis, what is a lexicon?

a catalog of words, their synonyms, and their meanings

data mart

a data collection, smaller than the data warehouse, that addresses the needs of a particular department or functional area of the business.

How does a data warehouse differ from a transactional database?

a data warehouse is different transactional databases as it holds a variety of data organized by subject for a ton of different areas of the company, not just transactions, and it is all integrated into one place. it can update in real-time and includes metadata.

what is six sigma?

a methodology aimed at reducing the number of defects in a business process

Is data mining a new discipline?

a new definition for the use of many disciplines. Data mining is tightly positioned at the intersection of many disciplines emerging field that has attracted much attention in a very short time

What is a balanced scorecard (BSC)? Where did it come from?

a performance management system it came from Kaplan and Norton.

terms

a single word or multiword phrase extracted directly from the corpus of a spe-cific domain by means of NLP methods

what does drill down mean?

a specific OLAP technique whereby the user navigates among levels of data ranging from the most summarized (up) to the most detailed (down).

time series forecasting

based on the assumption that the future is an extension of the past. Historical data is used to predict future demand

Oper marts

an operation data mart created when operational data needs to be analyzed multidimensionally

Understanding customers better has helped Amazon and others become more successful. The understanding comes primarily from?

analyzing vast data amounts routinely collected

Web Crawlers (Spiders)

are used to read through the content of a Web site automatically.

stop words

are words that are filtered out prior to or after processing of natural language data (i.e., text).

What is the meaning of and motivation for balance in BSC?

arises because the combined set of measures is supposed to encompass indicators that are • Financial and nonfinancial • Leading and lagging • Internal and external • Quantitative and qualitative •Short term and long term

Testing and Evaluation

assesses the degree to which the selected model (or models) meets the business objectives and, if so, to what extent

What are some major data mining methods?

associations, prediction, clustering, sequential relationship, time series forecasting, visualization

deployment

can be as simple as generating a report or as complex as implementing a repeatable data mining process across the enterprise.

In which stage of extraction, transformation, and load (ETL) into a data warehouse are anomalies detected and corrected?

cleanse

web based data warehouse design

client / web browser -> uses internet to connect to ->webserver-> webpages, application server, data warehousing

what is a major difference between cluster analysis and classification?

cluster analysis classifes binary variables where as classification analysis separate variable into multiple classes

What is an ensemble model in data mining?

combines the outcomes of two or more different or same models like two decision trees.

What are the main differences between commercial and free data mining software tools?

commericial tools you have to pay for open source is free

dimensional tables

contain classification and aggregation information about the central fact rows. contain attributes that describe the data contained within the fact table; they address how data will be analyzed and summarized. Dimension tables have a one-to-many relationship with rows in the central fact table.

Data Preparation

converting information from surveys or other data sources so it can be used in statistical analysis

what is metadata? explain the importance of it.

data about data, describe the structure of and some meaning about data, thereby contributing to their effective or ineffective use.

Data mining algorithms

decision trees, regresion, k means nearest neighbor

What are the privacy issues in data mining?

deidentification of individual data records, sharing customer data without seeking consent

sequential relationships

discover time-ordered events, such as predicting that an existing banking customer who already has a checking account will open a savings account followed by an investment account within a year

All of the following are challenges associated with natural language processing EXCEPT

dividing a text up into individual words in english

What recent factors have increased the popularity of data mining?

due to advances in statistics, artificial intelligence, machine learning, management science, information systems (IS), and databases

Inmon Model: EDW approach (top down)

employing established database development methodologies and tools, such as entity-relationship diagrams (ERD) and an adjustment of the spiral development approach. The EDW is the ideal in this approach because it provides a consistent and comprehensive view of the enterprise

Middleware Tools

enable access to the data warehouse. Power users such as analysts may write their own SQL queries. Others may employ a man-aged query environment, such as Business Objects, to access data. There are many front-end applications that business users can use to interact with data stored in the data repositories, including data mining, OLAP, reporting tools, and data visualiza-tion tools.

miner

end user of data mining

Federated Architecture

existing data warehouses, data marts, and legacy systems-> data mapping/metadata logical /physical integration of common data elements-> end-user applications.

true or false: Open-source data mining tools include applications such as IBM SPSS Modeler and Dell Statistica.

false. these are commericial mining tools.

imperfect or irregular input

foreign or regional accents and vocal impediments in speech and typographical or grammatical errors in texts make the processing of the language an even more difficult task.

sample SEMMA

generate a representative sample of the data

Give examples of situations in which cluster analysis would be an appropriate data mining technique

has been used extensively for fraud detection (both credit card and e-commerce fraud) and market segmentation of customers in contemporary CRM systems.

plan phase of BPM cycle

how do we get there

which data warehouse architecture uses a normalized relational warehouse that feeds multiple data marts?

hub and spoke

Which data warehousing architecture is the best? Why?

hub and spoke or Data Mart Bus Architecture with Linked Dimensional Data Marts. hub-and-spoke architecture is typically used with more enterprise-wide implementations and larger warehouses.hub-and-spoke architecture was the most expen-sive and time-consuming to implement.

clusters

identify natural groupings of things based on their known characteristics, such as assigning customers in different segments based on their demographics and past purchase behaviors

Give examples of situations in which regression would be an appropriate data mining technique.

if what is being predicted is a numeric value (e.g., temperature, such as 68°F), the prediction problem is called a regression.

real time data warehousing

implies that the refresh cycle of an existing data warehouse to update the data is more frequent (almost at the same time as the data becomes available at operational databases).

what are the pros of using an ensemble model in data mining?

improves accuracy and robustness

what are the cons of using an ensemble model in data mining?

increases complexity and lack of interperatability

What is Web mining? How does it differ from regular data mining or text mining?

is the process of discovering intrinsic relationships from Web data, which are expressed in the form of textual, linkage, or usage information. Web mining is essentially the same as data mining that uses data generated over the Web.

What is Web structure mining? How does it differ from Web content mining?

is the process of extracting useful information from the links embedded in Web documents. it focuses on the links not the content of a web page

What is sentiment analysis?

is trying to answer the question "What do people feel about a certain topic?" By analyzing data related to opinions of many using a variety of automated tools.Used in variety of domains, but its applications in CRM are especially noteworthy (which related to customers/consumers' opinions)

Why is the ETL process so important for data warehousing effort?

it helps with data integration and allows data to be put into a data warehouse and readible by computer models

what is an ODS?

it is an operational data store. they provide a fairly recent form of customer information files. they are used as a interim staging area for a data warehouse.the contents are updated throughout the course of business operations. used for short term decisions based on mission critical applications

associations

occurrences linked to a single event.

What would be the expected benefits and beneficiaries of sentiment analysis in politics?

one may predict who is more likely to win or lose. can help understand what voters are thinking and can clarify a candidate's position on issues. can help political organizations, campaigns, and news analysts to better understand which issues and positions matter the most to voters.

Breaking up a Web page into its components to identify worthy words/terms and indexing them using a set of rules is called

parsing the documents

operational plan

plan that translates an organization's strategic objectives and goals into a set of well defined tactics and initiatives, resource requirements, and expected results for some future time period (usually a year)

concepts

re features generated from a collection of documents by means of manual, statistical, rule-based, or hybrid categorization methodology.

What is business performance management? How does it relate to BI?

refers to the business processes, methodologies, metrics, and technologies used by enterprises to measure, monitor, and manage business performance. data warehousing which is a function of BI is implemented to monitor businesss performance

What is Web content mining? How can it be used for competitive advantage?

refers to the extraction of useful information from Web pages. it can also be used for information/news/opinion collection and summarization, sentiment analysis, and automated data collection and structuring for pre-dictive modeling.

What is social media analytics?

refers to the systematic and scientific ways to consume the vast amount of content created by Web-based social media outlets, tools, and techniques for the betterment of an organization's competitiveness.

corpus

s a large and structured set of texts (now usually stored and processed electronically) prepared for the purpose of conducting knowledge discovery.

Modify SEMMA

select variable, transform variable representations

in the Wimbledon case study, the tournament used data for each match in real time to highlight

significant events

What is a social network? What is the need for SNA?

social structure composed of individuals linking to each other.a theoretical construct useful in the social sciences to study relationships between individuals, groups, organizations, or even entire societies.

Hub-and-Spoke Architecture (Corporate Information Factory)

source systems-> staging area->normalized relational warehouse with atomic data->dependent data marts with some atomic/summarized data, end-user applications->dependent data marts with some atomic/ summarized data

Kimball Model: data mart approach (bottom up)

strategy is a "plan big, build small" approach. building one data mart at a time. This model applies dimensional data modeling, which starts with tables.

One-Tier Architecture

yes it ties to combine all these tiers into one. doesnt work well.

What skills should a DWA possess? Why?

• IT - Familiarity with high-performance hardware, software, and networking technologies, since the data warehouse is based on those • Solid business insight, to understand the purpose of the DW and its business justification • Familiarity with business decision-making processes to understand how the DW will be used • Excellent communication skills, to communicate with the rest of the organization

Visitor Profiles

• Keywords • Content groupings • Geography • Time of day • Landing page profiles

conversion statistics

• New visitors • Returning visitors • Leads • Sales/conversions • Abandonment/exit rate

website usability

• Page views • Time on site • Downloads • Click map • Click paths

Traffic Sources

• Referral Web sites • Search engines • Direct • Offline campaigns • Online campaigns


Ensembles d'études connexes

Chapter 4 muscles of the spine and thorax

View Set

70-740 ExamRef-12 TB, 70-740 ExamRef-11 TB, 70-740 ExamRef-9 TB, 70-740 ExamRef-8 TB, 70-740 ExamRef-7 TB, 70-740 ExamRef-6 TB, 70-740 ExamRef-5 TB, 70-740 ExamRef-4 TB, 70-740 Panek-3 TB, 70-740 ExamRef-20 TB, 70-740 ExamRef-10 TB, 70-740 ExamRef-16...

View Set

Napoleon's Rise and Fall- Instruction

View Set

exam I (ch. 13, 11, 3, 1, 2, 4, 5)

View Set

Entrepreneurial Small Business 5th Edition; Chapter 15

View Set

UNE REVISION DES VERBES CONJUGUEZ LES VERBES SUIVANTS AU PRESENT

View Set

Pharm Exam 3 Practice NCLEX style questions

View Set