Business Intelligence Chapter 1, Business Intelligence Chapter 3, BI Chapter 13, BI Exam 1, BI Exam 2, Chapter 14 BI, Chapter 11 BI, BI Chapter 8, Business Intelligence Chapter 5, Business Intelligence Chapter 7, Business Intelligence Chapter 7, Busi...

Ace your homework & exams now with Quizwiz!

Information-as-a-Service (IaaS)

"Information on Demand" Goal is to make information available quickly to people, processes, and applications across the business

Financial KPIs

"What are the economic consequences of the organization's past actions?" Examples: operating income, expenses, return on capital, profit margin, cash flow, economic value added

Business Process KPIs

"What are the existing and emerging internal business processes in which the supply chain organization must excel?" Examples: efficiency, cost, throughput, quality, effectiveness

Learning and Growth KPIs

"What infrastructure is needed to foster long-term growth and improvement?" Examples: employee satisfaction, employee retention, skill sets, education and training, information technology

Customer KPIs

"What value proposition is delivered to key customer segments?" Examples: customer satisfaction, customer retention, customer acquisition, market share in target segments, valued services

nearest neighbor classifier

"if it walks like a duck, quacks like a duck, then it is probably a duck" - identify the prevailing class among neighboring examples

reasons many different methods

- "no free lunch" theorem (no algorithm superior) - different merits and demerits (Vor Nachteile)

star schema

- 1 fact table - x dimensions

Relationale Datenbanken Vorteile

- ACID - atomicity (all or nothing) - consistency - isolation - durability

Dimensions of Model Performance

- Accuracy - scalability - robustness - comprehensibility - justifiability - calibration

Stage Choice

- Analyse Auswahl und alternativen - Implementierung - Select particular course

NOSQL Base

- Basically Available (replication of data among many different storage servers) - SOft state (NoSQL allows inconsistent) - eventually consistent (nosql ensures consistent state at some future point)

What are the best practices in dashboard design?

- Benchmark Key Performance Indicators (KPIs) with industry standards -Wrap the dashboard metrics with contextual metadata -Validate the dashboard design by a usability specialist -Prioritize and rank alerts/expectations streamed to the dashboard - Enrich dashboard with business-user comments - Present information in three different levels - Pick the right visual construct using dashboard design principles - Provide for guided analytics

Properties of a distance measure

- D(A,B) >= 0 -> non-negativity - D(A,B) = D(B,A) symmetry - A=b -> identity - D(A,B) < D(A,C)+D(B,C) -> subadditivity, triangle inequality

Challanges for RDBMS

- Diversity - Connectivity - Data Size

Staging Area

- Informationsintegration (temporärer Zwischenspeicher) - place where data is transformed

Business Intelligence (Definition)

- Konzepte und Methoden zur Verbesserung von Geschäftsentscheidungen unter Benutzung von Faktenbasierter Unterstützungssysteme. + Transformation von Daten in nutzbare Informationen + Ermöglichung einfacher Interpretierung großer Datenmengen

business application of association rule

- Market basket analysis - web usage mining

relational model

- RDBMS - implemented as two dimensional models - queries in SQL

BI Front end technologies

- Reporting - Portals&Dashboards - Data mining - OLAP Front END

Steps KDD Process

- Selection, Pre-Processing, Transformatiom, Data-Mining, Interpertation/Evaluation

Cloud Computing service layer

- Services (complete busines service) - Application (cloud based software) - Developement (software developement platform) - Platform (cloud based platforms) - Storage(Data storage) - Hosting(physical data centers)

reporting types

- Standard (fixed frequency) - Event driven - Ad Hoc (user request)

Stage Intelligence

- Suchumgebung für Zustände zur Entscheidung - Problemdifinition

BI Systems Data Type and Origin

- Time: present & past - horizon: mid/Long term - granularity: detailed and aggregated

BI Systems Purpose and Target Audience

- User Group: focused - Focus: Business Object

BI Portals & Dashboards

- Visualisation und Datenzugriff - Suchkosten reduzieren - all Information in one place

Virtual workspace

- abstraction of a execution environment that can be made dynamically available - resource limited - flexible software configuration

relevance of model assessment

- accountability - informed modeling decisions - nature of forecasting

Data Mining

- algorithm centric - characteristics: automated, discover novel Patterns - methods: predictive (voraussagend), descriptive

advantages Nfold cross validation

- approximates generalization error - increased robustness - less variance than split sample - uses all examples for model building

KDD Process Interpretation/Evaluation

- assesment of derived patterns (validity, reliability, originality) - next steps

Virtual Machines

- astraction of a physical host machine - hypervisor intercepts and emulates instructions from vm

federated way of integrating heterogeneous databases

- build wrapper on top of db - ad-hoc approach

unsupervised learning

- built analytic models without response variable

agglomerative clustering

- buttom up approach - every case own cluster - merge cases to form larger clusters

algorithms

- c4.5 (continuous and categorical independent variables, categorical target variable) - Cart (continuous and categorical independent variables, categorical and continuous target variables) - Chaid (categoriacal independent variables, categorical target vairables)

business application of classification

- churn prediction, direct mail, defect prediction/quality management, credit scoring, acceptance scoring

Enterprise Data Warehouse

- collects information about subjects, spanning the entire organisation - corperate wide data integration

idicators for measing accuracy

- compare predicted to actual responses - regression (mean absolute error, mean-square error, root-mean-suqre error) - classification (classification erorr, percentage correctly classified, precision, recall)

scalability

- consumption of time resources - memory resources - sensitivity with respect to parameters - parallelization important

Regression DMM

- continues dependent variable - continues and/or discrete independent variables

KDD Process pre-processing

- conversion to standard analysis format - exploratiy data analysis - aggregation

Data accuracy

- correct - unambiguous - consistent - complete

Metadata

- data about data - define warehouse objects - (source, time, missing fields, ...)

ETL Process

- data is identifiend and extracted - transformation for consistence - transported to DWH

Cloud storage

- data storage capazity hired out for others - remotly, temporarly cached on desktop computers

Graph Database

- dealing with highly interconnected data - nodes and relationsships can have properties - strength:traversing through the ndoes by relationships - NEO4J

columnar

- design of data is a column - adding columns is quite inexpensive, is done row by row - each row can have different columns - Hbase, Cassandra, Hypertable

assessing cluster solutions

- devisive: decrease heterogenity within cluster - agglomerative: increase heterogenity within a cluster

indicators to measure predictive accuracy

- differ across modeling framework - emphasize different notion of performance - focus on classification in following

k-means problems

- different starting centroids result in different clusters - echte cluster zu treffen ist schwierig - centroids maybe re-adjust themselves

association rule DDM

- discover rules of the form "IF A then B" - Sequence mining also looks for time-dependent relationships

KDD Process Selection

- documentation of available data - review of data quality, availability over time, granularity - selection of data for further analyses

levels

- each level represents a position in the concept hierarchy - all - category - subcategory - product

post-processing

- eleminate smallest clusters (outliners) - split high SSE clusters (loosers) - merge close clusters with low SSE

NOSQL Properties (Base)

- eventual consistency - basically available - soft state

MOLAP Drawback

- extra cost for seperate mddb - increasing learning curve - processing step can be quite lengthy - MOLAP tools difficulty querying models with high cardinality

MOLAP Benefits

- fast - smaller due to compression - automated computation of higher level aggregates of the data - compact for low dimensional data sets - "natural indexing" - power and ease of analytical calculations

Extraction caveats (Vorsichtsmaßnahmen)

- fast as possible - small as possible - infrequent as possible - changes in source system as small as possible

splitting rule

- for each splitting attribute, find best split (

dinstance measurement

- formal way to quantify similarity/dissimilarity (intra cluster distance, inter cluster distance)

external data sources

- from outside of the organization - marketing research, competitive information, economic forecast, web2.0, edi, efid, epcis, purchased databases

Reporting engines

- grafische Auswertung - ableiten persönlicher Performance indicators - used by it Departments und Business Analysten

KDD Process transformation

- handling of missing values - data reduction - encoding and projection

Why a seperate data warehouse

- high performance for both systems - different functions and data

business application of clustering

- identification of homogeneous customer segments - document clustering - fraud detection

Disadvantages split sample method

- inefficient - high variance

OLAP concept hierachies

- interactive data analysis - multiple dimensions defined by concept hierachies - view data from different perspectives and aggregation levels

recursive paritioning approach

- local optimum - search next important variable - split tree accordingly and create a branch for each split value

Anwendung von Cluster Analyse

- marketing (understanding of customer populations, mass customization, identifying new products, classification of customers) - textanalyse - fraud/anomaly detection

emphasize different notion of performance

- mean absolute error vs. mean squared error

Problems of data driven classifiers

- mislabeling - -> use the n nearest neighbors

solution of initial centroids problem k-means

- multiple runs - sample and use hierachical clustering to determine centroids - more then k centroids - take the best

Data warehouse = integration in advance

- multiple sources are integrated in dwh - high performance

Stage Design

- mögliche Folgen der Entscheidung - Entwicklung möglicher Lösung

classification of new examples

- nearest best match - no model estimation - only usage of available date (making regions of classes)

Cloud Computing

- network based computing

ROLAP Benefits

- no extra costs with propriety system - familiar relational DBA skills and tools are used

Examples Data anomalies

- no unique key - naming, coding - meaning between groups - spelling - missing values - multiple encodings - multiple local standards - multiple names

pre processing

- normalize data - eleminate outliners

NOSQL

- not only using SQL - no fixed schema, allowing fields to be added to any record without constraints - often: open source - designed to work on large clusters

relational db data types

- numeric, strings, dates, uninterpreted, blobs

Cloud computing on demand

- on demand services - pay for use

Real World Server Architectures

- one ETL, two db servers (clustered), two report servers, 1 Olap, 1 DM Server

Key/ Value

- pairs key to values - very high performance

comprehensibility

- prediction vs. insight - Manager Vertrauen (bzw. misstrauen in Blackbox modelle) - difficult to measure -

homogeneous ensemble

- producde base models with one prediction method

Starnet

- querying multidimensional databases using concept hierachies

Olap Query Characteristics

- read access to large amount of data - analysis of data relations - analysis of data by time - display of data across different dimensions - complex calculations - quick response

robustness

- real world data is "noisy" (unvollständig, falsch) - real-world phenomena change over time -affection on model (while building, afterwards)

advantages split sample method

- real world simulation of prediction model - easy to implement - fast - approximates generalization error

resons for staging areas

- reduced load on operational systems&dwh - backup and recovery - auditing

data mining models

- regression - classification - clustering - association rule and sequential pattern mining

differ across modeling framework

- regression - classification - others

Data Quality

- relevant - useful - accurate - accessible

disadvantages consumption

- resource consumption

approach to simulate real-life-application

- resubstitution estimate - split smaple estimate - cross validation estimate

Advantages of virtual machines

- run operating systems where physical hardware is out of reach - easier backup, creation of machines - test software on clean installation - multiple OS possible (at one time) - debug problems - easy migrition - run lagecy systems

Relationale Datenbanken limits

- scalibility (scale through multiple severs, join over servers is difficult) - complexity (data to tables complex and slow) - SQL can only work with structured data

KDD Process Data Mining

- selection data mining model - selection of data mining method (algorithm) - developement/ estimatoin of the model

linear classifiers

- seperation by a "line" - good and bad part (over/under line)

non linear classifiers

- seperation by curve - good and bad part (over/under line)

strategies of distance measering

- single linkage (clostes objects) - complete linkage (most far objects) - average linkage (mean of all pairwise distance)( mittlere abweichung) - centroid methods(dinstance between centroids)

ROLAP Drawbacks

- slower performance - difficult implementation of some calculations - increased workload for it and end users

Types of Reporting

- standard - event driven - ad hoc

justifiability

- stimmt das MOdell mit früheren Annahmen überein

Document orientated

- stores documents - document = hash = types - MongoDB, CouchDB

Data Mart

- subset of EDW for specific usergroup - selected subject - independet vs. dependet data marts

evaluating k-means clustering

- sum of squared error - error = distance to centroid - multiple solutions: prefer smalles SSE

Datawarehouse

- themenorientierte, integrierte, chronologische, persistente, Sammlung von Daten um das Management bei seinen entscheidungsprozessen zu unterstützen.

operational System Data Type and Origin

- time: present - horizon: short term - granularity: detailed

Devisive clustering

- top down approach - iterative split clusters into smaller sub clusters

operational Systems Purpose and target audience

- user Group: large and heterogeneous - Focus: Business processes

Online Analytical Processing (OLAP)

- user centric - Multi dimensional - interactive - requires some IT-Skills affinity

BI Systems technologie

- users: few - Access frequency: low - usage pattern: unregelmäßig - Response time: seconds - data volume: high - data updates: rare - storage: redundant - critical factors: database size, data Quality

operational System technology

- users: many - Access frequency: high - Benutzungsmuster: constant - Response time: miliseconds - data volume: low - data updates: often - storage: normalized tables - critical factors: Performance, Parallelität, Response time, fault tollerance

Mapping

- which operational attributes? - how to transform those? - mapping to dimensional models

k-means

- zufällige centroids setzen - Punkte dem nächsten centroid zuordnen - neue centroids berechnen (euclidean) - von 2. weiter bis sich die centroids nicht mehr bewegen

What are the critical success factors for Big Data analytics?

-A clear business need (alignment with the vision and the strategy) -Strong, committed sponsorship (executive champion) -Alignment between the business and IT strategy -A fact-based decision making culture -A strong data infrastructure

What is MapReduce? What does it do? How does it do it?

-A technique to distribute the processing of the very large multi-structured data files across a large cluster of machines. -Aids organizations in processing and analyzing large volumes of multi-structured data. -Reads the input file and splits it into multiple pieces. These splits are then processed by multiple map programs running in parallel on the nodes of the cluster. .Groups data in a split by the type of geometric shape Takes output from each map program, which calculates the sum of the number of different types of geometric shapes.

What is stream analytics? How does it differ from regular analytics?

-A term commonly used for extracting actionable information from continuously flowing/streaming data sources. -The science of analysis--to use data for decision making

What are important criteria when selecting an ETL tool?

-Ability to read from and write to an unlimited number of data source architectures -Automatic capturing and delivery of metadata -History of conforming to open standards -Easy to use interface for the developer and function user

When are column oriented organizations more efficient?

-An aggregate needs to be computer over many rows, but for a notably smaller subset of all columns of data -New values of a column are supplied for all rows at once because that column data can be written efficiently

What is Hadoop? How does it work?

-An open source framework for processing, storing, and analyzing massive amounts of distributed, unstructured data. -A client accesses unstructured and semistructured data from sources including log files, social media feeds, and internal data stores. It breaks the data up into "parts," which are then loaded into a file system made up of multiple nodes running commodity hardware.

Data Mining Analysis Methods

-Analyzing customer buying patterns to predict future marketing and promotion campaigns. -Building budgets and other financial information. -Detecting fraud by identifying deceptive spending patterns. -Finding the best customers who spend the most money. -Keeping customers from leaving or migrating to competitors. -Promoting and hiring employees to ensure success for both the company and the individual.

What are the four major types of patterns that data mining seeks to identify?

-Associations -Predictions -Clusters -Sequential Relationships

How do the two approaches differ (approaches to recommendation systems).

-Collaborative filtering: The recommendations system is built based on the individual user's past behavior by keeping track of the previous history of all purchased items. -Content filtering: Relies on the user ratings matrix, by considering specifications and characteristics of items.

What are examples of social networks relevant to business activities?

-Communication Networks -Community Networks -Criminal Networks -innovation Networks

What are the major characteristics and objectives of data mining?

-Data is presented in many formats. -Data mining environment is usually a client/server architecture or a Web based IS architecture. -Sophisticated new tools helps to remove the information ore buried in corporate files and public records. -Miner has little or no programming skill. -Striking it rich finds an unexpected result and requires end user to think creatively throughout the process. -Data mining can be analyzed and deployed quickly and easily. -Necessary to parallel processing for data mining

What are the most common myths about data mining?

-Data mining provides instant, crystal-ball-like predictions -Data mining is not yet viable for business applications -Data mining requires a separate, dedicated database -Only those with advance degrees can do data mining -Data mining is only for large firms that have lots of customer data

Describe the major components of a data warehouse

-Data sources: Multiple independent operational "legacy" systems and possibly from external data providers. -Data extraction and transformation: Uses custom-written or commercial software called ETL. -Data loading: Starts in staging area, transformed and cleansed, then loaded into data warehouse/data marts. -Comprehensive database: EDW to support all decision analysis by providing relevant summarized and detailed information originating from many different sources. -Metadata: Includes software programs about data and rules for organizing data summaries. -Middleware tools: enable access to the data warehouse .

What is Big Data analytics? How does it differ from regular analytics?

-Data that exceeds the reach of commonly used hardware environment and/or capabilities of software tools to capture, manage, and process it within a tolerable time span -The science of analysis--to use data for decision making

What are the use cases for Big Data and Hadoop?

-Data warehouse performance -Integrating data that provides business values -Interactive BI tools

What are the components of a Linear Programming Model?

-Decision Variables -Objective Function -Objective Function Coefficients -Constrains -Capacities -Input/Output Coefficients

What are the analysis tools for measuring social media?

-Descriptive analytics -Social network analysis -Advanced analytics

List ethical issues in analytics.

-Electronic surveillance -Ethics in DSS design -Software piracy -Invasion of individuals' privacy -Use of proprietary database -Use of intellectual property such as knowledge and expertise -Exposure of employees to unsafe environments related to computers -Computer accessibility for workers with disabilities -Accuracy of data, information, and knowledge -Protection of the rights of users -Accessibility to information -Use of corporate computers for non-work-related purposes -How much decision making to delegate to computers

What are direct benefits of implementing a data warehouse?

-End users can perform extensive analysis in numerous ways. -Consolidated view of corporate data -Better and more timely information -Enhance system performance -Data access is simplified

What are indirect benefits of implementing a data warehouse?

-Enhance business knowledge -Present a competitive advantages -Improve customer service and satisfaction -Facilitate decision making -Help reform business processes

What are the four main areas of effective security in a data warehouse?

-Establishing effective corporate and security policies and procedures; start at top management -Implementing logical security procedures and techniques to restrict access -Limit physical access to the data center environment. -Establish an effective internal control review process with an emphasis on security and privacy.

What are the major features of ES?

-Expertise -Symbolic reasoning -Deep knowledge -Self-knowledge

Describe the three steps of the ETL process

-Extraction: Reading data from one or more databases. -Transformation: Converting the extracted data from its previous form into the form in which it needs to be so that it can be placed into a data warehouse or simply another database. -Load: Putting the data into the data warehouse

What are some factors other than hardware, software, and network capabilities that have contributed to facilitating growth of decision support and analytics?

-Group communication and collaboration -Improved data management -Managing giant data warehouses and Big Data -Analytical support -Overcoming cognitive limits in processing and storing information -Knowledge management -Anytime, anywhere support

What are some of the key system-oriented trends that have fostered IS-supported decision making to a new level?

-Group communication and collaboration -Improved data management -Managing giant data warehouses and Big Data -Analytical support -Overcoming cognitive limits in processing and storing information -Knowledge management -Anywhere, anytime support

What are the things that help Web pages rank higher in the search engine results?

-Hire a company that specializes in search engine optimization to continuously improve sites appeal to changing practices of the search engines -Pay the search engine providers to be listed on the paid sponsors section -Consider reducing dependence on search engine traffic

What can cluster results be used for?

-Identify a classification scheme -Suggest statistical models to describe populations -Indicate rules for assigning new cases to classes for identification, targeting, and diagnostic purposes. -Provide measures of definition, size, and change in what were previously broad concepts -Find typical cases to label and represent classes. -Decrease the size and complexity -Identify outliers in a specific domain

What are factors that affect the architecture selection decision?

-Information interdependence between organizational units -Upper management's information needs -Urgency of need for a data warehouse. -Nature of end user tasks -Constraints on resources -Strategic view of the data warehouse prior to implemnetation -Compatibility with existing systems -Perceived ability of the in-house IT staff -Technical issues -Social and political factors

What are some of the major applications areas of artificial intelligence?

-Intelligent tutoring -Autonomous robots -Speech understanding -Automatic programming -Computer vision -Game playing -Expert system -Intelligent agents -Natural language processing -Machine learning -Voice recognition -Neural network -Generic algorithms -Fuzzy logic

What are the benefits of implementing a data warehouse?

-Keepers: money saved by improving traditional decision support functions (20%) -Gathers: money saved due to automated collection and dissemination of information (30%) -Users: money saved or gained from decisions made using the data warehouse (50%)

How can visitor profiles be leveraged with web analytics and segmentation?

-Keywords -Content groupings -Geography -Time of Day -Landing page profiles

List and define the major components of an ES.

-Knowledge acquisition Subsystem: accumulation, transfer, and transformation of problem-solving from expert or documented knowledge sources to a computer program for constructing or expanding the knowledge base. -Blackboard: an area of working memory set aside as a database for description of the current problem, as characterized by the input data. -Explanation Subsystem: can trace such responsibility and explain ES behaviors by interactively answering: why was a certain question as by the ES, how was a certain conclusion reached , why was a certain alternative rejected, what is the completed plan of decisions to be made in reaching the conclusion? -Knowledge-refining system: can analyze their own knowledge and its effectiveness, learn from it, and improve on it for future consultations.

When are row oriented organizations more efficient?

-Many columns of a single row are required at the same time and the row size is relatively small -Writing a new row if all of the column data is supplied at the same time.

Why has data mining become more popular?

-More intense competition at the global scale. -General recognition of the untapped value hidden in large data sources. -Consolidation and integration of database records. -Exponential increase in data processing and storage technologies. -Significant reduction in the cost of hardware and software for data storage. -Movement toward demassification of business practice

What is a cube? What do drill down, roll up, and slice and dice mean?

-Multidimensional data structure (actual or virtual) that allows fast analysis of data. -User navigates among levels of data ranging from the most summarized (up) to the most detailed (Down) -Computing all of the data relationships for one or more dimensions. -Subset of a multidimensional array corresponding to a single value set for one (or more) of the dimensions not in the subset. -Slice on more than two dimensions of a data cube.

What are conversion statistics?

-New & returning visitors -Leads -Sales Conversion -Abandonment / Exit Rates

What is NoSQL? How does it fit into the Big Data analytics picture?

-Not Only SQL. Processing large volumes of multi-structured data. -Serving up discrete data stored among large volumes of multi-structured data to end-users and automated Big Data applications. -Can work in conjunction with Hadoop.

What are challenges associated with implementing NLP?

-Part of speech tagging. -Text segmentation -Word sense disambiguation -Syntactic Ambiguity -Imperfect or Irregular Input -Speech Acts

Describe privacy concerns in analytics?

-Privacy is the right to be left alone and the right to be free from unreasonable personal intrusion -Internet uses and accesses data -Private info can aid in decision making, but hurts privacy

What are the big challenges that one should be mindful of when considering implementation of Big Data analytics?

-Process efficiency and cost reduction -Brand management -Revenue maximization, cross-selling, and up-selling -Enhanced customer experience -Churn identification, customer recruiting -Improved customer service -Identifying new products and market opportunities -Risk management -Regulatory compliance -Enhanced security capabilities

What are the types of organizations or professionals that comprise the analytics industry?

-Provide advice to the analytics industry providers and users -Professional societies or organizations that are membership based and organized. -Analytics ambassadors, influences, or evangelists that have presented their enthusiasms for analytics through seminars, books, or other publications.

What are the main types of a data warehouse?

-Provide decision support capability -Allows ready access to business information -Creates business insight

What are characteristics that differentiate between social and industrial media?

-Quality -Reach -Frequency -Accessibility -Usability -Immediacy -Updatability

What are the reasons for the upswing of open source software?

-Recession has driven up interest in low cost open source software -Open source tools are coming into a new level of maturity -Open source software augments traditional enterprise software without replacing it.

What are components of the inner petal of the analytics ecosystem?

-Regulators and policy makers -Analytics industry analysts & influencers -Academic institutions and certification agencies -Application Developers: industry specific or general

What are the use cases for Big Data and Hadoop?

-Repository -Active archive -Data warehouse performance -Integrating data that provides business values -Interactive BI tools

What does an LP allocation model assume?

-Returns from different allocations can be compared -Return from any allocation is independent of others. -All data are known with certainty -The resources are used in the most economical manner.

What are the two broad categories of SEOs?

-Search engines that recommend as part of a good site design. -Techniques of which search engines do not approve.

What are the data mining mistakes?

-Selecting the wrong problem for data mining. -Ignoring what your sponsor thinks data mining is and what it can/can't do. -Beginning without the end in mind. -Define the project around a foundation that your data can't support. -Leaving insufficient time for data preparation. -Looking only at aggregated results and not at individual records. -Not keeping track of the data mining procedure and results. -Using data from the future to predict the future. -Ignoring suspicious findings and quickly moving on. -Starting with high profile complex project first. -Running data mining algorithms repeatedly and blindly. -Ignore the subject matter experts. -Believing everything you are told about the data. -Assuming full cooperation. -Measuring your results differently from the way your sponsor does. -If you build, they will come mindset.

What are the three key components of a BPM?

-Set of integrated, closed loop management and analytic processes that address financial and operational activities -Tools for business to define strategic goals and measure / manage performance against those goals. -Core set of processes linked to organizational strategy

When developing a data warehouse, what are the most important risks and issues to consider and avoid?

-Starting with the wrong sponsorship chain -Setting expectations that you cannot meet -Engaging in politically naïve behavior -Loading the warehouse with information just because it is available -Believing that data warehousing database design is the same as transactional database design -Choosing a data warehouse manager who is technology oriented rather than user oriented -Focusing on traditional internal record-oriented data and ignoring the value of external data and of text, images, and perhaps, sound and video. -Delivering data with overlapping and confusing definitions -Believing promises of performance, capacity, and scalability -Believing that your problems are over when the data warehouse is up and running -Focusing on ad hoc data mining and periodic reporting instead of alerts.

What are various risks and issues when developing a successful data warehouse?

-Starting with the wrong sponsorship chain -Setting expectations you can't meet -Engaging in politically in naive behavior -Loading the warehouse with data just because it's available -Believing that the data warehousing is the same as transactional database design -Choosing a data warehouse manager who is technology oriented rather than user oriented. -Focusing on traditional orientated data and ignoring the value of external data -Delivering data with overlapping and confusing definitions. -Believing promises of performance, capacity, and scalability. -Believing that your problems are over when the data warehouse is up and running. -Focusing on ad hoc data mining and periodic reporting instead of alerts

What is the main difference between statistics and data mining?

-Statistics collects sample data to test the hypothesis whereas data mining and analytics use all the existing data to discover novel patterns and relationships. -Size of data varies

What are the most distinguishing features of KPIs?

-Strategy -Targets -Ranges -Encoding -Time frames -Benchmarks

What are the four perspectives that BSC suggests us to use to view organizational performance?

-The Customer Perspective -The Financial Perspective -The Learning and Growth Perspective -The Internal Business Processes Perspective

List the major characteristics of Web 2.0

-The ability to tap into the collective intelligence of users. The more users contribute, the more popular and valuable a Web 2.0 site becomes. -Data is made available in new or never-intended ways. Web 2.0 data can be remixed or "mashed up," often through Web service interfaces, much the way a dance-club DJ mixes music -Relies on user-generated and user-controlled content and data. -Lightweight programming technique and tools let nearly anyone act as a Web site developer. -The virtual elimination of software-upgrade cycles makes everything a perpetual beta or work-in-progress and allows rapid prototyping, using the Web as an application development platform. -Users can access applications entirely through a browser. -An architecture of participation and digital democracy encourages users to add value to the application as they use it. -A major emphasis is on social networks and computing. -There is strong support for information sharing and collaboration . -Fosters rapid and continuous creation of new business models.

What are some of the main challenges the Web poses for knowledge discovery.

-The web is too big for effective data mining -is too complex -is too dynamic -is not specific to a domain -has everything

What are best practices in social media analytics?

-Think of measurement as a guidance -Track the elusive statement -Improve the accuracy of text analysis -Look at the ripple effect -Look beyond the brand -Identify your most powerful influencers -Look closely at the accuracy of your analytic tool -Incorporate social media intelligence into planning

List and briefly describe the best practices in social media analytics?

-Think of measurement as a guidance system, not a rating system -Track the elusive sentiment -Continuously improve the accuracy of text analysis -Look at the ripple effect -Look beyond the brand -Identify your most powerful influencers -Look closely at the accuracy of your analytic tool -Incorporate social media intelligence into planning

Why is master data management gaining popularity?

-Tighter integration with operational systems demands -Most data warehouses still lack MDM and data quality functions -Regulatory and financial reports must by perfectly clean and accurate

What are challenges with the Web?

-Too big for effective data mining -Too complex & dynamic -Not specific to a domain -Web has everything

What are difficulties that arise when analyzing multiple goals?

-Usually difficult to obtain an explicit statement of the organization's goals. -Goals and subgoals are viewed different -Decision maker may change the importance assigned to specific goals over time or for different decision scenarios. -Personal agendas -Importance assessment differently.

What are the ways to manage multiple goals?

-Utility theory -Goal Programming -Expression of goals as constraints -Points system

What is special about the Big Data vendor landscape? Who are the big players?

-Vendor's are able to develop their own hadoop distributions, based on the Apache open source distribution, but with various levels of proprietary customization. -Cloudera, MapR, Hortonworks

What are the four categories of web analytics?

-Web site usability -Traffic sources -Visitor profiles -Conversion statistics

What will play a significant role in defining the future of data warehouse?

-Web, social media, and Big Data -Open source software -SaaS -Cloud Computing -Data lakes

KPIs

-linked to a strategy w/ an objective -defines the target and actual performance measure (e.g. increase repeat business for bike customers by 15%)

Transaction processing systems

... Support day-to-day operations.

Analytic Information systems

... Support decision making.

dependent data mart

... are sourced directly from EDW

independent Data Mart

... are sourced from one or more operational systems or external information providers

Information System

... betrifft die Informations- und Kommunikationssysteme im Geschäft und der Administration.

KDD definition

... is the nontrivial process of identifying valid, novel, potentially useful and ultimately understandable patterns in data.

Informations und Kommunikationssysteme

... ist ein soziotechnisches System zur Befriedigung der Informationsnachfrage. Es ist ein Mensch/Aufgabe/Technik-System.

Grievances

/ˈgri vəns/ : grievances a real or imagined wrong or other cause for complaint or protest, especially unfair treatment.

DWH Tiers

0. Data Sources 1. Data storage 2. Olap Engine 3. Frond end Tools

4 categories of KPI examples

1) Financial 2) Customer 3) Business Process 4) Learning and Growth

Costs of Using Low-Quality Information

1) Inability to track customers accurately. 2) Difficulty identifying the organization's most valuable customers. 3) Inability to identify selling opportunities. 4) Lost revenue opportunities from marketing to nonexistent customers. 5) The cost of sending undeliverable mail. 6) Difficulty tracking revenue because of inaccurate invoices. 7) Inability to build strong relationships with customers.

6 Dashboard Elements in Performance Point

1) Indicators 2) Filters 3) Reports 4) KPIs 5) Scorecards 6) Dashboard

The four primary reasons for low-quality information

1) Online customers intentionally enter inaccurate information to protect their privacy. 2) Different systems have different information entry standards and formats. 3) Data-entry personnel enter abbreviated information to save time or erroneous information by accident. 4) Third-party and external information contains inconsistencies, inaccuracies, and errors.

IMPORTANT difference between data, information, knowledge

1) data = facts, observations, raw numbers 2) information = with meaning subset of data with its context, out of manipulated raw data, e.g. number of sales today 3) knowledge = derived information, justified believes (logic, empirical observations), about relationships among concepts, decisions are higher reliable if based on knowledge - not just data or informtion

6 Distinguishing features of KPIs

1) embody strategic objectives 2) measure performance against specific targets 3) targets have performance ranges (above, on, below) 4) ranges are encoded in software enabling visual display (e.g. red, yellow, green) 5) targets typically are assigned time frames by which they must be accomplished 6) targets are often measured against a benchmark (e.g. previous year's results

IMPORTANT four synergistic capabilities of BI

1) organizational memory: collect quantitive data, accumulated over time 2) information integration: non-quantitive and external data 3) insight creation: apply analytics 4) presentation: display in visual and user friendly formats --> They provide input to each other

two types of integrity constraints

1) relational 2) business critical

6 Dashboard Characteristics

1) use of visual components (e.g. charts, performance bars, spark lines, gauges, meters, stoplights) to highlight, at a glance, the data and exceptions that require action 2) transparent to the user, meaning that they require minimal training and are extremely easy to use 3) combine data from a variety of systems into a single, summarized, unified view of the business 4) enable drill-down or drill-through to underlying data sources or reports 5) present a dynamic, real-world view with timely data updates 6) require little, if any, customized coding to implement, deploy, and maintain

advantages to using the web to access company databases

1) web browsers are much easier to use than directly accessing the database by using a custom-query tool 2) the web interface requires few or no changes to the database model 3) it costs less to add a web interface in front of a DBMS than to redesign and rebuild the system to support changes. Additional data-driven website advantages include: -Easy to manage content: Website owners can make changes without relying on MIS professionals; users can update a data-driven website with little or no training. -Easy to store large amounts of data: Data-driven websites can keep large volumes of information organized. Website owners can use templates to implement changes for layouts, navigation, or website structure. This improves website reliability, scalability, and performance. -Easy to eliminate human errors: Data-driven websites trap data-entry errors, eliminating inconsistencies while ensuring that all information is entered correctly.

Zhao described five levels of metadata management maturity:

1. Ad-hoc, discovered, managed, optimized, and automated.

What are the steps of CRISP DM Process?

1. Business Understanding 2. Data Understanding 3. Data Preparation 4. Model Building 5. Testing & Evaluation 6. Deployment

What are the steps of data processing steps/

1. Data Consolidation -- collect, select, and integrate 2, Data Cleaning -- impute values, reduce noise, eliminate duplicates 3. Data Transformation -- normalize, discretize, and create attributes 4. Data Reduction -- dimension, volume, and balance data

Describe the four steps managers take in making a decision.

1. Define the problem (A decision situation that may deal with some difficulty or with an opportunity) 2. Construct a model that describes the real-world problem 3. Identify possible solutions to the modeled problem and evaluate the solutions 4. Compare, choose, and recommend a potential solution to the problem

What are the steps of simulation?

1. Define the problem. 2. Construct the simulation model. 3. Test & validate the model 4. Design the experiment 5. Conduct the experiment

What are the steps of the text mining process?

1. Establish the corpus 2. Create the term document matrix. 3. Extract knowledge

classification steps

1. Mode construction - training set -> model 2. Model usage

What are the main steps in carrying out sentiment analysis projects?

1. Sentiment Detection 2.N-P Polarity Classification 3. Target Identification 4. Collection and Aggregation

What are the steps for a sentiment analysis?

1. Sentiment Detection: calculate the OS Polarity 2. NP Polarity Classification 3. Target Identification: Identify the target for sentiment 4. Collection & aggregation

structured approach to architecture developement

1. high level corperate data model in short time 2. independent data marts can be implemented in parallel with ewh 3. Distributed data marts can be constructed to integrate different data marts 4. edw is constructed

BI evolution

1.0 DBMS based structured Content 2.0 web based, unstructured Content 3.0 mobile and sensor based content

How far does data warehousing trace back to?

1970s

complete but inaccurate information

2/31/10 is an example of complete but inaccurate information (February 31 does not exist)

Dimensions

4 Dimensionales Array...

Enterprise Application Integration (EAI)

= alternative to ERP EAI = middleware that can parse, duplicate or transform data between applications. It allows integration without redefining business practices EAI connects multiple systems that are isolated and make them work together and share their data. ERP in contrast is a monolithic software block.

What are examples of traffic sources?

=Referral Web Sites -Search Engines -Direct Searches via bookmarking of web page or using URL -Offline campaigns -Online Campaigns

dendrogram

?!?

Define decision automation systems.

A business rule-based system that uses intelligence to recommend solutions to repetitive decisions (such as pricing). Is also called automated decision support: A rule-based system that provides a solution to a repetitive managerial problem.

What is a namespace? Why is it important in key-value database?

A collection of identifiers. Keys must be unique within a namespace.

data store

A data repository - either permanent for temporary - for data transformed by processes. Data Stores can be files or full database systems.

What is a data mart?

A departmental data warehouse that stores only relevant data

What is data visualization? Why is it needed?

A graphical, animation, or video presentation of data and the results of data analysis. The use of visual representations to explore, make sense of, and communication data.

Define Gini index. What does it measure?

A metric that is used in economics to measure the diversity of the population. The same concept can be used to determine the purity of a specific class as a result of a decision to branch along a particular attribute/variable.

What is Six Sigma? How is it used as a performance measurement system?

A performance management methodology aimed at reducing the number of defects in a business process to as close to zero defects per million opportunities (DPMO) as possible.

What is a balanced scorecard? Where did it come from?

A performance measurement and management methodology that helps translates an organization's financial, customer, internal process, learning and growth objectives and targets into a set of actionable initiatives. Kaplan and Norton first articulated this methodology in their Harvard Business Review article in 1992.

What is a performance management system? Why do we need one?

A performance measurement system typically comprises systematic methods of setting business goals together with periodic feedback reports that indicates progress against goals. A system that assists managers in tracking the implementations of business strategy by comparing actual results against strategic goals and objectives.

What is a data warehouse?

A physical repository where relational data are specially organized to provide enterprise-wide, cleansed data in a standardized form.

Data warehouse

A physical repository where relational data are specially organized to provide enterprise-wide, cleansed data in a standardized format.

RDBMS vs DBMS

A relational database management system (RDBMS) is a database management system (DBMS) that is based on the relational model

What is an independent data mart

A small warehouse designed for a strategic business unit or department

What is a social network? What is social network analysis?

A social network is a social structure composed of individuals/people (or groups of individuals or organizations) linked to one another with some type of connections/relationships. Social network analysis (SNA) is the systematic examination of social networks. Dating back to the 1950s, social network analysis is an interdisciplinary field that emerged from social psychology, sociology, statistics, and graph (network) theory.

data cube

A special database used to store data in OLAP reporting

Key Performance Indicator (KPI)

A strategic objective AND METRICS that measures performance against a goal

What is a dependent data mart

A subset that is created directly from a data warehouse

What is a data cube

A two-dimensional, three-dimensional, or higher-dimensional object in which each dimension of the data represents a measure of interest

What are the differences and commonalities between dashboards and scorecards?

A visual presentation of critical data for executives to view. It allows executives to see hot spots in seconds and explore the situation. A performance measurement and management methodology that helps translates an organization's financial, customer, internal process, learning and growth objectives and targets into a set of actionable initiatives

What are examples of transaction processing?

ATM withdrawals, bank deposits, cash register scans at the grocery store, etc.

Data-as-a-service (DaaS)

Accessing data "where it lives", enriching data quality with centralization,

Out flow:

Accessing to obtain data by consumer ad hoc and routine. Delivery: to render data by warehouse via publish and subscribe mechanisms.

What is the percentage of test data set samples correctly classified by the model?

Accuracy Rate

What is the outcome of predictive analytics?

Accurate projections of future events and outcomes.

How does DaaS change the way data is handled?

Actual platform on which data resides doesn't matter. Any business process can access data wherever it resides. Customers can move quickly due to simplicity of the data access and the need for basic knowledge not expert knowledge

What includes predictive analysis and text analytics that examine the content in online conversations?

Advance analytics

Web 2.0

Advanced Web (blogs, wikis, social networks), Objective (enhance creativity, information sharing, collaboration), Changing the web from passive to active, redefining what is on the Web as well as how it works, companies are adopting and benefiting

When would all items start in individual clusters and the clusters are joined together?

Agglomerative

Down-Flow

Aging. To archive data into storage hierarchy

business intelligence examples

Airlines: Analyze popular vacation locations with current flight listings. Banking: Understand customer credit card usage and nonpayment rates. Health care: Compare the demographics of patients with critical illnesses. Insurance: Predict claim amounts and medical coverage costs. Law enforcement: Track crime patterns, locations, and criminal behavior. Marketing: Analyze customer demographics. Retail: Predict sales, inventory levels, and distribution. Technology: Predict hardware failures.

relational online analytical processing (ROLAP)

Analytical processing functions that use relational databases and familiar relational query tools to store and analyze multidimensional data

What is the process of developing actionable decisions or recommendations for actions based on insights generated from historical data?

Analytics

Who has developed analytics software for general use with data that has been collected in a data warehouse or is available through one of the platforms?

Analytics Focused Software Developers

What is a report? What are they used for?

Any communication artifact prepared with the specific intention of conveying information in a presentable form. -To ensure that all departments are functioning properly -To provide information -To provide the results of an analysis -To persuade others to act -To create an organizational memory (as part of a knowledge management system)

What is the most commonly used algorithm to discover association rules that attempts to find subsets that are common to at least a minimum number of the itemsets ?

Aprirori Algorithm *uses bottom up approach*

What is a graphical assessment technique where the true positive rate is plotted on the y axis and the false positive is plotted on the x-axis?

Area Under the ROC Curve

What is the most popular and most commonly used measure of central tendency?

Arithmetic Mean

What is the sum of all the values/observations divided by the number of observations in the data set?

Arithmetic Mean

Why is Big Data important? What has changed to put it in the center of the analytics world?

As more and more data becomes available in various forms and fashions, timely processing of the data with traditional means becomes impractical. The exponential growth, availability, and use of information, both structured and unstructured, brings Big Data to the center of the analytics world. Pushing the boundaries of data analytics uncovers new insights and opportunities for the use of Big Data.

What aims to find interesting relationships between variables in large databases?

Association Rule Mining

What finds the commonly co-occuring grouping of things?

Associations

What is a popular and well-researched technique for discovering interesting relationships among variables in large database?

Associations

What is an "authoritative page"? What is a "hub"? What is the difference between the two?

Authoritative page: web page that is identified as particularly popular based on links by other web pages and directories Hub: one or more web pages that provide a collection of links to authoritative pages Difference: hub will contain multiple authoritative pages where an authoritative page is just one link

How can sentiment analysis be used in predicting financial markets?

Automated analysis of market sentiments using social media, news, blogs, and discussion groups seem to be a proper way to compute the market movements. If done correctly, it can identify short-term stock movements based on the buzz in the market, potentially impacting liquidity and trading.

What is the creation of a shortened version of a textual document by a computer program that contains the most important points of the original document

Automatic Summarization

What are the three types of data generated through Web page visits?

Automatically generated data stored in server access logs, referrer logs, agent logs, and client-side cookies User profiles Metadata, such as page attributes, content attributes, and usage data.

IMPORTANT ETL process

BI tools can also directly help obtaining data and information (such as through extraction, transformation, and loading of data).

What is the main difference between BSC and Six Sigma?

BSC is focused on improving overall strategy and the Six Sigma is focused on improving processes.

Describe the reasoning procedures of forward chaining and backward chaining.

Backward Chaining: goal-driven approach in which you start from expectation of what is going to happen and then seek evidence that supports (or contradicts) your expectation. Forward Chaining: data-driven approach. Starts from available information as it becomes available or from a basic idea, and then we try to draw conclusions. The ES analyzes the problem by looking for the facts that match the IF part of its IF-THEN rules.

What is used when introducing structure to a collection of text based documents to classify them into two or more predetermined classes or to cluster them into natural groupings?

Bag of Words i.e. Spam Filtering

What is the best known and most widely used performance management system that suggests people view the organization from four perspectives?

Balance Scorecard (BSC)

What is both a performance measurement and a management methodology that helps translate an organizations financial, customer, internal process, and learning / growth objectives and targets into a set of actionable initiatives?

Balanced Scorecard

The E in BASE stands for eventually consistent. What does that mean?

Basically Available, Soft State, Eventually Consistent. Some replicas might be inconsistent for some period of time but will become consistent at some point.

Cloud sourcing

Benefits: - high scale, low-cost providers - any time/place acces via web browser - rapid scalability, cost and load sharing concerns: - Performance, reliability, SLAs - control of data and service parammeters - application features and choices - lock in effects, no migration between cloud providers - no standard API - privacy, security, complience, trust

What is the outcome of prescriptive analytics?

Best possible business decisions and actions

What refers to data that is structured, unstructured , in a stream and so forth?

Big Data

Info & Info 2

Big data is one of the most promising technology trends occurring today. Of course, notable companies such as Facebook, Google, and Netflix are gaining the most business insights from big data currently, but many smaller markets are entering the scene, including retail, insurance, and health care. Over the next decade, as big data starts to improve your everyday life by providing insights into your social relationships, habits, and careers, you can expect to see the need for data scientists and data artists dramatically increase.

The four common characteristics of big data

Big data requires sophisticated tools to analyze all the unstructured information from millions of customers, devices, and machine interactions. Big data are analyzed for marketing trends in business as well as in the fields of manufacturing, medicine, and science

Who's if the father of data warehousing?

Bill Inmon

What attempts to improve rankings in way that are not approved by the search engines or involve deception?

Black Hat SEO

When is a fixed number of instances from the original data are sampled for training and the rest of the data set is used for testing?

Bootstrapping

What is the plan big, build small approach that focuses on the request of a specific department?

Bottom Up Approach / Data Mart Approach (DM)

What is a graphical illustration of several descriptive statistics about a given data set?

Box & Whiskers Set / Box Plot

What approach uses probability theory to build classification models based on the past occurrence that are capable of placing a new instance into a most probable class/category?

Boyesian Classifiers

What represents the outcome of a test to classify a pattern using one of the attributes?

Branch

What focuses on listening to social media where anyone can post opinions that can damage or boost your reputation?

Brand Management

Who is an individual who weak ties fill a structural hole providing the only link between two individuals or clusters?

Bridge

What are the subcategories of distributions (SNA)?

Bridge, centrality, density, distance, structural holes, and tie strength

What is a collection of tools for manipulating, mining, and analyzing the data in the warehouse?

Business Analytics

What enables interactive access to data, manipulation of data, and the ability to conduct appropriate analysis?

Business Intelligence

What is an umbrella term that combines architectures, tools, data bases, analytical tools, applications, and methodologies?

Business Intelligence

What is based on the transformation of data to inflammation, then decisions, and finally actions?

Business Intelligence

What are the business processes, methodologies, metrics, and technologies used by enterprises to measure, monitor, and manage business performance?

Business Performance Management (BPM)

What is used for monitoring and analyzing performance?

Business Process Management

What are enablers of descriptive analytics?

Business reporting, dashboards, scorecards, and data warehousing

What do the C and A in the CAP theorem stand for? Give an example of how designing for one of those properties can lead to difficulties in maintaining the other.

C is Consistency. A is Availability. P is Partitioning. When using a two-phase commit, the database favors consistency but at the risk of the most recent data not being available for a brief period of time. While the two-phase commit is executing, other queries to the data are blocked. The updated data is unavailable until the two-phase commit finishes. This favors consistency over availability

What uses a sequence of six steps that starts with a good understanding of the business and the need for the data mining project and ends with the deployment of the solution that satisfies the specific business need?

CRISP DM

List and briefly define the phases in the CRISP-DM process

CRISP-DM provides a systematic and orderly way to conduct data mining projects. This process has six steps. First, an understanding of the data and an understanding of the business issues to be addressed are developed concurrently. Next, data are prepared for modeling; are modeled; model results are evaluated; and the models can be employed for regular use.

How does CRISP-DM differ from SEMMA?

CRISP_DM: A cross industry standardized process of conducting data mining projects, which is a sequence of six steps that start with a good understanding of the business and the need for the data mining project and ends with the deployment of the solution that satisfied the specific business need. SEMMA: An alternative process for data mining projects proposed by the SAS Institute. "Sample, Explore, Modify, Model, and Assess"

EDW's are used to provide data for many types of DSS including:

CRM, supply chain management (SCM), business performance management (BPM), business activity monitoring (BAM), product life-cycle management (PLM), revenue management, and sometimes even Knowledge Management Systems (KMS).

What represents the labels of multiple classes used to divide a variable into specific groups and represents a finite number of values with no continuum between them?

Categorical data / Discrete data i.e. race, sex, age group, and educational level

What refers to a group of metrics that aim to quantify the importance or influence of a particular node within a network?

Centrality

What warehousing architecture has a gigantic EDW that serves the needs of all organizational units and provides users with access to all the data in the data warehouse?

Centralized Data Warehouse

What is assumed that complete knowledge is available so that the decision maker knows exactly what the outcome of each course of action is?

Certainty

What is based on the identification, capture, and delivery of the changes made to enterprise data sources?

Change Capture

What is the most common data mining tasks and analyzes the historical data stored in a database and automatically generates a model that can predict future behaviors?

Classification

What is the most frequently used data mining method for real world problems?

Classification

What is the primary source for accuracy estimation in classification problems?

Classification Matrix

What is the difference between the clustering and the classification?

Classification learns the function between the characteristics of things and their membership through a supervised learning process whereas clustering is an unsupervised learning process where only the input variables are presented to the algorithm.

What is the major difference between cluster analysis and classification?

Classification methods learn from previous examples containing inputs and the resulting class labels, and once properly trained they are able to classify future cases. Clustering partitions pattern records into natural segments or clusters.

What are subcategories of prediction?

Classification, regression, and time series.

What is the clickable photos, text links in the copy, downloads, and navigation on a page?

Click map

What can reveal where you might be losing visitors in a specific process?

Click paths

What is the analysis of the information collected by web servers can help better understand user behavior?

Clickstream Analysis

What are the sources of Big Data?

Clickstreams from websites, postings on social media sites, traffic, sensors, or weather

MOLAP Model

Client (storing with multidimensional array)-> application layer (Molap Engine) -> Warehouse - efficient storage and processing - complexity hidden from the user

ROLAP Model

Client -> SQL/MDX -> Application layer (ROLAP ENGINE) -> SQL -> Warehouse server

What are the subcategories of segmentation for SNA?

Cliques and social orders, clustering coefficient, and cohesion.

What implies that optimum performance is achieved by setting goals and objectives, establishing initiatives and plans to achieve those goals, monitoring actual performance, and taking corrective action?

Closed Loop BPM Cycle

Define cloud computing. How does it relate to PaaS, SaaS, and IaaS

Cloud computing offers the possibility of using software, hardware, platform, and infrastructure, all on a service-subscription basis. Cloud computing enables a more scalable investment on the part of a user. Like PaaS, etc., cloud computing offers organizations the latest technologies without significant upfront investment. In some ways, cloud computing is a new name for many previous related trends: utility computing, application service provider grid computing, on-demand computing, software as a service (SaaS), and even older centralized computing with dumb terminals. But the term cloud computing originates from a reference to the Internet as a "cloud" and represents an evolution of all previous shared/centralized computing trends

What has been used extensively for fraud detection and market segmentation of customers in contemporary CRM systems?

Cluster Analysis

What is the means of identifying classes of items so that items in a cluster have more in common with each other than with items in other clusters AND identify natural groupings of events or objects so that a common set of characteristics?

Cluster Analysis

What is used to sort case into groups or clusters so that the degree of association is strong among members of the same cluster and weak among members of different cluster?

Cluster Analysis

What partitions a collection of things into segments whose members share similar characteristics, but the class labels are unknown?

Clustering

What are subcategories of segmentation?

Clustering and outlier analysis

What is the measurehood of likelihood that two members of a node are associates?

Clustering coefficient

What identifies the natural grouping of thins based on their known characteristics?

Clusters

What is it called when an individual's problem solving capability is limited when a wide range of diverse information and knowledge is required?

Cognitive Limits

What is a system that stores data tables as sections of columns of data rather than as rows of data?

Columnar Database / Column Oriented Database Management Systems **much finer grain of control**

Name two data structures used in column family databases.

Columns and column families

What are semistructured decisions? Provide two examples.

Combination of standard and complex problems. trading bonds, setting marketing budgets for consumer products, performing capital acquisition analysis

Example of Low-Quality Information

Completeness. The customer's first name is missing. Another issue with completeness. The street address contains only a number and not a street name. Consistency. There may be a duplication of information since there is a slight difference between the two customers in the spelling of the last name. Similar street addresses and phone numbers make this likely. Accuracy. This may be inaccurate information because the customer's phone and fax numbers are the same. Some customers might have the same number for phone and fax, but the fact that the customer also has this number in the email address field is suspicious. Another issue with accuracy. There is inaccurate information because a phone number is located in the email address field. Another issue with completeness. The information is incomplete because there is not a valid area code for the phone and fax numbers.

What is the EWD to support all decision analysis by providing relevant summarized and detailed information originating from many different sources?

Comprehensive Database

Explain the importance of metadata

Comprise info that increases our understanding of traditional data. Provides context to the reported data and provides enriching information that leads to the creation of knowledge.

What enables people to overcome their cognitive limits by quickly accessing and processing vast amounts of stored information?

Computerized Systems

What are features generated from a collection of documents by means of manual, statistical, rule based, or hybrid categorization methodology?

Concepts

What is the process called that predicts machinery failures before they occur through the use of sensory data?

Condition Based Maintenance

What are the metrics of measuring social network analysis?

Connections, distributions, and segmentation.

What represents the dimensional information coming from potentially disparate source, but pertaining to the same subject?

Consistent data

What is the assumption that states that the response variables have the same variance in this error?

Constant Variance/Homoscedasticity

data mart

Contains a subset of data warehouse information

What are situations with unlimited numbers of possible events that follow density functions?

Continuous Distributions

What is a large and structured set of texts prepared for the purpose of conducting knowledge discovery?

Corpus

What gives an estimate on the degree of association between the variables?

Correlation

What is interested in low level relationships between two variables?

Correlation

When would you introduce structure to the corpus?

Create the term document matrix?

Name two applications of ES in finance and describe their benefits.

Credit Analysis System: An ES can help a lender analyze a customer's credit card record and determine a proper credit limit. Rules in the knowledge base can also help assess risk and risk-management policies. Pension Fund Adviser: An ES that provides information on an employee's pension fund status. The system maintains an up-to-date knowledge base to give participants advice concerning the impact of regulation changes and conformance with new standards.

What is a multidimensional data structure that allows fast analysis of data and is defined as the capability of efficiently manipulating and analyzing data from multiple perspective?

Cube

What creates a one on one relationships with customers by developing an intimate understanding of their needs and wants?

Customer Relationship Management

What are the most popular application areas for sentiment analysis? Why?

Customer relationship management (CRM) and customer experience management are popular "voice of the customer (VOC)" applications. Other application areas include "voice of the market (VOM)" and "voice of the employee (VOE)."

What are the perspectives that an organization should develop objectives, measures, targets, and initiatives?

Customer, financial. internal business process, and learning & growth.

Data warehousing depends on:

DBMS, Extraction and conversion tools, internetworking techniques, front-end analysis tools, graphics

What is frequently a convenient first step to acquiring experience in constructing and managing a data warehouse while presenting business users with the benefits of better access to their data?

DM Approach

What is a closed loop business improvement model and encompasses the steps of defining, measuring, analyzing, improving, and controlling a process?

DMAIC

How does a data warehouse differ from a database

DW: A physical repository where relational data are specially organized to provide enterprise-wide, cleansed data in a standardized form. DB: A collection of files that are viewed as a single storage concept. Available to a wide range of users

OLAP major task of ...

DWH

What is the collection of facts usually obtained as the result of experiments, observations, transaction, or experiences?

Data

What is the main ingredient for any BI, data science, and business analytics initiative?

Data

What is the ability to access and extract data from any data source?

Data Access

What is a term for professionals who were doing BI in the form of data compilation, cleaning, reporting, and perhaps some visualization?

Data Analyst

What is the integration of business view across multiple data stores?

Data Federation

What companies enable generating and collection of data that may be used fr developing analytical insights?

Data Generation Infrastructure Providers

What comprises three major processes that permit data to be accessed ad made accessible to an array of ETL and analysis tools and data warehousing environment: data access, data federation, and change capture?

Data Integration

What is a large storage location that can hold vast quantities of data in its native/raw format for future potential analytics consumption?

Data Lakes

What includes the organizations that provide hardware and software targeting the basic foundation for all management solutions?

Data Management Infrastructure Providers

What usually smaller and focuses on a particular subject or department?

Data Mart

What architecture has the individual marts linked to each other via some kind of middleware?

Data Mart Bus Architecture

What is the process that uses statistical, mathematical, and artificial intelligence techniques to extract and identify useful information and subsequent knowledge from large sets of data?

Data Mining

What is used to describe discovering or mining knowledge from large amounts of data?

Data Mining

What was used to describe the process through which previously unknown patterns in data were discovered?

Data Mining

What is the tedious and time demanding process that is necessary to convert the raw real world data into a well refined form for analytics algorithms?

Data Preprocessing

What is the term for professional who utilizes predictive analysis, statistical analysis, and more advance analytical tools and algorithms?

Data Scientists

What focuses on a specific industry sector and build on their existing relationships in that industry through their niche platforms and services for data collection?

Data Serviced Providers

When would the data be transformed for better processing and aggregated?

Data Transformation

What contains a wide variety of data that presents a coherent picture of business conditions at a single point in time?

Data Warehouse

What is a discipline that results in applications that provide decision support capability, allows ready access to business information, and creates business insight?

Data Warehouse

What is a pool of data produced to support decision making and is a repository of current and historical data of potential interest to managers throughout the organization?

Data Warehouse

What is a subject oriented, integrated, time variant, nonvolatile collection of data in support of management's decision making process?

Data Warehouse

Who possess solid business insight and be familiar with high performance software, hardware, and networking technologies?

Data Warehouse Administrator

What consists of an integrated set of servers, storage, operating systems, database management systems, and software specifically preinstalled and preoptimized for data warehousing?

Data Warehouse Appliances

What provides solutions for the mid-warehouse to Bi Data warehouse market, offering two cost performance on data volumes in the terabyte to petabyte range?

Data Warehouse Appliances (low cost of ownership)

What are companies that include their own hardware to provide efficient data storage, retrieval, and processing?

Data Warehouse Providers i.e. IBM, Oracle, and Teradata

What describes where the company wants to go, why it wants to go there, and what it will do when it gets there?

Data Warehousing Strategy

What means that data are easily and readily obtainable?

Data accessibility Answers the question "Can we easily get the data when we need to?"

When would when the data be cleaned and the values are identified and dealt with?

Data cleaning

What means that the data are accurately collected and combined/merged?

Data consistency

When would relevant data be collected from identified sources, necessary records an variables are selected, and the records coming form multiple data sources are integrated and merged?

Data consolidation

What means that data are correct and are a godo match for the analytics problem?

Data content accuracy Answers the question "Do we have the right data for the job?"

What means that the data should be up to date for a given analytics mode and is recorded at or neat the time of the event or observation so that the time delay related misrepresentation of the data is prevented?

Data currency/data timeliness

What requires that the variables and data values be defined at the lowest level of detail for the intended use of the data?

Data granularity

Describe the data warehousing process

Data is imported from various external and internal resources are cleansed and orgainzed in a manner consistent with the organization's need's. Data's are populated in the DW, data marts can be loaded for specific areas.

Multidimensional Data

Data is represented in cubes Facts (Measures) + dimensions

Extract, transform, load (ETL)

Data management - The processes to extract, transform, cleanse, reengineer, and load source data into the data warehouse, and move the data from one location to another

What are the three main types of data warehouses?

Data marts, operational data store (ODS), and enterprise data warehouses (EDW)

What are some major data mining methods and algorithms?

Data mining tasks can be classified into three main categories: prediction, association, and clustering. Based on the way in which the patterns are extracted from the historical data, the learning algorithms of data mining methods can be classified as either supervised or unsupervised. With supervised learning algorithms, the training data includes both the descriptive attributes (i.e., independent variables or decision variables) as well as the class attribute (i.e., output variable or result variable). In contrast, with unsupervised learning the training data includes only the descriptive attributes.

What are enablers of predictive analytics?

Data mining, text mining, web/media mining, and forecasting

Where is the most time spend on the analytics tasks?

Data preprocessing

What is the term that means that the variables in the data set are all relevant to the study being conducted?

Data relevancy

What means that all required data elements are included in the data set and build a predictive or prescriptive analytics model?

Data richness Available variables portray enough dimensional of the underlying subject matter for an accurate and worthy analytics study.

What means that data is secured to only allow those people who have the authority and the need to access it and to prevent anyone else from reaching it?

Data security and data privacy

Meta data management

Data services - data that describes the meaning and structure of business data, as well as how it is created, accessed, and used

Application programming interface

Data source - Mechanism to populate source systems with raw data and to pull operational reports

Enterprise application integration/staging area

Data source - Provides an integrated common data interface and interchange mechanism for real-time and source systems.

Operational transaction systems

Data source - Systems that run day-to-day business operations and provide source data for the data warehouse and DSS environment

What refers to the originality and appropriateness of the storage medium where the data is obtained?

Data source reliability Answers the question "Do we have the right confidence and belief in this data source?"

What are the major components of the data warehousing process?

Data sources, data extraction/transformation, data loading, comprehensive database, metadata, and middleware tools

The four major components of the data warehousing process

Data sources. Data extraction (using custom-written or commercial software called ETL), Data loading (data loaded to staging area) Comprehensive database, metadata (used by IT personnel and users).

What is Big Data analytics?

Data that cannot be stored in a single storage unit. Data that is arriving in many different forms (structured, unstructured, or in a stream).

What is the term used to describe a match/mismatch between the actual and expected data values of a given variable?

Data validity

What are the four major components of business intelligence?

Data warehouse, business analytics, business process management, and user interface.

What are the parts that comprise the data warehousing architectures?

Data warehouse, data acquisition software (application server), client front end software (database server)

Data-mining tools

Data-mining tools use a variety of techniques to find patterns and relationships in large volumes of information that predict future behavior and guide decision making. help users uncover business intelligence in their data

What is the successful administration and management of a data warehouse entails skills and proficiency?

Database Administrator

What is the component where the most work must be done to implement a data model and optimize it for query performance?

Database Management Systems (DBMS)

What area are the prediction models that differentiate deceptive statements from truthful ones classified as?

Deception Detection

What is the evolution of decision support, business intelligence, and analytics?

Decision Support Systems --> Enterprise/Executive Information Systems --> Business Intelligence --> Analytics --> Big Data

What conveniently organizes information and knowledge in a systematic, tabular manner to prepare it for analysis?

Decision Tables

What divides a training set until each division consists entirely or primarily of examples for one class?

Decision Tree

What shows the relationships of the problem graphically and can handle complex situations in a complex form?

Decision Tree

What classifies data into a finite number of classes based on the values of the input variables?

Decision Trees

What includes many input variables / attributes that may have an impact on the classification of different patterns?

Decision Trees

What is a hierarchy of if then statements and are thus significantly faster than neural networks?

Decision Trees -Classify data into a finite number of classes based on the values of the input variables

What describe alternative courses of action?

Decision Variables

What are some terms that are content free expressions and there is no universally accepted definition?

Decision support system, management information system,

What are examples of an enterprise data warehouse?

Decision support systems, customer relationship management, supply chain management, revenue management, etc.

List and briefly define at least two classification techniques?

Decision tree analysis. Decision tree analysis (a machine-learning technique) is arguably the most popular classification technique in the data mining arena. • Statistical analysis. Statistical classification techniques include logistic regression and discriminant analysis, both of which make the assumptions that the relationships between the input and output variables are linear in nature, the data is normally distributed, and the variables are not correlated and are independent of each other. • Case-based reasoning. This approach uses historical cases to recognize commonalities in order to assign a new case into the most probable category. • Bayesian classifiers. This approach uses probability theory to build classification models based on the past occurrences that are capable of placing a new instance into a most probable class (or category). • Genetic algorithms. The use of the analogy of natural evolution to build directed search-based mechanisms to classify data samples. • Rough sets. This method takes into account the partial membership of class labels to predefined categories in building models (collection of rules) for classification problems.

Define strategic planning. Provide two examples.

Defining long-range goals and policies for resource allocation.

Describe the major steps in developing rule-based ES.

Defining the nature and scope of the problem: Identify the nature of the problem and to define its scope. Some domain my not be appropriate for the application of an ES. Identifying proper experts: Find proper experts who have knowledge and are willing to assist in developing the knowledge base. Selecting the building tools: Choose a proper tool for implementing the system. Coding the system: The team can focus on coding the knowledge based on the tool's syntactic requirements. Evaluating the system: Evaluation includes both verification and validation. Verification ensures that the resulting knowledge base contains knowledge exactly the same as that acquired from the expert.

What are storage solution providers?

Dell and Netapp

What is the proportion of direct ties in a network relative to the total number possible?

Density

What is a subset that is created directly from the data warehouse and uses a consistent data model to provide quality data?

Dependent Data Mart

What ensures that the end user is viewing the same version of the data that is accessed by all other data warehouse users?

Dependent Data Mart --high cost limits this to large companies

What refers to knowing what is happening in data organization and understanding some underlying trends and causes of such occurrences?

Descriptive / Reporting Analytics

What answers the question "What happened?" and"What is happening?"

Descriptive Analytics

What is the entry level in the business analytics taxonomy?

Descriptive Analytics

What helps us convert our numbers and symbols into meaningful representatives for anyone to understand and use?

Descriptive Statistics

What is used to describe the sample data on hand and summarizes it in a way that is meaningful and easily understandable patterns emerge?

Descriptive Statistics

What are the levels of decision/normative analytics?

Descriptive, predictive, and prescriptive

What is predictive analytics? How can organizations employ predictive analytics?

Determine what is likely to happen in the future based on statistical techniques. Organization employ predictive analytics by data mining-using cluster algorithms like decision tree models and neural networks in addition to association mining techniques to estimate relationships between different purchasing behaviors.

data warehouse enables business users, typically managers, to be more effective in many ways, including:

Developing customer profiles. Identifying new-product opportunities. Improving business operations. Identifying financial issues. Analyzing trends. Understanding competitors. Understanding product performance

What cycle creates a huge database of documents / pages organized and indexed based on their content and information value?

Development Cycle

What are the two main cycles in search engines? Describe the steps in each cycle.

Development Cycle Web Crawler Document Indexer Step 1: Preprocessing the documents Step 2: Parsing the documents Step 3: Creating the term-by-document matrix Response Cycle Query Analyzer Document Matcher/ Ranker

What are the two main cycles of a search engine?

Development Cycle and a responding cycle

Real-Time Location Intelligence

Devices that are constantly sending out location information, reality mining

What is the slice on more than two dimensions of a data cube?

Dice

What has one to many relationships with rows in the central fact table?

Dimension Table

What contain classification and aggregation information about central fact rows and the attributes that describe the data contained within the fact table?

Dimension Tables

What is a retrieval based system that supports high volume query access?

Dimensional Modeling -- star and snowflake schema

When the number of variables can be rather large, the analyst must reduce the number down to a manageable size. What is the process called?

Dimensional Reduction / Variable Selection

Data mining landscape

Discovery vs verification Models: Prediction (regression, classification) vs. description (segmentaton, association) (Folie 12)

What involves a situation with a limited number of events that can take on only a finite number of values?

Discrete Distributions

What refers to building a model of a system where the interaction between different entities is studied?

Discrete Event Simulation

What is used to estimate or describe the degree of variation in a given variable of interest?

Dispersion -- used for judging central tendency.

4 contributes of BI & their improvement

Dissemination of real time information in a user-friendly fashion Creation of new knowledge based on the past Responsive and anticipative decisions based more closely on all the latest information Improved planning for the future through data and information about the past --> Improvement in operational performance, customer service and in identifying new opportunities

What is the minimum number of ties required to connect two particular actors?

Distance

What is the frequency of data points counted and plotted over a small number of class labels or numerical ranges?

Distribution

four enterprise architecture models

Diversification model low standardization low integration o Decentralized o Different markets with different products and services o Benefit from local autonomy Coordination model low standardization high integration o Sharing of customers, products, suppliers and partners o Business unit leaders have autonomy Replication model high standardization low integration o Independent units following highly standardized process (e.g. McDonalds) o Units do not depend on each other Unification model high standardization high integration o Integrated supply chains that share customer and supplier data (e.g. DOW Chemical)

When would all items start in one cluster and are broken apart?

Divisive

Describe two differences between document databases and relational databases.

Document databases do not require a fixed, predefined schema. Documents can have embedded documents and lists of multiple values within a document.

What happens when the user navigates among levels of data ranging from the most summarized up to the most detailed?

Drill Up / Down

What are the leading indicators / value drivers that measure activities that have a significant impact on outcome KPS?

Driver KPI

Inmon model

EDW approach (top down)

What is an integral compomental in the process in any data centric project and consists of extraction, transformation, and loading integrated & cleansed data?

ETL

What are the two metrics to evaluate search engines?

Effectiveness and Efficiency

How many players are involved in the analytics environment?

Eleven clusters Inner and outer petals & seed of the flower

What is the process of intelligently combining the information created and provided by two or more information sources?

Ensemble Models. -Improving accuracy and robustness of information outcomes while reducing uncertainty and bias associated with individual models.

What provides a vehicle for pushing data from source systems into the data warehouse and involves integrating application functionality and is focused on sharing functionality across systems?

Enterprise Application Integration (EAI)

EDW stands for

Enterprise Data Warehouse

What is a large scale data warehouse that is used across the enterprise for decision support and provides integration of data from many sources into a standard format?

Enterprise Data Warehouse

What is a mechanism for pulling data from source systems to satisfy a request for information and uses predefined metadata to populate view that make integrated data appear relational to end users?

Enterprise Information Integration (EII)

What is an evolving tool space that promises real time data integration from a variety of sources, such as relational databases, Web services, and multi-dimensional databases?

Enterprise Information Integration (EII)

What system collects all the data from every corner of the enterprise and integrates it into a consistent schema so that every part of the organization has access to the single version when and where needed?

Enterprise Resource Planning (ERP) systems

Types of integration technologies that enable data and metadata integration:

Enterprise application integration (EAI, vehival pushes data from source to data warehouse), Enterprise information integration (EII, promotes real-time data integration).

What measures the extent of uncertainty or randomness in a data set and is used to build subtrees so that the entropy of each final subset is 0?

Entropy

What is the monitoring, scanning, and interpretation of collected information?

Environmental Scanning and Analysis

When would you collect and organize the domain specific unstructured data?

Establish the corpus

ETL stands for

Exchange, transfer and load

What systems were designed as graphical dashboards and scorecards so that they could serve as visually appealing displays while focusing on the most important factors for decision makers to keep track of the key performance indicators?

Executive Information Systems

What are issues affect whether an organization will purchase tools or build the transformation process itself?

Expensive, long learning curve, and it's difficult to measure how the IT organization is doing until it has learned to use the tools.

What is an ES?

Expert System is a computer-based information system that use expert knowledge to attain high level decision performance in a narrowly defined problem domain.

What is an independent variable also known as?

Explanatory or input

How does sentiment appear in text?

Explicit - subjective sentence directly expresses an opinion AND Implicit - The text implies an opinion

When would you discover novel patterns from the T-D matrix?

Extract knowledge

What involves reading data from one or more databases?

Extraction

ETL

Extraction(Select Data OLTP), Transormation(validate, clean, integrate), Load(move data into warehouse)

What does ETL stand for?

Extraction, Transformation, and Load

What contains the descriptive attributes needed to perform decision analysis and query reporting?

Fact Table

What is the outcome of when the predictive class is negative and the observed class is positive?

False Negative

What is the outcome of when the predictive class is positive and the observed class is negative?

False Positive

What skills should a DWA (Data Warehouse Administrator) possess?

Familiar with high-performance software, hardware, and networking technologies. Possess solid business insight, decision-making processes and communication skills.

Hub-and-spoke architecture

Famous data warehousing architecture today. Focus on building a scalable and maintainable infrastructure that includes a centralized data warehouse and several dependent data marts. Allows for easy customization of user interfaces and reports. Lacks a holistic enterprise view, and may lead to data redundancy and data latency.

What uses all possible means to integrate analytical resources from multiple sources to meet changing needs or business conditions?

Federated Data Warehouse

Where has the most common use of data mining been used on the commercial side?

Finance, retail, and healthcare sectors.

ad-hoc reports

From that point on, the actual reports are created by business end-users. Ad-hoc is Latin for "as the occasion requires." This means that with this BI model, users can use their reporting and analysis solution to answer their business questions "as the occasion requires," without having to request queries from IT.

What are unstructured decisions? Provide two examples.

Fuzzy, complex problems for which there are no cut-and-dried solution methods. writing a corporate mission stmt, selecting a location for a company picnic.

Location-Based Analytics

Geospatial Analytics, Geocoding, Enables aggregate view of large geographic area, Integrate "where" into customer view

Consumer Oriented Locations based analytics

Geospatial static approach (GPS navigation, data analysis), location-based dynamic approach (historic and current location demand analysis)

Organization Oriented Location based analytics

Geospatial static approach (examining geographic site locations), location-based dynamic approach (live location feeds; real-time marketing promotions)

What has been used in economics to measure the diversity of a population and can be used to determine the purity of a specific class as a result of a decision to branch along a particular attribute or variable?

Gini Index

Classification

Given: collection of records, each record attributes, one attribute = class Task: find a Model Goal: unknown records should be classified

What are factors that are forcing business managers to rethink how they integrate and manage their businesses?

Global competitive pressures, demand for ROI, management, investor inquiry, and government regulations

What calculates the values of the inputs necessary to achieve a desired level of an output/goal?

Goal Seeking

What is speech analytics? How does it relate to sentiment analysis?

Growing field of science that allows users to analyze and extract information from both live and recorded conversations. Sentiment analysis, as it relates to speech analytics, focuses on assessing the emotional state expressed in a conversation and on measuring the presence and strength of positive and negative feelings that are exhibited by the participants.

Cluster Analyses

Gruppierung von Objekten (homogen innerhalb der Gruppe, heterogen zwischen den Gruppen)

The technologies that come with Big Data are

Hadoop, MapReduce, and NoSQL, Hive

How can computers provide support for making structured decisions?

Help with operational and managerial controls. Therefore, it is possible to use a scientific approach for automating portions of managerial decision making.

Decision Support Data

Historic data that is queried intensively in fewer less normalized tables. Has large data volumes.

What combines the outcomes of two or more of the same type of models such as decision trees?

Homogeneous Ensemble Model

What is the extent to which actors form ties with similar vs. dissimilar others?

Homophily

What are the sub categories of connections in SNA?

Homophily, multiplexity, mutuality, network closure, propinquity.

What is it called when another firm develops and maintains the data warehouse?

Hosted Data Warehouse

What is one or more Web pages that provide a collection of links to authoritative pages and implicitly conferring the authorities on a narrow field?

Hub

What is the most famous data warehousing architecture today because it's focused on building a scalable and maintaining infrastructure?

Hub & Spoke Architecture -allows for easy customization of user interfaces and reports, but can have data redundancy and latency.

Human-generated data

Human-generated data is data that humans, in interaction with computers, generate Human-generated structured data includes input data, click-stream data, or gaming data

Holap

Hybrid Olap - 1 part Molap and 1 part ROLAP

What is the most popular publicly known and referenced algorithm used to calculate hubs and authorities?

Hyperlink Induced Topic Search

What are the major hardware players that provide the infrastructure for database computing?

IBM, Dell, HP, Oracle,.

What are tools used for predictive analytics?

IBM, Oracle, SAP, Teradata, Informatica

What companies provide indigenous hardware and software platforms?

IBM, Oracle, and Teradata

What is the level of understanding and insight provided by the model?

INterpretability

What is the OS polarity?

If the objectivity value is close to 1, then there is no opinion to mine.

How can analytics affect job satisfaction?

If the routine and mundane work can be done using an analytic system, then it should free up the managers and knowledge workers to do more challenging tasks. It was found that employees using ADS systems, especially those who are empowered by the systems, were more satisfied with their jobs.

Describe monotonic write consistency. Why is is so important?

If you were to issue several update commands, they would be executed in the order you were issued them. This ensures that the results of a set of commands are predictable. Repeating the same commands with the same starting data will yield the same results.

How does clustering improve search effectiveness for text mining?

Improved search recall and search precision.

What is the integration of the algorithmic extent of data analytics into data warehousing?

In Database Processing / In Database Analytics *used for high throughput,real time application environments, including fraud detection, credit score, risk management, etc.*

What keeps the data permanently in the main memory?

In Memory Database

List and briefly discuss some of the text mining applications in marketing.

Increase cross-selling and up-selling by analyzing the unstructured data generated by call centers. Invaluable for customer relationship management. Analyze rich sets of unstructured text data, combined with the relevant structured data extracted from organizational databases, to predict customer perceptions and subsequent purchasing behavior.

What assumption states that the errors of the response variable are uncorrelated with each other?

Independence (weaker than actual statistical independence)

What is a small warehouse designed for strategic business unit or a department, but its source is not an enterprise data warehouse?

Independent Data Mart --lower cost & lower scale

What is the simplest and least costly architecture alternative?

Independent Data Marts *Developed to operate independent of each other and serve the needs of individual organizational units*

What is used to draw inferences or conclusions about the characteristics of the population?

Inferential Statistics

What are graphical models of a model that can facilitate the identification process?

Influence Diagram

What is the identification of key phrases and relationships within text by looking for predefined objects and sequences in text by way of pattern matching?

Information Extraction

What is the splitting mechanism used in ID3 which is perhaps the most widely known decision trees algorithm and was developed by Ross Quinlan?

Information Gain

What is used to identify and stop malicious attacks on critical information infrastructure?

Information Warfare

What are the measures to assess the success of an architecture?

Information quality, system quality, individual impacts, and organizational impacts.

How do document databases differ from key-value databases?

Instead of storing each attribute of an entity with a separate key, document databases store multiple attributes in a single document. Users can query and retrieve documents by filtering on key-value pairs within a document.

What does it mean to place data from different sources into a consistent format?

Integrated

Stages of rational decision making

Intelligence -> Design -> Choice

What are the common characteristics of data scientists?

Intense curiosity, creativity, communication,/interpersonal, domain expertise, problem definition, managerial, technical skills (data manipulation, programming/hacking/scripting, internet and social media/networking)

Location Intelligence (LI)

Interactive maps that further drill down to details about any location

What reflects intermediate outcomes in mathematical models?

Intermediate Result Variables

Describe the three managerial roles, and list some of the specific activities in each.

Interpersonal - managers interact with people inside and outside their work units. Figurehead, Leader, Liaison Informational - managers receive and communicate information with other people inside and outside the organization. Monitor, Disseminator, Spokesperson Decisional - manager use information to make decisions to solve problems or take advantage of opportunities. Entrepreneur, Disturbance handler, Resource allocator, Negotiator

What are variables that can be measured on interval scales?

Interval Data i.e. Temperature

Define BI.

Is a umbrella term that combines architectures, tools, databases, analytical tools, applications, and methodologies. Means different things to different people.

When is the accuracy calculated by leaving one sample out at each iteration of the estimation process?

Jackknifing

When is the complete data set randomly split into k mutually exclusives subsets of approximately equal size?

K-Fold Cross Validation / Rotation Estimation

What represents a strategic objective and measures performance against a goaL?

Key Performance Indicator (KPI)

What is descriptive analytics? What tools are employed in descriptive analytics?

Knowing what is happening in the organization and understanding some underlying trends and causes of such occurrences. Tools include reports, queries, alerts, and trends using various reporting tools and techniques. Major player is visualization.

KDD

Knowledge Discovery from Databases

What is a process of using data mining methods to find useful information and patterns in the data which involved using algorithms to identify patterns in data?

Knowledge Discovery in Databases (KDD)

When would an expert's knowledge about the categories be encoded into the system either declarative or in the form of procedural classification rules?

Knowledge Engineering Approach

KDD

Knowledge discovery in databases - Data mining front end technology - DATA mining as step within the KDD process

What are the two main approaches to text classification?

Knowledge engineering and machine learning.

What are other names of data mining?

Knowledge extraction, pattern analysis, data archaeology, information harvesting, pattern searching and data dreging

Which graph measures the degree to which a distribution is more of less peaked than a normal distribution?

Kurtosis

What represents the final class choice for a pattern?

Leaf Node

What is used when every data point is used for testing once as many models developed as there are a number of data points?

Leave One Out *time consuming, but best for small data sets

What is the catalog of words, their synonyms, and their meanings for a given language and create a variety of special purpose lexicons for use in sentiment analysis projects?

Lexicon

What is the best known technique in a family of optimization tools called mathematical programming?

Linear Programming *all relationships among variables are linear

What assumption states that the relationship between the response variable and the explanatory variable are linear?

Linearity

What are assumptions associated in linear regression?

Linearity, independence, normality, constant variance, and multicollinearity

When is the linkage among many objects of interest is discovered automatically?

Link Analysis

What involves putting the data into the data warehouse?

Load

What is a very popular, statistically sound, probability classified algorithm that employs supervised learning?

Logistic Regression

What is used to classify a categorical variable?

Logistic Regression

Types of analytical processing

MOLAP (multidimensional online analytical processing) is an alternative to the ROLAP (Relational OLAP) technology))) indexes directly into a multidimensional database. ROLAP(relational online analytical processing) is an alternative to the MOLAP (Multidimensional OLAP) technology. HOLAP(hybrid online analytical processing) is a combination of ROLAP ( Relational OLAP) and MOLAP (Multidimensional OLAP) SQL -SQL (pronounced "ess-que-el") stands for Structured Query Language. SQL is used to communicate with a database. According to ANSI (American National Standards Institute), it is the standard language for relational database management systems.

options for olap

MOLAP, ROLAP

When would a general inductive process build a classifier by learning from a set of reclassified examples?

Machine Learning Approach

The sources of structured data include:

Machine-generated data & Human-generated data (structured)

The sources of unstructured data include:

Machine-generated unstructured data & Human-generated unstructured data

List and describe the major components of BI.

Major objective is to enable interactive access (sometimes in real time) to data, to enable manipulation of data, and to give business managers and analysts the ability to conduct appropriate analysis. By analyzing historical and current data, situations, and performances, decision makers get valuable insights that enable them to make more informed and better decisions. Process is based on the transformation of data to information, then to decisions, and finally to actions.

Scope BI

Management Support Systems (Focus: Planning, organization, Control)

Business Advantages of a Relational Database 5) Increased Information Security

Managers must protect information, like any asset, from unauthorized users or misuse Security risks are increasing as more and more databases and DBMS systems are moving to data centers run in the cloud

Dinstanz Berechnung

Manhatten: |Bx-Ax| + |By-Ay| Euclidean: wurzel aus der summe von der achsendifferenz zum quardrat

When can text mining be used to increase cross selling and up selling by analyzing the unstructured data generated by call centers?

Market Applications -Invaluable for CRM!

What are subcategories of association?

Market basket, link analysis, and sequence analysis.

What is a family of tools designed to help solve managerial problems in which the decision maker must allocate scarce resources among competing activities to optimize a measurable goal?

Mathematical Programming

What is a simpler way to calculate the overall deviation from the mean and is calculated by measuring the absolute values of the differences between each data point and the mean?

Mean Absolute Deviation

What is a single numerical value that aims to describe a set of data by simply identifying or estimating the central position within the data?

Measure of Central Tendency

What is the measure of center value in a given data set?

Median

What is the most standardized and orderly making it a more minable information source?

Medical Literature

What can drive changes in business intelligence?

Mergers & acquisitions, regulatory requirements, and introduction of new channels.

What describes the structure and some meaning about data contributing to their effective or ineffective use?

Metadata

List and describe the three major categories of business reports.

Metric management reports: Business performance is managed through outcome-oriented metrics. Enterprise-wide agreed targets to be tracked over a period of time. Dashboard-type reports: Present a range of different performance indicators on one page. Vendors would provide a set of predefined reports with static elements and fixed structure, but also allow for customization. Balanced scorecard-type reports: Presents an integrated view of success in an organization. In addition to financial performance, it also includes customer, business processes, and learning growth perspectives.

What is a worldwide source for access to Mircosoft's SQL Server suite for academic purposes teaching and research?

Microsoft Enterprise Consortium

Where are data and models stored in the same relational database environment, making model management a considerably easier task?

Microsoft SQL Server

Who provides easy to use tools for reporting or descriptive analytics?

Middleware Providers i.e. Oracle, SAP, and IBM

Who provides tools that enable reporting or descriptive analytics?

Middleware industry players i.e. Microsoft SQL, Tableau, SAS

What enables access to the data warehouse?

Middleware tools

What is the observation that occurs most frequently?

Mode *most useful for data with a small number of unique values

What is the most common two step methodology of classification type?

Model development/training and model testing/deployment.

Kimball model

Model with the data mart approach (bottom up)

inom model

Model, also known as the EDW approach, emphasizes top-down development, employing established database development methodologies and tools, such as entity-relationship diagrams (ERD), and an adjustment of the spiral development approach.

kimball model

Model, also known as the data mart approach, is a "plan big, build small" approach. A data mart is a subject-oriented or department-oriented data warehouse. It is a scaled-down version of a data warehouse that focuses on the requests of a specific department, such as marketing or sales.

What is the most common simulation method for business decisions that begins with building a model of the decision problem without having to consider the uncertainty of any variables?

Monte Carlo Simulation

What is a branch of the field of linguistics and a part of the NLP that studies the internal structure of words?

Morphology

What is the assumption that states that the explanatory variables are not correlated?

Multicollinearity

What involves data analysis in several dimensions and are generally shown in a spreadsheet format?

Multidimensional Analysis

HOLAP Model

Multidimensional Data Types, Relational Data Types, Tools -> OLAP API/SQL

What is the number of content forms constrained in a tie?

Multiplexity

What is the systems that convert information from computer databases into readable human language?

Natural Language Generation

What is a subfield of artificial intelligence and computation linguistics and studies the problem of understanding the natural human language with the view of converting depictions of human language into more formal representations that are easier for computer programs to manipulate?

Natural Language Processing

What is the measure of the completeness of relational triads?

Network Closure

What is the term for describing analytics that relate to groups of people, social networks, supply chain networks, etc?

Network Science

What involves the development of mathematical structures that have the capability to learn from past experiences presented in the form of well structured data sets?

Neural Networks -Classification algorithm

What are the necessary conditions for a good expert?

No ES can be designed without the strong support of knowledgeable and supportive experts. A proper expert should have a through understanding of problem-solving knowledge, the role of ES and decision support technology, and good communication skills.

What are the two fundamental data structures in a graph database?

Nodes and relations Also called vertices and edges.

What has finite non-ordered values?

Nominal Data

What contains measurements of simple codes assigned to objects as labels which are not measurements?

Nominal data - can be represented with binomial values having two possible values i.e. variable marital status (single, married, divorce)

What means that some experimentation type search or inference is inolved?

Nontrival

What does it mean if users can't change or update the data?

Nonvolatile

What assumption states that the errors of the response variable are normally distributed?

Normality

What means that the patterns are not previously known to the user within the context of the system being analyzed?

Novel

What are numeric values?

Numeric Data

What represents the numeric values of specific variables?

Numeric Data / Continuous Data (scalable data) -- can be integer or real.

Operational data store

ODS. Provides a fairly recent form of customer information file (CIF). This type of database is often used as an interim staging area for a data warehouse. Used for short term decisions. Uploads just recent info not for long-term use. Data warehouse on the other hand stores permanent info. An ODS consolidates data from multiple source systems and provides a near-real time, integrated view of a volatile, current data.

What does an analyst use to navigate through the database and screen for a particular subset of the data by changing the data's orientations and defining analytical calculations?

OLAP

What is the approach to quickly answer ad hoc questions by executing multidimensional analytical queries against organizational data repositories?

OLAP

What is the most commonly used data analysis technique in data warehouses and has been growing in popularity due to the exponential increase in data volumes and the recognition of the business value of data driven analytics?

OLAP

What is OLAP and how does it differ from OLTP?

OLAP (online analytical processing) is an approach to quickly answer ad hoc questions by executing multidimensional analytical queries against organizational data repositories (example-data warehouses, data marts). OLTP(online transaction processing system) is a term used for a transaction system, which is primary responsible for capturing and storing data related to day-to-day business functions such as ERP, CRM, SCM, point of sale, and so forth.

most common analysis technique in data warehouse?

OLAP online analytical processing.

What is used for a transaction system that is primarily responsible for capturing and storing data related to day to data business functions such as ERP, CRM, SCM, POS, and so forth?

OLTP

What is a common representation schema of the frequency based relationship between the terms and documents in tabular format where terms are listed in columns?

Occurrence Matrix / Term by Document Matrix

What refers to web measurement and analysis about you and your products that takes place outside your web site?

Off Site Web Analytics

What are the two main categories of Web analytics?

Off site and on site.

What measure visitors behavior once thy are on the web site and measures the performance in a commercial context?

On Site Web Analytics

How many values can be stored with a single key in a key-value database?

One

Olap

Online Analytical Processing - live data - reporting

What is the term used for analyzing, characterizing, and summarizing structured data stored in organizational databases?

Online Analytics Processing (OLAP)

What handles a company's routine ongoing business and responds immediately to user requests?

Online Transaction Processing (OLTP)

Types of analytical processing activities:

Online analytical processing (OLAP), data mining, querying, reporting, and other decision-support applications.

OLAP vs OLTP

Online analytical processing VS online transactional processing. OTLP for capturing and storing data for day-to-day business functions such as ERP, CRM, SCM, point of sale, and so forth. Not for ad-hoc and complex queries that deal with a number of data items. OLAP on the other hand is designed to address this need by providing ad hoc analysis of organizational data much more effectively and efficiently. OLAP and OLTP rely on each other. OLAP uses the data captures by OLTP and OLTP automates the business processes that are managed by decisions supported by OLAP.

OLTP

Online transaction processing (traditional relational DBMS)

What consolidates data from multiple source systems and provides a near real time, integrated view of volatile current data?

Operational Data Stores

What provides a fairly recent form of customer information file and is used as an interim staging area for a data warehouse?

Operational Data Stores

What is used for short term decisions involving mission critical applications rather than for the medium and long term decisions associated with EDW?

Operational Data Stores (think short term memory)

What translates an organization's strategic objectives and goals into a set of well defined tactics and initiatives, resource requirements, and expected results fro some future time period?

Operational Plan *key to success is integration

abstract architecture

Operational Systems -> ETL Process -> Data Warehouse -> Frond end Software -> Warehouse users

What decision support model used data that was obtained from the domain experts use of manual processes to build mathematical or knowledge to solve constrained optimization problems?

Operations Research

An ODS is a

Opertaional data stores. type of customer-information-file database that is often used as a staging area for a data warehouse.

What are some other names for sentiment analysis?

Opinion mining, subjectivity analysis, and appraisal extraction

What is the solution that has the highest degree of goal attainment associated with it known as?

Optimal Solution

What are enablers of prescriptive analytics?

Optimization, simulation, decision modeling, and expert systems.

What has finite ordered values?

Ordinal Data

What contains codes assigned to objects or events as labels that also represent the rank order among them?

Ordinal Data i.e.e credit score, age group.

What aims to minimize the sum of squared residuals and leads to a mathematical expression for the estimated value of the regression line?

Ordinary Least Squares Method

What are lagging indicators that measure the output of past activity?

Outcome KPI (financial in nature)

What uses JavaScript embedded in the site page code to make image requests to a 3rd party analytics dedicated server whenever a page is rendered by a web browser?

Page Tagging

What is the most basic of measurements and is presented as the average page views per visitor?

Page Views

What enables multiple CPUs to process data warehouse query requests simultaneously and provides scalability?

Parallel Processing

What are some of the challenges of NLP?

Part-of-speech tagging: It is difficult to mark up terms in a text as corresponding to a particular part of speech because the part of speech depends not only on the definition of the term but also on the context within which it is used. Text segmentation: Some written languages, such as Chinese, Japanese, and Thai, do not have single-word boundaries. Word sense disambiguation: Many words have more than one meaning. Selecting the meaning that makes the most sense can only be accomplished by taking into account the context within which the word is used. Syntactic ambiguity: The grammar for natural languages is ambiguous; that is, multiple possible sentence structures often need to be considered. Choosing the most appropriate structure usually requires a fusion of semantic and contextual information. Imperfect or irregular input: Foreign or regional accents and vocal impediments in speech and typographical or grammatical errors in texts make the processing of the language an even more difficult task. Speech acts: A sentence can often be considered an action by the speaker. The sentence structure alone may not contain enough information to define this action

Motivations for Redundant (Analytic) Data Storage

Performance Accessibility

What assists managers in tracking the implementation of business strategy by comparing actual results against strategic goals and objectives?

Performance Measurement Systems

What are examples of decision analysis attributes?

Performance measures, operational metrics, aggregated measures, and all the others to analyze the organization's performance.

Describe a two-phase commit. Does it help ensure consistency or availability?

Phase 1: the database writes, or commits, the data to the disk of the primary server. Phase 2: The database writes data to the disk of the backup server. It helps ensure consistency because if the primary server fails, it can switch to the backup database.

What is used to change the dimensional orientation of a report or ad hoc query page display?

Pivot

What can be made at the word, term, sentence, or document level?

Polarity Identification

What are other areas that utilize sentiment analysis applications?

Politics, government intelligence, and e-Commerce sites.

What means that the discovered patterns should lead to some benefit to the user or task?

Potentially Useful

What is the act of telling about the future?

Prediction / Forecasting

What are categories of data mining tasks?

Prediction, Association, and Segmentation

What are the key differences among the major data mining methods?

Prediction: the act of telling about the future. It differs from simple guessing by taking into account the experiences, opinions, and other relevant information in conducting the task of foretelling. A term that is commonly associated with prediction is forecasting. Even though many believe that these two terms are synonymous, there is a subtle but critical difference between the two. Whereas prediction is largely experience and opinion based, forecasting is data and model based. That is, in order of increasing reliability, one might list the relevant terms as guessing, predicting, and forecasting, respectively. In data mining terminology, prediction and forecasting are used synonymous, and the term prediction is used as the common representation of the act.

What tells the nature of future occurrences of certain events based on what has happened in the past?

Predictions

What is most commonly used assessment factor for classification models that predicts the class label of new or previously unseen data?

Predictive Accuracy

What aims to determine what is likely to happen in the future and is based on statistical techniques?

Predictive Analytics

What answers the question "What will happen?" and "Why will it happen?"

Predictive Analytics

Where has the biggest growth in analytics been?

Predictive Analytics

What answers the question "What should I do?" or "Why should I do it?"

Prescriptive Analytics

What is used to provide a decision or a recommendation for a specific action?

Prescriptive Analytics

What is used to recognize what is gong on as well as the likely forecast and make decisions to achieve the best performance possible?

Prescriptive Analytics

What is a modeling a key element for?

Prescriptive analytics

In what simulation would one or more of the independent variables be probabilistic?

Probabilistic Simulation

What implies the data mining comprises many iterative steps?

Process

What is the tendency for actors to have more ties with geographically close others?

Propinquity

What is a performance dashboard? Why are they so popular for BI software tools?

Provide visual displays of important information that is consolidated and arranged on a single screen so that information can be digested at a single glance and easily drilled in and further explored. Gives a quick and accurate idea of what is going on within the organization.

Why is AaaS cost-effective?

Provides many virtual analytical applications with better scalability and higher cost savings. With growing data and volumes an dozens of virtual analytical applications, chances are that more of them leverage processing at different times, usage patterns, and frequencies.

What contains both nominal and ordinal data?

Qualitative Data / Categorical Data

What is made up of result variables, decision variables, uncontrollable variables, and intermediate result variables?

Quantitative Models

What is a quarter of the number of data points given in a data set?

Quartile

What is a useful measure of dispersion because they are much less affected by outliers or a skewness in the data set?

Quartile Reported along with the median as the best choice of measure of dispersion and central tendency

What are the two components of a response cycle?

Query Analyzer and Document Matcher/Ranker

What employs a hierarchal clustering approach where the most relevant documents to the posed query appear in small tight clusters that are nested in larger clusters containing less similar documents, creating a spectrum of relevance levels among the documents?

Query Specific Clustering

What is the task of automatically answering a question posed in natural language?

Question Answering

What are the open source platforms that have emerged as popular industrial strength software tools for predictive analytics?

R, Rapid Miner, and KNIME

What ranges from 0 to 1 with 0 indicating that the proposed model is NOT a good fit and 1 indicating that the proposal model is a perfect fit?

R2

What is the difference between the largest and smallest values in a given data set?

Range (simplest measure of dispersion)

What is the most popular general platform for data mining/data science?

RapidMiner

What includes measurement variables commonly found in the physical sciences and engineering?

Ratio Data i.e. Mass, length, time, plane angle, energy

What implies that the refresh cycle of an existing data warehouse to update the data is more frequent?

Real-Time Data Warehousing

Operational Data

Real-time data stored in relational database optimized to support daily transactions. Many tables that are normalized and is updated intensively.

What is prescriptive analytics? What kinds of problems can be solved by prescriptive analytics?

Recognize what is going on as well as the likely forecast and make decisions to achieve the best performance possible. These recommendations can be in the forms of a specific yes/no decision for a problem, a specific amount (say price for a specific item to charge or a complete set of production plans. Maybe in a report or automated decision rules system.

What attempts to describe the dependence of a response variable on one explanatory variables where it implicitly assumes that there is a one way casual effect from the explanatory variable to the response variable?

Regression

What is a simple statistical technique to model the dependence of a variable on one explanatory variables?

Regression

What is concerned with the relationships between all explanatory variables and the response variable?

Regression

What is the most widely known and used analytics techniques in statistics used for hypothesis testing and prediction/forecasting?

Regression

What is a dependent variable also known as?

Response or output

What reflects the level of effectiveness of a system by indicating how well the system performs or attains its goals?

Result/Outcome Variables

What is the mantra for business intelligence?

Right information at the right time and in the right place.

When must the decision maker consider several possible outcomes for each alternative each with a given probability of occurrence?

Risk / Probabilistic / Stochastic Decision Making

What is a decision making method that analyzes the risk associated with different alternatives?

Risk Analysis

What is the ability's to make reasonably accurate predictions given noisy data or data with missing and erroneous values?

Robustness

the three key factors that affect the presentation ability

Role different user groups (CEO, middle manager, customer support, ...) Task every task requires different content and format of the information Preference individuals differ in their preference (big picture vs. detail) --> a good BI solution should

OLAP operations

Roll up, Drill Down, Slice and dice, Pivot (rotate)

What involves computing all the data relationships for one or more dimensions?

Roll-Up

What are structured decisions? Provide two examples.

Routine and typically repetitive problems for which standard methods exist. finding an appropriate inventory level, choosing an optimal investment strategy

What system captured experts' knowledge in a format that computers could process so that these could be used for consultation and allowed scare expertise to be made available where and when needed?

Rule Based Expert Systems

Who are some examples of ETL providers?

SAS, Microsoft, Oracle, IBM

What are some tools used for predictive analytics?

SAS, SPSS, and IBM

What makes a statistically representative sample of data to apply exploratory statistical and visualization techniques, select, and transform the most significant predictive variables, models the variables to predict outcomes, and confirm a mode'l's accuracy?

SEMMA

What are some data solution providers offering hardware and platform independent database management systems?

SQL Server family of MIcrosoft and SAP

What are the most commonly used database management systems?

SQL Server, Oracle, and DB2

What is a creative way of deploying information systems applications where the provider licenses its applications to customers for use a a service on demand?

SaaS (Extended ASP Model)

Business Applications of Regression

Sales predictions, financial forecasting, residual value estimation...

What are the steps of the SEMMA Data Mining Process?

Sample, Explore, Modify, Model, and Assess.

What is the ability to construct a prediction model efficiently given a rather large amount of data?

Scalability

What are the two most popular clustering methods for text mining?

Scatter/gather and query specific clustering

What is a software program that searches for documents, base don keywords users have provided that have to do with the subject of their inquiry?

Search Engine

What is the intentional activity of affecting the visibility of an e-commerce site or a web site in a search engine's natural search results?

Search Engine Optimization (SEO)

What are the man concerns for a data warehouse professional?

Security & privacy of information

What are URLs known as?

Seeds

What is the most common method for solving this risk analysis problem?

Select the alternative with the greatest expected value.

Self-service business intelligence (SSBI)

Self-service business intelligence (SSBI) is an approach to data analytics that enables business users to access and work with corporate data even though they do not have a background in statistical analysis, business intelligence (BI) or data mining. Allowing end users to make decisions based on their own queries and analyses frees up the organization's business intelligence and information technology (IT) teams from creating the majority of reports and allows those teams to focus on other tasks that will help the organization reach its goals.

How can computers provide support to semistructured and unstructured decisions?

Semistructured- use combination of standard solution and human judgment. Management science can provide models for a portion of decision-making problems. For the unstructured a DSS (Decision Support System) can improve the quality of the info on which the decision is based with alternatives and potential impacts. Unstructured- can only be partially supported by standard computerized quantitative methods. Have to create customized solutions. Intuition and judgment may play a larger role.

What attempts to assess the impact of change in the input data or parameters on the proposed solution?

Sensitivity Analysis

What collects a massive amount of data at a faster rate and have been adopted by various sectors such as healthcare, sports, and energy?

Sensors

What is a technique used to detect favorable and unfavorable opinions toward specific products and services using a large number of textual data sources?

Sentiment Analysis

When are the relationships examined in terms of their order of occurrence to identify associations over time?

Sequence Mining

What is the discovery of time ordered events?

Sequential Relationships

What are the two technical ways of collecting the data for on site analytics?

Server log files analysis and page tagging

What are the main differences among line, bar and pie charts? When should you choose one over the other?

Shows the relationship between two variables; they most often are used to track changes or trends over time. Connect individual data points to help infer changing trends over a period of time. Used to compare data across multiple categories. Effective when you have nominal data or numerical data that splits into different categories to compare results. Used to illustrates relative proportions of a specific measure; used to show percentages in catagories. If the number of categories is more than 4, use a bar chart instead.

What are the major data mining processes?

Similar to other information systems initiatives, a data mining project must follow a systematic project management process to be successful. Several data mining processes have been proposed: CRISP-DM, SEMMA, and KDD.

Centralized data warehouse

Similar to the hub-and-spoke one. except no dependent data marts, rather a big enterprise data warehouse that serves the needs of all organizational units. More holistic view. No data marts.

What partitions the data into two mutually exclusive subsets called a training set and a test set?

Simple Split

What is normally used when a problem is too complex to be treated using numerical optimization techniques?

Simulation

What is the appearance of reality and is a technique for conducting experiments with a computer on a model of a management system?

Simulation

What reduces the overall dimensionality of the input matrix to a lower dimensional space where each consecutive dimension represents the largest degree of variability possible?

Singular Value Decomposition

What is a performance management methodology aimed at reducing the number of defects in a business process to as close to 0 DPMO as possible?

Six Sigma

What is a measure of asymmetry in a distribution of the data that portrays a unimodal structure with only one peak exists in the distribution?

Skewness

What is a subset of multidimensional array corresponding to a single value set for one or more of the dimensions not in the subset?

Slice -3D Cub

What are commonly used OLAP operations?

Slice & Dice, drill down, roll-up, and pivot.

Slice And Dice

Slice and dice refers to a strategy for segmenting, viewing and understanding data in a database. Users slices and dice by cutting a large segment of data into smaller parts, and repeating this process until arriving at the right level of detail for analysis. Slicing and dicing helps provide a closer view of data for analysis and presents data in new and diverse perspectives.

What is a logical arrangement of tables in a multidimensional database in such a way that the entity relationship diagram resembles a snowflake in shape?

Snowflake Schema -dimensions are normalized into multiple related tales.

What is the mining of textual context created in social media and analyzing socially established networks for the purpose of gaining insight about existing and potential customers' current and future behaviors and about the likes and dislikes toward a firm's product/service?

Social Analytics

What is the enabling technologies of social interactions among people in which they create, share, and exchange information?

Social Media

What is the systematic and scientific ways to consume vast amount of content created by web based social media outlets, tools, and techniques for the betterment of an organization's competitiveness?

Social Media Analytics

What is a social structure composed of individuals/people linked to one another with some type of connections/relationships and provides a holistic approach to analyzing the structure and dynamics of social entities?

Social Network

What is a theoretical construct useful in the social sciences to study relationships between individuals, groups, organizations, or even societies?

Social Network

What follows the links between friends, fans, and followers to identify connections of influence as well as the biggest sources of influence?

Social Network Analysis

What is the systematic examination of social networks that view social relationships in terms of network theory?

Social Network Analysis (SNA)

What can be placed on a separate server in the network or on the transnational application databases themselves and can use event and process based approaches to proactively and intelligently measure and monitor operational processes?

Software Monitors / Intelligent Agents

internal data sources

Sources: OLTP, ERP, CRM Kind of Data: production, planning, sales, customer, marketing, organizational maintained in different formats: sturctured documents, unstructured documents

What recent technologies may shape the future of data warehousing?

Sourcing: Web/Social Media/Big Data, Open source software, Software as a service, cloud computing Infrastructure: Columnar, Real-time data warehousing, DW appliances, data management technologies and practices, In-database processing technology, In-memory storage technology, New database management systems, advanced analytics

What converts spoken words to machine readable input?

Speech Recognition

What are the computation costs involved in generating and using the model where faster is deemed to be better?

Speed

What is a test on one or more attributes and determines how the data are to be divided further?

Split Point

What is the most popular end user modeling tool because it incorporates many powerful financial, statistical, mathematical, and other functions?

Spreadsheet

What is natural language processing?

Ss a subfield of artificial intelligence and computational linguistics. It studies the problem of "understanding" the natural human language, with the view of converting depictions of human language (such as textual documents) into more formal representations (in the form of numeric and symbolic data) that are easier for computer programs to manipulate.

What is the measure of the spread of values within a set of data?

Standard Deviation

What is the most commonly used and the simplest style of dimensional modeling that contains a central fact table surrounded and connected to dimension tables?

Star Schema

What is designed to provide fast query response time, simplicity, and ease of maintenance for read only database structures?

Star Schema -dimensions are denormalized with each dimension being represented by a single table

multi dimensional data in relational db

Star schema, snowflake schema, fact constellation,

What is a collection of mathematical techniques to characterize and interpret data?

Statistics

What is the process of reducing inflected words to their stem form?

Stemming

What are words that are filtered out prior to or after processing of natural language data?

Stop waords

What are the two aspects to managing data that can't be stored in a single unit?

Storing and processing.

What are the steps of a closed loop BPM strategy?

Strategize, plan, monitor/analyze, and act/adjust

What is a high level plan of action, encompassing a long period of time to achieve a defined goal?

Strategy

What are features of a KPI?

Strategy, targets, ranges, encoding, time frames, and benchmarks.

What is the absence of ties between two parts of a network?

Structural Holes

What do data mining algorithm used and can be classified as categorical or numeric?

Structured data -Categorical: Nominal, ordinal -Numerical: Interval, ratio

What enables users to determine how their business is performing and why?

Subject Oriented -- provides a more comprehensive view of the organization.

Characteristics of Data Warehousing include

Subject oriented (data organized by detailed subject such as sales, customer,) Integrated (consistent format), Time Varient ( maintains historical data). Nonvolatile (users can't change data, changes are recorded as new data).

What are characteristics of data warehousing?

Subject oriented, integrated, time variant, nonvolatile.

What are the types of metadata (based on pattern)?

Syntactic, structural, and semantic.

Meta-flow:

System modeling: to define structure of legacy systems, synthesizing to create valued, regulating to create modules for capturing.

What is a single word or multi-word phrase extracted directly from the corpus of a specific domain by means of NLP methods?

Term

When would rows represent documents and columns represent terms?

Term Document Matrix

What is used when the classifier is build and then tested on the test set and has 1/3 of the data?

Test Set

What is more commonly used in a business application context,?

Text Analytics -- relatively new term

What is frequently used in academic research?

Text Mining

What is the semiautomated process of extracting patterns from large amounts of unstructured data sources?

Text Mining

What is text analytics: How does it differ from text mining?

Text analytics is a concept that includes information retrieval (e.g., searching and identifying relevant documents for a given set of key terms) as well as information extraction, data mining, and Web mining. Test mining is the semi-automated process of extracting patterns (useful information and database) from large amounts of unstructured data sources.

What are the main steps in the text mining process?

Text mining entails three tasks: 1. Establish the Corpus: Collect and organize the domain-specific unstructured data 2. Create the Term-Document Matrix: Introduce structure to the corpus 3. Extract Knowledge: Discover novel patterns from the T-D matrix

What is a computer program that automatically converts normal language text into human speech?

Text to Speech / Speech Synthesis

What is the main difference between the SEMMA and Crisp DM?

The CRISP DM takes a more comprehensive approach and SEMMA implicitly assumes that the data mining project's goals and objectives along with the appropriate data sources have been identified and understood.

Value of information

The ability to understand, digest, analyze, and filter information is key to growth and success for any professional in any industry

Define managerial control. Provide two examples.

The acquisition and efficient use of resources in the accomplishment of organizational goals.

What are issues that are pertaining to scalability?

The amount of data in a warehouse, how quickly the warehouse is expected to grow, the number of concurrent users, and the complexity of user queries. **must scale both horizontal and vertically

What are popular techniques for time series forecasting?

The averaging methods -- simple average, moving average, weighted moving average, and exponential smoothing.

Define operational control. Provide two examples.

The efficient and effective execution of specific tasks. A/R A/P

What is "search engine optimization"? Who benefits from it?

The intentional activity of affecting the visibility of an e-commerce site or a website in a search engine's natural (unpaid or organic) search results. It involves editing a page's content, HTML, metadata, and associated coding to both increase its relevance to specific keywords and to remove barriers to the indexing activities of search engines. Primarily benefits companies with e-commerce sites by making their pages appear toward the top of search engine lists when users query.

What is the main difference between commercial and free data mining software tools?

The main difference between commercial tools, such as Enterprise Miner and Statistica, and free tools, such as Weka and RapidMiner, is computational efficiency. The same data mining task involving a rather large dataset may take a whole lot longer to complete with the free software, and in some cases it may not even be feasible (i.e., crashing due to the inefficient use of computer memory).

What are some methods for cluster analysis?

The most commonly used clustering algorithms are k-means and self-organizing maps.

Business Analytics and its goals

The process of creating new insights from information is known as business analytics a) Business Intelligence --> Operational --> Here & Now b) Business Analytics --> Strategic --> Future Goals: Extracting the knowledge buried inside enterprise databases (discover unknown relationships) Analytical decision are put on a repeatable basis instead of treating as an ad hoc activity

Define analytics.

The process of developing actionable decisions or recommendations for actions based on insights generated from historical data.

Describe how ES perform inference.

The process of using the rules in the knowledge base along with the known facts to draw conclusions. Requires some logic embedded in a computer program to access and manipulate the stored knowledge. This program is an algorithm that, with the guidance of the interference rules, controls the reasoning process and is usually called the inference engine

BI

The set of techniques and tools for the transformation of raw data into meaningful and useful information for business analysis/decision support purposes.

What is the difference between information visualization and visual analytics?

The use of visual representations to explore, make sense of, and communication data. The combination of visualization and predictive analytics.

What is the purpose of technology providers or the outer petals?

They provide technology, solutions, and training to analytics user organizations so they can employ these technologies in the most effective ad efficient manner.

Data mart bus architecture

This architecture is a viable alternative to the independent data marts where the individual marts are linked to each other via some kind of middleware. Not optimal for complex data queries.

What are the key similarities and differences between a two-tiered architecture and a three-tiered architecture?

Three-Tier: Has client workstation, application server and database server (each in own tier). Data is processed twice and deposited in an additional multidimensional database. Separation of functions of the DW, eliminates resource constraints, easily create data marts. Two-Tier: Has client workstation and application/database server. Same hardware, but more economical. Can have problems with large DW with data intensive applications.

What is defined by the linear combination of time, emotional intensity, intimacy, and reciprocity?

Tie Strength

What is the structure of a two tier architecture?

Tier 1: Client Workstation Tier 2: Application & Database Server **more economical, but more performance problems

What is the structure of a three tier architecture?

Tier 1: Client Workstation Tier 2: Application Server Tier 3: Database Server **eliminates resource constraints and makes it possible to easily create DMs

3 tiers of data warehousing architecture. ( a 2 tier is more economical where the last two work together but not great for large companies).

Tier 1: Client workstation. Tier 2: Application server. Tier 3: Database server.

What is a situation in which it's not important to know exactly when the event occurred?

Time Independent

What is a sequence of data points of the variable of interest, measured and represented as successive points in time spaced at uniform time intervals?

Time Series

What assumes all the explanatory variables are aggregated and consumed in the response variable's time variant behavior?

Time Series Forecasting

What is the use of mathematical modeling to predict future values of the variable of interest based on previously observed values?

Time Series Forecasting

What measures the visitor's interaction with the website?

Time on Site

What is a categorized block of text in a sentence?

Tokenizing

online analytical processing (OLAP),

Tools to create an advanced data analysis environment that supports decision making, business modeling, and operations research.

What adapts traditional relational database tools to the development needs of an enterprise wide data warehouse and provides a consistent & comprehensive view of the enterprise?

Top Down Development / EDW Approach

How does traditional analytics make use of location-based data?

Traditional analytics produce visual maps that are geographically mapped and based on the traditional location data, usually grouped by the postal codes. The use of postal codes to represent the data is a somewhat static approach for achieving a higher level view of things

What is used by the model builder and has 2/3 of the data?

Training Set

In a simple split, what are the three mutually exclusive subsets used to prevent overfitting?

Training, validation, and testing.

What is a computerized record of a discrete event?

Transaction

What involves converting the extracted data from its previous form into the form in which it needs to be so that it can be placed into a data warehouse or simply another database?

Transformation

What is sentiment analysis? How does it relate to text mining?

Tries to answer the question, "What do people feel about a certain topic?" by digging into opinions of many using a variety of automated tools. It is also known as opinion mining, subjectivity analysis, and appraisal extraction. Unlike text mining, which categorizes text by conceptual taxonomies of topics, sentiment classification generally deals with two classes (positive versus negative), a range of polarity (e.g., star ratings for movies), or a range in strength of opinion

What is the outcome of when the predictive class is negative and the observed class is negative?

True Negative

What is the outcome of when the predictive class is positive and the observed class is positive?

True Positive

specificity

True negative/Truenegative+False Positive

What means that the pattern should make business sense that leads to the user saying they understand?

Ultimately Understandable

When would the decision maker consider situations in which several outcomes are possible for each course of action and the decision maker does not know the probability of occurrence of the possible outcomes?

Uncertainty

What are the factors that affect the result variables, but are not under the control of the decision maker?

Uncontrollable variables/Paramaters

environmental scanning

Undirected viewing mode limited, irregular information Conditional viewing mode controlling for internal data, external data monitored Searching mode seeking information to update existing knowledge Enacting mode experimentation and trying new behaviors

What is composed of any combination of textual, imagery, voice, and Web content?

Unstructured data

Geographic information system (GIS)

Used to capture, store analyze and manage the data linked to a location, combined with integrated sensor technologies and GPS

What is a critical success factor in data warehouse development?

User participation

What uses animated computer graphic displays to present the impact of different managerial decisions?

VIS

What means that the discovered patterns should hold true on new data with a sufficient degree of certainty?

Valid

What is used to calculate the deviation of all data points in a given data set from the mean?

Variance

What is a simulation method that lets decision makers see what the model is doing and how it interacts with the decisions made, as they are made?

Visual Interactive Simulation (VIS)

Geocoding

Visual Maps, Postal codes, Latitude & Longitude

What is a significant technology that has become a key player in descriptive analytics?

Visualization

What is an integral part of analytic CRM and customer experience management systems that helps to better understand and better manage customer complaints/praises?

Voice of the Customer

What has been limited to employee satisfaction surveys and is a way to listen what employees are saying?

Voice of the Employee

What is about understanding aggregate opinions and trends and helps companies with competitive intelligence and product development and positioning?

Voice of the Market

Out of the Vs that are used to define Big Data, in your opinion, which one is the most important? Why?

Volume, Variety, Velocity, Veracity, Variability, Value Proposition. -Value Proposition: A preconceived notion about "big" data is that it contains more patterns and interesting anomalies than "small" data. By analyzing large and feature rich data, organizations can gain greater business value that they may not have otherwise. Users can detect the patterns in small data sets using simple statistical analytics. Big analytics means greater insight and better decisions, something that every organization needs.

What is primarily Web site usage data focused and aims to describe what has happened on the Web site?

Web Analytics

What is the process of discovering intrinsic relationships from Web data which are expressed in the form of textual, linkage, or useful information?

Web Mining

What is the extraction of useful information from data generated through web page visits and transactions?

Web Usage Mining

Additional data warehouse characteristics include:

Web based, Relational/multidimensional, Client/Server (for easy access to end-users), Real time (newer data warehouses provide real-time or active data-access and analysis capabilities) Metadata (data about data, how its all organized and how to use them, etc).

What are characteristics that enable data warehouses to be tuned exclusively for data access?

Web based, relation/multidimensional, client/server, real time, include metadata.

What is the extraction of useful information from Web pages?

Web content mining

What is the taxonomy of web analytics?

Web content mining, web structure mining, and web usage mining.

What are automated techniques that are used to read through the content of a Web site.

Web crawlers

What is Web mining? How does it differ from regular data mining or text mining?

Web mining is the discovery and analysis of interesting and useful information from the Web and about the Web, usually through Web-based tools. Text mining is less structured because it's based on words instead of numeric data.

What is the process of extracting useful information from the links embedded in Web documents and identifies authoritative pages and hubs?

Web structure mining

What is the integration of data warehousing and Internet that offer important solutions for managing corporate data?

Web-Based Data Warehousing

What are the differences and commonalities between Web-based social media and traditional/industrial media?

Web-based social media differ from traditional/ industrial media as they are comparatively inexpensive and accessible to enable anyone to publish or access/ consume information. Industrial media generally require significant resources to publish information, as in most cases the articles go through many revisions before being published -quality +reach (commonality) -frequency -accessibility -usability -immediacy -updatability

What are the two main components of the development cycle?

Webcrawler & Document Indexer

What are commonly used Web analytics metrics? What is the importance of metrics?

Website usability: Traffic sources: Visitor profiles: Conversion statistics: They provide access to a lot of valuable marketing data, which can be leveraged for better insights to grow your business and better document your ROI. The insight and intelligence gained from Web analytics can be used to effectively manage the marketing efforts of an organization and its various products or services.

What open source data mining software includes a large number of algorithms for different data mining tasks and has an intuitive user interface most popular in educational circles?

Weka

What are examples of open source, free data mining software?

Weka, KNIME< Rapid Miner

What is the outcome of descriptive analytics?

Well-defined business problems and opportunities.

What is structured as "What will happen to the solution if an input variable, an assumption, or a parameter value is changed?"

What If Analysis

What is a distributed system?

When systems run on multiple servers, instead of just one computer.

How BI Can Answer Tough Customer Questions 2

Where has the business been? Historical perspective offers important variables for determining trends and patterns. Where is the business now? Looking at the current business situation allows managers to take effective action to solve issues before they grow out of control. Where is the business going? Setting strategic direction is critical for planning and creating solid business strategies

What conforms to the search engine's guidelines and involves no deception?

White Hat SEO

What is the difference between white hat and black hat SEO?

White hats tend to produce results that last a long time and black hats anticipate that their sites may eventually be banned either temporarily or permanently once they discover what they are doing.

What is a laboriously hand coded database of English words, their definitions, sets of synonyms, and various semantic relations between synonym sets?

WordNet expensive to build and maintain for NLP

What is the purpose of the analytics accelerators or the inner petals?

Works with both technology providers and users.

What is the most granular polarity identification?

World Level

What is the world's largest data and text repository?

World Wide Web (WWW)

What is the process used to optimally prices services to maximize revenues as a function of time varying transactions?

Yield Management

Inter Cluster

Zwischen den Gruppen

data artist

a business analytics specialist who uses visual tools to help people understand complex data

Dashboard (in PP)

a collection of 1 or more related scorecards or report elements arranged in a set of web pages, hosted by SharePoint Server

Big data

a collection of large, complex data sets, including structured and unstructured data, which cannot be analyzed using traditional database methods and tools

record

a collection of related data elements (in the MUSICIANS table, these include "3, Lady Gaga, gag.tiff, Do not bring young kids to live shows")

Enterprise Data Warehouse (EDW)

a data warehouse for the enterprise

star schema

a data-modeling technique used to map multidimensional decision support data into a relational database.

primary key

a field (or group of fields) that uniquely identifies a given record in a table. In the table RECORDINGS, the primary key is the field RecordingID that uniquely identifies each record in the table. Primary keys are a critical piece of a relational database because they provide a way of distinguishing each record in a table; for instance, imagine you need to find information on a customer named Steve Smith. Simply searching the customer name would not be an ideal way to find the information because there might be 20 customers with the name Steve Smith

Scorecards

a high-level snapshot of organizational performance; displays a collection of KPIs and the performance targets for those KPIs

data warehouse

a logical collection of information, gathered from many operational databases, that supports business analysis activities and decision-making tasks primary purpose is to combine information, more specifically, strategic information, throughout an organization into a single repository in such a way that the people who need that information can make decisions and undertake business analysis (collect information from multiple systems in a common location that uses a universal querying tool)

In an OLAP a cube is

a multidimensional data structure actual or virtual that allows fast analysis of data. The capability of efficiently manipulating and analyzing data from multiple perspectives. aimed for overcome a limitation of relational databases. an analyst can navigate through the database and screen for a particular subset of the data by changing the data's orientations and defining analytical calculations. not great for lots of data as a standard relational format is.

foreign key

a primary key of one table that appears as an attribute in another table and acts to provide a logical relationship between the two tables

Extraction, transformation, and loading (ETL)

a process that extracts information from internal and external databases, transforms it using a common set of enterprise definitions, and loads it into a data warehouse. The data warehouse then sends portions (or subsets) of the information to data marts

Information cleansing or scrubbing (2 of 3 core concepts of data warehousing)

a process that weeds out and fixes or discards inconsistent, incorrect, or incomplete information

dimensional modeling is

a retrieval based system that supports high-volume query access.

Independent Data Mart

a small warehouse designed for a strategic business unit (SBU) or a department, but its source is not an EDW.

A data warehouse is

a specially constructed data repository where data are organized so that they can be easily accessed by end users for several applications.

What is PerformancePoint Dashboard Designer?

a tool that you can use to create dashboards, scorecards, and reports and then publish them to a SharePoint site; Dashboard Designer is part of PerformancePoint Services in MS SharePoint Server 2012

What is an operational data stores (ODS)

a type of database often used as an interim area for a data warehouse

Dashboard definition

a visual display of the most important information needed to achieve one or more objectives; consolidated and arranged on a single screen so the information can be monitored at a glance

BI-... a) tool b) solution c) product d) process

a) BI-Tools are generic software sold by vendors like Oracle, SAP, Microsoft Dynamics, sage b) BI-Solutions are customized software, deployed within organizations c) BI-Product as result of BI where information & knowledge are created d) BI-Process how the organization obtain, analyze and distribute

input and output for Organizational Memory

a) Input: Data, information and knowledge is stored as events occur b) Output: accumulated information & knowledge about the past (not necessarily integrated)

Online Analytical Processing (OLAP), its goals and features

a) OLAP queries the data warehouse, response are pre-calculated b) Organizes data into cubes c) Dimensions summarize data and can be hierarchically drilled down d) OLAP allows to quickly manipulate the analytic results across the different dimensions, no waiting for queries or calculations

why BI gets more important

a) exploding data volumes large data collection these can make decisions even more difficult b) complicate decisions increasingly difficult because of 24/7 worldwide complex processes larger diversity of required information to make decision c) need for quick reflexes market influences cause quick changes so decision has to be made in window of opportunity delays: converting, ingtegrating or resulting of information/knowledge d) technological process better tools for organization because ERP, DW systems need for data or text mining

drill down

access data that is in a lower level of a hierarchically structured database.

Middleware tools enable

access to the data warehouse. Power users such as analysts may write their own SQL queries.

Relational DBMS

allow multiple access queries.

Active Data warehousing (as opposed to traditional data warehousing)

allows for large users and operational staffs.Active Data Warehouse is repository of any form of captured transactional data so that they can be used for the purpose of finding trends

A relational database management system

allows users to create, read, update, and delete data in a relational database. Although the hierarchical and network models are important, this text focuses only on the relational database model

Metric

an analytical measurement intended to quantify the state of a system

dynamic catalog

an area of a website that stores information about products in a database (dynamic website information)

decision support system

an information system that helps managers understand specific kinds of problems and potential solutions and analyze the impact of different decision options using what if scenarios

data warehouse

an integrated, subject-oriented, time-variant, nonvolatile collection of data , that provides support for decision making.

data-driven website

an interactive website kept constantly updated and relevant to the needs of its customers using a database (especially useful when a firm needs to offer large amounts of information, products, or services. Can help limit the amount of information displayed to customers based on unique search requirements)

What is an oper marts

an operational data mart

market basket analysis

analyzes such items as websites and checkout scanner information to detect customers' buying behavior and predict future behavior by identifying affinities among customers' choices of products and services

SaaS

application is hosted as a service

Oper marts

are created when operational data needs to be analyzed multidimensionally. The data for an oper mart come from an ODS.

issue of scaling

attributes may have to be scaled to prevent domination of one attribute

heterogenous ensembles

base models of stem from different prediction methods

Attributes

beschreibende Informationen über die Dimensionen

splitter/branching node

binary decision

In-Flow DS flow

capturing data from legacy system, validating to test data for reality, repairing to examine and build data, transforming for consolidation, applying to move and load data.

Standard Data mining format

cases -> variables

non-exclusive approaches

cases are assigned to some clusters with some probability(graussian mixture)

chord /kɔrd/ or circus chart

chord chart is already implemented in Power BI: A chord diagram is a graphical method of displaying the inter-relationships between data in a matrix. The data is arranged radially around a circle with the relationships between the points typically drawn as arcs connecting the data together.

Reporting

classic Approach to serve Managers Information Needs

model

classification rules, decision tree, mathematical formulae

ensemble methods

combining multiple methods

data dictionary

compiles all of the metadata about the data elements in the data model

business intelligence

comprehensive, cohesive, and integrated set of tools and processes used to capture, collect, integrate, store, and analyze data with the purpose of generating and presenting information used to support business decision making.

Business-critical integrity

constraints enforce business rules vital to an organization's success and often require more insight and knowledge than relational integrity constraints no product returns are accepted after 15 days past delivery (makes sense because of spoilage of produce)

A data mart...

contain data on one topic (e.g., marketing). A data mart can be a replication of a subset of data in the data warehouse. Data marts are a less expensive solution that can be replaced by or can supplement a data warehouse. Data marts can be independent of or dependent on a data warehouse.

root node

contains all data

fact constellation

contains multiple fact tables that share many dimension tables

biggest pitfalls associated with real-time information

continual change

accuracy or Percentage correctly classified

correctly classified examples / all examples

Machine-generated data

created by a machine without human intervention Machine-generated structured data includes sensor data, point-of-sale data, and web log (blog) data

database management system (DBMS)

creates, reads, updates, and deletes data in a database while controlling access and security. Managers send requests to the DBMS, and the DBMS performs the actual manipulation of the data in the database

What is metadata

data about the data. in a data warehouse, metadata describe the contents of a data warehouse and the manner of its acquisition and use

Data integration uses three things:

data access, data federation (integration of business views across multiple data stores) and change capture (based on the identification, capture and delivery of changes made to enterprise data sources.

What solutions does business intelligence provide

data access, storage, data analysis and visualization technologies to support better decision making

data mart (1 of 3 core concepts of data warehousing)

data mart contains a subset of data warehouse information. To distinguish between data warehouses and data marts, think of data warehouses as having a more organizational focus and data marts as having a functional focus

A web-server is backed by both a

data warehouse and an application server. used for ease of access, platform independence, and lower cost.

The federated data warehouse

data warehouse architecture involves integrating disparate systems and analytical resources from multiple sources to meet changing needs or business conditions.

data warehouse parts

data warehouse itself, data acquisition (back-end), client (front-end).

Business Advantages of a Relational Database 4) Increased Information Integrity (Quality)

database design needs to consider integrity constraints

physical view of information

deals with the physical storage of information on a storage device

dice

defines a subcube by performing a selction of one or more dimensions

business rule

defines how a company performs certain aspects of its business and typically results in either a yes/no or true/false answer Stating that merchandise returns are allowed within 10 days of purchase is an example of a business rule

data quality audits

determine the accuracy and completeness of its data. Most organizations determine a percentage of accuracy and completeness high enough to make good decisions at a reasonable cost, such as 85 percent accurate and 65 percent complete.

several obstacles of BI introduction

difficult to find a fitting BI solution, because often expensive and benefits are rather long term business processes are often not constantly defined BI need for business user are difficult to identify

classification DDM

discrete dependent variable continuous and or discrete independent variables

classification vs. regression

discrete dependent variable vs. continuous dependent variable

Business Advantages of a Relational Database 1) Increased Flexibility

distinction between logical and physical views is important in understanding flexible database user views

measuring distance

eg. euclidean distance (folie 39)

Transactional information

encompasses all of the information contained within a single business process or unit of work, and its primary purpose is to support daily operational tasks (Organizations need to capture and store transactional information to perform operational tasks and repetitive decisions such as analyzing daily sales reports and production schedules to determine how much inventory to carry)

Analytical information

encompasses all organizational information, and its primary purpose is to support the performance of managerial analysis tasks (Analytical information is useful when making important decisions such as whether the organization should build a new manufacturing plant or hire additional sales personnel. Analytical information makes it possible to do many things that previously were difficult to accomplish, such as spot business trends, prevent diseases, and fight crime; identify many unusual trends)

primary concepts of the relational database model

entities, attributes, keys, and relationships

technologies used for information integration

environmental scanning events, trends, relationships and external environment which could influence the company (law change, new technology, competitors) text mining "reading" and analyzing text written in natural language web mining searching the web (forums, social media) and online text RFID information regarding the location of goods

Dirty data

erroneous or flawed data (complete removal of dirty data from a source is impractical or virtually impossible) dirty data is a business problem, not an MIS problem

exclusive approaches

every case is assigned to exactly one cluster (k-means)

Specialized software tools

exist that use sophisticated procedures to analyze, standardize, correct, match, and consolidate data warehouse information

ETL

extract, transform, load

data scientist

extracts knowledge from data by performing statistical analysis, data mining, and advanced analytics on big data to identify trends, market changes, and other relevant information

Error rate

false positive+False negativ/alles

Comparison Query Performance

fast for multidimensional data types, slow for relational by increasing complexity

Advanced analytics

focuses on forecasting future trends and producing insights using sophisticated quantitative methods, including statistics, descriptive and predictive data mining, simulation, and optimization (uses data patterns to make forward-looking predictions to explain to the organization where it is headed)

logical view of information

focuses on how individual users logically access information to meet their own particular business needs

Starnet abstraction level

footprint

post-pruning

fully grown tree - complexity

Indicators

graphical symbols used in KPIs to show whether performance is on or off target (e.g. stoplight symbols)

Structured data

has a defined length, type, and format and includes numbers, dates, or strings such as Customer Address. (typically stored in a traditional system such as a relational database or spreadsheet and accounts for about 20 percent of the data that surrounds us)

determining number of clusters

heurestic approach, schauen wie viele den besten objective value generieren mit geringstem aufwand

approaches of clustering

hierachical vs. non hirachical agglomerative & divisive vs exclusivevs & non-exclusive

DBMS use three primary data models for organizing information

hierarchical, network, and the relational database, the most prevalent

Real-time information

immediate, up-to-date information

Dynamic information

includes data that change based on user actions. For example, static websites supply only information that will not change until the content editor changes the information. Dynamic information changes when a user requests information. A dynamic website changes information based on user requests such as movie ticket availability, airline prices, or restaurant reservations

Static information

includes fixed data incapable of change in the event of a user action

Filters

individual dashboard items that enable dashboard users to focus on specific information (e.g. geography filter enabling a user to view information for a specific geographical region)

multidimensional cube is

inflexible and does not support the ad hoc creation of multidimensional views of the products, services and customers. can't handle more then 30 gigabits of data.

Inmon vs kimball

inmom op-down, enterprise wide, complex, dubjrct driven, low end0user, IT professionals, WHEREAS kimball bottom-up, simple method, data marts, process oriented, dimensional modeling, high end user accessibilites.

Intra Cluster

innerhalb eines Clusters

Federated data warehouse

integrates analytical resources from multiple sources to meet changing needs or business conditions.

types of Data Sources

internal Data sources vs. external data sources

snowflake schema

is a logical arrangement of tables in a multidimensional database in such a way that the entity relationship diagram resembles a snowflake in shape.

Enterprise integration informaiton

is a mechanism for pulling data from source systems to satisfy a request for information. It is an evolving tool space that promises real-time data integration from a variety of sources, such as relational databases, Web services, and multidimensional databases.

Dependent Data Mart

is a subset that is created directly from the data warehouse. It has the advantage of using a consistent data model and providing quality data. A dependent data mart ensures that the end user is viewing the same version of the data that is accessed by all other data warehouse users. The high cost of data warehouses limits their use to large companies.

Unstructured data

is not defined, does not follow a specified format, and is typically free-form text such as emails, Twitter tweets, and text messages (Unstructured data accounts for about 80 percent of the data that surrounds us)

drill down

less detailed -> more detailed - stepping down a concept hierarchy or intruducing additional dimensions (country -> state -> city)

predictions methods

linear vs non-linear, parametric vs. non parametric, homogenous versus heterogenous, individual verses ensemble

Data models

logical data structures that detail the relationships among data elements by using graphics or pictures

What is used to develop probabilistic models between one or more explanatory models between one or more explanatory predictor variables?

logistic Regression

database

maintains information about various types of objects (inventory), events (transactions), people (employees), and places (warehouses) (store information) (core component of any system, regardless of size, is a database and a database management system)

Data warehousing used primarily to help

make informed decisions.

Relational Databases are not well suited for

manipulating records. support a lot of data. supports dynamic joining of data. proven technology. performance less than optimal cannot be used for purely optimized processing.

Aims Custering

maximize intra-cluster homogenitiy, maximzie inter cluster-heterogenity

Information integrity

measure of the quality of information

Dimensional modeling

modeling is a retrieval-based system that supports high-volume query access.

Data visualization tools

move beyond Excel graphs and charts into sophisticated analysis techniques such as controls, instruments, maps, time-series graphs, and more Data visualization tools can help uncover correlations and trends in data that would otherwise go unrecognized

Comparison Data preperation Time for Query

multidimensional data type fast at complexity,

nominal

multiple variables, no order

clustering DDM

no dependent variable, continuous and/or discrete (indepentend) variables

pre-pruning

not fully grown tree - disadvantages: consider focal node only - how to collect parameters (maximal depth)

Information integrity issues

occur when a system produces incorrect, inconsistent, or duplicate data (can cause managers to consider the system reports invalid and will make decisions based on other sources)

Information inconsistency

occurs when the same data element has different values

Analysis paralysis

occurs when the user goes into an emotional state of over-analysis (or over-thinking) a situation so that a decision or action is never taken, in effect paralyzing the outcome In the time of big data, analysis paralysis is a growing problem. One solution is to use data visualizations to help people make decisions faster

How are data warehouses different from operational databases

operational databaseses are more product oriented and data warehouses use subject orientation to give a more comprehensive view of the organization.

conecept hierachy

parent- child relationship among members of dimension

Master data management (MDM)

practice of gathering data and ensuring that it is uniform, accurate, consistent, and complete, including such entities as customers, suppliers, products, sales, employees, and other critical entities that are commonly integrated across organizational systems

leaf node

prediction

Data mining take analysis further by sifting through a large amount of data to find info using these such algorithms:

predictive modeling, database segmentation, link analysis, deviation detection.

Infographics

present the results of data analysis, displaying the patterns, relationships, and trends in a graphical format (exciting and quickly convey a story users can understand without having to analyze numbers, tables, and boring charts)

Distributed computing

processes and manages algorithms across many machines in a computing environment

simple classifiers

prototype based methods - rote learner (exact match) - nearest neighbor

Reports

provide access to interactive and static data in a variety of forms (e.g. analytic chart, analytic grid, Excel services, KPI details, web page)

OLAP tools

provide data access to end users. allow a user to "drill-down" into their data to view it at whatever level of detail they need.

Real-time systems

provide real-time information in response to requests. Many organizations use real-time systems to uncover key corporate transactional information

Metadata

provides details about data. F(an image could include its size, resolution, and date created. Metadata about a text document could contain document length, data created, author's name, and summary)

continous variables

quantitative variables https://statistics.laerd.com/statistical-guides/types-of-variable.php

Two primary tools are available for retrieving information from a DBMS

query-by-example (QBE) tool and a structured query language (SQL)

n fold cross validation

randomly split data in n samples - 1 model validation, n for building the model

Data governance

refers to the overall management of the availability, usability, integrity, and security of company data

simulation of model applications

resubstitution estimate, split sample, N-fold cross validation

internal node

result of branching node

association detection

reveals the relationship between variables along with the nature and frequency of the relationships

pivot

rotate invert or rotates data axes in view goal: alternative presentation of the data

Relational integrity constraints

rules that enforce basic and fundamental information-based constraints. For example, a relational integrity constraint would not allow someone to create an order for a nonexistent customer, provide a markup percentage that was negative, or order zero pounds of raw materials from a supplier

Integrity constraints

rules that help ensure the quality of information

resubstitution estimate

same data for estimation and assessment (single sample approach)

Machine-generated unstructured data

satellite images, scientific atmosphere data, and radar data

Business Advantages of a Relational Database 2) Increased Scalability and Performance

scalable to handle the massive volumes of information, the large numbers of users expected for the launch of the website, and need to perform quickly under heavy use

slice

selection of single value, resulting in a smaller cube -> slice

training set

set of tuples used for model construction

star schema

simplest form of dimensional modeling. contains a central tact table surrounded by and connected to several dimension tables. the fact table contains a large number of rows that correspond to observed facts and external links.

slice and dice

slice and dice: phrase of slice, divide a quantity of information up into smaller parts, especially in order to analyze it more closely or in different ways.

MOLAP

specialized database physicalle storing data in multidimensional form

split sample

split data in two sets, one for estimation and model assessment

ROLAP

star or snowflake schema in relational database

snowflake schema

star schema with normalization

The growing demand for real-time information

stems from organizations' need to make faster and more effective decisions, keep smaller inventories, operate more efficiently, and track performance more carefully

types of measures

stored vs. calculated

entity (also referred to as a table)

stores information about a person, place, thing, transaction, or event (ex. TRACKS, RECORDINGS, MUSICIANS, and CATEGORIES) -columns, attributes, fields-> (supplier, inventory, materials, distribution)

relational database model

stores information in the form of logically related two-dimensional tables

Roll up

summarize data by climbing up hierarchy or dimension reduction - day -> month -> quarter

Uses for real-time location intelligence

targeting right customer based on their behavior over geographic locations

Data visualization

technologies that allow users to see or visualize data to transform information into a business perspective Data visualization is a powerful way to simplify complex data sets by placing data in a format that is easily grasped and understood far quicker than the raw data alone

Human-generated unstructured data

text messages, social media data, and emails

structured query language (SQL)

that asks users to write lines of code to answer questions against a database

information cube

the common term for the representation of multidimensional information

retention /rɪˈtɛn ʃən/

the continued possession, use, or control of something. Membership retention, pro-mentorship, retain, the meeting,

Attributes (also called columns or fields)

the data elements associated with an entity (the entity TRACKS are TrackNumber, TrackTitle, TrackLength, and RecordingID. Attributes for the entity MUSICIANS are MusicianID, MusicianName, MusicianPhoto, and MusicianNotes)

Information redundancy Business Advantages of a Relational Database 3) Reduced Information Redundancy

the duplication of data, or the storage of the same data in multiple places (can cause storage issues along with data integrity issues, making it difficult to determine which values are the most current or most accurate. Employees become confused and frustrated when faced with incorrect information causing disruptions to business processes and procedures. One primary goal of a database is to eliminate information redundancy by recording each piece of information in only one place in the database)

Information granularity /ˈgræn yə lər/

the extent of detail within the information (fine and detailed or coarse and abstract)

content creator

the person responsible for creating the original website content

content editor

the person responsible for updating and maintaining website content

Data mining

the process of analyzing data to extract information not offered by the raw data alone (can also begin at a summary information level (coarse granularity) and progress through increasing levels of detail (drilling down) or the reverse (drilling up))

extraction, transformation, and loading (ETL)

the processes used in a data warehouse. It includes extracting data from outside sources, transforming it to fit operational needs, and loading it into the end target (database or data warehouse)

multidimensional databases lack

the scalability and flexibility for DSS

data element (or data field)

the smallest or basic unit of information (can include a customer's name, address, email, discount rate, preferred shipping method, product name, quantity ordered, and so on)

Time-series information

timestamped information collected at a particular frequency

performing extensive ETL (extraction, transformation, load)

to move data to the data warehouse may be a sign of poorly managed data and a fundamental lack of a coherent data management strategy.

Why do we need BI?

to support better decision making and to increase organizational knowledge base

query-by-example (QBE)

tool that helps users graphically design the answer to a question against a database

Business intelligence dashboards

track corporate metrics such as critical success factors and key performance indicators and include advanced capabilities such as interactive controls, allowing users to manipulate data for analysis. The majority of business intelligence software vendors offer a number of data visualization tools and business intelligence dashboards

two primary types of information

transactional and analytical

sensitivity

true positive/true positive+false negative

ordinal

two or more categories, rated (bad-normal-good)

Multidimensional Database

usually contain a star model. designed for slice and dice and drill down analysis. highly indexed databases. provides data mining and drill down capabilities.

Distributed database management system

would pull the requested data from databases across the organization, bring all the data back to the same place, and then consolidate in, sort it, and do whatever else was necessary to answer the user's question. Islands of data problem still existed.

differences between BI and other information technologies like: a) knowledge management b) data warehousing c) data mining d) decision support systems

x) all kind of data: BI: data & info as input, results in *NEW* knowledge --- x) Focuses mainly on internal, structured data: a) Knowledge Management: info & knowledge as input, using the existing knowledge optimally b) Data Warehousing: ETL obtains data from multiple systems, stores them in single repository c) Data Mining: discovering hidden patterns in data, produces information d) Decision Support System: making appropriate decision

List and describe the major components of BI.

· Architectures · Database tools · Analysis tools · Applications · Methodologies


Related study sets

P.S 3: Verbs ( when you learn new verb look in the dictionary)

View Set

3:4 Organization: Employee and Labor Relations Quiz

View Set

Chapter 3: project man--- mult choice

View Set

Economic Indicators: The Primacy of GDP/BMC Certification

View Set

The Election of 1860 and the Outbreak of the Civil War Study Island Answers

View Set

Chapter Quiz: Labor and Financial Markets

View Set