Business Intelligence Chapter 1, Business Intelligence Chapter 3, BI Chapter 13, BI Exam 1, BI Exam 2, Chapter 14 BI, Chapter 11 BI, BI Chapter 8, Business Intelligence Chapter 5, Business Intelligence Chapter 7, Business Intelligence Chapter 7, Busi...
Information-as-a-Service (IaaS)
"Information on Demand" Goal is to make information available quickly to people, processes, and applications across the business
Financial KPIs
"What are the economic consequences of the organization's past actions?" Examples: operating income, expenses, return on capital, profit margin, cash flow, economic value added
Business Process KPIs
"What are the existing and emerging internal business processes in which the supply chain organization must excel?" Examples: efficiency, cost, throughput, quality, effectiveness
Learning and Growth KPIs
"What infrastructure is needed to foster long-term growth and improvement?" Examples: employee satisfaction, employee retention, skill sets, education and training, information technology
Customer KPIs
"What value proposition is delivered to key customer segments?" Examples: customer satisfaction, customer retention, customer acquisition, market share in target segments, valued services
nearest neighbor classifier
"if it walks like a duck, quacks like a duck, then it is probably a duck" - identify the prevailing class among neighboring examples
reasons many different methods
- "no free lunch" theorem (no algorithm superior) - different merits and demerits (Vor Nachteile)
star schema
- 1 fact table - x dimensions
Relationale Datenbanken Vorteile
- ACID - atomicity (all or nothing) - consistency - isolation - durability
Dimensions of Model Performance
- Accuracy - scalability - robustness - comprehensibility - justifiability - calibration
Stage Choice
- Analyse Auswahl und alternativen - Implementierung - Select particular course
NOSQL Base
- Basically Available (replication of data among many different storage servers) - SOft state (NoSQL allows inconsistent) - eventually consistent (nosql ensures consistent state at some future point)
What are the best practices in dashboard design?
- Benchmark Key Performance Indicators (KPIs) with industry standards -Wrap the dashboard metrics with contextual metadata -Validate the dashboard design by a usability specialist -Prioritize and rank alerts/expectations streamed to the dashboard - Enrich dashboard with business-user comments - Present information in three different levels - Pick the right visual construct using dashboard design principles - Provide for guided analytics
Properties of a distance measure
- D(A,B) >= 0 -> non-negativity - D(A,B) = D(B,A) symmetry - A=b -> identity - D(A,B) < D(A,C)+D(B,C) -> subadditivity, triangle inequality
Challanges for RDBMS
- Diversity - Connectivity - Data Size
Staging Area
- Informationsintegration (temporärer Zwischenspeicher) - place where data is transformed
Business Intelligence (Definition)
- Konzepte und Methoden zur Verbesserung von Geschäftsentscheidungen unter Benutzung von Faktenbasierter Unterstützungssysteme. + Transformation von Daten in nutzbare Informationen + Ermöglichung einfacher Interpretierung großer Datenmengen
business application of association rule
- Market basket analysis - web usage mining
relational model
- RDBMS - implemented as two dimensional models - queries in SQL
BI Front end technologies
- Reporting - Portals&Dashboards - Data mining - OLAP Front END
Steps KDD Process
- Selection, Pre-Processing, Transformatiom, Data-Mining, Interpertation/Evaluation
Cloud Computing service layer
- Services (complete busines service) - Application (cloud based software) - Developement (software developement platform) - Platform (cloud based platforms) - Storage(Data storage) - Hosting(physical data centers)
reporting types
- Standard (fixed frequency) - Event driven - Ad Hoc (user request)
Stage Intelligence
- Suchumgebung für Zustände zur Entscheidung - Problemdifinition
BI Systems Data Type and Origin
- Time: present & past - horizon: mid/Long term - granularity: detailed and aggregated
BI Systems Purpose and Target Audience
- User Group: focused - Focus: Business Object
BI Portals & Dashboards
- Visualisation und Datenzugriff - Suchkosten reduzieren - all Information in one place
Virtual workspace
- abstraction of a execution environment that can be made dynamically available - resource limited - flexible software configuration
relevance of model assessment
- accountability - informed modeling decisions - nature of forecasting
Data Mining
- algorithm centric - characteristics: automated, discover novel Patterns - methods: predictive (voraussagend), descriptive
advantages Nfold cross validation
- approximates generalization error - increased robustness - less variance than split sample - uses all examples for model building
KDD Process Interpretation/Evaluation
- assesment of derived patterns (validity, reliability, originality) - next steps
Virtual Machines
- astraction of a physical host machine - hypervisor intercepts and emulates instructions from vm
federated way of integrating heterogeneous databases
- build wrapper on top of db - ad-hoc approach
unsupervised learning
- built analytic models without response variable
agglomerative clustering
- buttom up approach - every case own cluster - merge cases to form larger clusters
algorithms
- c4.5 (continuous and categorical independent variables, categorical target variable) - Cart (continuous and categorical independent variables, categorical and continuous target variables) - Chaid (categoriacal independent variables, categorical target vairables)
business application of classification
- churn prediction, direct mail, defect prediction/quality management, credit scoring, acceptance scoring
Enterprise Data Warehouse
- collects information about subjects, spanning the entire organisation - corperate wide data integration
idicators for measing accuracy
- compare predicted to actual responses - regression (mean absolute error, mean-square error, root-mean-suqre error) - classification (classification erorr, percentage correctly classified, precision, recall)
scalability
- consumption of time resources - memory resources - sensitivity with respect to parameters - parallelization important
Regression DMM
- continues dependent variable - continues and/or discrete independent variables
KDD Process pre-processing
- conversion to standard analysis format - exploratiy data analysis - aggregation
Data accuracy
- correct - unambiguous - consistent - complete
Metadata
- data about data - define warehouse objects - (source, time, missing fields, ...)
ETL Process
- data is identifiend and extracted - transformation for consistence - transported to DWH
Cloud storage
- data storage capazity hired out for others - remotly, temporarly cached on desktop computers
Graph Database
- dealing with highly interconnected data - nodes and relationsships can have properties - strength:traversing through the ndoes by relationships - NEO4J
columnar
- design of data is a column - adding columns is quite inexpensive, is done row by row - each row can have different columns - Hbase, Cassandra, Hypertable
assessing cluster solutions
- devisive: decrease heterogenity within cluster - agglomerative: increase heterogenity within a cluster
indicators to measure predictive accuracy
- differ across modeling framework - emphasize different notion of performance - focus on classification in following
k-means problems
- different starting centroids result in different clusters - echte cluster zu treffen ist schwierig - centroids maybe re-adjust themselves
association rule DDM
- discover rules of the form "IF A then B" - Sequence mining also looks for time-dependent relationships
KDD Process Selection
- documentation of available data - review of data quality, availability over time, granularity - selection of data for further analyses
levels
- each level represents a position in the concept hierarchy - all - category - subcategory - product
post-processing
- eleminate smallest clusters (outliners) - split high SSE clusters (loosers) - merge close clusters with low SSE
NOSQL Properties (Base)
- eventual consistency - basically available - soft state
MOLAP Drawback
- extra cost for seperate mddb - increasing learning curve - processing step can be quite lengthy - MOLAP tools difficulty querying models with high cardinality
MOLAP Benefits
- fast - smaller due to compression - automated computation of higher level aggregates of the data - compact for low dimensional data sets - "natural indexing" - power and ease of analytical calculations
Extraction caveats (Vorsichtsmaßnahmen)
- fast as possible - small as possible - infrequent as possible - changes in source system as small as possible
splitting rule
- for each splitting attribute, find best split (
dinstance measurement
- formal way to quantify similarity/dissimilarity (intra cluster distance, inter cluster distance)
external data sources
- from outside of the organization - marketing research, competitive information, economic forecast, web2.0, edi, efid, epcis, purchased databases
Reporting engines
- grafische Auswertung - ableiten persönlicher Performance indicators - used by it Departments und Business Analysten
KDD Process transformation
- handling of missing values - data reduction - encoding and projection
Why a seperate data warehouse
- high performance for both systems - different functions and data
business application of clustering
- identification of homogeneous customer segments - document clustering - fraud detection
Disadvantages split sample method
- inefficient - high variance
OLAP concept hierachies
- interactive data analysis - multiple dimensions defined by concept hierachies - view data from different perspectives and aggregation levels
recursive paritioning approach
- local optimum - search next important variable - split tree accordingly and create a branch for each split value
Anwendung von Cluster Analyse
- marketing (understanding of customer populations, mass customization, identifying new products, classification of customers) - textanalyse - fraud/anomaly detection
emphasize different notion of performance
- mean absolute error vs. mean squared error
Problems of data driven classifiers
- mislabeling - -> use the n nearest neighbors
solution of initial centroids problem k-means
- multiple runs - sample and use hierachical clustering to determine centroids - more then k centroids - take the best
Data warehouse = integration in advance
- multiple sources are integrated in dwh - high performance
Stage Design
- mögliche Folgen der Entscheidung - Entwicklung möglicher Lösung
classification of new examples
- nearest best match - no model estimation - only usage of available date (making regions of classes)
Cloud Computing
- network based computing
ROLAP Benefits
- no extra costs with propriety system - familiar relational DBA skills and tools are used
Examples Data anomalies
- no unique key - naming, coding - meaning between groups - spelling - missing values - multiple encodings - multiple local standards - multiple names
pre processing
- normalize data - eleminate outliners
NOSQL
- not only using SQL - no fixed schema, allowing fields to be added to any record without constraints - often: open source - designed to work on large clusters
relational db data types
- numeric, strings, dates, uninterpreted, blobs
Cloud computing on demand
- on demand services - pay for use
Real World Server Architectures
- one ETL, two db servers (clustered), two report servers, 1 Olap, 1 DM Server
Key/ Value
- pairs key to values - very high performance
comprehensibility
- prediction vs. insight - Manager Vertrauen (bzw. misstrauen in Blackbox modelle) - difficult to measure -
homogeneous ensemble
- producde base models with one prediction method
Starnet
- querying multidimensional databases using concept hierachies
Olap Query Characteristics
- read access to large amount of data - analysis of data relations - analysis of data by time - display of data across different dimensions - complex calculations - quick response
robustness
- real world data is "noisy" (unvollständig, falsch) - real-world phenomena change over time -affection on model (while building, afterwards)
advantages split sample method
- real world simulation of prediction model - easy to implement - fast - approximates generalization error
resons for staging areas
- reduced load on operational systems&dwh - backup and recovery - auditing
data mining models
- regression - classification - clustering - association rule and sequential pattern mining
differ across modeling framework
- regression - classification - others
Data Quality
- relevant - useful - accurate - accessible
disadvantages consumption
- resource consumption
approach to simulate real-life-application
- resubstitution estimate - split smaple estimate - cross validation estimate
Advantages of virtual machines
- run operating systems where physical hardware is out of reach - easier backup, creation of machines - test software on clean installation - multiple OS possible (at one time) - debug problems - easy migrition - run lagecy systems
Relationale Datenbanken limits
- scalibility (scale through multiple severs, join over servers is difficult) - complexity (data to tables complex and slow) - SQL can only work with structured data
KDD Process Data Mining
- selection data mining model - selection of data mining method (algorithm) - developement/ estimatoin of the model
linear classifiers
- seperation by a "line" - good and bad part (over/under line)
non linear classifiers
- seperation by curve - good and bad part (over/under line)
strategies of distance measering
- single linkage (clostes objects) - complete linkage (most far objects) - average linkage (mean of all pairwise distance)( mittlere abweichung) - centroid methods(dinstance between centroids)
ROLAP Drawbacks
- slower performance - difficult implementation of some calculations - increased workload for it and end users
Types of Reporting
- standard - event driven - ad hoc
justifiability
- stimmt das MOdell mit früheren Annahmen überein
Document orientated
- stores documents - document = hash = types - MongoDB, CouchDB
Data Mart
- subset of EDW for specific usergroup - selected subject - independet vs. dependet data marts
evaluating k-means clustering
- sum of squared error - error = distance to centroid - multiple solutions: prefer smalles SSE
Datawarehouse
- themenorientierte, integrierte, chronologische, persistente, Sammlung von Daten um das Management bei seinen entscheidungsprozessen zu unterstützen.
operational System Data Type and Origin
- time: present - horizon: short term - granularity: detailed
Devisive clustering
- top down approach - iterative split clusters into smaller sub clusters
operational Systems Purpose and target audience
- user Group: large and heterogeneous - Focus: Business processes
Online Analytical Processing (OLAP)
- user centric - Multi dimensional - interactive - requires some IT-Skills affinity
BI Systems technologie
- users: few - Access frequency: low - usage pattern: unregelmäßig - Response time: seconds - data volume: high - data updates: rare - storage: redundant - critical factors: database size, data Quality
operational System technology
- users: many - Access frequency: high - Benutzungsmuster: constant - Response time: miliseconds - data volume: low - data updates: often - storage: normalized tables - critical factors: Performance, Parallelität, Response time, fault tollerance
Mapping
- which operational attributes? - how to transform those? - mapping to dimensional models
k-means
- zufällige centroids setzen - Punkte dem nächsten centroid zuordnen - neue centroids berechnen (euclidean) - von 2. weiter bis sich die centroids nicht mehr bewegen
What are the critical success factors for Big Data analytics?
-A clear business need (alignment with the vision and the strategy) -Strong, committed sponsorship (executive champion) -Alignment between the business and IT strategy -A fact-based decision making culture -A strong data infrastructure
What is MapReduce? What does it do? How does it do it?
-A technique to distribute the processing of the very large multi-structured data files across a large cluster of machines. -Aids organizations in processing and analyzing large volumes of multi-structured data. -Reads the input file and splits it into multiple pieces. These splits are then processed by multiple map programs running in parallel on the nodes of the cluster. .Groups data in a split by the type of geometric shape Takes output from each map program, which calculates the sum of the number of different types of geometric shapes.
What is stream analytics? How does it differ from regular analytics?
-A term commonly used for extracting actionable information from continuously flowing/streaming data sources. -The science of analysis--to use data for decision making
What are important criteria when selecting an ETL tool?
-Ability to read from and write to an unlimited number of data source architectures -Automatic capturing and delivery of metadata -History of conforming to open standards -Easy to use interface for the developer and function user
When are column oriented organizations more efficient?
-An aggregate needs to be computer over many rows, but for a notably smaller subset of all columns of data -New values of a column are supplied for all rows at once because that column data can be written efficiently
What is Hadoop? How does it work?
-An open source framework for processing, storing, and analyzing massive amounts of distributed, unstructured data. -A client accesses unstructured and semistructured data from sources including log files, social media feeds, and internal data stores. It breaks the data up into "parts," which are then loaded into a file system made up of multiple nodes running commodity hardware.
Data Mining Analysis Methods
-Analyzing customer buying patterns to predict future marketing and promotion campaigns. -Building budgets and other financial information. -Detecting fraud by identifying deceptive spending patterns. -Finding the best customers who spend the most money. -Keeping customers from leaving or migrating to competitors. -Promoting and hiring employees to ensure success for both the company and the individual.
What are the four major types of patterns that data mining seeks to identify?
-Associations -Predictions -Clusters -Sequential Relationships
How do the two approaches differ (approaches to recommendation systems).
-Collaborative filtering: The recommendations system is built based on the individual user's past behavior by keeping track of the previous history of all purchased items. -Content filtering: Relies on the user ratings matrix, by considering specifications and characteristics of items.
What are examples of social networks relevant to business activities?
-Communication Networks -Community Networks -Criminal Networks -innovation Networks
What are the major characteristics and objectives of data mining?
-Data is presented in many formats. -Data mining environment is usually a client/server architecture or a Web based IS architecture. -Sophisticated new tools helps to remove the information ore buried in corporate files and public records. -Miner has little or no programming skill. -Striking it rich finds an unexpected result and requires end user to think creatively throughout the process. -Data mining can be analyzed and deployed quickly and easily. -Necessary to parallel processing for data mining
What are the most common myths about data mining?
-Data mining provides instant, crystal-ball-like predictions -Data mining is not yet viable for business applications -Data mining requires a separate, dedicated database -Only those with advance degrees can do data mining -Data mining is only for large firms that have lots of customer data
Describe the major components of a data warehouse
-Data sources: Multiple independent operational "legacy" systems and possibly from external data providers. -Data extraction and transformation: Uses custom-written or commercial software called ETL. -Data loading: Starts in staging area, transformed and cleansed, then loaded into data warehouse/data marts. -Comprehensive database: EDW to support all decision analysis by providing relevant summarized and detailed information originating from many different sources. -Metadata: Includes software programs about data and rules for organizing data summaries. -Middleware tools: enable access to the data warehouse .
What is Big Data analytics? How does it differ from regular analytics?
-Data that exceeds the reach of commonly used hardware environment and/or capabilities of software tools to capture, manage, and process it within a tolerable time span -The science of analysis--to use data for decision making
What are the use cases for Big Data and Hadoop?
-Data warehouse performance -Integrating data that provides business values -Interactive BI tools
What are the components of a Linear Programming Model?
-Decision Variables -Objective Function -Objective Function Coefficients -Constrains -Capacities -Input/Output Coefficients
What are the analysis tools for measuring social media?
-Descriptive analytics -Social network analysis -Advanced analytics
List ethical issues in analytics.
-Electronic surveillance -Ethics in DSS design -Software piracy -Invasion of individuals' privacy -Use of proprietary database -Use of intellectual property such as knowledge and expertise -Exposure of employees to unsafe environments related to computers -Computer accessibility for workers with disabilities -Accuracy of data, information, and knowledge -Protection of the rights of users -Accessibility to information -Use of corporate computers for non-work-related purposes -How much decision making to delegate to computers
What are direct benefits of implementing a data warehouse?
-End users can perform extensive analysis in numerous ways. -Consolidated view of corporate data -Better and more timely information -Enhance system performance -Data access is simplified
What are indirect benefits of implementing a data warehouse?
-Enhance business knowledge -Present a competitive advantages -Improve customer service and satisfaction -Facilitate decision making -Help reform business processes
What are the four main areas of effective security in a data warehouse?
-Establishing effective corporate and security policies and procedures; start at top management -Implementing logical security procedures and techniques to restrict access -Limit physical access to the data center environment. -Establish an effective internal control review process with an emphasis on security and privacy.
What are the major features of ES?
-Expertise -Symbolic reasoning -Deep knowledge -Self-knowledge
Describe the three steps of the ETL process
-Extraction: Reading data from one or more databases. -Transformation: Converting the extracted data from its previous form into the form in which it needs to be so that it can be placed into a data warehouse or simply another database. -Load: Putting the data into the data warehouse
What are some factors other than hardware, software, and network capabilities that have contributed to facilitating growth of decision support and analytics?
-Group communication and collaboration -Improved data management -Managing giant data warehouses and Big Data -Analytical support -Overcoming cognitive limits in processing and storing information -Knowledge management -Anytime, anywhere support
What are some of the key system-oriented trends that have fostered IS-supported decision making to a new level?
-Group communication and collaboration -Improved data management -Managing giant data warehouses and Big Data -Analytical support -Overcoming cognitive limits in processing and storing information -Knowledge management -Anywhere, anytime support
What are the things that help Web pages rank higher in the search engine results?
-Hire a company that specializes in search engine optimization to continuously improve sites appeal to changing practices of the search engines -Pay the search engine providers to be listed on the paid sponsors section -Consider reducing dependence on search engine traffic
What can cluster results be used for?
-Identify a classification scheme -Suggest statistical models to describe populations -Indicate rules for assigning new cases to classes for identification, targeting, and diagnostic purposes. -Provide measures of definition, size, and change in what were previously broad concepts -Find typical cases to label and represent classes. -Decrease the size and complexity -Identify outliers in a specific domain
What are factors that affect the architecture selection decision?
-Information interdependence between organizational units -Upper management's information needs -Urgency of need for a data warehouse. -Nature of end user tasks -Constraints on resources -Strategic view of the data warehouse prior to implemnetation -Compatibility with existing systems -Perceived ability of the in-house IT staff -Technical issues -Social and political factors
What are some of the major applications areas of artificial intelligence?
-Intelligent tutoring -Autonomous robots -Speech understanding -Automatic programming -Computer vision -Game playing -Expert system -Intelligent agents -Natural language processing -Machine learning -Voice recognition -Neural network -Generic algorithms -Fuzzy logic
What are the benefits of implementing a data warehouse?
-Keepers: money saved by improving traditional decision support functions (20%) -Gathers: money saved due to automated collection and dissemination of information (30%) -Users: money saved or gained from decisions made using the data warehouse (50%)
How can visitor profiles be leveraged with web analytics and segmentation?
-Keywords -Content groupings -Geography -Time of Day -Landing page profiles
List and define the major components of an ES.
-Knowledge acquisition Subsystem: accumulation, transfer, and transformation of problem-solving from expert or documented knowledge sources to a computer program for constructing or expanding the knowledge base. -Blackboard: an area of working memory set aside as a database for description of the current problem, as characterized by the input data. -Explanation Subsystem: can trace such responsibility and explain ES behaviors by interactively answering: why was a certain question as by the ES, how was a certain conclusion reached , why was a certain alternative rejected, what is the completed plan of decisions to be made in reaching the conclusion? -Knowledge-refining system: can analyze their own knowledge and its effectiveness, learn from it, and improve on it for future consultations.
When are row oriented organizations more efficient?
-Many columns of a single row are required at the same time and the row size is relatively small -Writing a new row if all of the column data is supplied at the same time.
Why has data mining become more popular?
-More intense competition at the global scale. -General recognition of the untapped value hidden in large data sources. -Consolidation and integration of database records. -Exponential increase in data processing and storage technologies. -Significant reduction in the cost of hardware and software for data storage. -Movement toward demassification of business practice
What is a cube? What do drill down, roll up, and slice and dice mean?
-Multidimensional data structure (actual or virtual) that allows fast analysis of data. -User navigates among levels of data ranging from the most summarized (up) to the most detailed (Down) -Computing all of the data relationships for one or more dimensions. -Subset of a multidimensional array corresponding to a single value set for one (or more) of the dimensions not in the subset. -Slice on more than two dimensions of a data cube.
What are conversion statistics?
-New & returning visitors -Leads -Sales Conversion -Abandonment / Exit Rates
What is NoSQL? How does it fit into the Big Data analytics picture?
-Not Only SQL. Processing large volumes of multi-structured data. -Serving up discrete data stored among large volumes of multi-structured data to end-users and automated Big Data applications. -Can work in conjunction with Hadoop.
What are challenges associated with implementing NLP?
-Part of speech tagging. -Text segmentation -Word sense disambiguation -Syntactic Ambiguity -Imperfect or Irregular Input -Speech Acts
Describe privacy concerns in analytics?
-Privacy is the right to be left alone and the right to be free from unreasonable personal intrusion -Internet uses and accesses data -Private info can aid in decision making, but hurts privacy
What are the big challenges that one should be mindful of when considering implementation of Big Data analytics?
-Process efficiency and cost reduction -Brand management -Revenue maximization, cross-selling, and up-selling -Enhanced customer experience -Churn identification, customer recruiting -Improved customer service -Identifying new products and market opportunities -Risk management -Regulatory compliance -Enhanced security capabilities
What are the types of organizations or professionals that comprise the analytics industry?
-Provide advice to the analytics industry providers and users -Professional societies or organizations that are membership based and organized. -Analytics ambassadors, influences, or evangelists that have presented their enthusiasms for analytics through seminars, books, or other publications.
What are the main types of a data warehouse?
-Provide decision support capability -Allows ready access to business information -Creates business insight
What are characteristics that differentiate between social and industrial media?
-Quality -Reach -Frequency -Accessibility -Usability -Immediacy -Updatability
What are the reasons for the upswing of open source software?
-Recession has driven up interest in low cost open source software -Open source tools are coming into a new level of maturity -Open source software augments traditional enterprise software without replacing it.
What are components of the inner petal of the analytics ecosystem?
-Regulators and policy makers -Analytics industry analysts & influencers -Academic institutions and certification agencies -Application Developers: industry specific or general
What are the use cases for Big Data and Hadoop?
-Repository -Active archive -Data warehouse performance -Integrating data that provides business values -Interactive BI tools
What does an LP allocation model assume?
-Returns from different allocations can be compared -Return from any allocation is independent of others. -All data are known with certainty -The resources are used in the most economical manner.
What are the two broad categories of SEOs?
-Search engines that recommend as part of a good site design. -Techniques of which search engines do not approve.
What are the data mining mistakes?
-Selecting the wrong problem for data mining. -Ignoring what your sponsor thinks data mining is and what it can/can't do. -Beginning without the end in mind. -Define the project around a foundation that your data can't support. -Leaving insufficient time for data preparation. -Looking only at aggregated results and not at individual records. -Not keeping track of the data mining procedure and results. -Using data from the future to predict the future. -Ignoring suspicious findings and quickly moving on. -Starting with high profile complex project first. -Running data mining algorithms repeatedly and blindly. -Ignore the subject matter experts. -Believing everything you are told about the data. -Assuming full cooperation. -Measuring your results differently from the way your sponsor does. -If you build, they will come mindset.
What are the three key components of a BPM?
-Set of integrated, closed loop management and analytic processes that address financial and operational activities -Tools for business to define strategic goals and measure / manage performance against those goals. -Core set of processes linked to organizational strategy
When developing a data warehouse, what are the most important risks and issues to consider and avoid?
-Starting with the wrong sponsorship chain -Setting expectations that you cannot meet -Engaging in politically naïve behavior -Loading the warehouse with information just because it is available -Believing that data warehousing database design is the same as transactional database design -Choosing a data warehouse manager who is technology oriented rather than user oriented -Focusing on traditional internal record-oriented data and ignoring the value of external data and of text, images, and perhaps, sound and video. -Delivering data with overlapping and confusing definitions -Believing promises of performance, capacity, and scalability -Believing that your problems are over when the data warehouse is up and running -Focusing on ad hoc data mining and periodic reporting instead of alerts.
What are various risks and issues when developing a successful data warehouse?
-Starting with the wrong sponsorship chain -Setting expectations you can't meet -Engaging in politically in naive behavior -Loading the warehouse with data just because it's available -Believing that the data warehousing is the same as transactional database design -Choosing a data warehouse manager who is technology oriented rather than user oriented. -Focusing on traditional orientated data and ignoring the value of external data -Delivering data with overlapping and confusing definitions. -Believing promises of performance, capacity, and scalability. -Believing that your problems are over when the data warehouse is up and running. -Focusing on ad hoc data mining and periodic reporting instead of alerts
What is the main difference between statistics and data mining?
-Statistics collects sample data to test the hypothesis whereas data mining and analytics use all the existing data to discover novel patterns and relationships. -Size of data varies
What are the most distinguishing features of KPIs?
-Strategy -Targets -Ranges -Encoding -Time frames -Benchmarks
What are the four perspectives that BSC suggests us to use to view organizational performance?
-The Customer Perspective -The Financial Perspective -The Learning and Growth Perspective -The Internal Business Processes Perspective
List the major characteristics of Web 2.0
-The ability to tap into the collective intelligence of users. The more users contribute, the more popular and valuable a Web 2.0 site becomes. -Data is made available in new or never-intended ways. Web 2.0 data can be remixed or "mashed up," often through Web service interfaces, much the way a dance-club DJ mixes music -Relies on user-generated and user-controlled content and data. -Lightweight programming technique and tools let nearly anyone act as a Web site developer. -The virtual elimination of software-upgrade cycles makes everything a perpetual beta or work-in-progress and allows rapid prototyping, using the Web as an application development platform. -Users can access applications entirely through a browser. -An architecture of participation and digital democracy encourages users to add value to the application as they use it. -A major emphasis is on social networks and computing. -There is strong support for information sharing and collaboration . -Fosters rapid and continuous creation of new business models.
What are some of the main challenges the Web poses for knowledge discovery.
-The web is too big for effective data mining -is too complex -is too dynamic -is not specific to a domain -has everything
What are best practices in social media analytics?
-Think of measurement as a guidance -Track the elusive statement -Improve the accuracy of text analysis -Look at the ripple effect -Look beyond the brand -Identify your most powerful influencers -Look closely at the accuracy of your analytic tool -Incorporate social media intelligence into planning
List and briefly describe the best practices in social media analytics?
-Think of measurement as a guidance system, not a rating system -Track the elusive sentiment -Continuously improve the accuracy of text analysis -Look at the ripple effect -Look beyond the brand -Identify your most powerful influencers -Look closely at the accuracy of your analytic tool -Incorporate social media intelligence into planning
Why is master data management gaining popularity?
-Tighter integration with operational systems demands -Most data warehouses still lack MDM and data quality functions -Regulatory and financial reports must by perfectly clean and accurate
What are challenges with the Web?
-Too big for effective data mining -Too complex & dynamic -Not specific to a domain -Web has everything
What are difficulties that arise when analyzing multiple goals?
-Usually difficult to obtain an explicit statement of the organization's goals. -Goals and subgoals are viewed different -Decision maker may change the importance assigned to specific goals over time or for different decision scenarios. -Personal agendas -Importance assessment differently.
What are the ways to manage multiple goals?
-Utility theory -Goal Programming -Expression of goals as constraints -Points system
What is special about the Big Data vendor landscape? Who are the big players?
-Vendor's are able to develop their own hadoop distributions, based on the Apache open source distribution, but with various levels of proprietary customization. -Cloudera, MapR, Hortonworks
What are the four categories of web analytics?
-Web site usability -Traffic sources -Visitor profiles -Conversion statistics
What will play a significant role in defining the future of data warehouse?
-Web, social media, and Big Data -Open source software -SaaS -Cloud Computing -Data lakes
KPIs
-linked to a strategy w/ an objective -defines the target and actual performance measure (e.g. increase repeat business for bike customers by 15%)
Transaction processing systems
... Support day-to-day operations.
Analytic Information systems
... Support decision making.
dependent data mart
... are sourced directly from EDW
independent Data Mart
... are sourced from one or more operational systems or external information providers
Information System
... betrifft die Informations- und Kommunikationssysteme im Geschäft und der Administration.
KDD definition
... is the nontrivial process of identifying valid, novel, potentially useful and ultimately understandable patterns in data.
Informations und Kommunikationssysteme
... ist ein soziotechnisches System zur Befriedigung der Informationsnachfrage. Es ist ein Mensch/Aufgabe/Technik-System.
Grievances
/ˈgri vəns/ : grievances a real or imagined wrong or other cause for complaint or protest, especially unfair treatment.
DWH Tiers
0. Data Sources 1. Data storage 2. Olap Engine 3. Frond end Tools
4 categories of KPI examples
1) Financial 2) Customer 3) Business Process 4) Learning and Growth
Costs of Using Low-Quality Information
1) Inability to track customers accurately. 2) Difficulty identifying the organization's most valuable customers. 3) Inability to identify selling opportunities. 4) Lost revenue opportunities from marketing to nonexistent customers. 5) The cost of sending undeliverable mail. 6) Difficulty tracking revenue because of inaccurate invoices. 7) Inability to build strong relationships with customers.
6 Dashboard Elements in Performance Point
1) Indicators 2) Filters 3) Reports 4) KPIs 5) Scorecards 6) Dashboard
The four primary reasons for low-quality information
1) Online customers intentionally enter inaccurate information to protect their privacy. 2) Different systems have different information entry standards and formats. 3) Data-entry personnel enter abbreviated information to save time or erroneous information by accident. 4) Third-party and external information contains inconsistencies, inaccuracies, and errors.
IMPORTANT difference between data, information, knowledge
1) data = facts, observations, raw numbers 2) information = with meaning subset of data with its context, out of manipulated raw data, e.g. number of sales today 3) knowledge = derived information, justified believes (logic, empirical observations), about relationships among concepts, decisions are higher reliable if based on knowledge - not just data or informtion
6 Distinguishing features of KPIs
1) embody strategic objectives 2) measure performance against specific targets 3) targets have performance ranges (above, on, below) 4) ranges are encoded in software enabling visual display (e.g. red, yellow, green) 5) targets typically are assigned time frames by which they must be accomplished 6) targets are often measured against a benchmark (e.g. previous year's results
IMPORTANT four synergistic capabilities of BI
1) organizational memory: collect quantitive data, accumulated over time 2) information integration: non-quantitive and external data 3) insight creation: apply analytics 4) presentation: display in visual and user friendly formats --> They provide input to each other
two types of integrity constraints
1) relational 2) business critical
6 Dashboard Characteristics
1) use of visual components (e.g. charts, performance bars, spark lines, gauges, meters, stoplights) to highlight, at a glance, the data and exceptions that require action 2) transparent to the user, meaning that they require minimal training and are extremely easy to use 3) combine data from a variety of systems into a single, summarized, unified view of the business 4) enable drill-down or drill-through to underlying data sources or reports 5) present a dynamic, real-world view with timely data updates 6) require little, if any, customized coding to implement, deploy, and maintain
advantages to using the web to access company databases
1) web browsers are much easier to use than directly accessing the database by using a custom-query tool 2) the web interface requires few or no changes to the database model 3) it costs less to add a web interface in front of a DBMS than to redesign and rebuild the system to support changes. Additional data-driven website advantages include: -Easy to manage content: Website owners can make changes without relying on MIS professionals; users can update a data-driven website with little or no training. -Easy to store large amounts of data: Data-driven websites can keep large volumes of information organized. Website owners can use templates to implement changes for layouts, navigation, or website structure. This improves website reliability, scalability, and performance. -Easy to eliminate human errors: Data-driven websites trap data-entry errors, eliminating inconsistencies while ensuring that all information is entered correctly.
Zhao described five levels of metadata management maturity:
1. Ad-hoc, discovered, managed, optimized, and automated.
What are the steps of CRISP DM Process?
1. Business Understanding 2. Data Understanding 3. Data Preparation 4. Model Building 5. Testing & Evaluation 6. Deployment
What are the steps of data processing steps/
1. Data Consolidation -- collect, select, and integrate 2, Data Cleaning -- impute values, reduce noise, eliminate duplicates 3. Data Transformation -- normalize, discretize, and create attributes 4. Data Reduction -- dimension, volume, and balance data
Describe the four steps managers take in making a decision.
1. Define the problem (A decision situation that may deal with some difficulty or with an opportunity) 2. Construct a model that describes the real-world problem 3. Identify possible solutions to the modeled problem and evaluate the solutions 4. Compare, choose, and recommend a potential solution to the problem
What are the steps of simulation?
1. Define the problem. 2. Construct the simulation model. 3. Test & validate the model 4. Design the experiment 5. Conduct the experiment
What are the steps of the text mining process?
1. Establish the corpus 2. Create the term document matrix. 3. Extract knowledge
classification steps
1. Mode construction - training set -> model 2. Model usage
What are the main steps in carrying out sentiment analysis projects?
1. Sentiment Detection 2.N-P Polarity Classification 3. Target Identification 4. Collection and Aggregation
What are the steps for a sentiment analysis?
1. Sentiment Detection: calculate the OS Polarity 2. NP Polarity Classification 3. Target Identification: Identify the target for sentiment 4. Collection & aggregation
structured approach to architecture developement
1. high level corperate data model in short time 2. independent data marts can be implemented in parallel with ewh 3. Distributed data marts can be constructed to integrate different data marts 4. edw is constructed
BI evolution
1.0 DBMS based structured Content 2.0 web based, unstructured Content 3.0 mobile and sensor based content
How far does data warehousing trace back to?
1970s
complete but inaccurate information
2/31/10 is an example of complete but inaccurate information (February 31 does not exist)
Dimensions
4 Dimensionales Array...
Enterprise Application Integration (EAI)
= alternative to ERP EAI = middleware that can parse, duplicate or transform data between applications. It allows integration without redefining business practices EAI connects multiple systems that are isolated and make them work together and share their data. ERP in contrast is a monolithic software block.
What are examples of traffic sources?
=Referral Web Sites -Search Engines -Direct Searches via bookmarking of web page or using URL -Offline campaigns -Online Campaigns
dendrogram
?!?
Define decision automation systems.
A business rule-based system that uses intelligence to recommend solutions to repetitive decisions (such as pricing). Is also called automated decision support: A rule-based system that provides a solution to a repetitive managerial problem.
What is a namespace? Why is it important in key-value database?
A collection of identifiers. Keys must be unique within a namespace.
data store
A data repository - either permanent for temporary - for data transformed by processes. Data Stores can be files or full database systems.
What is a data mart?
A departmental data warehouse that stores only relevant data
What is data visualization? Why is it needed?
A graphical, animation, or video presentation of data and the results of data analysis. The use of visual representations to explore, make sense of, and communication data.
Define Gini index. What does it measure?
A metric that is used in economics to measure the diversity of the population. The same concept can be used to determine the purity of a specific class as a result of a decision to branch along a particular attribute/variable.
What is Six Sigma? How is it used as a performance measurement system?
A performance management methodology aimed at reducing the number of defects in a business process to as close to zero defects per million opportunities (DPMO) as possible.
What is a balanced scorecard? Where did it come from?
A performance measurement and management methodology that helps translates an organization's financial, customer, internal process, learning and growth objectives and targets into a set of actionable initiatives. Kaplan and Norton first articulated this methodology in their Harvard Business Review article in 1992.
What is a performance management system? Why do we need one?
A performance measurement system typically comprises systematic methods of setting business goals together with periodic feedback reports that indicates progress against goals. A system that assists managers in tracking the implementations of business strategy by comparing actual results against strategic goals and objectives.
What is a data warehouse?
A physical repository where relational data are specially organized to provide enterprise-wide, cleansed data in a standardized form.
Data warehouse
A physical repository where relational data are specially organized to provide enterprise-wide, cleansed data in a standardized format.
RDBMS vs DBMS
A relational database management system (RDBMS) is a database management system (DBMS) that is based on the relational model
What is an independent data mart
A small warehouse designed for a strategic business unit or department
What is a social network? What is social network analysis?
A social network is a social structure composed of individuals/people (or groups of individuals or organizations) linked to one another with some type of connections/relationships. Social network analysis (SNA) is the systematic examination of social networks. Dating back to the 1950s, social network analysis is an interdisciplinary field that emerged from social psychology, sociology, statistics, and graph (network) theory.
data cube
A special database used to store data in OLAP reporting
Key Performance Indicator (KPI)
A strategic objective AND METRICS that measures performance against a goal
What is a dependent data mart
A subset that is created directly from a data warehouse
What is a data cube
A two-dimensional, three-dimensional, or higher-dimensional object in which each dimension of the data represents a measure of interest
What are the differences and commonalities between dashboards and scorecards?
A visual presentation of critical data for executives to view. It allows executives to see hot spots in seconds and explore the situation. A performance measurement and management methodology that helps translates an organization's financial, customer, internal process, learning and growth objectives and targets into a set of actionable initiatives
What are examples of transaction processing?
ATM withdrawals, bank deposits, cash register scans at the grocery store, etc.
Data-as-a-service (DaaS)
Accessing data "where it lives", enriching data quality with centralization,
Out flow:
Accessing to obtain data by consumer ad hoc and routine. Delivery: to render data by warehouse via publish and subscribe mechanisms.
What is the percentage of test data set samples correctly classified by the model?
Accuracy Rate
What is the outcome of predictive analytics?
Accurate projections of future events and outcomes.
How does DaaS change the way data is handled?
Actual platform on which data resides doesn't matter. Any business process can access data wherever it resides. Customers can move quickly due to simplicity of the data access and the need for basic knowledge not expert knowledge
What includes predictive analysis and text analytics that examine the content in online conversations?
Advance analytics
Web 2.0
Advanced Web (blogs, wikis, social networks), Objective (enhance creativity, information sharing, collaboration), Changing the web from passive to active, redefining what is on the Web as well as how it works, companies are adopting and benefiting
When would all items start in individual clusters and the clusters are joined together?
Agglomerative
Down-Flow
Aging. To archive data into storage hierarchy
business intelligence examples
Airlines: Analyze popular vacation locations with current flight listings. Banking: Understand customer credit card usage and nonpayment rates. Health care: Compare the demographics of patients with critical illnesses. Insurance: Predict claim amounts and medical coverage costs. Law enforcement: Track crime patterns, locations, and criminal behavior. Marketing: Analyze customer demographics. Retail: Predict sales, inventory levels, and distribution. Technology: Predict hardware failures.
relational online analytical processing (ROLAP)
Analytical processing functions that use relational databases and familiar relational query tools to store and analyze multidimensional data
What is the process of developing actionable decisions or recommendations for actions based on insights generated from historical data?
Analytics
Who has developed analytics software for general use with data that has been collected in a data warehouse or is available through one of the platforms?
Analytics Focused Software Developers
What is a report? What are they used for?
Any communication artifact prepared with the specific intention of conveying information in a presentable form. -To ensure that all departments are functioning properly -To provide information -To provide the results of an analysis -To persuade others to act -To create an organizational memory (as part of a knowledge management system)
What is the most commonly used algorithm to discover association rules that attempts to find subsets that are common to at least a minimum number of the itemsets ?
Aprirori Algorithm *uses bottom up approach*
What is a graphical assessment technique where the true positive rate is plotted on the y axis and the false positive is plotted on the x-axis?
Area Under the ROC Curve
What is the most popular and most commonly used measure of central tendency?
Arithmetic Mean
What is the sum of all the values/observations divided by the number of observations in the data set?
Arithmetic Mean
Why is Big Data important? What has changed to put it in the center of the analytics world?
As more and more data becomes available in various forms and fashions, timely processing of the data with traditional means becomes impractical. The exponential growth, availability, and use of information, both structured and unstructured, brings Big Data to the center of the analytics world. Pushing the boundaries of data analytics uncovers new insights and opportunities for the use of Big Data.
What aims to find interesting relationships between variables in large databases?
Association Rule Mining
What finds the commonly co-occuring grouping of things?
Associations
What is a popular and well-researched technique for discovering interesting relationships among variables in large database?
Associations
What is an "authoritative page"? What is a "hub"? What is the difference between the two?
Authoritative page: web page that is identified as particularly popular based on links by other web pages and directories Hub: one or more web pages that provide a collection of links to authoritative pages Difference: hub will contain multiple authoritative pages where an authoritative page is just one link
How can sentiment analysis be used in predicting financial markets?
Automated analysis of market sentiments using social media, news, blogs, and discussion groups seem to be a proper way to compute the market movements. If done correctly, it can identify short-term stock movements based on the buzz in the market, potentially impacting liquidity and trading.
What is the creation of a shortened version of a textual document by a computer program that contains the most important points of the original document
Automatic Summarization
What are the three types of data generated through Web page visits?
Automatically generated data stored in server access logs, referrer logs, agent logs, and client-side cookies User profiles Metadata, such as page attributes, content attributes, and usage data.
IMPORTANT ETL process
BI tools can also directly help obtaining data and information (such as through extraction, transformation, and loading of data).
What is the main difference between BSC and Six Sigma?
BSC is focused on improving overall strategy and the Six Sigma is focused on improving processes.
Describe the reasoning procedures of forward chaining and backward chaining.
Backward Chaining: goal-driven approach in which you start from expectation of what is going to happen and then seek evidence that supports (or contradicts) your expectation. Forward Chaining: data-driven approach. Starts from available information as it becomes available or from a basic idea, and then we try to draw conclusions. The ES analyzes the problem by looking for the facts that match the IF part of its IF-THEN rules.
What is used when introducing structure to a collection of text based documents to classify them into two or more predetermined classes or to cluster them into natural groupings?
Bag of Words i.e. Spam Filtering
What is the best known and most widely used performance management system that suggests people view the organization from four perspectives?
Balance Scorecard (BSC)
What is both a performance measurement and a management methodology that helps translate an organizations financial, customer, internal process, and learning / growth objectives and targets into a set of actionable initiatives?
Balanced Scorecard
The E in BASE stands for eventually consistent. What does that mean?
Basically Available, Soft State, Eventually Consistent. Some replicas might be inconsistent for some period of time but will become consistent at some point.
Cloud sourcing
Benefits: - high scale, low-cost providers - any time/place acces via web browser - rapid scalability, cost and load sharing concerns: - Performance, reliability, SLAs - control of data and service parammeters - application features and choices - lock in effects, no migration between cloud providers - no standard API - privacy, security, complience, trust
What is the outcome of prescriptive analytics?
Best possible business decisions and actions
What refers to data that is structured, unstructured , in a stream and so forth?
Big Data
Info & Info 2
Big data is one of the most promising technology trends occurring today. Of course, notable companies such as Facebook, Google, and Netflix are gaining the most business insights from big data currently, but many smaller markets are entering the scene, including retail, insurance, and health care. Over the next decade, as big data starts to improve your everyday life by providing insights into your social relationships, habits, and careers, you can expect to see the need for data scientists and data artists dramatically increase.
The four common characteristics of big data
Big data requires sophisticated tools to analyze all the unstructured information from millions of customers, devices, and machine interactions. Big data are analyzed for marketing trends in business as well as in the fields of manufacturing, medicine, and science
Who's if the father of data warehousing?
Bill Inmon
What attempts to improve rankings in way that are not approved by the search engines or involve deception?
Black Hat SEO
When is a fixed number of instances from the original data are sampled for training and the rest of the data set is used for testing?
Bootstrapping
What is the plan big, build small approach that focuses on the request of a specific department?
Bottom Up Approach / Data Mart Approach (DM)
What is a graphical illustration of several descriptive statistics about a given data set?
Box & Whiskers Set / Box Plot
What approach uses probability theory to build classification models based on the past occurrence that are capable of placing a new instance into a most probable class/category?
Boyesian Classifiers
What represents the outcome of a test to classify a pattern using one of the attributes?
Branch
What focuses on listening to social media where anyone can post opinions that can damage or boost your reputation?
Brand Management
Who is an individual who weak ties fill a structural hole providing the only link between two individuals or clusters?
Bridge
What are the subcategories of distributions (SNA)?
Bridge, centrality, density, distance, structural holes, and tie strength
What is a collection of tools for manipulating, mining, and analyzing the data in the warehouse?
Business Analytics
What enables interactive access to data, manipulation of data, and the ability to conduct appropriate analysis?
Business Intelligence
What is an umbrella term that combines architectures, tools, data bases, analytical tools, applications, and methodologies?
Business Intelligence
What is based on the transformation of data to inflammation, then decisions, and finally actions?
Business Intelligence
What are the business processes, methodologies, metrics, and technologies used by enterprises to measure, monitor, and manage business performance?
Business Performance Management (BPM)
What is used for monitoring and analyzing performance?
Business Process Management
What are enablers of descriptive analytics?
Business reporting, dashboards, scorecards, and data warehousing
What do the C and A in the CAP theorem stand for? Give an example of how designing for one of those properties can lead to difficulties in maintaining the other.
C is Consistency. A is Availability. P is Partitioning. When using a two-phase commit, the database favors consistency but at the risk of the most recent data not being available for a brief period of time. While the two-phase commit is executing, other queries to the data are blocked. The updated data is unavailable until the two-phase commit finishes. This favors consistency over availability
What uses a sequence of six steps that starts with a good understanding of the business and the need for the data mining project and ends with the deployment of the solution that satisfies the specific business need?
CRISP DM
List and briefly define the phases in the CRISP-DM process
CRISP-DM provides a systematic and orderly way to conduct data mining projects. This process has six steps. First, an understanding of the data and an understanding of the business issues to be addressed are developed concurrently. Next, data are prepared for modeling; are modeled; model results are evaluated; and the models can be employed for regular use.
How does CRISP-DM differ from SEMMA?
CRISP_DM: A cross industry standardized process of conducting data mining projects, which is a sequence of six steps that start with a good understanding of the business and the need for the data mining project and ends with the deployment of the solution that satisfied the specific business need. SEMMA: An alternative process for data mining projects proposed by the SAS Institute. "Sample, Explore, Modify, Model, and Assess"
EDW's are used to provide data for many types of DSS including:
CRM, supply chain management (SCM), business performance management (BPM), business activity monitoring (BAM), product life-cycle management (PLM), revenue management, and sometimes even Knowledge Management Systems (KMS).
What represents the labels of multiple classes used to divide a variable into specific groups and represents a finite number of values with no continuum between them?
Categorical data / Discrete data i.e. race, sex, age group, and educational level
What refers to a group of metrics that aim to quantify the importance or influence of a particular node within a network?
Centrality
What warehousing architecture has a gigantic EDW that serves the needs of all organizational units and provides users with access to all the data in the data warehouse?
Centralized Data Warehouse
What is assumed that complete knowledge is available so that the decision maker knows exactly what the outcome of each course of action is?
Certainty
What is based on the identification, capture, and delivery of the changes made to enterprise data sources?
Change Capture
What is the most common data mining tasks and analyzes the historical data stored in a database and automatically generates a model that can predict future behaviors?
Classification
What is the most frequently used data mining method for real world problems?
Classification
What is the primary source for accuracy estimation in classification problems?
Classification Matrix
What is the difference between the clustering and the classification?
Classification learns the function between the characteristics of things and their membership through a supervised learning process whereas clustering is an unsupervised learning process where only the input variables are presented to the algorithm.
What is the major difference between cluster analysis and classification?
Classification methods learn from previous examples containing inputs and the resulting class labels, and once properly trained they are able to classify future cases. Clustering partitions pattern records into natural segments or clusters.
What are subcategories of prediction?
Classification, regression, and time series.
What is the clickable photos, text links in the copy, downloads, and navigation on a page?
Click map
What can reveal where you might be losing visitors in a specific process?
Click paths
What is the analysis of the information collected by web servers can help better understand user behavior?
Clickstream Analysis
What are the sources of Big Data?
Clickstreams from websites, postings on social media sites, traffic, sensors, or weather
MOLAP Model
Client (storing with multidimensional array)-> application layer (Molap Engine) -> Warehouse - efficient storage and processing - complexity hidden from the user
ROLAP Model
Client -> SQL/MDX -> Application layer (ROLAP ENGINE) -> SQL -> Warehouse server
What are the subcategories of segmentation for SNA?
Cliques and social orders, clustering coefficient, and cohesion.
What implies that optimum performance is achieved by setting goals and objectives, establishing initiatives and plans to achieve those goals, monitoring actual performance, and taking corrective action?
Closed Loop BPM Cycle
Define cloud computing. How does it relate to PaaS, SaaS, and IaaS
Cloud computing offers the possibility of using software, hardware, platform, and infrastructure, all on a service-subscription basis. Cloud computing enables a more scalable investment on the part of a user. Like PaaS, etc., cloud computing offers organizations the latest technologies without significant upfront investment. In some ways, cloud computing is a new name for many previous related trends: utility computing, application service provider grid computing, on-demand computing, software as a service (SaaS), and even older centralized computing with dumb terminals. But the term cloud computing originates from a reference to the Internet as a "cloud" and represents an evolution of all previous shared/centralized computing trends
What has been used extensively for fraud detection and market segmentation of customers in contemporary CRM systems?
Cluster Analysis
What is the means of identifying classes of items so that items in a cluster have more in common with each other than with items in other clusters AND identify natural groupings of events or objects so that a common set of characteristics?
Cluster Analysis
What is used to sort case into groups or clusters so that the degree of association is strong among members of the same cluster and weak among members of different cluster?
Cluster Analysis
What partitions a collection of things into segments whose members share similar characteristics, but the class labels are unknown?
Clustering
What are subcategories of segmentation?
Clustering and outlier analysis
What is the measurehood of likelihood that two members of a node are associates?
Clustering coefficient
What identifies the natural grouping of thins based on their known characteristics?
Clusters
What is it called when an individual's problem solving capability is limited when a wide range of diverse information and knowledge is required?
Cognitive Limits
What is a system that stores data tables as sections of columns of data rather than as rows of data?
Columnar Database / Column Oriented Database Management Systems **much finer grain of control**
Name two data structures used in column family databases.
Columns and column families
What are semistructured decisions? Provide two examples.
Combination of standard and complex problems. trading bonds, setting marketing budgets for consumer products, performing capital acquisition analysis
Example of Low-Quality Information
Completeness. The customer's first name is missing. Another issue with completeness. The street address contains only a number and not a street name. Consistency. There may be a duplication of information since there is a slight difference between the two customers in the spelling of the last name. Similar street addresses and phone numbers make this likely. Accuracy. This may be inaccurate information because the customer's phone and fax numbers are the same. Some customers might have the same number for phone and fax, but the fact that the customer also has this number in the email address field is suspicious. Another issue with accuracy. There is inaccurate information because a phone number is located in the email address field. Another issue with completeness. The information is incomplete because there is not a valid area code for the phone and fax numbers.
What is the EWD to support all decision analysis by providing relevant summarized and detailed information originating from many different sources?
Comprehensive Database
Explain the importance of metadata
Comprise info that increases our understanding of traditional data. Provides context to the reported data and provides enriching information that leads to the creation of knowledge.
What enables people to overcome their cognitive limits by quickly accessing and processing vast amounts of stored information?
Computerized Systems
What are features generated from a collection of documents by means of manual, statistical, rule based, or hybrid categorization methodology?
Concepts
What is the process called that predicts machinery failures before they occur through the use of sensory data?
Condition Based Maintenance
What are the metrics of measuring social network analysis?
Connections, distributions, and segmentation.
What represents the dimensional information coming from potentially disparate source, but pertaining to the same subject?
Consistent data
What is the assumption that states that the response variables have the same variance in this error?
Constant Variance/Homoscedasticity
data mart
Contains a subset of data warehouse information
What are situations with unlimited numbers of possible events that follow density functions?
Continuous Distributions
What is a large and structured set of texts prepared for the purpose of conducting knowledge discovery?
Corpus
What gives an estimate on the degree of association between the variables?
Correlation
What is interested in low level relationships between two variables?
Correlation
When would you introduce structure to the corpus?
Create the term document matrix?
Name two applications of ES in finance and describe their benefits.
Credit Analysis System: An ES can help a lender analyze a customer's credit card record and determine a proper credit limit. Rules in the knowledge base can also help assess risk and risk-management policies. Pension Fund Adviser: An ES that provides information on an employee's pension fund status. The system maintains an up-to-date knowledge base to give participants advice concerning the impact of regulation changes and conformance with new standards.
What is a multidimensional data structure that allows fast analysis of data and is defined as the capability of efficiently manipulating and analyzing data from multiple perspective?
Cube
What creates a one on one relationships with customers by developing an intimate understanding of their needs and wants?
Customer Relationship Management
What are the most popular application areas for sentiment analysis? Why?
Customer relationship management (CRM) and customer experience management are popular "voice of the customer (VOC)" applications. Other application areas include "voice of the market (VOM)" and "voice of the employee (VOE)."
What are the perspectives that an organization should develop objectives, measures, targets, and initiatives?
Customer, financial. internal business process, and learning & growth.
Data warehousing depends on:
DBMS, Extraction and conversion tools, internetworking techniques, front-end analysis tools, graphics
What is frequently a convenient first step to acquiring experience in constructing and managing a data warehouse while presenting business users with the benefits of better access to their data?
DM Approach
What is a closed loop business improvement model and encompasses the steps of defining, measuring, analyzing, improving, and controlling a process?
DMAIC
How does a data warehouse differ from a database
DW: A physical repository where relational data are specially organized to provide enterprise-wide, cleansed data in a standardized form. DB: A collection of files that are viewed as a single storage concept. Available to a wide range of users
OLAP major task of ...
DWH
What is the collection of facts usually obtained as the result of experiments, observations, transaction, or experiences?
Data
What is the main ingredient for any BI, data science, and business analytics initiative?
Data
What is the ability to access and extract data from any data source?
Data Access
What is a term for professionals who were doing BI in the form of data compilation, cleaning, reporting, and perhaps some visualization?
Data Analyst
What is the integration of business view across multiple data stores?
Data Federation
What companies enable generating and collection of data that may be used fr developing analytical insights?
Data Generation Infrastructure Providers
What comprises three major processes that permit data to be accessed ad made accessible to an array of ETL and analysis tools and data warehousing environment: data access, data federation, and change capture?
Data Integration
What is a large storage location that can hold vast quantities of data in its native/raw format for future potential analytics consumption?
Data Lakes
What includes the organizations that provide hardware and software targeting the basic foundation for all management solutions?
Data Management Infrastructure Providers
What usually smaller and focuses on a particular subject or department?
Data Mart
What architecture has the individual marts linked to each other via some kind of middleware?
Data Mart Bus Architecture
What is the process that uses statistical, mathematical, and artificial intelligence techniques to extract and identify useful information and subsequent knowledge from large sets of data?
Data Mining
What is used to describe discovering or mining knowledge from large amounts of data?
Data Mining
What was used to describe the process through which previously unknown patterns in data were discovered?
Data Mining
What is the tedious and time demanding process that is necessary to convert the raw real world data into a well refined form for analytics algorithms?
Data Preprocessing
What is the term for professional who utilizes predictive analysis, statistical analysis, and more advance analytical tools and algorithms?
Data Scientists
What focuses on a specific industry sector and build on their existing relationships in that industry through their niche platforms and services for data collection?
Data Serviced Providers
When would the data be transformed for better processing and aggregated?
Data Transformation
What contains a wide variety of data that presents a coherent picture of business conditions at a single point in time?
Data Warehouse
What is a discipline that results in applications that provide decision support capability, allows ready access to business information, and creates business insight?
Data Warehouse
What is a pool of data produced to support decision making and is a repository of current and historical data of potential interest to managers throughout the organization?
Data Warehouse
What is a subject oriented, integrated, time variant, nonvolatile collection of data in support of management's decision making process?
Data Warehouse
Who possess solid business insight and be familiar with high performance software, hardware, and networking technologies?
Data Warehouse Administrator
What consists of an integrated set of servers, storage, operating systems, database management systems, and software specifically preinstalled and preoptimized for data warehousing?
Data Warehouse Appliances
What provides solutions for the mid-warehouse to Bi Data warehouse market, offering two cost performance on data volumes in the terabyte to petabyte range?
Data Warehouse Appliances (low cost of ownership)
What are companies that include their own hardware to provide efficient data storage, retrieval, and processing?
Data Warehouse Providers i.e. IBM, Oracle, and Teradata
What describes where the company wants to go, why it wants to go there, and what it will do when it gets there?
Data Warehousing Strategy
What means that data are easily and readily obtainable?
Data accessibility Answers the question "Can we easily get the data when we need to?"
When would when the data be cleaned and the values are identified and dealt with?
Data cleaning
What means that the data are accurately collected and combined/merged?
Data consistency
When would relevant data be collected from identified sources, necessary records an variables are selected, and the records coming form multiple data sources are integrated and merged?
Data consolidation
What means that data are correct and are a godo match for the analytics problem?
Data content accuracy Answers the question "Do we have the right data for the job?"
What means that the data should be up to date for a given analytics mode and is recorded at or neat the time of the event or observation so that the time delay related misrepresentation of the data is prevented?
Data currency/data timeliness
What requires that the variables and data values be defined at the lowest level of detail for the intended use of the data?
Data granularity
Describe the data warehousing process
Data is imported from various external and internal resources are cleansed and orgainzed in a manner consistent with the organization's need's. Data's are populated in the DW, data marts can be loaded for specific areas.
Multidimensional Data
Data is represented in cubes Facts (Measures) + dimensions
Extract, transform, load (ETL)
Data management - The processes to extract, transform, cleanse, reengineer, and load source data into the data warehouse, and move the data from one location to another
What are the three main types of data warehouses?
Data marts, operational data store (ODS), and enterprise data warehouses (EDW)
What are some major data mining methods and algorithms?
Data mining tasks can be classified into three main categories: prediction, association, and clustering. Based on the way in which the patterns are extracted from the historical data, the learning algorithms of data mining methods can be classified as either supervised or unsupervised. With supervised learning algorithms, the training data includes both the descriptive attributes (i.e., independent variables or decision variables) as well as the class attribute (i.e., output variable or result variable). In contrast, with unsupervised learning the training data includes only the descriptive attributes.
What are enablers of predictive analytics?
Data mining, text mining, web/media mining, and forecasting
Where is the most time spend on the analytics tasks?
Data preprocessing
What is the term that means that the variables in the data set are all relevant to the study being conducted?
Data relevancy
What means that all required data elements are included in the data set and build a predictive or prescriptive analytics model?
Data richness Available variables portray enough dimensional of the underlying subject matter for an accurate and worthy analytics study.
What means that data is secured to only allow those people who have the authority and the need to access it and to prevent anyone else from reaching it?
Data security and data privacy
Meta data management
Data services - data that describes the meaning and structure of business data, as well as how it is created, accessed, and used
Application programming interface
Data source - Mechanism to populate source systems with raw data and to pull operational reports
Enterprise application integration/staging area
Data source - Provides an integrated common data interface and interchange mechanism for real-time and source systems.
Operational transaction systems
Data source - Systems that run day-to-day business operations and provide source data for the data warehouse and DSS environment
What refers to the originality and appropriateness of the storage medium where the data is obtained?
Data source reliability Answers the question "Do we have the right confidence and belief in this data source?"
What are the major components of the data warehousing process?
Data sources, data extraction/transformation, data loading, comprehensive database, metadata, and middleware tools
The four major components of the data warehousing process
Data sources. Data extraction (using custom-written or commercial software called ETL), Data loading (data loaded to staging area) Comprehensive database, metadata (used by IT personnel and users).
What is Big Data analytics?
Data that cannot be stored in a single storage unit. Data that is arriving in many different forms (structured, unstructured, or in a stream).
What is the term used to describe a match/mismatch between the actual and expected data values of a given variable?
Data validity
What are the four major components of business intelligence?
Data warehouse, business analytics, business process management, and user interface.
What are the parts that comprise the data warehousing architectures?
Data warehouse, data acquisition software (application server), client front end software (database server)
Data-mining tools
Data-mining tools use a variety of techniques to find patterns and relationships in large volumes of information that predict future behavior and guide decision making. help users uncover business intelligence in their data
What is the successful administration and management of a data warehouse entails skills and proficiency?
Database Administrator
What is the component where the most work must be done to implement a data model and optimize it for query performance?
Database Management Systems (DBMS)
What area are the prediction models that differentiate deceptive statements from truthful ones classified as?
Deception Detection
What is the evolution of decision support, business intelligence, and analytics?
Decision Support Systems --> Enterprise/Executive Information Systems --> Business Intelligence --> Analytics --> Big Data
What conveniently organizes information and knowledge in a systematic, tabular manner to prepare it for analysis?
Decision Tables
What divides a training set until each division consists entirely or primarily of examples for one class?
Decision Tree
What shows the relationships of the problem graphically and can handle complex situations in a complex form?
Decision Tree
What classifies data into a finite number of classes based on the values of the input variables?
Decision Trees
What includes many input variables / attributes that may have an impact on the classification of different patterns?
Decision Trees
What is a hierarchy of if then statements and are thus significantly faster than neural networks?
Decision Trees -Classify data into a finite number of classes based on the values of the input variables
What describe alternative courses of action?
Decision Variables
What are some terms that are content free expressions and there is no universally accepted definition?
Decision support system, management information system,
What are examples of an enterprise data warehouse?
Decision support systems, customer relationship management, supply chain management, revenue management, etc.
List and briefly define at least two classification techniques?
Decision tree analysis. Decision tree analysis (a machine-learning technique) is arguably the most popular classification technique in the data mining arena. • Statistical analysis. Statistical classification techniques include logistic regression and discriminant analysis, both of which make the assumptions that the relationships between the input and output variables are linear in nature, the data is normally distributed, and the variables are not correlated and are independent of each other. • Case-based reasoning. This approach uses historical cases to recognize commonalities in order to assign a new case into the most probable category. • Bayesian classifiers. This approach uses probability theory to build classification models based on the past occurrences that are capable of placing a new instance into a most probable class (or category). • Genetic algorithms. The use of the analogy of natural evolution to build directed search-based mechanisms to classify data samples. • Rough sets. This method takes into account the partial membership of class labels to predefined categories in building models (collection of rules) for classification problems.
Define strategic planning. Provide two examples.
Defining long-range goals and policies for resource allocation.
Describe the major steps in developing rule-based ES.
Defining the nature and scope of the problem: Identify the nature of the problem and to define its scope. Some domain my not be appropriate for the application of an ES. Identifying proper experts: Find proper experts who have knowledge and are willing to assist in developing the knowledge base. Selecting the building tools: Choose a proper tool for implementing the system. Coding the system: The team can focus on coding the knowledge based on the tool's syntactic requirements. Evaluating the system: Evaluation includes both verification and validation. Verification ensures that the resulting knowledge base contains knowledge exactly the same as that acquired from the expert.
What are storage solution providers?
Dell and Netapp
What is the proportion of direct ties in a network relative to the total number possible?
Density
What is a subset that is created directly from the data warehouse and uses a consistent data model to provide quality data?
Dependent Data Mart
What ensures that the end user is viewing the same version of the data that is accessed by all other data warehouse users?
Dependent Data Mart --high cost limits this to large companies
What refers to knowing what is happening in data organization and understanding some underlying trends and causes of such occurrences?
Descriptive / Reporting Analytics
What answers the question "What happened?" and"What is happening?"
Descriptive Analytics
What is the entry level in the business analytics taxonomy?
Descriptive Analytics
What helps us convert our numbers and symbols into meaningful representatives for anyone to understand and use?
Descriptive Statistics
What is used to describe the sample data on hand and summarizes it in a way that is meaningful and easily understandable patterns emerge?
Descriptive Statistics
What are the levels of decision/normative analytics?
Descriptive, predictive, and prescriptive
What is predictive analytics? How can organizations employ predictive analytics?
Determine what is likely to happen in the future based on statistical techniques. Organization employ predictive analytics by data mining-using cluster algorithms like decision tree models and neural networks in addition to association mining techniques to estimate relationships between different purchasing behaviors.
data warehouse enables business users, typically managers, to be more effective in many ways, including:
Developing customer profiles. Identifying new-product opportunities. Improving business operations. Identifying financial issues. Analyzing trends. Understanding competitors. Understanding product performance
What cycle creates a huge database of documents / pages organized and indexed based on their content and information value?
Development Cycle
What are the two main cycles in search engines? Describe the steps in each cycle.
Development Cycle Web Crawler Document Indexer Step 1: Preprocessing the documents Step 2: Parsing the documents Step 3: Creating the term-by-document matrix Response Cycle Query Analyzer Document Matcher/ Ranker
What are the two main cycles of a search engine?
Development Cycle and a responding cycle
Real-Time Location Intelligence
Devices that are constantly sending out location information, reality mining
What is the slice on more than two dimensions of a data cube?
Dice
What has one to many relationships with rows in the central fact table?
Dimension Table
What contain classification and aggregation information about central fact rows and the attributes that describe the data contained within the fact table?
Dimension Tables
What is a retrieval based system that supports high volume query access?
Dimensional Modeling -- star and snowflake schema
When the number of variables can be rather large, the analyst must reduce the number down to a manageable size. What is the process called?
Dimensional Reduction / Variable Selection
Data mining landscape
Discovery vs verification Models: Prediction (regression, classification) vs. description (segmentaton, association) (Folie 12)
What involves a situation with a limited number of events that can take on only a finite number of values?
Discrete Distributions
What refers to building a model of a system where the interaction between different entities is studied?
Discrete Event Simulation
What is used to estimate or describe the degree of variation in a given variable of interest?
Dispersion -- used for judging central tendency.
4 contributes of BI & their improvement
Dissemination of real time information in a user-friendly fashion Creation of new knowledge based on the past Responsive and anticipative decisions based more closely on all the latest information Improved planning for the future through data and information about the past --> Improvement in operational performance, customer service and in identifying new opportunities
What is the minimum number of ties required to connect two particular actors?
Distance
What is the frequency of data points counted and plotted over a small number of class labels or numerical ranges?
Distribution
four enterprise architecture models
Diversification model low standardization low integration o Decentralized o Different markets with different products and services o Benefit from local autonomy Coordination model low standardization high integration o Sharing of customers, products, suppliers and partners o Business unit leaders have autonomy Replication model high standardization low integration o Independent units following highly standardized process (e.g. McDonalds) o Units do not depend on each other Unification model high standardization high integration o Integrated supply chains that share customer and supplier data (e.g. DOW Chemical)
When would all items start in one cluster and are broken apart?
Divisive
Describe two differences between document databases and relational databases.
Document databases do not require a fixed, predefined schema. Documents can have embedded documents and lists of multiple values within a document.
What happens when the user navigates among levels of data ranging from the most summarized up to the most detailed?
Drill Up / Down
What are the leading indicators / value drivers that measure activities that have a significant impact on outcome KPS?
Driver KPI
Inmon model
EDW approach (top down)
What is an integral compomental in the process in any data centric project and consists of extraction, transformation, and loading integrated & cleansed data?
ETL
What are the two metrics to evaluate search engines?
Effectiveness and Efficiency
How many players are involved in the analytics environment?
Eleven clusters Inner and outer petals & seed of the flower
What is the process of intelligently combining the information created and provided by two or more information sources?
Ensemble Models. -Improving accuracy and robustness of information outcomes while reducing uncertainty and bias associated with individual models.
What provides a vehicle for pushing data from source systems into the data warehouse and involves integrating application functionality and is focused on sharing functionality across systems?
Enterprise Application Integration (EAI)
EDW stands for
Enterprise Data Warehouse
What is a large scale data warehouse that is used across the enterprise for decision support and provides integration of data from many sources into a standard format?
Enterprise Data Warehouse
What is a mechanism for pulling data from source systems to satisfy a request for information and uses predefined metadata to populate view that make integrated data appear relational to end users?
Enterprise Information Integration (EII)
What is an evolving tool space that promises real time data integration from a variety of sources, such as relational databases, Web services, and multi-dimensional databases?
Enterprise Information Integration (EII)
What system collects all the data from every corner of the enterprise and integrates it into a consistent schema so that every part of the organization has access to the single version when and where needed?
Enterprise Resource Planning (ERP) systems
Types of integration technologies that enable data and metadata integration:
Enterprise application integration (EAI, vehival pushes data from source to data warehouse), Enterprise information integration (EII, promotes real-time data integration).
What measures the extent of uncertainty or randomness in a data set and is used to build subtrees so that the entropy of each final subset is 0?
Entropy
What is the monitoring, scanning, and interpretation of collected information?
Environmental Scanning and Analysis
When would you collect and organize the domain specific unstructured data?
Establish the corpus
ETL stands for
Exchange, transfer and load
What systems were designed as graphical dashboards and scorecards so that they could serve as visually appealing displays while focusing on the most important factors for decision makers to keep track of the key performance indicators?
Executive Information Systems
What are issues affect whether an organization will purchase tools or build the transformation process itself?
Expensive, long learning curve, and it's difficult to measure how the IT organization is doing until it has learned to use the tools.
What is an ES?
Expert System is a computer-based information system that use expert knowledge to attain high level decision performance in a narrowly defined problem domain.
What is an independent variable also known as?
Explanatory or input
How does sentiment appear in text?
Explicit - subjective sentence directly expresses an opinion AND Implicit - The text implies an opinion
When would you discover novel patterns from the T-D matrix?
Extract knowledge
What involves reading data from one or more databases?
Extraction
ETL
Extraction(Select Data OLTP), Transormation(validate, clean, integrate), Load(move data into warehouse)
What does ETL stand for?
Extraction, Transformation, and Load
What contains the descriptive attributes needed to perform decision analysis and query reporting?
Fact Table
What is the outcome of when the predictive class is negative and the observed class is positive?
False Negative
What is the outcome of when the predictive class is positive and the observed class is negative?
False Positive
What skills should a DWA (Data Warehouse Administrator) possess?
Familiar with high-performance software, hardware, and networking technologies. Possess solid business insight, decision-making processes and communication skills.
Hub-and-spoke architecture
Famous data warehousing architecture today. Focus on building a scalable and maintainable infrastructure that includes a centralized data warehouse and several dependent data marts. Allows for easy customization of user interfaces and reports. Lacks a holistic enterprise view, and may lead to data redundancy and data latency.
What uses all possible means to integrate analytical resources from multiple sources to meet changing needs or business conditions?
Federated Data Warehouse
Where has the most common use of data mining been used on the commercial side?
Finance, retail, and healthcare sectors.
ad-hoc reports
From that point on, the actual reports are created by business end-users. Ad-hoc is Latin for "as the occasion requires." This means that with this BI model, users can use their reporting and analysis solution to answer their business questions "as the occasion requires," without having to request queries from IT.
What are unstructured decisions? Provide two examples.
Fuzzy, complex problems for which there are no cut-and-dried solution methods. writing a corporate mission stmt, selecting a location for a company picnic.
Location-Based Analytics
Geospatial Analytics, Geocoding, Enables aggregate view of large geographic area, Integrate "where" into customer view
Consumer Oriented Locations based analytics
Geospatial static approach (GPS navigation, data analysis), location-based dynamic approach (historic and current location demand analysis)
Organization Oriented Location based analytics
Geospatial static approach (examining geographic site locations), location-based dynamic approach (live location feeds; real-time marketing promotions)
What has been used in economics to measure the diversity of a population and can be used to determine the purity of a specific class as a result of a decision to branch along a particular attribute or variable?
Gini Index
Classification
Given: collection of records, each record attributes, one attribute = class Task: find a Model Goal: unknown records should be classified
What are factors that are forcing business managers to rethink how they integrate and manage their businesses?
Global competitive pressures, demand for ROI, management, investor inquiry, and government regulations
What calculates the values of the inputs necessary to achieve a desired level of an output/goal?
Goal Seeking
What is speech analytics? How does it relate to sentiment analysis?
Growing field of science that allows users to analyze and extract information from both live and recorded conversations. Sentiment analysis, as it relates to speech analytics, focuses on assessing the emotional state expressed in a conversation and on measuring the presence and strength of positive and negative feelings that are exhibited by the participants.
Cluster Analyses
Gruppierung von Objekten (homogen innerhalb der Gruppe, heterogen zwischen den Gruppen)
The technologies that come with Big Data are
Hadoop, MapReduce, and NoSQL, Hive
How can computers provide support for making structured decisions?
Help with operational and managerial controls. Therefore, it is possible to use a scientific approach for automating portions of managerial decision making.
Decision Support Data
Historic data that is queried intensively in fewer less normalized tables. Has large data volumes.
What combines the outcomes of two or more of the same type of models such as decision trees?
Homogeneous Ensemble Model
What is the extent to which actors form ties with similar vs. dissimilar others?
Homophily
What are the sub categories of connections in SNA?
Homophily, multiplexity, mutuality, network closure, propinquity.
What is it called when another firm develops and maintains the data warehouse?
Hosted Data Warehouse
What is one or more Web pages that provide a collection of links to authoritative pages and implicitly conferring the authorities on a narrow field?
Hub
What is the most famous data warehousing architecture today because it's focused on building a scalable and maintaining infrastructure?
Hub & Spoke Architecture -allows for easy customization of user interfaces and reports, but can have data redundancy and latency.
Human-generated data
Human-generated data is data that humans, in interaction with computers, generate Human-generated structured data includes input data, click-stream data, or gaming data
Holap
Hybrid Olap - 1 part Molap and 1 part ROLAP
What is the most popular publicly known and referenced algorithm used to calculate hubs and authorities?
Hyperlink Induced Topic Search
What are the major hardware players that provide the infrastructure for database computing?
IBM, Dell, HP, Oracle,.
What are tools used for predictive analytics?
IBM, Oracle, SAP, Teradata, Informatica
What companies provide indigenous hardware and software platforms?
IBM, Oracle, and Teradata
What is the level of understanding and insight provided by the model?
INterpretability
What is the OS polarity?
If the objectivity value is close to 1, then there is no opinion to mine.
How can analytics affect job satisfaction?
If the routine and mundane work can be done using an analytic system, then it should free up the managers and knowledge workers to do more challenging tasks. It was found that employees using ADS systems, especially those who are empowered by the systems, were more satisfied with their jobs.
Describe monotonic write consistency. Why is is so important?
If you were to issue several update commands, they would be executed in the order you were issued them. This ensures that the results of a set of commands are predictable. Repeating the same commands with the same starting data will yield the same results.
How does clustering improve search effectiveness for text mining?
Improved search recall and search precision.
What is the integration of the algorithmic extent of data analytics into data warehousing?
In Database Processing / In Database Analytics *used for high throughput,real time application environments, including fraud detection, credit score, risk management, etc.*
What keeps the data permanently in the main memory?
In Memory Database
List and briefly discuss some of the text mining applications in marketing.
Increase cross-selling and up-selling by analyzing the unstructured data generated by call centers. Invaluable for customer relationship management. Analyze rich sets of unstructured text data, combined with the relevant structured data extracted from organizational databases, to predict customer perceptions and subsequent purchasing behavior.
What assumption states that the errors of the response variable are uncorrelated with each other?
Independence (weaker than actual statistical independence)
What is a small warehouse designed for strategic business unit or a department, but its source is not an enterprise data warehouse?
Independent Data Mart --lower cost & lower scale
What is the simplest and least costly architecture alternative?
Independent Data Marts *Developed to operate independent of each other and serve the needs of individual organizational units*
What is used to draw inferences or conclusions about the characteristics of the population?
Inferential Statistics
What are graphical models of a model that can facilitate the identification process?
Influence Diagram
What is the identification of key phrases and relationships within text by looking for predefined objects and sequences in text by way of pattern matching?
Information Extraction
What is the splitting mechanism used in ID3 which is perhaps the most widely known decision trees algorithm and was developed by Ross Quinlan?
Information Gain
What is used to identify and stop malicious attacks on critical information infrastructure?
Information Warfare
What are the measures to assess the success of an architecture?
Information quality, system quality, individual impacts, and organizational impacts.
How do document databases differ from key-value databases?
Instead of storing each attribute of an entity with a separate key, document databases store multiple attributes in a single document. Users can query and retrieve documents by filtering on key-value pairs within a document.
What does it mean to place data from different sources into a consistent format?
Integrated
Stages of rational decision making
Intelligence -> Design -> Choice
What are the common characteristics of data scientists?
Intense curiosity, creativity, communication,/interpersonal, domain expertise, problem definition, managerial, technical skills (data manipulation, programming/hacking/scripting, internet and social media/networking)
Location Intelligence (LI)
Interactive maps that further drill down to details about any location
What reflects intermediate outcomes in mathematical models?
Intermediate Result Variables
Describe the three managerial roles, and list some of the specific activities in each.
Interpersonal - managers interact with people inside and outside their work units. Figurehead, Leader, Liaison Informational - managers receive and communicate information with other people inside and outside the organization. Monitor, Disseminator, Spokesperson Decisional - manager use information to make decisions to solve problems or take advantage of opportunities. Entrepreneur, Disturbance handler, Resource allocator, Negotiator
What are variables that can be measured on interval scales?
Interval Data i.e. Temperature
Define BI.
Is a umbrella term that combines architectures, tools, databases, analytical tools, applications, and methodologies. Means different things to different people.
When is the accuracy calculated by leaving one sample out at each iteration of the estimation process?
Jackknifing
When is the complete data set randomly split into k mutually exclusives subsets of approximately equal size?
K-Fold Cross Validation / Rotation Estimation
What represents a strategic objective and measures performance against a goaL?
Key Performance Indicator (KPI)
What is descriptive analytics? What tools are employed in descriptive analytics?
Knowing what is happening in the organization and understanding some underlying trends and causes of such occurrences. Tools include reports, queries, alerts, and trends using various reporting tools and techniques. Major player is visualization.
KDD
Knowledge Discovery from Databases
What is a process of using data mining methods to find useful information and patterns in the data which involved using algorithms to identify patterns in data?
Knowledge Discovery in Databases (KDD)
When would an expert's knowledge about the categories be encoded into the system either declarative or in the form of procedural classification rules?
Knowledge Engineering Approach
KDD
Knowledge discovery in databases - Data mining front end technology - DATA mining as step within the KDD process
What are the two main approaches to text classification?
Knowledge engineering and machine learning.
What are other names of data mining?
Knowledge extraction, pattern analysis, data archaeology, information harvesting, pattern searching and data dreging
Which graph measures the degree to which a distribution is more of less peaked than a normal distribution?
Kurtosis
What represents the final class choice for a pattern?
Leaf Node
What is used when every data point is used for testing once as many models developed as there are a number of data points?
Leave One Out *time consuming, but best for small data sets
What is the catalog of words, their synonyms, and their meanings for a given language and create a variety of special purpose lexicons for use in sentiment analysis projects?
Lexicon
What is the best known technique in a family of optimization tools called mathematical programming?
Linear Programming *all relationships among variables are linear
What assumption states that the relationship between the response variable and the explanatory variable are linear?
Linearity
What are assumptions associated in linear regression?
Linearity, independence, normality, constant variance, and multicollinearity
When is the linkage among many objects of interest is discovered automatically?
Link Analysis
What involves putting the data into the data warehouse?
Load
What is a very popular, statistically sound, probability classified algorithm that employs supervised learning?
Logistic Regression
What is used to classify a categorical variable?
Logistic Regression
Types of analytical processing
MOLAP (multidimensional online analytical processing) is an alternative to the ROLAP (Relational OLAP) technology))) indexes directly into a multidimensional database. ROLAP(relational online analytical processing) is an alternative to the MOLAP (Multidimensional OLAP) technology. HOLAP(hybrid online analytical processing) is a combination of ROLAP ( Relational OLAP) and MOLAP (Multidimensional OLAP) SQL -SQL (pronounced "ess-que-el") stands for Structured Query Language. SQL is used to communicate with a database. According to ANSI (American National Standards Institute), it is the standard language for relational database management systems.
options for olap
MOLAP, ROLAP
When would a general inductive process build a classifier by learning from a set of reclassified examples?
Machine Learning Approach
The sources of structured data include:
Machine-generated data & Human-generated data (structured)
The sources of unstructured data include:
Machine-generated unstructured data & Human-generated unstructured data
List and describe the major components of BI.
Major objective is to enable interactive access (sometimes in real time) to data, to enable manipulation of data, and to give business managers and analysts the ability to conduct appropriate analysis. By analyzing historical and current data, situations, and performances, decision makers get valuable insights that enable them to make more informed and better decisions. Process is based on the transformation of data to information, then to decisions, and finally to actions.
Scope BI
Management Support Systems (Focus: Planning, organization, Control)
Business Advantages of a Relational Database 5) Increased Information Security
Managers must protect information, like any asset, from unauthorized users or misuse Security risks are increasing as more and more databases and DBMS systems are moving to data centers run in the cloud
Dinstanz Berechnung
Manhatten: |Bx-Ax| + |By-Ay| Euclidean: wurzel aus der summe von der achsendifferenz zum quardrat
When can text mining be used to increase cross selling and up selling by analyzing the unstructured data generated by call centers?
Market Applications -Invaluable for CRM!
What are subcategories of association?
Market basket, link analysis, and sequence analysis.
What is a family of tools designed to help solve managerial problems in which the decision maker must allocate scarce resources among competing activities to optimize a measurable goal?
Mathematical Programming
What is a simpler way to calculate the overall deviation from the mean and is calculated by measuring the absolute values of the differences between each data point and the mean?
Mean Absolute Deviation
What is a single numerical value that aims to describe a set of data by simply identifying or estimating the central position within the data?
Measure of Central Tendency
What is the measure of center value in a given data set?
Median
What is the most standardized and orderly making it a more minable information source?
Medical Literature
What can drive changes in business intelligence?
Mergers & acquisitions, regulatory requirements, and introduction of new channels.
What describes the structure and some meaning about data contributing to their effective or ineffective use?
Metadata
List and describe the three major categories of business reports.
Metric management reports: Business performance is managed through outcome-oriented metrics. Enterprise-wide agreed targets to be tracked over a period of time. Dashboard-type reports: Present a range of different performance indicators on one page. Vendors would provide a set of predefined reports with static elements and fixed structure, but also allow for customization. Balanced scorecard-type reports: Presents an integrated view of success in an organization. In addition to financial performance, it also includes customer, business processes, and learning growth perspectives.
What is a worldwide source for access to Mircosoft's SQL Server suite for academic purposes teaching and research?
Microsoft Enterprise Consortium
Where are data and models stored in the same relational database environment, making model management a considerably easier task?
Microsoft SQL Server
Who provides easy to use tools for reporting or descriptive analytics?
Middleware Providers i.e. Oracle, SAP, and IBM
Who provides tools that enable reporting or descriptive analytics?
Middleware industry players i.e. Microsoft SQL, Tableau, SAS
What enables access to the data warehouse?
Middleware tools
What is the observation that occurs most frequently?
Mode *most useful for data with a small number of unique values
What is the most common two step methodology of classification type?
Model development/training and model testing/deployment.
Kimball model
Model with the data mart approach (bottom up)
inom model
Model, also known as the EDW approach, emphasizes top-down development, employing established database development methodologies and tools, such as entity-relationship diagrams (ERD), and an adjustment of the spiral development approach.
kimball model
Model, also known as the data mart approach, is a "plan big, build small" approach. A data mart is a subject-oriented or department-oriented data warehouse. It is a scaled-down version of a data warehouse that focuses on the requests of a specific department, such as marketing or sales.
What is the most common simulation method for business decisions that begins with building a model of the decision problem without having to consider the uncertainty of any variables?
Monte Carlo Simulation
What is a branch of the field of linguistics and a part of the NLP that studies the internal structure of words?
Morphology
What is the assumption that states that the explanatory variables are not correlated?
Multicollinearity
What involves data analysis in several dimensions and are generally shown in a spreadsheet format?
Multidimensional Analysis
HOLAP Model
Multidimensional Data Types, Relational Data Types, Tools -> OLAP API/SQL
What is the number of content forms constrained in a tie?
Multiplexity
What is the systems that convert information from computer databases into readable human language?
Natural Language Generation
What is a subfield of artificial intelligence and computation linguistics and studies the problem of understanding the natural human language with the view of converting depictions of human language into more formal representations that are easier for computer programs to manipulate?
Natural Language Processing
What is the measure of the completeness of relational triads?
Network Closure
What is the term for describing analytics that relate to groups of people, social networks, supply chain networks, etc?
Network Science
What involves the development of mathematical structures that have the capability to learn from past experiences presented in the form of well structured data sets?
Neural Networks -Classification algorithm
What are the necessary conditions for a good expert?
No ES can be designed without the strong support of knowledgeable and supportive experts. A proper expert should have a through understanding of problem-solving knowledge, the role of ES and decision support technology, and good communication skills.
What are the two fundamental data structures in a graph database?
Nodes and relations Also called vertices and edges.
What has finite non-ordered values?
Nominal Data
What contains measurements of simple codes assigned to objects as labels which are not measurements?
Nominal data - can be represented with binomial values having two possible values i.e. variable marital status (single, married, divorce)
What means that some experimentation type search or inference is inolved?
Nontrival
What does it mean if users can't change or update the data?
Nonvolatile
What assumption states that the errors of the response variable are normally distributed?
Normality
What means that the patterns are not previously known to the user within the context of the system being analyzed?
Novel
What are numeric values?
Numeric Data
What represents the numeric values of specific variables?
Numeric Data / Continuous Data (scalable data) -- can be integer or real.
Operational data store
ODS. Provides a fairly recent form of customer information file (CIF). This type of database is often used as an interim staging area for a data warehouse. Used for short term decisions. Uploads just recent info not for long-term use. Data warehouse on the other hand stores permanent info. An ODS consolidates data from multiple source systems and provides a near-real time, integrated view of a volatile, current data.
What does an analyst use to navigate through the database and screen for a particular subset of the data by changing the data's orientations and defining analytical calculations?
OLAP
What is the approach to quickly answer ad hoc questions by executing multidimensional analytical queries against organizational data repositories?
OLAP
What is the most commonly used data analysis technique in data warehouses and has been growing in popularity due to the exponential increase in data volumes and the recognition of the business value of data driven analytics?
OLAP
What is OLAP and how does it differ from OLTP?
OLAP (online analytical processing) is an approach to quickly answer ad hoc questions by executing multidimensional analytical queries against organizational data repositories (example-data warehouses, data marts). OLTP(online transaction processing system) is a term used for a transaction system, which is primary responsible for capturing and storing data related to day-to-day business functions such as ERP, CRM, SCM, point of sale, and so forth.
most common analysis technique in data warehouse?
OLAP online analytical processing.
What is used for a transaction system that is primarily responsible for capturing and storing data related to day to data business functions such as ERP, CRM, SCM, POS, and so forth?
OLTP
What is a common representation schema of the frequency based relationship between the terms and documents in tabular format where terms are listed in columns?
Occurrence Matrix / Term by Document Matrix
What refers to web measurement and analysis about you and your products that takes place outside your web site?
Off Site Web Analytics
What are the two main categories of Web analytics?
Off site and on site.
What measure visitors behavior once thy are on the web site and measures the performance in a commercial context?
On Site Web Analytics
How many values can be stored with a single key in a key-value database?
One
Olap
Online Analytical Processing - live data - reporting
What is the term used for analyzing, characterizing, and summarizing structured data stored in organizational databases?
Online Analytics Processing (OLAP)
What handles a company's routine ongoing business and responds immediately to user requests?
Online Transaction Processing (OLTP)
Types of analytical processing activities:
Online analytical processing (OLAP), data mining, querying, reporting, and other decision-support applications.
OLAP vs OLTP
Online analytical processing VS online transactional processing. OTLP for capturing and storing data for day-to-day business functions such as ERP, CRM, SCM, point of sale, and so forth. Not for ad-hoc and complex queries that deal with a number of data items. OLAP on the other hand is designed to address this need by providing ad hoc analysis of organizational data much more effectively and efficiently. OLAP and OLTP rely on each other. OLAP uses the data captures by OLTP and OLTP automates the business processes that are managed by decisions supported by OLAP.
OLTP
Online transaction processing (traditional relational DBMS)
What consolidates data from multiple source systems and provides a near real time, integrated view of volatile current data?
Operational Data Stores
What provides a fairly recent form of customer information file and is used as an interim staging area for a data warehouse?
Operational Data Stores
What is used for short term decisions involving mission critical applications rather than for the medium and long term decisions associated with EDW?
Operational Data Stores (think short term memory)
What translates an organization's strategic objectives and goals into a set of well defined tactics and initiatives, resource requirements, and expected results fro some future time period?
Operational Plan *key to success is integration
abstract architecture
Operational Systems -> ETL Process -> Data Warehouse -> Frond end Software -> Warehouse users
What decision support model used data that was obtained from the domain experts use of manual processes to build mathematical or knowledge to solve constrained optimization problems?
Operations Research
An ODS is a
Opertaional data stores. type of customer-information-file database that is often used as a staging area for a data warehouse.
What are some other names for sentiment analysis?
Opinion mining, subjectivity analysis, and appraisal extraction
What is the solution that has the highest degree of goal attainment associated with it known as?
Optimal Solution
What are enablers of prescriptive analytics?
Optimization, simulation, decision modeling, and expert systems.
What has finite ordered values?
Ordinal Data
What contains codes assigned to objects or events as labels that also represent the rank order among them?
Ordinal Data i.e.e credit score, age group.
What aims to minimize the sum of squared residuals and leads to a mathematical expression for the estimated value of the regression line?
Ordinary Least Squares Method
What are lagging indicators that measure the output of past activity?
Outcome KPI (financial in nature)
What uses JavaScript embedded in the site page code to make image requests to a 3rd party analytics dedicated server whenever a page is rendered by a web browser?
Page Tagging
What is the most basic of measurements and is presented as the average page views per visitor?
Page Views
What enables multiple CPUs to process data warehouse query requests simultaneously and provides scalability?
Parallel Processing
What are some of the challenges of NLP?
Part-of-speech tagging: It is difficult to mark up terms in a text as corresponding to a particular part of speech because the part of speech depends not only on the definition of the term but also on the context within which it is used. Text segmentation: Some written languages, such as Chinese, Japanese, and Thai, do not have single-word boundaries. Word sense disambiguation: Many words have more than one meaning. Selecting the meaning that makes the most sense can only be accomplished by taking into account the context within which the word is used. Syntactic ambiguity: The grammar for natural languages is ambiguous; that is, multiple possible sentence structures often need to be considered. Choosing the most appropriate structure usually requires a fusion of semantic and contextual information. Imperfect or irregular input: Foreign or regional accents and vocal impediments in speech and typographical or grammatical errors in texts make the processing of the language an even more difficult task. Speech acts: A sentence can often be considered an action by the speaker. The sentence structure alone may not contain enough information to define this action
Motivations for Redundant (Analytic) Data Storage
Performance Accessibility
What assists managers in tracking the implementation of business strategy by comparing actual results against strategic goals and objectives?
Performance Measurement Systems
What are examples of decision analysis attributes?
Performance measures, operational metrics, aggregated measures, and all the others to analyze the organization's performance.
Describe a two-phase commit. Does it help ensure consistency or availability?
Phase 1: the database writes, or commits, the data to the disk of the primary server. Phase 2: The database writes data to the disk of the backup server. It helps ensure consistency because if the primary server fails, it can switch to the backup database.
What is used to change the dimensional orientation of a report or ad hoc query page display?
Pivot
What can be made at the word, term, sentence, or document level?
Polarity Identification
What are other areas that utilize sentiment analysis applications?
Politics, government intelligence, and e-Commerce sites.
What means that the discovered patterns should lead to some benefit to the user or task?
Potentially Useful
What is the act of telling about the future?
Prediction / Forecasting
What are categories of data mining tasks?
Prediction, Association, and Segmentation
What are the key differences among the major data mining methods?
Prediction: the act of telling about the future. It differs from simple guessing by taking into account the experiences, opinions, and other relevant information in conducting the task of foretelling. A term that is commonly associated with prediction is forecasting. Even though many believe that these two terms are synonymous, there is a subtle but critical difference between the two. Whereas prediction is largely experience and opinion based, forecasting is data and model based. That is, in order of increasing reliability, one might list the relevant terms as guessing, predicting, and forecasting, respectively. In data mining terminology, prediction and forecasting are used synonymous, and the term prediction is used as the common representation of the act.
What tells the nature of future occurrences of certain events based on what has happened in the past?
Predictions
What is most commonly used assessment factor for classification models that predicts the class label of new or previously unseen data?
Predictive Accuracy
What aims to determine what is likely to happen in the future and is based on statistical techniques?
Predictive Analytics
What answers the question "What will happen?" and "Why will it happen?"
Predictive Analytics
Where has the biggest growth in analytics been?
Predictive Analytics
What answers the question "What should I do?" or "Why should I do it?"
Prescriptive Analytics
What is used to provide a decision or a recommendation for a specific action?
Prescriptive Analytics
What is used to recognize what is gong on as well as the likely forecast and make decisions to achieve the best performance possible?
Prescriptive Analytics
What is a modeling a key element for?
Prescriptive analytics
In what simulation would one or more of the independent variables be probabilistic?
Probabilistic Simulation
What implies the data mining comprises many iterative steps?
Process
What is the tendency for actors to have more ties with geographically close others?
Propinquity
What is a performance dashboard? Why are they so popular for BI software tools?
Provide visual displays of important information that is consolidated and arranged on a single screen so that information can be digested at a single glance and easily drilled in and further explored. Gives a quick and accurate idea of what is going on within the organization.
Why is AaaS cost-effective?
Provides many virtual analytical applications with better scalability and higher cost savings. With growing data and volumes an dozens of virtual analytical applications, chances are that more of them leverage processing at different times, usage patterns, and frequencies.
What contains both nominal and ordinal data?
Qualitative Data / Categorical Data
What is made up of result variables, decision variables, uncontrollable variables, and intermediate result variables?
Quantitative Models
What is a quarter of the number of data points given in a data set?
Quartile
What is a useful measure of dispersion because they are much less affected by outliers or a skewness in the data set?
Quartile Reported along with the median as the best choice of measure of dispersion and central tendency
What are the two components of a response cycle?
Query Analyzer and Document Matcher/Ranker
What employs a hierarchal clustering approach where the most relevant documents to the posed query appear in small tight clusters that are nested in larger clusters containing less similar documents, creating a spectrum of relevance levels among the documents?
Query Specific Clustering
What is the task of automatically answering a question posed in natural language?
Question Answering
What are the open source platforms that have emerged as popular industrial strength software tools for predictive analytics?
R, Rapid Miner, and KNIME
What ranges from 0 to 1 with 0 indicating that the proposed model is NOT a good fit and 1 indicating that the proposal model is a perfect fit?
R2
What is the difference between the largest and smallest values in a given data set?
Range (simplest measure of dispersion)
What is the most popular general platform for data mining/data science?
RapidMiner
What includes measurement variables commonly found in the physical sciences and engineering?
Ratio Data i.e. Mass, length, time, plane angle, energy
What implies that the refresh cycle of an existing data warehouse to update the data is more frequent?
Real-Time Data Warehousing
Operational Data
Real-time data stored in relational database optimized to support daily transactions. Many tables that are normalized and is updated intensively.
What is prescriptive analytics? What kinds of problems can be solved by prescriptive analytics?
Recognize what is going on as well as the likely forecast and make decisions to achieve the best performance possible. These recommendations can be in the forms of a specific yes/no decision for a problem, a specific amount (say price for a specific item to charge or a complete set of production plans. Maybe in a report or automated decision rules system.
What attempts to describe the dependence of a response variable on one explanatory variables where it implicitly assumes that there is a one way casual effect from the explanatory variable to the response variable?
Regression
What is a simple statistical technique to model the dependence of a variable on one explanatory variables?
Regression
What is concerned with the relationships between all explanatory variables and the response variable?
Regression
What is the most widely known and used analytics techniques in statistics used for hypothesis testing and prediction/forecasting?
Regression
What is a dependent variable also known as?
Response or output
What reflects the level of effectiveness of a system by indicating how well the system performs or attains its goals?
Result/Outcome Variables
What is the mantra for business intelligence?
Right information at the right time and in the right place.
When must the decision maker consider several possible outcomes for each alternative each with a given probability of occurrence?
Risk / Probabilistic / Stochastic Decision Making
What is a decision making method that analyzes the risk associated with different alternatives?
Risk Analysis
What is the ability's to make reasonably accurate predictions given noisy data or data with missing and erroneous values?
Robustness
the three key factors that affect the presentation ability
Role different user groups (CEO, middle manager, customer support, ...) Task every task requires different content and format of the information Preference individuals differ in their preference (big picture vs. detail) --> a good BI solution should
OLAP operations
Roll up, Drill Down, Slice and dice, Pivot (rotate)
What involves computing all the data relationships for one or more dimensions?
Roll-Up
What are structured decisions? Provide two examples.
Routine and typically repetitive problems for which standard methods exist. finding an appropriate inventory level, choosing an optimal investment strategy
What system captured experts' knowledge in a format that computers could process so that these could be used for consultation and allowed scare expertise to be made available where and when needed?
Rule Based Expert Systems
Who are some examples of ETL providers?
SAS, Microsoft, Oracle, IBM
What are some tools used for predictive analytics?
SAS, SPSS, and IBM
What makes a statistically representative sample of data to apply exploratory statistical and visualization techniques, select, and transform the most significant predictive variables, models the variables to predict outcomes, and confirm a mode'l's accuracy?
SEMMA
What are some data solution providers offering hardware and platform independent database management systems?
SQL Server family of MIcrosoft and SAP
What are the most commonly used database management systems?
SQL Server, Oracle, and DB2
What is a creative way of deploying information systems applications where the provider licenses its applications to customers for use a a service on demand?
SaaS (Extended ASP Model)
Business Applications of Regression
Sales predictions, financial forecasting, residual value estimation...
What are the steps of the SEMMA Data Mining Process?
Sample, Explore, Modify, Model, and Assess.
What is the ability to construct a prediction model efficiently given a rather large amount of data?
Scalability
What are the two most popular clustering methods for text mining?
Scatter/gather and query specific clustering
What is a software program that searches for documents, base don keywords users have provided that have to do with the subject of their inquiry?
Search Engine
What is the intentional activity of affecting the visibility of an e-commerce site or a web site in a search engine's natural search results?
Search Engine Optimization (SEO)
What are the man concerns for a data warehouse professional?
Security & privacy of information
What are URLs known as?
Seeds
What is the most common method for solving this risk analysis problem?
Select the alternative with the greatest expected value.
Self-service business intelligence (SSBI)
Self-service business intelligence (SSBI) is an approach to data analytics that enables business users to access and work with corporate data even though they do not have a background in statistical analysis, business intelligence (BI) or data mining. Allowing end users to make decisions based on their own queries and analyses frees up the organization's business intelligence and information technology (IT) teams from creating the majority of reports and allows those teams to focus on other tasks that will help the organization reach its goals.
How can computers provide support to semistructured and unstructured decisions?
Semistructured- use combination of standard solution and human judgment. Management science can provide models for a portion of decision-making problems. For the unstructured a DSS (Decision Support System) can improve the quality of the info on which the decision is based with alternatives and potential impacts. Unstructured- can only be partially supported by standard computerized quantitative methods. Have to create customized solutions. Intuition and judgment may play a larger role.
What attempts to assess the impact of change in the input data or parameters on the proposed solution?
Sensitivity Analysis
What collects a massive amount of data at a faster rate and have been adopted by various sectors such as healthcare, sports, and energy?
Sensors
What is a technique used to detect favorable and unfavorable opinions toward specific products and services using a large number of textual data sources?
Sentiment Analysis
When are the relationships examined in terms of their order of occurrence to identify associations over time?
Sequence Mining
What is the discovery of time ordered events?
Sequential Relationships
What are the two technical ways of collecting the data for on site analytics?
Server log files analysis and page tagging
What are the main differences among line, bar and pie charts? When should you choose one over the other?
Shows the relationship between two variables; they most often are used to track changes or trends over time. Connect individual data points to help infer changing trends over a period of time. Used to compare data across multiple categories. Effective when you have nominal data or numerical data that splits into different categories to compare results. Used to illustrates relative proportions of a specific measure; used to show percentages in catagories. If the number of categories is more than 4, use a bar chart instead.
What are the major data mining processes?
Similar to other information systems initiatives, a data mining project must follow a systematic project management process to be successful. Several data mining processes have been proposed: CRISP-DM, SEMMA, and KDD.
Centralized data warehouse
Similar to the hub-and-spoke one. except no dependent data marts, rather a big enterprise data warehouse that serves the needs of all organizational units. More holistic view. No data marts.
What partitions the data into two mutually exclusive subsets called a training set and a test set?
Simple Split
What is normally used when a problem is too complex to be treated using numerical optimization techniques?
Simulation
What is the appearance of reality and is a technique for conducting experiments with a computer on a model of a management system?
Simulation
What reduces the overall dimensionality of the input matrix to a lower dimensional space where each consecutive dimension represents the largest degree of variability possible?
Singular Value Decomposition
What is a performance management methodology aimed at reducing the number of defects in a business process to as close to 0 DPMO as possible?
Six Sigma
What is a measure of asymmetry in a distribution of the data that portrays a unimodal structure with only one peak exists in the distribution?
Skewness
What is a subset of multidimensional array corresponding to a single value set for one or more of the dimensions not in the subset?
Slice -3D Cub
What are commonly used OLAP operations?
Slice & Dice, drill down, roll-up, and pivot.
Slice And Dice
Slice and dice refers to a strategy for segmenting, viewing and understanding data in a database. Users slices and dice by cutting a large segment of data into smaller parts, and repeating this process until arriving at the right level of detail for analysis. Slicing and dicing helps provide a closer view of data for analysis and presents data in new and diverse perspectives.
What is a logical arrangement of tables in a multidimensional database in such a way that the entity relationship diagram resembles a snowflake in shape?
Snowflake Schema -dimensions are normalized into multiple related tales.
What is the mining of textual context created in social media and analyzing socially established networks for the purpose of gaining insight about existing and potential customers' current and future behaviors and about the likes and dislikes toward a firm's product/service?
Social Analytics
What is the enabling technologies of social interactions among people in which they create, share, and exchange information?
Social Media
What is the systematic and scientific ways to consume vast amount of content created by web based social media outlets, tools, and techniques for the betterment of an organization's competitiveness?
Social Media Analytics
What is a social structure composed of individuals/people linked to one another with some type of connections/relationships and provides a holistic approach to analyzing the structure and dynamics of social entities?
Social Network
What is a theoretical construct useful in the social sciences to study relationships between individuals, groups, organizations, or even societies?
Social Network
What follows the links between friends, fans, and followers to identify connections of influence as well as the biggest sources of influence?
Social Network Analysis
What is the systematic examination of social networks that view social relationships in terms of network theory?
Social Network Analysis (SNA)
What can be placed on a separate server in the network or on the transnational application databases themselves and can use event and process based approaches to proactively and intelligently measure and monitor operational processes?
Software Monitors / Intelligent Agents
internal data sources
Sources: OLTP, ERP, CRM Kind of Data: production, planning, sales, customer, marketing, organizational maintained in different formats: sturctured documents, unstructured documents
What recent technologies may shape the future of data warehousing?
Sourcing: Web/Social Media/Big Data, Open source software, Software as a service, cloud computing Infrastructure: Columnar, Real-time data warehousing, DW appliances, data management technologies and practices, In-database processing technology, In-memory storage technology, New database management systems, advanced analytics
What converts spoken words to machine readable input?
Speech Recognition
What are the computation costs involved in generating and using the model where faster is deemed to be better?
Speed
What is a test on one or more attributes and determines how the data are to be divided further?
Split Point
What is the most popular end user modeling tool because it incorporates many powerful financial, statistical, mathematical, and other functions?
Spreadsheet
What is natural language processing?
Ss a subfield of artificial intelligence and computational linguistics. It studies the problem of "understanding" the natural human language, with the view of converting depictions of human language (such as textual documents) into more formal representations (in the form of numeric and symbolic data) that are easier for computer programs to manipulate.
What is the measure of the spread of values within a set of data?
Standard Deviation
What is the most commonly used and the simplest style of dimensional modeling that contains a central fact table surrounded and connected to dimension tables?
Star Schema
What is designed to provide fast query response time, simplicity, and ease of maintenance for read only database structures?
Star Schema -dimensions are denormalized with each dimension being represented by a single table
multi dimensional data in relational db
Star schema, snowflake schema, fact constellation,
What is a collection of mathematical techniques to characterize and interpret data?
Statistics
What is the process of reducing inflected words to their stem form?
Stemming
What are words that are filtered out prior to or after processing of natural language data?
Stop waords
What are the two aspects to managing data that can't be stored in a single unit?
Storing and processing.
What are the steps of a closed loop BPM strategy?
Strategize, plan, monitor/analyze, and act/adjust
What is a high level plan of action, encompassing a long period of time to achieve a defined goal?
Strategy
What are features of a KPI?
Strategy, targets, ranges, encoding, time frames, and benchmarks.
What is the absence of ties between two parts of a network?
Structural Holes
What do data mining algorithm used and can be classified as categorical or numeric?
Structured data -Categorical: Nominal, ordinal -Numerical: Interval, ratio
What enables users to determine how their business is performing and why?
Subject Oriented -- provides a more comprehensive view of the organization.
Characteristics of Data Warehousing include
Subject oriented (data organized by detailed subject such as sales, customer,) Integrated (consistent format), Time Varient ( maintains historical data). Nonvolatile (users can't change data, changes are recorded as new data).
What are characteristics of data warehousing?
Subject oriented, integrated, time variant, nonvolatile.
What are the types of metadata (based on pattern)?
Syntactic, structural, and semantic.
Meta-flow:
System modeling: to define structure of legacy systems, synthesizing to create valued, regulating to create modules for capturing.
What is a single word or multi-word phrase extracted directly from the corpus of a specific domain by means of NLP methods?
Term
When would rows represent documents and columns represent terms?
Term Document Matrix
What is used when the classifier is build and then tested on the test set and has 1/3 of the data?
Test Set
What is more commonly used in a business application context,?
Text Analytics -- relatively new term
What is frequently used in academic research?
Text Mining
What is the semiautomated process of extracting patterns from large amounts of unstructured data sources?
Text Mining
What is text analytics: How does it differ from text mining?
Text analytics is a concept that includes information retrieval (e.g., searching and identifying relevant documents for a given set of key terms) as well as information extraction, data mining, and Web mining. Test mining is the semi-automated process of extracting patterns (useful information and database) from large amounts of unstructured data sources.
What are the main steps in the text mining process?
Text mining entails three tasks: 1. Establish the Corpus: Collect and organize the domain-specific unstructured data 2. Create the Term-Document Matrix: Introduce structure to the corpus 3. Extract Knowledge: Discover novel patterns from the T-D matrix
What is a computer program that automatically converts normal language text into human speech?
Text to Speech / Speech Synthesis
What is the main difference between the SEMMA and Crisp DM?
The CRISP DM takes a more comprehensive approach and SEMMA implicitly assumes that the data mining project's goals and objectives along with the appropriate data sources have been identified and understood.
Value of information
The ability to understand, digest, analyze, and filter information is key to growth and success for any professional in any industry
Define managerial control. Provide two examples.
The acquisition and efficient use of resources in the accomplishment of organizational goals.
What are issues that are pertaining to scalability?
The amount of data in a warehouse, how quickly the warehouse is expected to grow, the number of concurrent users, and the complexity of user queries. **must scale both horizontal and vertically
What are popular techniques for time series forecasting?
The averaging methods -- simple average, moving average, weighted moving average, and exponential smoothing.
Define operational control. Provide two examples.
The efficient and effective execution of specific tasks. A/R A/P
What is "search engine optimization"? Who benefits from it?
The intentional activity of affecting the visibility of an e-commerce site or a website in a search engine's natural (unpaid or organic) search results. It involves editing a page's content, HTML, metadata, and associated coding to both increase its relevance to specific keywords and to remove barriers to the indexing activities of search engines. Primarily benefits companies with e-commerce sites by making their pages appear toward the top of search engine lists when users query.
What is the main difference between commercial and free data mining software tools?
The main difference between commercial tools, such as Enterprise Miner and Statistica, and free tools, such as Weka and RapidMiner, is computational efficiency. The same data mining task involving a rather large dataset may take a whole lot longer to complete with the free software, and in some cases it may not even be feasible (i.e., crashing due to the inefficient use of computer memory).
What are some methods for cluster analysis?
The most commonly used clustering algorithms are k-means and self-organizing maps.
Business Analytics and its goals
The process of creating new insights from information is known as business analytics a) Business Intelligence --> Operational --> Here & Now b) Business Analytics --> Strategic --> Future Goals: Extracting the knowledge buried inside enterprise databases (discover unknown relationships) Analytical decision are put on a repeatable basis instead of treating as an ad hoc activity
Define analytics.
The process of developing actionable decisions or recommendations for actions based on insights generated from historical data.
Describe how ES perform inference.
The process of using the rules in the knowledge base along with the known facts to draw conclusions. Requires some logic embedded in a computer program to access and manipulate the stored knowledge. This program is an algorithm that, with the guidance of the interference rules, controls the reasoning process and is usually called the inference engine
BI
The set of techniques and tools for the transformation of raw data into meaningful and useful information for business analysis/decision support purposes.
What is the difference between information visualization and visual analytics?
The use of visual representations to explore, make sense of, and communication data. The combination of visualization and predictive analytics.
What is the purpose of technology providers or the outer petals?
They provide technology, solutions, and training to analytics user organizations so they can employ these technologies in the most effective ad efficient manner.
Data mart bus architecture
This architecture is a viable alternative to the independent data marts where the individual marts are linked to each other via some kind of middleware. Not optimal for complex data queries.
What are the key similarities and differences between a two-tiered architecture and a three-tiered architecture?
Three-Tier: Has client workstation, application server and database server (each in own tier). Data is processed twice and deposited in an additional multidimensional database. Separation of functions of the DW, eliminates resource constraints, easily create data marts. Two-Tier: Has client workstation and application/database server. Same hardware, but more economical. Can have problems with large DW with data intensive applications.
What is defined by the linear combination of time, emotional intensity, intimacy, and reciprocity?
Tie Strength
What is the structure of a two tier architecture?
Tier 1: Client Workstation Tier 2: Application & Database Server **more economical, but more performance problems
What is the structure of a three tier architecture?
Tier 1: Client Workstation Tier 2: Application Server Tier 3: Database Server **eliminates resource constraints and makes it possible to easily create DMs
3 tiers of data warehousing architecture. ( a 2 tier is more economical where the last two work together but not great for large companies).
Tier 1: Client workstation. Tier 2: Application server. Tier 3: Database server.
What is a situation in which it's not important to know exactly when the event occurred?
Time Independent
What is a sequence of data points of the variable of interest, measured and represented as successive points in time spaced at uniform time intervals?
Time Series
What assumes all the explanatory variables are aggregated and consumed in the response variable's time variant behavior?
Time Series Forecasting
What is the use of mathematical modeling to predict future values of the variable of interest based on previously observed values?
Time Series Forecasting
What measures the visitor's interaction with the website?
Time on Site
What is a categorized block of text in a sentence?
Tokenizing
online analytical processing (OLAP),
Tools to create an advanced data analysis environment that supports decision making, business modeling, and operations research.
What adapts traditional relational database tools to the development needs of an enterprise wide data warehouse and provides a consistent & comprehensive view of the enterprise?
Top Down Development / EDW Approach
How does traditional analytics make use of location-based data?
Traditional analytics produce visual maps that are geographically mapped and based on the traditional location data, usually grouped by the postal codes. The use of postal codes to represent the data is a somewhat static approach for achieving a higher level view of things
What is used by the model builder and has 2/3 of the data?
Training Set
In a simple split, what are the three mutually exclusive subsets used to prevent overfitting?
Training, validation, and testing.
What is a computerized record of a discrete event?
Transaction
What involves converting the extracted data from its previous form into the form in which it needs to be so that it can be placed into a data warehouse or simply another database?
Transformation
What is sentiment analysis? How does it relate to text mining?
Tries to answer the question, "What do people feel about a certain topic?" by digging into opinions of many using a variety of automated tools. It is also known as opinion mining, subjectivity analysis, and appraisal extraction. Unlike text mining, which categorizes text by conceptual taxonomies of topics, sentiment classification generally deals with two classes (positive versus negative), a range of polarity (e.g., star ratings for movies), or a range in strength of opinion
What is the outcome of when the predictive class is negative and the observed class is negative?
True Negative
What is the outcome of when the predictive class is positive and the observed class is positive?
True Positive
specificity
True negative/Truenegative+False Positive
What means that the pattern should make business sense that leads to the user saying they understand?
Ultimately Understandable
When would the decision maker consider situations in which several outcomes are possible for each course of action and the decision maker does not know the probability of occurrence of the possible outcomes?
Uncertainty
What are the factors that affect the result variables, but are not under the control of the decision maker?
Uncontrollable variables/Paramaters
environmental scanning
Undirected viewing mode limited, irregular information Conditional viewing mode controlling for internal data, external data monitored Searching mode seeking information to update existing knowledge Enacting mode experimentation and trying new behaviors
What is composed of any combination of textual, imagery, voice, and Web content?
Unstructured data
Geographic information system (GIS)
Used to capture, store analyze and manage the data linked to a location, combined with integrated sensor technologies and GPS
What is a critical success factor in data warehouse development?
User participation
What uses animated computer graphic displays to present the impact of different managerial decisions?
VIS
What means that the discovered patterns should hold true on new data with a sufficient degree of certainty?
Valid
What is used to calculate the deviation of all data points in a given data set from the mean?
Variance
What is a simulation method that lets decision makers see what the model is doing and how it interacts with the decisions made, as they are made?
Visual Interactive Simulation (VIS)
Geocoding
Visual Maps, Postal codes, Latitude & Longitude
What is a significant technology that has become a key player in descriptive analytics?
Visualization
What is an integral part of analytic CRM and customer experience management systems that helps to better understand and better manage customer complaints/praises?
Voice of the Customer
What has been limited to employee satisfaction surveys and is a way to listen what employees are saying?
Voice of the Employee
What is about understanding aggregate opinions and trends and helps companies with competitive intelligence and product development and positioning?
Voice of the Market
Out of the Vs that are used to define Big Data, in your opinion, which one is the most important? Why?
Volume, Variety, Velocity, Veracity, Variability, Value Proposition. -Value Proposition: A preconceived notion about "big" data is that it contains more patterns and interesting anomalies than "small" data. By analyzing large and feature rich data, organizations can gain greater business value that they may not have otherwise. Users can detect the patterns in small data sets using simple statistical analytics. Big analytics means greater insight and better decisions, something that every organization needs.
What is primarily Web site usage data focused and aims to describe what has happened on the Web site?
Web Analytics
What is the process of discovering intrinsic relationships from Web data which are expressed in the form of textual, linkage, or useful information?
Web Mining
What is the extraction of useful information from data generated through web page visits and transactions?
Web Usage Mining
Additional data warehouse characteristics include:
Web based, Relational/multidimensional, Client/Server (for easy access to end-users), Real time (newer data warehouses provide real-time or active data-access and analysis capabilities) Metadata (data about data, how its all organized and how to use them, etc).
What are characteristics that enable data warehouses to be tuned exclusively for data access?
Web based, relation/multidimensional, client/server, real time, include metadata.
What is the extraction of useful information from Web pages?
Web content mining
What is the taxonomy of web analytics?
Web content mining, web structure mining, and web usage mining.
What are automated techniques that are used to read through the content of a Web site.
Web crawlers
What is Web mining? How does it differ from regular data mining or text mining?
Web mining is the discovery and analysis of interesting and useful information from the Web and about the Web, usually through Web-based tools. Text mining is less structured because it's based on words instead of numeric data.
What is the process of extracting useful information from the links embedded in Web documents and identifies authoritative pages and hubs?
Web structure mining
What is the integration of data warehousing and Internet that offer important solutions for managing corporate data?
Web-Based Data Warehousing
What are the differences and commonalities between Web-based social media and traditional/industrial media?
Web-based social media differ from traditional/ industrial media as they are comparatively inexpensive and accessible to enable anyone to publish or access/ consume information. Industrial media generally require significant resources to publish information, as in most cases the articles go through many revisions before being published -quality +reach (commonality) -frequency -accessibility -usability -immediacy -updatability
What are the two main components of the development cycle?
Webcrawler & Document Indexer
What are commonly used Web analytics metrics? What is the importance of metrics?
Website usability: Traffic sources: Visitor profiles: Conversion statistics: They provide access to a lot of valuable marketing data, which can be leveraged for better insights to grow your business and better document your ROI. The insight and intelligence gained from Web analytics can be used to effectively manage the marketing efforts of an organization and its various products or services.
What open source data mining software includes a large number of algorithms for different data mining tasks and has an intuitive user interface most popular in educational circles?
Weka
What are examples of open source, free data mining software?
Weka, KNIME< Rapid Miner
What is the outcome of descriptive analytics?
Well-defined business problems and opportunities.
What is structured as "What will happen to the solution if an input variable, an assumption, or a parameter value is changed?"
What If Analysis
What is a distributed system?
When systems run on multiple servers, instead of just one computer.
How BI Can Answer Tough Customer Questions 2
Where has the business been? Historical perspective offers important variables for determining trends and patterns. Where is the business now? Looking at the current business situation allows managers to take effective action to solve issues before they grow out of control. Where is the business going? Setting strategic direction is critical for planning and creating solid business strategies
What conforms to the search engine's guidelines and involves no deception?
White Hat SEO
What is the difference between white hat and black hat SEO?
White hats tend to produce results that last a long time and black hats anticipate that their sites may eventually be banned either temporarily or permanently once they discover what they are doing.
What is a laboriously hand coded database of English words, their definitions, sets of synonyms, and various semantic relations between synonym sets?
WordNet expensive to build and maintain for NLP
What is the purpose of the analytics accelerators or the inner petals?
Works with both technology providers and users.
What is the most granular polarity identification?
World Level
What is the world's largest data and text repository?
World Wide Web (WWW)
What is the process used to optimally prices services to maximize revenues as a function of time varying transactions?
Yield Management
Inter Cluster
Zwischen den Gruppen
data artist
a business analytics specialist who uses visual tools to help people understand complex data
Dashboard (in PP)
a collection of 1 or more related scorecards or report elements arranged in a set of web pages, hosted by SharePoint Server
Big data
a collection of large, complex data sets, including structured and unstructured data, which cannot be analyzed using traditional database methods and tools
record
a collection of related data elements (in the MUSICIANS table, these include "3, Lady Gaga, gag.tiff, Do not bring young kids to live shows")
Enterprise Data Warehouse (EDW)
a data warehouse for the enterprise
star schema
a data-modeling technique used to map multidimensional decision support data into a relational database.
primary key
a field (or group of fields) that uniquely identifies a given record in a table. In the table RECORDINGS, the primary key is the field RecordingID that uniquely identifies each record in the table. Primary keys are a critical piece of a relational database because they provide a way of distinguishing each record in a table; for instance, imagine you need to find information on a customer named Steve Smith. Simply searching the customer name would not be an ideal way to find the information because there might be 20 customers with the name Steve Smith
Scorecards
a high-level snapshot of organizational performance; displays a collection of KPIs and the performance targets for those KPIs
data warehouse
a logical collection of information, gathered from many operational databases, that supports business analysis activities and decision-making tasks primary purpose is to combine information, more specifically, strategic information, throughout an organization into a single repository in such a way that the people who need that information can make decisions and undertake business analysis (collect information from multiple systems in a common location that uses a universal querying tool)
In an OLAP a cube is
a multidimensional data structure actual or virtual that allows fast analysis of data. The capability of efficiently manipulating and analyzing data from multiple perspectives. aimed for overcome a limitation of relational databases. an analyst can navigate through the database and screen for a particular subset of the data by changing the data's orientations and defining analytical calculations. not great for lots of data as a standard relational format is.
foreign key
a primary key of one table that appears as an attribute in another table and acts to provide a logical relationship between the two tables
Extraction, transformation, and loading (ETL)
a process that extracts information from internal and external databases, transforms it using a common set of enterprise definitions, and loads it into a data warehouse. The data warehouse then sends portions (or subsets) of the information to data marts
Information cleansing or scrubbing (2 of 3 core concepts of data warehousing)
a process that weeds out and fixes or discards inconsistent, incorrect, or incomplete information
dimensional modeling is
a retrieval based system that supports high-volume query access.
Independent Data Mart
a small warehouse designed for a strategic business unit (SBU) or a department, but its source is not an EDW.
A data warehouse is
a specially constructed data repository where data are organized so that they can be easily accessed by end users for several applications.
What is PerformancePoint Dashboard Designer?
a tool that you can use to create dashboards, scorecards, and reports and then publish them to a SharePoint site; Dashboard Designer is part of PerformancePoint Services in MS SharePoint Server 2012
What is an operational data stores (ODS)
a type of database often used as an interim area for a data warehouse
Dashboard definition
a visual display of the most important information needed to achieve one or more objectives; consolidated and arranged on a single screen so the information can be monitored at a glance
BI-... a) tool b) solution c) product d) process
a) BI-Tools are generic software sold by vendors like Oracle, SAP, Microsoft Dynamics, sage b) BI-Solutions are customized software, deployed within organizations c) BI-Product as result of BI where information & knowledge are created d) BI-Process how the organization obtain, analyze and distribute
input and output for Organizational Memory
a) Input: Data, information and knowledge is stored as events occur b) Output: accumulated information & knowledge about the past (not necessarily integrated)
Online Analytical Processing (OLAP), its goals and features
a) OLAP queries the data warehouse, response are pre-calculated b) Organizes data into cubes c) Dimensions summarize data and can be hierarchically drilled down d) OLAP allows to quickly manipulate the analytic results across the different dimensions, no waiting for queries or calculations
why BI gets more important
a) exploding data volumes large data collection these can make decisions even more difficult b) complicate decisions increasingly difficult because of 24/7 worldwide complex processes larger diversity of required information to make decision c) need for quick reflexes market influences cause quick changes so decision has to be made in window of opportunity delays: converting, ingtegrating or resulting of information/knowledge d) technological process better tools for organization because ERP, DW systems need for data or text mining
drill down
access data that is in a lower level of a hierarchically structured database.
Middleware tools enable
access to the data warehouse. Power users such as analysts may write their own SQL queries.
Relational DBMS
allow multiple access queries.
Active Data warehousing (as opposed to traditional data warehousing)
allows for large users and operational staffs.Active Data Warehouse is repository of any form of captured transactional data so that they can be used for the purpose of finding trends
A relational database management system
allows users to create, read, update, and delete data in a relational database. Although the hierarchical and network models are important, this text focuses only on the relational database model
Metric
an analytical measurement intended to quantify the state of a system
dynamic catalog
an area of a website that stores information about products in a database (dynamic website information)
decision support system
an information system that helps managers understand specific kinds of problems and potential solutions and analyze the impact of different decision options using what if scenarios
data warehouse
an integrated, subject-oriented, time-variant, nonvolatile collection of data , that provides support for decision making.
data-driven website
an interactive website kept constantly updated and relevant to the needs of its customers using a database (especially useful when a firm needs to offer large amounts of information, products, or services. Can help limit the amount of information displayed to customers based on unique search requirements)
What is an oper marts
an operational data mart
market basket analysis
analyzes such items as websites and checkout scanner information to detect customers' buying behavior and predict future behavior by identifying affinities among customers' choices of products and services
SaaS
application is hosted as a service
Oper marts
are created when operational data needs to be analyzed multidimensionally. The data for an oper mart come from an ODS.
issue of scaling
attributes may have to be scaled to prevent domination of one attribute
heterogenous ensembles
base models of stem from different prediction methods
Attributes
beschreibende Informationen über die Dimensionen
splitter/branching node
binary decision
In-Flow DS flow
capturing data from legacy system, validating to test data for reality, repairing to examine and build data, transforming for consolidation, applying to move and load data.
Standard Data mining format
cases -> variables
non-exclusive approaches
cases are assigned to some clusters with some probability(graussian mixture)
chord /kɔrd/ or circus chart
chord chart is already implemented in Power BI: A chord diagram is a graphical method of displaying the inter-relationships between data in a matrix. The data is arranged radially around a circle with the relationships between the points typically drawn as arcs connecting the data together.
Reporting
classic Approach to serve Managers Information Needs
model
classification rules, decision tree, mathematical formulae
ensemble methods
combining multiple methods
data dictionary
compiles all of the metadata about the data elements in the data model
business intelligence
comprehensive, cohesive, and integrated set of tools and processes used to capture, collect, integrate, store, and analyze data with the purpose of generating and presenting information used to support business decision making.
Business-critical integrity
constraints enforce business rules vital to an organization's success and often require more insight and knowledge than relational integrity constraints no product returns are accepted after 15 days past delivery (makes sense because of spoilage of produce)
A data mart...
contain data on one topic (e.g., marketing). A data mart can be a replication of a subset of data in the data warehouse. Data marts are a less expensive solution that can be replaced by or can supplement a data warehouse. Data marts can be independent of or dependent on a data warehouse.
root node
contains all data
fact constellation
contains multiple fact tables that share many dimension tables
biggest pitfalls associated with real-time information
continual change
accuracy or Percentage correctly classified
correctly classified examples / all examples
Machine-generated data
created by a machine without human intervention Machine-generated structured data includes sensor data, point-of-sale data, and web log (blog) data
database management system (DBMS)
creates, reads, updates, and deletes data in a database while controlling access and security. Managers send requests to the DBMS, and the DBMS performs the actual manipulation of the data in the database
What is metadata
data about the data. in a data warehouse, metadata describe the contents of a data warehouse and the manner of its acquisition and use
Data integration uses three things:
data access, data federation (integration of business views across multiple data stores) and change capture (based on the identification, capture and delivery of changes made to enterprise data sources.
What solutions does business intelligence provide
data access, storage, data analysis and visualization technologies to support better decision making
data mart (1 of 3 core concepts of data warehousing)
data mart contains a subset of data warehouse information. To distinguish between data warehouses and data marts, think of data warehouses as having a more organizational focus and data marts as having a functional focus
A web-server is backed by both a
data warehouse and an application server. used for ease of access, platform independence, and lower cost.
The federated data warehouse
data warehouse architecture involves integrating disparate systems and analytical resources from multiple sources to meet changing needs or business conditions.
data warehouse parts
data warehouse itself, data acquisition (back-end), client (front-end).
Business Advantages of a Relational Database 4) Increased Information Integrity (Quality)
database design needs to consider integrity constraints
physical view of information
deals with the physical storage of information on a storage device
dice
defines a subcube by performing a selction of one or more dimensions
business rule
defines how a company performs certain aspects of its business and typically results in either a yes/no or true/false answer Stating that merchandise returns are allowed within 10 days of purchase is an example of a business rule
data quality audits
determine the accuracy and completeness of its data. Most organizations determine a percentage of accuracy and completeness high enough to make good decisions at a reasonable cost, such as 85 percent accurate and 65 percent complete.
several obstacles of BI introduction
difficult to find a fitting BI solution, because often expensive and benefits are rather long term business processes are often not constantly defined BI need for business user are difficult to identify
classification DDM
discrete dependent variable continuous and or discrete independent variables
classification vs. regression
discrete dependent variable vs. continuous dependent variable
Business Advantages of a Relational Database 1) Increased Flexibility
distinction between logical and physical views is important in understanding flexible database user views
measuring distance
eg. euclidean distance (folie 39)
Transactional information
encompasses all of the information contained within a single business process or unit of work, and its primary purpose is to support daily operational tasks (Organizations need to capture and store transactional information to perform operational tasks and repetitive decisions such as analyzing daily sales reports and production schedules to determine how much inventory to carry)
Analytical information
encompasses all organizational information, and its primary purpose is to support the performance of managerial analysis tasks (Analytical information is useful when making important decisions such as whether the organization should build a new manufacturing plant or hire additional sales personnel. Analytical information makes it possible to do many things that previously were difficult to accomplish, such as spot business trends, prevent diseases, and fight crime; identify many unusual trends)
primary concepts of the relational database model
entities, attributes, keys, and relationships
technologies used for information integration
environmental scanning events, trends, relationships and external environment which could influence the company (law change, new technology, competitors) text mining "reading" and analyzing text written in natural language web mining searching the web (forums, social media) and online text RFID information regarding the location of goods
Dirty data
erroneous or flawed data (complete removal of dirty data from a source is impractical or virtually impossible) dirty data is a business problem, not an MIS problem
exclusive approaches
every case is assigned to exactly one cluster (k-means)
Specialized software tools
exist that use sophisticated procedures to analyze, standardize, correct, match, and consolidate data warehouse information
ETL
extract, transform, load
data scientist
extracts knowledge from data by performing statistical analysis, data mining, and advanced analytics on big data to identify trends, market changes, and other relevant information
Error rate
false positive+False negativ/alles
Comparison Query Performance
fast for multidimensional data types, slow for relational by increasing complexity
Advanced analytics
focuses on forecasting future trends and producing insights using sophisticated quantitative methods, including statistics, descriptive and predictive data mining, simulation, and optimization (uses data patterns to make forward-looking predictions to explain to the organization where it is headed)
logical view of information
focuses on how individual users logically access information to meet their own particular business needs
Starnet abstraction level
footprint
post-pruning
fully grown tree - complexity
Indicators
graphical symbols used in KPIs to show whether performance is on or off target (e.g. stoplight symbols)
Structured data
has a defined length, type, and format and includes numbers, dates, or strings such as Customer Address. (typically stored in a traditional system such as a relational database or spreadsheet and accounts for about 20 percent of the data that surrounds us)
determining number of clusters
heurestic approach, schauen wie viele den besten objective value generieren mit geringstem aufwand
approaches of clustering
hierachical vs. non hirachical agglomerative & divisive vs exclusivevs & non-exclusive
DBMS use three primary data models for organizing information
hierarchical, network, and the relational database, the most prevalent
Real-time information
immediate, up-to-date information
Dynamic information
includes data that change based on user actions. For example, static websites supply only information that will not change until the content editor changes the information. Dynamic information changes when a user requests information. A dynamic website changes information based on user requests such as movie ticket availability, airline prices, or restaurant reservations
Static information
includes fixed data incapable of change in the event of a user action
Filters
individual dashboard items that enable dashboard users to focus on specific information (e.g. geography filter enabling a user to view information for a specific geographical region)
multidimensional cube is
inflexible and does not support the ad hoc creation of multidimensional views of the products, services and customers. can't handle more then 30 gigabits of data.
Inmon vs kimball
inmom op-down, enterprise wide, complex, dubjrct driven, low end0user, IT professionals, WHEREAS kimball bottom-up, simple method, data marts, process oriented, dimensional modeling, high end user accessibilites.
Intra Cluster
innerhalb eines Clusters
Federated data warehouse
integrates analytical resources from multiple sources to meet changing needs or business conditions.
types of Data Sources
internal Data sources vs. external data sources
snowflake schema
is a logical arrangement of tables in a multidimensional database in such a way that the entity relationship diagram resembles a snowflake in shape.
Enterprise integration informaiton
is a mechanism for pulling data from source systems to satisfy a request for information. It is an evolving tool space that promises real-time data integration from a variety of sources, such as relational databases, Web services, and multidimensional databases.
Dependent Data Mart
is a subset that is created directly from the data warehouse. It has the advantage of using a consistent data model and providing quality data. A dependent data mart ensures that the end user is viewing the same version of the data that is accessed by all other data warehouse users. The high cost of data warehouses limits their use to large companies.
Unstructured data
is not defined, does not follow a specified format, and is typically free-form text such as emails, Twitter tweets, and text messages (Unstructured data accounts for about 80 percent of the data that surrounds us)
drill down
less detailed -> more detailed - stepping down a concept hierarchy or intruducing additional dimensions (country -> state -> city)
predictions methods
linear vs non-linear, parametric vs. non parametric, homogenous versus heterogenous, individual verses ensemble
Data models
logical data structures that detail the relationships among data elements by using graphics or pictures
What is used to develop probabilistic models between one or more explanatory models between one or more explanatory predictor variables?
logistic Regression
database
maintains information about various types of objects (inventory), events (transactions), people (employees), and places (warehouses) (store information) (core component of any system, regardless of size, is a database and a database management system)
Data warehousing used primarily to help
make informed decisions.
Relational Databases are not well suited for
manipulating records. support a lot of data. supports dynamic joining of data. proven technology. performance less than optimal cannot be used for purely optimized processing.
Aims Custering
maximize intra-cluster homogenitiy, maximzie inter cluster-heterogenity
Information integrity
measure of the quality of information
Dimensional modeling
modeling is a retrieval-based system that supports high-volume query access.
Data visualization tools
move beyond Excel graphs and charts into sophisticated analysis techniques such as controls, instruments, maps, time-series graphs, and more Data visualization tools can help uncover correlations and trends in data that would otherwise go unrecognized
Comparison Data preperation Time for Query
multidimensional data type fast at complexity,
nominal
multiple variables, no order
clustering DDM
no dependent variable, continuous and/or discrete (indepentend) variables
pre-pruning
not fully grown tree - disadvantages: consider focal node only - how to collect parameters (maximal depth)
Information integrity issues
occur when a system produces incorrect, inconsistent, or duplicate data (can cause managers to consider the system reports invalid and will make decisions based on other sources)
Information inconsistency
occurs when the same data element has different values
Analysis paralysis
occurs when the user goes into an emotional state of over-analysis (or over-thinking) a situation so that a decision or action is never taken, in effect paralyzing the outcome In the time of big data, analysis paralysis is a growing problem. One solution is to use data visualizations to help people make decisions faster
How are data warehouses different from operational databases
operational databaseses are more product oriented and data warehouses use subject orientation to give a more comprehensive view of the organization.
conecept hierachy
parent- child relationship among members of dimension
Master data management (MDM)
practice of gathering data and ensuring that it is uniform, accurate, consistent, and complete, including such entities as customers, suppliers, products, sales, employees, and other critical entities that are commonly integrated across organizational systems
leaf node
prediction
Data mining take analysis further by sifting through a large amount of data to find info using these such algorithms:
predictive modeling, database segmentation, link analysis, deviation detection.
Infographics
present the results of data analysis, displaying the patterns, relationships, and trends in a graphical format (exciting and quickly convey a story users can understand without having to analyze numbers, tables, and boring charts)
Distributed computing
processes and manages algorithms across many machines in a computing environment
simple classifiers
prototype based methods - rote learner (exact match) - nearest neighbor
Reports
provide access to interactive and static data in a variety of forms (e.g. analytic chart, analytic grid, Excel services, KPI details, web page)
OLAP tools
provide data access to end users. allow a user to "drill-down" into their data to view it at whatever level of detail they need.
Real-time systems
provide real-time information in response to requests. Many organizations use real-time systems to uncover key corporate transactional information
Metadata
provides details about data. F(an image could include its size, resolution, and date created. Metadata about a text document could contain document length, data created, author's name, and summary)
continous variables
quantitative variables https://statistics.laerd.com/statistical-guides/types-of-variable.php
Two primary tools are available for retrieving information from a DBMS
query-by-example (QBE) tool and a structured query language (SQL)
n fold cross validation
randomly split data in n samples - 1 model validation, n for building the model
Data governance
refers to the overall management of the availability, usability, integrity, and security of company data
simulation of model applications
resubstitution estimate, split sample, N-fold cross validation
internal node
result of branching node
association detection
reveals the relationship between variables along with the nature and frequency of the relationships
pivot
rotate invert or rotates data axes in view goal: alternative presentation of the data
Relational integrity constraints
rules that enforce basic and fundamental information-based constraints. For example, a relational integrity constraint would not allow someone to create an order for a nonexistent customer, provide a markup percentage that was negative, or order zero pounds of raw materials from a supplier
Integrity constraints
rules that help ensure the quality of information
resubstitution estimate
same data for estimation and assessment (single sample approach)
Machine-generated unstructured data
satellite images, scientific atmosphere data, and radar data
Business Advantages of a Relational Database 2) Increased Scalability and Performance
scalable to handle the massive volumes of information, the large numbers of users expected for the launch of the website, and need to perform quickly under heavy use
slice
selection of single value, resulting in a smaller cube -> slice
training set
set of tuples used for model construction
star schema
simplest form of dimensional modeling. contains a central tact table surrounded by and connected to several dimension tables. the fact table contains a large number of rows that correspond to observed facts and external links.
slice and dice
slice and dice: phrase of slice, divide a quantity of information up into smaller parts, especially in order to analyze it more closely or in different ways.
MOLAP
specialized database physicalle storing data in multidimensional form
split sample
split data in two sets, one for estimation and model assessment
ROLAP
star or snowflake schema in relational database
snowflake schema
star schema with normalization
The growing demand for real-time information
stems from organizations' need to make faster and more effective decisions, keep smaller inventories, operate more efficiently, and track performance more carefully
types of measures
stored vs. calculated
entity (also referred to as a table)
stores information about a person, place, thing, transaction, or event (ex. TRACKS, RECORDINGS, MUSICIANS, and CATEGORIES) -columns, attributes, fields-> (supplier, inventory, materials, distribution)
relational database model
stores information in the form of logically related two-dimensional tables
Roll up
summarize data by climbing up hierarchy or dimension reduction - day -> month -> quarter
Uses for real-time location intelligence
targeting right customer based on their behavior over geographic locations
Data visualization
technologies that allow users to see or visualize data to transform information into a business perspective Data visualization is a powerful way to simplify complex data sets by placing data in a format that is easily grasped and understood far quicker than the raw data alone
Human-generated unstructured data
text messages, social media data, and emails
structured query language (SQL)
that asks users to write lines of code to answer questions against a database
information cube
the common term for the representation of multidimensional information
retention /rɪˈtɛn ʃən/
the continued possession, use, or control of something. Membership retention, pro-mentorship, retain, the meeting,
Attributes (also called columns or fields)
the data elements associated with an entity (the entity TRACKS are TrackNumber, TrackTitle, TrackLength, and RecordingID. Attributes for the entity MUSICIANS are MusicianID, MusicianName, MusicianPhoto, and MusicianNotes)
Information redundancy Business Advantages of a Relational Database 3) Reduced Information Redundancy
the duplication of data, or the storage of the same data in multiple places (can cause storage issues along with data integrity issues, making it difficult to determine which values are the most current or most accurate. Employees become confused and frustrated when faced with incorrect information causing disruptions to business processes and procedures. One primary goal of a database is to eliminate information redundancy by recording each piece of information in only one place in the database)
Information granularity /ˈgræn yə lər/
the extent of detail within the information (fine and detailed or coarse and abstract)
content creator
the person responsible for creating the original website content
content editor
the person responsible for updating and maintaining website content
Data mining
the process of analyzing data to extract information not offered by the raw data alone (can also begin at a summary information level (coarse granularity) and progress through increasing levels of detail (drilling down) or the reverse (drilling up))
extraction, transformation, and loading (ETL)
the processes used in a data warehouse. It includes extracting data from outside sources, transforming it to fit operational needs, and loading it into the end target (database or data warehouse)
multidimensional databases lack
the scalability and flexibility for DSS
data element (or data field)
the smallest or basic unit of information (can include a customer's name, address, email, discount rate, preferred shipping method, product name, quantity ordered, and so on)
Time-series information
timestamped information collected at a particular frequency
performing extensive ETL (extraction, transformation, load)
to move data to the data warehouse may be a sign of poorly managed data and a fundamental lack of a coherent data management strategy.
Why do we need BI?
to support better decision making and to increase organizational knowledge base
query-by-example (QBE)
tool that helps users graphically design the answer to a question against a database
Business intelligence dashboards
track corporate metrics such as critical success factors and key performance indicators and include advanced capabilities such as interactive controls, allowing users to manipulate data for analysis. The majority of business intelligence software vendors offer a number of data visualization tools and business intelligence dashboards
two primary types of information
transactional and analytical
sensitivity
true positive/true positive+false negative
ordinal
two or more categories, rated (bad-normal-good)
Multidimensional Database
usually contain a star model. designed for slice and dice and drill down analysis. highly indexed databases. provides data mining and drill down capabilities.
Distributed database management system
would pull the requested data from databases across the organization, bring all the data back to the same place, and then consolidate in, sort it, and do whatever else was necessary to answer the user's question. Islands of data problem still existed.
differences between BI and other information technologies like: a) knowledge management b) data warehousing c) data mining d) decision support systems
x) all kind of data: BI: data & info as input, results in *NEW* knowledge --- x) Focuses mainly on internal, structured data: a) Knowledge Management: info & knowledge as input, using the existing knowledge optimally b) Data Warehousing: ETL obtains data from multiple systems, stores them in single repository c) Data Mining: discovering hidden patterns in data, produces information d) Decision Support System: making appropriate decision
List and describe the major components of BI.
· Architectures · Database tools · Analysis tools · Applications · Methodologies