ITM
Itemset
A collection of one or more items
Data Lake
A storage repository that holds a vast amount of raw data in its original format until the business needs it
Time Series Patterns - Seasonal Pattern
A time series that shows a recurring pattern over one year or less
Stationary time series
A time series whose statistical properties are independent of time.
Trivial Rule
Already known by anyone who are familiar with the business
Basket Analysis
An analysis of items frequently co-occurring in transactions
Data Warehouse
An integrated, subject-oriented, time-variant, non-volatile database that provides support for decision making.
Analytic implementation - Localized analytics
Analytic efforts are isolated The organization collects transaction data efficiently but often lacks the right data for better decision-making.
Universal Principles for Data Ethics 10-12
Aspire to design practices that incorporate transparency, configurability, accountability, and auditability. Products and research practices should be subject to internal (and potentially external) ethical review. Governance practices should be robust, known to all team members and regularly reviewed.
Customer privacy dilemma - more customer data
Better understanding of customer Bigger privacy concern
Characteristics of ERP
Broad functionality Integration of "modules" Operates in real-time Evolving
Kotter's Steps for Leading Organizational Change - Manage
Build on the change Embed the change into the culture
Outliers
Can severely distort the representativeness of the clustering results
Applications of Cluster Analysis - Marketing
Cluster customers into smaller groups for additional analysis
Applications of Cluster Analysis - Psychology
Clusters are used to identify subcategories of illnesses
ERP (Enterprise Resource Planning)
Collection of integrated software for all functions of business management
Silhouette Coefficient
Combines the ideas of both cohesion and separation for individual points
Design of Datawarehouse storage - Data Cube
Combining 3 different attributes in a 3D shape
Kotter's Steps for Leading Organizational Change - Implement
Communicate the vision Remove obstacles Create short-term wins
Useful Rule
Contains high quality, actionable information
Statistic-based Text Mining
Count the number of times words occur Calculate the statistical proximity Produce many irrelevant results, or noise
Kotter's Steps for Leading Organizational Change - Prepare
Create a sense of urgency Form a powerful coalition Create a vision
data lake vs data warehouse - analytics
DW - batch reporting, visualisations DL - Machine learning, predictive analytics
data lake vs data warehouse - data
DW - data comes from business applications and operational databases DL - data comes from IoT devices and social media
data lake vs data warehouse - schema
DW - designed prior to implementation DL - designed at time of analysis
data lake vs data warehouse - price/performance
DW - fast query results that use high cloud storage DL - Slow query results using low cloud storage
data lake vs data warehouse - data quality
DW - highly curated data, that serves as the truth DL - raw data that is not curated
DELTA Model
Data - unique, accessible and available to you Enterprise wide - data and analytics available to a firm Leaders - at all levels that promote data analytics culture Targets - dealing with identifying business areas that benefit Analysts - to execute strategy
Role of Data
Data --> Information --> Decision-making
Output of a data warehouse
Data Mining Business Intelligence Data/Business Analytics for decision making -Explanatory Analytics -Predictive Analytics
Universal Principles for Data Ethics 7-9
Data can be a tool of both inclusion and exclusion As far as possible, explain methods for analysis and marketing to data disclosers. Data scientists and practitioners should accurately represent their qualifications (and limits to their expertise), adhere to professional standards, and strive for peer accountability.
Unstructured Data
Data does not exist in a fixed location and can include text documents, PDFs, voice messages, emails
Types of Data Warehouses - Operational Data Store
Data warehouse is refreshed in real time Preferred for routine activities like storing records of the Employees
Databases in Organizations - Middle management level
Deliver the data required for tactical planning Monitor the use of resources Evaluate performance Enforce security and privacy of data in the database
The relational (operational) database
Describes a precise set of data manipulation constructs
Issues for analytic project implementation - Data-related Challenges
Disparate Data Sources and Data Silos Data Warehouses are not the Only Option Dirty Data
Benefits of ERP
Efficiency Forecasting Collaboration Scalability Integrated Information Cost Savings
Regulations & Compliances for privacy - FCC Privacy Act
Ensures the accuracy and protects the privacy of every individual whose protected information is stored in Commission systems or records Regulate the collection, maintenance, use, and dissemination of Privacy Act-protected information
Analytic implementation - Analytical aspirations
Executives make a commitment to broader use of analytics The organization has business intelligence tools and data marts Most data remains un-integrated, non-standardized and inaccessible.
Natural Language Processing (NLP)
Find meanings in the text: -by recognizing a variety of word forms as having similar meanings -by analyzing sentence structure to provide a framework of understanding the text Achieve both speed and accuracy
Cluster Analysis - Objectives
Find useful groups of objects in data Find similar items in a group Find dissimilar items in a group
Data Lake Benefits
Flexibility - allows data to remain in its native form making more data available for analysis
Support
Frequency of transactions that contain both X and Y
Modular structure of an ERP
Functionalities are logically put into different business processes and structured into a module. The module can be detached without affecting other modules Modules are decided per department
Silhouette Value - 0.5 or more
Good evidence of reality of the clusters in the data
Types of Clustering - Partitional
Group objects into non-overlapping clusters, so that each data object is in exactly one cluster
Applications of Cluster Analysis - Information Technology
Group search terms in clusters that best captures the query
Applications of Cluster Analysis - Biology
Group similar living things together
Types of Clustering - Hierarchical
Grouping data into clusters where all data in each cluster is very similar. Do not have to assume any particular number of clusters
Analytic implementation - Analytical companies
High-quality data Enterprise-wide analytical plan IT processes and governance principles Some embedded or automated analytics.
Characteristics of a data warehouse - time-variant
Historical data is accumulated over the time.
Association Rule Mining - Reducing Candidates (Apriori principle)
If an itemset is frequent, then all of its subsets must also be frequent Support of an itemset never exceeds the support of its subsets
Approach to solve privacy issues - Utilitarian approach
If the overall harm exceeds the overall benefit, the practice is regarded as unethical If personal harm (e.g., loss of pleasure) exceeds any benefit (e.g., convenience), it is regarded as unethical
Goal of ERP
Integrate everything
Role of the DBMS
Intermediary between the user and the database Enables data to be shared Presents the end user with an integrated view of the data Receives and translates application requests into operations required to fulfill the requests
Classification
Known number of groups Assign new observations to per-determined set of groups
Data Lake disadvantages
Lack of governance
Customer privacy dilemma - less customer data
Less understanding of customer Smaller privacy concern
Silhouette Value - 0.25 or less
Little to no evidence of cluster reality
Confidence
Measures how often items in Y appear in transactions that contain X
Lift
Measures that take into account statistical dependence The higher the lift the stronger the association rule
Analytic implementation - Analytically impaired
Missing or poor-quality data Multiple definitions of data Poorly integrated systems.
Issues for analytic project implementation - Team-related Challenges
Need for an Analytics Roadmap Internal vs. External Expertise
Issues for analytic project implementation - Leadership Team Challenges
Old-School Mindset Lack of Continuous Involvement
Characteristics of a data warehouse - non-volatile
Only aggregated data is integrated into DW and never revised as opposed to the transnational data which can be changed
Approach to solve privacy issues - Kantian approach
People should be respected and treated as individuals capable of rational choice with regards to the electronic monitoring
Regulations & Compliances for privacy - HIPAA
Privacy Rule - establishes national standards for the protection of certain health information Security Rule - establishes national security standards for protecting certain health information that is held or transferred in electronic form
Types of Data Warehouses - Enterprise Data Warehouse
Provides decision making service Unified approach for organizing and representing data Ability to classify data according to the subject and give access according to those divisions
Regulations & Compliances for privacy - General Data Protection Regulation (GDPR)
Regulation in EU law on data protection and privacy for all individuals within the European Union
Databases in Organizations - Operational management level
Represent and support company operations Produce query results Enhance the company's short-term operations
Text Analytics
Searches through unstructured text data to look for useful patterns. Around 80% of data in an organization is in the form of text documents
Universal Principles for Data Ethics 4-6
Seek to match privacy and security safeguard with privacy and security expectations. Always follow the law, but understand that the law is often a minimum bar. Be wary of collecting data just for the sake of having more data.
Inexplicable Rule
Seems to have no explanation and do not suggest a course of action
Time Series Patterns - Cyclical Pattern
Shows a periodic pattern lasting more than one year
Silhouette Value - 0.25-0.5
Some evidence of reality of the clusters in the data More investigation needed
Types of Data Warehouses - Data Mart
Specially designed for a particular line of business, such as sales, finance, sales or finance Data can be collected from multiple sources
Design of Datawarehouse storage - Star Schema
Stores multi-dimensional data in tables Each table is connected to a central main table
Databases in Organizations - Top management level
Strategic decision planning Identify growth opportunities Define and enforce organizational policies Reduce costs and boost productivity Provide feedback
Cluster Cohesion
Sum of the weight of all links in a cluster
Cluster Separation
Sum of the weights between nodes in the cluster and nodes outside the cluster
Issues for analytic project implementation - External Challenges
The Big-Bang Approach vs. Low-Risk Approach Pretty Visualizations vs. Actionable Insights
Cluster Analysis
The data preparation technique used in market segmentation to divide consumers into different homogeneous groups is called
Universal Principles for Data Ethics 1-3
The highest priority is to respect the persons behind the data. Account for the downstream uses of datasets. The consequences of utilizing data and analytical tools today are shaped by how they've been used in the past.
Analytic implementation - Analytical competitors
The organization is routinely reaping big benefits from its enterprise-wide analytics capability The organization has a full-fledged analytic architecture that is enterprise-wide, fully automated and integrated into processes.
ETL Process (Extract, Transform, and Load)
The process of extracting data from source systems and bringing it into the data warehouse
Privacy Definition
The right to be Left Alone The right to Control Access to Self's Personal Information The right to Withhold Certain Facts from Public Knowledge
Time-series Analysis
To uncover a pattern in a time series and then extrapolate the pattern into the future Assumption is that the similar patterns in the past would be repeated in the future Based solely on past values
Structured Data
Typically numeric or categorical Can be organized and formatted in a way that is easy for computers to read, organize, and understand Can be inserted into a database in a seamless fashion.
Identifying Time Series Patterns (Forecasting)
Ultimately, the user should decide which model to use based on the software output and his managerial knowledge Uses linear regression
Clustering
Unknown number of groups Assign new observations based on having similarities and differences
Association Rule Mining - Min support/confidence level
Used in frequent itemset generation Generate all items whose support and confidence >= minsup/minconf threshhold
Characteristics of a data warehouse - subject-oriented
Uses multi-dimensions to slice-and-dice the aggregated data
Time Series Patterns - Horizontal Pattern
When data fluctuates randomly around a constant mean
Time Series Patterns - Trend Pattern
gradual shifts or movements to relatively higher or lower values over a longer period of time Such as population change