SAP
Data Mining Model
A data mining model is the statistical technique that is chosen to find trends, patterns, and relationships.
Association Analysis
Association analysis is the technique of finding hidden relationships between and among sets of items that frequently occur together in a dataset. Association analysis examines a group of transactions and then determines the rules that predict the probability that a particular item will occur in a transaction based on the frequency with which other items occur. Association analysis is also called affinity analysis Businesses can use these associations to promote and recommend items that often occur together; for example, pizza and soft drinks.
Data Mining and Statistics
At the core of data mining is statistical analysis. Significantly, some scientists and analysts believe that big data and data mining are driving advances in statistical methods. In addition, data mining is cross-disciplinary; that is, it can use artificial intelligence (AI), machine learning, databases, algorithms, and visualizations. With large datasets, AI and machine learning become important because they can automate the process.
Characteristics of Transactional Systems
Availability should be as close to 100% as possible Detailed down to the individual transaction Updatable Process transactions quickly Store current information, older data are archived Support the organization's business functions Require concurrency management to deal with users who try to access the same data at the same time Support business processes Process small uniform transactions Optimized for quick writing and storage Data are functionally oriented
Clustering
Clustering entails grouping together data points that are similar to one another. Each data point within a cluster is similar to the others within the cluster, and it is dissimilar to the data points in other clusters. In this figure, how many clusters of similar items do you see? Two.
Database Interactions
Create Read Update Delete
Database Anomalies
Create, update and delete interactions can cause anomalies Update anomalies occur when the same data are stored in multiple places Insert anomalies result when there is no place within the table to store the new data until another event occurs Delete anomalies occur when deleting some data results in the unintentional deletion of other data. Databases are "normalized" to eliminate anomalies
Data Staging
Data are acquired, cleansed, and harmonized. With cheap data storage and large processing capacity, data mining can be applied to the complete dataset rather than to a sampling of it.
Structured Data - Spreadsheets
Data integrity may be a concern File size is limited Easy to use
Data Mining Overview
Data mining involves sifting through large volumes of data to obtain insights. It is a process. No prior knowledge of what the data will reveal. Helps discover patterns, relationships and trends that are not evident using techniques discussed thus far. Used not only in business but also in medicine, government, climate study, entertainment, sports, and research in almost any area of study you can think of
Summary
Data provisioning involves providing users and other systems with access to data Structured data are easily read by computers and unstructured data need to be processed before computers can use them Data come from many sources: transactional systems, informational systems, legacy systems, web services, info agents, social media, and sensors Transactional systems are used for a business's operations and informational systems are used for analysis When collecting data for analysis, we can collect all of them, sample them, calibrate or scale them. Data may be collected automatically through continuous monitoring, feedback mechanisms and intelligent control agents.
Descriptive Models for Data Mining
Descriptive data models are used to explore and describe the inherent connections and structure within data. These connections are not easily found by slicing-and dicing techniques, especially when the number of attributes is large. Descriptive models do not attempt to predict a target variable. There are no independent or dependent variables. Instead, the model groups together cases that are similar to each other. These models are also called unsupervised models. Two models will be considered here: Clustering Association analysis
Fast Facts about SAP S/4HANA Enterprise Resource Planning (SAP ERP)
Enables a company to support and optimize its business processes Helps the organization run smoothly Real-time environment, Scalable and flexible Collections of logically related transactions within identifiable business functions
Data Sources
Legacy Systems Usually older technologies, sometimes developed internally Web Services HTTP invokes the application Platform agnostic via protocols such as SOAP (simple object access protocol) Data are stored as XML Web pages Crawlers and Info Agents Web scraping Web crawling
SAP - The Intelligent Enterprise
Market leader in enterprise application software One of the world largest independent software manufacturer 400,000 satisfied Customers in 190 countries SAP enables companies to • Streamline processes • Use live data • Predict customer trends • Connect entire businesses Best-run technologies and solutions deployed end-to-end for your industry. Intelligence will reinvent industries and change business forever
Non-integrated Special Purpose Transactional Systems
Not all businesses use ERPs Smaller businesses may not need integrated systems Functionally oriented systems such as point-of-sale or computer-aided design (CAD) often remain un-integrated
What´s New? SAP HANA
SAP HANA is an In-Memory Database that allows you to process data very fast But SAP HANA can do way more with its different processing engines
Fast Facts about SAP S/4HANA Master Data
Stored for a long time and seldom changed Represent logically grouped data like: • Customer Master • Material Master • Vendor Master • General Ledger accounts
Deployment
The next phase of the data mining process is to deploy the chosen and tested data model. This is where new technologies such as in-memory databases are used to perform real-time predictive analytics
The best model is chosen and tested against the last partition of data, the test partition.
The test partition provides an unbiased estimate of how the chosen model will perform with new data. The test partition does not impact the training, selection or validation of the model.
Benefits of an ERP
Transactional data are entered once and then can be shared Changes to master data are entered once used many times The data processing and storage functionality of all of the business processes are consolidated in a single system, reducing IT costs
Summary
We examined the various data models that allow us to make forecasts and predictions. For forecasting, we discussed the use of time series analysis to identify patterns, trends, and seasonality as well as the modeling techniques that enable us to separate the random values in a time series from those we can explain. For predictive modeling, we considered the two basic types of predictive models, estimations and classifications. It is important to note that predictive data models are supervised; that is, they need to be trained, validated, tested, and run in real-world scenarios. Models are evaluated and retrained from time to time.
Summary
You have been introduced to data mining and the data mining process. There are five phases in the data mining process - Staging, training, validation, deployment and monitoring Three types of models - descriptive, predictive, prescriptive Training, validation and testing are used for predictive data mining. Data are partitioned into three datasets
Descriptive models*
- for analytics are used to describe trends, patterns, and relationships in existing data without making predictions for future data. Also known as unsupervised models Clustering Association analysis
Structured Data
-Organized or modeled, typically in rows and columns -Fixed-width cells or fields -Values are proscribed; that is, each cell contains values related to a given input such as sales dollars, name, and so on. -Data are more easily scanned or examined -Understandable by computers -Structured text, numeric, dates
Unstructured Data
-Unorganized -Data can be of varying lengths -Values may not be topical; that is, they may be free format and not relate to each other, such as comments and suggestions -Data are more difficult to scan or examine -Need to be "translated" so that a computer can read the content -Unstructured text, audio, video, pictures, and graphics
SAP S/4HANA ...
... is the next generation Business Suite ... is the biggest innovation since SAP R/3 ... connects people, business networks and devices ... works in real time ... represents efficiency, simplicity and innovation ... master data is managed centrally, for example partners, customers and vendors
SAP - Run Simple
1972- Foundation of SAP Development of real-time data application software 1981/82- Introduction SAP R/2 1986-89- Development SAP R/3 SAP presents at the CeBIT Hanover 1993/94-Partnership with Microsoft connecting SAP R/3 with Windows NT operating system IBM Corporation is now using SAP R/3 1995/96- SAP joins the Internet SAP R/3 can now be used online 2002- 30th Birthday of SAP 3rd largest independent software provider 2005/06- Announcment of release SAP ERP 2009-Launching Business Suite 7 optimization of business performance reduce IT costs 2011-Implementation of 1st SAP in-memory computing product SAP HANA platform Data access possible in seconds 2013-SAP Business Suite moves to SAP HANA fastest growing product in history of enterprise software 2015-SAP S/4 HANA
Structured Data - Databases
A bit of history: Hierarchical model Parent-child relationship Network model Parent-child relationship but children can have more than one parent Object-oriented modeling Entities as objects Relational databases Most transactions are processed on relational databases
Data Mining Accuracy
A frequently encountered question is how accurate, and therefore reliable, a data mining activity is. Reliability is often tied to the application scenario of the data mining process. In the sciences, for example, accuracy and reliability are critical. Although no prediction can be 100% accurate, these scenarios require very high reliability. In business applications, the reliability question is somewhat different. Instead of asking how accurate the model is, we ask whether the data mining results offers an improvement over no data mining. We are interested in the added benefit of data mining over the absence of data mining. If the benefit outweighs the cost, then data mining activities become justified.
Enterprise Resource Planning Systems (ERP) are Transactional Systems
A fully integrated system to allow functional business areas to share data
Data Mining
After completing this chapter, you will be able to: Define data mining. Describe the data mining process. Explain the differences among descriptive, predictive, and prescriptive data models
Descriptive Models for Data Mining
After completing this chapter, you will be able to: Define the basic statistical models used in descriptive data mining. Use descriptive data mining techniques to gain insights. Identify the appropriate data models for various descriptive data mining problems.
Data Analytics Overview
After completing this chapter, you will be able to: Describe what data analytics is Explain why the study of analytics is important Recap examples of analytics in real-world situations, particularly business scenarios Describe the structure of the model company GBI, Get familiar with GBI employees, who appear in many of the examples in this text
Global Bike Inc. (GBI)
Company History John Davis and Peter Weiss are co-CEOs Operations in the United States (US) and Germany (DE) The Business Professional and prosumer cyclist market GBI sells high quality bicycles and cycling accessories Sells to retailer partners, not to the consumer directly
Fast Facts about SAP S/4HANA Digital Transformation - Why?
DRIVERS OF CHANGE Customer -Contact to the companies by social media etc -Changing the way of life -Empowered & Informed Customer Research -Simulation -Forecasting -New Possibilities Companies -New organization units -Changing customer expectations Technology -IoT -Cloud -Mobility -Big Data
What is Data Analytics?
Data Analytics can answer these and other questions: What has happened in the past? Why did it happen? What could happen in the future? With what certainty? What actions can we take now to support or prevent certain events from happening in the future? Can some of the actions resulting from our discoveries be automated?
Why Study Analytics?
Demand for employees who understand and can analyze data Huge growth in the amount of data available Analytics can provide strategic advantages to an organization
Summary
Descriptive data mining uses unsupervised models. Descriptive data mining seeks only to describe the current data and not to predict future values. Two descriptive models are clustering and association analysis. Descriptive data modeling helps us understand behaviors and transactions within our system
Data Sources - Sensors
Equipment sensors Equipment status Maintenance alerts Safety monitoring Input and output measurements Prediction of failure Meter reading for billing and other purposes Diagnosis and repair Health monitoring sensors Traffic and Vehicles Science and Engineering Satellites
Predictive models are of two types:
Estimation models - attempt to approximate or otherwise determine outcomes based on multiple parameters and known relationships expressed as mathematical algorithms or parametric equations; that is, equations that express a set of quantities as functions of independent variables. Classification models or classifiers - categorize data, entities, and events to identify patterns that explain how different variables in a model contribute to an outcome
Monitoring
Even after a data model has been deployed, there is always room for improvement. To improve a data model, new data are added back into the training phase so that the model can learn and change as new behaviors, trends, and patterns are discovered.
What´s New? Real Time Simplification
Example of data compression Traditional Database Architecture Principles of the S/4HANA Data Model • Normalized data modeling third normal form • Avoid unwanted redundancies • Avoid inconsistencies and anomalies Disadvantages - Frequent use of redundant data to increase the performance of e.g. data aggregation - Higher effort to update redundant data Principles of the A/4HANA Data Mode • Storage of data in denormalized form • Single source of data • No longer need for redundant data storage for tasks such as aggregation • Processing of aggregation and analytics on the fly • Check for inconsistencies and anomalies due to denormalization are no longer a critical task
Structured Data - Flat File
Examples: SimpleDB, Tokyo Cabinet, MongoDB .csv and .txt files
Data Acquisition
Identify the various types and sources of data for analysis and explain their characteristics. Describe the essential features of transactional and informational systems. Explain the uses of spreadsheets, flat files, and databases for data storage. Discuss data-collection techniques.
Number of Rules
If the transaction includes only two items, then there will be only 2 rules. Three items will generate 12 rules. With just 10 items, the number of rules grows to 57,000! Excessive numbers of rules represent an obstacle to effective analysis. Not only are they time consuming to construct, but they are also unwieldy to understand and interpret
Data Sources - Informational Systems
Informational system (OLAP) Provide a place for data to be stored and prepared for analytical purposes Hold large volumes of historical data Archived, aggregated data which may have come from a transactional system originally Example - a data warehouse
What is Data Analytics?
It is a process that involves Gathering data that are sometimes not in a usable form Cleaning up the data to make them usable Loading the data into storage models Manipulating them to discover the information
Characteristics of Informational Systems
Less detailed than transactional systems. Data is stored in summarized form Data are extracted from other systems and loaded into the informational system periodically Needs to be designed to handle a variety of queries and ad hoc reporting Supports managerial decision making, frequently strategic planning Optimized for quick reading Data are historical Data may be integrated across areas, times, regions, etc. Availability based on user authentication and authorizations
Data Gathering Methods
Once the data sources are identified, we can gather or acquire the data Gather all the data Gather a sample of the data Sampling extracts only some of the data from the source Sampling is appropriate when one or more of the following conditions exists: The analysts are certain that each data point is representative of the entire set The source dataset is too large for the planned analysis The application specifically calls for a data sample, as is the case with some accounting and regulatory compliance audits. Calibrate the data Use of known data relationships to predict possible dataset values Scaling Standardizes data to a normal distribution Continuous monitoring and embedded audit modules Automated data collection frequently used to discover unusual values in the data Feedback Mechanisms Logs Exception reports Intelligent Control Agents
Fast Facts about SAP S/4HANA Data Types in ERP Systems
Organizational Data Company Code Plant Storage Location Distribution Channel Purchasing Organization Master Data Person Material Customer Vendor Work Centre Transaction Data Purchase Order Invoice Quotation Sales Order Transportation Order
Training Partition Trained
Predictive data mining models are supervised models; that is they need to be trained on existing data in order to perform predictions for new data. The data are split into three partitions or sets
Relational Databases
Relationships are created using keys Primary Key An attribute or combination of attributes that can be used to uniquely identify a specific row (record) in a table. Foreign Key An attribute in one table that is a primary key in another table. Used to link the two tables
Examples of Business Analytics Examples
Retail - pricing, timing or pricing strategies, discounts, product placement, up-selling and cross-selling of products Manufacturing - demand forecasting, production planning Marketing - targeted marketing Supply chain - vendor selection, optimizing distribution costs Customer service/help desk - customized service Forecasting and budgeting Audit and analysis of internal controls - risk assessment Governments - resource allocations, tax compliance Utilities - demand forecasting, management of power supplies Investors - determine which investments are acceptable Science - in just about every area of science, analytics are important for interpretation of data Medicine - many applications: risk factor identification, treatment plans, disease prevention and control Sports - player acquisitions, analysis of player performance Fraud Prevention - credit card fraud
Fast Facts about SAP S/4HANA On-Premise vs. Cloud
SAP S/4 HANA Cloud Edition Subscription Licensing Deployment in the private cloud, maintained by SAP • SAP provides system and controls maintenance • Automatic participation in quarterly innovation upgrades • In-App extensibility with limited ABAP • Current release cycles • SAP ERP embedded SAP S/4 HANA On-Premise Traditional licensing Traditional licensing with customer control of deployment and maintenance • Private control of deployment and maintenance • Hardware at companies location • Privately controlled data • Fewer release cycles • Individual requirements possible • Traditional ABAP extensibility up to core modification
Which of the following statements regarding SAP S/4HANA are true?
SAP S/4HANA provides scalable, real-time, predictive and simulation capabilities YES SAP S/4HANA is only available as a cloud solution NO SAP S/4HANA is made for siloed, non-integrated business NO SAP S/4HANA enables instant, contextual information and a personal experience YES SAP S/4HANA is SAP´s new suite to help customer reimagine their business YES
Which of the following statements regarding SAP S/4HANA On-Premise are true?
SAP provides system and controls maintenance NO Privately controlled data YES Automatic participation in quarterly innovation upgrades NO Hardware at companies location YES Fewer release cycles YES
Fast Facts about SAP S/4HANA Organizational Unit
SAP terminology: Client Company Code Plant Sales Organization Division Storage Location
Multivariate time series
Several variables vary over time. We want to model the interactions among them. Example: temperature and carbon dioxide concentration over long time periods
Univariate time series
Single variable that varies over time. Examples: interest rate, global temperature, inventory value, population
Data Sources - Social Media
Social Media Text and Clicks from Social Networking Sites Application programming interface (API) Provided by the web site Crawl the site with automated scripts Custom applications Text and Clicks from News Sites Text and Clicks from E-commerce Sites Click stream analysis
Data Analytics - The Convergence of Vocabulary
Statistics is used in data analytics Computer science improves capabilities to perform data analytics Domain knowledge in every area has its unique vocabulary and analytical applications
Structured Data
Structured data is organized such that it can be read and used by people and computers String, numeric, or dates Stored in cells or fields of fixed length Examples: Spreadsheets Flat files Databases
Processing Unstructured Text Data
Tagging -Extensible markup language (XML) and extensible business reporting language (XBRL) Natural Language Processors (NLP) People speak and the computer "translates" into commands it can understand Example - Python Image Recognition Translation of pictures into information the computer can understand Example - Jetpac Artificial Intelligence (AI) Computers that can reason Example - IBM Watson and cognitive computing Much more...
lift
The final metric used to evaluate a rule Lift is the measure of how accurately a rule predicts affinity or association compared to the random (coincidental) co-occurrence of the items. Lift = Confidence / Support of the consequent Lift values greater than 1 imply that the antecedent and consequent are associated (correlated) with each other. The higher the value, the stronger the affinity. Lift values equal to 1 imply that the two are independent of one another. Lift values less than 1 imply that the two are negatively correlated; that is, the occurrence of the antecedent has a negative effect on the occurrence of the consequent.
Metrics for Association Rules
The frequency and strength of association rules are measured by three ratios: Support Confidence Lift
Prediction interval
The interval within with we can forecast that the variable value will fall with certain probability
antecedents consequents The general format of the rule is:
The items on the left side of the arrow The items on the right side If (antecent(s)) Then (consequent(s)).
Data Provisioning
The process of providing users and systems with access to data Internally, it includes authorizations and security Externally, by permission or open source Data are replicated prior to extraction from the source
Association Rules
The rules that are generated by associational analysis An example of an association rule is {PIZZA-BY-THE-SLICE} {SOFT DRINK 20 OZ}. This rule simply states that when people buy pizza by the slice they also buy a 20-ounce soft drink.
Systems
There are three types of systems (a group of related things and events or processes):
Nondeterministic systems (stochastic systems)
These systems do not have deterministic laws or rules. Instead they are modeled using random variables, sometimes expressed as a range of values Use of distributions Use of probabilities Best use of data mining
Time Series Components
This figure shows the original time series, the trend component, the seasonal component and finally the random component.
Fast Facts about SAP S/4HANA Transaction Data
Transaction data is the system record of business event. Depending on the business event, different master data and organizational data will be referenced For example, during a sales order business event, the following data is stored • Organizational level: client, company code, sales organization • Master data: customer, material, pricing (condition) • Situational data: date, time, person, amount
Fast Facts about SAP S/4HANA Documents
Transactions are data sets that are generated if a business transaction was executed. Is a record of the business transaction Includes all relevant predefined information from the master data and organizational entities Example: • Sales Document • Purchasing Document • Material Document • Accounting Document Document Flow The document flow as well as the order status allow the setting of the status at any point in time SAP revises the status every time a change in a document takes place
Unstructured Data
Unstructured data comes in many forms Images, video, audio, text, XML (tagged data) Unstructured data may be un-understandable to a computer in its native form It must be transformed to be shared by a computer
Prescriptive models
answer the question, 'What action (decisions) should I take based on the results of the predictive data mining?
Chaotic systems
are also deterministic, however they are highly sensitive to slight fluctuations in inputs. The challenges in these systems are the same as in deterministic systems but are compounded by input value variations. Since slight differences in data input can cause erratic output values, models developed for chaotic systems can be unreliable.
E-commerce
cross-selling and up-selling products or services to customers based on their interests and previous purchases as well as reviews and interactions on social media.
Predictive
data mining involves the partitioning of datasets with known target variables into three subsets to train, validate, and test a model. This process is known as supervision of the model.
Trend
direction of the data change over time. Can be computed as Average Moving average Least square fit
Support
for a rule is the fraction or percentage of transactions that contain all of the items within the rule. Because support is a probability, its values range between 0 and 1. Antecedent + Consequent Total of all transactions
Customer relationship
identifying customer preferences and providing customized services to each customer.
Fraud detection
identifying unauthorized use of credit cards and fraudulent filings of tax returns
transaction
in association analysis refers to a collection of items that occur together in an identifiable event
Simple linear regression
is a mathematical model that creates an arithmetic equation to explain the relationship between variables The goal of simple linear regression is to fit a straight line through the points on a chart between the dependent and independent variables. Once the equation for the straight line is known, then you can estimate the value of the dependent variable for any given value of independent variable (within its interval of validity)
cycle
is a pattern that displays highs and lows outside or in addition to the seasonal highs and lows. The length of a cycle does not need to be constant, in contrast to the fixed period of seasonality.
time series
is a sequence of values of an attribute, or variable, taken at equidistant intervals of time.
Time series analysis
is a technique that analysts use to: uncover any implicit structure (patterns or trends) in the data model that structure to make forecasts.
Forecast
is an estimation of the value of a variable in the future. Example: Revenue forecast for next year
Confidence
is the measure or probability of the consequent items in transactions that contain the antecedent items.
deterministic system
is the most predictable of the three. Within a deterministic system, given the present state (the complete description of all system attributes at the present time), we can predict, at least in theory, with full certainty, all future states. Challenge is in discovering the rules that govern it No need for data mining
validation partition
is used to check how the model performs on predicting the outcomes. The actual and predicted outcomes are compared to estimate the model performance
training partition
is used to train the model based on existing data.
Prediction
is used to uncover and understand relationships between variables in order to predict future behavior. Example: Which customer is likely to buy a bicycle?
Seasonality
pattern of regular fluctuations in the data over time. The period could be months, quarters, seasons, weeks, etc.
Advertising
streaming targeted ads to online users based on their browsing history and social media activities.
Conditional Probability
that is, it measures the probability of an event given that another event has occurred. Its value varies between 0 and 1 We must be careful not to conclude that the rule with the highest confidence is the most reliable or most important. In fact, a major drawback of confidence is that it measures the conditional probability of the consequent given the existence of the antecedent What it misses is the probability that the consequent will occur in cases where the antecedent does not.
User profile
understanding user behaviors on social media to create a user profile. This analysis can be used for advertising and targeted marketing.
Predictive models
use existing (current or historical) data to analyze and discover trends, patterns, and relationships with the goal of applying results to future data to make forecasts and predictions Also known as supervised models Estimation Classification