Data Science
- unsupervised learning technique EM determines in an iterative process the parameters of a model, so that they explain the observed data optimally. For this purpose, the algorithm presumes the presence of unobserved categories. Alternately it estimates (1) the belonging to one of these categories and (2) the parameters which determine this cateogry
EM (wiki)
› Model-based procedures • 'Imputation' of missing data + estimation of model in one step Limitations: model must fit the type of analysis that you plan to execute
Full information maximum likelihood (FIML)
1. There is typically so much unstructured data that filtering is needed 2. Unstructured data require structuring 3. Unstructured data needs to be matched to other data sources
Give three reasons why analyzing unstructured data is so much harder than analyzing structured data.
- generate accurate predictions, based on given observations in the dataset - create unbiased parameters of means, variances, regression parameters --> It is not necessarily the goal to estimate the correct (but missing) value
Goal of imputation
- File system which enables the storage of huge data on the data systems of multiple computers (nodes) --> Splits huge data files into blocks (64MB) which are saved each on one node in a Hadoop Cluster - these are connected, making them one big file system make them into one big file system -> responsible for storing data on the clusters
HDFS (Hadoop Distributed File System)
1. Column headers are values, not variable names 2. Multiple variables are stored in one column(e.g. gender and age: m_25) 3. Variables are stored in both rows and columns
3 common problems in messy datasets
• typically organized in rows and columns • fits in spreadsheets / relational databases • possibility to start analysis without further preparation • data can be accessed through SQL (Structured Query Language)
Structured Data
= standard way of mapping the meaning of a dataset to its structure → Makes it easier for any researcher to extract needed variables/information
Tidy Data
- bezeichnet eine Anzahl von vernetzten Computern. - Aufgaben: die Erhöhung der Rechenkapazität und die Erhöhung der Verfügbarkeit - Knoten /nodes: Die in einem Cluster befindlichen Computer
Computer Cluster:
- Entropy and Information Gain - Gini Impurity - Variance Reduction
Seelction of Split variable (Decision Trees)
• If a variable has a missing value for a certain object, than that object is not used for determining the effect of that variable. • The object is used for estimating all other effects • Can be useful for first analyses Difference listwise deletion: only value that is missing is excluded, not the whole row; all observed data is included
'Traditional/naive' imputation methods: Pairwise detection
analysis process: bottom up approach effective communication: top down
Analysis process vs effective communication
• averaging the prediction over a collection of predictors generated from bootstrap samples (both classification and regression) For Dummies: - Train each learner on a different set of the data - Create number of subsets of data (D1, D2, etc) - Random assignment of data points to the subsets of datasets; Datapoints can also be picked twice to be put into subset - Use each subset to train a different new model -> generates m different models - Result: having an ensemble of different models (instead of an ensemble of different algorithms) - Collect outputs of each model (Y) take their mean FINAL RESULT 1) In classification case: classifier counts the 'votes' (Xs) --> assings the class with the most votes (Xs) to Y 2) In Prediction/ regression case: can be applied to the prediction of continuous values by taking the averge value of each prediction (X) for a given test sample (see above)
Bagging
2 Dimensions: Predefined Data (Data is already given) & Framed Problem (Clear problem described/ measurable) PD Yes X FP Yes: Problem solving: often consultancies (most promising) PD No X FP Yes: Data modelling: Problem clear, but no data find KPIs to solve data PD Yes X FP No: Collateral Catch: often "nice surprise" PD No X FP No: Data mining: often disappointment
Big Data Strategies/ Analytical Strategies
Boosting involves incrementally building an ensemble by training each new model instance to emphasize the training instances that previous models mis-classified.
Boosting
Step 1: create weak classifiers › Start with uniform weighting › During each step of learning • Increase weights of the examples which are not correctly learned by the weak learner • Decrease weights of the examples which are correctly learned by the weak learner › Idea: • Focus on difficult examples which are not correctly classified in the previous steps Step 2: combine weak classifiers › Weighted voting › Construct strong classifier by weighted voting of the weak classifiers › Idea • Better weak classifier gets a larger weight • Iteratively add weak classifiers › Increase accuracy of the combined classifier through minimization of a cost function
Boosting
Area between Q1 & Q3 comprises where 50% of the data lies Q2 is the median of the data, i.e. 50% of values above and below this point length of the whiskers is max 1.5 IQR, but can be shorter if data doesn't go that far. If it goes further, data points are shown as points --> "outliers"
Boxplot - How to read it
Yes, for super. Learning we simply use the outcome of the system as the prediction of the dependent variable. For unsupervised learning we can assume that the prediction for the dependent variable is simply the dependent variable of the majority in the group, or the average dependent variable for continuous DV's
Can you use both methods (i.e. supervised and unsupervised learning) for predicting a dependent variable? Explain.
1. Causes that can be ignored: - MCAR: missing completely at random - MAR: missing at random 2. Causes that can't be ignored: - MNAR: missing not at random
Causes of Missing Data
● Supervised learning (dependent techniques) ○ Training data includes both input and desired results ("historical data") ○ For some examples, the correct results are known and used as input ○ Construction of a proper training, validation and test set is crucial ○ Examples.: (Logistic) regression, Neural Networks, SVM ● Unsupervised learning (Interdependent techniques) ○ Model is not provided with the correct results during the training ○ Can be used to cluster the input data in classes, based on their statistical properties only ○ Examples: Clustering, factor analysis
Classification of ML algorithms
-Most cases never purely MAR/ Mnar Main Question: How inaccurate are the estimations outcomes (e.g. predictions, parameters) if we cannot account for the reason of the missing values in our model? - sometimes important to know causes, if this influences accuracy of the predictions - but often it is not, since the effect of inaccuracies on conclusions are minimal
Conclusion of causes of missing values
- also called data scrubbing - = the act of detecting and correcting (or removing) incomplete, incorrect, inaccurate, irrelevant records from a dataset.
Data Cleansing
› Often significantly better than a single classifier derived from D (whole dataset) › Improved accuracy in prediction when using this method has been proven › Requirement: unstable classifier types are needed • Unstable means: a small change to the training data may lead to major decision changes. • Decision trees are a typical unstable classifier
Evaluation & Requirements of Bagging
If an artificial data set is created with known relationships between the variables, the outcomes of analyses where observations are deleted from the data (i.e. missings are generated on purpose) can be compared to this known relationship to investigate the effect of missings on the outcomes of the analyses.
Explain how you can use simulated data to investigate the effect of missing data on the outcomes on your analyses.
- Form of multiple imputation -Missing values are imputed by copying values from similar records in the same dataset • random hot-deck imputation: a value is chosen randomly (can be done per segment) -> missing value is imputed by randomly assigning one of the observerd values to it • sequential hot deck imputation: data is sorted according to one or more auxiliary (Hilfs-) variables - each missing value is imputed with the value from the first following record that has an observed value • predictive mean matching (pmm): a form of nearest-neighbor hot-deck imputation: model predicts the missing data point and the point that is closes is being used for the imputation
Hot Deck Imputation
• Build your storyline as a pyramid: define the core message first • Be exhausitive (vollständig) • Be mutually exclusive (einander ausschließend) • Apply this to any form of (business) communication: text/presentation in powerpoint/video etc.)
How to build a storyline?
○ Active controls such as need additional user authentication, blocking data transfers
How to cope privacy and security issues?
Potential solution: trimming -> cutting of x% of values(e.g. 5%) on both sides of the distribuition --> This way getting rid of outliers which exert extreme influence (outlier values weigh in quadratically) on e.g. the mean BUT: check if outliers make sense first, maybe there is a reason they are there (e.g. high sales bcs special promotion or competition ran out of stock)
How to deal with outliers?
• image is classified according to its visual content (e.g. to identify consumers' response to a product • applied e.g. for quality control, measuring consumers emotions • deep neural networks can provide good results for visual recognition and classification • BUT: Pattern recognition algorithms can be easily fooled • Other example: analysis of the effect of car design on sales Used tools: o Image morphing: computer generated variations of car designs o Grid analysis: measure consistency in car designs by comparing pixels of car images o Algorithmic feature point detection: automatic detection of car design features
Image Classification
Merging data results in the creation of one new type of information: Implied information derived from declared (name, address, Birthdate etc), appended (payment status, billing information etc) and overlaid information (age class, income class, household size, profession etc) Implied data: - not collected, but calculated/ implied - not always correct but has a high value to the firm - is created by combining declared, appended (angefügt), and overlaid information - e.g. CLV, Relationship Length, Churn Probability ● Declared: given from customer to firm ● Appended: added after customer contact ● Overlaid: data from external providers that can provide additional information about objects/customers in the database that cannot be inferred from variables in the database; Categorization ● Implied: calculating and combining variables into new ones
Implied Data
Process of replacing missing data with substituted values ->replaces missing data with estimated values based on other information
Imputation
› MAR methods (MI, EM, FIML) • Are ALWAYS at least as good as, • and are mostly better than the 'traditional / naive' methods (e.g., listwise deletion) › Methods for MNAR are NOT always better than MAR methods
Imputation methods: 'traditional/naive' vs. MAR vs. MNAR
... separate measurements of the effect of KPI's to integral measurements, which explain and forecast Market, Customer and Brand KPI's to increase performance ... data silos into centralized, integrated data environment ... analyzing averages of all customers to de-average into target groups
In order to create value from merging data source, one has to transform...
No. Analyses that deal with pure MNAR data should control for this type of missings. MI is suited for MAR data. P(missing) depends on the value of the variable itself, hence it cannot be used for imputing. However, in practice, almost never purely MNAR. In those situations, it might be a good idea to use MI
Is multiple imputation a good approach for dealing with data that is Missing Not At Random? Explain your answer
● Data cleaning: missing values, noisy data, and inconsistent data › ● Data integration: merging data from multiple data stores › ● Data selection: select the data relevant to the analysis › ● Data transformation: aggregation (daily sales to weekly or monthly sales) or generalization (street to city; age to young, middle age and senior) › ● Data mining: apply intelligent methods to extract patterns › ● Pattern evaluation: interesting patterns should contradict the user's belief or confirm a hypothesis the user wished to validate › ● Knowledge presentation: visualization and representation techniques to present the mined knowledge to the users
KDD Process
• implicit (by contrast to explicit) • valid (patterns should be valid on new data) • novel (novelty can be measured by comparing to expected values) • potentially useful (should lead to useful actions) • understandable (to humans) patterns in data
Knowledge Discovery in Databases (KDD) is the non-trivial process of identifying patterns in data, which are ...
• classifies words into different linguistic (verbs, pronouns etc.) or psychological (affect, cognition; work, home etc. ) categories, that tap social, cognitive and affective processes • computes the % of words in a text that are in each of these categories Can be used e.g. to measure the impact of consistency in advertising on long & short-term market share
LIWC
+ ability to handle big data + high flexibility with non-linear models + real-time updating - takes long to train - stopping criteria influences result - effects are not interpretable
List 3 advantages and 3 disadvantages of machine learning algorithms from your perspective.
› Probability that a value is missing depends on variables that are available in our data set › Example: • Observations for 'Age' are complete • Elderly people more often did not answer the question about household size - > p(x is missing) depends on y (observed variables) -> Since imputation estimates missing values based on information given (observed variables), no bias is created if imputation is based on pure MAR-relationships
MAR
› MCAR 1: • Reason for missing completely random › MCAR 2: • Reason for missing is unrelated to variables in the analysis • Example: - favorite soccer club -> p(x is missing) = 0.1 for all observations -> No biases if we analyze without controlling for the cause of missing values
MCAR
- MI generates multiple predictions for one misssing value, by running multiple imputations --> mean of all predicted values is then used as the final predicted/imputed value -->MI = running multiple (m) single imputations
MI - How does it work?
- Continuous DV x Unsupervised: Clustering and Dimensionality Reduction - Continuous DV x Supervised: Linear Regression, Decision Trees, Random Forests - Categorical DV x Unsupervised: Association Analysis - Categorical DV x Supervised: Trees, Logistic Regression
Machine Learning Algorithms
- data-driven procedure - two basic steps: 1. Deal with missing data (e.g. replace with plausible values) 2. Analyze as if there were no missing values › Basis: regression-based single imputation › "Drawback" of this method: imputed values are always precisely on the regression line › 'Real data' are never all on a straight line --> restore the 'lost error'
Multiple Imputation (MI)
Difference between % of Promoters (would recommend brand/company tor friends/family) and % of Detractors (wouldn't recommend brand/company tor friends/family); the higher the better (max +100; min -100)
Net Promoter Score
+ prediction accuracy generally high + robust (also when training data contains errors or noisy data (e.g. distorted data)) + fast evaluation of the learned target function - Long training time - difficult to understand the learned target function - Not easy to incorporate domain knowledge
Neural Network Advantages and Disadvantages
● Like human neurons, each have an activation function, i.e. is the number above value x → activation, if below → no activation ● Problem with threshold point: if a change in an input number is very small, but passes the classification threshold, then its prediction is completely different ● → Sigmoid activation function is a curve, instead of a cutoff point (similar to logit)
Neural Networks
performs multidimensional analysis of business data and provides the capability for complex calculations, trend analysis, and sophisticated data modeling - OLAP operations include rollup (increasing the level of aggregation) and drill-down (decreasing the level of aggregation or increasing detail) - OLAP data is often combined with additional external ifnormation, i.e. weather records for sale of beachwear
OLAP (Online Analytical Processing)
refers to the relationship between two entities (see also entity-relationship model) A and B in which an element of A may be linked to many elements of B, but a member of B is linked to only one element of A. E.g. Customer and Order table: One customer can place multiple orders but one (specific) order can only be placed by one customer
One-to-many relationship
covers order entry and banking transactions: focus on detailed information, security and maximum processing volume of requests
Online transaction processing (OLTP)
5 Questions to ask for Opportunity finding: - Who - What - Where - When - Why Steps: 1. Define Goal (e.g. Increase revenue of beachware); 2. find drivers (e.g. age); 3. quatnify them using analytical tools (e.g. starters can increase revenue by 6 mln) take action (e.g. increase online ads for this target group)
Opportunity tree
Overfitting: means that it learns the training set too well -> performs poorly on the test set Underfitting: model is too simple, both training and test error are large
Over- & underfitting
- no conclusions - irrelevant insights - not precise enough (too long, too many (statistical) details)
Pitfalls of storytelling by data scientists
- can lead to bias - can create inaccurate predictions - can create problems with analysis --> imputation as a tool to avoid listwise deletion of cases
Problems of missing data
- Accuracy (Measurement errors, Validity of measurements, ...) - Consistency (Ranges logical?) - Completeness (missing variables/ observations)
Problems with Data Quality:
• ... is a language • Purpose: Defining, managing and manipulating databases Helps you to explain what data you want to extract • ... is everywhere Many websites are dynamically generated from database Software
SQL
1. Data Definition Language (DDL) • DDL is that part of database language, that is used to describe, change and delete data structures & related elements • All instructions to build and maintain language • Used by database developers • Example for a command: CREATE TABLE; CREATE VIEW 2. Data Control Language (DCL) • Used to grant or withdraw permissions (Berechtigungen) • All instructions for maintaining the database • Used by administrator • Example for a command: GRANT 3. Data Manipulation Language (DML) • Used to read, write, change and delete data • All instructions to manipulate data • For end-users (e.g. those that need to analyze data) • Example for a command: SELECT; INSERT; UPDATE; DELETE
SQL consists of 3 components of database language
If a spreadsheet is used to keep track of orders and the corresponding customer details, multiple orders of the same customer require repeated storage of the customer details, as a new order requires a new line with the specific order details. In a relational database, an order is linked to a unique customer id, and the customer details are stored once in a separate table.
Show with an example that a database can be used to avoid redundancies in the data
1. Analytical capabilities 2. Data & Tools (Statistical tools e.g. R, SPSS; Database Tools e.g. SQL) 3. Business Sense 4. Communication & Visualization
Skills Data Scientists unite
• Is an open-source cloud-computing framework • Provides an interface for programming entire clusters • Executes job by giving tasks to different "slaves" (other computers), which in turn provide results • Consists of different items, e.g. SparkSQL (enables to run SQL commands), Spark Streaming (enables processing of Data Streams), MLlib (library for Machine learning Algorithms)
Spark
RAW DATA ---type checking, normalizing---> TECHNICALLY CORRECT DATA ---fix and impute ---> CONSISTENT DATA ---estimate, analyze, derive etc. ----> STATISTICAL RESULTS ----tabulate, plot ---> FORMATTED OUTPUT
Statistical analysis value chain (De Jonge & van der Loo)
1. Big Data Assets --> P.P. Lack of integration of flood data 2. Big Data Capabilities (People, systems, etc.) --> Capabilities not in place 3. Big Data Analytics --> Which analytics to use 4. Big Data Value --> How to create and measure value?
Steps in Value Creation Model and possible problems
Linear SVM: Maximize the width of a linear classifier (margin) between two support vectors (extreme data points of different classes, which are closest to each other) --> looks only at those extreme data points, doesn't really take into account the other points that lie behind them Requirement: Training data necessary, incl. data for which the classes are known to which they belong to --> same technique for non-linear SVM (Abstand zwischen SV maximieren, dann halt keine gerade linie sondern ein kreis oder ein nicht-gerade linie z.b.)
Support Vector Machines
- use pic/vid instead of text to create max impact
Sweet spot of data, stroy visual
• 'Automated version' of what market researchers have been doing for a long time interpreting (semi-)unstructured texts • E.g. coding interviews, open questions in surveys • Transforms unstructured information in structured data o further statistical analysis using traditional marketing techniques (e.g. (logistic) regression, factor analysis, cluster analysis, etc.) • example of a tool: Linguistic Inquiry Word Count (LIWC)
Text analytics
○ Forest-RI (random input selection): randomly select, at each node, m attributes as candidates for the split at the node. The CART methodology is used to grow the trees to maximum size ○ Forest-RC (random linear combinations): creates new attributes (or features) that are linear combination of the existing attributes (reduces the correlation between individual classifiers)
Two methods to construct random forest:
• not directly suited for analyses • require structuring before they can be stored in a table or in a spreadsheet • subsequently, existing analysis techniques can be used o E.g. (logistic) regression, tree methods, or machine learning techniques such as Neural Networks, Support Vector Machines, etc. • Unstructured is relevant since it is hardly analyzed, however only a small part will be used for analysis • huge amounts of it out there • require much data storage, e.g. due to 'rich content' data (containing pictures and videos) (<-> structured data can often be stored more efficiently, e.g. in asci text) Unstructured data is not always completely unstructured (e.g. Email: Receiver & sender address, text of email are given; but further interpretation needed to analyze data e.g. sentiment of email (negative sentiment -> refusal/ denial)
Unstructured Data
• GROUP BY: o Groups records into one summary row o Returns one record for each group o Often involves aggregate (sammeln) functions: COUNT; MAX; SUM; AVG etc o Can group by one or more columns • WHERE: o Aggregate functions not allowed in WHERE HAVING needs to be used for that instead • Aggregate Functions are f.e. COUNT(..); MAX(...); SUM(..); etc.
Useful things to know about commands:
Fatal Attraction: - High V2C, Low V2F - High delivering & low extracting firm --> Happy customers, but low profit (e.g. Zalando, Snapchat) Win/Win: - High V2C, High V2F - High delivering & extracting firm (e.g. Apple) Doomed to fail: - Low V2C, Low V2F - Low delivering & low extracting --> Companies who fail to create value for customer and for the firm (e.g. Startups) Enjoy while it lasts - Low V2C, High V2F - Low delivering & high extracting ( e.g. banks, insurances)
V2C & V2F model
Outliers are extreme values, unlike the rest of the sample and with no special treatment could lead to over-estimates.
What is an outlier?
Parsing locates and identifies individual data elements in the source files and then isolates these data elements in the target files. Illustration: identify search term in clickstream data
What is parsing?
Supervised learning has a dependent variable with which you can compare the outcome of your system. In the unsupervised learning we don't make use of a dependent variable. The classification is done by finding similarities among the explanatory variables.
What is the main difference between supervised and unsupervised learning?
● How to determine whether a customer likes a product or not ○ To what extend is this based on a temporary trend or permanent preferences ● Use category analysis to suggest similar products ● Use visual analysis to suggest similar color schemes in other product categories ● → suggest products with high likeliness to buy ● → customize website with preferred product categories • Result: Showing very similar products • How does it work? o Feature extraction (e.g. color?) o Calculate distances (how similar?) o Visual matching service (return most similar picture) o Apply (integrate in customer journey, test and improve)
deep learning:
○ Prediction tools: use some variables to predict unknown or future values of other variables (e.g. Classification, Regression) ○ Description tasks: Find human-interpretable patterns that describe the data (Clustering)
● Data Mining Tasks can be divided in two groups:
A database is a collection of information that is organized so that it can easily be accessed, managed, and updated. It consists of multiple tables: 'spreadsheets' with columns (fields) and rows (records)
Definition Database
Extract - Extraction of selected, relevant sources Transform - Processes of cleaning, unifying, translating, calculating etc. Load - Loading the processed/ transformed data in the intermediate environment
Extraction of data - ETL Process
Allows researcher to make use of all information in the data Often: - Imputation only of IVs --> biased estimated - Imputation of both IVs & DVs --> unbiased estimates
What does imputation do?
1. Market level: V2C: - Product Awareness - Product Attractiveness - Product Uniqueness V2F: - Market volume/ size - Market growth - Number of competitors 2. Brand level: V2C: - Brand awareness - Brand consideration - Brand preference V2F: - Brand sales - Brand market share - Brand Equity 3. Customer Level V2C: - Customer Satisfaction - Net Promoter Score (NPS) V2F: - Customer Lifetime Value - Marketing ROI --> Metrics is a good starting point for Data Integration/ developing strategies to deal with big data, as it provides guidance to identify which data sources can be used/ which KPIs could be interesting
3 levels of V2C & V2F
1. spreadsheets do not automatically sort all columns 2. Redundancy of information: Redundancy in a spreadsheet can be reduced by a database, since a spreadsheet assumes a 1-1 relationship (same row). If the data consist of 1-many relationships, use of spreadsheets leads to redundancy since it requires that the same value needs to be stored multiple times. 3. Maintaining data (e.g. updating customer info) is easier in a database, since data are not stored multiple times
3 reasons why organizations prefer storing data (on customers, orders and products) on databases, and not in spreadsheets
- pattern recognition - Interpretation of analysis output - storytelling
3 ways to apply visualization
○ Filter: subsetting or removing observations based on some condition. ○ Transform: adding or modifying variables. These modifications can involve either a single variable (e.g., log-transformation), or multiple variables (e.g., computing density from weight and volume). ○ Aggregate: collapsing multiple values into a single value (e.g., by summing or taking means). ○ Sort: changing the order of observations
4 fundamental steps to make data tidy
Volume, Velocity (speed), Variety, Veracity, Value
5 V's of Big Data
1. Situation (Describe it, not controversial or raising discussion, recognizable for audience) 2. Complication (This is the problem or challenge to be solved/ addressed, clearly underpinned with arguments/ facts) 3. How to achieve? / Is this correct (i.e. will our results support this)? 4. Key Message (The answer to the complication, should create curiosity, only one single message!) 5. Which/ What/ Why? 6. Statements (Further breaking down the message, maximum of 3 submessages)
Building blocks for a clear storyline
A. Simple (How many variables included and how they are related -> Parsimoniousness) B. Evolutionary (start with a simple model and add more to it) C. Complete (shouldn't be any variables left out that have a significant influence) D. Adaptive E. Robust (adding or removing data-points shouldn't make model crash) -> A& C contradicting: Preference for 'parsimonious' models • Parsimonious: the explanation of any phenomenon should make as few assumptions as possible, eliminating those that make no difference in the observable predictions of the explanatory hypothesis or theory (Get rid of variables that don't generate additional information)
Criteria for a good model
eine Kopie des Teildatenbestandes eines Data-Warehouse (DW), die für einen bestimmten Organisationsbereich oder eine bestimmte Anwendung oder Analyse erstellt wird. + fast to roll out - potential problems of data integration
Data Mart
1.Single-Source Problems a) Schema Level (Lack of Integrity Constraints) -> e.g. Uniqueness b) Instance Level (Data entry errors) -> e.g. Misspellings, Duplicates Mulit-Source Problems a) Schema Level (Heterogeneous data models and schema designs) -> e.g. naming conflicts (same name used for different objects) b) Instance Level (overlapping, contradicting and inconsistent data) -> e.g. Inconsistent aggregating
Data Quality Problems in Data sources
storage place of data that remains under the control of one department and is isolated from the rest of the organization
Data Silo
- A data warehouse is a central database, optimized for analysis purposes. It unites and compresses (verdichten) data from multiple, usually heterogenic sources Why data Warehouses? -> Enable knowledge worker to make better and faster decisions - External Sources are extracted, transformed, loaded & refreshed, then become a data warehouse
Data Warehouse
› At each split a variable is selected as the basis on which the decision of splitting is made -> Selection of this split is based on concept of entropy (mittlerer Informationsgehalt) and information gain • Gini-Impurity (minimize misclassification) • Variance Reduction (total reduction of the variance of the target variable) -> improving group homogeneity by comaring purity before and after split › Goal: best classification, with minimum number of branches/ Technique to generate groups which are as homogeneously within each other, and as heterogeneously amongst each other. Evalutation: + interpretation is very simple and easy to follow ->used in practice + good performance in Big Data + approximates nonlinearity with linear combinations of predictors ->very flexible + uses simple rules to relate the outcome and the predictors - the prediction accuracy is relatively low, - they can be very non-robust --> Again a trade-off between error "acceptance" (underfitting) and complexity (overfitting) has to be made
Decision Trees
Boosting: - Models are built sequentially on modified versions of the data - predictions of the model are combined through weighted sum/ vote - Boosting algorithm can be extended for numeric prediction - Comparing with bagging: Boosting tends to achieve greater accuracy, but it also risks overfitting the model to misclassified data Bagging (& Random Forest): - independent selection of data
Differences Bagging and Boosting
1. Star schema: - used by most data warehouses - one single fact table (with all IDs) and one table for each dimension (i.e. order, customer, salesperson) with one ID -> there is a single fact table, and all other tables are directly related to this fact table. - do not provide the possibility to show attribute hierarchies/sub-attributes + Simplicity! Explanation: a star scheme consists of a fact table that are directly linked to one or more tables that contain the details for the dimensions that are present in the fact table. This is a simple set up that is easy to implement and to explain. However, if the dimensions contain dimensions themselves, these lead to data reduncancies. A star scheme is not necessarily fully normalized. 2. Snowflake Schema - Hierarchy of data/ subcategories can be shown with the Snowflake Schema - e.g. Fact table (middle) <- ProdNo (1. level) <- CategoryName (2. level) the other tables do not directly relate to the fact table, but link to other tables. -> the other tables do not directly relate to the fact table, but link to other tables. + no data redundancy! Explanation: in a snowflake scheme, if a dimension contains dimensions, these are specified in separate tables, leading to a nested structure of dimensions without data redundancy (a snowflake scheme is fully normalized) Difference: StarS: all tables relate directly to the Fact Table SnowflakeS: not all realte directly to the Fact Table This reflects the hierarchy in the dimensions.
Different ways of illustrating relationship between tables
ML: More heuristic ("shortcut to find satisfactory, but not necessarily optimal solution") Focus on improving performance of a learning agent + Predictive analysis Supervised and/ or unsupervised learning techniques Also looks at real time learning (DM not!) DM: Integrates theory and heuristics Focus on entire process of Knowledge Discovery, including data cleaning, learning, and integration, as well as visualization of results + focus on discovery of new patterns (exploratory data analysis) Unsupervised learning technique --> Many algorithms can be used for both purposes
Distinction ML and Data Mining:
› May lead to biased outcomes › Always leads to loss of statistical power (data is excluded - and hence, information is not used) › Reasonable approach if not more than 5% of the data points are lost › Larger percentages: often considerable loss in statistical power
Drawback of analysis of Complete Cases (listwise deletion)
The idea is to use multiple models to obtain better predictive performance than could be obtained from any of the individual models Popular Ensemble Methods: Bagging: • averaging the prediction over a collection of predictors generated from bootstrap samples (both classification and regression) › Boosting: • weighted vote with a collection of classifiers that were trained sequentially from training sets given priority to instances wrongly classified (classification) › Random Forest: • averaging the prediction over a collection of trees splitted using a randomly selected subset of features (both classification and regression) › Ensemble learning is a process that uses a set of models, • each of them obtained by applying a learning process to a given problem. › This set of models (ensemble) is integrated in some way to obtain the final prediction. › Aggregation of multiple learned models with the goal of improving accuracy. › Intuition: simulate what we do when we combine an expert panel in a human decision-making process
Ensemble learning
- data-driven procedure - two basic steps: 1. Deal with missing data (e.g. replace with plausible values) 2. Analyze as if there were no missing values › Alternate between (abwechseln zwischen): • E-step: predict missing data • M-step: estimate parameters (mean, (co)variances) -> Runs until no major improvement resulting from doing E, M, E, M, etc. can be observed anymoref › Excellent parameter estimates › No standard errors! • Bootstrapping (independent resampling) • or multiple imputation
Expectation Maximization (EM) algorithm
TDL: only measures the ratio of probability(y=1) of the highest 10% of the predicted outcomes compared to the rest. So to speak, how good you can predict the 1's GINI: on the other hand is the measure of how good the 1's are predicted in the whole prediction deciles.
Explain the difference between Top Decile Lift (TDL) and the Gini-coefficient?
What would you like to show? 1) relationship between data points: - Scatter plots (relation between 2 variables) - Bubble Chart (relation between 2 variables and size of observation) - World map (realtion geography and DV) - Network (Relation between objects/ individuals) 2) comparison of data points: - Column Chart (vertical) - Bar chart (horizontal) (limited nr of subunits) - Bullet chart (complex comparisons, e.g. incl. target) - Line chart 3) Composition of data - Pie chart (Useful with limited items) - Stacked chart (multiple subunits for multiple groups) - Waterfall chart (also suitable for negative values!) - Word Cloud (for text analysis) 4) Distribution of data - column histogram (few categories) - Line chart ( many categories) - Lift chart (especially for predictive modelling)
How to choose the right chart
- Use descriptives/ graphical summaries - Critical value of boxplots: x is less than Q1 -1.5 x IQR x is greater than Q3 + 1.5 x IQR -> this however is a random definition, not all outliers are included there and not all values included are outliers
How to detect an outlier?
• Personalization: o Improved customer experience by adding personalized content based on data and algorithm o generate more gross demand (Bruttobedarf) o BUT: Respect for privacy of customer --> Recommendations, Content Personalization, Browse & Search (Help customers find the most interesting products • Optimization: - smart usage of data to make better decisions → improve effectiveness • Pricing: o Intelligent pricing of articles by applying price optimization algorithms o Align prices with customers willingness to pay
How to make money with Analytics
Integration at the individual level: This means that the individual (or read "household" for "individual") is identified in every data source and that we start linking the different data sources using the keys identifying the individual: the keys for combining two sources should correspond. Let's assume in our example that we want to integrate the customer data with the online data. In the customer data, the individual can be identified by a customer id, or for example a combination of address plus birth data. Integration at the intermediate level: uses a segmentation based on dimensions that can be identified in all sources to be integrated. The segmentation then becomes the common denominator (Nenner) to which all data should be aggregated for every time period. In our example of the insurance company, we could define the dimensions as age and income. Classifying each of these two dimensions in 5 classes would result in 25 segments to be identified in every data source per time period. Integration at the time level The third option for data integration is the least advanced. In this option data will be aggregated to the time period that can be identified in the data sources and the time axis will be the dimension on which to compare the different data sources.
In Section 3.1 of Verhoef et al. (2016), different options for integrating data sources with different aggregation levels are discussed. Discuss these options, and give an example of each of the options.
› Probability that a value is missing depends on the value itself • Even after taking the values of the other observed variables into account › Example: income --> People who earn more are more unlikely to fill in answer to this question -> p(x is missing) depends on x -> Imputation might make it worse in some MNAR cases; BUT: imputation methods often still perform better than traditional/ naive approaches, and can be used if MAR assumptions not satisfying; Also: often multiple causes for missing values -> partially MAR, partially MNAR e.g. › Was the example on the previous slide pure MAR? • Is p(missing) not depending on the size of the household?
MNAR
- supporting architecture for Big Data -> enables processing large datasets across a cluster of computers - breaks down complex big data problems/ jobs into small units of work, which can be processed parallel --> Higher efficiency Involves 2 Steps: 1. Map Step: master node data is divided into many smaller problems 2. Reduce Step: Analyzes and merges input data from the map steps --> powerfull parallel programming technique for distributed processing of big data files on clusters
MapReduce
"Machine Learning is concerned with computer programs that automatically improve their performance through experience. "
What is Machine Learning?
- supervised ML method, capable of Classification (discrete) and Regression task (continuous) - creates number of decision trees the higher the number of trees, the higher robustness & Accuracy - grow multiple trees ->use different, randomly selected subsets of data ->use different (randomly selected) variables (Creation of random trees = random forest) How to benefit from this? - most trees can provide correct prediction of class for most part of the data - trees are making mistakes at different places (weigh each other out) Result: when voting is conducted for each of the observations and it is then decided about the class of the observations based on the results of the random forest, it is expected to be more close to the correct classification + can be used for classification and regression tasks + can handle high number of missing values & maintain accuracy + more trees prevent overfitting of the model + Comparable in accuracy to boosting, but more robust to errors and outliers + Insensitive to the number of attributes selected for consideration at each split + faster than bagging or boosting - good job at classification, but rather poor job at regression -little control on what model does black box Applications: - finding loyal customers - identifying likelihhod of customer liking a recommended product, based on similar kinds of customer -image classification - voice recognition
Random Forest
Auswertung von Texten mit dem Ziel, eine geäußerte Haltung als positiv oder negativ zu erkennen
Sentiment Analysis
› Parsing › Correcting › Standardizing › Matching › Consolidating (vereinigen) ● Parsing: Locates and identifies individual data elements in the source files and then isolates these data elements in the target files ○ E.g. extracting a search term from clickstream data, converting a business card into a digital contact ● Correcting: corrects parsed individual data components using sophisticated data algorithms and secondary data sources ○ E.g. misspellings in search terms, correcting wrongly parsed business card information ● Standardizing: transform data into its preferred format ○ Adapting job titles according to a category, using country abbreviations rather than full names ● Matching: Searching and matching records within and across the parsed, corrected and standardized data based on predefined business rules to eliminate duplications ○ Matching phone numbers to people based on same name and address in previous records ● Consolidating: analyzing and identifying relationships between matched records and consolidating/merging them into ONE representation → comb. 2 data sources into one Parsing = locates and identifies individual data elements in the source files and then isolates these data elements in the target files (Extracting relevant portions of a dataset) - Application: identify search term in clickstream data Correcting = Corrects parsed individual data components using sophisticated data algorithms and secondary data sources (e.g. search terms 'vanillia', 'vanilla', 'vanilloa' people might all search for the brand Vanilia) Standardizing = Standardizing applies conversion routines to transform data into its preferred (and consistent) format using both standard and custom business rules (e.g. putting together words with the same meaning but different writings/ typos, e.g. always Ms., 'Beth' can also be matched with names like 'Elizabeth', 'Bethany') Matching = Searching and matching records within and across the parsed, corrected and standardized data based on predefined business rules to eliminate duplications (e.g. different names in different sources, one time 'Beth', one time 'Elizabeth' You have to decide which one to use) Consolidating = Analyzing and identifying relationships between matched records and consolidating/merging them into ONE representation.
Steps in cleansing technically correct data
Pyramid: 1. Audience (Determines the level of detail/how much you should go into detail in your motivation and proof) 2. Situation 3. Complication 4. Conclusions/ Key Messages (2-4: Part of Management Summary) 5. Motivation 6. Proof ( Chapters of your presentation; Use analysis as proof for motivation and conclusions)
Structure of your presentation
1. Filtering • Signal-to-noise ratio is low in many unstructured data applications 2. Structuring • Before they can be used in further analyses, unstructured data need structuring 3. Matching to data from other sources • Combination of data from various sources (structured and unstructured) lead to better insights 4. Companies lack skills to generate insights from unstructured data • Lack of Data Scientists
Structuring unstructured data (Short Version)
A normal join creates a subset of two tables, only selecting rows from both tables for values of the index that specifies the join that are present in both tables. In a left join, it selects all values of the index that specifies the join of the first table in the join statement. If the second table in the join statement does not contain some of the values of this index, a NULL value is reported for the columns of the second table that are selected in the output of the query. Example: suppose we have a table with customer details and a order table. If not all customers placed an order in a certain period that is specified in the query, the output of a query that contains a left join on the customer table and the order table will report all customers that are present in the customer table, and report NULL values in the output for those columns that are selected from the order table.
What is a LEFT JOIN in SQL? Illustrate with an example when you would use a LEFT JOIN
1. Unstructured data require filtering • Signal-to-noise ratio is low in many unstructured data applications • Computer algorithms necessary for filtering relevant content 2. Unstructured data require structuring • Before they can be used in further analyses, unstructured data need structuring • Classifying or grading unstructured data is a form of adding 'meta data' (= data about data) to this data • Can be relatively easy (complexity of document can be measured by byte count) or complex (recognizing facial expressions) • Structuring data in marketing: Unstructured data in marketing like interviews, focus groups, surveys with open-end questions used to be structured by hand, however isn't feasible for large volume, is expensive etc. Solution: Coding by computers. Computers need programming/training before they are able to correctly structure and/ or interpret unstructured data 3. Unstructured data need to be matched to data from other sources • Combination of data from various sources (structured and unstructured) lead to better insights • can be easy if one unique key is given that is present in every data set that needs to be merged (e.g. CustomerID) • In unstructured data however often not the case • then matching needs to be done in a more difficult way, e.g. matching based on zip code and age Mismatches more likely Consequence: Computer-aided filtering, structuring and matching is likely to be noisy • Errors in selection of data points • Misclassification of meaning • Errors in matching data sources Many applications assume that the sheer amount of data cancels the errors out 4. Companies lack skills to generate insights from unstructured data • Lack of Data Scientists, who combine the following skills: • analytical skills to analyze the data • business sense to translate the analysis outcomes into insights • skills to generate business impact based on the insights
Structuring unstructured data (long version)
2 Dimensions: 1. Data Type: Structured & Unstructured 2. Data Source: Internal & external Structured x External: - Public Data - ZIP-code Data - Market Research Data Unstructured x External: - Social Media ( Google, FB etc) -> e.g. product reviews on amazon - Blogs Structured x Internal: - CRM data - Sales Data Unstructured x Internal: - Customer Contact Data - Mobile Data
Types of Data
Exploration & analysis of big data in order to discover meaningful patterns -> Find patterns which are implicit, novel, valid and potentially useful • Data Mining is one step in Knowledge Discovery in Databases (KDD)
What is Data Mining?
Imputation of missing data on a variable is replacing that missing by a value that is drawn from an estimate of the distribution of this variable. In single imputation, only one estimate is used. Advantage: easy to implement. Disadvantage: it ignores the uncertainty in the estimation of this distribution ● Sophisticated procedure where the imputation of the missing variable is based on various other, but known, characteristics of the subjects ● The missing value will be estimated based on other variables ● Single, because each missing is only imputed once ● The estimated distribution of results will never be identical of the true distribution, but at least leads to an unbiased estimation of the population distribution
What is single imputation? Discuss its advantages and disadvantages.
Velocity, Volume & Variety make Big Data so different from data that was known/ given until then: - huge volume - high variety of data (different datasets) - high speed of data (real time customer data) • Increase of consumer data due to increasing online activities Data increases faster than Storage possibilities • Elicits new challenge of data storage Data strategies required: ICT (Information and Communication Technologies) enabled companies store more data in an integrated ERP (Enterprise-Resource Planning), CRM
What makes Big Data different to data that was given until then
- different names of same attribute you want to merge datasets on ( e.g. A.customer_id ≡ B.customerID) - look for data value conlficts, e.g. different scales - aggregation level might be different (session level vs. daily level)
What to keep in mind when merging data?
1. GINI coefficient: - focuses on overall performance of model - the higher, the better; max : 1, min: 0 2. Top Decile Lift: - focuses on predicting DV (churners) - TDL = group of people with highest probability to churn • TDL approx. 1: model is not better than random selection • TDL approx. 2: model is twice as good in predicting churners - the higher the better 3. Look at calculation time -> smaller the better BUT: % correctly classified can not be compared across methods
What to take into account when judging ML method?
• Develop systems that can automatically adapt and customize themselves to individual users. o Personalized news or mail filter • Discover new knowledge from large databases (data mining). o Market basket analysis (e.g. diapers and beer) • Ability to mimic human and replace certain monotonous tasks - which require some intelligence. o like recognizing handwritten characters • Develop systems that are too difficult/expensive to construct manually because they require specific detailed skills or knowledge tuned to a specific task (knowledge engineering bottleneck)
Why Machine Learning?
Because every data set has its own naïve model HR. while for one dataset 90 could be a good HR because of a naïve HR of 30%, for another model with a naïve HR of 90% this HR might be not satisfactory.
Why can the hit-rate not be compared among different data-sets?
- lower error - Less overfitting --> Why? Each learner that is used has a sort of bias (e.g. linear regression data is linear) but when you put learners together, biases 'fight' each other
Why ensemble learning?
› Dummy Values, › Absence of Data, › Cryptic Data, › Contradicting Data, › Non-Unique Identifiers, › Data Integration Problems
Why is data 'dirty'?
Why needed? • For exploration of data • Understand and make sense of data • Communication of results of analysis • coping with the risk of information overload
Why visualization?