Data 2
Data-driven decision-making culture
"If experiments are costly, expert opinion by management is a plausible way to make decisions. But when experiments are cheap, they are likely to provide more reliable answers than opinions—even opinions from highly paid people. Furthermore, even when experienced managers have better-than-average opinions, it is likely that there are more useful things for them to do than sit around a table debating about which background colors will appeal to Web users. The right response from managers to such questions should be 'run an experiment.'" -(from Varian, H. 2010. Computer Mediated Transactions. American Economic Review:Papers and Proceedings 100, pp. 1-10.)
Instead of bringing huge amounts of data to a huge processing unit
(supercomputer), it is often better to distribute data to several smaller processing units, send the analytical code to them, and then combine the results together.
how to deal with Outliers
-Dealing with them is always a judgement call whether to remove them, replace them with some other value, or leave as is. -Sometimes, we may be specifically interested in them! -remove the datum -use the average value of other datum *never guess
ETL Uses
-Extract data from legacy systems -Cleanse the data to improve data quality and establish consistency -Load data into a target database
causes of bad data
-Users of data create "workarounds" instead of addressing root causes -People creating the data don't understand how others will use it -Creators and users of data have poor communication NOT -Everyone is invested in getting the data right
impact of bad data on future decision making processes
-loss of memory -false ideas/logic -confusion and miscommunication
How data gets dirty
-measurements can be wrong or innacurate -instrument = right questions not asked -consistency =questions asked in multiple ways = weight = pounds or ounces?
consequences of dirty data
-start distrusting your sources -errors in decision making -back to square one with sources
Hadoop
-stores data in smaller chunks across a network on different computers (nodes). -Stores real-time cable box activity for millions of customers by region
'Big Data' Is a Set of Technologies
1. 'Big Data' used to be a much-hyped term a few years ago. It can be understood as technologies that allow processing very large amounts of data so that they can be analyzed. -Hadoop and MapReduce are some of the most popular technologies for dealing with large amounts of data.
What can you do about the tyranny of success?
1. Be sure to check your KPIs against the SMART criteria. 2. Use multiple KPIs to measure success as a combination of multiple dimensions.
Some Best Practices
1. Focus on getting new data right. Why? 2. Limit time fixing old data. Why? 3. Data producers should communicate with data consumers. Why? 4. Have a mindset to check your work constantly. Why?
Setting Up an ETL Process
1. Inspect whatever metadata (data dictionary) are available to assess which columns map to each other in different datasets. 2. Decide how to choose the correct version of the data that reside in multiple databases. 3. Setup rules for resolving other inconsistencies, duplicates, omissions and other problems in the data, and to validate the data. There are numerous tools available to create ETL processes for different purposes.
Relational Database: Benefits
1. Integrity. 2. Flexibility 3. Efficiency However, relational databases are more complex to operate and use than flat files
MATCH function
An excel lookup & reference function that identifies a searched item's position in a list = MATCH( lookup value, lookup range, match type) Ex: ProdMatch= MATCH( J2, Lookups!$A$1:$A$124, 0)
The future of integration- API using EAI
Application Programming Interfaces (APIs) using Enterprise Application Integration (EAI) can be used in place of ETL for a more flexible, scalable solution that includes workflow integration. While ETL is still the primary data integration resource, EAI is increasingly used with APIs in web-based settings.
metrics and intuition
Are both needed in organizations -Aaltonen (2013)
ETL process often involves
Data cleansing to improve the quality and consistency of data -IBM(2020)
How are data stored in a relational database
Different types of data are stored in different tables -Rosemblum and Dorsey 2013
How ELT process differ from an ETL process
ELT involves less data transformation before the data is loaded into a target system IBM (2020)
ETL andother dataintegrationmethods
ETL and ELT are just two data integration methods, and there are other approaches that are also used to facilitate data integration workflows. Some of these include: -Change Data Capture(CDC) -Data replication -Data virtualization -Stream Data Integration(SDI)
ETL Benefits and Challenges
ETL solutions improve quality by performing data cleansing prior to loading the data to a different repository. -A time-consuming batch operation, ETL is recommended more often for creating smaller target data repositories that require less frequent updating, while other data integration methods are used to integrate increasingly larger volumes of data that changes or real-time data streams.
ETL
Extract, Transform, Load an automated process of Extracting data from multiple sources, Transforming data into a consistent format, and Loading data into an analytical system. a process that extracts, transforms, and loads data from multiple sources to a data warehouse or other unified data repository. -integration process that combines data from multiple data sources into a single, consistent data store that is loaded into a data warehouse or other target system. provides the foundation for data analytics and machine learning workstreams cleanses and organizes data in away which addresses specific business intelligence needs, like monthly reporting, but it can also tackle more advanced analytics,which can improve back-end processes or end user experiences
ETL tools
In the past, organizations wrote their own ETL code. There are now many open source and commercial ETL tools and cloud services to choose from. Typical capabilities of these products include the following: -Comprehensive automation and ease of use: -A visual, drag-and-drop interface -Support for complex data management -Security and compliance
Relational Database: Tables and Queries
Instead of one big two-dimensional matrix (or dataset), a relational database stores different data into different tables. A table resembles a dataset - it has rows and columns just as our datasets - but we usually need to combine data from different tables into a new table (dataset) before we can use it. The data are combined using database queries, e.g. -SELECT * FROM Employees;
integrity of data
It's easier to maintain when the same item is recorded in one place only.
Efficiency
It's faster to retrieve and update data when you don't have to plough through lots of redundant values.
the purpose of primary and foreign keys in a relational database
Keys help link different data together -Rosemblum and Dorsey 2013
Meaning of metrics can lead to mindlessly optimizing every individual aspect of business
Managers lose sight of the big picture and things that matter to customers -Aaltonen (2013)
Problems with Data Are a Big Deal (Redman 2013)
People doing analytics spend 50% of their time i) searching for data, ii) correcting errors, and iii) verifying correctness
Indicator Types
Quantitative Qualitative Leading Lagging Input Process Output Directional Actionable Financial
if non-profits would be funded solely on their success KPIs
Small non-profits serving the most vulnerable may become culled -Schambra 2013
The Agency Problem
The data creator is often NOT the data consumer.
Why Do We Need ETL?
The power of data analytics is often based on combining data from different sources. However, data stored in different places are often formatted differently; such differences need to be resolved before the data can be combined. It can be very difficult to enforce a consistent schema for data even across the different departments of the same organization.
Flexibility
You can create different cuts into the data.
Outlier
an observation that lies an abnormal distance from other values in a random sample from a population. In a sense, this definition leaves it up to the analyst (or a consensus process) to decide what will be considered abnormal. Before abnormal observations can be singled out, it is necessary to characterize normal observations
why data integration is important
because companies have data on multiple databases
Extract, Transform, Load (ETL) processes are often needed
because data from multiple sources need to be integrated for analysis - IBM (2020)
Bad data
can create a vicious cycle that kills a data-driven decision-making culture -Redman (2013)
integration
combining data from multiple sources and providing users with a unified view of them
CSV file
comma-separated values a common file format along with Excel spreadsheet files for transferring data between systems and applications. can inspected using any text editor (they are just text files), which makes them convenient to work with. contain just the plain dataset (no formatting, no formulas, no visualizations, etc.)
SMART
criteria for good KPIs 1. Specific for the business (Group 1) 2. Measurable (Group 2) 3. Achievable by the organization (Group 3) 4. Relevant to success (Group 4) 5. Time-phased (Group 5)
Dirty Data
data that contains errors Ex: -dirty = b as gender, 213 as age -Missing values -Inconsistent date -nonintegrated data = from various sources -wrong granularity = level of detail
how to deal with ambiguous data
data that you don't know what it's label means -look at the metadata/ data dictionary for the set and compare -ask the data creation source -make an educated guess
The tyranny of success
difference between success and failure, for instance, in non-profits is may not be clearly defined.
ETL vs ELT
difference in order of operations. -ELT copies or exports the data from the source locations, but instead of loading it to a staging area for transformation, it loads the raw data directly to the target data store to be transformed as needed. -ELT is particularly useful for high-volume, unstructured datasets asloading can occur directly fromthe source -ELT can be more idealfor big data management since itdoesn't need much upfrontplanning for data extraction andstorage. -The ETL process, on the other hand, requires more definition at the onset. Specific data points need to be identifiedfor extraction along with any potential "keys" to integrate across disparate source systems. -While ELT has become increasingly more popular with the adoption of cloud databases, it has its own disadvantages for being the newer process, meaning that best practices are still being established. both processes leverage a variety of data repositories, such as databases, data warehouses,and data lakes, each process has its advantages and disadvantages.
179 assets were unintentionally included in a purchases contract
due to hidden rows becoming visible when the spreadsheet was converted to a PDF file - Moss (2021)
Relational Database
has a collection of tables that store data for different types of entities. -data are often stored in a database that has several advantages. Tables are made of rows (records, observations) and columns (variables). Fields are made of characters that can represent different types of data (data types). The structure of the database is described by a database schema
The idea of Hadoop/MapReduce
is deceptively simple, but it provides much more scalability as we can process huge amounts of data on a relatively cheap hardware. However, sometimes data are so interdependent that it makes the approach difficult - if processing one data item requires knowledge of other data items (think a social network).
the problem with scoring non-profit outcomes as "success" or "failure"
is the distinction between success and failure is not always clear Schambra (2013)
Key Performance Indicators (KPIs)
measure whether an equipment, process, team, individual or an organization is operating at an adequate level of performance. often measured as continuous values that are divided into 'acceptable' and 'unacceptable' performance by a threshold value.
Scorecard
measures performance against several goals (KPIs). distinctive elements: -x and check mark -use of colors
MapReduce
processes the pieces of data in parallel in different nodes and combines the results together. -Analyzes which programs people are mostly likely to pause and then skip commercials
Extract
raw data is copied or exported from source locations to a staging area. Data management teams can extract data from a variety of data sources, which can be structured or unstructured. Those sources include but are not limited to: -SQL or NoSQL servers -CRM and ERP systems -Flat files -Email -Web pages
Flat Files
store data as a dataset that is more or less ready to be analyzed -CSV and Excel files CSV (comma-separated values) files are together with Excel spreadsheet files the most common way to store data. contain one or more two-dimensional (rows and columns) datasets.
According to Tableau, if data are incorrect,
the analysis may produce results that look as if they were correct
Transform
the raw data undergoes data processing. the data is transformed and consolidated for its intended analytical use case. This phase can involve the following tasks: -Filtering, cleansing, de-duplicating, validating, and authenticating the data. -Performing calculations, translations, or summarizations based on the raw data. This can include changing row and column headers for consistency, converting currencies or other units of measurement, editing text strings, and more -Conducting audits to ensure data quality and compliance -Removing, encrypting, or protecting data governed by industry or governmental regulators -Formatting the data into tables or joined tables to match the schema of the target data warehouse.
Load
the transformed data is moved from the staging area into a target data warehouse. -Typically, this involves an initial loading of all data, followed by periodic loading of incremental data changes and, less often, full refreshes to erase and replace data in the warehouse. For most organizations that use ETL, the process is automated, well-defined, continuous and batch-driven. Typically, ETL takes place during off-hours when traffic onthe source systems and the datawarehouse is at its lowest.
What Is an Indicator?
variables that measure (indicate) a business-relevant value. combine (aggregate) data into a single value. provide a forceful simplification of reality, which is both their benefit and a possible drawback.
