Big Data Exam 2
Machine learning
"is a current application of AI based around the idea that we should really just be able to give machines access to data and let them learn for themselves"
Consistency
- Values are the same/agree between data sets - best controlled with application and database level constraints
Data in Data Mining
- a collection of facts obtained as the result of experiences, observations or experiments - consist of numbers, letters, words, images, voice recordings - structured, unstructured, semi-structured
Documentation
- a data dictionary should be available to anyone in the organization who collects or works with data (data field names, types, ranges, constraints, defaults) - who is responsible for which data - where is the data collected - what is the data collection process
Operational data stores
- a database which integrate corporate data from different data sources in order to facilitate operational reporting in real-time or near real-time - used for short- term decisions rather than medium or long-term decisions - unlike the static content of a data warehouse, the contents of a ODS are updated throughout the course of business operation
Enterprise data warehouse
- a large-scale data warehouse for the enterprise - provides integration of data from many sources into a standard format - provides data for many types of Decision Support Systems (DSS) including supply chain management, product life-cycle management, revenue management and knowledge management systems
Data Warehouse
- a physical repository where relational data are specially organized to provide enterprise-wide, cleansed data in standardized format - a pool of data produced to support decision making. it is also a repository of current and historical data of potential interest to managers throughout the organization
Business Intelligence
- a technology-driven process for analyzing data and presenting actionable information to help executives, managers and other corporate end users make informed business decisions -organizations use them to make sense of data and to make better decisions - term that includes databases, applications, methodologies and other tools used for executive and managerial decision making
Data Warehousing
- accessing, organizing and integrating key operational data in a form that is consistent, reliable, timely and readily available wherever and whenever needed
6 quality dimensions
- accuracy - completeness - consistency - conformity - integrity - timeliness
Nonvolatile
- after data entered into a DW users cannot change or update them; any changes are recorded as a new data - possible to delete
Predictive analytics
- aims to determine what is like to happen in the future. it is based on statistical techniques as well as other more recently developed techniques that fall under the categories of data mining or machine learning
Benefits of DW
- allows end users to perform extensive analysis - allows a consolidated view of corporate data - better and more timely information - enhanced system performance - simplification of data access
2-tier architecture
- application server runs on the same hardware platform as the data warehouse - advantage: more economical
Timeliness
- as up to date as possible - time should be reflected in the data and/or report - the more timely the data, the more costly and difficult to produce - very context oriented
Data Mining
- certain names are more prevalent in certain US locations (O'Brien, O'Reilly) - group together similar documents returned by search engine according to their context
Correction
- change data values not recognized - misspellings (auto correct) - default values to correct data type errors
Analysis techniques
- classification - regression - clustering - association analysis
Abbreviation expansion
- clearly define - INC for incorporated - ST for street - USA for United States of America
Completeness
- contains all the required information - nothing is missing - all data is usable (no errors, all data is "understood") - if it doesn't exist at the time of the analysis, recreation is rarely successful - best controlled when planning process for data collection occurs
Include metadata
- data about data - contains information how data is organized and can be used effectively
Organization
- data can be examined for errors more quickly if it is reasonably organized (sorted by entity, date, location)
Typographical and transcription
- data entry errors - misspelling and abbreviations erros - miskeyed letters
Big data quality
- data is inherently unclean (typically, the more it is unstructured, the less clean it is) - data can speak volumes but have little to say (noise) - real world data is messy
Planning
- data management is a process that must be guided from start to end - understanding the organization's needs and the data that supports those needs is the place to start - data structures and data collection should be controlled to best facilitate the needs of the organization
Types of data warehousing
- data marts - operational data stores - enterprise data warehouses
Subject oriented
- data organized by subject such as sales, products, customers - enables users to determine how their business is performing and why - provides a more comprehensive view of the organization
Time-variant (time series)
- data saved over multiple time periods (daily, weekly) - enables decision makers to detect trends, deviations and long-term relationships - every DW should support time dimension of data quality
Separation of Duties
- data should be collected at a minimum number of places in the organization - student data should be entered once by the registrar and then used by other areas
Data cleansing framework
- define and determine error types - search and identify error instances - correct errors - document error instances and error types - modify data entry procedures to reduces future errors
Act (feedback)
- determine what action to take - figure out how to implement the action - monitor and measure the impact of the action - evaluate the action based on success criteria you defined at the beginning - any revision?
Accuracy
- does the data correctly reflect what is true? - should agree with an identified source - may be difficult to detect because of data errors - best controlled when data is entered as close to the source as possible
Data cleansing process
- eliminate duplicate records - sorting method to find duplicates - problem with finding non-exact duplicates
Education
- everyone in the process should be responsible for ensuring data quality - a good understanding of why data quality is important and ways to manage quality is critical - everyone in the process must be proactive not only with the data of which they are in charge, but anything unusual they may see
Format Conformance
- expected format - ex: date formats differ between countries - month/day/year - day/month/year
Updating missing fields
- fill fields that are missing data if reasonable - may be caused by errors in original data
Data cleansing
- has a definite process but that process is flexible - not all organizations view quality in the same way, so not all organizations clean in the same way - all processes include first finding errors, and then correcting them - not all organizations do it the same way, but its important nevertheless
Data quality software
- helps with optimizing data quality and data profiling software is available that helps understanding data structure, relationships and content
Analyze
- includes selection of analytical techniques to use, building a model of the data, evaluation of analytical results, and creating reports and data visualizations to showcase the results - input data > select analysis techniques > model > model output
Three goals of Integrate
- integrate all data that is essential for our problem - clean the data to address data quality issues - transform the raw data to make it suitable for analysis
Decision support systems
- interactive computer-based systems which help decision makers utilize data and models to solve unstructured problems - couple the intellectual resources of individuals with the capabilities of the computer to improve the quality of decisions. it is a computer-based support system for management decision makers
Organize
- involves looking at the data to understand its nature, what it means, its quality and format - it is a part of the two-step data preparation process - aims for some preliminary explore in order to gain a better understanding of the specific characteristics of data
Implicit and explicit Nullness
- is absence of a value allowed? - implicit nulls: missing allowed - explicit nulls: what value is to be used if data is missing? (telephone number)
Prevention
- it can be very difficult to change data so quality data collection methods are necessary - includes both application and data structure design (dropdown boxes to minimize text entries, ranges, data types)
Data mining
- knowledge discovery from data - extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patters of knowledge from huge amounts of data
Floating Data
- lack of clarity as to what types of data go into specific fields - data placed in wrong field - consider address 1 and address 2 fields: which is for the street address or are both for the street address?
Not data mining
- look up phone number in phone directory - query a web search engine for information about "Amazon"
Concerns about real-time DW
- not all data should be updated continuously - mismatch of reports generated minutes apart - may be cost prohibitive - may also be infeasible
3-tier architecture
- operational systems contain the data and the software for data acquisition in one tier, the application server is another tier, and the third tier includes data warehouse - advantage: less recourse constraint, enables data marts
Data cleansing guidelines
- planning - education - organization - separation of duties - prevention - documentation
Transformation errors
- reconstruct data field size may truncate existing data (jones becomes jon) - changing data field type may change existing data (date becomes a number)
Descriptive analytics
- refers to knowing what is happening in and organization understanding some underlying trends and causes of such occurrences - first involves consolidation of data sources - visualization is key to this exploratory analysis step - ex: dashboard
Active Data Warehouse
- strategic and tactical decisions - results measured with operations - only comprehensive detailed data available within minutes - high number of users accessing and querying the system simultaneously - flexible and hoc reporting, as well as machine-assisted modeling to discover new hypotheses and relationships
Traditional Data Warehouse Environment
- strategic decisions only - results sometimes hard to measure - daily, weekly, monthly data currency acceptable; summaries often appropriate - moderate user currency - highly restrictive reporting used to confirm or check existing processes and patterns; often uses pre developed summary tables or data marts
Association analysis
- the goal is to come up with a set of rules to capture associations within items or events. the rules are used to determine when items or events occur together - ex: famous diaper beer connection
Best Practices for implementing DW
- the project must fit with corporate strategy - it is important to manage user expectations - the data warehouse must be built incrementally - adaptability must be built incrementally - adaptability must be built in from the start - the project must be managed by both IT and business professionals - only load data that have been cleansed/high quality - do not overlook training requirements
Integrity
- this primarily applies to relationships within the data structure - all data in the structure should be retrievable regardless of where it is - this is why keys are so important in the data structure - best controlled at the time of data structure development
Parsing
- to divide a group of tokens.. used to find patterns in the data in order to standardize values - look for patterns (find two spaces in the name field, divide field into first, middle and last)
Overloaded Attributes
- too much information in one field - Joe Smith IRA - Joe Smith in Trust for Mary Smith - John and Mary Smith (implies they are married)
Regression
- when your model has to predict a numeric value instead of a category, then the task becomes a regression - example: predict the price of a stock
Steps to start a big data project
1. Define the problem 2. Assess the situation 3. Define the purpose
DW architectures
1. Information independence between organizational units 2. upper management's information needs 3. Urgency of need for a data warehouse 4. Nature of end-user tasks 5. constraints on resources 6. strategic view of the data warehouse prior to implementation 7. Compatibility with existing systems 8. Perceived ability of the in-house IT staff 9. Technical issues (particularly ETL) 10. Social/political factors
Important criteria in selecting an ETL tool
1. ability to read from and write to an unlimited number of data sources/architectures 2. automatic capturing and delivery of metadata 3. a history of conforming to open standards 4. an easy-to-use interface for the developer and the functional user
Issues affecting the purchase of an ETL tool
1. data transformation tools are expensive 2. data transformation tools may have a long learning curve
4 components to a BI system
1. data warehouse- storing and querying data 2. business analytics- for manipulating, mining and analyzing data 3. business performance management- monitoring and analyzing performance 4. user interface- controlling the system and visualizing data
Indirect benefits of data warehouse
1. enhance business knowledge 2. present competitive advantage 3. enhance customer service and satisfaction 4. facilitate decision making 5. help in reforming business processes
Conformity
For instances of similar/same data - same data type - same format - same size (date of graduation) - best controlled at the time of data structure creation
Pre-processing
Has two steps: organize and integrate
Summary statistics
Mode, mean, median and standard deviation provide numerical values to describe your data
Classification
Predicting the weather as being sunny, rainy, windy or cloudy based on various factors involved
Statistics
_____ and data mining both look for relationships within data - we first make hypothesis and collect sample data to test our hypothesis
Mathematical modeling
______ and data mining both look to make decisions to achieve the best possible performance of the system - can use data mining results as an input to provide optimal actions based on the restrictions and limitations of the system
Sorting
a common method for eliminating duplicates, but non-exact duplicates can pose a large problem
Outlier
a data point that's distant from other data points. Plotting outliers will help check for errors or rare events in the data
Data mart
a data warehouse that stores only relevant data (smaller and subject-oriented)
Independent data mart
a small data warehouse designed for a strategic business unit or a department (not dependent on the larger data warehouse); an alternative to high cost data warehouses
Dependent data mart
a subset that is created directly from a data warehouse (dependent on the larger data warehouse)
Correlation graphs
can be used to explore the dependencies between different variables in the data
Artificial intelligence
enabling machines to become "smart"
Real-time DW
enabling real-time data updates for real-time analysis and real-time decision making
Clustering
grouping a company's customer base into distinct segments for more effective targeted marketing like seniors, adults and teenagers
Capture
includes anything that makes us retrieve data including: finding, accessing, acquiring and moving data
Integrate
includes integration of multiple data sources, cleaning data, filtering data, creating datasets which programs can read and understand, such as packaging raw data using a specific data format
Quality
is a moving target-less important when doing exploratory data analysis; critical when using data for decision support
Transform
is the process of converting the extracted data from its previous form into the form it needs to be in so that it can be placed into another database. Transformation occurs by using rules or lookup tables or by combining the data with other data
Integrated
places data from different sources into a consistent format
Visualization
provide a quick and effective way to look at data - give you the idea of where the hotspots are - what the distribution of data is - show correlations between variables
Web-based
provides efficient computing environment for web-based applications
Real time
provides real time or active data access and analysis capabilities
Business analytics
refers to the application of models directly to business data. it involves the use of DSS tools, especially models, in assisting decision makers
Act
reporting insights from analysis and determining actions from insights based on the purpose you initially defined
Prescriptive analytics
seeks to recognize what is going on as well as the likely forecast and to make decisions to achieve the best performance possible
Data mining
serves as the foundation for AI and machine learning
Definition of Data Cleansing
the assessment of data to determine quality failures (inaccuracy, incompleteness) and then improving the quality by correcting as possible any errors found
Normalization
the process of organizing tables and columns in a relational database to reduce redundancy and improve data integrity
Extract
the process of reading data from a database
Load
the process of writing the data into the target database
Extract, Transform, Load
three data base functions that are combined into one tool to pull data out of one database and place it into another database
Standardization
use legal name instead of nickname (use robert for Bob, Rob, Bobby and Robbie)
Graphing the general trends
will show you if there is a consistent direction in which the values of these variables are moving towards, like sales prices going up or down