Big Data Exam 2

Lakukan tugas rumah & ujian kamu dengan baik sekarang menggunakan Quizwiz!

Machine learning

"is a current application of AI based around the idea that we should really just be able to give machines access to data and let them learn for themselves"

Consistency

- Values are the same/agree between data sets - best controlled with application and database level constraints

Data in Data Mining

- a collection of facts obtained as the result of experiences, observations or experiments - consist of numbers, letters, words, images, voice recordings - structured, unstructured, semi-structured

Documentation

- a data dictionary should be available to anyone in the organization who collects or works with data (data field names, types, ranges, constraints, defaults) - who is responsible for which data - where is the data collected - what is the data collection process

Operational data stores

- a database which integrate corporate data from different data sources in order to facilitate operational reporting in real-time or near real-time - used for short- term decisions rather than medium or long-term decisions - unlike the static content of a data warehouse, the contents of a ODS are updated throughout the course of business operation

Enterprise data warehouse

- a large-scale data warehouse for the enterprise - provides integration of data from many sources into a standard format - provides data for many types of Decision Support Systems (DSS) including supply chain management, product life-cycle management, revenue management and knowledge management systems

Data Warehouse

- a physical repository where relational data are specially organized to provide enterprise-wide, cleansed data in standardized format - a pool of data produced to support decision making. it is also a repository of current and historical data of potential interest to managers throughout the organization

Business Intelligence

- a technology-driven process for analyzing data and presenting actionable information to help executives, managers and other corporate end users make informed business decisions -organizations use them to make sense of data and to make better decisions - term that includes databases, applications, methodologies and other tools used for executive and managerial decision making

Data Warehousing

- accessing, organizing and integrating key operational data in a form that is consistent, reliable, timely and readily available wherever and whenever needed

6 quality dimensions

- accuracy - completeness - consistency - conformity - integrity - timeliness

Nonvolatile

- after data entered into a DW users cannot change or update them; any changes are recorded as a new data - possible to delete

Predictive analytics

- aims to determine what is like to happen in the future. it is based on statistical techniques as well as other more recently developed techniques that fall under the categories of data mining or machine learning

Benefits of DW

- allows end users to perform extensive analysis - allows a consolidated view of corporate data - better and more timely information - enhanced system performance - simplification of data access

2-tier architecture

- application server runs on the same hardware platform as the data warehouse - advantage: more economical

Timeliness

- as up to date as possible - time should be reflected in the data and/or report - the more timely the data, the more costly and difficult to produce - very context oriented

Data Mining

- certain names are more prevalent in certain US locations (O'Brien, O'Reilly) - group together similar documents returned by search engine according to their context

Correction

- change data values not recognized - misspellings (auto correct) - default values to correct data type errors

Analysis techniques

- classification - regression - clustering - association analysis

Abbreviation expansion

- clearly define - INC for incorporated - ST for street - USA for United States of America

Completeness

- contains all the required information - nothing is missing - all data is usable (no errors, all data is "understood") - if it doesn't exist at the time of the analysis, recreation is rarely successful - best controlled when planning process for data collection occurs

Include metadata

- data about data - contains information how data is organized and can be used effectively

Organization

- data can be examined for errors more quickly if it is reasonably organized (sorted by entity, date, location)

Typographical and transcription

- data entry errors - misspelling and abbreviations erros - miskeyed letters

Big data quality

- data is inherently unclean (typically, the more it is unstructured, the less clean it is) - data can speak volumes but have little to say (noise) - real world data is messy

Planning

- data management is a process that must be guided from start to end - understanding the organization's needs and the data that supports those needs is the place to start - data structures and data collection should be controlled to best facilitate the needs of the organization

Types of data warehousing

- data marts - operational data stores - enterprise data warehouses

Subject oriented

- data organized by subject such as sales, products, customers - enables users to determine how their business is performing and why - provides a more comprehensive view of the organization

Time-variant (time series)

- data saved over multiple time periods (daily, weekly) - enables decision makers to detect trends, deviations and long-term relationships - every DW should support time dimension of data quality

Separation of Duties

- data should be collected at a minimum number of places in the organization - student data should be entered once by the registrar and then used by other areas

Data cleansing framework

- define and determine error types - search and identify error instances - correct errors - document error instances and error types - modify data entry procedures to reduces future errors

Act (feedback)

- determine what action to take - figure out how to implement the action - monitor and measure the impact of the action - evaluate the action based on success criteria you defined at the beginning - any revision?

Accuracy

- does the data correctly reflect what is true? - should agree with an identified source - may be difficult to detect because of data errors - best controlled when data is entered as close to the source as possible

Data cleansing process

- eliminate duplicate records - sorting method to find duplicates - problem with finding non-exact duplicates

Education

- everyone in the process should be responsible for ensuring data quality - a good understanding of why data quality is important and ways to manage quality is critical - everyone in the process must be proactive not only with the data of which they are in charge, but anything unusual they may see

Format Conformance

- expected format - ex: date formats differ between countries - month/day/year - day/month/year

Updating missing fields

- fill fields that are missing data if reasonable - may be caused by errors in original data

Data cleansing

- has a definite process but that process is flexible - not all organizations view quality in the same way, so not all organizations clean in the same way - all processes include first finding errors, and then correcting them - not all organizations do it the same way, but its important nevertheless

Data quality software

- helps with optimizing data quality and data profiling software is available that helps understanding data structure, relationships and content

Analyze

- includes selection of analytical techniques to use, building a model of the data, evaluation of analytical results, and creating reports and data visualizations to showcase the results - input data > select analysis techniques > model > model output

Three goals of Integrate

- integrate all data that is essential for our problem - clean the data to address data quality issues - transform the raw data to make it suitable for analysis

Decision support systems

- interactive computer-based systems which help decision makers utilize data and models to solve unstructured problems - couple the intellectual resources of individuals with the capabilities of the computer to improve the quality of decisions. it is a computer-based support system for management decision makers

Organize

- involves looking at the data to understand its nature, what it means, its quality and format - it is a part of the two-step data preparation process - aims for some preliminary explore in order to gain a better understanding of the specific characteristics of data

Implicit and explicit Nullness

- is absence of a value allowed? - implicit nulls: missing allowed - explicit nulls: what value is to be used if data is missing? (telephone number)

Prevention

- it can be very difficult to change data so quality data collection methods are necessary - includes both application and data structure design (dropdown boxes to minimize text entries, ranges, data types)

Data mining

- knowledge discovery from data - extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patters of knowledge from huge amounts of data

Floating Data

- lack of clarity as to what types of data go into specific fields - data placed in wrong field - consider address 1 and address 2 fields: which is for the street address or are both for the street address?

Not data mining

- look up phone number in phone directory - query a web search engine for information about "Amazon"

Concerns about real-time DW

- not all data should be updated continuously - mismatch of reports generated minutes apart - may be cost prohibitive - may also be infeasible

3-tier architecture

- operational systems contain the data and the software for data acquisition in one tier, the application server is another tier, and the third tier includes data warehouse - advantage: less recourse constraint, enables data marts

Data cleansing guidelines

- planning - education - organization - separation of duties - prevention - documentation

Transformation errors

- reconstruct data field size may truncate existing data (jones becomes jon) - changing data field type may change existing data (date becomes a number)

Descriptive analytics

- refers to knowing what is happening in and organization understanding some underlying trends and causes of such occurrences - first involves consolidation of data sources - visualization is key to this exploratory analysis step - ex: dashboard

Active Data Warehouse

- strategic and tactical decisions - results measured with operations - only comprehensive detailed data available within minutes - high number of users accessing and querying the system simultaneously - flexible and hoc reporting, as well as machine-assisted modeling to discover new hypotheses and relationships

Traditional Data Warehouse Environment

- strategic decisions only - results sometimes hard to measure - daily, weekly, monthly data currency acceptable; summaries often appropriate - moderate user currency - highly restrictive reporting used to confirm or check existing processes and patterns; often uses pre developed summary tables or data marts

Association analysis

- the goal is to come up with a set of rules to capture associations within items or events. the rules are used to determine when items or events occur together - ex: famous diaper beer connection

Best Practices for implementing DW

- the project must fit with corporate strategy - it is important to manage user expectations - the data warehouse must be built incrementally - adaptability must be built incrementally - adaptability must be built in from the start - the project must be managed by both IT and business professionals - only load data that have been cleansed/high quality - do not overlook training requirements

Integrity

- this primarily applies to relationships within the data structure - all data in the structure should be retrievable regardless of where it is - this is why keys are so important in the data structure - best controlled at the time of data structure development

Parsing

- to divide a group of tokens.. used to find patterns in the data in order to standardize values - look for patterns (find two spaces in the name field, divide field into first, middle and last)

Overloaded Attributes

- too much information in one field - Joe Smith IRA - Joe Smith in Trust for Mary Smith - John and Mary Smith (implies they are married)

Regression

- when your model has to predict a numeric value instead of a category, then the task becomes a regression - example: predict the price of a stock

Steps to start a big data project

1. Define the problem 2. Assess the situation 3. Define the purpose

DW architectures

1. Information independence between organizational units 2. upper management's information needs 3. Urgency of need for a data warehouse 4. Nature of end-user tasks 5. constraints on resources 6. strategic view of the data warehouse prior to implementation 7. Compatibility with existing systems 8. Perceived ability of the in-house IT staff 9. Technical issues (particularly ETL) 10. Social/political factors

Important criteria in selecting an ETL tool

1. ability to read from and write to an unlimited number of data sources/architectures 2. automatic capturing and delivery of metadata 3. a history of conforming to open standards 4. an easy-to-use interface for the developer and the functional user

Issues affecting the purchase of an ETL tool

1. data transformation tools are expensive 2. data transformation tools may have a long learning curve

4 components to a BI system

1. data warehouse- storing and querying data 2. business analytics- for manipulating, mining and analyzing data 3. business performance management- monitoring and analyzing performance 4. user interface- controlling the system and visualizing data

Indirect benefits of data warehouse

1. enhance business knowledge 2. present competitive advantage 3. enhance customer service and satisfaction 4. facilitate decision making 5. help in reforming business processes

Conformity

For instances of similar/same data - same data type - same format - same size (date of graduation) - best controlled at the time of data structure creation

Pre-processing

Has two steps: organize and integrate

Summary statistics

Mode, mean, median and standard deviation provide numerical values to describe your data

Classification

Predicting the weather as being sunny, rainy, windy or cloudy based on various factors involved

Statistics

_____ and data mining both look for relationships within data - we first make hypothesis and collect sample data to test our hypothesis

Mathematical modeling

______ and data mining both look to make decisions to achieve the best possible performance of the system - can use data mining results as an input to provide optimal actions based on the restrictions and limitations of the system

Sorting

a common method for eliminating duplicates, but non-exact duplicates can pose a large problem

Outlier

a data point that's distant from other data points. Plotting outliers will help check for errors or rare events in the data

Data mart

a data warehouse that stores only relevant data (smaller and subject-oriented)

Independent data mart

a small data warehouse designed for a strategic business unit or a department (not dependent on the larger data warehouse); an alternative to high cost data warehouses

Dependent data mart

a subset that is created directly from a data warehouse (dependent on the larger data warehouse)

Correlation graphs

can be used to explore the dependencies between different variables in the data

Artificial intelligence

enabling machines to become "smart"

Real-time DW

enabling real-time data updates for real-time analysis and real-time decision making

Clustering

grouping a company's customer base into distinct segments for more effective targeted marketing like seniors, adults and teenagers

Capture

includes anything that makes us retrieve data including: finding, accessing, acquiring and moving data

Integrate

includes integration of multiple data sources, cleaning data, filtering data, creating datasets which programs can read and understand, such as packaging raw data using a specific data format

Quality

is a moving target-less important when doing exploratory data analysis; critical when using data for decision support

Transform

is the process of converting the extracted data from its previous form into the form it needs to be in so that it can be placed into another database. Transformation occurs by using rules or lookup tables or by combining the data with other data

Integrated

places data from different sources into a consistent format

Visualization

provide a quick and effective way to look at data - give you the idea of where the hotspots are - what the distribution of data is - show correlations between variables

Web-based

provides efficient computing environment for web-based applications

Real time

provides real time or active data access and analysis capabilities

Business analytics

refers to the application of models directly to business data. it involves the use of DSS tools, especially models, in assisting decision makers

Act

reporting insights from analysis and determining actions from insights based on the purpose you initially defined

Prescriptive analytics

seeks to recognize what is going on as well as the likely forecast and to make decisions to achieve the best performance possible

Data mining

serves as the foundation for AI and machine learning

Definition of Data Cleansing

the assessment of data to determine quality failures (inaccuracy, incompleteness) and then improving the quality by correcting as possible any errors found

Normalization

the process of organizing tables and columns in a relational database to reduce redundancy and improve data integrity

Extract

the process of reading data from a database

Load

the process of writing the data into the target database

Extract, Transform, Load

three data base functions that are combined into one tool to pull data out of one database and place it into another database

Standardization

use legal name instead of nickname (use robert for Bob, Rob, Bobby and Robbie)

Graphing the general trends

will show you if there is a consistent direction in which the values of these variables are moving towards, like sales prices going up or down


Set pelajaran terkait

English File Beginner Common verb phrases 2

View Set

Everfi- Marketplaces- Keys to Investing

View Set

Writing an Argumentative Editorial about Initiating Change

View Set

Circular Flow of the Economy and GDP - Summary and Review

View Set

Auditing Final Study Guide Ch. 7, 8, 16, 17 and 19

View Set

AP Psych Social Psychology people

View Set

Chapter 7: Central Limit Theorem (Mean)

View Set

Chapter 10: Cash and Financial Investments

View Set

Orientation to Online Learning For JCTC

View Set