Data Science and Big Data Analytics
Big Data
Data whose scale, distribution, diversity, and/or timeliness require the use of new technical architectures and analytics to enable insights that unlock new sources of business value.
Parallel computing environments and Massively Parallel Processing
Because of the complexity of big data, the preferred approach for processing it is in:
Structured
Data Structure that contains a defined data type, format or structure. Example: Transaction data and OLAP
Semi-Structured
Data Structure that contains textual data files with a discernable pattern enabling parsing Example: XML data files that are self describing and defined by an xml schema
"Quasi" Structured
Data Structure that contains textual data with erratic data formats, can be formatted with effort, tools, and time. Example: Web clickstream data that may contain some inconsistencies in data values and formats
Unstructured
Data that has no inherit structure and is usually stored as different types of files Example: Text documents, PDFs, images and videos
44x
How much will data increase from 2010 to 2020?
Huge volume of data, complexity of data types and structures, and speed or velocity of new data creation.
What are the 3 defining characteristics of big data?
Massively Parallel Processing
Processing that enables simultaneous, parallel ingest and data loading and analysis.
True
True/False: Because most of the big data is unstructured or semi-structured in nature, this requires different techniques and tools to process and analyze.
Non-Structured (semi, quasi and unstructured)
What data structure makes up 80-90% of Big Data?
High volumes of data
What is 1 characteristic of big data that remains consistent?