Lecture 8- Big Data
Data Lake
-large data pool in which the schema and data requirements are not defined until the data is queried -While a data warehouse stores data in predefined target structure with detailed metadata, a data lake is a storage repository that holds a vast amount of raw data in its native format until it is needed -serves as a corporate big data repository -a suitable concept for the storage of big data, as big data is inherently less structured and typically kept in its raw format -big data analysis can be performed on data in data lakes or directly in original sources
Big Data- DWH vs. Data Lake
Analogy: Agriculture vs. Hunting/Gathering We can think of a data warehouse as an area of land devoted to agriculture We can think of a data lake as a wild area of an equivalent size, which is simply being fenced and designated for hunting and gathering -The preparation was much simpler and cheaper, but the yield is smaller, less predictable, and different in nature A fully developed data warehouse and a data lake are at the opposite ends of a broad spectrum of solutions for large analytical data repositories
Big Data Techniques
MapReduce is a common big-data technique -parallel computing divides complex tasks into a sequence of smaller tasks that are performed in parallel on multiple computers -using multiple computers at the same time (parallel computing) vastly reduces the time needed for processing -MapReduce technique utilizes regular commodity (i.e. cheap) computers
Characteristics of Big Data
Massive volumes of diverse and rapidly growing data that are not formally modeled -Characteristics of Big Data Volume Velocity Variety Heterogeneous -From various sources such as smart devices, social media, sensors etc. -Variety of formats such as: -Semi-structured: web-logs, emails, tweets, etc. -Unstructured: text, video, audio, etc. -Not modeled up from for a pre-determined operational and/or analytical queries (retrievals) -Can encompass 80-90% (or even more) Some of it may be of use and some (actually most) of it will not
Big Data Methods
Standard database and data warehousing techniques, cannot adequately deal with the diversity and volume of big data -big data methods allow organizations to analyze and get insight from the big data -big data methods do not replace database and data warehousing approaches developed for managing and utilizing formally modeled data assets -Instead, they allow organizations to analyze and get insight from the kinds of data that are not suited for regular database and data warehouse techniques
Three Types of Data stored in Corporations and Organizations
Transactional Structured Data- Operational databases- data modeled/structured and stored for anticipated pre-determined operational use Analytical Structured Data- Data Warehouses and Data Marts- Data modeled/structured and stored for anticipated pre-determines analytical use Unstructured/Semi-structured, Un-modeled Data Big Data
Big Data
a part of overall data strategy -not a separate isolated initiative Big Data as a term