Big Data Analytics

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

Pig

(for analyzing large datasets)

Variability

-Data flows can be highly inconsistent -Data flows peak when something big is trending on social media (e.g., a high-profile IPO coming up) -Challenge is to create an efficient infrastructure (e.g., storage and processing capabilities) to deal with the peaks and lows of data flows -Cloud storage and processing is one of the options for companies to deal with

Velocity

-The speed at which data is produced and the speed at which it must be processed (i.e., captured, stored, and analyzed) to be useful -RFID tags, automated sensors, GPS devices, and smart meters are increasing the need to deal with torrents of data in near-real time -Reacting quickly enough to deal with this is a challenge to most organizations -Technologies to deal with this are still emerging -Data stream analytics -In-motion analytics

Hive

A data warehouse infrastructure to facilitate querying large datasets in distributed storage Built on Hadoop platform Querying can be done using SQL-like language _____QL

Value

Big Data by itself, regardless of the size, type, or speed, is worthless Big Data + "big" analytics = ____ With the ____ proposition, Big Data also brought about big challenges Effectively and efficiently capturing, storing, and analyzing Big Data New breed of technologies needed (developed or purchased or hired or outsourced ...)

Variety

Data comes in all types of formats -Traditional databases, text documents, emails, video, audio, web, and sensor data -80% to 85% of company data is in some sort of unstructured or semi-structured format Challenges Tools and techniques to process the variety of data Merging the different data formats to get an integrated view of the data and information about customers and operations

C

In-memory analytics A. combines hardware, software, and storage in a single unit for performance. B. uses many processors in parallel. C. stores and processes the complete data set in RAM. D. Both (b) and (c).

Big Data core technology

MapReduce + Hadoop

Hbase

Open source, non-relational distributed database modeled on Google's BigTable Column-oriented dtabase Suited for sparse datasets

In memory analytics

Storing and processing the complete data set in RAM Allows analytical computations and big-data processing in-memory distributed across dedicated set of nodes R, Tableau

Grid Computing & MPP

Use of many machines and processor in parallel (MPP - massively parallel processing)

a

Which of the following describes the 'Variability' feature of big data? A. Data flows are characterized by 'peaks' and 'lows'. B. Data streaming analytics is one of the technologies to deal with variability. C. Data comes in all types of formats. D. None of the above

b

Which of the following describes the 'Variety' feature of big data? A. Data flows are characterized by 'peaks' and 'lows'. B. Data comes in all types of formats. C. Data streaming analytics is one of the technologies to deal with variety. D. Both (b) and (c)

Cassandra

distributed database system

Drill

for querying big data

c

Big Data comes from all of the following sources, EXCEPT: A. Web logs B. Medical records C. MapReduce D. RFID

c

Cloud storage and processing is an option to deal with which of the following features of big data? A. Variety B. Velocity C. Variability D. All of the above

Appliances

Combine hardware,software, and storage in a single unit of performance and scalability

Volume

Most common trait of big data Some factors that contribute to the exponential increase in data are -Transaction-based data stored through the years -Text-data constantly streaming from social media Increasing amounts of sensor data being collected. -Automatically generated RFID and GPS data. -Storage is not the main issue now, but how to determine relevance of the data and how to create value from the data. - 'Big' is a relative term as technology progresses (terabyte is no longer a big deal, as we move towards petabytes and zetabytes)

Hcatalog

Table and storage management layer for Hadoop Schema created using can be accessed directly through Mapreduce or Pig

Sqoop

Tool for efficiently transferring bulk data between Hadoop and structured datastores

b

Which of the following is a challenge caused by 'Velocity' feature of big data? A. Creating an efficient infrastructure to deal with the 'peaks' and 'lows' of big data. B. Reacting quickly enough to deal with torrents of data in near-real-time. C. Merging the different data formats to get an integral view of operations. D. Both (a) and (b)

b

Which of the following is a feature of Hadoop? A. It stores and processes the complete data set in RAM. B. Hadoop clusters run on inexpensive commodity hardware. C. Processes only unstructured stream data. D. It is available as a single integrated software.

a

Which of the following is an example where MapReduce is used? A. Indexing the Web for search B. Updating a data warehouse C. Running SQL query on a large relational database D. Both (a) and (b)

C

Which of the following is true about Hadoop? A. Hadoop is a database management system for big data. B. Hadoop clusters run only on appliance hardware. C. MapReduce is a parallel processing model built on Hadoop. D. All of the above

C

Which of the following statements about Hadoop is NOT a myth? A. Hadoop is a single integrated product. B. Hadoop deals only with data volume. C. Hadoop and MapReduce are related, but not the same D. Hadoop complements a datawarehouse and rarely replaces it.

A

Which of the following statements is true about MapReduce? A. MapReduce is good at processing large volumes of multi-structured data in a timely manner. B. MapReduce is name given to HDFS by the open source community, C. MapReduce is made up of a Name Node and Secondary Nodes. D. Both (a) and (b)

MapReduce

distributes the processing of very large multi-structured data files across a large cluster of ordinary machines/processors Goal - achieving high performance with "simple" computers Developed and popularized by Google Good at processing and analyzing large volumes of multi-structured data in a timely manner Example tasks: indexing the Web for search, graph analysis, text analysis, machine learning, ... system first reads input file and splits it into multiple pieces These splits are processed by multiple map programs running in parallel on the nodes of a cluster Each map program performs part of the computation (e.g., group data by color and shape) The output from each map program are collected and merged as input to the reduce program Reduce program performs the final computation based on the input given Sorting Counting

hadoop

is an open source framework source framework for storing and analyzing massive amounts of distributed unstructured data Created by Doug Cutting clusters run on inexpensive commodity hardware so projects can scale-out inexpensively is now part of Apache Software Foundation Open Source - hundreds of contributors continuously improve the core technology Access unstructured and semi-structured data (e.g., log files, social media feeds, other data sources) Break the data up into "parts," which are then loaded into a file system made up of multiple nodes running on commodity hardware using HDFS Each "part" is replicated multiple times and loaded into the file system for replication and failsafe processing A node acts as the Facilitator and another as Job Tracker Jobs are distributed to the clients, and once completed, the results are collected and aggregated using MapReduce.


Kaugnay na mga set ng pag-aaral

Chap. 17: Endocrine System ZOO251

View Set

FINA 3313 EXAM 3 FLR quiz review

View Set

Quiz I - Supplemental Reading Material

View Set

chapter 3 - legal concerns and insurance issues

View Set

Real Estate Practice Test: National Exam With State Law Questions

View Set