Chapter 7 Study guide

Ace your homework & exams now with Quizwiz!

A job _____ is a node in a Hadoop cluster that initiates and coordinates MapReduce jobs, or the processing of the data

tracker

In open source databases, the most important performance enhancement to date is the cost-based __________

optimizer

All of the following statements about MapReduce are truce except

MapReduce runs without fault tolerance

HBase is nonrelational _______ that allows for low latency, quick lookups in Hadoop

database

In the Alternative Data for Market Analysis or Forecasts case study, satellite data was NOT used for

monitoring indivudual customer patterns

A newly popular unit of data in the Big Data era is the petabyte which is

10^15 bytes

List and describe four of the most crtiical success factors for Big Data

A clear business need: there must be a valid reason for a business to invest in Big Data analytics that would result in financial benefits, not investing in it to try new technology Strong, committed sponsorship: It is difficult to succeed without this. Sponsorship for it can be at a department level Alignment between the business and IT strategy: The analytics work is always supporting business strategy. It should be helping the business be successful A strong data Infrastructure: This requires marrying the old with the new for a holistic infrastructure that works together

What is Big Datas relationship to the cloud?

Amazon and Google have working Hadoop Cloud offerings

How does Hadoop work?

It breaks up big data into multiple parts so each part can be processed and analyzed at the same time on multiple computers

What is NoSQL as used for big data? describe its major downsides

It is a style of the database that processes large volumes of multi-structured data. It is made for serving up discrete data stored in large volumes of data to end user and Big Data applications that are automated The downside is that many lack mature management and monitoring tools

Define MapReduce

It is a technique that was popularized by Google. It distributes the processing of very large multi structured data files across a large cluster of machines

Describe data stream mining and how it is used

It is the process of extracting novel patterns and knowledge structures from continous flows of ordered sequence of instances that can be read a small number of times using limited computing and storage

In the world of Big Data, ________ aids organizations in processing and analyzing large volumes of multistructure data. Examples include indexing and search, graph analysis

MapReduce

HBase, Cassandra, MongoDB, and Accumulo are examples of _____ databases

NoSQL

Which of the following sources is likely to produce Big Data the fastest?

RFID Tags

What are the differences between stream analyics and perpetual analytics? When would you use one or the other?

Stream analytics applies transaction level logic to real time observations Perpetual analytics evaluates every incoming observations against all prior observations, wehere this is not window size You would use stream analytics when the volume of transactions is high and the time to decision is short. For perpetual you would use when the volume is able to be managed in real time

Why are some portions of tape backup workloads being redirected to Hadoop clusters today?

They are being redirected to Hadoop clusters because of the difficulty to retrieve the data, it also takes a long time to retrieve it and data is often lost

There is a clear difference between the type of information support provided by influential users versus the others on Twitter

True

List and describe the three main "V's" that characterize big data

Volume: the amount of data being produced, social media, sensor data, and GPS data Variety: the different types of data, they can be structured, semi structured, or unstructured Velocity: How fast the data is being produced and how fast it is being processed

In a Hadoop "stack", what is a slave node?

a node where data is stored and processed

The problem of forecasting economic activity or microclimates based on a variety of data beyond the usual retail data is a very recent phenomenon and has led to another buzzword: ____

altnernative data

using data to understand customer/clients and business operations to sustain and foster growth and pofitability is

an increasingly challenging task for today's enterprises

In motion ___________ is often overlooked today in the world of BI and Big data

analytics

___________ bring together hardware and software in a physical unit that is not only fast but also scalable on an as-needed basis

appliances

In the financial services industry, Big Data can be used to improve

both A and B Regulatory oversight Decision making

As volumes of Big Data arrive from multiple sources such as sensors, machines, social media, and clickstream interactions, the first step is to ________ all the data reliably and cost effectively

capture

In the Analyzing Disease Patterns from a Electronic Medical Records Data Warehouse case study, what was the analytic goal?

determine differences in rates of disease in urban and rural populations

Hadoop is primarily a(n) ________ file system and lacks capabilities we'd associate with a DBMS, such as indexing, random access to data, and support for SQL.

distributed

In network analysis, what connects nodes?

edges

As the size and the complexity of analytical systems increase, the need for more __________ analytical systems is also increasing to obtain the best performance

efficient

Big data comes from__________

everywhere

In the Salesforce case study, streaming data is used to identify services that customers use most

falce

Big Data uses commodity hardware, which is expensive, specialized hardware that is custom built for a client or applicationq

false

Big data simplifies data governance issues, espeically for global firms

false

Hadoop and MapReduce require each other to work

false

In most cases, Hadoop is used to replace data warehouses

false

Which Big Data approach promotes efficiency, lower cost, and better performance by processing jobs in a shared, centrally managed pool of IT resources

grid computing

__________ speeds time to insights and enables better data governance by performing data integration and analytic functions inside the database

in database analytics

Allowing Big Data to be processed in memory and distributed across a dedicated set of nodes can solve complex problems in near real time with highly accurate insights. What is this process call?

in memory analytics

________ of data provides business value; pulling of data from multiple subject areas and numerous applications into repository is the raison d'etre for data warehouses

integration

The ____ Node in a Hadoop cluster provides client information on where in the cluster particular data is stored and if any nodes fail

name

In the Twitter case study, how did influential users support their tweets?

objective data

Big Data employs __________ processing techniques and nonrelational data storage capabilities in order to process unstructured and semistructured data

parallel

In a Hadoop "stack", what node periodically replicates and stores data from the Name Node should it fail?

secondary node

In the energy industry, _____________ grids are one of the most impactful applications of stream analytics

smart

Companies with the largest revenues from Big Data tend to be

the largest computer and IT services firms

Traditional data warehouses have not been able to keep up with

the variety and complexity of data

Current total storage capacity lags behind the digital information being generated in the world

true

Despite their potential, many current NoSQL tools lack mature management and monitoring tools

true

Dig Data is being drive by the exponetial growth, avaialbiilty, and use of information

true

For low latency, interactive reports, a data warehouse is preferable to Hadoop

true

Hadoop was designed to handle petabytes and exabytes of data distributed over multiple nodes in parallel

true

If you have many flexible programming languages running in parallel, Hadoop is preferable to a data warehouse

true

In application case 7.6 analyzing disease patterns from an electronic medical records data warehouse, it was found tha turban individuals have a higher numer of diagnosed disease conditions

true

In the opening vignette, the Access Telecom, built a system to better visualize custoemrs who were unhappy before they canceled their service

true

It is important for Big Data and self service business intelligence to go hand in hand to get maximum value from analytics

true

MapReduce can be easily understood by skilled programmers due to its procedural nature

true

Satellite data can be used to evaluate the activity at retail locations as a source of alternative data

true

Social media mentions can be used to chart and predict flu outbreaks

true

The quality and objectivity of information disseminated by influential users of Twitter is higher than the disseminated by noninfluential users

true

The term "Big Data" is relative as it depends on the size of the using organization

true

Under which of the following requirements would it be more approiate to use Hadoop over a data warehouse?

unrestricted, ungoverned, and sandbox explorations

What is the Hadoop Distributed File system designed to handle?

unstructured and semistructured non-relational sata

The _________ of Big Data is its potential to contain more useful patterns and interesting anomalies than "small" data

value proposition

Data flows can be highly inconsistent, with periodic peaks, making data loads hard to manage. What is this feature of Big Data called?

variability

Organizations are working with data that meets the three V's variety, volume, and __________ chracterizations

velocity

___________ refers to the comformity to fact: accuracy, quality, truthfulness, or trustworthiness of the data

veracity

When considering Big Data projects and architecture, list and describe five challenges designers should be mindful of in other to make the journey to analytics competency less stressful

Data integration: the ability to combine different types of data and to do so quick and cheap Data volume: being able to collect, store, and process the massive amounts of data at a good speed Processing capabilitites: being able to process the data that was captured quickly Data governance: being able to keep up with security, privacy, ownership, and quality issues Skills availability: it is being harnessed with new tools and looked at in new ways

List and briefly discuss the three characteristics that define and make the case for data warehousing

Data warehouse performance ,the cost based optimizer Integrating data that provides business value which answers business questions Interactive BI tools provide acess to data warehouse insights

In the opening vignette, why was the Telecom company so concerned about the loss of customers, if customer churn is common in that industry?

Even though it is common, they were losing customers at a very high rate which was concerning. They were hoping that by analyzing this trend they could control it and reduce the number of customers leaving


Related study sets

Life Insurance Policy Provisions, Options, & Riders

View Set

Diodes, Transistors and Solid State Principles

View Set

L4; Acceptance & Commitment Therapy

View Set

Nursing Care of Children "Moderate"

View Set