Chapter 7

अब Quizwiz के साथ अपने होमवर्क और परीक्षाओं को एस करें!

All of the following statements about MapReduce are true EXCEPT

MapReduce runs without fault tolerance.

Hadoop is primarily a(n) ________ file system and lacks capabilities we'd associate with a DBMS, such as indexing, random access to data, and support for SQL.

distributed

In a network analysis, what connects nodes?

edges

Big Data comes from ________.

everywhere

Allowing Big Data to be processed in memory and distributed across a dedicated set of nodes can solve complex problems in near-real time with highly accurate insights. What is this process called?

in-memory analytics

In the world of Big Data, ________ aids organizations in processing and analyzing large volumes of multistructured data. Examples include indexing and search, graph analysis, etc.

MapReduce

refers to the conformity to facts: accuracy, quality, truthfulness, or trustworthiness of the data.

Veracity

HBase is a nonrelational ________ that allows for low-latency, quick lookups in Hadoop.

database

In the financial services industry, Big Data can be used to improve

regulatory oversight and decision making

MapReduce can be easily understood by skilled programmers due to its procedural nature.

True

Social media mentions can be used to chart and predict flu outbreaks.

True

The quality and objectivity of information disseminated by influential users of Twitter is higher than that disseminated by noninfluential users.

True

The term "Big Data" is relative as it depends on the size of the using organization.

True

In most cases, Hadoop is used to replace data warehouses.

False

In the Salesforce case study, streaming data is used to identify services that customers use most.

False

In the opening vignette, the Access Telecom (AT), built a system to better visualize customers who were unhappy before they canceled their service.

True

It is important for Big Data and self-service business intelligence to go hand in hand to get maximum value from analytics.

True

Satellite data can be used to evaluate the activity at retail locations as a source of alternative data.

True

There is a clear difference between the type of information support provided by influential users versus the others on Twitter.

True

The problem of forecasting economic activity or microclimates based on a variety of data beyond the usual retail data is a very recent phenomenon and has led to another buzzword — ________.

alternative data

Using data to understand customers/clients and business operations to sustain and foster growth and profitability is

an increasingly challenging task for today's enterprises.

________ of data provides business value; pulling of data from multiple subject areas and numerous applications into one repository is the raison d'être for data warehouses.

Integration

In the Analyzing Disease Patterns from an Electronic Medical Records Data Warehouse case study, what was the analytic goal?

determine differences in rates of disease in urban and rural populations

In a Hadoop "stack," what is a slave node?

a node where data is stored and processed

As volumes of Big Data arrive from multiple sources such as sensors, machines, social media, and clickstream interactions, the first step is to ________ all the data reliably and cost effectively.

capture

In a Hadoop "stack," what node periodically replicates and stores data from the Name Node should it fail?

secondary node

In the energy industry, ________ grids are one of the most impactful applications of stream analytics.

smart

Traditional data warehouses have not been able to keep up with

the variety and complexity of data.

A job ________ is a node in a Hadoop cluster that initiates and coordinates MapReduce jobs, or the processing of the data.

tracker

Data flows can be highly inconsistent, with periodic peaks, making data loads hard to manage. What is this feature of Big Data called?

variability

When considering Big Data projects and architecture, list and describe five challenges designers should be mindful of in order to make the journey to analytics competency less stressful.

•Data volume: The ability to capture, store, and process the huge volume of data at an acceptable speed so that the latest information is available to decision makers when they need it. •Data integration: The ability to combine data that is not similar in structure or source and to do so quickly and at reasonable cost. •Processing capabilities: The ability to process the data quickly, as it is captured. The traditional way of collecting and then processing the data may not work. In many situations data needs to be analyzed as soon as it is captured to leverage the most value. •Data governance: The ability to keep up with the security, privacy, ownership, and quality issues of Big Data. As the volume, variety (format and source), and velocity of data change, so should the capabilities of governance practices. •Skills availability: Big Data is being harnessed with new tools and is being looked at in different ways. There is a shortage of data scientists with the skills to do the job. •Solution cost: Since Big Data has opened up a world of possible business improvements, there is a great deal of experimentation and discovery taking place to determine the patterns that matter and the insights that turn to value. To ensure a positive ROI on a Big Data project, therefore, it is crucial to reduce the cost of the solutions used to find that value.

Companies with the largest revenues from Big Data tend to be

the largest computer and IT services firms.

As the size and the complexity of analytical systems increase, the need for more ________ analytical systems is also increasing to obtain the best performance.

efficient

What is the Hadoop Distributed File System (HDFS) designed to handle?

unstructured and semistructured non-relational data

The ________ of Big Data is its potential to contain more useful patterns and interesting anomalies than "small" data.

value proposition

Organizations are working with data that meets the three V's-variety, volume, and ________ characterizations.

velocity

How does Hadoop work?

It breaks up Big Data into multiple parts so each part can be processed and analyzed at the same time on multiple computers.

The ________ Node in a Hadoop cluster provides client information on where in the cluster particular data is stored and if any nodes fail.

Name

HBase, Cassandra, MongoDB, and Accumulo are examples of ________ databases.

NoSQL

Which Big Data approach promotes efficiency, lower cost, and better performance by processing jobs in a shared, centrally managed pool of IT resources?

grid computing

In the Alternative Data for Market Analysis or Forecasts case study, satellite data was NOT used for

monitoring individual customer patterns.

In the Twitter case study, how did influential users support their tweets?

objective data

A newly popular unit of data in the Big Data era is the petabyte (PB), which is

1015 bytes.

What is Big Data's relationship to the cloud?

Amazon and Google have working Hadoop cloud offerings.

Define MapReduce.

As described by Dean and Ghemawat (2004), MapReduce is a programming model and an associated implementation for processing and generating large data sets. Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system.

________ speeds time to insights and enables better data governance by performing data integration and analytic functions inside the database.

In-database analytics

Which of the following sources is likely to produce Big Data the fastest?

RFID tags

In the opening vignette, why was the Telecom company so concerned about the loss of customers, if customer churn is common in that industry?

The company was concerned about its loss of customers, because the loss was at such a high rate. The company was losing customers faster than it was gaining them. Additionally, the company had identified that the loss of these customers could be traced back to customer service interactions. Because of this, the company felt that the loss of customers is something that could be analyzed and hopefully controlled.

Big Data is being driven by the exponential growth, availability, and use of information.

True

Current total storage capacity lags behind the digital information being generated in the world.

True

Despite their potential, many current NoSQL tools lack mature management and monitoring tools.

True

In-motion ________ is often overlooked today in the world of BI and Big Data.

analytics

In open-source databases, the most important performance enhancement to date is the cost-based ________.

optimizer

Big Data employs ________ processing techniques and nonrelational data storage capabilities in order to process unstructured and semistructured data.

parallel

Under which of the following requirements would it be more appropriate to use Hadoop over a data warehouse?

unrestricted, ungoverned sandbox explorations

List and describe four of the most critical success factors for Big Data analytics.

•A clear business need (alignment with the vision and the strategy). Business investments ought to be made for the good of the business, not for the sake of mere technology advancements. Therefore, the main driver for Big Data analytics should be the needs of the business at any level—strategic, tactical, and operations. •Strong, committed sponsorship (executive champion). It is a well-known fact that if you don't have strong, committed executive sponsorship, it is difficult (if not impossible) to succeed. If the scope is a single or a few analytical applications, the sponsorship can be at the departmental level. However, if the target is enterprise-wide organizational transformation, which is often the case for Big Data initiatives, sponsorship needs to be at the highest levels and organization-wide. •Alignment between the business and IT strategy. It is essential to make sure that the analytics work is always supporting the business strategy, and not other way around. Analytics should play the enabling role in successful execution of the business strategy. •A fact-based decision making culture. In a fact-based decision-making culture, the numbers rather than intuition, gut feeling, or supposition drive decision making. There is also a culture of experimentation to see what works and doesn't. To create a fact-based decision-making culture, senior management needs to do the following: recognize that some people can't or won't adjust; be a vocal supporter; stress that outdated methods must be discontinued; ask to see what analytics went into decisions; link incentives and compensation to desired behaviors. •A strong data infrastructure. Data warehouses have provided the data infrastructure for analytics. This infrastructure is changing and being enhanced in the Big Data era with new technologies. Success requires marrying the old with the new for a holistic infrastructure that works synergistically.

What are the differences between stream analytics and perpetual analytics? When would you use one or the other?

•In many cases they are used synonymously. However, in the context of intelligent systems, there is a difference.Streaming analyticsinvolves applying transaction-level logic to real-time observations. The rules applied to these observations take into account previous observations as long as they occurred in the prescribed window; these windows have some arbitrary size (e.g., last 5 seconds, last 10,000 observations, etc.).Perpetual analytics, on the other hand, evaluates every incoming observation against all prior observations, where there is no window size. Recognizing how the new observation relates to all prior observations enables the discovery of real-time insight. •When transactional volumes are high and the time-to-decision is too short, favoring nonpersistence and small window sizes, this translates into using streaming analytics. However, when the mission is critical and transaction volumes can be managed in real time, then perpetual analytics is a better answer.

List and briefly discuss the three characteristics that define and make the case for data warehousing.

1)Data warehouse performance: More advanced forms of indexing such as materialized views, aggregate join indexes, cube indexes, and sparse join indexes enable numerous performance gains in data warehouses. The most important performance enhancement to date is the cost-based optimizer, which examines incoming SQL and considers multiple plans for executing each query as fast as possible. 2)Integrating data that provides business value: Integrated data is the unique foundation required to answer essential business questions. 3)Interactive BI tools: These tools allow business users to have direct access to data warehouse insights. Users are able to extract business value from the data and supply valuable strategic information to the executive staff.

What is NoSQL as used for Big Data? Describe its major downsides.

•NoSQL is a new style of database that has emerged to, like Hadoop, process large volumes of multi-structured data. However, whereas Hadoop is adept at supporting large-scale, batch-style historical analysis, NoSQL databases are aimed, for the most part (though there are some important exceptions), at serving up discrete data stored among large volumes of multi-structured data to end-user and automated Big Data applications. This capability is sorely lacking from relational database technology, which simply can't maintain needed application performance levels at Big Data scale. •The downside of most NoSQL databases today is that they trade ACID (atomicity, consistency, isolation, durability) compliance for performance and scalability. Many also lack mature management and monitoring tools.

List and describe the three main "V"s that characterize Big Data.

•Volume: This is obviously the most common trait of Big Data. Many factors contributed to the exponential increase in data volume, such as transaction-based data stored through the years, text data constantly streaming in from social media, increasing amounts of sensor data being collected, automatically generated RFID and GPS data, and so forth. •Variety: Data today comes in all types of formats—ranging from traditional databases to hierarchical data stores created by the end users and OLAP systems, to text documents, e-mail, XML, meter-collected, sensor-captured data, to video, audio, and stock ticker data. By some estimates, 80 to 85 percent of all organizations' data is in some sort of unstructured or semistructured format. •Velocity: This refers to both how fast data is being produced and how fast the data must be processed (i.e., captured, stored, and analyzed) to meet the need or demand. RFID tags, automated sensors, GPS devices, and smart meters are driving an increasing need to deal with torrents of data in near-real time.

________ bring together hardware and software in a physical unit that is not only fast but also scalable on an as-needed basis.

Appliances

Describe data stream mining and how it is used.

Data stream mining, as an enabling technology for stream analytics, is the process of extracting novel patterns and knowledge structures from continuous, rapid data records. A data stream is a continuous flow of ordered sequence of instances that in many applications of data stream mining can be read/processed only once or a small number of times using limited computing and storage capabilities. Examples of data streams include sensor data, computer network traffic, phone conversations, ATM transactions, web searches, and financial data. Data stream mining can be considered a subfield of data mining, machine learning, and knowledge discovery. In many data stream mining applications, the goal is to predict the class or value of new instances in the data stream given some knowledge about the class membership or values of previous instances in the data stream.

Big Data simplifies data governance issues, especially for global firms.

False

Big Data uses commodity hardware, which is expensive, specialized hardware that is custom built for a client or application.

False

Hadoop and MapReduce require each other to work.

False

Why are some portions of tape backup workloads being redirected to Hadoop clusters today?

First, while it may appear inexpensive to store data on tape, the true cost comes with the difficulty of retrieval. Not only is the data stored offline, requiring hours if not days to restore, but tape cartridges themselves are also prone to degradation over time, making data loss a reality and forcing companies to factor in those costs. To make matters worse, tape formats change every couple of years, requiring organizations to either perform massive data migrations to the newest tape format or risk the inability to restore data from obsolete tapes. •Second, it has been shown that there is value in keeping historical data online and accessible. As in the clickstream example, keeping raw data on a spinning disk for a longer duration makes it easy for companies to revisit data when the context changes and new constraints need to be applied. Searching thousands of disks with Hadoop is dramatically faster and easier than spinning through hundreds of magnetic tapes. Additionally, as disk densities continue to double every 18 months, it becomes economically feasible for organizations to hold many years' worth of raw or refined data in HDFS.

For low latency, interactive reports, a data warehouse is preferable to Hadoop.

True

Hadoop was designed to handle petabytes and exabytes of data distributed over multiple nodes in parallel.

True

If you have many flexible programming languages running in parallel, Hadoop is preferable to a data warehouse.

True

In Application Case 7.6, Analyzing Disease Patterns from an Electronic Medical Records Data Warehouse, it was found that urban individuals have a higher number of diagnosed disease conditions.

Chapter 7

संबंधित स्टडी सेट्स

History Reading Due 2/1 Pages 238-241

Social 20-2 Chapter 1: Nation and Identity

Food Science 201-300

PS 307 Chapter 4

APUSH Chapter 11 LC

Strat Sed Final

MKTG 3310: Ch. 10 (Marketing Research)

Acct. 3001 Ch.3 Computational

Heat Transfer

Driver's Ed-Distractions

Business Statistics Final Exam

Chapter 9 review

Acid Base (PrepU)

Nursing - 309 Exam Set (Iggy 14, 15, 16)

BUSLAW EXAM 2

Music Appreciation Final

chemistry mid term

Ch 13 one way within subjects anova

Chapter 5: Vision Quiz

History chapter 10