Chapter 7 & 8
What is a data scientist and what does the job involve?
A data scientist is a role or a job frequently associated with Big Data or data science. In a very short time it has become one of the most sought-out roles in the marketplace. Currently, data scientists' most basic, current skill is the ability to write code (in the latest Big Data languages and platforms). A more enduring skill will be the need for data scientists to communicate in a language that all their stakeholders understand—and to demonstrate the special skills involved in storytelling with data, whether verbally, visually, or—ideally—both. Data scientists use a combination of their business and technical skills to investigate Big Data looking for ways to improve current business analytics practices (from descriptive to predictive and prescriptive) and hence to improve decisions for new business opportunities.
Big Data uses commodity hardware, which is expensive, specialized hardware that is custom built for a client or application.
False
Connectivity is not a part of the IoT infrastructure.
False
For cloud computing to be successful, users must have knowledge and experience in the control of the technology infrastructures.
False
Hadoop and MapReduce require each other to work.
False
IaaS helps provide faster information, but provides information only to managers in an organization.
False
In most cases, Hadoop is used to replace data warehouses.
False
In the Great Clips case study, the company uses geospatial data to analyze, among other things, the types of haircuts most popular in different geographic locations.
False
In the Salesforce case study, streaming data is used to identify services that customers use most.
False
In the classification of location-based analytic applications, examining geographic site locations falls in the consumer-oriented category.
False
SaaS combines aspects of cloud computing with Big Data analytics and empowers data scientists and analysts by allowing them to access centrally managed information data sets.
False
Siemens utilizes data sensors to track failure rates in household appliances.
False
Users definitely own their biometric data.
False
A critical emerging trend in analytics is the incorporation of location data. ________ data is the static location data used by these location-based analytic applications.
Geospatial
In this model, infrastructure resources like networks, storage, servers, and other computing resources are provided to client companies.
IaaS
________ provides resources like networks, storage, servers, and other computing resources to client companies.
IaaS
________ speeds time to insights and enables better data governance by performing data integration and analytic functions inside the database.
In-database analytics
________ of data provides business value; pulling of data from multiple subject areas and numerous applications into one repository is the raison d'être for data warehouses.
Integration
Data and text mining is a promising application of AaaS. What additional capabilities can AaaS bring to the analytic world?
It can also be used for large-scale optimization, highly-complex multi-criteria decision problems, and distributed simulation models. These prescriptive analytics require highly capable systems that can only be realized using service-based collaborative systems that can utilize large-scale computational resources.
This model began with the notion that data quality could happen in a centralized place, cleansing and enriching data and offering it to different systems, applications, or users, irrespective of where they were in the organization, computers, or on the network.
DaaS
Describe data stream mining and how it is used.
Data stream mining is the process of extracting novel patterns and knowledge structures from continuous , rapid data records . A data stream is a continues flow of ordered instances that in many applications of data stream mining can be read/processed only one time or small number of times. In many data stream mining applications the goal is to predict the class or value of new instances in the stream .
How do the traditional location-based analytic techniques using geocoding of organizational locations and consumers hamper the organizations in understanding "true location-based" impacts?
Locations based on postal codes offer an aggregate view of a large geographic area. This poor granularity may not be able to pinpoint the growth opportunities within a region. The location of the target customers can change rapidly. An organization's promotional campaigns might not target the right customers.
Define MapReduce.
MapReduce is a programming model that is used to process and generate big datasets with a parallel distributed algorithm.
All of the following statements about MapReduce are true EXCEPT
MapReduce runs without fault tolerance
The ________ Node in a Hadoop cluster provides client information on where in the cluster particular data is stored and if any nodes fail.
Name
Which of the following sources is likely to produce Big Data the fastest?
RFID tags
If you have many flexible programming languages running in parallel, Hadoop is preferable to a data warehouse.
True
In Application Case 7.6, Analyzing Disease Patterns from an Electronic Medical Records Data Warehouse, it was found that urban individuals have a higher number of diagnosed disease conditions.
True
In the Quiznos case, the company employed location-based behavioral targeting to narrow the characteristics of users who were most likely to eat at a quick-service restaurant.
True
In the opening vignette, the Access Telecom (AT), built a system to better visualize customers who were unhappy before they canceled their service.
True
Internet of Things (IoT) is the phenomenon of connecting the physical world to the Internet.
True
It is important for Big Data and self-service business intelligence to go hand in hand to get maximum value from analytics.
True
MapReduce can be easily understood by skilled programmers due to its procedural nature.
True
One reason the IoT is growing exponentially is because hardware is smaller and more affordable.
True
RFID can be used in supply chains to manage product quality.
True
Satellite data can be used to evaluate the activity at retail locations as a source of alternative data.
True
Social networking Web sites like Facebook, Twitter, and LinkedIn, are also examples of cloud computing.
True
The quality and objectivity of information disseminated by influential users of Twitter is higher than that disseminated by noninfluential users.
True
The term "Big Data" is relative as it depends on the size of the using organization.
True
The term cloud computing originates from a reference to the Internet as a "cloud" and represents an evolution of all of the previously shared/centralized computing trends.
True
There is a clear difference between the type of information support provided by influential users versus the others on Twitter.
True
List and describe the three main "V"s that characterize Big Data.
Velocity : The speed at which data is being stored Volume : The volume at which data is being stored Variety : How fast data is being produced and how fast data must be processed.
________ refers to the conformity to facts: accuracy, quality, truthfulness, or trustworthiness of the data.
Veracity
In a Hadoop "stack," what is a slave node?
a node where data is stored and processed
The problem of forecasting economic activity or microclimates based on a variety of data beyond the usual retail data is a very recent phenomenon and has led to another buzzword — ________.
alternative data
Using data to understand customers/clients and business operations to sustain and foster growth and profitability is
an increasingly challenging task for today's enterprises.
In the financial services industry, Big Data can be used to improve
both A & B
Why are some portions of tape backup workloads being redirected to Hadoop clusters today?
- difficulty of retrieval , data that is stored offline takes long to retrieve , tape formats change over time , and are prone to loss of data . There is a value in keeping historical data online and accessable.
In what ways can communications companies use geospatial analysis to harness their data effectively?
Communication companies often generate massive amounts of data every day. The ability to analyze the data quickly with a high level of location-specific granularity can better identify the customer churn and help in formulating strategies specific to locations for increasing operational efficiency, quality of service, and revenue.
Why are companies like IBM shifting to provide more services and consulting?
Customers see that significant value can be created with the application of analytics, and need help completing these tasks.
Server virtualization is the pooling of physical storage from multiple network storage devices into a single storage device.
False
________ is/are used to capture, store, analyze, and manage data linked to a location using integrated sensor technologies, global positioning systems installed in smartphones, or through RFID deployments in the retail and healthcare industries.
GIS
How does Hadoop work?
It breaks up Big Data into multiple parts so each part can be processed and analyzed at the same time on multiple computers.
In the world of Big Data, ________ aids organizations in processing and analyzing large volumes of multistructured data. Examples include indexing and search, graph analysis, etc.
MapReduce
What new geometric data type in Teradata's data warehouse captures geospatial features?
ST_GEOMETRY
What are the differences between stream analytics and perpetual analytics? When would you use one or the other?
Stream analytics : Appling the transaction level logic to a real-time observation. The rule applies to these observations take into account previous observation as long as they accured in a prescribed window. The windows have to have a size. Perpetual analytics: Evaluates every incoming observation against all prior observation and there is no window size. I would use stream analytics when the transactional volume is high and time to decision is to short, favoring nonpresistance and smaller window sizes.
Why is separating the impact of analytics from that of other computerized systems a difficult task?
The trend is toward integrating systems
Which of the following is true of data-as-a-Service (DaaS) platforms?
There are standardized processes for accessing data wherever it is located.
Big Data simplifies data governance issues, especially for global firms.
False
Web-based e-mail such as Google's Gmail are not examples of cloud computing.
False
While cloud services are useful for small and midsize analytic applications, they are still limited in their ability to handle Big Data applications.
False
What is Internet of Things (IoT) and how is it used?
IoT is the phenomenon of connecting the physical world to the Internet. In IoT, physical devices are connected to sensors that collect data on the operation, location, and state of a device. This data is processed using various analytics techniques for monitoring the device remotely from a central office or for predicting any upcoming faults in the device.
________ is the splitting of available bandwidth into channels.
Network virtualization
What is NoSQL as used for Big Data? Describe its major downsides.
NoSQL is a new form of databases that processes and stores unstructured data that is not in a tabular format . NoSQL is high performance and highly scalable . The downside is that they trade ACID compliance for performance and scalability.
Using this model, companies can deploy their software and applications in the cloud so that their customers can use them.
PaaS
Which of the following allows companies to deploy their software and applications in the cloud so that their customers can use them?
PaaS
________ is a generic technology that refers to the use of radio-frequency waves to identify objects.
RFID
This model allows consumers to use applications and software that run on distant computers in the cloud infrastructure.
SaaS
________ is the masking of physical servers from server users.
Server virtualization
How does Siemens use sensor data to help monitor equipment on trains?
Siemens uses an IoT model and sensors attached to several key components of trains and other railway equipment to help evaluate its current working condition, and predict the need for future repair. By using a wide variety of different types of sensors, the company is able to evaluate a multitude of conditions. This evaluation can be on the train itself, or within the supporting infrastructure. By using analytics to monitor these sensors, the company is able to predict the need for repair prior to component failure.
In the opening vignette, why was the Telecom company so concerned about the loss of customers, if customer churn is common in that industry?
The loss was at such a high rate . Th company had been losing customer faster than gaining customers . It was identified that the lost of customers could be traced back to customer service interactions.
Which of the following is true about the furtherance of homeland security?
There is a greater need for oversight
Current total storage capacity lags behind the digital information being generated in the world.
True
Data as a service began with the notion that data quality could happen in a centralized place, cleansing and enriching data and offering it to different systems, applications, or users, irrespective of where they were in the organization, computers, or on the network.
True
From massive amounts of high-dimensional location data, algorithms that reduce the dimensionality of the data can be used to uncover trends, meaning, and relationships to eventually produce human-understandable representations.
True
Hadoop was designed to handle petabytes and exabytes of data distributed over multiple nodes in parallel.
True
Service-oriented DSS solutions generally offer individual or bundled services to the user as a service.
True
Social media mentions can be used to chart and predict flu outbreaks.
True
With RFID tags, a(n) ________ tag has a battery on board to energize it.
active
In-motion ________ is often overlooked today in the world of BI and Big Data.
analytics
The portion of the IoT technology infrastructure that focuses on controlling what and how information is captured is
applications
Pokémon GO is an example of a location-sensing ________ reality-based game.
augmented
As volumes of Big Data arrive from multiple sources such as sensors, machines, social media, and clickstream interactions, the first step is to ________ all the data reliably and cost effectively.
capture
Which of these is NOT a part of the IoT technology infrastructure?
electrical access
Smartbin has developed trash containers that include sensors to detect
fill levels
Which Big Data approach promotes efficiency, lower cost, and better performance by processing jobs in a shared, centrally managed pool of IT resources?
grid computing
Today, most smartphones are equipped with various instruments to measure jerk, orientation, and sense motion. One of these instruments is an accelerometer, and the other is a(n)
gyroscope
The portion of the IoT technology infrastructure that focuses on the sensors themselves is
hardware
Allowing Big Data to be processed in memory and distributed across a dedicated set of nodes can solve complex problems in near-real time with highly accurate insights. What is this process called?
in-memory analytics
By using ________, businesses can collect and analyze data to discern large-scale patterns of movement and identify distinct classes of behaviors in specific contexts.
location-enabled services
Location information from ________ phones can be used to create profiles of user behavior and movement.
mobile
What kind of location-based analytics is a real-time marketing promotion?
organization-oriented location-based dynamic approach
With RFID tags, a(n) ________ tag receives energy from the electromagnetic field created by the interrogator.
passive
For individual decision makers, ________ values constitute a major factor in the issue of ethical decision making.
personal
In general, ________ is the right to be left alone and the right to be free from unreasonable personal intrusion.
privacy
IaaS, AaaS and other ________-based offerings allow the rapid diffusion of advanced analysis tools among users, without significant investment in technology acquisition.
cloud
The portion of the IoT technology infrastructure that focuses on how to transmit data is
connectivity
GPS Navigation is an example of which kind of location-based analytics?
consumer-oriented geospatial static approach
Describe your understanding of the emerging term people analytics. Are there any privacy issues associated with the application?
•Applications such as using sensor-embedded badges that employees wear to track their movement and predict behavior has resulted in the termpeople analytics. This application area combines organizational IT impact, Big Data, sensors, and has privacy concerns. One company, Sociometric Solutions, has reported several such applications of their sensor-embedded badges. •People analytics creates major privacy issues. Should the companies be able to monitor their employees this intrusively? Sociometric has reported that its analytics are only reported on an aggregate basis to their clients. No individual user data is shared. They have noted that some employers want to get individual employee data, but their contract explicitly prohibits this type of sharing. In any case, sensors are leading to another level of surveillance and analytics, which poses interesting privacy, legal, and ethical questions.
Hadoop is primarily a(n) ________ file system and lacks capabilities we'd associate with a DBMS, such as indexing, random access to data, and support for SQL.
distributed
What is cloud computing? What is Amazon's general approach to the cloud computing services it provides?
•Wikipedia defines cloud computing as "a style of computing in which dynamically scalable and often virtualized resources are provided over the Internet. Users need not have knowledge of, experience in, or control over the technology infrastructures in the cloud that supports them." •Amazon.com has developed an impressive technology infrastructure for e- commerce as well as for business intelligence, customer relationship management, and supply chain management. It has built major data centers to manage its own operations. However, through Amazon.com's cloud services, many other companies can employ these very same facilities to gain advantages of these technologies without having to make a similar investment. Like other cloud-computing services, a user can subscribe to any of the facilities on a pay-as-you-go basis. This model of letting someone else own the hardware and software but making use of the facilities on a pay-per-use basis is the cornerstone of cloud computing.
In a network analysis, what connects nodes?
edges
As the size and the complexity of analytical systems increase, the need for more ________ analytical systems is also increasing to obtain the best performance.
efficient
Big Data comes from ________.
everywhere
In the Alternative Data for Market Analysis or Forecasts case study, satellite data was NOT used for
monitoring individual customer patterns
In the Twitter case study, how did influential users support their tweets?
objective data
In open-source databases, the most important performance enhancement to date is the cost-based ________.
optimizer
Big Data employs ________ processing techniques and nonrelational data storage capabilities in order to process unstructured and semistructured data.
parallel
Predictive analytics is beginning to enable development of software that is directly used by a consumer. One key concern in employing these technologies is the loss of ________.
privacy
A(n) ________ is operated solely for a single organization having a mission critical workload and security concerns.
private cloud
In a(n) ________ the subscriber uses the resources offered by service providers over the Internet.
public cloud
In a Hadoop "stack," what node periodically replicates and stores data from the Name Node should it fail?
secondary node
Services that let consumers permanently enter a profile of information along with a password and use this information repeatedly to access services at multiple sites are called
single-sign-on facilities
In the energy industry, ________ grids are one of the most impactful applications of stream analytics.
smart
The portion of the IoT technology infrastructure that focuses on how to manage incoming data and analyze it is
software backend
Companies with the largest revenues from Big Data tend to be
the largest computer and IT services firms.
Traditional data warehouses have not been able to keep up with
the variety and complexity of data
A job ________ is a node in a Hadoop cluster that initiates and coordinates MapReduce jobs, or the processing of the data.
tracker
A major structural change that can occur when analytics are introduced into an organization is the creation of new organizational ________.
units
Under which of the following requirements would it be more appropriate to use Hadoop over a data warehouse?
unrestricted, ungoverned sandbox explorations
What is the Hadoop Distributed File System (HDFS) designed to handle?
unstructured and semistructured non-relational data
The ________ of Big Data is its potential to contain more useful patterns and interesting anomalies than "small" data.
value proposition
Data flows can be highly inconsistent, with periodic peaks, making data loads hard to manage. What is this feature of Big Data called?
variability
Organizations are working with data that meets the three V's-variety, volume, and ________ characterizations.
velocity
AaaS in the cloud has economies of scale and scope by providing many ________ analytical applications with better scalability and higher cost savings.
virtual
Analytics can change the way in which many ________ are made by managers and can consequently change their jobs.
decisions
In the Analyzing Disease Patterns from an Electronic Medical Records Data Warehouse case study, what was the analytic goal?
determine differences in rates of disease in urban and rural populations
Big Data is being driven by the exponential growth, availability, and use of information.
True
Despite their potential, many current NoSQL tools lack mature management and monitoring tools.
True
For low latency, interactive reports, a data warehouse is preferable to Hadoop.
True
List and describe four of the most critical success factors for Big Data analytics.
1) A clear business need : Business investments need to be made for the better good of the company , not for the sake of technology advancment. The driver should be a need of the business. 2) Strong committed sponsorship : If you don't have strong executive sponsorship it is difficult to succeed with Big data analytics. 3) Alignment between the business and IT strategy : Ensuring that the analytical work is supporting the business strategy 4) A fact based decision making culture : A fact based decision making culture the numbers and not the gut feelings are followed.
List and briefly discuss the three characteristics that define and make the case for data warehousing.
1) Data warehouse performance , Cost based optimizer. 2) Integrating data that provides business value to answer business questions. 3) Interactive BI tools give access to data warehouse.
When considering Big Data projects and architecture, list and describe five challenges designers should be mindful of in order to make the journey to analytics competency less stressful.
- Data Volume : - Data Integration : - Processing Capabilities : - Data Governance : - Skill availability : - Solution Cost
A newly popular unit of data in the Big Data era is the petabyte (PB), which is
10^15 bytes
What is Big Data's relationship to the cloud?
Amazon and Google have working Hadoop cloud offerings
________ bring together hardware and software in a physical unit that is not only fast but also scalable on an as-needed basis.
Appliances
HBase, Cassandra, MongoDB, and Accumulo are examples of ________ databases.
NoSQL
HBase is a nonrelational ________ that allows for low-latency, quick lookups in Hadoop.
database