ISM3540 Big Data Exam 2

Ace your homework & exams now with Quizwiz!

You are tasked with accumulating survey data on a web page and are responsible for it being free from dirty data once you close the survey and get the data to the researching team. Which is the best way to handle the possibility of dirty data

Build a website that validates data as the survey participant takes the survey.

Which data mining process/methodology is thought to be the most comprehensive, according to kdnuggets.com rankings?

CRISP-DM

Which broad area of data mining applications analyzes data, forming rules to distinguish between defined classes?

Classification

Which of the following is a data mining myth?

Data mining requires a separate, dedicated database.

Articles and auxiliary verbs are assigned little value in text mining and are usually filtered out.

T

Big Data is being driven by the exponential growth, availability, and use of information.

T

Categorization and clustering of documents during text mining differ only in the preselection of categories.

T

Clickstream analysis does not need users to enter their perceptions of the Web site or other feedback directly to be useful in determining their preferences.

T

Converting continuous valued numerical variables to ranges and categories is referred to as discretization.

T

Despite their potential, many current NoSQL tools lack mature management and monitoring tools.

T

For low latency, interactive reports, a data warehouse is preferable to Hadoop.

T

If using a mining analogy, "knowledge mining" would be a more appropriate term than "data mining."

T

If you have many flexible programming languages running in parallel, Hadoop is preferable to a data warehouse.

T

In data mining, classification models help in prediction.

T

In sentiment analysis, it is hard to classify some subjects such as news as good or bad, but easier to classify others, e.g., movie reviews, in the same way.

T

In text mining, if an association between two concepts has 7% support, it means that 7% of the documents had both concepts represented in the same document.

T

In the Tito's Vodka case study, trends in cocktails were studied to create a quarterly recipe for customers.

T

In the Wimbledon case study, designers balanced the needs of mobile and desktop computer users.

T

It is important for Big Data and self-service business intelligence to go hand in hand to get maximum value from analytics.

T

MapReduce can be easily understood by skilled programmers due to its procedural nature.

T

Regional accents present challenges for natural language processing.

T

Satellite data can be used to evaluate the activity at retail locations as a source of alternative data.

T

The cost of data storage has plummeted recently, making data mining feasible for more firms.

T

The quality and objectivity of information disseminated by influential users of Twitter is higher than that disseminated by noninfluential users.

T

Using data mining on data about imports and exports can help to detect tax avoidance and money laundering.

T

When a problem has many attributes that impact the classification of different patterns, decision trees may be a useful approach.

T

In the Target case study, why did Target send a teen maternity ads?

Target's analytic model suggested she was pregnant based on her buying habits.

In sentiment analysis, which of the following is an implicit opinion?

The customer service I got for my TV was laughable.

All of the following statements about data mining are true EXCEPT:

The ideas behind it are relatively new.

How does Hadoop work?

It breaks up Big Data into multiple parts so each part can be processed and analyzed at the same time on multiple computers.

Which of the following sources is likely to produce Big Data the fastest?

RFID tags

What do voice of the market (VOM) applications of sentiment analysis do?

They examine customer sentiment at the aggregate level.

Which of the following statements about Web site conversion statistics is FALSE?

Visitors who begin a purchase on most Web sites must complete it.

Search engine optimization (SEO) is a means by which

Web site developers can increase Web site search rankings.

Web site usability may be rated poor if

Web site visitors download few of your offered PDFs and videos

In text analysis, what is a lexicon?

a catalog of words, their synonyms, and their meanings

In the opening vignette, the architectural system that supported Watson used all the following elements EXCEPT

a core engine that could operate seamlessly in another domain without changes.

What does Web content mining involve?

analyzing the unstructured content of Web pages

Understanding customers better has helped Amazon and others become more successful. The understanding comes primarily from

analyzing the vast data amounts routinely collected.

In data mining, finding an affinity of two products to be commonly together in a shopping cart is known as

association rule mining.

What is the main reason parallel processing is sometimes used for data mining?

because of the massive data amounts and search efforts involved

In text mining, tokenizing is the process of

categorizing a block of text in a sentence.

A data mining study is specific to addressing a well-defined business task, and different business tasks require

different sets of data.

Which Big Data approach promotes efficiency, lower cost, and better performance by processing jobs in a shared, centrally managed pool of IT resources?

grid computing

Understanding which keywords your users enter to reach your Web site through a search engine can help you understand

how well visitors understand your products.

In the Influence Health case study, what was the goal of the system?

increasing service use

Identifying and preventing incorrect claim payments and fraudulent activities falls under which type of data mining applications?

insurance

What does the scalability of a data mining method refer to?

its ability to construct a prediction model efficiently given a large amount of data

What does the robustness of a data mining method refer to?

its ability to overcome noisy data to make somewhat accurate predictions

The data field "ethnic group" can be best described as

nominal data.

In the Twitter case study, how did influential users support their tweets?

objective data

What are the two main types of Web analytics?

off-site and on-site Web analytics

Breaking up a Web page into its components to identify worthy words/terms and indexing them using a set of rules is called

parsing the documents. *Think: break up into PARts

Prediction problems where the variables have numeric values are most accurately defined as

regressions

Third party providers of publicly available data sets protect the anonymity of the individuals in the data set primarily by

removing identifiers such as names and social security numbers.

In a Hadoop "stack," what node periodically replicates and stores data from the Name Node should it fail?

secondary node

In the Wimbledon case study, the tournament used data for each match in real time to highlight

significant events

Clustering partitions a collection of things into segments whose members share

similar characteristics

Companies with the largest revenues from Big Data tend to be

the largest computer and IT services firms.

In the research literature case study, the researchers analyzing academic papers extracted information from which source?

the paper abstract

All of the following statements about data mining are true EXCEPT

the process aspect means that data mining should be a one-step process to results.

In estimating the accuracy of data mining (or other) classification models, the true positive rate is

the ratio of correctly classified positives divided by the total positive count.

Under which of the following requirements would it be more appropriate to use Hadoop over a data warehouse?

unrestricted, ungoverned sandbox explorations

What is the Hadoop Distributed File System (HDFS) designed to handle?

unstructured and semistructured non-relational data

A newly popular unit of data in the Big Data era is the petabyte (PB), which is

10^15 bytes.

The Eckerson survey of 2002 estimated the total cost (to the US yearly economy) of dirty data to be approximately:

600 billion USD

Natural language processing (NLP) is associated with which of the following areas?

All of these (text mining, AI, computational linguistics)

What is Big Data's relationship to the cloud?

Amazon and Google have working Hadoop cloud offerings.

Big Data uses commodity hardware, which is expensive, specialized hardware that is custom built for a client or application.

F

Companies understand that when their product goes "viral," the content of the online conversations about their product does not matter, only the volume of conversations.

F

Consistent high quality, higher publishing frequency, and longer time lag are all attributes of industrial publishing when compared to Web publishing.

F

Data mining can be very useful in detecting patterns such as credit card fraud, but is of little help in improving sales.

F

Data mining requires specialized data analysts to ask ad hoc questions and obtain answers quickly from the system.

F

Data that is collected, stored, and analyzed in data mining is often private and personal. There is no way to maintain individuals' privacy other than being very careful about physical data security.

F

Descriptive analytics for social media feature such items as your followers as well as the content in online conversations that help you to identify themes and sentiments.

F

Hadoop and MapReduce require each other to work.

F

In a dataset where all values on an observation are supposed to be populated you encounter several which are empty (NULL). It is always best to replace these NULL values with the average of that column of data.

F

In most cases, Hadoop is used to replace data warehouses.

F

In sentiment analysis, sentiment suggests a transient, temporary opinion reflective of one's feelings.

F

In the Dell cases study, the largest issue was how to properly spend the online marketing budget.

F

In the Miami-Dade Police Department case study, predictive analytics helped to identify the best schedule for officers in order to pay the least overtime.

F

In the cancer research case study, data mining algorithms that predict cancer survivability with high predictive power are good replacements for medical professionals.

F

In the evolution of social media user engagement, the largest recent change is the growth of creators.

F

In the opening case, police detectives used data mining to identify possible new areas of inquiry.

F

K-fold cross-validation is also called sliding estimation.

F

Market basket analysis is a useful and entertaining way to explain data mining to a technologically less savvy audience, but it has little business significance

F

Ratio data is a type of categorical data

F

Search engine optimization (SEO) techniques play a minor role in a Web site's search ranking because only well-written content matters.

F

Search engines are only used in the context of the World Wide Web (WWW).

F

Since little can be done about visitor Web site abandonment rates, organizations have to focus their efforts on increasing the number of new visitors.

F

Statistics and data mining both look for data sets that are as large as possible.

F

Text analytics is the subset of text mining that handles information retrieval and extraction, plus data mining.

F

The entire focus of the predictive analytics system in the Infinity P&C case was on detecting and handling fraudulent claims for the company's benefit.

F

Web-based media has nearly identical cost and scale structures as traditional media.

F


Related study sets

Lab Manual Ch. 17 (Ears) & 18 (Mouth, Nose, Sinuses, Throat)

View Set

3b. Test Review - Represent Data

View Set

Syracuse University EAR 111 Exam 1

View Set

Theories of Organizational Behavior Mid-Term

View Set

Chapter 41: Management of Patients With Musculoskeletal Disorders 3

View Set

MEGA/MOCA exam flash cards early childhood education learning across curriculum ALL subjects

View Set

Integrated Chinese Lesson 18 Dialogue 2

View Set

The Dodd-Frank Wall Street Reform & Consumer Protection Act (2010)

View Set