ISM 3540 FSU exam 2
The Eckerson survey of 2002 estimated the total cost (to the US yearly economy) of dirty data to be approximately:
$600 Billion USD
The data field "ethnic group" can be best described as: A. nominal data B. interval data C. ordinal data D. ratio data
A. nominal data
A company/organization can encounter dirty data in the form of
All of the these( invalid mailing addressinvalid email addressduplicated data )
In the evolution of social media user engagement, the largest recent change is the growth of creators. True False
False
In the evolution of social media user engagement, the largest recent is the growth of creators. True False
False
In the opening case, police detectives used data mining to identify possible new areas of inquiry. True False
False
K-fold cross-validation is also called sliding estimation. True False
False
Market basket analysis is a useful and entertaining way to explain data mining to a technologically less savvy audience, but it has little business significance. True False
False
Open-source data mining tools include applications such as IBM SPSS Modeler and Dell Statistica. True False
False
Statistics and data mining both look for data sets that are as large as possible. True False
False
Text analytics is the subset of text mining that handles information retrieval and extraction, plus data mining. True False
False
In text mining, tokenizing is the process of
categorizing a block of text in a sentence
Companies with the largest revenues from Big Data tend to be
the largest computer and IT services firms.
Which of the following statement about Web site conversion statistics is FALSE? A. Web site visitors can be classed as either new or returning B. Visitors who begin a purchase on most Web sites C. The conversion rate is the number of people who take action divided by the number of visitors D. Analyzing exit rates can tell you why visitors left your Web site.
B. Visitors who begin a purchase on most Web sites
What does Web content mining involve? A. analyzing the universal resource locator in Web pages B. analyzing the unstructured content of Web pages C. analyzing the pattern of visits to a Web site D. analyzing the PageRank and other metadata of a Web page
B. analyzing the unstructured content of Web pages
Prediction problems where the variables have numeric values are most accurately defined as: A. classifications B. regressions C. associations D. computations
B. regressions
You are tasked with accumulating survey data on a web page and are responsible for it being free from dirty data once you close the survey and get the data to the researching team. Which is the best way to handle the possibility of dirty data?
Build a website that validates data as the survey participant takes the survey.
In the opening vignette, the architectural system that supported Watson used all the following EXCEPT: A. Massive parallelism to enable simultaneous consideration of multiple hypotheses. B. An underlying confidence subsystem that ranks and integrates answers. C. A core engine that could operate seamlessly in another domain without changes. D. integration of shallow and deep knowledge
C. A core engine that could operate seamlessly in another domain without changes.
Web site usability may be rated poor if
C. Web site visitors download few of your offered PDFs and videos.
In the opening vignette, the architectural system that supported Watson used all the following elements EXCEPT: A. Massive parallelism to enable simultaneous consideration of multiple hypotheses B. an underlying confidence subsystem that ranks and integrates answers. C. a core engine that could operate seamlessly in another domain without changes. D. Integration of shallow and deep knowledge
C. a core engine that could operate seamlessly in another domain without changes.
What is the main reason parallel processing is sometimes used for data mining? A. because the hardware exists in most organizations, and it is available to use. B. because most of the algorithms used for data mining require it C. because of the massive data amounts and search efforts involved D. because any strategic application requires requires parallel processing
C. because of the massive data amounts and search efforts involved
What does the scalability of a data mining method refer to? A. its ability to predict the outcome of a previously unknown data set accurately B. its speed of computation and computational costs in using the mode C. its ability to construct a prediction model efficiently given a large amount of data D. its ability to overcome noisy data to make somewhat accurate predictions
C. its ability to construct a prediction model efficiently given a large amount of data
What are the two main types of Web analytics?
C. off-site and on-site Web analytics
Third party providers of publicly available data sets protect the anonymity of the individuals in the data set primarily by: A. asking data users to use the data ethically B. leaving in identifiers (e.g. name), but changing other variables C. removing identifiers such as names and social security numbers D. letting individuals in the data know their data is being accessed.
C. removing identifiers such as names and social security numbers
Natural language processing (NLP) is associated with which of the following areas? A. text mining B. Artificial intelligence C. Computational linguistics D. All of these
D. All of these
As discussed in class, which data mining process/methodology is the most widely-used and generally regarded as the most comprehensive? A. SEMMA B. proprietary organizational methodologies C. KDD Process D. CRISP-DM
D. CRISP-DM
What does the robustness of a data mining method refer to? A. its ability to predict the outcome of a previously unknown data set accurately B. its speed of computation and computational costs in using the mode C. its ability to construct a prediction model efficiently given a large amount of data D. Its ability to overcome noisy data to make somewhat accurate predictions
D. Its ability to overcome noisy data to make somewhat accurate predictions
Which broad area of data mining applications partitions a collection of objects into natural groupings with similar features? A. associations B. visualization C. classification D. clustering
D. clustering
A data mining study is specific to addressing a well-defined business task, and different business tasks require: A. general organizational data B. general industry data C. general economic data D. different sets of data
D. different sets of data.
Understanding which keywords your users enter to reach your Web site through a search engine can help you understand A. the hardware your Web site is running on B. the type of Web browser being used by your Web site visitors C. most of your Web site visitors' wants and needs D. how well visitors understand your products
D. how well visitors understand your products
Breaking up a Web page into its components to identify worthy words/terms and indexing them using a set of rules is called A. preprocessing the documents B. document analysis C. creating the term-by-document matrix D. parsing the documents
D. parsing the documents
Which of the following is a data mining myth?
Data mining requires a separate, dedicated database.
In sentiment analysis, which of the following is an implicit opinion?
The customer service i got for my TV was laughable
All of the following statements about data mining are true EXCEPT:
The ideas behind it are relatively new. OR a. the process aspect means that data mining should be a one-step process to results.
Companies understand that when their product goes "viral," the content of the online conversations about their product does not matter, only the volume of conversations. True False
False
Consistent high quality, higher publishing frequency, and longer time lag are all attributes of industrial publishing when compared to Web publishing. True False
False
Data mining can be very useful in detecting patterns such as credit card fraud, but is of little help in improving sales. True False
False
Data mining requires specialized data analysts to ask ad hoc questions and obtain answers quickly from the system. True False
False
Data that is collected, stored, and analyzed in data mining is often private and personal. There is no way to maintain individuals' privacy other than being very careful about physical data security. True False
False
In a dataset where all values on an observation are supposed to be populated you encounter several which are empty (NULL). It is always best to replace these NULL values with the average of that column of data. True False
False
In a dataset where all values on an observation are supposed to be populated you encounter several which are empty (NULL). It is best to just replace these NULL values with the average of that column of data. True False
False
In most cases, Hadoop is used to replace data warehouses. True False
False
In sentiment analysis, sentiment suggests a transient, temporary opinion reflective of one's feelings True False
False
In the Dell cases study, the largest issue was how to properly spend the online marketing budget. True False
False
In the Miami-Dade Police Department case study, predictive analytics helped to identify the best schedule for officers in order to pay the least overtime. True False
False
In the cancer research case study, data mining algorithms that predict cancer survivability with high predictive power are good replacements for medical professionals. True False
False
In the car insurance case study, text mining was used to identify auto features that caused injuries. True False
False
Search engine optimization (SEO) techniques play a minor role in a Web site's search ranking because only well-written content matters. True False
False
Search engines are only used in the context of the World Wide Web (WWW). True False
False
Since little can be done about visitor Web site abandonment rates, organizations have to focus their efforts on increasing the number of new visitors. True False
False
The entire focus of the predictive analytics system in the Infinity P&C case was on detecting and handling fraudulent claims for the company's benefit. True False
False
Web-based media has nearly identical cost and scale structures as traditional media. True False
False
Articles and auxiliary verbs are assigned little value in text mining and are usually filtered out. True False
True
Big Data is being driven by the exponential growth, availability, and use of information. True False
True
Categorization and clustering of documents during text mining differ only in the preselection of categories. True False
True
Clickstream analysis does not need users to enter their perceptions of the Web site or other feedback directly to be useful in determining their preferences. True False
True
Converting continuous valued numerical variables to ranges and categories is referred to as discretization. True False
True
Current total storage capacity lags behind the digital information being generated in the world. True False
True
Current use of sentiment analysis in voice of the customer applications allows companies to change their products or services in real time in response to customer sentiment. True False
True
During classification in data mining, a false positive is an occurrence classified as true by the algorithm while being false in reality. True False
True
Hadoop was designed to handle petabytes and exabytes of data distributed over multiple nodes in parallel. True False
True
If using a mining analogy, "knowledge mining" would be a more appropriate term than "data mining." True False
True
In data mining, classification models help in prediction. True False
True
In sentiment analysis, it is hard to classify some subjects such as news as good or bad, but easier to classify others, e.g., movie reviews, in the same way. True False
True
In text mining, if an association between two concepts has 7% support, it means that 7% of the documents had both concepts represented in the same document or sample. True False
True
In the Tito's Vodka case study, trends in cocktails were studied to create a quarterly recipe for customers. True False
True
In the opening vignette, the Access Telecom (AT). built a system to better visualize customers who were unhappy before they canceled their service. True False
True
Regional accents present challenges for natural language processing. True False
True
Social media mentions can be used to chart and predict flu outbreaks. True False
True
The cost of data storage has plummeted recently, making data mining feasible for more firms. True False
True
The quality and objectivity of information disseminated by influential users of Twitter is higher than that disseminated by noninfluential users. True False
True
The term "Big Data" is relative as it depends on the size of the using organization True False
True
Using data mining on data about imports and exports can help to detect tax avoidance and money laundering. True False
True
When a problem has many attributes that impact the classification of different patterns, decision trees may be a useful approach. True False
True
Which of the following is NOT one of the "3 V's of Big Data"
Veracity
Search engine optimization (SEO) is a means by which
Web site developers can increase Web site search rankings.
In text analysis, what is a lexicon? a. a catalog of words, their synonyms, and their meanings b. a catalog of customers, their words, and phrases c. a catalog of letters, words, phrases, and sentences d. a catalog of customers, products, words, and phrases
a catalog of words, their synonyms, and their meanings
In the opening vignette, the architectural system that supported Watson used all the following elements EXCEPT
a core engine that could operate seamlessly in another domain without changes.
The Survey of 2017 estimated the total cost (to the US yearly economy) of dirty data to be approximately: a. 3.1 Trillion USD b. 600 million USD c. 3.1 Billion USD d. 600 Trillion USD
a. 3.1 Trillion USD
Understanding customers better has helped Amazon and others become more successful. The understanding comes primarily from
analyzing the vast data amounts routinely collected
What is the Hadoop Distributed File System (HDFS) designed to handle? a. unstructured and semistructured relational data b. unstructured and semistructured non-relational data c. structured and semistructured relational data d. structured and semistructured non-relational data
b. unstructured and semistructured non-relational data
Which of the following sources is likely to produce Big Data the fastest? a. order entry clerks b. cashiers c. RFID tags d. online customers
c. RFID tags
Which broad area of data mining applications analyzes data, forming rules to distinguish between defined classes? a. associations b. visualization c. classification d. clustering
c. classification
Under which of the following requirements would it be more appropriate to use Hadoop over a data warehouse? a. ANSI 2003 SQL compliance is required b. online archives alternative to tape c. unrestricted, ungoverned sandbox explorations d. analysis of provisional data
c. unrestricted, ungoverned sandbox explorations
In the Influence Health case study, what was the goal of the system?
d. increasing service use
Data flows can be highly inconsistent, with periodic peaks, making data loads hard to manage. What is this feature of Big Data called? a. volatility b. periodicity c. inconsistency d. variability
d. variability
In the Analyzing Disease Patterns from an Electronic Medical Records Data Warehouse case study, what was the analytic goal?
determine differences in rates of disease in urban and rural populations
Understanding which keywords your users enter to reach your Web site through a search engine can help you understand
how well visitors understand your products
Breaking up a Web page into its components to identify worthy words/terms and indexing them using a set of rules is called
parsing the documents
Prediction problems where the variables have numeric values are most accurately defined as
regressions
In the Wimbledon case study, the tournament used data for each match in real time to highlight
significant events
In the research literature case study, the researchers analyzing academic papers extracted information from which source?
the paper abstract
In estimating the accuracy of data mining (or other) classification models, the true positive rate is what?
the ratio of correctly classified positives divided by the total positive count.
Clustering partitions a collection of things into segments whose members share
similar characteristics
A company/organization can encounter dirty data in the form of: A. All of these B. Invalid mailing address C. invalid email address D. duplicated data
A. All of these
You are tasked with accumulating survey data on a web page and are responsible for it being free from dirty data once you close the survey and get the data to the researching team. Which is the best way to handle the possibility of dirty data? A. Build a website that validates data as the survey participant takes the survey B. Let your friend throw a survey site together that accumulates the data and you export it into a spreadsheet and fix the data manually C. Have the survey site email you when it encounters data that is not formatted correctly D. Have the survey accumulate the data and then email the survey participant after the survey is processed asking them to retake it due to invalid data.
A. Build a website that validates data as the survey participant takes the survey
All of the following statements about data mining are true EXCEPT: A. The process aspect means that data mining should be a one-step process to results B. the novel aspect means that previously unknown patterns are discovered C. the potentially useful aspect means that results should lead to some business benefit D. the valid aspect means that the discovered patterns should hold true on new data
A. The process aspect means that data mining should a one-stop process to results
What do voice of the market (VOM) applications of sentiment analysis do? A. They examine customer sentiment at the aggregate level. B. They examine employee sentiment in the organization. C. They examine the stock market for trends. D. They examine the "market of ideas" in politics
A. They examine customer sentiment at the aggregate level.
In data mining, finding an affinity of two products to be commonly together in a shopping cart is known as: A. association rule mining B. cluster analysis C. decision trees D. artificial neutral networks
A. association rule mining
Identifying and preventing incorrect claim payments and fraudulent activities falls under which type of data mining applications? A. insurance B. retailing and logistics C. customer relationship management D. computer hardware and software
A. insurance