Week 1-8

Ace your homework & exams now with Quizwiz!

Given the following itemsets: {A,B,C} {B,C} {A,B,C,D} {A,C,D} {C,D} {B,C} {B,D} What is the interest of {A}->C ?

0.143

Given the following itemsets: {A,B,C} {B,C} {A,B,C,D} {A,C,D} {C,D} {B,C} {B,D} What is the confidence of {B,C}->D ?

0.25

If the support of {A,B} is 4 and the support of {A} is 8, what is the confidence of the rule {A}->B ?

0.5

The following are true labels and predicted labels for a test set: True Predicted 1 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 1 1 1 0 What is the F1-measure value of the predictions?

29%

Given the following itemsets: {A,B,C} {B,C} {A,B,C,D} {A,C,D} {C,D} {B,C} {B,D} What is the support of {A,C} ?

3

If the expected revenue per click for a particular advertiser is $0.02 and the bid for that advertiser was $0.50, what is the ClickThrough Rate?

4%

Assume you have the following set of preferences of people over seats at a table:{"John", {Seat1, Seat2, Seat 3}, "Mary",{Seat2, Seat4}, "Bob", {Seat1, Seat3, Seat4}, "Alice", {Seat3, Seat5}, "Cindy", {Seat1, Seat2, Seat3}} What is the competitive ratio of the greedy algorithm vs. the optimal seating arrangement?

4/5

The following are true labels and predicted labels for a test set: True Predicted 1 1 0 0 1 1 0 0 1 1 1 0 0 1 0 1 1 1 1 1 What is the accuracy of the predictions?

70%

Given the following sets of integers: A={1, 2, 3, 4} B={2, 3, 5, 7} C={2, 4, 6} Which pair has the highest Jaccard similarity?

A and C

Which of the following attribute types is most fitting for an attribute specifying whether a person is employed?

Binary

Identify all the data mining tasks that could be naturally associated with the following application: Recommending a possible friend for a user of a social network.

Collaborative Filtering Classification Link Analysis

Which of the following is an appropriate way of determining the number of clusters for the k-means method?

Compute average cluster diameter for different values of k, graph the diameter values over k, and choose a knee point. Compute average silhouette scores for different values of k and choose k that generates the largest value.

At a high level, what is the purpose of the following mapper and reducer combination? *Mapper* FOR EACH line IN input FOR EACH number IN line emit (<"s", number^2>) *Reducer* t = 0 FOR EACH value IN input list t = t + value emit(<"t", t>)

Compute the sum of squares of each integer in the input

How does Hadoop know that DataNodes fail?

DataNodes stop sending heartbeat messages

Which of the following distance measures is most applicable for comparing strings?

Edit distance

In clustering tasks, each data instance contains a class attribute, and the goal is to find a function that maps instances into classes.

False

True or False: A node with high hubness value must necessarily have a low authority value.

False

What is the main advantage of using the MSApriori algorithm compared to the standard Apriori?

MSApriori allows for finding rules related to rare items, while limiting the total number of rules generated.

Which of the following attribute types is most fitting for an attribute specifying an object's color (red, green, white, etc.) ?

Nominal

Which of the following attribute types is most fitting for an attribute specifying a person's income as low, medium, or high?

Ordinal

Bagging method

Random forest

Assume you have a database with two relations (i.e. tables): customers and accounts.The schema for customers is composed of the following attributes: *customerID (integer) *name (string) *address (string) *phone (string) The schema for accounts is composed of the following attributes: *customerID (integer) *accountNumber (integer) *balance (float) What is the SQL query to find all customer names who have at least one account with balance >$100,000 ?

SELECT name FROM customers, accounts WHERE customers.customerID=accounts.customerID AND balance>100000

Structured data

SQL, and OLAP

The problem that points in high-dimensional spaces appear to be at about the same distance from each other is called:

The curse of dimensionality

What is the main purpose of a data warehouse

To provide a repository for the purpose of analytics

True or False: Every complete graph is a maximum clique.

True

True or False: Every strongly connected component of a graph satisfies the property of a weakly connected component.

True

True or False: In Hadoop's HDFS, there is a single NameNode, whose purpose is to manage the file system namespace.

True

True or False: bagging is the best method

True

Assume you have a simple graph with 5 vertices: A, B, C, D, and E. What can we call a traversal of this graph if we start at A, then proceed to C, E, B, A, and finish at D?

Walk, Trail

Given the following adjacency matrix, what is the approximate rank vector after one iteration of the power iteration method (use PageRank model with beta=1)? ___ ___ | 1/2 1/3 0. | M= | 1/2 1/3 0 | |___ 0. 1/3 1 ___|

[5/18, 5/18, 4/9]'

A data schema is

a description of a data set's attributes and their properties

Triangular matrix method stores values as

arrays

Two PYC methods:

counting pairs and triangular matrix method

A -Priori assumes what?

download closure

cliques

in an undirected graph, every two distinct vertices are adjacent

Which of the following algorithms would be more applicable to use for clustering when the number of clusters is known and time efficiency is an issue?

k-means

What are the two ways to check if clusters are valid?

knee point and silhouette score

use counting pairs method when there is

less values

PYC saves what?

memory

Use triangular matrix method when there is

more values

Properties of an undirected graph

no direction, order doesn't matter, edges are set

cycle

non empty trail in which the only repeated vertices are the first and the last

walk

sequence of edges and vertices

Jacard Similarity is used for

sets

Properties of a directed graph

shows direction, order matters, edges are tuples

Support is

the number of baskets containing all items in I

path

trail with distinct edges and vertices

Counting pairs method stores values as

tuples

AdaBoost is susceptible to

uniform noise

MS A-Priori assumes what?

values are stored in ascending order of MIS value

Cosine similarity is used for

vectors

closed walk

walk that begins and ends on the same vertex

trail

walk with distinct edges

Given the following itemsets: {A,B,C} {B,C} {A,B,C,D} {A,C,D} {C,D} {B,C} {B,D} What will be in the candidate set C3 when applying the A-Priori algorithm with a support threshold of 3?

{ {A,B,C}, {A,C,D}, {B,C,D} }

What is the graph diameter of a complete graph of 10 vertices?

1

Assume you have 3 documents with the following terms: D1 = "computer", "web", "storage", "options" D2 = "computer", "game", "development" D3 = "web", "development", "frameworks" If the query Q is composed of terms "computer" and "development", what is the relevance of the query to document D2, using the TF.IDF measure?

1.16992

Assume you have 5 documents with the following terms: *D1 = "data", "web", "storage", "options" *D2 = "computer", "game", "development" *D3 = "web", "mining", "development", "frameworks" *D4 = "programming", "game", "course", "programming", "mining", "game" *D5 = "data", "mining", "course" If the query Q is composed of terms "game" and "mining", what is the relevance of document D4 to the query, using the TF.IDF measure?

1.69

What is the worst case and best case competitive ratio for the greedy bipartite matching algorithm?

1/2 worst case and 1/1 best case

Given the following characteristic matrix: Document 1 Document 2 Document 3 ------------------------------------------- 1 1 0 ------------------------------------------- 0 1 1 ------------------------------------------- 1 0 1 ------------------------------------------- 1 1 0 and permutations: P1 = (2,4,3,1) P2 = (1,2,4,3) P3 = (4,3,2,1)What is the minhash signature of document 3?

2,2,3

Assume you have the following data points: 1, 4, 15, 20, 42 and you execute the k-means algorithm to cluster the data with k=2.If 1 is the first cluster centroid and 4 the second cluster centroid, then what are the final centroid values after the k-means algorithm finishes?

2.5 and 25.67

The following are true labels and predicted labels for a test set: True Predicted 1 1 0 0 1 1 0 0 1 1 1 0 0 1 0 1 1 1 1 1 What is the precision of the predictions?

71%

The following are true labels and predicted labels for a test set: True Predicted 1 1 0 0 1 1 0 0 1 1 1 0 0 1 0 1 1 1 1 1 What is the recall of the predictions?

83%

Assume the output from Mappers in a MapReduce application is the following set of key/value pairs: <1, 1>, <1, 2>, <1, 3>, <2, 1>, <2, 2>, <3, 1> What is the set of inputs that are given to the reducers?

<1, [1, 2, 3]>, <2, [1, 2]>, <3, [1]>

What is the initial rank vector that should be used in the power iteration method to guarantee convergence?

A vector with any values that sum to 1

True negative

Actual = 0, Predicted = 0

False positive

Actual = 0, Predicted = 1

False negative

Actual = 1, Predicted = 0

True positive

Actual = 1, Predicted = 1

Boosting method

AdaBoost

In the PCY algorithm, if 3 pairs map to the same bucket and the minimum support is 2, which of the following statements is true?

Any pair that maps to this bucket may be frequent

Consider the following scenario:There are three advertisers A, B, and CA bids on query x, B bids on x and y, and C bids on x, y, and zAll have budgets of $3. Given the following query stream: x x x y y y z z z What is the sequence of choices, assuming the worst case scenario, using the greedy algorithm?

C,C,C,B,B,B,_,_,_

You have the following data points: 1, 2, 4, 6, 9 Assuming you have done agglomerative hierarchical clustering with clusters are represented by their centroids (average), and at each step the clusters with the closest centroids were merged. If you cut off the process once two clusters existed, which points belonged to these clusters?

C1 = {1,2,4,6} C2 = {9}

Identify all the data mining tasks that could be naturally associated with the following application: Recommending a wine based on the client's preferences (e.g. I want a sweet red wine).

Collaborative Filtering Classification

True or False: According to the PageRank model, a page with more in-links will always have higher rank than a page with less in-links

False

True or False: An online algorithm will look at all the inputs before making a decision.

False

True or False: Based on the TF.IDF measure of similarity, if a term appears in many documents, it will be given a higher IDF value.

False

True or False: If both knn and decision tree learning are used and the trained models for both are generated, the knn method will be generally faster to make a prediction than a decision tree.

False

True or False: Occam's Razor favors more complex models over simpler ones.

False

True or False: The purpose of shingling is to reduce the representation size of a text document.

False

What are the two clustering methods?

HAC and k-means

Unstructured data

Hadoop

Which of the following algorithms would be the most appropriate to use for determining how many topics exist in a large collection of papers written about different topics?

Hierarchical agglomerative clustering

Which of the following measures is commonly used as a splitting criteria for decision tree induction?

Information gain

Assume you have a database with two relations (i.e. tables): customers and accounts.The schema for customers is composed of the following attributes: *customerID (integer) *name (string) *address (string) *phone (string) The schema for accounts is composed of the following attributes: *customerID (integer) *accountNumber (integer) *balance (float) What is the SQL query to find the customer name with account number 12345.

SELECT name FROM customers, accounts WHERE customers.customerID=accounts.customerID AND accountNumber=12345

True or False: A machine learning algorithm may have overfit when its performance on training examples is significantly higher than on testing examples.

True

True or False: Assuming there is sufficient memory to store all pairs of frequent items, The PCY algorithm runs slower than the standard A-Priori algorithm?

True

True or False: Collaborative filtering is a technique most often used for recommendation systems

True


Related study sets

Chapter 1 the civil war / what do you know?

View Set

*BUSINESS MARKETING MARKETING MIX

View Set

elementary probability & statistics midterm

View Set

PN2RN Adult Health- Metabloic Quiz

View Set

Chapter 12 Section 1: Growth of the Cotton Industry

View Set