CAP 4770

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

Suppose that the PAM k-medoids algorithm is being used. Assuming that in the current iteration the total cost of swapping a non-representative object with a representative object would yield a value S = -61.988, should the two of them be swapped? Enter 0 if they should be swapped, enter 1 if they should not be swapped.

0

Suppose that the k-medoids algorithm is being used to obtain a partitioning of a dataset of 2-dimensional points. Consider two representative points m1 and m2 m1 = (10, 45) m2 = (35, 26). Suppose that p = (8, 59) is currently assigned to the cluster represented by m1. If m2 were replaced by a random non-representative point r, r = (42, 17), would p be assigned to the cluster represented by r, or would p remain assigned to cluster represented by m1? Assume the Euclidean distance is used as the dissimilarity measure. Enter 1 is the answer is p would be assigned to the cluster represented by r.Enter 0 if the answer is p would remain assigned to cluster represented by m1.

0

Suppose you have obtained a clustering of the dataset under consideration, using one of the methods studied in the course. Now you want to assess the quality of the clustering and you decide to examine the average silhouette coefficient, s. If s = -0.916, would you evaluate the quality of the clustering as good? Enter 1 is the answer is "yes, it is likely good", enter 0 if the answer is "No, it is not likely good".

0

Consider a data set corresponding to readings from a distance sensor: 3, 14, 9, 27, 7, 15, 55, 82, 95, 95 If normalization by decimal scaling is applied to the set, what would be the normalized value of the first reading, 3?

0.03

Let x and y be vectors for comparison: x = (3, 20) and y = (19, 4). Compute the cosine similarity between the two vectors. Round the result to two decimal places.

0.35

Suppose that the minimum and maximum values for the attribute temperature are 37 and 71, respectively. Map the value 54 to the range [0, 1] . Round your answer to 1 decimal place.

0.5

How many cuboids are there in an 7-dimensional data cube if there were no hierarchies associated to any dimension?

128

Consider the data points p and q: p = (2, 13) and q = (17, 7). Compute the Minkowski distance between p and q using h = 4. Round the result to one decimal place.

15.1

Suppose that a data set has been partitioned into two clusters, C1 and C2, with centroids c1 = (10, 5) and c2 = (1, 10), respectively. Clusters C1 has been assigned the points p1 = (5, 5) p2 = (2, 4) p3 = (8, 8) and cluster C2 the points p4 = (5, 7) p5 = (2, 5) Calculate the within-cluster variation of the given partitioning.

154

Assume that a data set has been partitioned into bins of size 3 as follows: Bin 1: 10, 14, 15 Bin 2: 16, 20, 21 Bin 3: 23, 27, 34 Which would be the first value of the second bin if smoothing by bin means is performed? Round your result to two decimal places.

19

Consider the point p and assume that that the k-means algorithm is being used. If c1 and c2, are the centroids of cluster 1 and cluster 2, respectively, which cluster would you assign p to? Use the data below for your analysis. p = (18, 9) c1 = (5, 6) c2 = (14, 8) (Enter 1 if p should be assigned to cluster 1, and 2 if p should be assigned to cluster 2)

2

Let D be a dataset of two-dimensional points to be partitioned into 5 clusters. After applying two different centroid-based partitioning algorithms, two different partitions of 5 clusters are obtained, P1 and P2 (that is, P1 consists of 5 clusters and P2 consists of 5 clusters). Let E1 and E2 be the within-cluster sum of squares of P1 and P2, respectively. If E1 = 239.24 and E2 = 149.86, which one would you consider being of a better quality, P1 or P2? Enter 1 if P1 is your choice, 2 if it is P2.

2

Suppose that the data for analysis includes the attribute time. The time values for the data tuples are: 292, 274, 334, 247, 272, 293, 232, 316, 295, 251, 256 What is the value of the midrange? Round the result to the nearest integer.

283

Suppose that the data for analysis includes the attribute salary. The salary values for the data tuples are (in increasing order): 33,488 34,659 35,831 36,677 37,284 38,924 What is the value of the mean? Round the result to two decimal places.

36,143.83

Suppose that the k-means algorithm is being used. Assuming that in the previous iteration the following points were assigned to cluster C, p1 = (3, 10) p2 = (4, 5) p3 = (6, 9) what would be the x coordinate of the new center of C? Round your answer to 1 decimal place.

4.3

Suppose that the data for analysis includes the attribute age. The age values for the data tuples are (in increasing order): 20, 24, 29, 35, 39, 45, 50, 55, 55, 64 What is the value of the median? Round the result to the nearest integer.

42

Consider a cluster C obtained with a centroid-based partitioning method. Assume that the center of C is c and p is a point in C, where c = (5, 6) and p = (4, 4). Calculate the squared error of p. Round your answer to 1 decimal place.

5

Consider the data points p and q: p = (7, 12) and q = (12, 9). Compute the Euclidean distance between p and q. Round the result to one decimal place.

5.8

Consider seven points in 1-D, the one-dimensional space. Suppose a partitioning into two clusters C1 and C2 has been obtained by a k-medoids method (i.e., k=2). Let m1 and m2 be the representative objects of C1 and C2, respectively, m1 = 26 and m2 = 14. Cluster C1 has been assigned the (non-representative) points p1 = 3 p2 = 34 and cluster C2 the points p3 = 17 p4 = 2 p5 = 4 Using the Manhattan distance (i.e., the absolute value of the difference between two points) as the dissimilarity measure, calculate the absolute error E of the given partitioning.

56

Which of the following is NOT a data warehouse? -A system that supports users in making decisions based on current data -A data repository maintained separately from a business' operational database -A physically separate store of data obtained from application data found in an operational environment -A system that supports users in making decisions based on historic data

A system that supports users in making decisions based on current data

___ is a neural network algorithm. -Cross-validation -A Bayesian classifier -A suppot vector machine -Backpropagation

Backpropagation

___ are statistical classifiers. -Ensemble methods -Decision trees -Bayesian classifiers -IF-THEN rules

Bayesian classifiers

Consider the decision tree given below, which represents the concept fruit. An IF-THEN rule extracted from it is ___. -IF size = medium AND shape = round THEN fruit = mango -IF size = small THEN fruit = (yellow, green, cherry) -IF size = medium OR large THEN fruit = (mango, banana, watermelon)) -IF color = green THEN fruit = lemon

IF size = medium AND shape = round THEN fruit = mango

______ describe or define warehouse elements. -Metadata -DBMSs -Queries -Operational algorithms

Metadata

___ is the ability of an algorithm to build a classifier efficiently in the presence of large amounts of data. -Robutness -Quality -Interpretability -Scalability

Scalability

The ___ of a classifier is the percentage of test set tuples that are correctly classified by it. -accuracy -overfit -Gini index -gain ratio

accuracy

Intuitively, the roll-up OLAP operation corresponds to concept ___ in a concept hierarchy. -cooperation -ascension -forecasting -specialization

ascension

An example of an ensemble method used for classification is ___. -bagging -backpropagation -clustering -support vector machines

bagging

Attribute-oriented induction is an alternative to the ___ approach for data generalization. -three tier architecture -back-end tools -data cube -concept hierarchies

concept hierarchies

To obtain a reliable estimate of the accuracy of a classifier, several methods can be used. The ___ method randomly partitions the initial data into k mutually exclusive subsets each of approximately the same size, S1, S2, ..., Sk. Training and testing occurs k times: in iteration i, the set Si is used as the test set while the remaining k-1 subsets are used to train the model. -holdout -bootstrap -cross-validation -selection

cross-validation

Among the data warehouse applications, ___ applications support knowledge discovery. -analytical processing -information processing -star schema -data mining

data mining

IF-THEN rules can be extracted from a ___. -matrix -decision tree -neural network -data cube

decision tree

Consider a data cube measure obtained by applying the sum() function. The measure is ___. -distributive -holistic -algebraic -analytic

distributive

Consider the 2-D data cube LOCATION RESOURCE South America 10,365 North America 5,971 Asia 2,840 Which represents information on freshwater resources per country (in kms cubed). The cube contains the dimensions location and resource. The concept hierarchy for location is defined as the total order "country < continent." Which operation materializes the view provided below? LOCATION RESOURCE Brazil 8,233 United States 3,069 Canada 2,902 China 2,840 Colombia 2,132 -roll-up -pivot -drill-down -slice

drill-down

A major feature of a data warehouse is that -old data is removed periodically to improve performance -is time-variant -typical users include clerks and database professionals -it focus on the day-to-day operations of an organization

is time-variant

The steps of data classification are ___. -learning step and classification step -unsupervised learning step and supervised learning step -clustering step and testing step -training step and model construction step

learning step and classification step

Consider the data points p1 = (25, 31) p2 = (12, 3) and a query point p0 = (30, 4) Which point would be more similar to p0 if you used the supremum distance as the proximity measure?

p2

A disadvantage of neural networks is ___. -they can't be used to classify patterns on which they have not been trained -poor interpretability -short training times -low tolerance of noisy data

poor interpretability

Tree ___ methods remove the least-reliable branches of a decision tree. -pruning -replication -splitting -boosting

pruning

The ___ OLAP operation performs aggregation on data cube, either by climbing up a concept hierarchy for a dimension or by dimension reduction. -rotate -drill-down -roll-up -slice

roll-up

In the ___ schema some dimension tables are normalized, generating additional tables. -snowflake -data mart -galaxy -star

snowflake

A disadvantage of support vector machines is that ___. -they provide a compact description of the learned model -in general, they are not accurate -they cannot be used for numeric prediction -the training time can be long for large datasets

the training time can be long for large datasets

In the ___ method, the process to design and construct a data warehouse is sequential, moving onto each phase only if the previous phase is complete. -top-down -spiral -bottom-up -waterfall

waterfall


Kaugnay na mga set ng pag-aaral

C482 Software 1: Oracle Certified Associate Java SE 8

View Set

Patho: Unit 7 GU and Reproductive function

View Set

CompTIA Linux+ Chapter 1-4 Review

View Set

Chapter 4: Reaching the Audience

View Set