Data Analytics and Visualization Final

Ace your homework & exams now with Quizwiz!

If points (0,3), (2,1), and (-2,2) are the only points assigned a cluster, what is the centroid for this cluster?

(0,2)

What is the Pearson correlation coefficient r(X,Y) of: X = [0,1,2,3,4] Y = [14,12,10,8,6]

-1

What is the Cos similarity value for two documents with nothing in common

0

What is the Cos similarity value for two identical documents

1

Which value remains in my_list after the following list operations? my_list = [0, 2, 4, 6, 8, 1, 3, 5, 7, 9] my_list = my_list[2:7] sorted(my_list) my_list = my_list[3]

1

Given 1000 data points, if we set the minimum number of points required to split a node to 200 and the minimum leaf size to 300, what is the maximum possible depth of the resulting decision tree?

2

How many elements will the result of this list comprehension have? old = [1,3,7] new = [elem*2 for elem in old if elem > 1]

2

How many items remain from the original list after the following slicing operations? my_list = [0, True, 2, 'three', 4, 5, 6, 7, 8, 9, 'ten'] my_list = my_list[:8] my_list = my_list[0:5] my_list = my_list[2:4]

2

What is the Hamming distance between these two data points: A = (1, 2, 3, 4) B = (1, 5, -1, 4)

2

What is the L2 distance between these two data points: A = (1, 2, 3, 4) B = (1, 5, -1, 4)

5

What is the L1 distance between these two data points: A = (1, 2, 3, 4) B = (1, 5, -1, 4)

7

Select Multiple Missing data can reasonably be replaced with: A. A fixed value (e.g., 0) B. A computed reduction value (e.g., mean) C. A value randomly sampled from the column D. A value randomly sampled from the row E. A randomly generated value

A. A fixed value (e.g., 0) B. A computed reduction value (e.g., mean) C. A value randomly sampled from the column

What would be the best visual mapping for two quantitative data columns named A and B? A. A->x axis; B->y axis B. A->area; B->x axis C. A->angle; B->y axis D. A->x axis; B->color luminance

A. A->x axis; B->y axis

Which of the following can act as possible termination conditions in k-means fitting? A. Centroids do not change position between successive iterations. B. The assignment of items to clusters does not change between iterations. C. The within-cluster variance falls below a pre-set threshold. D. A fixed number of iterations have passed.

A. Centroids do not change position between successive iterations. B. The assignment of items to clusters does not change between iterations. C. The within-cluster variance falls below a pre-set threshold. D. A fixed number of iterations have passed.

Which of the following are disadvantages of choosing a decision tree classifier? A. Decision trees can easily overfit data. B. Decision trees are difficult to interpret. C. Decision trees are not stable algorithms.

A. Decision trees can easily overfit data. C. Decision trees are not stable algorithms.

Identify the correct statements related to document sentiment classification: A. Document sentiment classification is different from topic-based text classification. B. Three classes are possible: positive, negative, and neutral. C. Sentiment classification completely depends upon opinion words. D. We want to classify a whole document based on the overall sentiment of the opinion holder.

A. Document sentiment classification is different from topic-based text classification. B. Three classes are possible: positive, negative, and neutral. D. We want to classify a whole document based on the overall sentiment of the opinion holder.

Select Multiple When cleaning data, it's a good idea to switch these values to NaN. A. Extreme data B. Invalid data C. Missing data D. Prime number data

A. Extreme data B. Invalid data C. Missing data

Which of the following statements are true about the DBSCAN algorithm? A. It does not require prior knowledge of the number of desired clusters. B. It is capable of detecting outliers. C. For data points to be in a cluster, they must be in a distance threshold to a core point. D. It has strong assumptions for the distribution of data points in dataspace.

A. It does not require prior knowledge of the number of desired clusters. B. It is capable of detecting outliers. C. For data points to be in a cluster, they must be in a distance threshold to a core point.

What is the function of a tag in HTML? A. It indicates some behavior that your web browser uses for formatting B. It tells a web browser that this is the start of an HTML document C. It creates a web site on your computer D. It allows you to open your web browser E. It allows you to link to another web page

A. It indicates some behavior that your web browser uses for formatting

Python list comprehensions are appropriately used to perform which of the following operations? A. Map B. Reduce C. Filter D. Sort E. Slice

A. Map and C. Filter

If I am using all features of my dataset and I achieve 100% accuracy on my training set but 70% on my testing set, what should I be correcting? A. Model overfit B. Model underfit C. The selected classification algorithm D. Missing data

A. Model overfit

Which multi-dimensional visualization would be best for finding potential correlations between all pairs of columns A. Scatterplot Matrix B. Heatmap C. Boxplot D. Parallel Coordinates Plot

A. Scatterplot Matrix

Common way(s) to locate data in an HTML page using BeautifulSoup are: A. Search for tags and identifiers. B. Navigate tag hierarchy. C. Query its x,y position on the page. D. Search by the exact data value.

A. Search for tags and identifiers. B. Navigate tag hierarchy.

Which Python data structure would be best for representing a positive word dictionary, so that we can efficiently look up potential positive words in it? A. Set B. String C. List D. CSV File

A. Set

Which of the following are reasons that might cause a random forest to overfit the input data? A. The number of trees B. The percentage of data used for training C. The depth of each tree

A. The number of trees B. The percentage of data used for training C. The depth of each tree

Which of the following statements is true about the k-Nearest Neighbors algorithm? A. The optimal value of k is different for each dataset. B. The algorithm decides the value of k. C. The optimal value of k can change with each question you ask about a dataset. D. The choice of k is irrelevant to the output.

A. The optimal value of k is different for each dataset. C. The optimal value of k can change with each question you ask about a dataset.

What are the two types of hierarchical clustering? A. Top-down (divisive) B. Bottom-up (agglomerative) C. DBSCAN D. dendrogram E. k-means

A. Top-down (divisive) B. Bottom-up (agglomerative)

You want to compute the average of the Quiz1 scores of all the students who got more than 70%. Which execution sequence of operations would be best? A. filter, then reduce B. reduce, then filter C. sort, then slice D. map, then filter

A. filter, then reduce

You have a table of VT students, which contains data about each student's Major and GPA. You want to compute each student's GPA differential in comparison to the average GPA of students of their same Major. Which sequence of operations would be best? A. groupby, reduce, join, map B. join, reduce, map, groupby C. reduce, map, groupby D. union, filter, sort

A. groupby, reduce, join, map

Select multiple The HTML5 Document Object Model for a web page: A. is organized in a tree structure. B. is composed of tags. C. contains attribute=value pairs. D. stores database tables.

A. is organized in a tree structure. B. is composed of tags. C. contains attribute=value pairs.

Which of these are true about the k-means algorithm we discussed in class? A. k-means can produce different clusterings for the same data and same k. B. k-means always selects exactly k clusters. C. k-means will leave outliers out of all of the k clusters. D. k-means clusters the data by selecting the k most representative rows from the dataset. E. k-means uses a distance matrix.

A. k-means can produce different clusterings for the same data and same k. B. k-means always selects exactly k clusters.

Imputing the mean preserves what attribute(s) of the data, if any? A. mean B. sum C. correctness D. variance

A. mean

Which of the following can be used to create a DataFrame in pandas? A. A Python tuple B. A Python dictionary C. A scalar value D. Another pandas DataFrame E. A Python list F. A pandas Series

All of the answer choices A. A Python tuple B. A Python dictionary C. A scalar value D. Another pandas DataFrame E. A Python list F. A pandas Series

Which of the following are true of the Euclidean distance matrix m for a quantitative data table d? Assume: d contains n rows and p columns. m[a, b] = value of m at its row a, column b. dist(d, a, b) = Euclidean distance between rows a and b in d. A. m[a, a] == 0 B. m.shape == (n, n) C. m[a, b] == m[b, a] D. m[a, c] <= m[a, b] + m[b, c] E. m[a, b] == dist(d, a, b) F. m[a, b] >= 0

All of the answer choices A. m[a, a] == 0 B. m.shape == (n, n) C. m[a, b] == m[b, a] D. m[a, c] <= m[a, b] + m[b, c] E. m[a, b] == dist(d, a, b) F. m[a, b] >= 0

Define Lambda Expression

An unnamed (or anonymous) function to perform a computation

Define Inner Join

Associate rows between two DateFrames, keeping the intersection of matching values

Define Many-to-one Join

Associated rows between two DataFrames, where only one of the DataFrames has unique keys

Which of the following statements are true about cluster analysis in general? A. We can only visualize clustering results when the data is two-dimensional. B. Different clustering algorithms are able to detect clusters of different sizes and shapes. C. We must know the number of output clusters k before running all clustering algorithms. D. When clustering, we want to put two dissimilar data objects into the same cluster. E. Agglomerative clustering is an example of a distance-based clustering method. F. Cluster analysis doesn't require labeled training data; the goal is to find structure in the data. G. In order to perform cluster analysis, we need to have a similarity or distance measure between data objects.

B. Different clustering algorithms are able to detect clusters of different sizes and shapes. E. Agglomerative clustering is an example of a distance-based clustering method. F. Cluster analysis doesn't require labeled training data; the goal is to find structure in the data. G. In order to perform cluster analysis, we need to have a similarity or distance measure between data objects.

Which multi-dimensional visualization would be best for comparing the values in a table with over 100 rows and over 100 columns? A. Scatterplot Matrix B. Heatmap C. Boxplot D. Parallel Coordinates Plot

B. Heatmap

Which of the following are reasons to use PCA for dimensionality reduction? A. PCA can be used as a replacement for linear regression. B. Reducing the data dimensionality will enable a future algorithm in the analysis pipeline to run more quickly. C. Reducing the data dimensionality to 2D or 3D can give you an intuition of the shape of the data by plotting. D. Principal Components can be added to the original DataFrame to have more dimensions to use for future learning.

B. Reducing the data dimensionality will enable a future algorithm in the analysis pipeline to run more quickly. C. Reducing the data dimensionality to 2D or 3D can give you an intuition of the shape of the data by plotting.

Select Multiple Which of the following statements about Principal Component Analysis are true? A. Because the algorithm may reach a local rather than a global minimum, we should run multiple PCA initializations. B. The number of possible principal components is equal to the number of input dimensions. C. PCA can only reduce dimensionality from n dimensions to n-1 dimensions. D. We should z-score normalize data dimensions prior to running PCA

B. The number of possible principal components is equal to the number of input dimensions. D. We should z-score normalize data dimensions prior to running PCA

In ensemble learning, you aggregate the predictions for weak learners under the assumption that a collection of these models will give a better prediction than the individual models. Which of the following statements are true for weak learners used in these ensembles? A. They have high bias, so they cannot solve complex learning problems B. They usually overfit C. They don't usually overfit

B. They usually overfit

According to Munzner's ranking of visual properties (channels), which choice would be best for comparing quantitive data values? A. motion B. bar chart C. pie chart D. colormap

B. bar chart

Which multi-dimensional visualization would be best for understanding the distributions of each of the columns in a table A. Scatterplot Matrix B. Heatmap C. Boxplot D. Parallel Coordinates Plot

C. Boxplot

A visualization technique to view a distribution of the popularity of words in a bag-of-words model is: A. Scatterplot matrix B. Regression line C. Histogram D. Scatterplot

C. Histogram

Feature normalization is an important step to perform before running most clustering algorithms. What is the reason behind this? A. The Euclidean distance computation will result in mathematical errors. B. You always get the same clusters when running the analysis repeatedly.. C. In distance calculation it will give the same influence for all features. D. None of these.

C. In distance calculation it will give the same influence for all features.

Throughout a large corpus of english documents, we should expect words such as "the," "a," and "and" to have consistently: A. Higher than average TFIDF B. Lower than average TF C. Lower than average TFIDF D. Higher than average TF

C. Lower than average TFIDF D. Higher than average TF

Which of these are true about the Multidimensional Scaling (MDS) algorithm we discussed in class? A. A higher stress value indicates a better MDS fit. B. MDS reduces dimensionality by selecting only the most important columns of the data. C. MDS uses a distance matrix. D. MDS projections can be reflected or rotated without changing its accuracy. E. MDS always reduces p dimensions to 2. F. MDS can produce different 2D projections for the same data. G. MDS attempts to preserve pairwise distances in the projection.

C. MDS uses a distance matrix. D. MDS projections can be reflected or rotated without changing its accuracy. F. MDS can produce different 2D projections for the same data. G. MDS attempts to preserve pairwise distances in the projection.

How would you best describe the Vector Space Model matrix expected for a large set of tweets, where each tweet was treated as a document? A. Dense B. Symmetric C. Sparse D. Diagonal

C. Sparse

The big difference between supervised and unsupervised learning algorithms is: A. Supervised algorithms can only be performed on quantitative data. B. Only clustering algorithms are supervised. C. Unsupervised algorithms don't require labels. D. You need to pay closer attention for errors with supervised algorithms.

C. Unsupervised algorithms don't require labels.

Imputation is best described as: A. finding missing values B. removing erroneous values C. assigning estimated values D. replacing dirty data with NaN

C. assigning estimated values

Which colormap would be best for visualizing quantitative data? A. solid red B. rainbow colormap C. blue saturation ramp colormap

C. blue saturation ramp colormap

The choice of k, the number of clusters to partition a set into: A. is a personal choice that shouldn't be discussed in results that are presented. B. should always be as large as your computer system can handle. C. depends on why and how you are clustering the data. D. has a maximum of 10.

C. depends on why and how you are clustering the data.

What is the purpose of using div tags in HTML? A. For adding titles B. For adding headings C. For creating different styles D. For creating different sections

D. For creating different sections

You want to combine a table of VT students and a table of VT dorm buildings to get a new table of the current VT campus that relates student data to data about the dorm they live in. Which operation would be best? A. One-to-one join B. Union or append C. Many-to-many join D. Many-to-one join

D. Many-to-one join

Select one: When representing documents with a bag-of-words model: A. The order of words in each document contains semantic meaning. B. Duplicate words are removed. C. You are forced to choose a large fixed vocabulary, such as the Oxford English Dictionary. D. Multiple documents can be aggregated by concatenating their bags.

D. Multiple documents can be aggregated by concatenating their bags.

Which multi-dimensional visualization would be best for finding multi-dimensional clusters of rows A. Scatterplot Matrix B. Heatmap C. Boxplot D. Parallel Coordinates Plot

D. Parallel Coordinates Plot

Which chart would be best to visualize the distribution of a single quantitative data column. A. scatter plot B. pie chart C. bar chart D. histogram

D. histogram

A major goal of the QAC approach is to:

Describe a problem and solution in a meaningful way.

Which of the following tags is the correct way to create a hyperlink? A. <a>http://www.website.com</a> B. <a name="http://www.website.com">http://www.website.com</a> C. <a link="http://www.website.com">http://www.website.com</a> D. <a url="http://www.website.com">http://www.website.com</a> E. <a href="http://www.website.com">http://www.website.com</a>

E.<ahref="http://www.website.com">http://www.website.com</a>

L2 distance is also know as

Euclidean Distance

Define parse

Extract information from and HTML page

True/False Dimension reduction always perfectly captures the high-dimensional data in low dimensional space, such that the high-dimensional data can be reconstructed from the low-dimensional data.

False

True/False Each dimension found using MDS represents only one attribute in the high-dimensional data.

False

True/False Matplotlib is the only way to create visualizations in Python.

False

True/False: Bar graphs and histograms are the same thing.

False

True/False: Data Science is a linear process of 6 steps with no iterative design

False

True/False: Markdown is a type of python code.

False

True/False: Reduce and Filter operations perform the same actions.

False

True/False: Since Notebooks have Markdown cells to provide explanations, I never need to add comments to my code.

False

True/False: Switching integer strings ("47") to integers (47) doesn't make a difference to future data processing.

False

True/False: The fit of an MDS solution is commonly assessed by a stress measure in which higher values of stress indicate better fits.

False

True/False: I can execute code cells in a Notebook in any order that I want without affecting the overall computation or exploration.

False

Define Imputation

Filling in missing data with representative values

Define Crawl

Find URLs of more pages to retrieve

What does Server-side script do?

Generate HTML page from a database

Categorical Distance is also know as

Hamming Distance

Define the Term: Slice

Identify data based on positional indices

Define the Term: Filter

Identify data that matches a condition

What is the type of the result of the following expression? "1,2,3,4".split(",")

List

L1 Distances is also known as

Manhattan Distance

Are two runs of k-means clustering expected to yield the same clustering results?

No

Define the Term: Sort

Order data by its values

A simple linear regression on a dataset with columns A and B produced this result: B = 0.5*A + 3 Therefore, the Pearson correlation coefficient r(A,B) value most likely should be:

Positive

QAC is an acronym for:

Question, Analysis, Conclusion

Define Groupby

Reduce cardinality by binning rows according to a categorical variable

What does a Browser do?

Render an HTML page for human consumption

Define fetch

Retrieve an HTML page from the web

Define provenance

Sequence of code cells in a computational notebook that is used to compute a result

Define the Term: Reduce

Summarize data as a single value

Define Dimension Reduction

Transform data from a high-dimensional space to a low-dimensional space

Define the Term: Map

Transform data into new values through a function

True/False A DataFrame is similar to a dictionary because you can use the index labels to get and set values.

True

True/False Code cells in a Notebook can be executed multiple times.

True

True/False: A Notebook is composed of Markdown, Code, and the results of running that Code.

True

True/False: A pandas Series acts in a way similar to that of an array or list, and a DataFrame is composed of a set of Series.

True

True/False: The standard marker for missing data in pandas is NaN.

True

Define Histogram

Visualize the distribution of quantitative variable using binned frequencies

As the k-Means algorithm runs, we currently have 3 centroids (0,1), (2,1), and (-1,2). Will points (2,3) and (2,0.5) be assigned to the same cluster in the next iteration?

Yes

You have a list of data about all the students' scores on Quiz 1. You want to curve the scores by adding 1 point to each student's score. Which operation would be best to perform?

map


Related study sets

Which of the following statements are true of the quantity weight?

View Set

Ch. 24- Asepsis and Infection Control

View Set

Integrated Marketing Communications Quiz 1

View Set

GENERAL INFO: Diffusion of Responsibility - The Bystander Effect (Darley & Latane, 1968)

View Set

AP Environmental Sciences - Unit 2

View Set

Abnormal psychology final review set 3

View Set