Data Analytics and Visualization Final
If points (0,3), (2,1), and (-2,2) are the only points assigned a cluster, what is the centroid for this cluster?
(0,2)
What is the Pearson correlation coefficient r(X,Y) of: X = [0,1,2,3,4] Y = [14,12,10,8,6]
-1
What is the Cos similarity value for two documents with nothing in common
0
What is the Cos similarity value for two identical documents
1
Which value remains in my_list after the following list operations? my_list = [0, 2, 4, 6, 8, 1, 3, 5, 7, 9] my_list = my_list[2:7] sorted(my_list) my_list = my_list[3]
1
Given 1000 data points, if we set the minimum number of points required to split a node to 200 and the minimum leaf size to 300, what is the maximum possible depth of the resulting decision tree?
2
How many elements will the result of this list comprehension have? old = [1,3,7] new = [elem*2 for elem in old if elem > 1]
2
How many items remain from the original list after the following slicing operations? my_list = [0, True, 2, 'three', 4, 5, 6, 7, 8, 9, 'ten'] my_list = my_list[:8] my_list = my_list[0:5] my_list = my_list[2:4]
2
What is the Hamming distance between these two data points: A = (1, 2, 3, 4) B = (1, 5, -1, 4)
2
What is the L2 distance between these two data points: A = (1, 2, 3, 4) B = (1, 5, -1, 4)
5
What is the L1 distance between these two data points: A = (1, 2, 3, 4) B = (1, 5, -1, 4)
7
Select Multiple Missing data can reasonably be replaced with: A. A fixed value (e.g., 0) B. A computed reduction value (e.g., mean) C. A value randomly sampled from the column D. A value randomly sampled from the row E. A randomly generated value
A. A fixed value (e.g., 0) B. A computed reduction value (e.g., mean) C. A value randomly sampled from the column
What would be the best visual mapping for two quantitative data columns named A and B? A. A->x axis; B->y axis B. A->area; B->x axis C. A->angle; B->y axis D. A->x axis; B->color luminance
A. A->x axis; B->y axis
Which of the following can act as possible termination conditions in k-means fitting? A. Centroids do not change position between successive iterations. B. The assignment of items to clusters does not change between iterations. C. The within-cluster variance falls below a pre-set threshold. D. A fixed number of iterations have passed.
A. Centroids do not change position between successive iterations. B. The assignment of items to clusters does not change between iterations. C. The within-cluster variance falls below a pre-set threshold. D. A fixed number of iterations have passed.
Which of the following are disadvantages of choosing a decision tree classifier? A. Decision trees can easily overfit data. B. Decision trees are difficult to interpret. C. Decision trees are not stable algorithms.
A. Decision trees can easily overfit data. C. Decision trees are not stable algorithms.
Identify the correct statements related to document sentiment classification: A. Document sentiment classification is different from topic-based text classification. B. Three classes are possible: positive, negative, and neutral. C. Sentiment classification completely depends upon opinion words. D. We want to classify a whole document based on the overall sentiment of the opinion holder.
A. Document sentiment classification is different from topic-based text classification. B. Three classes are possible: positive, negative, and neutral. D. We want to classify a whole document based on the overall sentiment of the opinion holder.
Select Multiple When cleaning data, it's a good idea to switch these values to NaN. A. Extreme data B. Invalid data C. Missing data D. Prime number data
A. Extreme data B. Invalid data C. Missing data
Which of the following statements are true about the DBSCAN algorithm? A. It does not require prior knowledge of the number of desired clusters. B. It is capable of detecting outliers. C. For data points to be in a cluster, they must be in a distance threshold to a core point. D. It has strong assumptions for the distribution of data points in dataspace.
A. It does not require prior knowledge of the number of desired clusters. B. It is capable of detecting outliers. C. For data points to be in a cluster, they must be in a distance threshold to a core point.
What is the function of a tag in HTML? A. It indicates some behavior that your web browser uses for formatting B. It tells a web browser that this is the start of an HTML document C. It creates a web site on your computer D. It allows you to open your web browser E. It allows you to link to another web page
A. It indicates some behavior that your web browser uses for formatting
Python list comprehensions are appropriately used to perform which of the following operations? A. Map B. Reduce C. Filter D. Sort E. Slice
A. Map and C. Filter
If I am using all features of my dataset and I achieve 100% accuracy on my training set but 70% on my testing set, what should I be correcting? A. Model overfit B. Model underfit C. The selected classification algorithm D. Missing data
A. Model overfit
Which multi-dimensional visualization would be best for finding potential correlations between all pairs of columns A. Scatterplot Matrix B. Heatmap C. Boxplot D. Parallel Coordinates Plot
A. Scatterplot Matrix
Common way(s) to locate data in an HTML page using BeautifulSoup are: A. Search for tags and identifiers. B. Navigate tag hierarchy. C. Query its x,y position on the page. D. Search by the exact data value.
A. Search for tags and identifiers. B. Navigate tag hierarchy.
Which Python data structure would be best for representing a positive word dictionary, so that we can efficiently look up potential positive words in it? A. Set B. String C. List D. CSV File
A. Set
Which of the following are reasons that might cause a random forest to overfit the input data? A. The number of trees B. The percentage of data used for training C. The depth of each tree
A. The number of trees B. The percentage of data used for training C. The depth of each tree
Which of the following statements is true about the k-Nearest Neighbors algorithm? A. The optimal value of k is different for each dataset. B. The algorithm decides the value of k. C. The optimal value of k can change with each question you ask about a dataset. D. The choice of k is irrelevant to the output.
A. The optimal value of k is different for each dataset. C. The optimal value of k can change with each question you ask about a dataset.
What are the two types of hierarchical clustering? A. Top-down (divisive) B. Bottom-up (agglomerative) C. DBSCAN D. dendrogram E. k-means
A. Top-down (divisive) B. Bottom-up (agglomerative)
You want to compute the average of the Quiz1 scores of all the students who got more than 70%. Which execution sequence of operations would be best? A. filter, then reduce B. reduce, then filter C. sort, then slice D. map, then filter
A. filter, then reduce
You have a table of VT students, which contains data about each student's Major and GPA. You want to compute each student's GPA differential in comparison to the average GPA of students of their same Major. Which sequence of operations would be best? A. groupby, reduce, join, map B. join, reduce, map, groupby C. reduce, map, groupby D. union, filter, sort
A. groupby, reduce, join, map
Select multiple The HTML5 Document Object Model for a web page: A. is organized in a tree structure. B. is composed of tags. C. contains attribute=value pairs. D. stores database tables.
A. is organized in a tree structure. B. is composed of tags. C. contains attribute=value pairs.
Which of these are true about the k-means algorithm we discussed in class? A. k-means can produce different clusterings for the same data and same k. B. k-means always selects exactly k clusters. C. k-means will leave outliers out of all of the k clusters. D. k-means clusters the data by selecting the k most representative rows from the dataset. E. k-means uses a distance matrix.
A. k-means can produce different clusterings for the same data and same k. B. k-means always selects exactly k clusters.
Imputing the mean preserves what attribute(s) of the data, if any? A. mean B. sum C. correctness D. variance
A. mean
Which of the following can be used to create a DataFrame in pandas? A. A Python tuple B. A Python dictionary C. A scalar value D. Another pandas DataFrame E. A Python list F. A pandas Series
All of the answer choices A. A Python tuple B. A Python dictionary C. A scalar value D. Another pandas DataFrame E. A Python list F. A pandas Series
Which of the following are true of the Euclidean distance matrix m for a quantitative data table d? Assume: d contains n rows and p columns. m[a, b] = value of m at its row a, column b. dist(d, a, b) = Euclidean distance between rows a and b in d. A. m[a, a] == 0 B. m.shape == (n, n) C. m[a, b] == m[b, a] D. m[a, c] <= m[a, b] + m[b, c] E. m[a, b] == dist(d, a, b) F. m[a, b] >= 0
All of the answer choices A. m[a, a] == 0 B. m.shape == (n, n) C. m[a, b] == m[b, a] D. m[a, c] <= m[a, b] + m[b, c] E. m[a, b] == dist(d, a, b) F. m[a, b] >= 0
Define Lambda Expression
An unnamed (or anonymous) function to perform a computation
Define Inner Join
Associate rows between two DateFrames, keeping the intersection of matching values
Define Many-to-one Join
Associated rows between two DataFrames, where only one of the DataFrames has unique keys
Which of the following statements are true about cluster analysis in general? A. We can only visualize clustering results when the data is two-dimensional. B. Different clustering algorithms are able to detect clusters of different sizes and shapes. C. We must know the number of output clusters k before running all clustering algorithms. D. When clustering, we want to put two dissimilar data objects into the same cluster. E. Agglomerative clustering is an example of a distance-based clustering method. F. Cluster analysis doesn't require labeled training data; the goal is to find structure in the data. G. In order to perform cluster analysis, we need to have a similarity or distance measure between data objects.
B. Different clustering algorithms are able to detect clusters of different sizes and shapes. E. Agglomerative clustering is an example of a distance-based clustering method. F. Cluster analysis doesn't require labeled training data; the goal is to find structure in the data. G. In order to perform cluster analysis, we need to have a similarity or distance measure between data objects.
Which multi-dimensional visualization would be best for comparing the values in a table with over 100 rows and over 100 columns? A. Scatterplot Matrix B. Heatmap C. Boxplot D. Parallel Coordinates Plot
B. Heatmap
Which of the following are reasons to use PCA for dimensionality reduction? A. PCA can be used as a replacement for linear regression. B. Reducing the data dimensionality will enable a future algorithm in the analysis pipeline to run more quickly. C. Reducing the data dimensionality to 2D or 3D can give you an intuition of the shape of the data by plotting. D. Principal Components can be added to the original DataFrame to have more dimensions to use for future learning.
B. Reducing the data dimensionality will enable a future algorithm in the analysis pipeline to run more quickly. C. Reducing the data dimensionality to 2D or 3D can give you an intuition of the shape of the data by plotting.
Select Multiple Which of the following statements about Principal Component Analysis are true? A. Because the algorithm may reach a local rather than a global minimum, we should run multiple PCA initializations. B. The number of possible principal components is equal to the number of input dimensions. C. PCA can only reduce dimensionality from n dimensions to n-1 dimensions. D. We should z-score normalize data dimensions prior to running PCA
B. The number of possible principal components is equal to the number of input dimensions. D. We should z-score normalize data dimensions prior to running PCA
In ensemble learning, you aggregate the predictions for weak learners under the assumption that a collection of these models will give a better prediction than the individual models. Which of the following statements are true for weak learners used in these ensembles? A. They have high bias, so they cannot solve complex learning problems B. They usually overfit C. They don't usually overfit
B. They usually overfit
According to Munzner's ranking of visual properties (channels), which choice would be best for comparing quantitive data values? A. motion B. bar chart C. pie chart D. colormap
B. bar chart
Which multi-dimensional visualization would be best for understanding the distributions of each of the columns in a table A. Scatterplot Matrix B. Heatmap C. Boxplot D. Parallel Coordinates Plot
C. Boxplot
A visualization technique to view a distribution of the popularity of words in a bag-of-words model is: A. Scatterplot matrix B. Regression line C. Histogram D. Scatterplot
C. Histogram
Feature normalization is an important step to perform before running most clustering algorithms. What is the reason behind this? A. The Euclidean distance computation will result in mathematical errors. B. You always get the same clusters when running the analysis repeatedly.. C. In distance calculation it will give the same influence for all features. D. None of these.
C. In distance calculation it will give the same influence for all features.
Throughout a large corpus of english documents, we should expect words such as "the," "a," and "and" to have consistently: A. Higher than average TFIDF B. Lower than average TF C. Lower than average TFIDF D. Higher than average TF
C. Lower than average TFIDF D. Higher than average TF
Which of these are true about the Multidimensional Scaling (MDS) algorithm we discussed in class? A. A higher stress value indicates a better MDS fit. B. MDS reduces dimensionality by selecting only the most important columns of the data. C. MDS uses a distance matrix. D. MDS projections can be reflected or rotated without changing its accuracy. E. MDS always reduces p dimensions to 2. F. MDS can produce different 2D projections for the same data. G. MDS attempts to preserve pairwise distances in the projection.
C. MDS uses a distance matrix. D. MDS projections can be reflected or rotated without changing its accuracy. F. MDS can produce different 2D projections for the same data. G. MDS attempts to preserve pairwise distances in the projection.
How would you best describe the Vector Space Model matrix expected for a large set of tweets, where each tweet was treated as a document? A. Dense B. Symmetric C. Sparse D. Diagonal
C. Sparse
The big difference between supervised and unsupervised learning algorithms is: A. Supervised algorithms can only be performed on quantitative data. B. Only clustering algorithms are supervised. C. Unsupervised algorithms don't require labels. D. You need to pay closer attention for errors with supervised algorithms.
C. Unsupervised algorithms don't require labels.
Imputation is best described as: A. finding missing values B. removing erroneous values C. assigning estimated values D. replacing dirty data with NaN
C. assigning estimated values
Which colormap would be best for visualizing quantitative data? A. solid red B. rainbow colormap C. blue saturation ramp colormap
C. blue saturation ramp colormap
The choice of k, the number of clusters to partition a set into: A. is a personal choice that shouldn't be discussed in results that are presented. B. should always be as large as your computer system can handle. C. depends on why and how you are clustering the data. D. has a maximum of 10.
C. depends on why and how you are clustering the data.
What is the purpose of using div tags in HTML? A. For adding titles B. For adding headings C. For creating different styles D. For creating different sections
D. For creating different sections
You want to combine a table of VT students and a table of VT dorm buildings to get a new table of the current VT campus that relates student data to data about the dorm they live in. Which operation would be best? A. One-to-one join B. Union or append C. Many-to-many join D. Many-to-one join
D. Many-to-one join
Select one: When representing documents with a bag-of-words model: A. The order of words in each document contains semantic meaning. B. Duplicate words are removed. C. You are forced to choose a large fixed vocabulary, such as the Oxford English Dictionary. D. Multiple documents can be aggregated by concatenating their bags.
D. Multiple documents can be aggregated by concatenating their bags.
Which multi-dimensional visualization would be best for finding multi-dimensional clusters of rows A. Scatterplot Matrix B. Heatmap C. Boxplot D. Parallel Coordinates Plot
D. Parallel Coordinates Plot
Which chart would be best to visualize the distribution of a single quantitative data column. A. scatter plot B. pie chart C. bar chart D. histogram
D. histogram
A major goal of the QAC approach is to:
Describe a problem and solution in a meaningful way.
Which of the following tags is the correct way to create a hyperlink? A. <a>http://www.website.com</a> B. <a name="http://www.website.com">http://www.website.com</a> C. <a link="http://www.website.com">http://www.website.com</a> D. <a url="http://www.website.com">http://www.website.com</a> E. <a href="http://www.website.com">http://www.website.com</a>
E.<ahref="http://www.website.com">http://www.website.com</a>
L2 distance is also know as
Euclidean Distance
Define parse
Extract information from and HTML page
True/False Dimension reduction always perfectly captures the high-dimensional data in low dimensional space, such that the high-dimensional data can be reconstructed from the low-dimensional data.
False
True/False Each dimension found using MDS represents only one attribute in the high-dimensional data.
False
True/False Matplotlib is the only way to create visualizations in Python.
False
True/False: Bar graphs and histograms are the same thing.
False
True/False: Data Science is a linear process of 6 steps with no iterative design
False
True/False: Markdown is a type of python code.
False
True/False: Reduce and Filter operations perform the same actions.
False
True/False: Since Notebooks have Markdown cells to provide explanations, I never need to add comments to my code.
False
True/False: Switching integer strings ("47") to integers (47) doesn't make a difference to future data processing.
False
True/False: The fit of an MDS solution is commonly assessed by a stress measure in which higher values of stress indicate better fits.
False
True/False: I can execute code cells in a Notebook in any order that I want without affecting the overall computation or exploration.
False
Define Imputation
Filling in missing data with representative values
Define Crawl
Find URLs of more pages to retrieve
What does Server-side script do?
Generate HTML page from a database
Categorical Distance is also know as
Hamming Distance
Define the Term: Slice
Identify data based on positional indices
Define the Term: Filter
Identify data that matches a condition
What is the type of the result of the following expression? "1,2,3,4".split(",")
List
L1 Distances is also known as
Manhattan Distance
Are two runs of k-means clustering expected to yield the same clustering results?
No
Define the Term: Sort
Order data by its values
A simple linear regression on a dataset with columns A and B produced this result: B = 0.5*A + 3 Therefore, the Pearson correlation coefficient r(A,B) value most likely should be:
Positive
QAC is an acronym for:
Question, Analysis, Conclusion
Define Groupby
Reduce cardinality by binning rows according to a categorical variable
What does a Browser do?
Render an HTML page for human consumption
Define fetch
Retrieve an HTML page from the web
Define provenance
Sequence of code cells in a computational notebook that is used to compute a result
Define the Term: Reduce
Summarize data as a single value
Define Dimension Reduction
Transform data from a high-dimensional space to a low-dimensional space
Define the Term: Map
Transform data into new values through a function
True/False A DataFrame is similar to a dictionary because you can use the index labels to get and set values.
True
True/False Code cells in a Notebook can be executed multiple times.
True
True/False: A Notebook is composed of Markdown, Code, and the results of running that Code.
True
True/False: A pandas Series acts in a way similar to that of an array or list, and a DataFrame is composed of a set of Series.
True
True/False: The standard marker for missing data in pandas is NaN.
True
Define Histogram
Visualize the distribution of quantitative variable using binned frequencies
As the k-Means algorithm runs, we currently have 3 centroids (0,1), (2,1), and (-1,2). Will points (2,3) and (2,0.5) be assigned to the same cluster in the next iteration?
Yes
You have a list of data about all the students' scores on Quiz 1. You want to curve the scores by adding 1 point to each student's score. Which operation would be best to perform?
map