CMDA 3654 Final Exam Quizlet

Ace your homework & exams now with Quizwiz!

If points (0,3), (2,1), and (-2,2) are the only points assigned a cluster, what is the centroid for this cluster?

(0, 2)

What is the Pearson correlation coefficient r(X,Y) of: X = [0,1,2,3,4] Y = [14,12,10,8,6]

-1

Compute the total bag-of-words sentiment of this document: "I love to hate twitter very, very much." Positive dictionary: [love, very] Negative dictionary: [hate, worse]

2

Given 1000 data points, if we set the minimum number of points required to split a node to 200 and the minimum leaf size to 300, what is maximum possible depth of the resulting decision tree?

2

How many elements will the result of this list comprehension have? old = [1,2,3] new = [old*2 for i in old if i > 1]

2

What is the Hamming distance between these two data points: A = (1,2,3,4) B = (1,5,-1,4)

2

What is the L2 distance between these two data points: A = (1,2,3,4) B = (1,5,-1,4)

5

The simple linear regression equation returned by SciKitLearn is Y = 61.93*X - 1.79. What is the slope of the equation?

61.93

What is the L1 distance between these two data points? A = (1,2,3,4) B = (1,5,-1,4)

7

What tag is the correct way to create a hyperlink?

<a href="http://www.website.com">http://www.website.com</a>

What tags create bolded text in HTML

<b> and <strong>

What would be the best visual mapping for two quantitative data columns named A and B?

A -> x-axis B -> y-axis

What can act as possible termination conditions in k-means fitting?

A fixed number of iterations have passed, the assignment of items to clusters does not change between iterations, centroids do not change position between successive iterations, the within cluster variance falls below a pre-set threshold

What are some hazards of using web data?

A slow internet connection Losing internet connection The website changing the file's data without telling you

Which of the following can be used to create a Dataframe in pandas? a. scalar value b. a python list c. a python dictionary d. a python tuple e. a pandas Series f. another pandas DataFrame

All of the above

Which of the following are disadvantages of choosing a decision tree classifier?

Decision trees are not stable algorithms decision trees can easily overfit data

A major goal of QQQ approach is to:

Describe a problem and solution in a meaningful way

The L2 Distance is also known as:

Euclidean Distance

A fit of an MDS solution is commonly assessed by a stress measure in which higher values of stress indicate better fits. (T/F)

False

A projection in which all dimension coefficients are 0.5 is different than a projection in which all dimension coefficients are 2. (T/F)

False

After the following code is executed to retrieve a JSON file: import requests response = requests.get("www.example.com/data.json") You can print a JSON representation of the website with the following code: print(response.json) (T/F)

False

Bar graphs and histograms are the same thing (T/F)

False

Data Science is a straightforward process with no iterative design (T/F)

False

Decision trees can only be used for classification (T/F)

False

Each dimension found using MDS represents only one attribute in the high-dimensional data. (T/F)

False

If data is provided to a Series or DataFrame in the form of a list or dictionary, the index and the data must be the same length. (T/F)

False

Markdown is a type of Python code. (T/F)

False

Matplotlib is the only way to create visualizations in Python (T/F)

False

Switching integer strings ("47") to integers(47) doesn't make a difference to future data processing (T/F)

False

The difference between content-based recommenders and collaborative recommenders is that collaborative recommenders can only consider opinions of other users, not the opinion of the current user (T/F)

False

The difference between content-based recommenders and collaborative recommenders is that content-based recommenders cannot consider user opinions at all, and are only based upon product characteristics. (T/F)

False

What is the type of the items store within the data structure? [4.5, 3.2, 7.3]

Float

Categorical Distance is also known as:

Hamming Distance

A visualization technique to view a distribution of the popularity of words in a bag-of-words model is:

Histogram

Which chart would be best to visualize the distribution of a single quantitative data column?

Histogram

What is the function of a tag in HTML

Indicates some behavior that your web browser uses for formatting

When splitting a decision tree node into branches, why should we prefer a metric that measures information gain rather than accuracy?

Information gain is more stable than accuracy, information gain chooses more impactful features when closer to the root, decision trees also prone to overfit, so accuracy doesn't help to generalize

What is the type of the following literal? [4.5, 3.2, 7.3]

List

What is the type of the result of the following expression? "1,2,3,4".split(",")

List

The L1 Distance is also known as:

Manhattan Distance

When representing documents with a bag-of-words model:

Multiple documents can be aggregated by concatenating their bags

The standard marker for missing data in pandas is:

NaN

Are two runs of k-means clustering expected to yield the same clustering results?

No

A simple linear regression on a dataset with columns A and B produced this results: B = 0.5*A + 3 Therefore the Pearson correlation coefficient r(A.B) value most likely should be:

Positive

QQQ is an acronym for:

Qualitative, Quantitative, Qualitative

How would you best describe the Vector Space Model matrix expected for a large set of tweets, where each tweet was treated as a document?

Sparse

What are possible reasons for a single agglomerative clustering algorithm to produce two different dendrograms during two different clustering analyses?

The linkage function changed, a different subset of dimensions was included in the analysis, a different subset of data object was included in the analyses

For what reasons might a random forest overfit input data?

The number of trees The depth of each tree The percentage of data used for training

In ensemble learning, you aggregate the predictions for weak learners under the assumption that a collection of these models will give a better prediction than the individual models. Which of the following statements are true for weak learners used in these ensembles?

They usually overfit

What are two types of hierarchical clustering?

Top-down (divisive) Bottom-up (agglomerative)

"Imputation" refers to the act of replacing missing data values with substitute values (T/F)

True

A URL is easily stored as a string (T/F)

True

A dataframe is similar to a dictionary because you can use the index labels to get and set values. (T/F)

True

A notebook is composed of Markdown, code, and the results of running that code. (T/F)

True

A pandas Series acts in a way similar to that of a array or list. (T/F)

True

After the following code is executed to retrieve a text file: import requests response = requests.get("www.example.com/data.txt") You can print the text of the website with the following code: print(response.text) (T/F)

True

Solutions found by MDS can be reflected or rotated without changing the accuracy of the projection (T/F)

True

The difference between content-based recommenders and collaborative recommenders is that collaborative recommenders cannot use information about the product characteristics. (T/F)

True

The get function from the requests module consumes a string URL and returns a string representing the contents of the website. (T/F)

True

The goal of multidimensional scaling is to construct a low-dimensional projection that preserves distances and patterns in the high-dimensional space. (T/F)

True

When creating a decision tree, a feature X can be used to split nodes at multiple levels (T/F)

True

The biggest difference between supervised and unsupervised learning algorithms is:

Unsupervised algorithms don't require labels

WMDS is acronym for:

Weighted Multidimensional Scaling

As the k-means algorithm runs, we currently have 3 centroids (0,1), (2,1), (-1,2). Will points (2,3) and (2,0.5) be assigned to the same cluster in the next iteration?

Yes

Imagine that you have 1000 inputs features in a dataset. You have to select the 100 most important features to the structure contained within the data. Is this an example of dimensionality reduction?

Yes

Is it possible that the assignment of items or clusters will not change between successive iterations in k-means?

Yes

In what scenarios is collaborative recommendation the most useful?

You collect reviews of many books from shoppers and use those reviews to determine which books to market you record which articles are most frequently visited by users and use that knowledge to automate a website homepage

Which of the follow code snippets are equivalent? a. squared = [] for value in values: squared.append(value**2) b. squared= [return value**2 for value in values] c. squared = [] for value in values: value**2 d. squared = [value**2 for value in values]

a and d

Which of the following is true about cluster analysis in general? a. agglomerative clustering is an example of a distance-based clustering method b. we can only visualize clustering results when the data is two-dimensional c. different clustering algorithms are able to detect clusters of different sizes and shapes d. k means is a clustering algorithm that will always find accurate clusters in the data

a, c

Which of the following statements are true about cluster analysis in general? a. cluster analysis doesn't require labeled training data; the goal is to find structure in the data b. when clustering we want to put 2 dissimilar data objects into the same cluster c. in order to perform cluster analysis, we need to have a similarity measure between data objects d. cluster analysis can be performed on both categorical and quantitative data

a, c, d

Which of the following statements are true about the DBSCAN algorithm? a. For data points to be in a cluster, they must be in a distance threshold to a core point. b. It has strong assumptions for the distribution of data points in dataspace. c. It does not require prior knowledge of the number of desired clusters. d. It is capable of detecting outliers.

a, c, d

Missing data can be replaced with: a. a fixed value b. a value randomly sampled from the column c. value randomly sampled from the row d. a computed reduction value e. a randomly generated value

a. fixed value b. a value randomly sampled from the column d. a computed reduction value

When cleaning data, it's a good idea to switch these values to NaN: a. missing data b. invalid data c. extreme data d. prime number data

a. missing data b. invalid data c. extreme data

Sentiment Analysis is related to:

application of natural language processing application of computational linguistics text analytics used to identify and extract subjective information in source materials

Which of the following statements is true about the k-nearest neighbors algorithms a. the choice of k is irrelevant to the output b. the algorithm decides the value of k c. the optimal value of k is different for each dataset d. the optimal value of k can change with each question you ask about a dataset

c, d

Name three relations to document sentiment classification

classify a whole document based on the overall sentiment of the opinion holder three classes are possible: positive, negative, and neutral document sentiment classification is different than topic-based text classification

concat() vs append()

concat(): - takes a group of 2+ Dataframes and combines via rows or columns - the most efficient way to merge multiple dataframes - a function build into pandas append(): - combines the rows of one dataframe to another - less efficient because full copies are made - shortcut because you don't have to specify as much 0 a method of dataframe types

What is the purpose of using div tags in HTML

creating different sections

What function in BeautifulSoup will remove a tag from the HTML tree and destroy it?

decompose()

The choice of k, the number of clusters to partition a set into:

depends on why and how you are clustering the data

What is the best strategy for scraping the total number of "Customer Reviews" listed on the top left of a page with BeautifulSoup?

find('div', {'data-hook':'total-review-count'}) find('span', {'class':'a-size-medium'})

What function in BeautifulSoup allows you to retrieve all instances of an HTML tag?

find_all()

What function in BeautifulSoup allows you to retrieve the value associated with an HTML attribute?

get()

You have a table of VT students, which contains data about each student's major and gpa. You want to compute each student's GPA differential in comparison to the average GPA of students of their same Major. Which sequence of operations would be best?

groupby, reduce, join, map

Throughout a large corpus of English documents, we should expect words such as "the", "a", and "and" to have consistently:

higher than average TF lower than average TF-IDF

What tag creates italicized text in HTML?

i em

Feature normalization is an important step to perform before running most clustering algorithms. What is the reason behind this?

in distance calculation it will gave the same influence for all features

The HTML5 Document Object Model for a webpage:

is composed of tags is organized in a tree structure contain attribute-value pairs

What is linkage

method for determining the distance between two clusters

If I am using all features of my dataset and I achieve 100% accuracy on my training set but 70% on my testing set, what should I be correcting?

model overfit

The role of coefficients applied to each dimension in WMDS is to:

remove dimensions from influencing the projection making some dimensions more important than others in the projection making some dimensions less important than others in the projection

Common ways to locate data in an HTML page using BeauitfulSoup are:

search for tags and identifiers navigate tag hierarchy

Name the three relations among sentiment, subjectivity, and emotion

sentiment is not a subset of subjectivity emotion is a subset of subjectivity sentiment is not a subset of emotion

Which python data structure would be best for representing a positive word dictionary, so that we can efficiently look up potential positive words in it?

set

What HTML attribute allows you to apply a CSS style to a section of text?

style class id

Histograms are best used for showing:

the distribution of a column of values

Scatterplots are best for showing:

the relationship between two column of values

Line plots are best used for showing:

the trend in column of values

A linear regression produces the equation Y = 0.4*X + 3. This indicated that:

when X = 0; Y = 3

What are the coefficients of the simple linear regression Y = mx + b of : X = [0,1,2,3,4] Y = [14,12,10,8,6]

y = -2x + 14


Related study sets

Ch. 1 Introduction to Computers and Programming

View Set

GEOL 1403 STREAMS AND FLOODS (10)

View Set

Understanding Business Chapter 7

View Set

Preparing to Estimate a Population Proportion Assignment

View Set

Chapter 63 Management of Patients with Neurologic Trauma

View Set

Geography: Chapter 13- Weathering,Karst Landscapes, and Mass Movements

View Set

Quantitative Methods Exam Ch 1-3

View Set