Data And Analytics

Ace your homework & exams now with Quizwiz!

Support Vector Machines (SVMs)

Similar to regression, but we're trying to find the dividing line (or really, the dividing hyperplane) rather than the best fit line. The data points that are closest to the hyperplane. These points will define the separating line better by calculating margins. These points are more relevant to the construction of the classifier. Hyperplane: A decision plane which separates between a set of objects having different class memberships. Margin: A gap between the two lines on the closest class points. This is calculated as the perpendicular distance from the line to support vectors or closest points. If the margin is larger in between the classes, then it is considered a good margin, a smaller margin is a bad margin.

Suppose you wanted to scrape the total number of "Customer reviews" listed on the top left of this page: https://www.amazon.com/Hutzler-571-Banana-Slicer/product-reviews/B0047E0EII (Links to an external site.) Which strategy with BeautifulSoup would likely be best? find('span', {'class':'a-size-medium'}) find('span', {'class':'total-review-count'}) find('div', {'data-hook':'total-review-count'}) find('div', {'class':'a-row'})

find('div', {'data-hook':'total-review-count'})

Random Forest

Disadvantage A random forest is slow in generating predictions because it has multiple decision trees. Whenever it makes a prediction, all the trees in the forest have to make a prediction for the same given input and then perform voting on it. This whole process is time-consuming. The model is difficult to interpret compared to a decision tree, where you can easily make a decision by following the path in the tree.

If points (0,3), (2,1), and (-2,2) are the only points assigned a cluster, what is the centroid for this cluster? (1,1.5) (1,2) (0,2) (0,1.5)

(sum of x/sum of y )/n (0,2)

The choice of k, the number of clusters to partition a set into: is a personal choice that shouldn't be discussed in results that are presented. should always be as large as your computer system can handle. .depends on why and how you are clustering the data. ... has a maximum of 10.

.depends on why and how you are clustering the data.

Given 1000 data points, if we set the minimum number of points required to split a node to 200 and the minimum leaf size to 300, what is the maximum possible depth of the resulting decision tree? 1 2 3 4 5

2

What is the total (positive - negative) bag-of-words sentiment of this document: "I love to hate twitter very, very much." Positive dictionary: [love, very] Negative dictionary: [hate, worse] 1 5 -1 2 4 0

2

Which of the following tags is the correct way to create a hyperlink? <a>http://www.website.com</a> <a>href="http://www.website.com">http://www.website.com</a> <a name="http://www.website.com">http://www.website.com</a> <a url="http://www.website.com">http://www.website.com</a> <a link="http://www.website.com">http://www.website.com</a>

<a href="http://www.website.com">http://www.website.com</a>

The gini score is a metric that quantifies the purity of the node.

A gini score greater than zero implies that samples contained within that node belong to different classes. A gini score of zero means that the node is pure, that within that node only a single class of samples exist.

Random Forest

Advantage Random forests is considered as a highly accurate and robust method because of the number of decision trees participating in the process. It does not suffer from the overfitting problem. The main reason is that it takes the average of all the predictions, which cancels out the biases. The algorithm can be used in both classification and regression problems. Random forests can also handle missing values. There are two ways to handle these: using median values to replace continuous variables, and computing the proximity-weighted average of missing values. You can get the relative feature importance, which helps in selecting the most contributing features for the classifier.

Which of the following can act as possible termination conditions in k-means fitting? A fixed number of iterations have passed. The assignment of items to clusters does not change between iterations. Centroids do not change position between successive iterations. The within-cluster variance falls below a pre-set threshold.

All correct

Which of the following statements are true about cluster analysis in general? Cluster analysis doesn't require labeled training data; the goal is to find structure in the data. When clustering, we want to put two dissimilar data objects into the same cluster. In order to perform cluster analysis, we need to have a similarity or distance measure between data objects. We must know the number of output clusters k before running all clustering algorithms. Agglomerative clustering is an example of a distance-based clustering method. We can only visualize clustering results when the data is two-dimensional. Different clustering algorithms are able to detect clusters of different sizes and shapes.

Cluster analysis doesn't require labeled training data; the goal is to find structure in the data. In order to perform cluster analysis, we need to have a similarity or distance measure between data objects. Agglomerative clustering is an example of a distance-based clustering method. Different clustering algorithms are able to detect clusters of different sizes and shapes.

Unsupervised Model

Clustering : Finding patterns and grouping from unlabeled data

A visualization technique to view a distribution of the popularity of words in a bag-of-words model is: Scatterplot matrix Scatterplot Histogram Regression line

Correct! Histogram

If I am using all features of my dataset and I achieve 100% accuracy on my training set but 70% on my testing set, what should I be correcting? Model under-fit Model overfit Missing data The selected classification algorithm

Correct! Model overfit

Which Python data structure would be best for representing a positive word dictionary, so that we can efficiently look up potential positive words in it? List String CSV File Set

Correct! Set

Which of the following are reasons that might cause a random forest to overfit the input data? The number of trees The depth of each tree The percentage of data used for training

Correct! The number of trees Correct! The depth of each tree Correct! The percentage of data used for training

Which of the following statements is true about the k-Nearest Neighbors algorithm? The choice of k is irrelevant to the output. The algorithm decides the value of k. The optimal value of k is different for each dataset. The optimal value of k can change with each question you ask about a dataset.

Correct! The optimal value of k is different for each dataset. Correct! The optimal value of k can change with each question you ask about a dataset.

In ensemble learning, you aggregate the predictions for weak learners under the assumption that a collection of these models will give a better prediction than the individual models. Which of the following statements are true for weak learners used in these ensembles? They don't usually overfit They usually overfit They have high bias, so they cannot solve complex learning problems

Correct! They usually overfit

Identify the Cosine similarity value for these pairs of documents: Two identical documents 1 Two documents with nothing in common 0 Other Incorrect Match Options:-1infinity

Correct!Two identical documents 1 Correct!Two documents with nothing in common 0

Decision Tree Parameters criterion: How the impurity of a split will be measured. The default value is "gini" but you can also use "entropy" as a metric for impurity. splitter: This is how the decision tree searches the features for a split. max_depth: This determines the maximum depth of the tree. min_samples_split: The minimum number of samples a node must contain in order to consider splitting. min_samples_leaf: The minimum number of samples needed to be considered a leaf node. max_features: The number of features to consider when looking for the best split.

Decision Tree Parameters criterion: How the impurity of a split will be measured. The default value is "gini" but you can also use "entropy" as a metric for impurity. splitter: This is how the decision tree searches the features for a split. max_depth: This determines the maximum depth of the tree. min_samples_split: The minimum number of samples a node must contain in order to consider splitting. min_samples_leaf: The minimum number of samples needed to be considered a leaf node. max_features: The number of features to consider when looking for the best split. In [5]: 1

Which of the following are disadvantages of choosing a decision tree classifier? Decision trees are difficult to interpret. Decision trees are not stable algorithms. Decision trees can easily overfit data.

Decision trees are not stable algorithms. Decision trees can easily overfit data.

Advantages of decision trees

Decision trees can be used to predict both continuous and discrete values i.e. they work well for both regression and classification tasks. They require relatively less effort for training the algorithm. They can be used to classify non-linearly separable data. They're very fast and efficient compared to KNN and other classification algorithms.

A major goal of the QAC approach is to: Describe a problem and solution in a meaningful way. Challenge critical thinking skills. A means of getting the CMDA major approved.

Describe a problem and solution in a meaningful way.

parse

Extract an information from an HTML page

crawl

Find UML of more page to retrieve

What is the purpose of using div tags in HTML? For creating different styles For creating different sections For adding headings For adding titles

For creating different styles For creating different sections

Which of the following statements are true about the DBSCAN algorithm? For data points to be in a cluster, they must be in a distance threshold to a core point. It has strong assumptions for the distribution of data points in dataspace. It does not require prior knowledge of the number of desired clusters. It is capable of detecting outliers.

For data points to be in a cluster, they must be in a distance threshold to a core point. It does not require prior knowledge of the number of desired clusters. It is capable of detecting outliers.

Server-side script

Generate HML page from a data base

Throughout a large corpus of English documents, we should expect words such as "the," "a," and "and" to have consistently: Higher than average TF Lower than average TFIDF Higher than average TFIDF Lower than average TF

Higher than average TF Correct! Lower than average TFIDF

Feature normalization is an important step to perform before running most clustering algorithms. What is the reason behind this? In distance calculation it will give the same influence for all features. You always get the same clusters when running the analysis repeatedly. The Euclidean distance computation will result in mathematical errors. None of these.

In distance calculation it will give the same influence for all features

k-Nearest Neighbors Lazy algorithm:

It does not need any training data points for model generation. All training data used in the testing phase. This makes training faster and testing phase slower and costlier. In the worst case, KNN needs more time to scan all data points and scanning all data points will require more memory for storing training data.

What is the function of a tag in HTML? It allows you to open your web browser It tells a web browser that this is the start of an HTML document It creates a web site on your computer It allows you to link to another web page It indicates some behavior that your web browser uses for formatting

It indicates some behavior that your web browser uses for formatting

K-means clustering algorithm

K-means clustering uses "centroids", K different randomly-initiated points in the data, and assigns every data point to the nearest centroid.

When representing documents with a bag-of-words model: You are forced to choose a large fixed vocabulary, since as the Oxford English Dictionary. The order of words in each document contains semantic meaning. Duplicate words are removed. Multiple documents can be aggregated by concatenating their bags.

Multiple documents can be aggregated by concatenating their bags.

Are two runs of k-means clustering expected to yield the same clustering results? Yes No

No

k-Nearest Neighbors Non-parametric:

Non-parametric: There is no assumption for underlying data distribution; the model structure determined from the dataset.

Fetch

Retrieve an HTML page from the web

QAC is an acronym for: Query, Answer, Case Question, Analysis, Conclusion Quiz, Action, Constant Question, Answer, Correlation

Question, Analysis, Conclusion

Which of the following are reasons to use PCA for dimensionality reduction? Reducing the data dimensionality will enable a future algorithm in the analysis pipeline to run more quickly. Reducing the data dimensionality to 2D or 3D can give you an intuition of the shape of the data by plotting. Principal Components can be added to the original DataFrame to have more dimensions to use for future learning. PCA can be used as a replacement for linear regression.

Reducing the data dimensionality will enable a future algorithm in the analysis pipeline to run more quickly. Reducing the data dimensionality to 2D or 3D can give you an intuition of the shape of the data by plotting.

Browser

Render an HTML page for human condumption

Common way(s) to locate data in an HTML page using BeautifulSoup are: Search for tags and identifiers. Navigate tag hierarchy. Query its x,y position on the page. Search by the exact data value.

Search for tags and identifiers. Navigate tag hierarchy.

How would you best describe the Vector Space Model matrix expected for a large set of tweets, where each tweet was treated as a document? Dense Sparse Diagonal Symmetric

Sparse

What are the two types of hierarchical clustering? Top-down (divisive) Bottom-up (agglomerative) k-means Dendrogram DBSCAN

Top-down (divisive) Bottom-up (agglomerative)

Kernel: Transforms an input data space into the required form. SVMs use a technique called the "kernel trick": the kernel takes a low-dimensional input space and transforms it into a higher dimensional space. In other words, you can say that it converts nonseparable problem to separable problems by adding more dimension to it.

Transforms an input data space into the required form. SVMs use a technique called the "kernel trick": the kernel takes a low-dimensional input space and transforms it into a higher dimensional space. In other words, you can say that it converts nonseparable problem to separable problems by adding more dimension to it.

The big difference between supervised and unsupervised learning algorithms is: Supervised algorithms can only be performed on quantitative data. Unsupervised algorithms don't require labels. You need to pay closer attention for errors with supervised algorithms. Only clustering algorithms are supervised.

Unsupervised algorithms don't require labels.

Which of the following statements about Principal Component Analysis are true? PCA can only reduce dimensionality from n dimensions to n-1 dimensions. Because the algorithm may reach a local rather than a global minimum, we should run multiple PCA initializations. We should z-score normalize data dimensions prior to running PCA Answer The number of possible principal components is equal to the number of input dimensions.

We should z-score normalize data dimensions prior to running PCA The number of possible principal components is equal to the number of input dimensions.

Identify the correct statements related to document sentiment classification. We want to classify a whole document based on the overall sentiment of the opinion holder. Sentiment classification completely depends upon opinion words. Document sentiment classification is different from topic-based text classification. Three classes are possible: positive, negative, and neutral

We want to classify a whole document based on the overall sentiment of the opinion holder. Document sentiment classification is different from topic-based text classification. Three classes are possible: positive, negative, and neutral

As the k-Means algorithm runs, we currently have 3 centroids (0,1), (2,1), and (-1,2). Will points (2,3) and (2,0.5) be assigned to the same cluster in the next iteration? Yes No

Yes

The HTML5 Document Object Model for a web page: is composed of tags. is organized in a tree structure. contains attribute=value pairs. stores database tables.

is composed of tags. is organized in a tree structure. contains attribute=value pairs.

Which of these are true about the k-means algorithm we discussed in class? k-means can produce different clusterings for the same data and same k. k-means clusters the data by selecting the k most representative rows from the dataset. k-means always selects exactly k clusters. k-means will leave outliers out of all of the k clusters. k-means uses a distance matrix.

k-means can produce different clusterings for the same data and same k k-means always selects exactly k clusters.

Supervised Learning

uses labeled training data to guide the ML program toward superior forecasting accuracy, classifying labeled data, Regression predicting trends using previous labeled data


Related study sets

Verbs: Past and Present Participle Forms

View Set

Chapter 8 - Location Planning & Analysis

View Set