CS397 - Natural Language Processing MIDTERM REVIEW
Why is it important to prevent overfitting in Decision Trees?
Overfitting occurs when the tree is too complex, capturing noise along with the underlying pattern. This can lead to poor generalization on unseen data.
What distinguishes parameters from hyperparameters in machine learning models?
Parameters are learned from the data during the training process, while hyperparameters are set before training and control the learning process.
Why might a model trained with BoW/TF-IDF and Naive Bayes perform poorly on a relatively balanced dataset?
Poor performance could be due to high dimensionality and sparse data inherent to BoW/TF-IDF, especially if there is a mismatch between the number of data points and the number of features.
How can document embeddings be created from word embeddings?
Document embeddings can be created by averaging the word embeddings for all words in the document or by using more sophisticated methods like doc2vec.
What are essential pre-processing tasks for NLP and provide a code snippet for one of them.
Essential pre-processing tasks include tokenization, stemming, lemmatization, and normalization. For example, tokenization can be done as follows:
What is the key advantage of FastText over Word2Vec?
FastText enhances Word2Vec by treating each word as a bag of character n-grams, which allows it to generate better word embeddings for rare words.
In a given Python script for dataset processing, what are typical "TODO" tasks you might need to complete?
Things like applying stemming and lemmatization, vectorizing the data (BoW and TF-IDF), and applying a machine learning model like Naive Bayes.
What is Word2Vec and how does it work?
Word2Vec is a method to produce word embeddings by using a neural network model to learn word associations from a large corpus of text.
Discuss an application of Word2Vec in NLP tasks.
Word2Vec is crucial for understanding linguistic patterns and semantics, which benefits tasks like sentiment analysis, machine translation, and topic modeling.
How does Word2Vec differ from traditional bag-of-words models?
Word2Vec uses dense vector representations and learns to predict words from context, capturing semantic relationships, unlike the sparse, high-dimensional vectors in bag-of-words.
How can you calculate the similarity between two word vectors?
You can calculate the cosine similarity between two word vectors to measure their similarity.
When might you choose a logistic regression model in NLP?
You might choose logistic regression for binary classification tasks in NLP, like sentiment analysis, when you need a probabilistic framework.
How can you identify if a dataset, such as one containing tweets about a movie, is labeled or not?
A dataset is labeled if each data point (e.g., each tweet) has an associated label or category, which is often used for supervised learning tasks.
Describe the Continuous Bag of Words (CBOW) model in Word2Vec.
CBOW predicts the target word based on context words. It averages the context word vectors and uses them to predict the target in the middle.
What is the difference between dependency parsing and constituency parsing?
Dependency parsing focuses on the relationships between words in a sentence, while constituency parsing breaks a sentence down into its constituent parts.
Describe the GloVe model.
GloVe stands for Global Vectors for Word Representation. It is an unsupervised learning algorithm for generating word embeddings by aggregating global word-word co-occurrence statistics from a corpus.
What issue do LSTM and GRU units address in RNNs?
LSTM (Long Short-Term Memory) and GRU (Gated Recurrent Unit) units address the vanishing gradient problem in RNNs, enabling them to capture long-range dependencies.
How does lemmatization differ from stemming?
Lemmatization reduces words to their base or root form based on their intended meaning, unlike stemming, which cuts off prefixes and suffixes.
Explain the basic principle of Logistic Regression in machine learning.
Logistic Regression is used for binary classification and relies on the logistic function. It estimates probabilities using Maximum Likelihood Estimation (MLE).
What is Naive Bayes commonly used for in NLP?
Naive Bayes is commonly used for classification tasks in NLP, such as spam detection or sentiment analysis, due to its simplicity and effectiveness.
What is the purpose of negative sampling in Word2Vec?
Negative sampling improves training speed and efficiency by training the model to differentiate target words from a small number of noise words.
What is the key concept behind Support Vector Machines (SVMs)?
SVMs aim to find a maximum-margin hyperplane in the feature space that best separates the classes, and they can use kernel functions to handle non-linearly separable data.
Why is sentiment analysis considered a form of classification?
Sentiment analysis is considered a form of classification because it categorizes a piece of text according to the sentiment expressed in it.
Why is it important to remove stopwords in NLP?
Stopwords are common words that are usually filtered out because they contribute little to the understanding of the text and can skew analysis.
What does TF-IDF stand for and what does it measure?
TF-IDF stands for Term Frequency-Inverse Document Frequency. It measures how important a word is to a document in a collection of documents.
What are some techniques that can improve classification results by manipulating the data?
Techniques include bootstrapping to increase the number of data points, applying dimensionality reduction to reduce feature space, normalizing features, and using Laplace smoothing to address zero probabilities in Naive Bayes.
What is the Bag-of-Words model?
The Bag-of-Words model represents text as an unordered collection of words, disregarding grammar and word order but keeping multiplicity.
What is the purpose of stemming in text pre-processing?
Tokenization is the process of breaking down text into smaller units, called tokens, which can be words, phrases, or symbols.
What is tokenization in NLP?
Tokenization is the process of breaking down text into smaller units, called tokens, which can be words, phrases, or symbols.