CSCI-544

अब Quizwiz के साथ अपने होमवर्क और परीक्षाओं को एस करें!

Which of the following statements regarding sentence structure representation are correct? Phrases in constituency structure are represented by terminal nodes Syntactic structures can help resolve semantic ambiguity given the sentence "watching a model train" We can use constituency or dependency structure There is always a directed path from the root to every other node in the dependency structure of the sentence "I saw a girl with a telescope."

We can use constituency or dependency structure There is always a directed path from the root to every other node in the dependency structure of the sentence "I saw a girl with a telescope."

Which of the following is not hyper-parameter when training a NN? Weight decay Learning rate Batch size Weights of each layer

Weights of Each Layer

Which statements regarding seq2seq are correct? Exhaustive search is an efficient and commonly used method for sequence generation The gradients will be back-propagated through decoder only When the model generates the special <EOS> token, the model stops generating The initial start token for decoder is the special token <BOS>

When the model generates the special <EOS> token, the model stops generating The initial start token for decoder is the special token <BOS>

When subclassing torch.utils.data.Dataset, which of these magic methods do you need to implement? __getitem__ __sizeof__ __eq__ __hash__

__getitem__

Compared to one-directional simple RNN, BLSTM: (select all that apply) can capture both past and future context in a sequence is computationally more efficient is better suited for tasks requiring long-range dependencies reduces the risk of vanishing gradient issues

can capture both past and future context in a sequence is better suited for tasks requiring long-range dependencies reduces the risk of vanishing gradient issues

Each hidden state in a simple RNN is recurrently computed with: (select all that apply) input at current step hidden state at current step input at previous step hidden state at previous step

input at current step, hidden state at previous step

Relu Function:

max(0,x)

You want to implement a MLP consisting of several linear layers with ReLU nonlinearities, emphasizing conciseness and simplicity. Which of these classes should your model class subclass? nn.Sequential nn.Module nn.Linear nn.LazyLinear

nn.Sequential

How do you move a tensor to a GPU in PyTorch? tensor.gpu() tensor.to('cuda') Use the torch.with_grad context-manager tensor.device('nvidia')

tensor.to('cuda')

From which PyTorch module could you import things like loss functions and nonlinearities? torch.nn torch.tensor torch.autograd torch.optim

torch.nn

What is/are possible shape(s) of the weight matrices in a simple RNN given input length is n, input dimension is d and hidden state dimension is d'? (select all that apply) (d', d) (d', d') (d, d) (n, d)

(d', d) (d', d')

Considering a classification task where ground-truth labels are [1, 0, 0, 1], suppose we have a model with predicted labels [1, 1, 1, 0], then precision of predictions is 1/3 1/4 1/2 3/4

1/3(Precision formula is: (TP/(TP + FP))

In beam search decoding, suppose the beam size is 3, then how many candidates are we considering at step 2? 4 9 6 5

9

Why are Conditional Random Fields (CRFs) often preferred over MEMMs for sequence labeling? A) CRFs model dependencies between labels jointly rather than independently.B) CRFs do not require training on labeled data.C) CRFs completely remove the need for feature engineering.D) CRFs only use rule-based approaches for classification.

A) CRFs model dependencies between labels jointly rather than independently.

Which of the following strategies can be used for dependency parsing? Bidirectional LSTM Cubic Parsing Algorithm Stack-Pointer Network Deep BiAffine Parser

All of them

What is the main limitation of the Bag-of-Words (BoW) model in NLP? A) It captures word order information.B) It generates dense vector representations.C) It ignores the structure and order of words in a sentence.D) It uses contextual embeddings for better representation.

Answer: C) It ignores the structure and order of words in a sentence.

Which of the following are types of deep generative models? (Select all that apply) Auto-regressive models Diffusion models Principal Component Analysis (PCA) Normalizing flows

Auto-regressive models Principal Component Analysis (PCA) Normalizing flows

What does the "Maximum Entropy Markov Model (MEMM)" use that differentiates it from an HMM? A) It models joint probability instead of conditional probability.B) It uses feature-based representations instead of transition probabilities.C) It ignores sequential dependencies in labeling.D) It always assumes a uniform probability distribution.

B) It uses feature-based representations instead of transition probabilities.

What is the main drawback of using Recurrent Neural Networks (RNNs) for sequence processing? A) They are too simple and do not contain hidden layers.B) They suffer from vanishing and exploding gradients during training.C) They do not perform well on structured data.D) They cannot process numerical data.

B) They suffer from vanishing and exploding gradients during training

Which word embedding technique uses a probabilistic approach to predict word occurrences based on context? A) TF-IDFB) Word2VecC) Bag of WordsD) One-hot encoding

B) Word2Vec

Which of the following statements about BERT are correct? To fine-tuned BERT on downstream data, the objective is identical as that in pre-training (i.e., masked token prediction) BERT utilizes contextual information to predict the masked token during training BERT-large is more deeper (i.e., more layers) and broader (i.e., more parameters per layer) compared with BERT-base BERT needs labeled data for pretraining

BERT utilizes contextual information to predict the masked token during training BERT-large is more deeper (i.e., more layers) and broader (i.e., more parameters per layer) compared with BERT-base

Which of the following metrics consider n-gram from reference and prediction in translation? BERTScores BLEURT BLEU

BLEU

In TF-IDF, what does the "Inverse Document Frequency (IDF)" measure? A) How frequently a word appears in a single documentB) How frequently a word appears in all documentsC) How rare a word is across the entire corpusD) The importance of stop words in the document

C) How rare a word is across the entire corpus.

In the Naive Bayes classifier, what assumption is made about the features? A) They are conditionally dependent on one another.B) They follow a uniform probability distribution.C) They are independent given the class label.D) They are always continuous variables.

C) They are independent given the class label

Which of the following is NOT a key advantage of using Transformer models like BERT? A) They can capture long-range dependencies in text.B) They use self-attention mechanisms for context understanding.C) They require very little computational power compared to traditional NLP models.D) They can be fine-tuned for various NLP tasks.

C) They require very little computational power compared to traditional NLP models.

What is the primary function of the Viterbi algorithm in NLP? A) To generate embeddings for wordsB) To compute the TF-IDF score of terms in a documentC) To find the most probable sequence of hidden states in an HMMD) To train deep learning models for sequence prediction

C) To find the most probable sequence of hidden states in an HMM

Which of the following statements accurately describes a step in the backpropagation algorithm for training a neural network? Forward pass, where inputs are passed through the network to compute the output. Adjusting weights and biases based on the difference between predicted and actual outputs. Initializing network parameters randomly before training. Calculating the gradient of the loss with respect to the weights and biases

Calculating the gradient of the loss with respect to the weights and biases

Which of the following methods could be used for preprocessing? A) Contracting and standardizing B) Removing extra spaces C) Substituting words with their synonyms D) Stemming

Contracting and standardizing Removing extra spaces Stemming

A valid residual connection in a Transformer encoder block is represented by the equation: Y=LayerNorm(F(X)×X) True False

False

Feedforward neural network(FNN) is good at modeling sentences with various lengths. True False

False

In a pre-norm Transformer architecture, the normalization layer is applied after the sublayer's transformation (e.g., after self-attention or the feed-forward network) to help stabilize training. True False

False

In the training loop, optimizer.zero_grad() should go after loss.backward() and before optimizer.step(). True False

False

Is the following statement regarding attention mechanism in seq2seq true or not: at each decoding step, the attention score (after softmax operation) sum over input tokens depends on the input length, the longer the input, the larger the attention sum. True False

False

Is this statement true or false: BERT is an encoder-decoder model which can be used for both classification and generation. True False

False

Self-attention builds in order information. True False

False

The hyperparameters in feedforward neural networks are learnable parameters. True False

False

True/False: BLSTM-CNN-CRF model outperforming BLSTM-CNN on POS-tagging suggests that structured models can help reduce gradient varnishing/exploding. True False

False

True/False: Since convolutional neural networks can divide input sequence into small context windows and compute them in parallel, it handles long dependency modeling well. True False

False

Word2Vec is a Factorization-based word embedding approach. True False

False

Sentiment analysis is a structured prediction task. True False

False because it doesn't depend on the internal structure within a sentence.

Which of the following is NOT a natural language? English First order logic C# C++

First order logic C# C++

Which of the following statements regarding semi-supervised NMT are correct? For self-learning NMT, the models are provided with parallel data and monolingual target data In noisy self-learning NMT, we add noise to synthetic target For back translation NMT, the models are provided with parallel data, synthetic source and genuine target data

For back translation NMT, the models are provided with parallel data, synthetic source and genuine target data

Which of the following statements about the multi-modal GPT-4o is CORRECT? GPT-4o cannot generate codes GPT-4o can understand audio input GPT-4o can generate stories GPT-4o can receive video input

GPT-4o can understand audio input GPT-4o can generate stories GPT-4o can receive video input

Which of the following procedure is NOT a standard step for classification tasks? A) Human study B) Evaluation C) Preprocessing D) Feature extraction

Human Study

Which statements about seq2seq decoding are WRONG? In exhaustive search, at each decoding step, the token with the highest probability is selected Beam search with k=1 is the same as greedy decoding Generation from beam search is guaranteed to be optimal In greedy decoding, the predicted token is sampled according to their probability

In exhaustive search, at each decoding step, the token with the highest probability is selected Generation from beam search is guaranteed to be optimal In greedy decoding, the predicted token is sampled according to their probability

Which of the following statements regarding vocabulary across multiple languages are correct? Information is not shared if we directly combine individual vocabularies Sequences become too long if we adopt byte-level vocabulary Binary Pair Encoding starts from a small vocabulary and learns to merge tokens Contextual information is easy to model with character-level vocabulary

Information is not shared if we directly combine individual vocabularies Sequences become too long if we adopt byte-level vocabulary Contextual information is easy to model with character-level vocabulary

Which of the following are challenges associated with batch normalization? It normalizes along the feature dimension only. It can be brittle when applied to data from different domains. It relies on running statistics during test time. It requires sufficiently large batch sizes for stable estimates.

It can be brittle when applied to data from different domains. It relies on running statistics during test time. It requires sufficiently large batch sizes for stable estimates.

What is/are true about word embeddings? It can be used to compute the similarity of distance between two different words It is a dense vector per word We can perform linear operations on top of word embeddings It contains contextual information from the sentence

It can be used to compute the similarity of distance between two different words It is a dense vector per word We can perform linear operations on top of word embeddings

Which of these is NOT a major benefit of PyTorch? It contains the largest collection of NLP datasets in one place on the internet It can compute gradients of the loss with respect to each contributing tensor It includes many neural network components, so you don't have to implement them yourself It allows computation to easily be moved to a GPU

It contains the largest collection of NLP datasets in one place on the internet

What improvements does latent diffusion provide over standard diffusion models? (Select all that apply) It introduces a latent space learned by a VAE to reduce computational cost It eliminates the need for iterative denoising It helps retain image structure while using fewer parameters It does not use noise scheduling

It introduces a latent space learned by a VAE to reduce computational cost, It helps retain image structure while using fewer parameters

What is False about non-convex optimization problems? There are multiple local optimal points Initialization will impact the final convergence point DNNs are non-convex It's guaranteed to converge to a global optimum

It is garunteed to converge to a global optimum

Which of these models use feature vectors? Maximum Entropy Markov Model Hidden Markov Model Conditional Random Field

Maximum Entropy Markov Model and Conditional Random Field

Which of the following classifiers is a generative model? A) Logistic RegressionB) Support Vector Machine (SVM)C) Naive BayesD) Conditional Random Fields (CRFs)

Naive Bayes

What are the advantages of CNN? Parallel computation Simple architecture Good at modeling long dependencies Small context window

Parallel Computation and good at modeling long dependencies

Which of the following components are key parts of the original Transformer encoder block for machine translation? Batch Normalization Recurrent neural network layer Position-wise feed-forward network Multi-head self-attention

Position-wise feed-forward network Multi-head self-attention

Which of the following task is NOT a classification task? Grouping images into different categories such as tree, flower, etc Sentiment analysis Predicting book genres from one of labels, e.g., fiction, narrative, etc Predicting stock price in the next few days

Predicting stock price in the next few days(this is a regression task)

Which of the following experiments include parameter updates? In-context learning Prefix tuning Fully fine-tuning Prompting

Prefix tuning Fully fine-tuning

Which of the following tasks is NOT suitable for a Bidirectional RNN? (select all that apply) Named Entity Recognition (NER) Part-of-Speech (POS) Tagging Sentiment Analysis on Full Sentences Real-time Speech Recognition

Real-time Speech Recognition

Which of the following strategies can be used to add noise to encoder for self-learning of machine translation Rephrase original input Dropout on word Add more monolingual source data Word permutations

Rephrase original input Dropout on word Word permutations

Given the vocabulary [first, second, third, Monday, Tuesday, January, February, equality, the], what's the corresponding bag-of-words vector representation for text "Martin Luther King Jr. Day, observed on the third Monday of January, honors the civil rights leader's legacy of promoting equality, justice, and nonviolent activism." [0, 0, 1, 1, 0, 1, 0, 1, 1] [0, 0, 0, 1, 1, 1, 1, 0, 3] [0, 0, 1, 1, 0, 1, 0, 1, 3] [1, 1, 1, 0, 0, 0, 0, 0, 0]

Right answer should actually be: [0, 0, 1, 1, 0, 1, 0, 1, 2]

Which of the following tasks are NOT sequence-to-sequence generation? Sentiment Analysis Speech to transcript generation English to French translation Question answering

Sentiment Analysis

Which statement about feature extraction is WRONG? Stop words are useless for sentiment analysis Deep learning automatically learns feature representations for different tasks Capitalization is a significant feature for sentiment analysis Negation words are important for sentiment analysis

Stop Words are useless for sentiment analysis or capitalization is a significant feature for sentiment analysis

In the context of Word2Vec, what is the purpose of sub-sampling? Sub-sampling is a technique to augment the training dataset with additional samples for better model generalization. Sub-sampling refers to adjusting the learning rate during training to control the convergence speed of the model. Sub-sampling is used to randomly remove a portion of the training data to reduce model overfitting. Sub-sampling involves skipping frequent words during training to improve efficiency and focus on informative words.

Sub-sampling involves skipping frequent words during training to improve efficiency and focus on informative words.

Why does the Transformer model split the queries, keys, and values into multiple heads in its multi-head attention mechanism? To simplify the self-attention computation by averaging multiple attention scores. To allow each head to focus on different positions and capture diverse relationships within the input. To enable the model to process longer sequences without increasing computational cost. To reduce the number of parameters by sharing weights among heads.

To allow each head to focus on different positions and capture diverse relationships within the input.

What is the primary purpose of adding positional embeddings to Transformer inputs? To encode the order of tokens since self-attention is order-agnostic. To enable residual connections. To improve the efficiency of the multi-head attention mechanism. To reduce the overall number of parameters.

To encode the order of tokens since self-attention is order-agnostic.

What is the primary purpose of the Viterbi algorithm in the context of Hidden Markov Models (HMMs)? To generate random sequences of observations based on the HMM parameters To estimate the transition and emission probabilities of the HMM To find the most likely sequence of hidden states given a sequence of observations To calculate the overall probability of a sequence of observations occurring under the HMM

To find the most likely sequence of hidden states given a sequence of observations

Why is the dot product scaled in the scaled dot-product attention mechanism? To normalize the vectors to unit length To increase the magnitude of dot products for better gradient flow To prevent the softmax function from saturating due to large dot product values To reduce computational complexity

To prevent the softmax function from saturating due to large dot product values

Given a fixed word vocabulary, we should expand the vocabulary when one word is not included in the vocabulary during preprocessing A) True B) False

True

In a traditional encoder-decoder architecture, using a single encoding vector to represent the entire source sentence is NOT sufficient for capturing all the necessary contextual information. True False

True

Is this statement true or false: When evaluating dependency parsing, the labeled attachment score does not exceed unlabeled attachment score. True False

True

Is this statement true or false: in autoregressive seq2seq generation, the probability of current step is based on given input (X) and historical generation (Y) True False

True

Output of feature extractors are feature vectors True False

True

Self-attention builds in order information. True False

True

True/False: Character-level features can be concatenated into the word representation as input to RNNs for sequence labeling. True False

True

True/False: It is recommended to use some existing word embeddings (such as glove, word2vec) as input representations into RNNs (such as BLSTM). True False

True

True/False: Linear RNNs allow for more efficient parallel computation compared to standard RNNs because they remove sequential dependencies in hidden state updates. True False

True

True/False: The hidden state in an LSTM is computed by multiplying output state with cell state's activation. True False

True

What makes VAEs different from standard autoencoders? (Select all that apply) VAEs use a probabilistic latent space rather than a deterministic encoding VAEs enforce a structure on the latent space using KL divergence VAEs always produce sharper images than GANs VAEs rely solely on adversarial loss for training

VAEs use a probabilistic latent space rather than a deterministic encoding, VAEs enforce a structure on the latent space using KL divergence

Which of the following statements regarding prompting are correct? We can prompt autoregressive language models with few-shot learning. For emotion classification, we can use GPT-3 to directly generate the label. For large language models, fine-tuning is more efficient (i.e., consuming fewer computational resources) than prompting for downstream tasks. For movie sentiment classification, we can prompt BERT to predict the masked token and map prediction to the label set.

We can prompt autoregressive language models with few-shot learning. For emotion classification, we can use GPT-3 to directly generate the label. For movie sentiment classification, we can prompt BERT to predict the masked token and map prediction to the label set.

Which of the following statements about many-to-one, one-to-many and many-to-many NMT are correct? Suppose we have parallel data between language A and B, language C and B, then we can train a model that can support zero-shot translation between A and C. In many-to-many translation, we can directly disregard low-resource data, relying on high-resource data and zero-shot translation. To translate from language A to language B, English is always a good pivot language that we can translate from A to English first, and then English to B. In one-to-many NMT, we can prepend special tokens before sentences to identify the target language.

Suppose we have parallel data between language A and B, language C and B, then we can train a model that can support zero-shot translation between A and C In one-to-many NMT, we can prepend special tokens before sentences to identify the target language.

Why are VAEs prone to generating blurry images? (Select all that apply) The Gaussian prior smooths the latent space, leading to interpolation effects The KL divergence regularization term limits capacity for fine-grained details VAEs do not optimize the Evidence Lower Bound (ELBO) VAEs suffer from mode collapse like GANs

The Gaussian prior smooths the latent space, leading to interpolation effects, The KL divergence regularization term limits capacity for fine-grained details

Which of these is assumed for a Hidden Markov Model? The observed output at time t depends only on the observed output at time t-1 The observed output at time t depends on both the observed output at time t-1 and the hidden state at time t The hidden state at time t depends only on the hidden state at time t-1 The hidden state at time t depends only on the observed output at time t

The hidden state at time t depends only on the hidden state at time t-1

Which of the following statements are differences/similarities between sequence-to-sequence generation and sequence labeling: For both models, the sequence length of output (Y) can be different from input (X) The space of output (Y) for sequence-to-sequence generation is much larger than that of sequence labeling Both models are composed of an encoder and a decoder

The space of output (Y) for sequence-to-sequence generation is much larger than that of sequence labeling

What are the challenges in exact density estimation models for image generation? (Select all that apply) The space of pixels is huge, making training difficult The sub-space of natural images is sparse relative to the full space These models often waste capacity on meaningless noise They do not use neural networks in any form

The space of pixels is huge, making training difficult, The sub-space of natural images is sparse relative to the full space, These models often waste capacity on meaningless noise

Which of the following statements about parameter-efficient fine-tuning are correct? Both the pre-trained model parameters and newly included parameters are updated in adapters during fine-tuning. Prefix tuning revises the computation of the attention module and feedforward layers with learnable parameters prepended to original vectors. We cannot incorporate adapter, prefix tuning and LoRA in transformers at the same time for parameter-efficient fine-tuning The update matrix in LoRA contains two trainable parameters with much lower rank than the original matrix

The update matrix in LoRA contains two trainable parameters with much lower rank than the original matrix

What are the main advantages of generative models? (Select all that apply) They require labeled data for training They can model data distributions to generate new samples They are only useful for image-related tasks They can learn from unlabeled data

They can model data distributions to generate new samples, ✅ They can learn from unlabeled data

Which statements about sinusoidal positional embeddings are true? They enable length extension by providing absolute positional information. They use fixed sine and cosine functions to encode positions. They are learnable parameters that adjust during training. They are added to the input word embeddings to incorporate order.

They enable length extension by providing absolute positional information. They use fixed sine and cosine functions to encode positions. They are added to the input word embeddings to incorporate order.

Which statements are true regarding autoregressive models? (Select all that apply) They estimate the joint probability of data as a product of conditional probabilities They generate images pixel by pixel in a sequential manner They are computationally more efficient than GANs They require an invertible transformation to estimate the probability density

They estimate the joint probability of data as a product of conditional probabilities, They generate images pixel by pixel in a sequential manner

What are the main limitations of closed-form analytic solutions in generative modeling? (Select all that apply) They have limited capacity to model complex data distributions They require a large number of parameters for training They are theoretically grounded but lack flexibility They are always superior to neural network-based generative models

They have limited capacity to model complex data distributions, They are theoretically grounded but lack flexibility

What are key properties of diffusion models? (Select all that apply) They progressively transform noise into realistic images They use an iterative denoising process They require an explicit discriminator for training They are often computationally expensive to train

They progressively transform noise into realistic images They use an iterative denoising process They are often computationally expensive to train

Which of the following are characteristics of normalizing flow models? (Select all that apply) They use a sequence of invertible transformations to model data distributions They are based on adversarial training They can compute exact likelihoods of data They are always more efficient than VAEs

They use a sequence of invertible transformations to model data distributions, They can compute exact likelihoods of data

Why does the Transformer model split the queries, keys, and values into multiple heads in its multi-head attention mechanism? To simplify the self-attention computation by averaging multiple attention scores. To allow each head to focus on different positions and capture diverse relationships within the input. To enable the model to process longer sequences without increasing computational cost. To reduce the number of parameters by sharing weights among heads.

To allow each head to focus on different positions and capture diverse relationships with the input


संबंधित स्टडी सेट्स

Informatika v medijih - INTERNET (osnove omrežij)

View Set

Module 47-Introduction to Psychological Disorders

View Set

Accounting CH1: Framework for Financial Accounting

View Set

Data Structures and Sorting Algorithms

View Set

iCEV Elanco Animal Science Certification

View Set