Natural Language Processing Final Exam
What term is used for a table that contains the frequency of each word of the vocabulary in a document called?
Document Vector Table
Vectorizing text steps
The steps include removing stop words, lemmatizing, stemming, tokenization, and vectorization.
Applications of POS Tagging:
Features in-text modeling, Autocomplete, Word ambiguity Resolution
True Statement:
For a word having high TFIDF value, the word will have a high term frequency with less document frequency.
Examples of stopwords:
There, in, such, for, and, it, a, an, the.
Regular expressions
They are a powerful and flexible method of specifying patterns. Once we have imported the re module, we can use re.findall() to find all substrings in a string that match a pattern.
TF-IDF
This is a statistic that is based on the frequency of a word in the corpus but it also provides a numerical representation of how important a word is for statistical analysis.
Word Mover's Distance WMD
Word Mover's Distance (WMD) is a promising new tool in machine learning that allows us to submit a query and return the most relevant documents.
The process of automatically assigning parts-of-speech to words in text is called?
part-of-speech tagging, POS tagging, or just tagging.
What is spaCy good for?
spaCy is designed specifically for production use and helps you build applications that process and "understand" large volumes of text. It can be used to build information extraction or natural language understanding systems or to pre-process text for deep learning.
What are the limits of the values for cosine similarity?
0 to 1
Regular Expression (RegEx or RegExp)
A sequence of characters that define a search pattern. Usually, such patterns are used by string searching algorithms for "find" and "find and replace" operations on strings or for input validation.
What Are Word Embeddings?
A word embedding is a learned representation for text where words that have the same meaning have a similar representation. It is this approach to representing words and documents that may be considered one of the key breakthroughs of deep learning on challenging natural language processing problems.
Encoding techniques used during text processing:
Bag of Words, TF-IDF, Word2Vec
What model in NLP helps in extracting features out of the text and is helpful in ML algorithms?
Bag of words
RNNs and LSTM networks have applications in diverse fields, including:
Chatbots, Sequential pattern identification, Image/handwriting detection, Video and audio classification, Sentiment analysis, Time series modeling in finance.
Count Vectorizers:
Count Vectorizer is a way to convert a given set of strings into a frequency representation.
Products based on NLP?
Grammar checkers, voice assistants, chatbots
Which TF-IDF value exhibits that the word is important for one document, but it is not a common word for all documents?
High
LSTMs
In LSTMs, the rules that govern the information stored in the state (memory) are trained neural nets themselves—therein lies the magic. They can be trained to learn what to remember, while at the same time the rest of the recurrent net learns to predict the target label!
Language model
In an NLP context, a language model is a model of the probability distribution of word sequences.
Cosine similarity
In data analysis, cosine similarity is a measure of similarity between two sequences of numbers. For defining it, the sequences are viewed as vectors in an inner product space, and the cosine similarity is defined as the cosine of the angle between them, that is, the dot product of the vectors divided by the product of their lengths.
The Global Vectors for Word Representation, or GloVe, algorithm
It is an extension to the Word2Vec method for efficiently learning word vectors, developed by Pennington, et al. at Stanford.
LDA ________ a clustering algorithm in the strictest sense. Clustering algorithms typically produce one grouping per item being clustered whereas LDA produces a _________ distribution of groupings.
Is not, probabilistic
Natural Language Tool Kit (NLTK)
It can perform different operations such as tokenization, stemming, classification, parsing, tagging and semantic reasoning.
Which type of information is given by the "Bag of Words" algorithm?
It describes the occurrence of words within a document, or in other words, a vocabulary of known words.
Frequency distribution
It is a collection of items along with their frequency counts (e.g., the words of a text and their frequency of appearance).
What is a Corpus?
It is a large and structured set of texts that can be read by machines to produce in a natural communicative setting.
What is self-attention?
It is a mechanism relating different positions of a single sequence in order to compute a representation of the same sequence. Attention allows our Transformer to focus on parts of our input sequence while we predicted our output sequence.
Lemmatization
It is a process that maps the various forms of a word (such as appeared, appears) to the canonical or citation form of the word, also known as the lexeme or lemma (e.g., appear).
WordNet
It is a semantically oriented dictionary of English, consisting of synonym sets—or synsets—and organized into a network
Number of words
It is the frequency of words in each document of the corpus.
What is Entity recognition ?
It is the process used to classify multiple entities found in a text in predefined categories, such as a person, objects, location, organizations, dates, events, etc. Word vector refers to the mapping of the words or phrases from vocabulary to a vector of real numbers.
Tokenization
It is the segmentation of a text into basic units—or tokens—such as words and punctuation. Tokenization based on whitespace is inadequate for many applications because it bundles punctuation together with words. NLTK provides an off-the-shelf tokenizer nltk.word_tokenize().
In which process is grouping together different forms of the same word do?
Lemmatization
The process of grouping different forms of the same word together is known as:
Lemmatization
What is Machine Translation?
Machine translation is the task of automatically converting source text in one language to text in another language. In a machine translation task, the input already consists of a sequence of symbols in some language, and the computer program must convert this into a sequence of symbols in another language.
N-gram taggers can be defined for large values of n, but once n is larger than 3, we usually encounter the sparse data problem; even with a large quantity of training data, we see only a tiny fraction of possible contexts.
N-gram taggers
Which of the following is a package used for NLP in Python?
NLTK
What Is Natural Language Processing?
Natural language processing, in its simplest form, is the ability for a computer/system to truly understand human language and process it in the same way that a human does.
NMF (Non Negative Matrix Factorization)
Non-negative Matrix Factorization is a Linear-algeabreic model, that factors high-dimensional vectors into a low-dimensionality representation. Similar to Principal component analysis (PCA), NMF takes advantage of the fact that the vectors are non-negative. By factoring them into the lower-dimensional form, NMF forces the coefficients to also be non-negative.
Which application is not related to NLP?
Semantic Analysis
Which of the following steps is involved wherein the whole corpus is divided into sentences?
Sentence Segmentation
It is a subset of natural language processing and text analysis that detects positive or negative sentiments in a text.
Sentiment analysis
Sequence-to-sequence
Sequence-to-sequence networks can be built with a modular, reusable encoder-decoder architecture. The encoder model generates a thought vector, a dense, fixed-dimension vector representation of the information in a variable-length input sequence. A decoder can use thought vectors to predict (generate) output sequences, including the replies of a chatbot.
A technique used to extract the base forms of the words by removing affixes from them is called?
Stemming
Which technique is used to extract the base form of the words by removing affixes from them?
Stemming
Which term is used torefer to the grammatical structure of a sentence?
Syntax
_____________ values help the computer understand which words are to be consider while processing the natural language.
TFIDF
Which of the following names is given to a chatbot?
Talk bot, Artificial conversational entity, chatterpatterbox
Which term is used for the freque3ncy of a word in one document?
Term Frequency
Latent Dirichlet Allocation
The idea behind latent Dirichlet allocation (LDA) is that documents are generated based on a set of topics. In this process, we assume that each document is distributed over the topics, and each topic is distributed over the terms. Each document, and each word, are generated from sampling these distributions. The LDA learner works backward and tries to identify the distributions where the observed is most probable.
BOW creates a set of ________ containing the count of word occurrences in the document that is easy to interpret.
Vectors
True about TextBlob
• It is a library for processing textual data • provides a simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more. • TextBlob is an open-source python library for processing textual data
It is true about NLP:
• Natural language lacks mathematical precision. • Nuances of meaning make natural language understanding difficult. • A text's meaning can be influenced by its context and the reader's "world view."