NLP

अब Quizwiz के साथ अपने होमवर्क और परीक्षाओं को एस करें!

What is a Dirichlet Distribution and how many is used in LDA?

'Distribution of distributions' - it is the distribution of topics in documents and the distribution of words in the topics. The more topics, the more dimensions in the distribution -> three topics imagine a triangle. LDA uses two: 1. one for the distribution of topics over documents 2. one for the distribution of words over topics

What tasks can you use language technology for?

- Spam detection - POS tagging - Named Entity Recognition - Machine translation - Information extraction - Sentiment analysis - Paraphrasing - Q&A

Issues in tokenization

1) Finlands / Finland / Finland's 2) One token or two? - San Fransisco 3) Abbreviations - PH.D 4) Language specific barriers - chinese dont have space between words

What are the three types of text normalization that are needed for NLP tasks?

1) segmenting/tokenizing words in running text 2) normalizing word formats 3) segmenting sentences in running text

What are the properties of a neuron in a neural network?

1) weight, 2) bias, 3) activation function Weight: individual weight for each neutron (which ones are the more important ones?) Bias: same for each neuron in hidden layer. Composes the negative threshold which you want the neutron to weight e.g. only want values to 10 add - 10 to the activation function Activation function: Passes sum of neutron to other neurons

What is the process for a basic crawler operation?

1. Begin with known "seed" URLs / seed pages 2. Fetch and parse them 3. fetch each URL on the queue and repeat

What are the five steps in Top2Vec?

1. Doc2Vec -> Convert each document & words to vectors 2. Dimensionality reduction 3. Clustering of documents - find the dense clusters (a cluster = a topic) 4. Find the centroid of each cluster, this will be the topic vector 5. find the n-closest word vectors to the topic vector

Name three of the main hyperparameters in LDA

1. Document density factor (alpha) -> Controls the number of topics expected in the document 2. Topic word density factor (beta) -> controls the distribution of words per topic in the document 3. Number of selected topics (K)

What neural networks exists? Which one is best for NLP? WHY?

1. Feedforward NN 2. RNN (recurrent NN) 3. CNN (convolutional NN)

What are the two trade off goals in LDA?

1. For each document, allocate its words to as few topics as possible 2. For each topic, assign a high prob to as few terms as possible

What kind of ambigituities exists?

1. Lexical ambiguity 2. Structural ambiguity

Issues when working with NLP?

1. Non standard english 2. Idioms / sayings 3. Issues with tokeniation 4. Neologism (bromance, retweet)

What are the complications with web crawling?

1. Not feasible with one machine 2. Malicious pages (spam) 3. Bandwith to remove service 4. How deep should you crawl URL hierarchy?

What are the two main considerations for URL frontier?

1. Politeness: do not hit a web server too frequently 2. Priority/freshness: crawl some pages more often than others

What is the steps in vector space ranking?

1. Represent the query as a weighted tf-idf vector 2. Represent each document as a weighted tf-idf vector 3. Compute the cosine similarity score for the query vector and each document vector 4. Rank documents with respect to the query by score 5. Return top K documents

What are the problems with LDA?

1. relies on bag-of-words which ignore ordering and semantics of words 2. requires predefined number of topics 3. relies on lots of arbitrary parameters such as stemming, lemmatization and custom stop-word lists

What is topic modeling?

> Unsupervised ML > Statistical modeling for discovering the topics that occurs in a collection of documents > Used to organize, understand, search, and summarize electronic archives

What are hash values?

A hash function is any function that can be used to map data of arbitrary size to fixed-size values.

What is a lexicon?

A list of words that corresponds to some meaning or class

what is Cross-entropy?

A loss function that measures the performance of a classification model and gives a loss between 0 and 1 where a perfect model would have a loss of 0.

What is Word2Vec?

A method of constructing word embeddings, that can be obtained through (1) skip gram, and (2) Common Bag of Words (CBOW).

What is stemming?

A method to standardize words by removing suffixes.

What is Weighted Minimum Edit Distance?

A mimimum edit distance where weights are added to the computation due to (1) spell correction - some letters are more often mistyped than others, and (2) biology - some insertions/deletions are more likely than others.

What is a bag-of-words model? And its steps

A model that turns text into fixed length vectors by counting how many times each word appears. 1. Define vocabulary, which is the set of all unique words in the document 2. Count how many times each word appears in each document EX: Vocab = {the, cat, is, nice, cute} A. 'the cat is nice' = [1,1,1,1,0] B. 'the cat is cute' = [1,1,1,0,1] OBS: BoW models loose contextual information since it only counts words, not why or where they appear in the sentence.

What is multimial logistic regression?

A problem with more than two classes, e.g. sentiment classification that has positive/negative/neutral. In such cases the probability must still add up to 1 so: P(pos) + P(neg) + P(neu) = 1 > Then instead of the sigmoid function used for two classes one must use the softmax function that outputs a vector

What is dynamic programming with respect to Minimum Edit Distance?

A tabular computation of the distance between sentence n and sentence m.

What is Aspect-based sentiment analysis?

A text analysis technique that categorizes data by aspect and identifies the sentiment attributed to each one. Aspect = Category, feature, topic that is being talked about Sentiment = pos/neg opinions about a particular aspect EX: Analysing feedback by associating specific sentiments to different aspects of a product or service. "Great food but the service was bad" > Food was great + service was bad

What is information retrieval (IR) based question answering?

A three-stage procedure: 1. questions processing -> detect question type, detect answer type and formulate the queries to send to a search engine 2. passage retrieval -> retrieve set of documents ranked by their relevance to the query 3. answer processing -> extract and rank candidate answers

What is perplexity?

A way of evaluating language models - it is a measure of how well a probabilistic model predicts a sample. The lower the perplexity the higher the probability.

What is Laplace smoothing?

A way of solving the problem of assigning zero probability to unseen events by assigning 1 to all counts and therefore avoiding any zeros in the counts.

What is interpolation in N-grams?

A way of solving the problem of zero frequency n-grams: mix probability estimated from all the n-gram estimators, weighting and combining the trigram, bigram, and unigram counts.

What is backoff in N-grams?

A way of solving the problem of zero frequency n-grams: use N-grams if the evidence is sufficient, if not, move on to bigrams, and otherwise use unigrams.

What is the chain rule of probability?

A way to compute probabilities of an entire sequence (the joint probability) given the conditional probabilities. P('I am fine thanks') = P(I) x P(am | I) x P (fine | I am) x P (I am fine thanks).

What can word2vec be used for? What prediction method ius used?

Application: broad such as translation, sentiment, QA Prediction method: Skip-gram (is w likely to show up near word x)

Why is euclidean distance a bad idea when comparing document similarity?

Because it will be large for vectors of different lengths. > Use angle instead, if the angle is 0 it means maximal similarity

What is a bigram model? And a trigram model?

Bigram: Calculates the probability of a word given all the previous words by only using the conditional probability of one preceding word. Trigram: Calculates the probability of a word given all the previous words by only using the conditional probability of two preceding words.

What is a generative classifier?

Classifies by finding similarities/main features of the class. Example: Naive bayes

What is a discriminative classifier?

Classifies by finding the features that are most useful to distinguish between classes. Example: Logistic regression

What is Lemmatization?

Converting a word into its base form (lemma)

How to compute similarity?

Create n grams Hash all n grams in two documents (much cheaper to compare hash values and permute than comparing n-grams directly)

What is the difference between data mining and text mining?

Data mining is about analyzing large datasets to identify meaningful patterns, whereas text mining is about finding meaningful insigts from unstructured text data by making it structured

What should any crawler do?

Distributed operation: designed to run on multiple distributed machines Scalable: increase crawl rate by adding more machines (e.g with Twitter who had restriction) Performance/efficiency: permit full use of available processing and network resources Fetch pages of higher quality first Continioous operation: continue fetching fresh copies of previously fetched daga Extensible: adapt to new data formats, protocols

What are the three assumptions behind LDA?

Distributional hypothesis = That similar topics makes use of similar words Statistical mixture hypothesis = That documents talk about several topics for which a statistical distribution can be determined Dirichlet distribution = • Distribution of topics in a document and the distribution of words in topics are Dirichlet distributions

What is explicit and implicit politeness?

Explicit politeness: specifications from webmasters on what portions of site can be crawled Implicit politeness: Even with no specification, avoid hitting any site too often

What is extrinsic evaluation and Intrinsic evaluation of models?

Extrinsic evaluation is the best way to evaluate. It is an end-to-end evaluation. It applies each model on the given task and compares the accuracy for each, thus getting the full picture on which model performs the best. Problem: Time-consuming. Intrinsic evaluation uses training and testsets and compares the models based on its performance on the test set. Problem: Could result in bad performance approximation unless the test data looks exactly like the training data.

What are filters and robots.txt?

Filters: regular expression for URLs to be crawled / not If robots.txt file is fetched, do not fetch repeatdly (otherwise burns bandwidth) Cache robots.txt files

"Word tokens" meaning

How many instances of a given type that is present in the text

"Word types" meaning

How many unique words

What is PageRank?

How much will a node/doc be hit from a webcrawl Uses teleporting to avoid dead ends and being stuck

What is HITS? What are the two main components?

Hyperlink-induced Topic Search - Algorithm used in link analysis to discover and rank webpages according to relevance. Two components 1. Given a query to a Search Engine, the set of highly relevant web pages are called Roots. They are potential Authorities. 2. Pages that are not very relevant but point to pages in the Root are called Hubs. Thus, an Authority is a page that many hubs link to whereas a Hub is a page that links to many authorities.

Is perplexity extrinsic or intrinsic evaluation?

Intrinsic

What is a language model?

Language modelling is the use of statistical and probabilistic techniques to determine the probability of a given sequence of words occuring in a sentence. We use it in - Audio to text conversion - speech recognition - sentiment analysis - spell correction

What is Latent, Direct, and Allocation in LDA?

Latent = Hidden. The topics are hidden. Dirichlet = The model assumes that the topics in the documents and the words in those topics follow a dirichlet distribution Allocation = Means giving something, which in this case are topics

What is the Maximum Likelihood Estimate (MLE) in n-grams?

MLE is used to estimate the n-gram probabilities. To compute the probability of y given the word x you can count the number of xy occurences and normalize it by counting the sum of all the times x appears in the text. >> Out of all the times x appears, y appears after X% of the times.

What is the rule of thumb for number of back queues?

Mercator recomends three times as many back queues as crawler threads

What is N-gram?

N-gram means a sequence of N words where we compute P(w|h) being the probability of a word w given some history h.

In LDA topic model, what is the one observed variable, and the three latent/hidden variables?

Observed - Word-distribution per document Hidden - Word distribution per topic in corpus - Topic distribution per document - Topic assignment per word in doc

Difference between PageRank and HITS

PageRank is considered more efficient HITS is better at dealing with domains

How can you evaluate an information retrieval system?

Precision: Fraction of retrieved docs that are relevant Recall: Fraction of relevant docs that are retrieved F measure: combined between precision and recall

What is factoid based questions?

Questions that have a short answer

What is tokenization?

Splitting a text/sentence/phrase into smaller units also called tokens e.g. Into words, letters

What is the task of text classification about?

Telling what class a given text is based on some features of the text

What is special about Levenshtein distance (a minimum edit distance operation)?

That a substitution costs 2, whereas in the 'normal' edit distance insertion, deletion, and substitution only costs 1.

What is the Markov Assumption?

The assumption that the probability of a word depends only on the previous word.

What is probabilistic language modeling?

The goal is to compute the probability of a sentence or the probability of an upcoming word

Lemmatization vs. Stemming

The goal of both is to reduce a word to its base form. However, stemming only chops off the end, whereas lemmatization considers the context.

What is Minimum Edit Distance?

The minimum number of editing operations for converting one string into another. Using; Insertion, Deletion, and substitution. Each operation costs 1, but if using Levenshtein distance a substition costs 2.

What is Porter's algorithm?

The most common algorithm for stemming english, it consists of 5 phases of word reductions, applied sequentially.

What is lexical semantics?

The study of word meaning and their relation to other words.

What is information retrieval (IR)?

To turn unstructured information embedded in texts into structured data Basic assumptions: Collection: A set of document Goal: Retrieve documents with information that is relevant to the information needed by the user

What does PageRank not take into account?

Topics or different types of webpages (HITS or topic specific Page Rank)

What happens in the training and testing of an LDA model?

Training = Learn the latent/hidden variables on a collection of documents (training set) Test = Predict the topic distribution of an unseen document

What is ergodic marcov chains?

Used to compute visit rate Based on transition probability or steady state probability distribution

What is word embedding?

Vector representations of a particulate word

What is the main idea and the goal behind LDA?

What each document can be described by a distribution of topics and each topic can be described as a distribution of words. Goal: To determine the mixture of topics that a document contains

What is a word lemma?

Words with the same stem / same word meaning. EX: cat and cats are the same lemma

How can you test performance, if you are not using a train/test set?

You can use extrinsic evaluation and apply models on the actual task and see how well it performs, e.g. if it should do spelling correction you could apply two models on the task and see which of the models that corrects most misspelled words.

What does 'latent' mean in LDA?

latent is the hidden layer in which the topics occur in - the topics are not observed, only the words and the documents is, thus, the topics are latent

Why is text normalization important?

so every text unit has the same form eg. USA vs. U.S.A and will therefore be charactarized as similar

What is a perceptron?

x


संबंधित स्टडी सेट्स

Ecclesiastes 3 - Flashcard MC Questions - Ted Hildebrandt

View Set

Chest Tube/pneumothorax from resources

View Set

What is the Relationship Between Solute Concentration and Osmosis

View Set

الدول العربية وعواصمها

View Set