Feature Extraction

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

what is the most popular word embedding used in practice?

Word2Vec

what is the basis for Word2Vec

a model that is: able to predict a given words given neighboring words, or predict neighboring words from a given word are likely to extract the contextual meaning

What is one-hot encoding?

a representation of each word as a class, using a one hot vector its like bag of words, except we keep a single word in each bag

what is a corpus?

a set of documents

what is each element of a document-term matrix called?

a term frequency

how would you represent a corpus using vectors?

as a document-term matrix, where each word has a value

how do you calculate document frequency, (for TF-IDF)

calculate term frequency - the number of times a word is in a document count document frequency - the # of documents each word occurs then: divide term frequency / document frequency

what is cosine similarity? how to calculate?

can use it to determine important words you: divide the dot-product of the 2 docs, by the euclidean norm (the magnitude of their vectors) thinking of vector spaces this is the angle between the vectors

to do a deeper analysis of text (other than document level), what do you need to do?

come up with a numerical representation for each word. (ie one hot encoding)

using a document-term matrix, how can you determine how similar documents are (how many words they have in common)?

compute the dot product between the 2 row vectors. The greater the dot product, the more similar the 2 documents are

what is a set of documents called

corpus (D)

what does glove give you?

it captures the similarities and differences between words

how does TF-IDF help with characterizing words?

it highlights words that are more unique to a document, and de-emphasizes words that are more frequent in the corpus(such as cost in financial document)

what is a drawback of using dot product to check for document similarity? and what is a solution?

it only captures portions of document overlap, it does not capture other terms that are not in common. so pairs that are very different can end up with the same dot-product as ones that are identical solution:use cosine similarity

what is one limitation of the bag of words approach?

it treats every word as equally important

what are some properties of Word2Vec ?

robust distributed representation Vector Size independent of vocabulary train once, store in lookup table deep learning ready!

if 2 pairs of words have a similar difference in there meaning, how are they represented in the word embedding space?

should be approximately equally separated in the embedded space. this can be used for things like: -finding synonyms -analogies -concepts of clustered words -positive/negative words

some examples of a document used for bag of words?

student essay one tweet

what is t-SNE

t-distributed Stochastic Neighbor Embedding great for visualizing embeddings

what is feature extraction?

taking cleaned text (tokenized/lemmitized/stripped) and locating relevant features, the outstanding characteristics of incoming information

what can you say about representations that characterize a document as one unit are good for?

the kinds of inferences that you can make are typically at a document level, such as: doc sentiment doc similarity doc topic mixture

What is a Bag-of-Words model?

treats each document as an unordered collection (or bag/set) of words

how does Glove work?

tries to directly optimize the vector representation of each word, just using co-occurance statistics (unlike word2vec which sets up a ancillary prediction task)

what is a better representation for bag of words that deals with duplicate words?

turn each document into a vector of numbers

how can you highlight important words occuring uniquely in each document

using TF-IDF

when does one-hot encoding break down? why? how to solve?

when we have a large vocabulary to deal with. The size of our word representation grows with the number of words. solve by limiting the size of the word representation by limiting it to a fixed size vector. ie) find an embedding for each word in some vector space, and exibit some desired properties

what is becoming the defacto choice for word representations? why?

word embeddings why? distributional hypothesis, the words that occur in the same context, tend to have similar meaning

why does word embedding work well?

word vectors that have simliar contexts, tend to get closer together

can you use transfer learning for word embeddings?

yes, you can learn word embeddings, similar to VGG16 or AlexNet

how does skip-grap word2Vec work? (CBoW is similar)

you pick any word from a sentence convert it to one-hot vector feed it into NN train predict a few surrounding words(context) a hidden layer, the outputs of that layer become the corresponding word vector

what are some document level models?

TF-IDF Bag Of Words

what is the range for cosine similarity?

1 most similar to -1 most dissimilar

what are the 2 flavors of word2vec models?

CBoW - Continuous Bag of Words SkipGrap - given middle word

what are some other approaches like word2Vec ?

Glove - Global Vectors for Word Representation Word Embeddings for Deep Learning


Kaugnay na mga set ng pag-aaral

ES 115, Test 2 Review (Class 14 and After)

View Set