Feature Extraction
what is the most popular word embedding used in practice?
Word2Vec
what is the basis for Word2Vec
a model that is: able to predict a given words given neighboring words, or predict neighboring words from a given word are likely to extract the contextual meaning
What is one-hot encoding?
a representation of each word as a class, using a one hot vector its like bag of words, except we keep a single word in each bag
what is a corpus?
a set of documents
what is each element of a document-term matrix called?
a term frequency
how would you represent a corpus using vectors?
as a document-term matrix, where each word has a value
how do you calculate document frequency, (for TF-IDF)
calculate term frequency - the number of times a word is in a document count document frequency - the # of documents each word occurs then: divide term frequency / document frequency
what is cosine similarity? how to calculate?
can use it to determine important words you: divide the dot-product of the 2 docs, by the euclidean norm (the magnitude of their vectors) thinking of vector spaces this is the angle between the vectors
to do a deeper analysis of text (other than document level), what do you need to do?
come up with a numerical representation for each word. (ie one hot encoding)
using a document-term matrix, how can you determine how similar documents are (how many words they have in common)?
compute the dot product between the 2 row vectors. The greater the dot product, the more similar the 2 documents are
what is a set of documents called
corpus (D)
what does glove give you?
it captures the similarities and differences between words
how does TF-IDF help with characterizing words?
it highlights words that are more unique to a document, and de-emphasizes words that are more frequent in the corpus(such as cost in financial document)
what is a drawback of using dot product to check for document similarity? and what is a solution?
it only captures portions of document overlap, it does not capture other terms that are not in common. so pairs that are very different can end up with the same dot-product as ones that are identical solution:use cosine similarity
what is one limitation of the bag of words approach?
it treats every word as equally important
what are some properties of Word2Vec ?
robust distributed representation Vector Size independent of vocabulary train once, store in lookup table deep learning ready!
if 2 pairs of words have a similar difference in there meaning, how are they represented in the word embedding space?
should be approximately equally separated in the embedded space. this can be used for things like: -finding synonyms -analogies -concepts of clustered words -positive/negative words
some examples of a document used for bag of words?
student essay one tweet
what is t-SNE
t-distributed Stochastic Neighbor Embedding great for visualizing embeddings
what is feature extraction?
taking cleaned text (tokenized/lemmitized/stripped) and locating relevant features, the outstanding characteristics of incoming information
what can you say about representations that characterize a document as one unit are good for?
the kinds of inferences that you can make are typically at a document level, such as: doc sentiment doc similarity doc topic mixture
What is a Bag-of-Words model?
treats each document as an unordered collection (or bag/set) of words
how does Glove work?
tries to directly optimize the vector representation of each word, just using co-occurance statistics (unlike word2vec which sets up a ancillary prediction task)
what is a better representation for bag of words that deals with duplicate words?
turn each document into a vector of numbers
how can you highlight important words occuring uniquely in each document
using TF-IDF
when does one-hot encoding break down? why? how to solve?
when we have a large vocabulary to deal with. The size of our word representation grows with the number of words. solve by limiting the size of the word representation by limiting it to a fixed size vector. ie) find an embedding for each word in some vector space, and exibit some desired properties
what is becoming the defacto choice for word representations? why?
word embeddings why? distributional hypothesis, the words that occur in the same context, tend to have similar meaning
why does word embedding work well?
word vectors that have simliar contexts, tend to get closer together
can you use transfer learning for word embeddings?
yes, you can learn word embeddings, similar to VGG16 or AlexNet
how does skip-grap word2Vec work? (CBoW is similar)
you pick any word from a sentence convert it to one-hot vector feed it into NN train predict a few surrounding words(context) a hidden layer, the outputs of that layer become the corresponding word vector
what are some document level models?
TF-IDF Bag Of Words
what is the range for cosine similarity?
1 most similar to -1 most dissimilar
what are the 2 flavors of word2vec models?
CBoW - Continuous Bag of Words SkipGrap - given middle word
what are some other approaches like word2Vec ?
Glove - Global Vectors for Word Representation Word Embeddings for Deep Learning