Text mining

Lakukan tugas rumah & ujian kamu dengan baik sekarang menggunakan Quizwiz!

What are the limitations of RNN and how do you tackle them?

Issues: Exploding gradients(gradient dipping) Vanishing gradients (use activation functions, weight initializations, network architecture LTSM and GRU is better)

How does the distributional hypothesis motivate the way deep learning is used for creating word embeddings?

"you can tell a word by the company it keeps" We assume that words that occur in the same context are similar in meaning. Word embeddings are based on the fact that distributed representations try to comprehend a word's meaning by considering the company it keeps (context words).

Difference between Glove and Elmo?

Both GloVe and ELMo are pretrained on an unsupervised task on a large body of text. A key difference is that with GloVe we are using the learned vectors for some downstream task. With ELMo we are using the learned vectors and the pretrained model as components in some downstream task.

Describe the structure of dialogs in terms of actions and goals.

A dialog is a conversational information exchange between two or more people, or between human and machine. Task oriented structured towards completing a task alexa, google assistans Clearly see a structure of dialog, you ask something and hopefully it replies with the right information and you can ask follow up questions. Chatbots Unstructured mimic human open conversation Whatever you ask, they reply, no matter if it is related to the conversation and without a goal to reach. Speech can be seen as actions Every sentence we use can be seen as an action and we use them to achieve a goal, like asking and answering. Constatives: answering,claiming,confirming Directives: asking, advising,forbidding Commmissives: promising, planning, opposing Acknowledgements: thanking, accepting, greeting Difficulties when designing a dialog system: common ground, speaker and listening must understand on a purpose speech act does not always follow the same structure, can be hard to find the structure mixed initiative, the person who controls the dialog. If a system only answers question it is a simple system, it's harder to have a system that answers and also asks questions Inference by implicature, humans can have implicit answers, which are hard to capture by a computer.

Describe a language model, and explain two methods one can use to evaluate it (one extrinsic, and one intrinsic)

A language model is a statistical model that assigns probabilities to words and sentences. Typically, we might be trying to guess the next word w in a sentence given all previous words, often referred to as the "history". (Bert is a language model) Extrinsic evaluation. This involves evaluating the models by employing them in an actual task (such as machine translation) and looking at their final loss/accuracy. This is the best option as it's the only way to tangibly see how different models affect the task we're interested in. However, it can be computationally expensive and slow as it requires training a full system. Intrinsic evaluation. This involves finding some metric to evaluate the language model itself, not taking into account the specific tasks it's going to be used for. While intrinsic evaluation is not as "good" as extrinsic evaluation as a final metric, it's a useful way of quickly comparing models. Perplexity is an intrinsic evaluation method. Perplexity is: "a measurement of how well a probability distribution or probability model predicts a sample." Better language models will have lower perplexity values or higher probability values for a test set. Fast to calculate, allowing researchers to weed out models that are unlikely to perform well in expensive/time-consuming real-world testing Not good for final evaluation, since it just measures the model's confidence, not its accuracy Hard to make apples-to-apples comparisons across datasets with different context lengths, vocabulary sizes, word- vs. character-based models, etc. the negative log of the probability of an event occurring

What is a virtual assistant and explain briefly how does it work?

A virtual assistant use natural language processing (NLP) to match user text or voice input to executable commands. It uses on NLP and NLU to respond to queries and is intelligence to provide answers to complicated queries First you need a device to enter your voice, Then the system processes the voice and translates it into text, often using HMM understands and translated what you have said using math models into text that can be processed by the NLP system It listens to the speech and finds phonemes (the smallest units of speech) to compare with the pre recorded. Then the system tries to understand the speech by breaking up each word into its POS. And identifies important words and executes the action corresponding to the word. If you say weather it will open a weather app.

What is POS tagging, explain how you can design an LSTM to do POS-Tagging?

BI-directional LSTM can be used för POS tagging. How to use: Use a corpus, for swedish SUC- stockholm Umeå Create input vectors, embeddings (word2vec ex, mincount meaning that even of the word only occur once it will still be encoded in the model) Then create a nn Make prediction Backpropagation Training

What is BOW?

Bag of words The bag-of-words (BOW) model is a representation that turns arbitrary text into fixed-length vectors by counting how many times each word appears. It is called a "bag" of words, because any information about the order or structure of words in the document is discarded. The model is only concerned with whether known words occur in the document, not where in the document. How? BOW includes a vocabulary of know words and a measure of the presence of words 1. build a vocabulary 2. count how many times a word appears 3. Turn them into vectors

What is the idea behind using N-Grams in NLP?

Checking the previous word and following word to see how the word is used (noun,verb et.c.). Can be used with different N, 3N-grams, 4N-grams and so on. In general, too high N value is not efficient. The idea with N-grams is that instead of computing the probability if a word given its entire history, we calculate is using only the neighbouring history. Meaning that we can predict the probability without looking at too far in the past.

Define configuration in transition-based parsing? What are the transitions in the arc-standard algorithm?

Configuration: The task of finding dependencies in words in a sequence arcs between tokens that show to dependencies between words Is made of three parts: buffer stack relations(head) It consists of a buffer holding tokens to be processed and a stack holding (the head of) tree fragments. A transition moves one token from the buffer to the stack, removes a token from the stack, or creates a dependency. Other systems may have different transitions or additional stack etc. Transitions: This is a transitions based dependency parsing The transitions can be: shift, the next word in the buffer onto the stack left-arc, add and arc from the top word of the stack to the second most top word right-arc, add an arc from the second top word to the top word

How NN works

Data is fed through the input layer which communicates to the hidden layers, the processing takes place in the hidden layers through weighted nodes. The nodes combine data from the input layer and assign weights. Weights close to zero means changing the neuron's input will not change the output while negative weights mean increasing the input will decrease the output. Each neuron in the hidden and output layer takes the output from the previous layer as input and applies a function on the sum product of weights and inputs The sum is passed through to the activation function and then to the output layer.

If you want to develop a hate speech detection system from English text, what are the general steps?

Different approaches: Key-word based, define words. Doesn't work for all words, doesn't see meaning of word or context. Source Metadata, additional information from social media can help identify hate speech. Demographics, location, timestamp. Privacy is a problem Machine learning, train the machine to decide what is hate speech and not. We would build a classifier to identify texts as hate speech or not building a NN, RNN with LSTM layer

What are the different types of documents, what is document classification, provide some examples?

Document classification is a process of assigning documents using ML to one or more categories depending on its content. -Three types of documents -unstructured, ex contracts , cannot contain any row and columns, no associated datamodel -semi structured, ex invoices , a mix between the two. It has a rational database with metadata making it possible to organize. -structured, ex survey, easy to search and organize, have rows and columns Applications can be movie review, hate speech, spam/not spam emails.

Explain the main idea behind FastText word representation

FastText is a way of representing words with vectors. A major drawback from other embedding techniques is its problem with handling Out of vocabulary words FastText handles words as composed n-grams, so if we have the word jump, it would be ju, jum, ump and mp. Fasttext can generate embedding for the words that does not appear in the training corpus. This can be done by adding the character n-gram of all the n-gram representations. FastText can also be used by stemming the words to their root. SO if we have running and runs, they would both be run A drawback with this is that is required a lot of memory

CNN

Filters in image processing, images are made up of pixels. Convolution flattens images. The role of the Conv. is to reduce the images into a form which is easier to process, without losing features which are critical for getting a good prediction. The objective of the Convolution is to extract the high-level features such as edges, from the input image The filter covolves(slides) across each nXn block of pixels from the input Pooling Down-samples the image to reduce computation power and dimension. Max pooling Takes the mac number the filter is on and chooses also performs as a noise supressant AVG pooling Takes the average of all the values CNN convolutional layers + pooling layers

Name two similarity measurement metrics for vectorized semantic representation, and tell which one is preferred? why?

Finding similarity when using vectors is quite easy, it uses angle and magnitude to find similarity between two words. Euclidean distance - considers angle and magnitude (length) Cosine distance - considers angle Cosine distance is better than euclidean distance, since magnitude of word vectors often does not give a lot of information

What do you understand by Gender Bias?

Gender bias is prejudice against one gender. It is transferred from human system to data and models. Gender bias cuts across many fields. Gender Bias is when we have errors in recognizing the right gender for a given instance. For example, the AI system needs to fill in the blanks, Man is to the king, a woman is to Queen. The underlying issue arises in cases where AI fills in sentences like "Father is to doctor as a mother is to nurse." The inherent gender bias in the remark reflects an outdated perception of women in our society that is not based on fact or equality. How does it happen: General bias in society Incomplete or skewed training data Bias in human labels in data Best Practices for Machine-Learning Teams to Avoid Gender Bias Ensure diversity in the training samples (e.g. use roughly as many female audio samples as males in your training data). Ensure that humans labeling the audio samples come from diverse backgrounds.

What is hate speech detection and list some methods applied to do hate detection?

Hate speech detection is a task of detecting if communications like text are or contains hatred or violence towards a person or a group. Different approaches: Key-word based, define words. Doesn't work for all words, doesn't see meaning of word or context. Source Metadata, additional information from social media can help identify hate speech. Demographics, location, timestamp. Privacy is a problem Machine learning, train the machine to decide what is hate speech and not.

Dealing with Out Of Vocabulary words OOV

Ignore Ignoring is probably the worst solution of all, that means the model pretends to not see the word. UNK word This solution is to reserve a dimension in feature space, it may eliminate some impact of OOV word, but very limited. Especially the OOV word plays an important role in the NLP task, e.g. some positive and negative words are OOV words in sentiment analysis. Feature Hashing Feature hashing is not really designed for the OOV words, The idea is to blindly trust the hash collision to bring some good luck to the model. Of course it can go south a lot. Spell check If you have observed a lot of OOV words are caused by typos(e.g. chatbot), it is pretty natural to use spell check, but one caveat is that, the spell check may over-correct some correct words, you have to use the corrected words as additional feature and combine this with other solutions together. Subword I learned it from Facebook's fasttext library, it breaks down the word/string to multiple consecutive characters. For example, if you want to break hello world into 3-gram-char pieces within the word, you will have features like hel, ell, llo, wor, orl, rld. If you want to break the string into pieces, aka across the words, some additional features apart from the previous ones are lo⎵, o⎵w, ⎵wo (⎵ means white space).

What are context vectors, encoders, and decoders in recurrent neural networks?

In RNN there are two NN working as encoder and decoder The encoder network is that part of the network that takes the input sequence and maps it to an encoded representation of the sequence (context vectors). The encoded representation, context vector is then used by the decoder network to generate an output sequence. Context vectoring acts as "memory" which captures information about what has been calculated so far, and enables RNNs to remember past information, where they're able to preserve information of long and variable sequences. Because of that, RNNs can take one or multiple input vectors and produce one or multiple output vectors.

What is skipgram?

In Skip-gram embeddings we want to predict the surrounding words from a given one. Rather than for a given word predicting all the surrounding words at once, we'll create context-target pairs. We decide on a window size, and create a context target pair with the central word being the context and each word in the window of surrounding words will be our target. skip-gram is an architecture for computing word embeddings Skip-gram is used to predict the context word for a given target word Skip-gram uses a word to predict the surrounding the ones in the window

What is the difference between supervised and unsupervised classification?

In supervised the data has labels and in unsupervised it does not.

List some of the practical issues that NN might face and come up with some suggestions for the solution?

Issues of NNThe issues and be in Dataset preparation or in Implementation and training. Dataset preparation: Misslable data by humans Imbalance data, one class has many more instances Batches with the same label, you may have to shuffle the data Less size of training samples Implementation and training: Data preprocessing must be done the same on training and test data Data normalization, must be only computed on the training data, and then the same applied to the other sets. Weight initialization, we should not set every weight to 0(or to the same), bc all neurons will compute the same weights Capacity of the neural network, can cause overfitting Regularization might overwhelm the data High learning rate the model has high kinetic energy, step decay is good.

What is tokenization?

It is a morphological analysis. The task of dividing a document into pieces, tokens. Separates words from sentences. Well understood for programming languages but not as well used in natural languages, where the same word can have different meanings. Languages can be space-delimited (english, swedish) and unsegmented languages (chinese, thai). The challenges of tokenization is that words can have the same meaning, and meanings can have different words. Techniques for tokenization (used for space-delimited languages): White space Dictionary based Regular expression For unsegmented languages: Maximum matching algorithm (greedy algorithm) using the longest word in the sequence to find the meaning Library in python: Spacy, Nltk, Stanford

For which task is a (MLP | RNN | CNN) useful?

MLP is useful for classification problem where the number of classes are known. RNN is useful for Sequence to sequence tasks. We can predict or generate sequences CNN is useful for image and text, when the input is multi-dimensional. Task where local patterns matters, for an image we can have local object such as edges within the image. The same can be applied to sentences, convolute over the text to better understand it.

What is a POS tag? Make a sentence, and POS tag it.

POS of part of speech tagging. It refers to the task of finding the word class for a word. There are multiple models for POS: Rule-based, relies on dictionary Hidden markov models, observation probability Decoding HMM, using algorithms Example of a sentence: I like pizza I = pronoun = like = verb pizza = noun

Explain the followings in a language, and give examples: Morphemes, lexemes, syntax, and context.

Morphemes is usually words, its the base of a word, where the word cannot be separated into more. Morphemes must have a meaning. Like the word cat, the smallest meaningful unit of language. Lexemes is the inflected form taken by a single word. Members of the lexeme run would be, run, running, and ran. This would not include runner, since that is a derived term, with a morpheme attached. Syntax is the set of rules for constructing sentences, its the arrangements of words in a specific order. If you change the position of one word you can change the entire meaning, for example the person ate a pizza, or the pizza ate a person. Context is how words and phrases come together in a language to convey a particular meaning. This includes tone, body language, empesize on words. The way something is said can change the meaning. For example is someone says great loud with a smile of their face vs great with low tone while rolling their eyes. Sentence: She likes Pizza Phonemes: Sound notations: unit of sound Morphemes: She | Likes | Pizza (There are three morphemes) Lexemes: She | Likes | Pizza (There are three lexemes: Token/Words) Syntax: She (Subject) likes(Verb) Pizza(Object) Context: Context of likes is She and Pizza

Describe the multi-class, multi-label, and multi-task classification problems.

Multi-class is a classification problem where the output is divided into several classes, for example a classification problem with classifies images of animals to multiple classes. In multi-class classification an image can only be represented to one animal. Multi-label classification problem is where you divide instances in different categories based on the same area. This can be classifying new articles into different categories, news, sport, fashion, and so on. In multi-label an article can contain text of multiple labels. Multi-task is where you extract multiple information traits about one instance. For example an image of an apple, you want to class, apple but also the size.

What is one hot encoding? length of encoding depends upon what?

One hot encoding: It is one of the techniques to represent features. Features are represented using a binary vector. The size of the vector depends upon the vocabulary. For n-words sentences, we have nxv, where v is the size of the vocabulary. Dataset of S sentence All sentences are the same length, T vocabulary size is V The dimension of the data for NN with on hot encoding S * T * V

How do you define natural language processing?

NLP is the ability for computers to understand human language called natural language in spoken and written form. It is a subset of AI. NLP combines different models like statistical, rule-based and deep learning to enable computers to learn and process human language the same way we humans do.

In human language we have polysemy and synonymy, what is the difference between them? Based on the above, what is synset in NLP?

Polysemy and synonymy are concepts in linguistics Polysemy is the 1:N mapping of form & meaning (1 word has multiple meanings table = tabular, furniture) Synonymy is the 1:N mapping of meaning & form (1 meaning has multiple words, beat, hit, strike = all mean the same thing) A synset is the group of the synonyms that are interchangeable without changing the value of the context that it's in. A synset is a set of synonyms that share a common meaning. Synset can be used in sentiment analysis. The lexical database WordNet can be used to browse and find words that are meaningfully related the the word in question

Justify why we use regularization methods, list the three major categories of regularization methods we discussed, and explain one of them in more detail.

Regularization methods: Can be used to avoid overfittting A modification in learning algorithm that is intended to generalize error It reduces the complexity of the model Validation loss vs training loss When validation loss goes up but training loss just goes down and down, then its overfitting. The model has high variance, the prediction will change if you change the training data, Flexibility in model (many parameters) increases variance NN Lower the weights, if you have high weights you are exaggeration the importance of this input, causing it to overfit. Techniques: Penalizing the parameters: penelizes the difference between predicted output and expected output L1 penalty equals the absolute value L2 penalty equals the square value By Dropout Reducing complexity by eliminating nodes can be used as an ensemble method where you eliminate nodes and use the output from all models. Early stopping Validation loss vs training loss When validation loss goes up but training loss just goes down and down, then its overfitting. Stop it when it starts overfitting By Adding noise: introdocuing noise the to data examples in NLP: easy data augmentation EDA - synonym replacement replace words by same POS(driving blue car = driving silver car) Back translation, translating from one lang to another then back to first lang.

What is Stemming and Lemmatization?

Stem is the base or root of a word, you chop of the ending of a word. The stem of Running → Run and of Ponies → Poni Removes affixes Lemmatization Finding the root word (lemma) by relying on the dictionary. The lemma of Ponies → Pony

What is semantics and how we can represent it? Specifically, discuss formal and vectorized semantic representation and give examples of each.

Semantics is the study of meaning we use symbols to represent meaning of words For example the object mug would be the representation of the word mug Formal semantics words can have different meanings, symbols for meanings that we define. We can use a consensus library for different meaning of a single word How can we share the meaning of a word in our mind with a computer? The way to represent meaning to a computer is by vectorized semantics Vectorized representation This is the process of representing words with numerical values. We transform words into vectors by embedding them into a vector space, for this we make use of that words which occur in a similar context will have a similar meaning. For vectorized representation there are multiple tools such as Word2vec, GloVe, FastText.

What are some example problems of NLP/Text Mining?

Some problems in NLP is the understanding of meanings and content of words. For humans it is easy to understand what is meant by someone's text or words, because we know the context of the sentence. For computers this is not as easy. The same words can have different meanings depending on the context, for example: I ran to the store because we ran out of milk Can I run something past you real quick? The house is looking really run down. In all of these, the word run has a different meaning. Also homonyms, which is different words with the same pronunciation, example their, there. Synonyms are also an issue, since it is hard for the computer to understand that these different words have the same meaning. ( small, little, tiny ) Slang, new words are made upp all the time. NLP can be used in machine translation problems, hate speech detection, spam email detection, sentiment analysis and much more. Sentiment Analysis Topic Classification Hate speech detection Spam detection Fraud detection Risk Management

How are the processes of spam classification and machine translation different from each other?

Spam classification is a binary classification problem, detecting emails as spam or not spam. The process of spam classification involves pre processing the data to clean it and for example remove stopwords. The words are then vectorized and fed to a model for classification. Machine translation is a process of translating words from one language from another. The text is pre processed and translated in to a language that the computer can understand i.e binary or numbers. The data is then fed to a model for translating. What are the differences: Machine translation is a seq2seq process while spam is binary classification. The MT model must understand the semantics, relationships and context of the words. The model needs to store the state. RNN architectures are necessary such as Encoder-Decoder architecture, transformer architecture, and so on. The MT model is using both NLU and NLG

What is the difference between Syntactic and Semantic analysis, how parser can be applied as a method of syntactic analysis?

Syntactic analysis focuses on form and syntax, the relationship of the words in a sentence. Semantic analysis focuses on the meaning of the words. Parsing is determining syntactic structure of the text based on the underlying grammar of the language Syntactic parsing involves the analysis of words in the sentence for grammar and their arrangement in a manner that shows the relationships among the words. parse is to "resolve a sentence into its component parts and describe their syntactic roles." Parsing refers to the formal analysis of a sentence by a computer into its constituents, which results in a parse tree showing their syntactic relation to one another in visual form, which can be used for further processing and understanding. It helps in identifying parts of speech, Phrases, and relationships.

Explain the F1-score, and justify its use over Accuracy

The F1-score sums up the performance of a model by combining precision and recall. 2 * (precision * recall / precision + recall ) Precision: Of all positive predictions, how many are really positive? Recall: Of all real positive cases, how many are predicted positive? F1-score works better if you have an uneven class distribution. Bc then If the proportions of the classes are very different then accuracy can give a fault result. Ex in class 1 = 100 class 2 = 1. The model predicts all to be class 1. The accuracy will be very high but it will be misleading.

What is frame semantic, some application that can use frame theory?

The idea of frame semantics is that the individual words need a larger network of words and meanings in order for a listener or reader to understand a single word in context. assign semantics structures to natural language sentences called frame theory Semantic frame = set of words that represent an event For example a transaction, buy and sell, we can use these verbs to represent transaction Frames have semantic roles, words in a sentence can take on these roles. Verbs with related meaning are expected to have the same meaning but in a syntactically different form. One application is FrameNet, which is used in answering and questioning

What is the vanishing gradient problem and how it is tackled?

The vanishing gradient is the problem where in training you update the weights and the updates make the weights get closer and closer to zero, meaning that the model can no longer learn. Some functions have a very small derivative, like sigmoid squishes a large input space into a small output space(0-1), meaning that big changes in the input will result in only a small change in output. the rates at which the weight are updated are small and will eventually be close to zero. Use other activation function like ReLu or using LSTM In LSTM: The gradient holds the activation vector of the forget gate, which helps the network to better regulate the gradient values at each time step by updating the forget gate's parameters. The activation of the forget gate allows the LSTM to select whether or not particular information should be remembered at each time step and update the model's parameters accordingly

What are NLU and NLG? How can you differentiate between NLG and NLU?

They are both a subset of NLP. NLG is Natural Language Understanding and is used to discover meaning of the text and find pieces of information within the text. This is used in hate speech detection and sentiment analysis. NLG is Natural Language Generation and is a method of generating text from structured data. This can for example be to generate a caption for an image or to generate a text in natural language form keywords. NLU reads data and converts it to structured data. While NLG writes structured data.

Suppose there are trained networks for encoder and decoder. Show the steps of how the encoder-decoder model translates "A B C" to "X Y Z".

They were made up to handle seq2seq problems It is made up of encoder, decoder and context vector. The encoder processes each token and transforms them into a fixed vector, the context vector. The encoder passes the vector to the decoder The decoder reads the vector and predicts the output sequence, token by token. The encoder and decoders are made up of LSTM cells Testing phase: You input a special symbol to represent the start of output sequence, can be <EOS> The decoder uses the start and the final stage of the encoder to predict the first word/token A in the sequence The 2nd time step, the A is fed as the input to the decoder. The decoder predicts X The 3rd time step B is fed as input to the decoder and it predicts Y. The 4th time step C is fed to the decoder and the model predicts Z To mark the end, you use a stop symbol <END>

Encoder-decoder

They were made up to handle seq2seq problems It is made up of encoder, decoder and context vector. The encoder processes each token and transforms them into a fixed vector, the context vector. The encoder passes the vector to the decoder The decoder reads the vector and predicts the output sequence, token by token. The encoder and decoders are made up of LSTM cells Testing phase: You input a special symbol to represent the start of output sequence, can be <EOS> The decoder uses the start and the final stage of the encoder to predict the first word/token T1 in the sequence The 2nd time step, the T1 is fed as the input to the decoder. To mark the end, you use a stop symbol <END> Training phase: is a little bit different and faster. The encoder works the same, it accepts tokens and sends the final state to the encoder. Instead of feeding the predicted output T1 into the decoder, we feed to the true token. In the real world when doing machine translation we don't have a 'true' value so that is why we use the predicted output in the test phase, but the true value in training phase.

What is tokenization and provide an example? What is the difference between tokenizing the space delimited language and unsegmented language?

Tokenization is a morphological analysis. Task of dividing a document into pieces, tokens. Well understood for programming languages and not as well used in natural languages, where the same word can have different meanings. Separates words from sentences Languages can be space-delimited (english, swedish) and unsegmented languages (chinese, thai). The challenges of tokenization is that words can have the same meaning, and meanings can have different words. Unsegmented languages are example chinese and thai. Space delimited languages is english, swedish. You need a different approach since the languages are built different. Techniques for tokenization (used for space-delimited languages): White space Dictionary based Regular expression For unsegmented languages: Maximum matching algorithm (greedy algorithm) using the longest word in the sequence to find the meaning

What are the typical steps of a text mining system involving machine learning?

Tokenization: Tokenization is one of the first steps in NLP, and it's the task of splitting a sequence of text into units. We have word-level tokenization, character-based tokenization, Subword level tokenization. Part of Speech Tagging: The act of establishing the part of speech of each token in a text and then marking it as such is known as Part of Speech tagging (or PoS tagging). We employ PoS tagging to determine whether a token identified as a noun, verb, adjective, adverb, etc. It is not a simple task: Book a hotel (here book is a verb). Text feature extraction: Requires machine learning. Cleaning - Removing unnecessary/unimportant parts of the text (like making all letters lowercase, removing special characters, etc) Example: "Does Michael like 99 cats?" → "does michael like 99 cats" Tokenization - Splitting the text into its constituent parts (like words, letters, and/or symbols depending on what embedding you use) Example: "does michael like 99 cats" → ["does", "michael", "like", "99", "cats"] Stemming/lemmatization - Switch all words into their baseform (like removing plural, changing verbs to their root form, etc) Example: ["does*", "michael", "like", "99", "cats"] → ["do", "michael", "like", "99", "cat"] Replace rare/special tokens - Switch some rare (only shows up once or twice in the dataset) or special (names, numbers, etc) tokens for tags Example: ["do", "michael", "like", "99", "cat"] → ["do", "<NAME>", "like", "<NUMBER>", "cat"] Indexing - Create a token to number mapping and replace each token by its number Example: ["do", "<NAME>", "like", "<NUMBER>", "cat"] → [454, 3, 872, 2, 234] Embedding - Use an embedder to get a vector to use as input to the machine learning model.

Describe the two paradigms of using a pre-trained model in transfer learning: feature extraction, and fine-tuning.

Transfer learning is using a pre-trained model to boost performance of your model. They are usually trained on large datasets and will improve the results, training time and they are more general. How? design your model and use the pretrained model inside it. They can both be only one layer or more complex. Types: Domain adaption Cross-lingual learning Multi-task learning Sequential You can use the model either as the complete model or as a feature extract model. To extract features: you pass your input to the model and get the output, and use the output as your input to your model. You use the model but add your own fully connected layer at the end, replacing the final layer. Word embedding is an example Fine tuning: you train both model and pretrained model you connect both models together 1.Obtain pre trained model 2.Create a base model 3.Freeze initial layers of pre trained model 4. add new trainable layers, usually final output layer 5. train the model, normal learning rate 6. Fine tune your model, by unfreezing parts of base model at a small learning rate. The low learning rate will increase the performance of the model on the new dataset while preventing overfitting. You have to put both models in your memory, which will require a lot of computer space.

What is the difference between tokenizing the space delimited language and unsegmented language?

Unsegmented languages are example chinese and thai. Space delimited languages is english, swedish. You need a different approach since the languages are built different. Techniques for tokenization (used for space-delimited languages): White space Dictionary based Regular expression For unsegmented languages: Maximum matching algorithm (greedy algorithm) using the longest word in the sequence to find the meaning

How does LSTM work? Why is it useful and when?

Use of LSTM to tackle the problem of vanishing gradient. Long Short-Term Memory networks — usually just called LSTMs — are a special kind of RNN, capable of learning long-term dependencies. LSTM does it by keeping the useful information and forgetting which is not meaningful. In LSTMs, the information flows through a mechanism known as cell states. The addition or removal of information is controlled by gates at each cell. A value of zero means let nothing through, while a value of one means let everything through. An LSTM has three of these gates to control the cell state. Forget Gate: This gate decides what information we're going to throw away from the cell state. Input Gate: which outputs numbers between 0 and 1 and decides which values to update. Output Gate: Finally, we need to decide what we're going to output. This output will be based on our cell state.

How do we evaluate a text processing system objectively?

We use different evaluation metrics depending on the model Accuracy, Precision, Recall, F1-score Language models use an evaluation corpus perplexity How well can we predict unseen text? We can see how well a model would predict a missing word automatic speech recognition your transcripts to be as close to the original sentence as possible error rate: how many words were wrong? machine translation We can ask people that know both languages automatic evaluation: edit distance from correct translation we would need many correct translations to know this. BLEU- is a model used to compare similarity this does not look into to meaning of words, only the words. no synonyms. Beyond BLEU based on bleu Meteor considers synonyms hlepor hungarian

What is word2vec?

Word2Vec is one of the most popular techniques to learn word embeddings using a shallow neural network. Words with similar contexts are placed closed spatially. It helps achieve the semantic meaning of the word. It can be obtained using two methods (both involving Neural Networks): Skip Gram and Common Bag of Words (CBOW) Like we have the concept of N-gram in the bag of words, here we have a concept of Window size. The Word2vec model will capture relationships of words with the help of window size by using skip-gram and CBow methods. We use a window to keep track of the center word and the context of the center word.

Mention some of the Word2vector method? What is the main difference between CBOW and skip-Grams

Word2vec is a technique for NLP and uses NN to learn word associations. Linear models in an encoder-decoder structure Two models for training embeddings in an unsupervised manner: continuous bag of words - target word is output. skip gram - target word is input Skip-gram and CBOw are mirrored versions of each other. Skip-gram uses a single word to predict the surrounding words in a window. Skip-gram takes the word as an input and predicts the neighbor context words as output While CBOW uses a window of context words to predict a word and takes neighboring context words and input and predicts a single word as output.

LSTM

is a variant of RNN, a gated RNN Applies a multi-layer long short-term memory (LSTM) RNN to an input sequence. Replace hidden units with LSTM cells, and add a connection to the cells called cell state. They were designed to mitigate the vanishing/exploding gradient problem. Call state vector, at each time step the next LSTM can choose the reed from it, write to it or reset the cell Three gates: Input gate, controls: is the memory cell updated? Forget gate, controls if memory cell is reset to 0? Output gate, controls if the current cellstate is made visible? They all have a sigmoid activation Another vector c-bar, it has tanh activation function, this will distribute and prevent vanishing/exploding gradient Each of the gates takes the hidden state + current input x as inputs, then it concatenates the vectors and applies a sigmoid c-bar represents a new candidate values that can be applies to the cell stateNow we can apply the gate

RNN

very versatile 1. Vector to seq model, image captioning 2. seq to vector models, sentiment analysis 3. seq2seq, word prediction (of same length sequence) encoder-decoder is for machine translation bc it can handle diff lengths in input/output RNNs have a major benefit in that they can use input sequences of any length in the same network. In RNNs, our input is fed sequentially to the network together with the previous network output. After our first input token is transformed to a prediction, this prediction is fed to the same network together with the next token, until all tokens have been seen. Thus, regardless of if our input sequence is one word or 100 words, the network size remains the same (but for 100 words we will have to backprop 100 times through the network to train it. What issues might this cause?).


Set pelajaran terkait

Chapter 44 Prepu: Assessment and Management of Patients with Biliary Disorders

View Set

Fundamentals of Law for Health Informatics and Information Management, Third Edition, Exam 4

View Set

Chapter 8 Logical Sequence & Order of Survey Questions

View Set

NU142- Chapter 31: Assessment and Management of Patients With Hypertension

View Set

ch 7 federal tax considerations quiz

View Set