NLP - Quiz #3
How to train a Perceptron
1. have a decision function. 2. Start all our feature weights at zero 3. Loop over every piece of data and make a prediction
Time complexity for greedy decoding?
O(NC)
Span
a subsection of text
Stratification? why good for training? why bad?
artificially sampling data to achieve the desired balance of classes. like if u divide data into groups, make sure each group is well represented / diverse and BALANCED. good -- balanced representation, avoid bias. bad -- doesnt accurately represent overall distribution in of classes in entire data set.
Classification
assigning one or more classes to each item (AKA instance) by predicting the best class(es) for the item
Ontology
classes, properties and instances
text classification process
1. Choose an initial set of features, a representation of each piece of data (instance) 2. Train a model 3. Tune the model by measuring performance on the devset Adjust hyperparameters Change features Change model if needed 4. When you have a final model or set of models, evaluate performance on the test set
How much data goes into train, dev, and test?
80/10/10 split In general, train is largest, dev and test are approximately equally-sized
What's dynamic programming?
An efficient approach to breaking down problems to make a full solution from partial solutions.
Viterbi Decoding is ______________ programming
DYNAMIC PROGRAMMING!
chrF
Imagine you have a special way of checking how well a language robot talks, and you don't really care about individual words or spaces between them.
What's an entity type?
Entity types allow us to create classes of entities/mentions
GPE vs LOC
GPE are animate whereas LOC are inanimate
why more common to use a log-linear model like logistic regression for classification?
It can produce probabilities, and overall there's much greater control over the mechanics of optimization.
Time complexity for viterbi decoding?
O(N,C^2)
Micro/macroaveraging
Macroaverage: average across the per-class F1 scores • "Average of averages"• Gives all classes the same weight in the overall score. Microaverage: compute F1 across all data points Every data point contributes equally to F1, so the F1 score is biased by the distribution of the labels (common labels dominate)
Viterbi decoding data structures?
Matrix ; 2D list
Perceptron: principle of operation
Perceptrons work like a decision-making system. Imagine teaching a robot to recognize apples and bananas. For each fruit, the robot looks at features like color and shape, assigns importance (weights) to each feature, and adds them up. If the total is more than a certain value, the robot decides it's an apple; otherwise, it's a banana.
AI busts & booms?
The occurrence of AI booms and busts can be attributed to a combination of factors, including technological advancements, expectations, challenges, and broader societal and economic influences.
Perceptron: update rule
The perceptron learns from mistakes using the update rule. If it makes an error in classifying a fruit, it adjusts the weights assigned to features. For example, if it mistakes a red banana for an apple, it might reduce the importance it gives to color. This update rule helps the perceptron get better at making correct decisions.
caricature of NLP process?
Typically we divide the data into the training, development (AKA validation), and test sets. then we train models on train, tune/improve using dev, and do our final evaluation on test
How to get an optimal tag assignment?
Viterbi Decoding
Feature set
a combination of features that represents the input
Class
a label that can be assigned to an item
Mention
a span or section of text that refers to a specific entity
Viterbi decoding
examine all possible tag assignments left to right, at each point identifying the best previous state for each state and what the resulting score would be
how to choose a feature set?
experiment on the dev set / development set.
NER?
identifying entity names in text -- technology that helps computers identify and categorize specific, named things in text, such as people, places, dates, and organizations. It's like teaching a computer to recognize and understand important information in sentences.
Most common hyperparameters for discriminative models?
learning rate and regularization
Learning rate vs regularization?
learning rate is how much we change our parameters each time we update them/ how big a size step to take to be most affective. regularization is how we control the size of our parameters• Many models like to overfit by setting large parameters• Common regularization schemes shrink the parameters slightly every step
Greedy decoding
make your tagging decisions left to right, but decide the best tag immediately at each point (instead of waiting until the end of the sequence)
Features
often we convert the item being classified into features before classifying it
Parameters vs. Hyperparameters
parameters : like the INGREDIENTS of ML -- stuff you control internally that affect the outcome hyperparameters : the external things you can control to get the BEST outcome. not necessarily PART of the internal ingredients, but help in creating the best product/outcome.
Regression vs classification
regression is predicting a continuous outcome based on features whereas classification is assigning a label
Most common hyperparameter for generative models?
smoothing
whats an entity?
something we care about that we want to be able to refer to
cross validation? why not use it?
splitting your data into parts, training your model on some parts of the data and testing it on others to ensure your model performs well on all of the data. we dont use it much bc it requires hella training models.. .like 10
information extraction
teaching computers to find and understand important details, like names of people or dates, in sentences.
entity linking
the process of connecting or linking a named entity mentioned in text to a specific entry or identity in a knowledge base or database.
when would macro f1 score be undefined?
there are no true positives / precision and recall are zero
why use NER?
to figure out what entities are in the data set as well as entity linking and extracting information from input.
Machine translation (MT)
translates data from a source language to a target language, for example English to Spanish
BLEU
way to measure how well a machine translates one language into another... the higher the BLEU score, the better the translation!!