AI - quiz4
"Linearly separable"
a half-plane or hyperplane exists that accurately (or almost!) divides one class from the other
Loss functions for a percerptron (e.g. 0/1, L1, L2, cross-entropy)
"zero-one loss" -> the penalty for getting the decision wrong Wrong answer = 1 Correct answer = 0 Suppose that we have a training pair (x,y). So y is the correct output. Suppose that ŷ is the output of our unit (weighted average and then activation function) L2 norm: square(y - ŷ) cross-entropy loss: -(ylogŷ + (1 - y) log(1 - ŷ)) In both cases, our goal is to minimize this loss averaged over all our training pairs!
William Labov
(1975, "The boundaries of words and their meanings") showed that the relative probabilities of two labels (e.g. cup vs. bowl) change gradually as properties (e.g. aspect ratio) are varied.
The BERT language model
(Bidirectional Encoder Representations from Transformers) is a state-of-the-art language model introduced by researchers at Google in 2018. It represents a significant advancement in the field of NLP trains natural language models by masking off words in training sentences, and learning to predict what they were!
Applications of Computer Vision
* Obstacle avoidance * Classification Classifier register by producing its exact position, pose, and orientation * 3D reconstruction if we have enough matched features, we can reconstruct the camera motion by extracting local features from a pair of images and use these as guide points to merge the two images into one. If we have enough training examples, or enough prior knowledge, it's possible to reconstruct 3D information from a single image. * Image generation Relatively recent neural net algorithms can create random images similar to a set of training images. We can add object to a scene * Predicting the future A much harder, but potentially more useful, task would be to predict the future from our current observations.
What can we tune?
* Parameters: values learned directly from the training set (e.g. probabilities in Bayes nets, weights on the elements in neural nets) * Hyperparameters: tuning constants adjusted using the development data (e.g. the Laplace smoothing constant in naive Bayes) * General design of the algorithm, e.g. neural net vs. HMM * Geometry of the model, e.g. the number of units and the connections in a Bayes net or neural net * Theory-based parameters, e.g. "a word must have at least one vowel." Automated scripts are often used to adjust hyperparameters and parts of the model. This makes top-notch neural nets costly to fine-tune. It's essential to keep these adjustments in mind, as they can lead to models that don't work well with new data.
Basics of how perceptrons work
* have LIMITED capabilities but are particularly EASY to train * Used for binary classification. It can categorize input data into one of two classes * Used for linear decision boundary. However, they are not suitable for problems where the decision boundary is nonlinear (e.g., the XOR problem). * EQUATION: 1) w1x1 + ... + wnxn + b >= 0 To avoid having to deal separately with the bias term b, we replace this with a fake feature x0 that is always set to 1, plus an additional weight w0 2) w0x0 + w1x1 + ... + wnxn >= 0 We'll follow the neural net terminology and call this final function the "activation function." 3) Sign(w0x0 + w1x1 + ... + wnxn) == 1 or -1
Activation functions for a percerptron equations for sigmoid and ReLU!
* our model of uncertainty about the decision which only returns 1 or -1 When we move to differentiable units for making neural nets, the popular choices for differentiable activation functions include the logistic (aka sigmoid), tanh, rectified linear unit (ReLu) sigmoid: looks like x root of 3 1 / (1 + e^-x) and returns values 0 to 1 ReLU: upward slope from (0,0) if positive, f(x) = x if negative, f(x) = 0
L1 vs. L2 norm
***ways to measure f(xi) and yi in regression model L1 norm or Manhattan distance: SIGMA abs(xi - yi) L2 norm or straight-line distance: sqrt(SIGMA(xi -yi)^2 (complicated BUT LESS SENSITIVE)
Entropy: definition, how it relates to evaluating possible splits in a decision tree
A good split is less diverse or uniform subpools *SO Diversity/uniformity is measured using entropy**** * the number of bits required to (optimally) compress a pool of values Entropy defintion: - SIGMA [P(c) * log(P(c))]
Batch vs. incremental training
Batch: train first, then test Incremental: test as we train
CV - Image formation:
Cameras and human eyes map part of the 3D world onto a 2D picture Image formation (pinhole camera, real lenses, human eye) - Pinhole camera: Light rays pass through the pinhole (P) and form an image on the camera's sensor array. - cameras use a lens to focus light coming in from a wider aperture Digitization (computer, human, including color): Digital cameras take the image from the lens and turn it into a grid of light intensity values, using filters to determine red, green, or blue light at each point. This data is then transformed into pixels, each having red, green, and blue values. Modern cameras create very high-resolution images, often more detailed than what AI can process. In contrast, our eyes have a special area, the fovea, that sees in high detail, while the surrounding areas see in lower detail, helping with tasks like avoiding obstacles. The left photo below displays a dense pattern of cone cells from the fovea. The right photo shows a mix of cones and smaller rod cells in the outer eye region. The irregular layout helps avoid "aliasing," which is when incorrectly sampled high-frequency patterns create unwanted patterns in images. Edge detection, segmentation
Adjusting weights for a differentiable unit using gradient descent Why do we need activation and loss functions differentiable? Main update equation (not details of all the derivatives)
Classical perceptrons have a step function activation function, so they return only 1 or -1. Our accuracy is the count of how many times we got the answer right. When we generalize to multi-layer networks, it will be much more convenient to have a differentiable output function, so that we can use gradient descent to update the weights. To do this, we need activation functions at each layer and one loss function at the very end. Gradient descent relies on computing the gradient of the loss function with respect to the model's weights. This calculation involves chain rule, which propagates gradients backward through the network. If the loss or activation functions are not differentiable, this gradient can't be computed, and weight updates using gradient descent become impossible. Differentiating allows for the smooth and stable optimization of neural network weights using gradient-based methods like gradient descent EQUATION: wi = wi + α*(y- ŷ) * ŷ * (1 - ŷ) * x1 where Let's assume we use the logistic/sigmoid for our activation function and the L2 norm for our loss function.
CFAR-10 dataset
Each image is 32 by 32, with 3 color values at each pixel. So we have 3072 feature values for each example image.
Overall training algorithm (e.g. epochs, random processing order)
Epoch = One run through training data through the update rule Linear classification uses multiple epochs Local Correlations: If you combine multiple documents about science to form a dataset, then several consecutive data points (individual documents or chunks of them) might be closely related or similar because they all talk about science Due to local correlations, do random processing order -> to provide a more diverse mix to the model throughout its learning process
Linear Classification Big Picture
Linear classifiers classify input items using a linear combination of feature values. Depending on the details, and the mood and theoretical biases of the author, a linear classifier may be known as: * perceptrons * single units of a neural net * logistic regression * a support vector machine (SVM) A key limitation of linear units is that class boundaries must be linear. Real class boundaries are frequently not linear. So linear units typically * used in groups (e.g. neural nets) and/or * fed data after some prior processing (e.g. non-linear or dimension reducing)
Decision trees, random forests
Nodes: Ask questions to classify data. Types: Yes/No, Multiple Choice, Value Threshold (e.g., height > 6 feet) Leaf (end point): Goal: Each leaf should have one clear category. Reality: Mixed categories sometimes due to unclear features. * Can be binary (poisonous/not) or varied (types of plants). * Strength: Uses diverse data types, not just numbers. Deep Trees: * Risk: Can overfit (not generalize well) * Solution: Use "random forest." -> Multiple smaller trees. Trees made by random selection of features/questions.
Multi-class perceptrons
Perceptrons can easily be generalized to produce one of a set of class labels, rather than a yes/no answer. We use several parallel classifier units. Each has its own set of weights. Join the individual outputs into one output using argmax
Workarounds of limited training data
Re-purposing layers trained for another purpose Creating training pairs by removing information Self-supervised, semi-supervised, unsupervised methods For example, current neural net software lets us train the entire multi-layer as a unit. Then we need correct outputs only for the final layer. This type of training optimizes the early layers only for this specific task.
Challenges with determining the correct answer specific vs general label unfamiliar objects, unfamiliar words context may affect best label to choose deciding what's important in complex scenes, extended sentences
Specific vs General label: Too general labels might miss out on important nuances, while too specific labels can make the classification task unnecessarily complex and require a much larger amount of training data. Unfamiliar objects or words: Without prior exposure during training, systems can mislabel or misclassify unfamiliar items, leading to errors. Context effect: Systems not trained to understand context can produce incorrect or nonsensical labels. For example, the word "bank" in "I went to the bank" vs. "I sat by the river bank." Without clear guidance or training, a system might focus on less important details and miss the central theme or subject, leading to misinterpretations or misclassifications.
Types of classifiers
Statistical (e.g. Naive Bayes, HMMs) Simple traditional classifiers: decision trees, k-nearest neighbor Linear classifiers (perceptrons) Neural nets
Rule for updating perceptron weights
Suppose that x is a feature vector, y is the correct class label, and ŷ is the class label that was computed using our current weights. Then our update rule is: * if ŷ = y -> do nothing * otherwise -> wi = wi + α*(y - ŷ) *xi α = "learning rate" constant controls how quickly the update process changes in response to new data. This training procedure will converge if: * data are linearly separable * we throttle the size of the updates as training proceeds by decreasing α * α is proportional to 1000/(1000+t)
Limitations of perceptrons and ways to address them
The decision boundary can only be a line. FIX? * use multiple units (neural net) * massage input features to make the boundary linear If there is overlap between the two categories, the learning process can thrash between different boundary positions! FIX? * reduce learning rate as learning progresses * don't update weights any more than needed to fix mistake in current example * cap the maximum change in weights (e.g. this example may have been a mistake) If there is a gap between the two categories, the training process may have trouble deciding where to place the boundary line. Fix? * switch to a closely-related learning method called a "support vector machines" (SVM's). The big idea for an SVM is that only examples near the boundary matter. So we try to maximize the distance between the closest sample point and the boundary (the "margin").
One-hot representations
There is no (reasonable) way to produce a single output variable that takes discrete values (one per class). So we use a "one-hot" representation of the correct output. A one-hot representation is a vector with one element per class. The target class gets the value 1 and other classes get zero. E.g. if we have 8 classes and we want to specify the third one, our vector will look like: [0, 0, 1, 0, 0, 0, 0, 0] * works well for medium-sized collections! * outputs a sequence of classifiers v1 ... vn BUT v can be negative so we use SOFTMAX
Logistic Regression vs Naive Bayes
While both methods use features to categorize data, Logistic Regression can handle varied feature scales and correlations better than Naive Bayes
what does overfitting refer to?
an overfit model is too closely tailored to the specific dataset it was trained on, often at the cost of its ability to generalize to new, unseen data.
What are we minimizing when we adjust the weights? what does the classification function consist of?
consists of: * the weighted sum of input feature values * the activation function * the loss function
Data for supervised training
correct or "gold" answers Noise in "correct" answers/annotation Annotators with limited training + annotation is done "quick and dirty" -> so trained annotators make errors. The supposedly correct answers may have been scraped off the web and not fully examined for correctedness.
Softmax
differentiable function or argmax version * maps each vi to e^vi / SIGMA(e^vi) which forces all values to be positive and the dominator normalizes all values into the range [0,1]
Uses for classification
labelling objects (use context and intrinsic properties) making decisions
CV - Classification: Identifying/naming objects in a picture Localizing/registering objects within a picture Visual question answering, captioning, semantic role labelling for a picture
occluded - disappear behind other objects moving parts lighting accommodation - same image with different focus points natural objects are rarely identical
Muti layer system
refer to a multi-layer perceptron (MLP), which is a kind of artificial neural network composed of multiple layers of nodes (or neurons). These layers include an input layer, one or more hidden layers, and an output layer. Each layer processes the inputs it receives, and the final output layer produces the prediction or classification. The use of multiple layers allows the neural network to learn more complex, non-linear patterns in the data.
k-nearest neighbors (how it works, what happens if you change k)
to find the most similar training example and copy its label k-nearest neighbors (k-NN) is a method where an input is classified based on the majority class among its 'k' closest examples from the training data INCREASING k makes the classification smoother and reduces overfitting. BUT memory costly since it stores all training data and slow.