Natural Language Processing (AI)
Deep Learning + NLP
- Uses neural networkbased methods - Introduced the concept of contextual understanding. - Automatic feature engineering - requires massive amount of data, but requires little human intervention
Data Splitting
Divide the dataset into training and testing sets. The training set is used to train the model, and the testing set is used to evaluate its performance.
Performance Evaluation
This is the process where we use the trained model to make predictions on previously unseen, labelled data
F1 score
a number between 0 and 1 and is the harmonic mean of precision and recall
Pre-processing
transforms the data into a format that is more easily and effectively processed
Model
mathematical representation of the learning that has been acquired
Feature Extracting
means to extract and produce feature representations that are appropriate for a type of NLP task.
Python and the Natural Language Toolkit (NLTK)
open source collection of libraries, programs, and education resources for building NLP programs
Multiclass and Multi-label Classification
reviewing textual data and assigning one (single label) or more (multi) labels to the textual data.
TF-IDF
stands for Term Frequency Inverse Document Frequency of records. It can be defined as the calculation of how relevant a word in a series or corpus is to a text. - is a statistical measure used to determine the mathematical significance of words in documents
Syntax
studies sentence structure and the rules governing how words combine to form meaningful expressions
Bag of Words
text is represented as the bag of its words, disregarding grammar and even word order but keeping multiplicity.
Types of Classification
- Binary - Multiclass and Multi-label
Techniques for Imbalanced Datasets
- Collect more data - Resample the dataset (over-sampling and undersampling) - Generate synthetic samples (SMOTE) - Try different algorithms
Quality Training Data
- Compatible with the task - Fairly balanced - Representative
Challenges in NLP
- Data quality - domains with limited data - Ambiguity - words having several meanings, sarcasm, irony, and figurative language can be difficult for machines to understand. - Domain-specific language - different jargons and terminologies for different domains - Lack of interpretability - difficult to interpret NLP models - Ethics and bias - NLP applications can perpetuate biases in the data used in training can them, leading to unfair and discriminatory outcomes - Privacy - NLP technology relies on vast amounts of data which can be used to track people's behavior and preferences
Transfer Learning + NLP
- Enables the transfer of knowledge from one machine to another. - Needs less training data - use of pre-trained models - BERT and GPT-2
Language Modeling Examples
- Next word prediction - Language generation - Text completion
Fundamentals of Linguistics
- Syntax - Semantics - Pragmatics
NLP Use Cases
- Text Classification - Language Modeling - Speech Recognition - Document Summarization and Question Answering - Machine Translation - Caption Generation
Traditional ML + NLP
- intersection of computer science and statistics where algorithms are used to perform a specific task without being explicitly programmed - recognizes patterns in the data and make predictions once new data arrives.
Recall
- measure of actual observations which are predicted correctly, i.e. - how many observations of positive class are actually predicted as positive. - It is also known as sensitivity
Precision
- measure of correctness that is achieved in true prediction - it tells us how many predictions are actually positive out of all the total positive predicted - Also known as specificity
NLP Use Cases
- spam detection - language translation - virtual agents and chatbots
Supervised Learning
- text classification, sentiment analysis, named entity recognition, and machine translation
Unsupervised Learning
- text summarization, topic modeling, and word embedding
Supervised Learning
- trained on labeled dataset - output is predicted by the supervised learning model - predict outcomes for new data - Regression and classification tasks
Unsupervised Learning
- trained on unlabeled texts - hidden patterns are discovered using the unsupervised model - Finding useful insights, hidden patterns from the unknown dataset. - Clustering and association tasks
NLP Goal
to create computers/systems that can understand human language and communicate with humans in a natural way.
Text Classification Application
1. Language Detection 2. Sentiment Analysis 3. Spam Filtering 4. Email Routing
Dataset
A ________ to provide examples for training the classifier.
Tool
A _________ for generating and consuming the classifier.
Language Modeling
A concept that focuses on understanding the structure and grammar of natural language. It's like teaching a computer how sentences are formed and what words are likely to come next based on the context of the sentence.
Parsing
Analyzing the grammatical structure of a sentence and identifying its constituent parts.
Text Summarization
Automatically generating a summary of a text that captures the most important information.
Word sense disambiguation
Determining the correct meaning of a word based on its context
Internal Data
Generated from the apps and tools - chat apps - help desk software - survey tools
Large Language Models
Large language models are advanced AI systems that are capable of understanding and generating human language. They are built using complex neural network architectures, such as transformer models, inspired by the human brain.
Parts-of-speech Tagging
Meaning of the POS Tagging acronym.
Natural Language Generation
Using computer algorithms to automatically generate natural language text, such as news articles or product descriptions
Topic Modeling
a technique used to discover prominent and underlying topics within a collection of documents without any prior knowledge or labeled examples.
Text Classification
aka Text Categorization, is the activity of labeling natural language texts with relevant categories from a predefined set.
N-grams
built by counting how often word sequences occur in corpus text and then estimating the probabilities
Binary Classification
classifying data into two mutually exclusive groups or categories
Pragmatics
deals with how language is used in context and how the speaker, the listener, and the surrounding situation influence meaning
Semantics
focuses on the study of meaning in language. It explores how words, phrases, and sentences convey information and represent concepts
External Data
from the web - Web scraping or public datasets - Kaggle, Hugging Face Datasets
Word2Vec
group of related models that are used to produce word embeddings
POS Tagging
is a process of assigning a part of speech or lexical class marker to each word in a sentence (and all sentences in a corpus).
Natural Language Processing
is a way for computers to analyze, understand, and derive meaning from human language in a smart and useful way.
Machine Learning in NLP
is about teaching computers to understand human language by giving them examples, adjusting their learning settings, and then using their learning to make predictions on new data.
Supervised Learning
is so named because the data scientist acts as a guide to teach the algorithm what conclusions it should come up with.
Named Entity Recognition (NER)
is to process a text and identify named entities in a sentence. Named entities are specific pieces of information, such as names of people, organizations, locations, dates, and more.
Accuracy
valid choice of evaluation for classification problems which are well balanced and not skewed or there is no class imbalance