Natural Language Processing (AI)

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

Deep Learning + NLP

- Uses neural networkbased methods - Introduced the concept of contextual understanding. - Automatic feature engineering - requires massive amount of data, but requires little human intervention

Data Splitting

Divide the dataset into training and testing sets. The training set is used to train the model, and the testing set is used to evaluate its performance.

Performance Evaluation

This is the process where we use the trained model to make predictions on previously unseen, labelled data

F1 score

a number between 0 and 1 and is the harmonic mean of precision and recall

Pre-processing

transforms the data into a format that is more easily and effectively processed

Model

mathematical representation of the learning that has been acquired

Feature Extracting

means to extract and produce feature representations that are appropriate for a type of NLP task.

Python and the Natural Language Toolkit (NLTK)

open source collection of libraries, programs, and education resources for building NLP programs

Multiclass and Multi-label Classification

reviewing textual data and assigning one (single label) or more (multi) labels to the textual data.

TF-IDF

stands for Term Frequency Inverse Document Frequency of records. It can be defined as the calculation of how relevant a word in a series or corpus is to a text. - is a statistical measure used to determine the mathematical significance of words in documents

Syntax

studies sentence structure and the rules governing how words combine to form meaningful expressions

Bag of Words

text is represented as the bag of its words, disregarding grammar and even word order but keeping multiplicity.

Types of Classification

- Binary - Multiclass and Multi-label

Techniques for Imbalanced Datasets

- Collect more data - Resample the dataset (over-sampling and undersampling) - Generate synthetic samples (SMOTE) - Try different algorithms

Quality Training Data

- Compatible with the task - Fairly balanced - Representative

Challenges in NLP

- Data quality - domains with limited data - Ambiguity - words having several meanings, sarcasm, irony, and figurative language can be difficult for machines to understand. - Domain-specific language - different jargons and terminologies for different domains - Lack of interpretability - difficult to interpret NLP models - Ethics and bias - NLP applications can perpetuate biases in the data used in training can them, leading to unfair and discriminatory outcomes - Privacy - NLP technology relies on vast amounts of data which can be used to track people's behavior and preferences

Transfer Learning + NLP

- Enables the transfer of knowledge from one machine to another. - Needs less training data - use of pre-trained models - BERT and GPT-2

Language Modeling Examples

- Next word prediction - Language generation - Text completion

Fundamentals of Linguistics

- Syntax - Semantics - Pragmatics

NLP Use Cases

- Text Classification - Language Modeling - Speech Recognition - Document Summarization and Question Answering - Machine Translation - Caption Generation

Traditional ML + NLP

- intersection of computer science and statistics where algorithms are used to perform a specific task without being explicitly programmed - recognizes patterns in the data and make predictions once new data arrives.

Recall

- measure of actual observations which are predicted correctly, i.e. - how many observations of positive class are actually predicted as positive. - It is also known as sensitivity

Precision

- measure of correctness that is achieved in true prediction - it tells us how many predictions are actually positive out of all the total positive predicted - Also known as specificity

NLP Use Cases

- spam detection - language translation - virtual agents and chatbots

Supervised Learning

- text classification, sentiment analysis, named entity recognition, and machine translation

Unsupervised Learning

- text summarization, topic modeling, and word embedding

Supervised Learning

- trained on labeled dataset - output is predicted by the supervised learning model - predict outcomes for new data - Regression and classification tasks

Unsupervised Learning

- trained on unlabeled texts - hidden patterns are discovered using the unsupervised model - Finding useful insights, hidden patterns from the unknown dataset. - Clustering and association tasks

NLP Goal

to create computers/systems that can understand human language and communicate with humans in a natural way.

Text Classification Application

1. Language Detection 2. Sentiment Analysis 3. Spam Filtering 4. Email Routing

Dataset

A ________ to provide examples for training the classifier.

Tool

A _________ for generating and consuming the classifier.

Language Modeling

A concept that focuses on understanding the structure and grammar of natural language. It's like teaching a computer how sentences are formed and what words are likely to come next based on the context of the sentence.

Parsing

Analyzing the grammatical structure of a sentence and identifying its constituent parts.

Text Summarization

Automatically generating a summary of a text that captures the most important information.

Word sense disambiguation

Determining the correct meaning of a word based on its context

Internal Data

Generated from the apps and tools - chat apps - help desk software - survey tools

Large Language Models

Large language models are advanced AI systems that are capable of understanding and generating human language. They are built using complex neural network architectures, such as transformer models, inspired by the human brain.

Parts-of-speech Tagging

Meaning of the POS Tagging acronym.

Natural Language Generation

Using computer algorithms to automatically generate natural language text, such as news articles or product descriptions

Topic Modeling

a technique used to discover prominent and underlying topics within a collection of documents without any prior knowledge or labeled examples.

Text Classification

aka Text Categorization, is the activity of labeling natural language texts with relevant categories from a predefined set.

N-grams

built by counting how often word sequences occur in corpus text and then estimating the probabilities

Binary Classification

classifying data into two mutually exclusive groups or categories

Pragmatics

deals with how language is used in context and how the speaker, the listener, and the surrounding situation influence meaning

Semantics

focuses on the study of meaning in language. It explores how words, phrases, and sentences convey information and represent concepts

External Data

from the web - Web scraping or public datasets - Kaggle, Hugging Face Datasets

Word2Vec

group of related models that are used to produce word embeddings

POS Tagging

is a process of assigning a part of speech or lexical class marker to each word in a sentence (and all sentences in a corpus).

Natural Language Processing

is a way for computers to analyze, understand, and derive meaning from human language in a smart and useful way.

Machine Learning in NLP

is about teaching computers to understand human language by giving them examples, adjusting their learning settings, and then using their learning to make predictions on new data.

Supervised Learning

is so named because the data scientist acts as a guide to teach the algorithm what conclusions it should come up with.

Named Entity Recognition (NER)

is to process a text and identify named entities in a sentence. Named entities are specific pieces of information, such as names of people, organizations, locations, dates, and more.

Accuracy

valid choice of evaluation for classification problems which are well balanced and not skewed or there is no class imbalance


Kaugnay na mga set ng pag-aaral

Accounting Chapter 8-Proprietorship, Partnerships and Corporations

View Set

Exam 2 - Practice Problems, Quizzes

View Set

Guide To Computer Forensics and Investigations 5th Ed Chapter 1 Review Questions

View Set

Chapter 11: Statement of Cash Flows

View Set

Chapter 43 - Restorative and Esthetic Dental Materials

View Set