AI and National Security Quiz

¡Supera tus tareas y exámenes ahora con Quizwiz!

synthetic data and its pitfalls

Data that is generated artificially, rather than collected from the real world. It can be used to train machine learning models, especially in cases where real-world data is scarce or difficult to collect. It can be: Generated in large quantities, privacy-preserving; used to generate data for rare events; used to generate data for dangerous or unethical situations. It can be difficult to generate realistic synthetic data. It can be biased (like all data), and expensive to generate. It can also more easily lead to overfitting.

Cost Function

Number that represents the difference between estimated and actual answer of an optimization problem. It starts off really big and as you train it, it gets smaller.

Transformer

Underlying neural network infrastructure that produces content; enables parallel processing of text for training of large data models; is able to learn the importance of word order from data through positional encoding and attention - where it notes the order and position of words; self-attention allows for understanding of the complex relationship between words in an input sequence; attention is a model that enables it to learn the importance of word order.

Recall

What proportion of actual positives was identified correctly?

data quality versus data readiness

instrinsic character of it data, and measures how well a dataset meets criteria for accuracy, completeness, validity, consistency, uniqueness, timeliness, and fitness for purpose, and it is critical to all data governance initiatives within an organization. VERSUS practical aspects of the data; is an indication of the preparedness of data for a specific machine learning task.

Parameter

internal variables that the model learns from the training data during the training process. They are the coefficients or weights that define the relationships between input features and the model's predictions.

Compute supply chain

involves (1) the sourcing of raw materials (2) the design of the hardware, and the (3) fabrication or manufacturing of the final products.

data warehouse

is a centralized repository that stores structured data from one or more disparate sources. often used for data mining, business intelligence (BI), and other analytical tasks.

Data splitting

you have a data set, you need to train, test, and validate. You should never use your dataset for validation for training, or else you'll overfit it.

Step zero for 12 steps of ML

"I wanna develop AI for X particular project. What are the first things I need to do as a part of step zero?" Go through the triad: Do we have the infrastructure and compute? Do we have the talent and the models to create what we need?

Tokenization

*Key for natural language processing; Process of breaking down a stream of text into unit of data, such as a word, a punctuation mark, or a whitespace character. Input text: "This is a sentence." Output tokens: ["This", "is", "a", "sentence", "."]

How AI can fail

- do institutional dynamics lead to more biased outcomes? - lack of diversity and implicit/explicit bias in the workplace - bias within AI capabilities themselves - it can spread harmful content - lack of transparency

AI Triad (and what's important about each factor)

1. The foundation of artificial intelligence. AI systems, particularly machine learning models, rely on large and high-quality datasets to learn and make predictions. The quality, quantity, and diversity of data are crucial factors in the success of AI applications. It's often more important than the algorithm used and is important in its quantity, quality, and representativeness. 2. AI is more powerful now and effective thanks to these; it allows to extract the analysis, intelligence, and patterns from the data. The mathematical and computational techniques that AI systems use to process data and make decisions. Machine learning algorithms, such as neural networks, decision trees, and support vector machines, are used to train models and extract patterns from data. The power and infrastructure required for this in AI tasks have grown significantly in recent years. Training deep learning models, in particular, demands substantial resources, often involving GPUs (Graphics Processing Units) or specialized hardware like TPUs (Tensor Processing Units). Compute power is the easiest to tell intent, capabilities, etc. of our adversaries. Discounts talent. It takes a lot of skill to really hone AI triad. The triad is useful but it is a lot more complex than three main components.

Challenges of AI explainability

1. humans have a difficult time interpret the information from ML because of the complexity of the models 2. the metrics we apply on an AI explainability model are subjective; they aren't really precise.

hyperparameter

A configuration setting or parameter that is not learned from the data but is set prior to training a model. They're essential because they control various aspects of the training process and the architecture of the model. Unlike model parameters, which are learned from the data (e.g., weights in a neural network), are set by the machine learning engineer or researcher and must be tuned to optimize the model's performance.

What is an LLM

A neural network AI model with many parameters that can perform a wide variety of tasks, such as generating, classifying, or summarizing text. Use a combination of supervised learning, unsupervised learning, and reinforcement learning (RLHF) for training

Parallelism

A technique used to perform multiple computations or tasks simultaneously, rather than sequentially, in order to speed up the execution of AI algorithms and models. Example: instead of adding 1 + 1 + 1 + 1 =4, you can assign 1 + 1 to one CPU and assign the other 1 + 1 to another CPU; saves significant time over time.

backpropagation algorithm

Algorithm that computes the gradient and cost function for weights. Adjust the weights in small ways using a backpropagation algorithm. This means the algorithm goes back through the network and gives more weight to the neurons that originally told the network that the value that was inputted was X when it was in fact X or closer to X than Y. The ones that said it was Y are provided less weight.

Data splitting

Also known as data partitioning, is a technique used in machine learning to divide a dataset into two or more subsets. The subsets are typically used for training, validation, and testing. The training set is used to train the machine learning model. The validation set is used to evaluate the performance of the model during training. The testing set is used to evaluate the performance of the model on unseen data.

Gradient Descent

An optimization algorithm which is commonly-used to train machine learning models and neural networks. Training data helps these models learn over time, and the cost function within gradient descent specifically acts as a barometer, gauging its accuracy with each iteration of parameter updates.

Data engineering

Broad field that encompasses the collection, storage, processing, and analysis of data.

Moore's Law

From the 1960s until the 2010s, engineering innovations that shrink transistors (a mini semiconductor) doubled the number of transistors on a single computer chip roughly every two years. Computer chips became millions of times faster and more efficient during this period.

What would your approach be to data for a machine learning solution?

Go through process of splitting, sourcing, and evaluating your data

GAN

Introduced in 2014 by Ian Goodfellow Provide a means to generate new data that resembles training data A generator: generates fake outputs A discriminator or evaluator: tries to differentiate real outputs from fake ones The two networks "compete" against one another in a zero-sum game, in which the generator attempts to comes up with fakes and the evaluator determines if they are real or fake

Learning systems

Machines that get little direct instruction and teach themselves what they need to know based on feedback

Importance of data

Often more important than the algorithm used Quantity, quality, representativeness

AI misalignment

Often referred to as "alignment" or "value alignment," is a critical concept in the field of artificial intelligence ethics and safety. It refers to the problem of ensuring that advanced AI systems, particularly those with significant autonomy and decision-making capabilities, align with human values, goals, and objectives. The concern is that if AI systems are not properly aligned with human values, they may take actions or make decisions that are harmful, undesirable, or contrary to our intentions.

Adversarial learning

Process of extracting information about the behavior and characteristics of an ML system and/or learning how to manipulate the inputs into an ML system in order to obtain a preferred outcome. Poisoning attack: create a weakness in the machine learning system during the training phase that can later be exploited If an attacker can control or manipulate the training data a system sees they can influence how the model behaves Exploratory attack: White box - direct access to model parameters or data Black box - gain access to model parameters, algorithms, or data indirectly, e.g., Oracle attack Cross-model transferability

Token encoding

Process of mapping tokens to unique integers. This is often done by creating a vocabulary of all the unique tokens in a dataset and then assigning each token a unique integer ID. Vocabulary: ["This", "is", "a", "sentence", "."] Token encoding: {"This": 0, "is": 1, "a": 2, "sentence": 3, ".": 4}

Agent

Something that acts to achieve an objectiv; refers to a software program or entity that can perceive its environment, make decisions or take actions to achieve specific goals, and interact with its surroundings. examples: self driving cars, game playing agents, recommendation agents, etc

Bias-variance definition and their tradeoff in machine learning.

The error due to overly simplistic assumptions in the learning algorithm. A model with high bias pays little attention to the training data and is unable to capture the underlying patterns. This leads to underfitting, where the model is too simple to represent the data accurately. Models with high bias tend to have poor predictive performance on both the training and test data. VERSUS the error due to too much complexity in the learning algorithm. A model with high variance is highly sensitive to the training data and captures noise in addition to the underlying patterns. This leads to overfitting, where the model fits the training data very well but fails to generalize to new, unseen data. Models with high variance have low error on the training data but may have high error on the test data. The challenge is to find the right balance that minimizes the total error (the sum of bias and variance) and leads to the best generalization performance on unseen data. It's much more likely that you're going to overfit your training data so you'll see it work less well IRL than it was working in your validation phase. Regardless of high bias or variance, you'll see poor performance. You need the right talent to make the call of what to do next with the network an algorithm

learning rate

The step the algorithm takes in adjusting weights along the gradient. It determines the size of the steps taken during the optimization process, such as gradient descent, when updating the model's parameters (weights and biases) to minimize the loss function.

NIST AI Risk Management Framework components (7) + trade off of frameworks

Valid and reliable Safe - but how do I know it's safe? Secure and resilient Accountable and transparent Explainable and interpretable Privacy-enhanced Fair - with harmful biased managed "PS SAFE" acronym The problem is sometimes you can get too reliant on the framework and then you miss your blindspots to edge cases that don't fit the model.

Weights

Value given to the connection between neurons. It's the parameter that represents the strength of a connection between two neurons or nodes in adjacent layers. These control the contribution of one neuron's output to another neuron's input

Precision v recall

What proportion of positive identifications was actually correct? True Positives / (True Positives + False Positives) VERSUS What proportion of actual positives was identified correctly? True Positives / (True Positives + False Negatives --- would tell you how many of the people the model predicted as having the disease actually have it. If precision is high, it means that when the model says someone has the disease, it's usually right. VERSUS would tell you how many of the people who actually have the disease were correctly identified by the model. If recall is high, it means the model is good at finding most of the people with the disease.

Confusion Matrix (+ its for components)

a chart that helps us understand how well a machine learning model is performing, especially for classification tasks. It's a way to measure how many correct and incorrect predictions the model is making. example: True Positives (TP): These are the animals that are truly cats, and the model correctly predicted them as cats. It's like saying, "Yes, it's a cat," and you're right! True Negatives (TN): These are the animals that are truly not cats (in this case, dogs), and the model correctly said they're not cats. It's like saying, "No, it's not a cat," and you're right! False Positives (FP): These are the animals that are actually not cats (dogs), but the model wrongly predicted them as cats. It's like saying, "Yes, it's a cat," but you're wrong because it's a dog. False Negatives (FN): These are the animals that are truly cats, but the model incorrectly said they're not cats. It's like saying, "No, it's not a cat," but you're wrong because it is a cat. By looking at these four numbers, we can calculate various important metrics to understand how well the model is doing, like accuracy, precision, recall, and F1-score. These metrics help us know if the model is good at identifying cats and if it sometimes makes mistakes.

Extract, Transform, and Load (ETL)

a data integration process that combines data from multiple data sources into a single, consistent data store that is loaded into a data warehouse or other target system. - Extracting the data from the source system. - Transforming the data to conform with the needs of the destination system. - Loading the data into the destination.

data science

a field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from data in various forms, both structured and unstructured, similar to data mining.

Algorithm (recipes)

a set of step-by-step instructions that computers follow to perform a task

what's made AI more powerful in recent years using the AI triad

access to more data, capabilities to create synthetic data, access to the cloud and cheaper access to entry.

data types

are -classifications- of data that tell the computer how to interpret and store the data; includes integers, floats, strings, and booleans.

data sets

are collections of data that are organized in a structured way, often used for machine learning and other data analysis tasks.

databases (& where are they stored)

are structured collections of data that are stored on a computer system, often used to store and manage large amounts of data.

Risks of LLMs

can be biased, reflecting the biases in the data that they are trained on. can be used to generate harmful content, such as disinformation and hate speech. are computationally expensive to train and use.

data lakes

centralized repositories that store all types of data in their native format. often used for big data analytics and machine learning.

Exploratory data analysis (EDA)

critical process of performing initial investigations on data so as to discover patterns, to spot anomalies, to test hypothesis and to check assumptions with the help of summary statistics and graphical representations.

Irreducible Error

due to poor data or algorithms, it cannot get to 100% accuracy or prediction. It cannot be improved anymore

Electronic Design Automation (EDA)

helps people create and check their electronic gadgets (smartphones, chips, tablets, computers), making sure they work perfectly before they're built. Design gadgets, create blueprints, test and sandbox stuff, solve future problems/predict how gadget will operate IRL, create instructions to build the device. Helps engineers save time and money down the line for a wide array of electronics.

Reinforcement Learning (RLHF)

machine learning technique without labled data that trains a model to perform a task by providing it with human feedback. The model is rewarded for generating outputs that are aligned with the feedback, and penalized for generating outputs that are not.

What's the difference between debugging and tuning?

one is broke have to fix it, rectifying issues that cause the software to behave unexpectedly, crash, or produce incorrect results VERSUS adjusting parameters to run more efficiently

Machine Learning

process of instructing computers to learn from data using compute power, algorithms, and data.

Supervised learning versus unsupervised learning algorithm (data labeled v unlabeled)

requires labeled data with known outcomes, while unsupervised learning works with unlabeled data to find patterns or structure. the algorithm learns to make predictions based on input-output pairs, while unsupervised learning focuses on discovering inherent structures or relationships within the data. used for tasks like classification and regression, where there are specific target values to predict, whereas unsupervised learning is used for clustering and dimensionality reduction tasks.

Cloud Computing (& its benefits)

technology paradigm that enables users to access and utilize computing resources (such as servers, storage, databases, networking, software, and more) over the internet, or "the cloud" offers several advantages, including scalability, flexibility, cost-efficiency, and accessibility

Layers

the highest-level building block in deep learning. a container that usually receives weighted input, transforms it with a set of mostly non-linear functions and then passes these values as output through the NN

parameters v hyperparameters

the skills and knowledge the robot gains as it practices the game. For example, it learns how to jump, run, and collect points. VERSUS the rules you set for the game. You decide how fast the game should be, how many lives the robot has, and how hard the game is to win.

Activation

the value of each neuron in the input layer.

data format

used to store and transmit specific types of data, such as images, audio, and video. often defined by a set of rules that specify how the data should be structured and encoded.

data structure (within a computer already)

way of organizing data in a computer so that it can be used efficiently. often used to store and retrieve data, as well as to implement algorithms


Conjuntos de estudio relacionados

208 Final- Chapter 21: Conflict, Workplace Violence && Negotiation

View Set

Word List 'mis' - bad or badly, wrong

View Set

Chapter 33 Assessment of the Cardiovascular System practice questions

View Set

Chapter 2 - Prenatal Development, Pregnancy and Birth

View Set

Practice test MGT 325 ch 1-6,8,9

View Set

PN Adult Medical Surgical Online Practice 2023 A

View Set