OpenAI Prep

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

the URL you send requests to

API endpoint

The amount of information the model can "remember" within a single prompt

Context window

let external systems call your app when a long-running task finishes, so you don't have to keep polling AMEX: Instead of repeatedly calling Resy to check on the booking, we set up a webhook so Resy can call us with an update Rather than blocking the UI: Granola sends audio to its transcription service. The service processes in background. When finished, the service sends a webhook to Granola:"Your transcript is ready." Granola then triggers: summary API call insights extraction memory updates

Define Webhook - give examples for AMEX and Granola

a numerical vector that represents the meaning of a piece of text let's the system understand similarity it enables semantic search and is the basis for RAG

Embedding

Endpoints are the URLs I call to use the OpenAI API. Tool calls are instructions generated by the model that tell my application what action to take next. They operate on different layers

Endpoints are... Tool calls are...

A secret API key in the header Used to control access, billing, and permissions how OpenAI secures access and tracks usage

How is authentication handled?

the real time process of a model generating a response based on what it learned during training -All training is done. -The model simply predicts tokens based on its learned patterns + your prompt

Inference

ChatGPT API platform Assistants API GPT Store Audio and Image Models

List all of OpenAI's product offerings

Turbo models exist because most real-world applications need speed, scale, and affordability. Turbo gives teams 90-95% of the model quality at a fraction of the cost and latency, which makes large-scale adoption possible

One liner about turbo models:

Ex. grammar rules, common facts, how and email is usually written Are set during pre-training the model adjusts its weights (parameters) every time it gets something "wrong"

Parameters

You send a message to another computer on the internet asking it to do something, and it sends a message back with the result a set of endpoints a client can call using HTTP to send requests and receive structured responses, usually in JSON

REST API

Rate limits = how much you can use per minute. Measured in: Requests per minute Tokens per minute

Rate Limits

how much of the API you're allowed to use per minute API calls per minute and tokens per minute Turbo models help because they use fewer tokens and can fit more calls within the rate limit.

Rate limits

-Humans rank different model outputs. -The model learns which responses humans prefer. -This shapes tone, style, helpfulness, guardrails

Reinforcement Learning from Human Feedback (RLHF)

-It ingests huge amounts of text. -The task is always the same: predict the next token. -By doing this billions of times, the model learns patterns, reasoning, structure, facts.

Supervised pre-training

how many requests a system can handle at the same time Turbo = high throughput Frontier = lower throughput(But higher reasoning quality) -How many API calls per second OpenAI can serve -How many tokens per minute the system can process -How much parallel load the model can handle without slowing down

Throughput, and what is the difference between turbo and frontier models?

Economic unit of AI Determine cost, speed, and accuracy

Tokens

something the model triggers in it's response, instruction inside the model's response aka To answer this request, I need to call this external tool

Tool call

Pre-training (model gains knowledge and learns patterns, logic and reasoning) -> post training/supervised fine tuning (learns to follow instructions, how to be helpful, how to structure answers) -> reinforcement learning with human feedback (humans rank responses to improve personality, tone, politeness, safety) -> options step is domain fine tuning done by customers -> Inference (model does its analysis and produces a response)

Walk through the LLM training process

User asks a question -> we embed the question into a query embedding -> query vector is compared against embeddings stored in the vector DB using semantic similarity and the most relevant chunks are identified -> top matching chunks are retrieved and served to the LLM along with the user's question (augmentation) -> tokenization happens -> LLM produces the response

Walk through the RAG workflow

The API lets developers embed OpenAI models into their own applications

What does OpenAI's API offer?

Chat endpoint → to summarize Embeddings endpoint → for search Audio endpoint → for transcription

What endpoints does Granola use when making API calls to OpenAI?

API Layer (your call → OpenAI) You call the OpenAI API endpoint. Model Layer (OpenAI → tool call) The model returns a tool call instruction telling your app what to do next.

What happens at the API layer vs the Model Layer with regards to tool calls?

Step 1 — AMEX calls OpenAI's chat/completions endpoint (This is the REST API call.) Step 2 — The model responds with a tool call Step 3 — AMEX uses this to call Resy's actual API endpoint (This is the real booking step.)

When a user says: "Book me a table at Norcina tonight." AMEX does:

-Sending a request(like sending a letter to a mailbox) -To a specific address(this is the "endpoint") -With instructions inside(the JSON request) -The server does the work(runs GPT-4) -And sends you back a response(JSON with the model's answer) client sends input parameters in JSON, and the model returns structured JSON with tokens, metadata, and any tool calls

When you "call the OpenAI API," you're really:

enterprises need models that are cheaper, faster, and more scalable, even if they're slightly less powerful than the full, frontier model Turbo models help companies scale AI affordably and optimize for speed

Why do turbo models exist?

Every API request includes the maximum amount of output the model is allowed to generate. This ensures they manage cost, latency, and output length.

max tokens

controls creativity, low temp = little creativity, high temp = more creativity

temperature

how much of the model's probability distribution you let it sample from

top-p

how long it takes the model to return a response after you send an API request influencers: model, tokens,

what is latency, how is it influenced?


Ensembles d'études connexes

Chapter 49: Concepts of Care for Patients with Oral Cavity and Esophageal Problems

View Set

Colorado Statues Rules and Regulations Common to Life and Sickness and Accident Insurance

View Set

Aplicación clínica de AINES y analgésicos coadyuvantes

View Set

The Gospels, Acts, & the Pauline Letters and General Epistles & Revelation

View Set

NVPRC - Life Insurance Underwriting and Policy Issue

View Set