OpenAI Prep
the URL you send requests to
API endpoint
The amount of information the model can "remember" within a single prompt
Context window
let external systems call your app when a long-running task finishes, so you don't have to keep polling AMEX: Instead of repeatedly calling Resy to check on the booking, we set up a webhook so Resy can call us with an update Rather than blocking the UI: Granola sends audio to its transcription service. The service processes in background. When finished, the service sends a webhook to Granola:"Your transcript is ready." Granola then triggers: summary API call insights extraction memory updates
Define Webhook - give examples for AMEX and Granola
a numerical vector that represents the meaning of a piece of text let's the system understand similarity it enables semantic search and is the basis for RAG
Embedding
Endpoints are the URLs I call to use the OpenAI API. Tool calls are instructions generated by the model that tell my application what action to take next. They operate on different layers
Endpoints are... Tool calls are...
A secret API key in the header Used to control access, billing, and permissions how OpenAI secures access and tracks usage
How is authentication handled?
the real time process of a model generating a response based on what it learned during training -All training is done. -The model simply predicts tokens based on its learned patterns + your prompt
Inference
ChatGPT API platform Assistants API GPT Store Audio and Image Models
List all of OpenAI's product offerings
Turbo models exist because most real-world applications need speed, scale, and affordability. Turbo gives teams 90-95% of the model quality at a fraction of the cost and latency, which makes large-scale adoption possible
One liner about turbo models:
Ex. grammar rules, common facts, how and email is usually written Are set during pre-training the model adjusts its weights (parameters) every time it gets something "wrong"
Parameters
You send a message to another computer on the internet asking it to do something, and it sends a message back with the result a set of endpoints a client can call using HTTP to send requests and receive structured responses, usually in JSON
REST API
Rate limits = how much you can use per minute. Measured in: Requests per minute Tokens per minute
Rate Limits
how much of the API you're allowed to use per minute API calls per minute and tokens per minute Turbo models help because they use fewer tokens and can fit more calls within the rate limit.
Rate limits
-Humans rank different model outputs. -The model learns which responses humans prefer. -This shapes tone, style, helpfulness, guardrails
Reinforcement Learning from Human Feedback (RLHF)
-It ingests huge amounts of text. -The task is always the same: predict the next token. -By doing this billions of times, the model learns patterns, reasoning, structure, facts.
Supervised pre-training
how many requests a system can handle at the same time Turbo = high throughput Frontier = lower throughput(But higher reasoning quality) -How many API calls per second OpenAI can serve -How many tokens per minute the system can process -How much parallel load the model can handle without slowing down
Throughput, and what is the difference between turbo and frontier models?
Economic unit of AI Determine cost, speed, and accuracy
Tokens
something the model triggers in it's response, instruction inside the model's response aka To answer this request, I need to call this external tool
Tool call
Pre-training (model gains knowledge and learns patterns, logic and reasoning) -> post training/supervised fine tuning (learns to follow instructions, how to be helpful, how to structure answers) -> reinforcement learning with human feedback (humans rank responses to improve personality, tone, politeness, safety) -> options step is domain fine tuning done by customers -> Inference (model does its analysis and produces a response)
Walk through the LLM training process
User asks a question -> we embed the question into a query embedding -> query vector is compared against embeddings stored in the vector DB using semantic similarity and the most relevant chunks are identified -> top matching chunks are retrieved and served to the LLM along with the user's question (augmentation) -> tokenization happens -> LLM produces the response
Walk through the RAG workflow
The API lets developers embed OpenAI models into their own applications
What does OpenAI's API offer?
Chat endpoint → to summarize Embeddings endpoint → for search Audio endpoint → for transcription
What endpoints does Granola use when making API calls to OpenAI?
API Layer (your call → OpenAI) You call the OpenAI API endpoint. Model Layer (OpenAI → tool call) The model returns a tool call instruction telling your app what to do next.
What happens at the API layer vs the Model Layer with regards to tool calls?
Step 1 — AMEX calls OpenAI's chat/completions endpoint (This is the REST API call.) Step 2 — The model responds with a tool call Step 3 — AMEX uses this to call Resy's actual API endpoint (This is the real booking step.)
When a user says: "Book me a table at Norcina tonight." AMEX does:
-Sending a request(like sending a letter to a mailbox) -To a specific address(this is the "endpoint") -With instructions inside(the JSON request) -The server does the work(runs GPT-4) -And sends you back a response(JSON with the model's answer) client sends input parameters in JSON, and the model returns structured JSON with tokens, metadata, and any tool calls
When you "call the OpenAI API," you're really:
enterprises need models that are cheaper, faster, and more scalable, even if they're slightly less powerful than the full, frontier model Turbo models help companies scale AI affordably and optimize for speed
Why do turbo models exist?
Every API request includes the maximum amount of output the model is allowed to generate. This ensures they manage cost, latency, and output length.
max tokens
controls creativity, low temp = little creativity, high temp = more creativity
temperature
how much of the model's probability distribution you let it sample from
top-p
how long it takes the model to return a response after you send an API request influencers: model, tokens,
what is latency, how is it influenced?
