2 · AI & agentsLLMs in 15 minutes

LLMs in 15 minutes

The just-enough mental model for talking to customers about AI without saying anything embarrassing.

What an LLM is, in one sentence

A statistical model trained on a lot of text that predicts the next word given everything that came before. Scale that up and you get something that can carry on a coherent conversation, write code, summarise documents, and reason through problems.

That’s it. Everything else is an implication of this.

What follows from “predict the next word”

PropertyWhy
Sometimes confidently wrong (hallucinates)Predicting plausible text ≠ predicting true text
Better with contextMore input → more constraints on the prediction
Has a finite “context window”Compute scales poorly with input length
Can be steered with examples (few-shot)Pattern-matching is what it does
Costs ~$/million tokens, not $/queryPricing reflects compute

Closed vs open-source

Closed-source (OpenAI, Anthropic, Google)Open-source (Llama, Mistral, Qwen)
Quality on hard tasksHigher todayCatching up
Cost per million tokensHigherLower (especially self-hosted)
Data residencyTheir cloudYour cloud / on-prem
Fine-tuningLimited (their API)Full control
LatencyNetwork round-tripCan be local
When to pickCustomer is fine with cloud, wants best qualityCustomer needs on-prem or data residency, or cost-sensitive at scale

Shipsy uses both. See Models — choosing & switching for how the platform routes between them.

Three numbers to know

  • Context window — how much input the model can see at once. GPT-4o: 128K tokens. Claude Sonnet: 200K. Gemini 1.5: 1M+. (A token ≈ ¾ of a word.)
  • Cost per million tokens — input is cheap, output is ~3-5× more. Plan budgets accordingly.
  • Latency — first-token latency vs. full-response latency. For voice agents, first-token matters most.

What LLMs are bad at

  • Math beyond simple arithmetic (use a tool / Python)
  • Anything requiring up-to-the-second data (use a tool / API)
  • Following long, complex instructions perfectly (break it down, add structure)
  • Doing the same thing twice the same way (temperature, sampling)
  • Knowing what they don’t know (they’ll often guess)

The job of an agent is to compose LLMs with tools and structure to compensate for these.

What CS folks actually need to remember

  1. The model isn’t magic. It’s a really good text-prediction engine wrapped in helpful API plumbing.
  2. Quality, cost, latency, and data residency are levers — pick the right one for the customer’s situation.
  3. Hallucination is mitigated by grounding (RAG), guardrails, and human-in-the-loop. See RAG, memory & vector DBs and Guardrails.

Sources

Changelog

  • 26 May 2026: Initial draft.