LLMs in 15 minutes
The just-enough mental model for talking to customers about AI without saying anything embarrassing.
What an LLM is, in one sentence
A statistical model trained on a lot of text that predicts the next word given everything that came before. Scale that up and you get something that can carry on a coherent conversation, write code, summarise documents, and reason through problems.
That’s it. Everything else is an implication of this.
What follows from “predict the next word”
| Property | Why |
|---|---|
| Sometimes confidently wrong (hallucinates) | Predicting plausible text ≠ predicting true text |
| Better with context | More input → more constraints on the prediction |
| Has a finite “context window” | Compute scales poorly with input length |
| Can be steered with examples (few-shot) | Pattern-matching is what it does |
| Costs ~$/million tokens, not $/query | Pricing reflects compute |
Closed vs open-source
| Closed-source (OpenAI, Anthropic, Google) | Open-source (Llama, Mistral, Qwen) | |
|---|---|---|
| Quality on hard tasks | Higher today | Catching up |
| Cost per million tokens | Higher | Lower (especially self-hosted) |
| Data residency | Their cloud | Your cloud / on-prem |
| Fine-tuning | Limited (their API) | Full control |
| Latency | Network round-trip | Can be local |
| When to pick | Customer is fine with cloud, wants best quality | Customer needs on-prem or data residency, or cost-sensitive at scale |
Shipsy uses both. See Models — choosing & switching for how the platform routes between them.
Three numbers to know
- Context window — how much input the model can see at once. GPT-4o: 128K tokens. Claude Sonnet: 200K. Gemini 1.5: 1M+. (A token ≈ ¾ of a word.)
- Cost per million tokens — input is cheap, output is ~3-5× more. Plan budgets accordingly.
- Latency — first-token latency vs. full-response latency. For voice agents, first-token matters most.
What LLMs are bad at
- Math beyond simple arithmetic (use a tool / Python)
- Anything requiring up-to-the-second data (use a tool / API)
- Following long, complex instructions perfectly (break it down, add structure)
- Doing the same thing twice the same way (temperature, sampling)
- Knowing what they don’t know (they’ll often guess)
The job of an agent is to compose LLMs with tools and structure to compensate for these.
What CS folks actually need to remember
- The model isn’t magic. It’s a really good text-prediction engine wrapped in helpful API plumbing.
- Quality, cost, latency, and data residency are levers — pick the right one for the customer’s situation.
- Hallucination is mitigated by grounding (RAG), guardrails, and human-in-the-loop. See RAG, memory & vector DBs and Guardrails.
Sources
Changelog
- 26 May 2026: Initial draft.