Deployment modes
At a glance
- Three deployment modes: cloud (default), on-prem, hybrid.
- Cloud runs on AWS ECS with a three-service split: API, worker, scheduler.
- Model choice drives deployment mode: closed-source LLMs need cloud; open-source (Llama, Mistral) enable on-prem.
- Data residency is the most common reason customers push for on-prem or hybrid.
Why this matters
“Where does our data go?” is the first question in every enterprise security review. Having a crisp answer — with a diagram — saves weeks of back-and-forth. Know the three modes and when to recommend each.
The three modes
Cloud (default)
The standard deployment for most customers.
| Component | Where it runs |
|---|---|
| Agent-platform API | AWS ECS (desiredCount >= 2) |
| Worker service | AWS ECS |
| Scheduler service | AWS ECS |
| LLM provider | Azure OpenAI (default) |
| Vector DB | Pinecone |
| Observability | New Relic + Elasticsearch + Langfuse |
Best for: customers comfortable with cloud, need fastest time-to-value, want access to the best models (GPT-4o, Claude Sonnet).
On-prem
The full stack runs inside the customer’s infrastructure.
| Component | Where it runs |
|---|---|
| Agent-platform | Customer’s Kubernetes or VMs |
| LLM | Open-source (Llama, Mistral — fine-tuned) |
| Vector DB | FAISS (local) |
| Observability | Customer’s monitoring stack |
Best for: banking (BDO), government, defense, or any customer with strict data-residency requirements that prohibit cloud LLM calls.
Trade-off: model quality. Open-source models are catching up but still trail GPT-4o and Claude Sonnet on complex reasoning and tool use.
Hybrid
Platform runs on-prem; LLM calls route to cloud. Data in transit is encrypted; data at rest stays on-prem.
Best for: customers who need data residency for stored data but accept that LLM inference happens in a cloud provider’s environment (with appropriate DPAs and SOC-2 coverage).
Choosing the right mode
Common customer situations
| Customer says | Recommend | Why |
|---|---|---|
| ”We’re fine with cloud” | Cloud | Fastest, best models |
| ”Our security team won’t approve cloud LLMs” | On-prem | Full control, open-source models |
| ”Data must stay in our VPC but we want GPT-4o” | Hybrid | Best-of-both — data stays local, LLM calls route to Azure |
| ”We’re a bank in the Philippines” (BDO pattern) | Start hybrid, assess on-prem | Banking regulators care about data at rest; inference in transit is usually acceptable with DPAs |
| ”We’re in the EU / Middle East” | Cloud (regional) or hybrid | Check specific regulation; often cloud with regional hosting suffices |
Infrastructure details (cloud mode)
The agent-platform runs as three ECS services:
| Service | Purpose |
|---|---|
| API | Handles inbound requests (webhooks, REST), routes to supervisor agent |
| Worker | Executes agent workflows, tool calls, LLM inference |
| Scheduler | Runs scheduled/cron-based workflows (e.g., Maya’s monitoring loops) |
Each runs with desiredCount >= 2 for high availability. Health checks are monitored via New Relic with automated alerting.
Sources
- Slack: #team-ai — deployment and infrastructure discussions
- See Architecture overview for how deployment fits in the platform
- See Security & compliance for data residency details
- Voice Agent Cost Structure & Deployment
Changelog
- 26 May 2026: Full content from Slack engineering discussions and architecture research.