Deployment modes

At a glance

Three deployment modes: cloud (default), on-prem, hybrid.
Cloud runs on AWS ECS with a three-service split: API, worker, scheduler.
Model choice drives deployment mode: closed-source LLMs need cloud; open-source (Llama, Mistral) enable on-prem.
Data residency is the most common reason customers push for on-prem or hybrid.

Why this matters

“Where does our data go?” is the first question in every enterprise security review. Having a crisp answer — with a diagram — saves weeks of back-and-forth. Know the three modes and when to recommend each.

The three modes

Cloud (default)

The standard deployment for most customers.

Component	Where it runs
Agent-platform API	AWS ECS (`desiredCount >= 2`)
Worker service	AWS ECS
Scheduler service	AWS ECS
LLM provider	Azure OpenAI (default)
Vector DB	Pinecone
Observability	New Relic + Elasticsearch + Langfuse

Best for: customers comfortable with cloud, need fastest time-to-value, want access to the best models (GPT-4o, Claude Sonnet).

On-prem

The full stack runs inside the customer’s infrastructure.

Component	Where it runs
Agent-platform	Customer’s Kubernetes or VMs
LLM	Open-source (Llama, Mistral — fine-tuned)
Vector DB	FAISS (local)
Observability	Customer’s monitoring stack

Best for: banking (BDO), government, defense, or any customer with strict data-residency requirements that prohibit cloud LLM calls.

Trade-off: model quality. Open-source models are catching up but still trail GPT-4o and Claude Sonnet on complex reasoning and tool use.

Hybrid

Platform runs on-prem; LLM calls route to cloud. Data in transit is encrypted; data at rest stays on-prem.

Best for: customers who need data residency for stored data but accept that LLM inference happens in a cloud provider’s environment (with appropriate DPAs and SOC-2 coverage).

Choosing the right mode

Common customer situations

Customer says	Recommend	Why
”We’re fine with cloud”	Cloud	Fastest, best models
”Our security team won’t approve cloud LLMs”	On-prem	Full control, open-source models
”Data must stay in our VPC but we want GPT-4o”	Hybrid	Best-of-both — data stays local, LLM calls route to Azure
”We’re a bank in the Philippines” (BDO pattern)	Start hybrid, assess on-prem	Banking regulators care about data at rest; inference in transit is usually acceptable with DPAs
”We’re in the EU / Middle East”	Cloud (regional) or hybrid	Check specific regulation; often cloud with regional hosting suffices

Infrastructure details (cloud mode)

The agent-platform runs as three ECS services:

Service	Purpose
API	Handles inbound requests (webhooks, REST), routes to supervisor agent
Worker	Executes agent workflows, tool calls, LLM inference
Scheduler	Runs scheduled/cron-based workflows (e.g., Maya’s monitoring loops)

Each runs with desiredCount >= 2 for high availability. Health checks are monitored via New Relic with automated alerting.

Sources

Slack: #team-ai — deployment and infrastructure discussions
See Architecture overview for how deployment fits in the platform
See Security & compliance for data residency details
Voice Agent Cost Structure & Deployment

Changelog

26 May 2026: Full content from Slack engineering discussions and architecture research.

Observability & monitoring Security & compliance