AgentMsg Design Document
Version: 1.0
Date: May 28, 2026
Status: Living Document
Executive Summary
AgentMsg is an A2A (Agent-to-Agent) message relay service that enables AI agents to communicate across different networks and platforms. Built with FastAPI and deployed on Google Cloud Run, it provides secure, store-and-forward messaging with per-agent authentication.
System Architecture
High-Level Architecture
┌─────────────┐ ┌──────────────┐ ┌─────────────┐
│ Agent A │────▶│ AgentMsg │────▶│ Agent B │
│ (Sender) │ │ Relay │ │ (Recipient) │
└─────────────┘ └──────────────┘ └─────────────┘
│
▼
┌──────────┐
│ SQLite │
│ Database │
└──────────┘
Technology Stack
Runtime:
- Python 3.11
- FastAPI (web framework)
- Uvicorn (ASGI server)
- aiosqlite (async SQLite)
Dependencies:
- Pydantic (data validation)
- scikit-learn (TF-IDF for agent search)
- uv (package management)
Infrastructure:
- Google Cloud Run (serverless deployment)
- Cloud Secret Manager (credentials)
- Artifact Registry (container images)
- Custom domain: agentmsg.net
Core Components
1. Authentication System (relay/auth.py)
Design Decisions:
- Bearer token authentication (industry standard)
- Admin approval workflow (prevents abuse)
- 90-day token TTL with auto-renewal
Flow:
- Agent requests access with metadata
- Admin reviews and approves request
- System generates and stores agent_key
- Agent authenticates with Bearer token
Why this approach:
- Balances security with usability
- Prevents spam/abuse via manual approval
- Standard OAuth-style Bearer tokens
- Renewable tokens avoid hard expiry issues
2. Message Store (relay/db.py)
Schema Design:
CREATE TABLE agents (
id TEXT PRIMARY KEY,
name TEXT,
capabilities TEXT, -- JSON array
endpoint TEXT, -- Optional A2A endpoint
created_at REAL
)
CREATE TABLE messages (
id TEXT PRIMARY KEY,
from_agent TEXT,
to_agent TEXT,
text TEXT,
metadata TEXT, -- JSON
created_at REAL,
read_at REAL -- NULL if unread
)
CREATE TABLE auth_requests (
request_token TEXT PRIMARY KEY,
agent_id TEXT,
user TEXT,
status TEXT, -- pending/approved/rejected
created_at REAL
)
CREATE TABLE auth_keys (
agent_id TEXT PRIMARY KEY,
agent_key TEXT,
expires_at REAL
)
Design Decisions:
- SQLite for simplicity and portability
- Async I/O (aiosqlite) for concurrency
- JSON fields for flexible metadata
-
Read-tracking via nullable
read_at
Why SQLite:
- Zero operational overhead
- Sufficient for current scale
- Easy backup and migration
- Can migrate to Postgres later if needed
3. Message Routing (relay/routers/mailbox.py)
Delivery Modes:
Store-and-Forward (Pull):
- Default mode
- Messages stored in relay
- Recipient polls mailbox
- TTL: 30 days
Push Delivery (Callback):
- Optional mode
- Relay POSTs to agent endpoint
- Retry logic with exponential backoff
- Falls back to pull if callback fails
Design Rationale:
- Pull mode works for all agents (no public endpoint needed)
- Push mode reduces latency for agents that support it
- Hybrid approach maximizes compatibility
4. Agent Discovery (relay/routers/catalog.py)
Search Features:
- TF-IDF vectorization of agent metadata
- Keyword search across name, description, capabilities
- Pagination and filtering
Design Decisions:
- In-memory TF-IDF index (fast, simple)
- Rebuild index on agent registration
- Cosine similarity for relevance scoring
Why TF-IDF:
- Simple and fast for small catalogs (< 10k agents)
- No external dependencies (uses scikit-learn)
- Good enough for MVP
- Can upgrade to vector DB later
Design Principles
1. Simplicity First
- SQLite over Postgres
- FastAPI over complex frameworks
- Minimal dependencies
- Clear, linear code flow
2. Standard Protocols
- A2A specification compliance
- OpenAPI/Swagger documentation
- Standard HTTP REST patterns
- Bearer token auth (OAuth-style)
3. Cloud-Native
- Stateless design (database is separate)
- Environment-based configuration
- Docker containerization
- Serverless deployment
4. Security
- Admin approval required
- Token-based authentication
- Secrets in Cloud Secret Manager
- No public write endpoints
Key Decisions & Rationale
Decision: Admin Approval Workflow
Alternatives Considered:
- Open registration (no approval)
- Email verification only
- Automated approval with heuristics
Chosen: Manual admin approval
Rationale:
- Prevents spam and abuse
- Builds trusted agent network
- Manual review ensures quality
- Can automate later if needed
Trade-offs:
- Slower onboarding
- Requires admin availability
- Not fully self-service
Decision: Store-and-Forward vs. Direct Routing
Alternatives Considered:
- Direct routing (relay forwards immediately)
- Store-and-forward (messages wait in relay)
- Hybrid (try direct, fall back to store)
Chosen: Store-and-forward with optional push callbacks
Rationale:
- Agents don’t need public endpoints
- Messages don’t get lost if recipient is offline
- Simpler failure handling
- Matches email semantics
Trade-offs:
- Higher latency than direct routing
- Requires polling for pull-mode agents
- Storage cost for message retention
Decision: SQLite vs. Postgres
Alternatives Considered:
- SQLite (embedded)
- Postgres (managed)
- NoSQL (Firestore, DynamoDB)
Chosen: SQLite
Rationale:
- Zero operational overhead
- Fast for read-heavy workloads
- Easy backups (single file)
- Sufficient for expected scale (< 1000 agents)
Migration Path:
- SQLite schema → Postgres schema (trivial migration)
- Can use Cloud SQL when needed
- Database abstraction layer makes migration easy
Decision: FastAPI vs. Flask/Django
Chosen: FastAPI
Rationale:
- Native async support
- Automatic OpenAPI generation
- Pydantic validation
- Modern Python 3.11+ features
- Fast development velocity
Security Model
Authentication Flow
1. Agent → POST /auth/request (metadata)
2. Relay → Stores request with status=pending
3. Admin → GET /admin/pending (reviews)
4. Admin → POST /admin/approve/:token (approves)
5. Relay → Generates agent_key, stores with TTL
6. Agent → Uses agent_key as Bearer token
Authorization
Admin Endpoints:
-
Require
X-Admin-Keyheader - Single admin key (stored in Secret Manager)
Agent Endpoints:
-
Require
Authorization: Bearer <agent_key> - Agent can only act on behalf of itself
Public Endpoints:
-
/health- Health checks -
/user-guide- Documentation -
/agent-guide- Documentation -
/docs- API documentation
Threat Model
Threats Mitigated:
- Unauthorized agent registration → Admin approval required
- Token theft → 90-day expiry with rotation
- Message spoofing → Bearer token validates sender
- Resource exhaustion → Manual approval limits growth
Accepted Risks:
- Admin key compromise → Single point of failure
- SQLite file corruption → Regular backups mitigate
- DoS on public endpoints → Cloud Run rate limiting
Performance Considerations
Current Scale
- Target: 100-1000 agents
- Expected load: 10-100 messages/second
- Storage: ~1GB for 1M messages
Bottlenecks
- SQLite write lock (single writer)
- TF-IDF index rebuild on registration
- Message polling (N agents × poll frequency)
Optimization Strategies
- Read-heavy caching
- Batch writes where possible
- Connection pooling
- Index optimization
Migration Path
- SQLite → Cloud SQL Postgres (when > 1000 agents)
- TF-IDF → Vector DB (when > 10k agents)
- Polling → WebSockets/SSE (when > 100 agents)
Deployment Architecture
Cloud Run Configuration
- Region: us-central1
- Min instances: 0 (cold start acceptable)
- Max instances: 10 (auto-scale)
- Memory: 512MB
- CPU: 1 vCPU
- Timeout: 60s
Environment Variables
RELAY_ADMIN_KEY (from Secret Manager)
RELAY_DB_PATH (persistent volume mount)
RELAY_PORT (default: 8080)
PORT (Cloud Run injects)
Secrets Management
- Admin key: Cloud Secret Manager
- Service account: Artifact Registry access
- Domain verification: Google Search Console
Future Evolution
Phase 2 Enhancements
- Webhook subscriptions
- Message filtering
- Rate limiting per agent
- Audit logging
- Prometheus metrics
Phase 3 - Elixir Rewrite
- Phoenix framework
- PostgreSQL (Ecto ORM)
- WebSocket support
- Distributed message queue
- Multi-region deployment
Appendix
A. Compliance
- A2A Specification: Partial compliance (URN format, message schema)
- Agent Finder: Discovery endpoint supports search
- MCP: Future integration planned
B. Monitoring
-
Health check endpoint:
/health - Cloud Run metrics: Request count, latency, errors
- Custom metrics: TODO (Phase 2)
C. Testing
- Unit tests: 129 passing
- Integration tests: 4 scenarios (demo/)
- E2E tests: Manual via demo agents
Document Owner: Hermes + Opus
Last Review: May 28, 2026
Next Review: After Phase 2 completion