AgentMsg

A2A Relay — Architecture Decision Record

Date: 2026-05-27
Status: Accepted
Author: Hermes (Alan Blount, sponsor)


Problem Statement

We need a relay/mailbox service so that any AI agent (Hermes, OpenClaw, Claude Code, Antigravity, etc.) can send and receive A2A messages regardless of NAT, firewalls, or whether the target agent is online at the time of sending. The service must:

  1. Allow agents behind NAT to receive inbound A2A tasks via polling or push
  2. Allow any agent to send A2A tasks to named agents without knowing their IP
  3. Be deployable to Cloud Run (zero-ops, scales to zero)
  4. Be AI Catalog / Agent Finder compliant for federated discovery
  5. Be testable with mocked LLM responses (BDD + TDD, record/replay)
  6. Have a clean path to add chat platform bridges later

Decision: Build a Thin Python Service — Do Not Adopt Any Existing Repo As-Is

Options Evaluated

❌ agentgateway/agentgateway — SKIP as primary

⚠️ eliasecchig/a2a-gateway — Adopt patterns, not codebase

❌ s-hiraoku/synapse-a2a — SKIP

✅ Build: a2a-relay — Thin Python FastAPI mailbox + relay


Architecture

┌──────────────────────────────────────────────────────────────────┐
│                        A2A RELAY SERVICE                         │
│                   (Cloud Run · public HTTPS URL)                 │
│                                                                  │
│  ┌────────────────┐    ┌─────────────────┐    ┌──────────────┐  │
│  │  A2A Inbound   │    │  Message Store  │    │ Agent Finder │  │
│  │  POST /tasks   │───▶│  (SQLite/PgSQL) │    │  Catalog     │  │
│  │  (from senders)│    │  per-agent      │    │  /.well-known│  │
│  └────────────────┘    │  mailbox        │    │  POST /search│  │
│                        └────────┬────────┘    └──────────────┘  │
│  ┌────────────────┐             │                                │
│  │  Agent Poll    │◀────────────┘                                │
│  │  GET /mailbox  │   (pull: agent polls when online)            │
│  │  /{agent_id}   │                                              │
│  └────────────────┘                                              │
│                                                                  │
│  ┌────────────────┐    ┌─────────────────────────────────────┐  │
│  │  Push Webhook  │───▶│  Outbound delivery to agent webhook │  │
│  │  (if agent has │    │  (A2A message/send to callback URL) │  │
│  │  registered    │    │  with exponential backoff retry     │  │
│  │  callback URL) │    └─────────────────────────────────────┘  │
│  └────────────────┘                                              │
│                                                                  │
│  ┌────────────────┐                                              │
│  │  Agent Registry│  POST /agents/register                       │
│  │  (who exists,  │  GET  /agents/{agent_id}/card               │
│  │  what caps,    │  (stores A2A agent card + callback URL)      │
│  │  callback URL) │                                              │
│  └────────────────┘                                              │
└──────────────────────────────────────────────────────────────────┘
         ▲                              ▲
         │ send task                    │ poll/receive
    ┌────┴─────┐                   ┌───┴──────┐
    │  Hermes  │                   │ OpenClaw │
    │ (NAT OK) │                   │ (NAT OK) │
    └──────────┘                   └──────────┘

Key Design Choices

1. Mailbox Pattern (Async, Durable)

2. Agent Registration

3. A2A Protocol Compliance

4. Agent Finder Compliance

5. Message Store: SQLite (Cloud Run friendly)

6. Testing Strategy


API Surface (MVP)

# Agent Management
POST   /agents/register          Register agent card + optional callback
GET    /agents/{id}/card         Fetch stored agent card
DELETE /agents/{id}              Deregister

# Mailbox (NAT-safe polling)
GET    /mailbox/{agent_id}       Poll: returns pending messages (cursor-based)
POST   /mailbox/{agent_id}/ack   Acknowledge message(s)

# A2A Relay Endpoint
POST   /a2a                      A2A JSON-RPC (relay's own A2A endpoint)
GET    /.well-known/agent.json   Relay's A2A agent card

# Agent Finder / AI Catalog
GET    /.well-known/ai-catalog.json   AI Catalog manifest
POST   /search                        Semantic search over registered agents

# Health
GET    /health                   Liveness
GET    /ready                    Readiness (DB connected)
GET    /metrics                  Prometheus metrics

Non-Goals (MVP)


Tech Stack

Component Choice Rationale
Language Python 3.12 ADK ecosystem, a2a-sdk, team familiarity
Web framework FastAPI 0.115+ Async-native, Pydantic v2, OpenAPI free
A2A SDK a2a-sdk>=1.0 Google’s official; same as a2a-gateway
Message store SQLite (aiosqlite) Zero-ops, Cloud Run compatible, swap to PG later
HTTP client httpx (async) Standard async HTTP; same as a2a-gateway
Testing pytest-asyncio + respx + pytest-bdd + pytest-recording Record/replay, BDD features, async
Container Python 3.12 slim Minimal, Cloud Run native
CI/CD GitHub Actions → GHCR → Cloud Run Free, automated
Config Pydantic Settings (env vars + .env) 12-factor, Cloud Run compatible

File Structure

a2a-relay/
├── relay/
│   ├── __init__.py
│   ├── main.py           # FastAPI app factory
│   ├── config.py         # Pydantic settings
│   ├── models.py         # DB models (SQLite schema)
│   ├── db.py             # aiosqlite connection + migrations
│   ├── api/
│   │   ├── agents.py     # /agents/* endpoints
│   │   ├── mailbox.py    # /mailbox/* endpoints
│   │   ├── a2a.py        # /a2a relay endpoint + agent card
│   │   └── catalog.py    # /.well-known/* + /search
│   ├── services/
│   │   ├── router.py     # Route incoming A2A task → target mailbox
│   │   ├── delivery.py   # Webhook push delivery + retry
│   │   └── search.py     # Catalog search (TF-IDF)
│   └── a2a_client.py     # Outbound A2A calls (wraps httpx)
├── tests/
│   ├── conftest.py
│   ├── features/         # Gherkin .feature files (BDD)
│   │   ├── relay.feature
│   │   ├── mailbox.feature
│   │   └── catalog.feature
│   ├── unit/
│   │   ├── test_router.py
│   │   ├── test_delivery.py
│   │   └── test_search.py
│   ├── integration/
│   │   ├── test_agents_api.py
│   │   ├── test_mailbox_api.py
│   │   ├── test_a2a_api.py
│   │   └── test_catalog_api.py
│   └── cassettes/        # respx recorded HTTP cassettes
├── docs/
│   ├── ARCHITECTURE.md   # This file
│   └── DEPLOYMENT.md
├── .github/
│   └── workflows/
│       ├── test.yml
│       └── deploy.yml
├── Dockerfile
├── Makefile
├── pyproject.toml
└── README.md

Deployment: Cloud Run

# One-time setup
gcloud run deploy a2a-relay \
  --image ghcr.io/<owner>/a2a-relay:main \
  --platform managed \
  --region us-central1 \
  --allow-unauthenticated \
  --set-env-vars "RELAY_API_KEY=<secret>,DATABASE_URL=:memory:" \
  --min-instances 0 \
  --max-instances 3 \
  --memory 512Mi \
  --port 8080

The Cloud Run URL becomes the relay’s public address.
Agents register with POST /agents/register including their callback URL (if any).
Agents behind NAT poll GET /mailbox/{agent_id} as often as they want.


Phase 2 (After MVP)