A2A Relay — Project Notes & Research Log
This is the living research document for the a2a-relay project.
Architecture decisions, alternatives considered, lessons learned, and future roadmap.
What We Built
A store-and-forward message relay for A2A (Agent-to-Agent) communication. Agents register, get approved, and exchange messages via a central mailbox API.
Not a streaming RPC framework. Not a pub/sub bus. A simple async mailbox.
Alternatives Considered (and why we didn’t use them)
agentgateway (Google)
- Full gRPC+protobuf stack, overkill for demo
- Requires proto compilation, complex routing config
- Would have taken days to customize
a2a-gateway (community)
- Abandoned / incomplete at time of evaluation
- No auth model, no mailbox semantics
Google Cloud Pub/Sub
- Would work but adds GCP dependency to every agent
- Agents need GCP credentials, not just a Bearer token
- Not portable for external/third-party agents
Decision: Custom FastAPI relay
- 39 files, 131 tests, 3 days to working MVP
- Agents only need HTTP + a Bearer token
- SQLite is fine for single-instance demo scale
- Full control over auth model and message schema
Auth Design: Option B (Per-Agent API Keys)
Agents self-register (POST /agents/register) → admin approves (POST /admin/approve/{token}) → agent uses Bearer token for all subsequent calls.
Key implementation detail: approval auto-creates the agent record in the registry. Early versions had a bug where the token was approved but the agent wasn’t in the main registry, causing “Agent not registered” errors even with valid Bearer tokens. Fixed in relay/routers/admin.py.
Infra Decisions
SQLite over PostgreSQL
-
Cloud Run is stateless —
/tmp/relay.dbresets on deploy - For the demo, that’s fine (agents re-register)
- For production: Cloud SQL (PostgreSQL) or Firestore
-
Migration path: swap
relay/db/sqlite.pyfor a Cloud SQL adapter behind the sameAsyncDBinterface
Single Cloud Run instance
-
--min-instances=1 --max-instances=1recommended while on SQLite - Multi-instance would need sticky routing or shared DB
- SQLite is not network-accessible across containers
uv over pip
- Faster, lockfile-based, reproducible
-
uv sync --extra devfor development -
uv runfor all commands — no venv activation needed -
Dockerfile uses
pip install uvthenuv sync --no-devfor production layer
Deployment Lessons Learned
1. python -m module doesn’t fire __main__ on Cloud Run
When a container CMD is python -m relay.main, Python looks for relay/main.py and runs it — but if __name__ == "__main__": block fires correctly. The real issue was that uvicorn was never being invoked because we had app = FastAPI() at module level and nothing calling uvicorn.run(). Fix: use uvicorn directly as the CMD.
2. Shell expansion in Dockerfile CMD (exec form)
CMD ["uvicorn", "--port", "${PORT:-8080}"] — the ${PORT:-8080} is passed literally, not expanded. Cloud Run always sets PORT=8080 anyway, so just hardcode it.
3. SA permissions for Cloud Build
Needs both roles/cloudbuild.builds.editor AND roles/storage.objectAdmin. objectCreator alone is not enough (Build needs to read back the source tarball from GCS).
4. Cloud Run service agent ≠ your SA
The Cloud Run service agent (service-PROJECT_NUMBER@serverless-robot-prod.iam.gserviceaccount.com) is a different identity from your deploy SA. Secret Manager access must be granted to both if the container needs to read secrets at runtime. (We use --set-secrets which injects as env vars — the service agent reads the secret during deployment.)
5. gcloud auth as non-root
Default gcloud install writes credentials to /root/.config/gcloud. Running as node user means those creds are inaccessible. Fix: copy the SA JSON to a node-readable path and gcloud auth activate-service-account --key-file=... as the node user. Then gcloud config set project ....
Demo Agents (in demo/agents/)
echo-agent
-
Polls the relay, echoes every message back to sender with
[ECHO]prefix -
Registers as
echo-agentwith the relay - FastAPI app on port 9001
counter-agent
- Maintains a per-sender count in memory
-
Replies with
Count: Nfor each message received - Demonstrates stateful agent behavior
- FastAPI app on port 9002
Demo Scenarios (demo/scenarios/)
-
01_echo_roundtrip.py— Basic send/receive with echo-agent -
02_counter_state.py— Stateful counter across multiple messages -
03_multi_agent_routing.py— Two agents talking to each other via relay -
04_concurrent_delivery.py— Parallel sends, verify all delivered
Future Work (not blocking demo)
Callback Push (xfail test)
test_callback_delivery is marked xfail — callback URL push not yet implemented.
When an agent registers with a callback_url, the relay should POST new messages there instead of waiting for the agent to poll. Implementation:
-
Add
callback_urlfield to agent record -
Background task: on
POST /messages/send, if recipient hascallback_url, dohttpx.post(callback_url, json=message) - Retry with exponential backoff, mark message delivered on 2xx
Cloud SQL Migration
When moving beyond single-instance demo:
-
Create Cloud SQL PostgreSQL instance in
alanblount-demo -
Implement
relay/db/postgres.pywith sameAsyncDBinterface -
Set
DATABASE_URLenv var on Cloud Run - Config auto-selects adapter based on URL scheme
Multi-Tenant / Per-Agent Namespacing
Currently all agents share one namespace. For production:
-
Namespace agents by
org_idorworkspace_id - Admin keys scoped to namespaces
- Separate Cloud Run services per tenant (or row-level security)
A2A Protocol Compliance
The relay currently uses its own message schema. To comply with Google’s A2A spec:
-
Adopt
Task,Message,Artifacttypes from the A2A spec -
Implement
tasks/send,tasks/get,tasks/sendSubscribeendpoints - See google/A2A for the OpenAPI spec
LiteLLM Failover
Once additional SA keys are available (alan-sandbox, zaf-sandbox):
- Configure LiteLLM router with all three GCP projects
-
Model priority:
alanblount-demo→alanblount-sandbox→alan-sandbox→zaf-sandbox -
Test with:
litellm --config /path/to/router-config.yaml
Key Files
a2a-relay/
├── relay/
│ ├── main.py # FastAPI app factory
│ ├── config.py # Pydantic Settings (RELAY_ prefix)
│ ├── models.py # Agent, Message, Token Pydantic models
│ ├── db/
│ │ └── sqlite.py # AsyncDB interface (aiosqlite)
│ ├── routers/
│ │ ├── admin.py # /admin/* — approve, list, revoke
│ │ ├── agents.py # /agents/* — register, status
│ │ └── mailbox.py # /messages/* — send, poll, ack
│ └── cli.py # CLI: relay-cli (HTTP + DB-direct modes)
├── demo/
│ ├── run_demo.py # Main demo runner (--relay-url flag)
│ ├── lib.py # Shared HTTP helpers
│ ├── agents/
│ │ ├── echo_agent.py
│ │ └── counter_agent.py
│ └── scenarios/
│ ├── 01_echo_roundtrip.py
│ ├── 02_counter_state.py
│ ├── 03_multi_agent_routing.py
│ └── 04_concurrent_delivery.py
├── tests/ # pytest, Bearer auth, 129 pass / 1 xfail
├── docs/
│ ├── ARCHITECTURE.md # Design doc (written first, before code)
│ ├── DEPLOYMENT.md # This file's operational sibling
│ └── RESEARCH.md # You are here
├── Dockerfile
├── pyproject.toml # uv-managed deps
└── .env.example
Research References
- Google A2A Protocol spec
- Google Agents CLI / ADK
- agents-cli clone — indexed as project 27
- agentgateway — evaluated, too heavy
- Cloud Run docs — secrets
- Cloud Run docs — SQLite limitations