AgentMsg

A2A Relay — Project Notes & Research Log

This is the living research document for the a2a-relay project. Architecture decisions, alternatives considered, lessons learned, and future roadmap.


What We Built

A store-and-forward message relay for A2A (Agent-to-Agent) communication. Agents register, get approved, and exchange messages via a central mailbox API.

Not a streaming RPC framework. Not a pub/sub bus. A simple async mailbox.


Alternatives Considered (and why we didn’t use them)

agentgateway (Google)

a2a-gateway (community)

Google Cloud Pub/Sub

Decision: Custom FastAPI relay


Auth Design: Option B (Per-Agent API Keys)

Agents self-register (POST /agents/register) → admin approves (POST /admin/approve/{token}) → agent uses Bearer token for all subsequent calls.

Key implementation detail: approval auto-creates the agent record in the registry. Early versions had a bug where the token was approved but the agent wasn’t in the main registry, causing “Agent not registered” errors even with valid Bearer tokens. Fixed in relay/routers/admin.py.


Infra Decisions

SQLite over PostgreSQL

Single Cloud Run instance

uv over pip


Deployment Lessons Learned

1. python -m module doesn’t fire __main__ on Cloud Run

When a container CMD is python -m relay.main, Python looks for relay/main.py and runs it — but if __name__ == "__main__": block fires correctly. The real issue was that uvicorn was never being invoked because we had app = FastAPI() at module level and nothing calling uvicorn.run(). Fix: use uvicorn directly as the CMD.

2. Shell expansion in Dockerfile CMD (exec form)

CMD ["uvicorn", "--port", "${PORT:-8080}"] — the ${PORT:-8080} is passed literally, not expanded. Cloud Run always sets PORT=8080 anyway, so just hardcode it.

3. SA permissions for Cloud Build

Needs both roles/cloudbuild.builds.editor AND roles/storage.objectAdmin. objectCreator alone is not enough (Build needs to read back the source tarball from GCS).

4. Cloud Run service agent ≠ your SA

The Cloud Run service agent (service-PROJECT_NUMBER@serverless-robot-prod.iam.gserviceaccount.com) is a different identity from your deploy SA. Secret Manager access must be granted to both if the container needs to read secrets at runtime. (We use --set-secrets which injects as env vars — the service agent reads the secret during deployment.)

5. gcloud auth as non-root

Default gcloud install writes credentials to /root/.config/gcloud. Running as node user means those creds are inaccessible. Fix: copy the SA JSON to a node-readable path and gcloud auth activate-service-account --key-file=... as the node user. Then gcloud config set project ....


Demo Agents (in demo/agents/)

echo-agent

counter-agent

Demo Scenarios (demo/scenarios/)

  1. 01_echo_roundtrip.py — Basic send/receive with echo-agent
  2. 02_counter_state.py — Stateful counter across multiple messages
  3. 03_multi_agent_routing.py — Two agents talking to each other via relay
  4. 04_concurrent_delivery.py — Parallel sends, verify all delivered

Future Work (not blocking demo)

Callback Push (xfail test)

test_callback_delivery is marked xfail — callback URL push not yet implemented. When an agent registers with a callback_url, the relay should POST new messages there instead of waiting for the agent to poll. Implementation:

Cloud SQL Migration

When moving beyond single-instance demo:

  1. Create Cloud SQL PostgreSQL instance in alanblount-demo
  2. Implement relay/db/postgres.py with same AsyncDB interface
  3. Set DATABASE_URL env var on Cloud Run
  4. Config auto-selects adapter based on URL scheme

Multi-Tenant / Per-Agent Namespacing

Currently all agents share one namespace. For production:

A2A Protocol Compliance

The relay currently uses its own message schema. To comply with Google’s A2A spec:

LiteLLM Failover

Once additional SA keys are available (alan-sandbox, zaf-sandbox):

  1. Configure LiteLLM router with all three GCP projects
  2. Model priority: alanblount-demoalanblount-sandboxalan-sandboxzaf-sandbox
  3. Test with: litellm --config /path/to/router-config.yaml

Key Files

a2a-relay/
├── relay/
│   ├── main.py          # FastAPI app factory
│   ├── config.py        # Pydantic Settings (RELAY_ prefix)
│   ├── models.py        # Agent, Message, Token Pydantic models
│   ├── db/
│   │   └── sqlite.py    # AsyncDB interface (aiosqlite)
│   ├── routers/
│   │   ├── admin.py     # /admin/* — approve, list, revoke
│   │   ├── agents.py    # /agents/* — register, status
│   │   └── mailbox.py   # /messages/* — send, poll, ack
│   └── cli.py           # CLI: relay-cli (HTTP + DB-direct modes)
├── demo/
│   ├── run_demo.py      # Main demo runner (--relay-url flag)
│   ├── lib.py           # Shared HTTP helpers
│   ├── agents/
│   │   ├── echo_agent.py
│   │   └── counter_agent.py
│   └── scenarios/
│       ├── 01_echo_roundtrip.py
│       ├── 02_counter_state.py
│       ├── 03_multi_agent_routing.py
│       └── 04_concurrent_delivery.py
├── tests/               # pytest, Bearer auth, 129 pass / 1 xfail
├── docs/
│   ├── ARCHITECTURE.md  # Design doc (written first, before code)
│   ├── DEPLOYMENT.md    # This file's operational sibling
│   └── RESEARCH.md      # You are here
├── Dockerfile
├── pyproject.toml       # uv-managed deps
└── .env.example

Research References