AgentMsg

A2A Relay — Deployment Guide

Overview

The relay runs on Google Cloud Run (project: alanblount-demo, region: us-central1). Container image is stored in Artifact Registry and built via Cloud Build.


Service URLs

Environment URL
Cloud Run (prod) https://a2a-relay-462816930018.us-central1.run.app
Local dev http://localhost:8765

GCP Setup (one-time)

SA Permissions Required

The service account general-agent-access@alanblount-demo.iam.gserviceaccount.com has these roles (confirmed working):

roles/aiplatform.user
roles/artifactregistry.reader
roles/artifactregistry.writer
roles/cloudbuild.builds.editor
roles/iam.securityReviewer          ← read own IAM policy
roles/run.admin                     ← deploy + set allUsers invoker
roles/run.developer
roles/secretmanager.admin
roles/secretmanager.secretAccessor
roles/storage.objectAdmin
roles/storage.objectCreator

Grant the essential non-AI-platform ones:

SA="general-agent-access@alanblount-demo.iam.gserviceaccount.com"
PROJECT="alanblount-demo"

for role in \
  roles/artifactregistry.writer \
  roles/cloudbuild.builds.editor \
  roles/iam.securityReviewer \
  roles/run.admin \
  roles/secretmanager.admin \
  roles/secretmanager.secretAccessor \
  roles/storage.objectAdmin; do
  gcloud projects add-iam-policy-binding $PROJECT \
    --member=serviceAccount:$SA \
    --role=$role
done

roles/iam.securityReviewer is the minimum for the SA to read its own IAM policy (get-iam-policy). Without it, every IAM diagnosis requires you to run gcloud manually.

The Cloud Run service agent also needs secret access (granted automatically during deploy if you run the deploy command below):

# Cloud Run service agent format: service-PROJECT_NUMBER@serverless-robot-prod.iam.gserviceaccount.com
# Project number for alanblount-demo: 462816930018
gcloud secrets add-iam-policy-binding RELAY_ADMIN_KEY \
  --project=alanblount-demo \
  --member=serviceAccount:service-462816930018@serverless-robot-prod.iam.gserviceaccount.com \
  --role=roles/secretmanager.secretAccessor

APIs to Enable

gcloud services enable \
  run.googleapis.com \
  cloudbuild.googleapis.com \
  secretmanager.googleapis.com \
  artifactregistry.googleapis.com \
  --project=alanblount-demo

Artifact Registry Repo

gcloud artifacts repositories create a2a-relay \
  --repository-format=docker \
  --location=us-central1 \
  --project=alanblount-demo

Secret Manager

# Create the admin key secret (store a strong random value)
echo -n "your-strong-admin-key-here" | \
  gcloud secrets create RELAY_ADMIN_KEY \
    --data-file=- \
    --project=alanblount-demo

Build & Deploy

Local Auth Setup

# Copy SA key to node-readable path and activate
cp /secrets/credentials/alanblount-demo-bf573405ac1f.json \
   /home/node/.config/gcloud/alanblount-demo-sa.json

gcloud auth activate-service-account \
  --key-file=/home/node/.config/gcloud/alanblount-demo-sa.json

gcloud config set project alanblount-demo

Build Image

cd /shared/workspace/open-source/a2a-relay

gcloud builds submit \
  --project=alanblount-demo \
  --tag=us-central1-docker.pkg.dev/alanblount-demo/a2a-relay/relay:latest \
  --suppress-logs

Build takes ~3–5 minutes. Image is ~340MB (Python 3.11-slim + sklearn).

Deploy to Cloud Run

gcloud run deploy a2a-relay \
  --project=alanblount-demo \
  --region=us-central1 \
  --image=us-central1-docker.pkg.dev/alanblount-demo/a2a-relay/relay:latest \
  --platform=managed \
  --allow-unauthenticated \
  --set-secrets=RELAY_ADMIN_KEY=RELAY_ADMIN_KEY:latest \
  --set-env-vars=RELAY_DB_PATH=/tmp/relay.db,RELAY_LOG_LEVEL=INFO \
  --memory=512Mi \
  --cpu=1 \
  --min-instances=0 \
  --max-instances=3 \
  --port=8080

Note: Cloud Run injects PORT=8080 automatically. The Dockerfile CMD uses ${PORT:-8080} but in practice Cloud Run always sets PORT. The CMD in Dockerfile is:

CMD ["uvicorn", "relay.main:app", "--host", "0.0.0.0", "--port", "8080"]

(Hardcoded to 8080 — Cloud Run always uses 8080 anyway.)

Verify Health

curl https://a2a-relay-462816930018.us-central1.run.app/health
# Expected: {"status": "ok", "agents": N, "messages": M}

Local Development

cd /shared/workspace/open-source/a2a-relay

# Install deps (uv required)
uv sync --extra dev

# Copy and edit env
cp .env.example .env
# Set RELAY_ADMIN_KEY in .env

# Start relay
uv run uvicorn relay.main:app --port 8765 --reload

# Or use the CLI
uv run python -m relay.cli --help

Running Tests

uv run pytest                    # all tests
uv run pytest -k test_mailbox    # specific module
uv run pytest -v --tb=short      # verbose

Test status: 129 passed, 1 skipped, 1 xfail (test_callback_delivery — callback push not yet implemented, documented as xfail).

Running the Demo

# Against local relay
uv run python demo/run_demo.py

# Against Cloud Run
uv run python demo/run_demo.py \
  --relay-url https://a2a-relay-462816930018.us-central1.run.app \
  --admin-key $RELAY_ADMIN_KEY

Demo scenarios:

  1. Echo agent round-trip — register, send, poll, verify echo
  2. Counter agent state — send multiple increments, verify count
  3. Multi-agent routing — two agents exchange messages via relay
  4. Concurrent delivery — parallel sends, verify all delivered

Architecture Notes

See ARCHITECTURE.md for full design.

Key decisions:

Dockerfile Pitfall (RESOLVED)

python -m relay.main exits immediately — Python’s -m flag doesn’t trigger if __name__ == "__main__" blocks in the same way when there’s no __main__.py. Always use:

CMD ["uvicorn", "relay.main:app", "--host", "0.0.0.0", "--port", "8080"]

Cloud Run Env Var Pitfall

${PORT:-8080} shell expansion does not work in Dockerfile CMD ["exec", "form"] (JSON array). Use a shell form or hardcode the port:

# ❌ Does NOT expand PORT in exec form
CMD ["uvicorn", "relay.main:app", "--port", "${PORT:-8080}"]

# ✅ Works — shell expands $PORT
CMD uvicorn relay.main:app --host 0.0.0.0 --port ${PORT:-8080}

# ✅ Also works — hardcode (Cloud Run always uses 8080)
CMD ["uvicorn", "relay.main:app", "--host", "0.0.0.0", "--port", "8080"]

Troubleshooting

Build PERMISSION_DENIED

ERROR: (gcloud.builds.submit) PERMISSION_DENIED

SA needs roles/cloudbuild.builds.editor AND roles/storage.objectAdmin (not just objectCreator).

Cloud Run TCP probe fails / no logs

Usually means the container exited immediately. Check:

  1. Is the Dockerfile CMD correct? (uvicorn not python -m relay.main)
  2. Is PORT hardcoded or using shell form?
  3. Check logs: gcloud run services logs read a2a-relay --project=alanblount-demo --region=us-central1 --limit=50

“Agent not registered” despite valid auth

Fixed in relay/routers/admin.py — the approve endpoint now calls db.register_agent. If you see this on a clean deploy, check that the admin approval step was completed (not just token creation).

SQLite on Cloud Run

Cloud Run is ephemeral — /tmp/relay.db is wiped on each revision deployment and may vary across instances. For production:


Git Workflow

cd /shared/workspace/open-source/a2a-relay

# Always commit as Alan Blount
git -c user.name="Alan Blount" -c user.email="alan@zeroasterisk.com" \
  commit -m "feat: your message"

git push origin master

Remote: https://github.com/zeroasterisk/a2a-relay