TacitusMail
All posts
Infrastructure

Cross-replica signalling bus for chat and calls

Calls and chat stopped working whenever the two parties hit different web replicas. A Redis pub/sub fanout replaces the in-process registry.

In the first real load test of chat and calls we ran into a reproducible failure: a message from one user never reached the other. Turns out the WebSocket registry was per-process, and nginx's round-robin happily landed two users on two different replicas of the same web container. Signals went into the process-local registry and disappeared.

The fix is app/realtime/signaling_bus.py: every publish_to_user(user_id, payload) does local-first delivery plus a Redis PUBLISH to tacitusmail:signal:{user_id}. Every web process runs a background PSUBSCRIBE loop that forwards matching messages into its own local registry. A per-process _origin tag prevents double-delivery when the publisher and subscriber share a replica.

On iOS the same commit rewired the call signalling end-to-end — CallManager now plugs its signal sender into the live ChatStore WebSocket instead of a dead RealtimeSession that was never initialised. The CallSignal Codable wrapper was the other root cause; it looked for a nested payload field that the server never emits. Replaced with raw-dict dispatch via a new handleIncomingSignalRaw.

T
The Tacitus Mail team
Engineering posts and release notes from the people who write the code.

More from the blog