ND.
All Articles
June 22, 202512 min read

Why We Migrated Our Signaling Layer from Node.js to Go

Analyzing the performance bounds of the V8 event loop under high-concurrency WebSocket traffic and our exact migration strategy.

GoNode.jsWebSocketsPerformance
Why We Migrated Our Signaling Layer from Node.js to Go

The Original Architecture

Our real-time signaling layer was originally built in Node.js. It handled WebSocket connections, presence updates, and message routing for a social platform growing toward 50,000 concurrent users. At ~10,000 connections, it ran fine. At 30,000+, we hit a wall.

Why Node.js Struggled

Node.js is single-threaded. Every event on the event loop — a new connection, a message, a timer — competes for the same thread. At scale, the event loop lag (the delay between scheduling a callback and executing it) grew from <1ms to 40–80ms. This directly caused:

  • Delayed presence updates ("User X is still shown as online 30 seconds after disconnecting")
  • Missed heartbeat timeouts causing stale connections to linger
  • CPU spikes up to 95% during reconnect storms
  • We profiled with clinic.js and confirmed the bottleneck was pure CPU contention on the event loop during message broadcast, not I/O.

    The Go Rewrite Strategy

    We did not rewrite everything at once. We used the strangler fig pattern:

    1. Phase 1: Deploy the Go signaling service alongside Node.js, routing 5% of traffic to it.

    2. Phase 2: Gradually shift traffic while monitoring error rates and latency.

    3. Phase 3: Decommission Node.js once Go handled 100% of traffic stably for 7 days.

    Go's Goroutine Model

    Each WebSocket connection in Go runs in its own goroutine — a lightweight green thread managed by the Go runtime. Goroutines start at ~2KB of stack vs ~8KB for OS threads, meaning we could handle 50,000 concurrent connections on a single 8-core instance.

    func handleConnections(hub *Hub, conn *websocket.Conn) {
        client := &Client{hub: hub, conn: conn, send: make(chan []byte, 256)}
        hub.register <- client
        
        // Read and write in separate goroutines
        go client.readPump()
        go client.writePump()
    }

    Results After 30 Days in Production

    MetricNode.jsGoChange
    P99 Message Latency78ms11ms-86%
    CPU (50K concurrent)91%34%-63%
    Memory per connection~200KB~8KB-96%

    When You Should NOT Do This

    If your concurrency requirements are under 10,000 connections and your team is deeply productive in Node.js, the migration cost is unlikely to pay off. We made this switch because we had hard latency SLAs and a roadmap to 200,000 concurrent users.