
Productive Robotics
WebSocket Memory Leak Fix
Diagnosed and fixed a safety-critical 4-layer cascading failure in production: JavaScript memory leak → Android WebView OOM → deadman heartbeat loss → robot safety fault during active welding. The robot stopped mid-weld with the arc still active and wire still feeding. This is the class of cross-layer systems debugging needed at Waymo, Tesla, and Boston Dynamics — where a software bug cascades into a physical safety event.
Traced a production crash where the tablet UI would die mid-weld operation on the factory floor. The failure chain: WiFi drops briefly (common in industrial environments with electromagnetic interference from welders) → WebSocket reconnects to rosbridge → ros_connected handler creates ~30 new ROSLIB.Topic subscriptions → old subscriptions are NOT cleaned up → after several reconnect cycles, hundreds of orphaned subscriptions accumulate → Android WebView (Chromium 90) runs out of memory → WebView process crashes → deadman heartbeat topic stops publishing → robot's safety system detects heartbeat loss → triggers emergency fault → robot stops mid-weld with arc still active and wire still feeding. A 4-layer cascade from a JavaScript scoping bug to physical danger.
Root cause analysis uncovered 3 compounding bugs that made this invisible. Bug 1 (JavaScript scoping): topic variables were declared with let inside the ros.on('connected') callback, making them block-scoped and completely inaccessible from the ros.on('disconnected') handler. The disconnect handler's try/catch silently swallowed every ReferenceError, making it appear that cleanup was working. Bug 2 (wrong object type): this.weldingRunState stored the last received message data (a plain JavaScript object), not the ROSLIB.Topic object, so .unsubscribe() calls were invoking a nonexistent method on a data object — again silently failing. Bug 3 (library internals): a previous fix attempt used ros.removeAllListeners() which stripped roslib's own internal event listeners needed for WebSocket reconnection, permanently breaking the connection after the first disconnect.
Built custom diagnostic instrumentation by monkey-patching ROSLIB.Topic.prototype.subscribe to intercept every subscription call: injected a counter (subCount++), logged the topic name and orphan count (subCount - unsubCount) to console, then called the original subscribe via origSubscribe.call(this, callback). Added subscribeId tracking to a global array to identify exactly which subscriptions survived across reconnect cycles, making the previously invisible leak quantifiable. This monkey-patching technique — wrapping library internals to add observability without modifying source — is a standard diagnostic pattern for debugging third-party library behavior in production.
The fix: hoisted all topic variables to outer var scope (matching the existing working pattern used for featureUpdateTopic), added explicit .unsubscribe() calls for all 7 topics in the disconnect handler without using removeAllListeners() (preserving roslib's internal reconnection listeners), and corrected the data-vs-topic variable confusion (this.jobRunState → jobRunStateTopic). This eliminated the cascade from a JavaScript scoping bug to a physical safety event. The debugging methodology — tracing a symptom (robot fault) backward through 4 abstraction layers (safety system → heartbeat → WebView → JavaScript) to reach the root cause (let vs var scoping) — demonstrates the cross-layer systems thinking required for safety-critical robotics software.