**We Replaced H.264 Streaming with JPEG Screenshots (and it Worked Better)**
It's time to admit defeat – our beautiful, hardware-accelerated, WebCodecs-powered, 60fps H.264 streaming pipeline over WebSockets was a disaster on enterprise networks.
We're building Helix, an AI platform where autonomous coding agents work in cloud sandboxes. Users need to watch their AI assistants work, like "screen share, but the thing being shared is a robot writing code." Last week, we explained how we replaced WebRTC with a custom WebSocket streaming pipeline. This week: why that wasn't enough.
The constraint that ruined everything was that it has to work on enterprise networks. And let's be honest, enterprise networks love HTTP and HTTPS, but hate UDP – blocked, deprioritized, or dropped altogether.
**Act I: Hubris (Also Known As "Enterprise Networking Exists")**
We tried WebRTC first. It worked great in dev, great in our cloud, but when we deployed it to an enterprise customer, the network just wouldn't cooperate. Outbound UDP was blocked, TURN servers were unreachable, and ICE negotiation failed. We could fight this, set up TURN servers, configure proxies, or work with IT departments – but where's the fun in that?
So we built a pure WebSocket video pipeline: H.264 encoding via GStreamer + VA-API (hardware acceleration), binary frames over WebSocket (Layer 7 only, works through any proxy), and WebCodecs API for hardware decoding in the browser. We were proud – we measured things in microseconds, implemented our own binary protocol, and even wrote Rust.
Then someone tried to use it from a coffee shop, and... "No, the video is definitely frozen. And now my keyboard isn't working." It was showing what the AI was doing 30 seconds ago, and the delay kept growing. We realized that 40Mbps video streams don't appreciate 200ms+ network latency.
**The Problem with H.264**
Frames buffer up in the TCP/WebSocket layer, arrive in-order (thanks, TCP!), but increasingly delayed. Video falls further behind real-time – by the time you see a bug, the AI has already committed it to main. We tried lowering the bitrate, but it was still 30 seconds behind.
Our big brain moment was: "What if we only send keyframes?" H.264 keyframes (IDR frames) are self-contained – no dependencies on previous frames. Just drop all the P-frames on the server side, send only keyframes, and get ~1fps of corruption-free video.
**The Solution: JPEG Screenshots**
But our WebSocket streaming layer sits on top of the Moonlight protocol, which decides that if you're not consuming P-frames, you're not ready for more frames. Period. We poked around for an hour or two, but without diving deep into the Moonlight protocol internals, we weren't going to fix this.
While debugging why the stream was frozen again, I opened our screenshot debugging endpoint in a browser tab: a pristine, 150KB JPEG of the remote desktop. Crystal clear. No artifacts. No waiting for keyframes. Just... pixels.
**The Hybrid Solution**
We didn't throw away the H.264 pipeline – we just needed to stop sending massive video frames when the network was bad. We added one control message: Server receives this, stops sending video frames. Client polls screenshots instead. Input keeps flowing. Everyone's happy.
15 lines of Rust. I am not joking.
**The Fix**
We almost shipped a hilarious bug – when you stop sending video frames, the WebSocket becomes basically empty. Just tiny input events and occasional pings. Our adaptive mode sees low latency and thinks: "Oh nice! Connection recovered! Let's switch back to video!" Video resumes. 40Mbps floods the connection. Latency spikes. Mode switches to screenshots. Latency drops. Mode switches to video. Latency spikes.
The fix was embarrassingly simple: once you fall back to screenshots, stay there until the user explicitly clicks to retry.
**Conclusion**
Sometimes the 15-year-old solution is the right one. We're building Helix, open-source AI infrastructure that works in the real world – even on terrible WiFi. Want to experience the joy of interacting with an agent desktop at 6 JPEGs a second yourself? Join us for the private beta on Discord.
Thanks for reading HelixML! Subscribe for free to receive new posts and support my work.
**API and Code**
* [api/cmd/screenshot-server/main.go](https://github.com/helixml/api/blob/main/cmd/screenshot-server/main.go) (200 lines of Go that changed everything) * [MoonlightStreamViewer.tsx](https://github.com/helixml/webapp/blob/main/src/components/MoonlightStreamViewer.tsx) (React component with adaptive logic) * [websockethost.ts](https://github.com/helixml/webapp/blob/main/src/websocket-host.ts) (WebSocket client with setVideoEnabled())