Progress on video system issues

The last few weeks have been frustrating for everyone involved. But we have our new servers in place and our video ingest system is now rock solid as far as stability of it goes. Both software and bandwidth wise.

The video edge system has been a different story. We removed our servers from Chicago and Dallas due to how problematic the issues at those datacenter locations have become. From blatantly lying to us, overselling bandwidth despite selling us dedicated 10Gbps bandwidth, and outright incompetency from them messing up routing so frequently.

Our new primary edge server is located in Virginia, in the same building as our primary ingest server, but different company. We’ve had a couple hiccups with congested routes on the edge server within the first 24 hours, but they mysteriously resolved after verbally ripping into the company in question. Keep in mind, same company that we had our Chicago and Dallas locations through. Ingest is through a totally different company, surprisingly zero issues and incredible support from them when we’ve needed it.

Once the bandwidth and routing congestion issues were resolved, another weak point of our edge system showed itself: the edge software. It’s a situation of this weak point not showing itself during local testing and simulated load tests. But once it gets put in production, despite same hardware, OS version, and such, it becomes a completely different story.

This aforementioned issue with the video edge software is something that has plagued our platform for years, often going weeks without issue and then suddenly showing itself. Our first thought was DDoS attack. While we did experience multiple DDoS attacks the last few weeks, extensive monitoring and coordination from the datacenter, ruled out this being caused by any type of attack. But rather it’s being caused by an obscure memory leak that only happens under the absolute perfect circumstances, typically during high traffic hours. Adding RAM bandaids the issue but there comes a point where no amount of additional RAM is a feasible resolution for the problem.

So what are the next steps? At the moment I am gutting the video player for desktop viewers, which is where most of our video traffic comes from. Basically everything under the hood of the video player is being updated and rewritten. This will help resolve playback errors that sometime happen which result in the stream randomly stopping and the loading message appearing as it tries to get the stream back, and help with load time. The look of it won’t change. This video player update will all be finished within the next 12 hours barring any critical issues that may arise.

And what about the edge server software? That’s ultimately the issue, right? Yes it is. I’m prototyping and testing new edge software that will not only be more reliable but even reduce latency. We do not use industry standard HLS or LL-HLS for stream delivery. Those are not ideal solutions when you’re trying to keep latency very low and scale it without investing millions of dollars in infrastructure. That’s the secret to Twitch (Amazon) and YouTube Live (Google) achieving low latency through HLS and DASH. They can afford to dump hundreds of millions into hardware to absorb the very high resource cost of low latency via HLS and DASH.

If you’ve read this far, thank you! We’re just as frustrated as you are about the video issues the last few weeks, many of such issues were completely out of our hands. But rest assured we’re not stopping until they’re completely resolved, and by the looks of it, have a better streaming experience than before.

-Mark

2 comments

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s