Video system crashes and overheating CPUs

For several months we’ve been trying to track down a very elusive cause of the video system crashing. From extensive configuration changes to all servers to code patches to the primary ingest software by its vendor. Two days ago we set up hyper logging of all OS events and waited for the primary ingest server to crash again. Like clockwork, it crashed last night and shed light on a serious issue. One of the CPUs in the primary ingest server were running over 90c at less than 50% load. This caused the OS to throttle that CPU down significantly which caused the ingest server software to lock up, which caused a domino effect, thus resulting in a complete video system outage.

At 9am, July 12, 2018, the datacenter housing that server is replacing it with an entirely new server. This will cause about 15-30 minutes downtime, unfortunately. But I’d say that’s minimal compared to the frustration the video system crashes have caused us and everyone using our sites for the last several months.

Of course there’s always the possibility that this may not entirely fix the issue. But based on how the video system crashes when that CPU goes into thermal protection, we’re fairly confident that will be resolved with the new server.

Thank you all for sticking with us and being so patient. The amount of lost sleep and frustration thanks to this issue has been incredible and has taken its toll on us. (I’m personally taking a 30 hour nap after all this.) But we’re pushing forward to try and provide the best live streaming experience we possibly can.

Mark Vaughn

One comment

Comments are closed.