This article has been cross-posted on Knack’s engineering blog. At Knack, one of our key user experiences is the Classroom, a web based audio and video conferencing environment with a realtime whiteboard. It looks like this:
We've had multiple iterations of the classroom and specifically how we handle streaming users audio and video (A/V) has evolved with the classroom's adoption. Early on we relied on infrastructure and API's from Jitsi which worked well initially but as a small team we found that we needed something a little bit more powerful and we were open to more plug and play options. In the summer of 2020, we migrated to using the Amazon Chime SDK for A/V and we still use them today. They provide API's with friendly documentation and even have a component library with hooks that have really helped feel like we're able to build anything we need to with it. However, it seemed to us that during COVID more and more browser based A/V bugs were popping up in browser engines. Some of which we were aware of and others we were blind to. We hadn't really solved observability with Chime. So as we were seeing new errors from Sentry and trying to track them manually we needed a way to monitor and troubleshoot the larger Chime picture and we ended up coming across an article that gave us a way to do just that! Monitoring and troubleshooting with Amazon Chime SDK meeting events from AWS's blog The blog article describes aggregating chime events and errors from the client and having them sent to an API gateway that triggers a lambda function and ultimately be nicely organized into a AWS CloudWatch Dashboard that looks like this:
Source: Monitoring and troubleshooting with Amazon Chime SDK meeting events from AWS's blog Now, that blog post from AWS provides a CloudFormation template that lets people deploy this via that Infrastructure-as-Code (IaC) solution, but at Knack we use Terraform as our IaC solution so we had to first write it in Terraform. If you're not familiar with Terraform, the best way to get started is through their tutorials which are excellent! After that was done we were able to get a better grasp at the errors our users were facing instantly and see what we could solve for and what we couldn't (mostly browser bugs). And for the cases where we couldn't we could provide options and UI to let the user know their options so that they could have a successful Classroom experience. At the time of writing we've been able to get our uncaught errors down and are handling every case we're aware of. If new errors pop up we are able to see them and research them immediately to put out fixes so that our users can continue to have the best experience. As a part of this blog post, we've open-sourced the Chime Events Cloudwatch Dashboard in Terraform so anyone can stand it up and start monitoring their Chime application: