Ever wonder what happens when your favorite app or website glitches or isn’t working as expected? Behind the scenes, there is often a dedicated team of individuals ready to jump in to fix them, day or night. At Artera, it is no different, and we ensure our technology runs smoothly for our customers. We transformed our on-call process on how we handle such a crucial responsibility, which ultimately made our engineering organization stronger and more effective. I will share how we did it and the positive changes that it brought.
The Challenge: Before Our Shift
Like many technology companies, Artera’s engineering teams have always provided 24×7 support to our customers. However, the process that supported us as a small, early-stage startup no longer scaled to keep pace with our growing customer base and our internal growth. Our on-call support process required a revamp to ensure we continued to meet the needs of our customers and improve other areas internally:
- Burnout Central. A large handful of engineers were burned out from constantly being under pressure, which created a daunting perception of being on-call for the rest of the engineers.
- Knowledge Bottleneck. Critical knowledge was contained within a small circle of engineers, which is a dangerous practice that knowledge can easily be lost if these engineers were unavailable.
- Everything is an Emergency. There was a lack of prioritization on which issues should be tackled first, so even minor issues became a “sev-1” which created many unintended disruptions and a constant sense of crisis.
There are other factors that indicated this was not a sustainable or scalable process, so we needed to make a change, and fast.
Introducing Our New On-Call Model
There is lengthy documentation on our process that the organization uses for guidance. For the purpose of this blog, we will share a summarized version and highlight the big focuses.
- Team Structure and Share Responsibility. Instead of relying on a few engineers, we spread the on-call responsibilities across all the individuals and teams. It feels more like a neighborhood watch system where everyone gets to play a part. Teams can decide their own rotation so it is much more manageable for everyone’s personal schedules and comfort. We also ask that 20% of every team capacity is dedicated to address on-call issues so urgent issues can be addressed promptly.
- Mode of Operations. An on-call engineer is responsible for resolving a list of different types of issues that align to their direct team, typically in this priority order: severity-1 defects, release failures, severity-2 & 3 defects, security vulnerabilities, and miscellaneous engineering work that may not have a direct owner. When there are no on-call issues to work on, this on-call engineer can pick up sprint work or help the team in other ways.
- Handling the “Oh No!” Moments. When we have large, but rare, customer impacting issues, we deem them as severity-1 issues and it’s all hands on deck. We engage with our Customer Support team immediately to inform our customers of the issue while engineers join a war room to fix the problem. An event commander leads the charge to ensure clear coordination, communication, and documentation. Following the incident, we always do a blameless postmortem – a review focused on learning and improving, not pointing fingers – and create an in-depth root-cause analysis document that is shared with our impacted customers. We also share this document internally with other teams so we can learn together and prevent similar issues in the future.
- Team Norms and Guiding Principles. Changing how we work can be a big deal, especially for those who are new to being on-call. We wanted to establish some core values:
- Be supportive of your peers. Provide coverage when others are in need, jump in and help answer questions in your domain, and work as a team to help resolve our defects.
- Be empowered to flex your work hours as needed. Everyone is encouraged to transition on-call responsibilities during day-time hours and take time off to recover if they worked off-hours.
- Be openminded that this is an evolving process and there aren’t always clear answers. We will all deal with ambiguity, but we are all empowered to help improve this process.
- Be brave and ask questions along the way! If you are not sure what to do, or how to do something, please do not hesitate to ask other engineers or EMs.
- Be accountable and meet the on-call expectations. Treat on-call work appropriately based on urgency and SLAs.
How We Made the Change Happen… Together
Shifting to this new model was a big cultural change. It means we needed to build new habits and a new mindset. We wanted to make sure everyone was part of this journey and here are the high level steps we took to roll out a new on-call process:
- We created a high level process documentation with our motivation and purpose, team structure, mode of operations, workflows and processes for different levels of issues, and various reference pages that may be helpful to on-call engineers. Clarity from the start was key.
- We shared the document and our honest narrative with our engineering management team to gather initial feedback from them and their teams. This ensured that while the process is new and can feel daunting, we included feedback from the individuals that will put this process in practice.
- We incorporated as much feedback as appropriate into this documentation and added a ‘FAQ’ section to address questions asked by many individuals. This ensured that everyone was heard when their feedback was addressed and had a place to ask questions.
- We communicated at our engineering all hands meeting. We shared again the narrative and general process, and made sure we set the tone that this process will evolve with the organization. This is a journey of progress, not instant perfection.
- We kicked off our process with a 6-sprint trial period, with each sprint ending with a retrospective on where the process works and where it needs improvement. This ensured we eased our way into a massive change and gave ourselves grace to make small changes every two weeks.
The Operational Nuts & Bolts
You may be curious how we managed this new model in practice…
Initially, all of our on-call engineers joined a daily triage call to review new issues, share updates, discuss impediments, and get input from others. We also had a dedicated member from the Customer Support team that would attend to get an overall understanding of where open issues are and give them a chance to ask clarifying questions as needed. The daily triage call was very effective for us for the first 6-8 months as we learned how to support on-call independently but also together. Three years later, we no longer need the daily triage call because we work so well together asynchronously!
To track all the work, we also set up a dedicated space (we use Jira, but you can use any project management tool) that enabled us to better organize, prioritize, and monitor all issues. We use digital labels to categorize the type of ticket that is being addressed (sev1/2/3/4 vs critical/high/med/low security vulnerabilities). Having an organization system for all the tickets help us easily track all the work and even for reporting purposes. We use these labels to build a dashboard to see how many issues are active, how quickly they are resolved (helping us meet our target SLAs), and other insights that allow everything to be managed more efficiently.
Fast communication is also vital for this model. We made heavy use of Slack (our enterprise messaging platform). We created a dedicated channel for all on-call engineers so that 1) all relevant chat is in one place, 2) urgent issues get immediate visibility, and 3) integrations can be set up from our observability tools to notify this channel if we need to get eyes on internal failures, such as when a release doesn’t go successfully or there is a major testing failure.
The Payoff: Better for Our Team and Our Customers
In the three years since we adapted and evolved our on-call process, we have seen tremendous value internally and externally. We saw a massive increase in our ability to complete on-call tickets in addition to our product roadmap, we have clear statistics around how quickly we hit our service-level agreements, and we have built great relationships across all of our stakeholders with the visibility and efficiencies that come with this new model. On-call has now become a self-running process within each team and become part of our fundamental engineering practice at Artera. We were delighted to see that our new on-call model drastically reduced the three risks mentioned previously in this blog. Here is what happened and our takeaways.
- Goodbye Burnout! Hello Balance! We understand that most engineers do not want to be “on-call” by default. Even if it rarely happens, people will naturally have anxiety over the idea of being possibly be paged in the middle of the night or during important personal events. We created an environment where this responsibility is spread across to as many people as possible so an engineer may only end up being on rotation a handful times a year. Having global teams in Hungary, we are also able to create a lot of coverage across various time zones to minimize being paged off-hours.
- Knowledge is Power, Especially When Shared. During their on-call rotations or incidents, engineers can end up outside of their comfort zone and have to look into codebases or systems that they aren’t familiar with, so they are naturally going to collaborate with others and learn new things. Also because the on-call responsibilities are shared across everyone, we saw that this led to more responsibility and accountability and everyone wanted to make things better. This organically led to higher quality code, because to minimize off-hours paging, engineers were much more cautious with what is pushed to production. Teams also adapted a great habit of building better monitoring and alerting so we can catch potential issues before they become problems for other teams or even our customers.
- Focus Where It Counts. We carved out a clear and agreed-upon list of on-call responsibilities and put them in priority order with service-level agreements so engineers can easily swarm on more urgent work. As mentioned, every team plans 20% of their capacity to support unplanned work. This allows the team to plan much effectively because the unplanned on-call work no longer disrupts the rest of the team. We also have set clear expectations on who should resolve what and when, which improves our internal processes.
What is Next?
We are writing this blog three years into this transformation and we can honestly say we are very proud of what we built. We know it wasn’t meant to be static, and have already made numerous changes to make our model more effective based on our own evolution. The key to our success has been making “on-call” a shared experience and responsibility. For Artera, we created a version that works for us. We will continue to focus on incremental enhancements to this process, driven by feedback from our amazing engineers, managers, and partners.
Founded in 2015, Artera is based in Santa Barbara, California and has been named a Deloitte Technology Fast 500 company (2021, 2022, 2023), and ranked on the Inc. 5000 list of fastest-growing private companies for four consecutive years. Artera is a two-time Best in KLAS winner in Patient Outreach.
For more information, visit www.artera.io.
Disclaimer: Artera’s blog posts and press releases are for informational purposes only and are not legal advice. Artera assumes no responsibility for the accuracy, completeness, or timeliness of blogs and non-legally required press releases. Claims for damages arising from decisions based on this release are expressly disclaimed, to the extent permitted by law.