Envato teams are responsible for the operation of the systems they build.
My team is trying something different to help onboard new people. We’re creating a set of infrastructure koans for them to complete. The koans are tasks that—once completed—will help folks navigate our infrastructure and systems, and thereby acquire skills that are essential for supporting our services.
When someone joins the team a new issue is created in one of our team’s Github repos using the koans document as a template. Once the new team member has completed all of the koans they are added to the on-call rota and assigned a buddy who can help if things get tricky whilst on call.
The koans are not meant to be layed out step by step unless the task is complex or requires unusual knowledge. We hope this encourages folks to explore and internalise more than they would if following a todo list.
Some Example Koans
Set yourself up on PagerDuty and read through past incidents.
View metrics for each of our systems in New Relic.
- What is the average response time?
- What does CPU, Memory, and I/O utilisation look like on each server?
- What are the slowest transactions for the service? Dig into each transaction and see where the time is spent.
- Check the error analytics tab and look for any relationships between errors and past deployments.
- Check the availability and capacity reports.
- Look for trends in the web transactions and database reports.
Look up each of our services in Rollbar.
- What are the two most common errors being reported?
- Drill into the details of a recorded error.
- Are these errors we can live with? Should we create a task to fix them?
Open the AWS CloudWatch console.
- Look through the available dashboards and metrics.
- What CloudWatch alerts do we have configured for our production systems?
Open the AWS ECS console.
- How many task definitions do we have? How many available versions exist for each of them?
- Which systems make up each of our service clusters?
- How many repositories do we have in ECR?
Access our ELK cluster and run some queries.
Run queries against our production database replicas.
Decrypt some database backups.
SSH into various servers in our infrastructure.