DevOps is nothing more than agile principles applied to the interface between software and infrastructure.
The goal is to accelerate development, improve quality and lessen the stress on the team. We apply a wide range of methodologies and tools, most of them revolving around the following values;
- Open; Ops and Dev people work as a team from the beginning of the project. Information is transparently shared across the team (e.g. Slack notifications on commits and code deployment), incidents are openly addressed as a team (5 Whys & After Action Report).
- Iterative; we release and deliver in small increments of changes as quickly as possible. This heavily relies on automation (e.g. Ansible) and continuous everything (e.g. TravisCI, Pipelines).
Best-practices & Toolbox
- Monitoring, Logging & Reporting; we monitor, log and report everything we can with TIGK (Telegraf, InfluxDB, Grafana & Kapacitor) and ELK (ElasticSearch, Logstash & Kibana). We also log errors happening in the clients (React, iOS, Android…) with Sentry or Fabric.
- Configuration management & Infrastructure as code; we scripts everything we can, from deployment pipelines to infrastructure configuration with Python and Ansible.
- Micro-services and Containers; we tend to prefer building SOA1 with lots of small (often state-less) micro-services. We increasingly rely on Docker to deliver these services.
- Test-Driven Development & Continuous testing; we try and test most things (including things like Ansible playbooks) and continuously run/report these tests in Slack with tools like TravisCI.
- Testing, Performance & Scalability; we continuously monitor code benchmarks (e.g. memory usage), run and report tests (even for Ansible) and stress-test services before releasing (using Locust, ApacheBench or Siege).
- ChatOps; we over-communicate all development related events in Slack (from commits on GitHub to monitoring failures and deployment status). When possible, we allow users to run commands from Slack using a bot (aka ChatOps).
- Just-in-time engineering; we avoid premature optimization (e.g. caching) or over-engineering.
- 5 Whys & AAR; following an incident, the team gets together and investigate the root cause of the issue using the 5 Whys technique. We often send a AAR to stakeholders and/or clients afterwards. We’re not looking for a culprit, we’re using this as an opportunity to learn and grow.
1: Spotify’s engineering culture is a great illustration of that concept.