OpsDev

Contents

Production First

There have been a number of articles recently discussing the concept of OpsDev.

Instead of leaving important architectural decisions to the last minute they should be part of the design from the beginning. Topics such as High Availability, Fault Tolerance, Service Discovery, Data Persistence, Centralised Logging and Monitoring are essential in modern applications and can be much more complicated if left to the last minute to implement.

From that I wanted to build a demo system that emulates a production environment. It should facilitate all the topics mentioned above in the simplest way possible to demonstrate each concept. The demo should provide a framework for experimenting with new tools and technologies.

Project Overview

https://github.com/jamesdmorgan/vagrant-ansible-docker-swarm

The project is available on github and will be extended as new components are added. The name may change as there is more than just an automated swarm

Everything will be automated so it should be trivial to get working. At the moment it uses around 12Gb of Ram. This can be reduced by using a smaller number of Virtual Machines. I wanted to use at least 3 managers to demonstrate clustering of both Docker Swarm and Consul.

Current Components

  • Vagrant - Management of Virtualbox VMs
  • Ansible - Provisioning of boxes
  • Docker 1.12 - Swarm creation on manager and worker boxes
  • Consul - External DNS, K/V store and dashboard
  • Registrator - Intended to register services but doesn’t currently work with 1.12
  • Consul-notifier - Temporary replacement to Registrator for registering Docker Swarm Services

Roadmap

  • RSyslog - Log shipping for container logs -> LogStash / filebeat
  • LogStash - Log file filtering
  • Elasticsearch - Log file indexing
  • Kibana - Log file UI

  • CollectD - System metrics
  • cAdvisor - Docker metrics
  • Riemann - Metric filtering and processing
  • InfluxDB - Metric storage
  • Grafana - Metric dashboard

  • Flocker - Volume management

Data persistence

Docker swarm brings amazing potential and new problems. Having containers re-scheduled causes problems when there are volumes / data persistence. Cattle containers work well in this environment. More complicated containers running Databases or k/v stores need to treated carefully. In my example I have a couple of containers that fall into this category. Mongo currently can move about the swarm. This would not work in a production environment as data loss is totally unacceptable. The options seem to be. Either don’t run in the Swarm, limit to specific hosts using constraints or use volume manager like Flocker. I will investigate this after I have centralised logging and monitoring.

I will write an individual post for each stage of the project. Starting with building the Virtual Machines with Vagrant & Ansible, through provisioning the Docker Swarm, service discovery, applications, logging and system monitoring.

comments powered by Disqus