Monitoring and Metrics

Contents

Overview

Our current monitoring solution is based almost entirely around Nagios and custom NRPE scripts. There are pros and cons of using Nagios, though I am not going into that here. There are many comparisons on the internet.

We need a lightweight, scalable solution that is easy to configure and manage and provides mechanism for analysing the data and acting on it, not just collecting it. Redefing monitoring

I believe we also need to build on simply detecting errors to predicting them. We also collect a huge amount of valuable information in the logs. Quite a lot is structured. This information provides invaluable insight into customer behaviour.

Monitoring can be broken down in 4 main tasks

  • Collection

  • Routing

  • Visualisation

  • Alerting

We have a custom monitoring tool that runs on each client that parses the application logs and formats up messages for the NRPE daemon. This gets picked up by Nagios and routed back to a central server using NSCA.

Many customers manually manage their Nagios configuration. This is both unworkable and bad practice. As part of the new customer rollouts and moving to Infrastructure as code (entire stack is managed by Ansible) I converted the existing Nagios & NRPE configuration to Ansible roles. This has simplified the deployment of new servers though suffers from poor visualisation (currently nagiosgraph), performance, alerting.

Ideas

Data storage

The system produces a large amount of log data and metrics. We glean small amounts of information from the logs but only scratch the surface of the amount of useful data.

It is important that we identify what is important to treat as metrics and what we should do with the rest of the log data. The latter may be less useful for realtime monitoring of the system but could, if collected correctly add insight into customer behaviour and trends. Discussions on application log data vs metrics suggest that the data should be treated differently.

Grafana + InfluxDB for purely time series metrics (specifically, monitoring applications & servers), and ELK for monitoring/diagnostics against log file sources

ELK (Elastic Search / LogStash & Kibana ):

  • Elasticsearch for deep search and data analytics
  • Logstash for centralized logging, log enrichment and parsing
  • Kibana for powerful and beautiful data visualizations

From reading reviews and comparisons Kibana seems to fall short when it comes to aggregation of building more complex queries. Grafana has Elasticsearch support so it seems to make sense to use Grafana in place of Kibana.

I aim to focus primarily on metrics initially then move on to taming the log files.

Metrics

Many of our clients have has success with a combination of using CollectD / StatsD on the client, writing to InfluxDB time series database to store the data points and then representing and graphing this information using Grafana. They also leverage the existing NRPE scripts that are used for application monitoring using Icinga2, writing to InfluxDB.

  • CollectD could replace the system NRPE checks
  • CollectD can talk directly to InfluxDB to store the points
  • Existing log parsing could be replaced with Logster
  • Riemann could optionally sit between CollectD / StatsD to provide contextual alerting, event stream processing
  • Icinga2 would replace Nagios to improve the UI and configuration. Icinga can also talk to InfluxDB.
  • InfluxDB would be a data source for Grafana
  • Icinga2 can pull in graphs from Grafana

CollectD vs Telegraf?

Telegraf is written by InfluxDB and very new to the scene. It can receive stats from a number of plugins including StatsD and output to a host of services including InfluxDB. There seems to be some debate on why another tools is needed when CollectD / StatsD already exist but it could given time provide a cleaner / simpler solution.

Resources

Deployment and Configuration

This section will be expanded as I start installing and configuring the different components.

comments powered by Disqus