Monday, 30 March 2015

Nagios vs Sensu vs Icinga2

Choosing a suitable monitoring framework for your system is important.  If you get it wrong you might find yourself having to re-write your checks and setup something different (most likely) at great cost.  Recently I looked into a few monitoring frameworks for a system and came to a few conclusions which I'll share below.

Background

The System
At the time of this investigation, the system (which has a microservices architecture) was in the process of being "productionised".  It had no monitoring in place and had never been supported in production.  The plan was to introduce monitoring so that it could be supported and monitored 24x7 with the hope of achieving minimal downtime.

Warning - I am biased
Before we get started, I have to acknowledge a few biases I have.  I have worked with nagios in the past and found it to be  bit of a pain.  However, this was probably due to the fact we created our checks in puppet which added an extra layer of complexity to an already high learning curve.  I decided to re-evaluate nagios because (a) we'd be creating our monitoring checks directly and (b) nagios has moved on since.
I think I might also be biased to favour newer technologies over older ones for no better reason that I'm currently at a startup who are working with a lot of new technologies.
  

Requirements

As follows:
  1. Highly scalable in terms of:
    1. handling complexity (presenting a large number of checks in a way that's easy to understand).
    2. handling load (can support lots of hosts with lots of checks).
  2. Secure.
  3. Good UI with:
    1. Access to historical alerts.
    2. Able to switch off alerts temporarily - possibly with comments e.g. “Ignoring unused failing web node - not causing an issue".
  4. Easy to extend/change.
    1. Ability to define custom checks.
    2. Ability to add descriptive text to an alert e.g. “If this check fails it means users of our site won't be able to...”>
    3. Easy to adjust alarm thresholds.   
  5. Good support for check dependencies.
    1. This is related to requirement 1.1) - Ideally the monitoring system will be able to help the user separate cause from affect.  When you have 100s of alerts firing it becomes hard to establish the underlying cause (see earlier post on this).  Without alert dependencies, the more alerts you add the more you increase the risk of confusion during an incident.  This is hugely important for your users when it comes to fixing a problem in 5 minutes instead of 30!

Nagios

Nagios is very popular, can do everything, but comes with several drawbacks.  For my proof of concept I extended Brian Goff's docker-nagios image. 

Pros:
  • Very popular so lots of support.
  • Huge number of features.
  • Good documentation.
Cons:
  • High learning curve due to its number of features.  This applies to both navigating the UI and writing checks.
  • Creating check dependencies is cumbersome as you have to reference checks via their service_description field.  This means you either use the description like an ID (i.e. not a description) or you duplicate your description in all the places you reference (depend on) your checks.
  • Creating checks with a check frequency of higher than 60 seconds involves a "proceed at your own risk" disclaimer  "I have not really tested other values for this variable, so proceed at your own risk if you decide to do so!" See here
  • UI feels old (at least it does to me).   I remember being very frustrated with the use of frames for the dashboard which mean it's hard to send people links and if you hit F5 to refresh, it takes you back to the homepage.

 

Sensu 

Sensu is a lightweight framework that's simple to extend and use.  I used Hiroaki Sano's sensu docker image to get my proof of concept up and running.

Pros:
Cons:
  • UI has a feature called stash which I don't understand and doesn't seem to be documented.
  • Dependencies can be configured however they seem to have little affect on the dashboard... see issue I raised on github here.
  • Documentation is great for a beginners guide and walkthrough.  I learnt the basics very quickly.  However I quickly found the need for a specification page which detailed exactly what the json could/could not contain.  This will be fixed soon hopefully: https://github.com/sensu/sensu-docs/issues/192 
  • Could not associate descriptive test with my checks e.g. "This checks for connectivity to the database which is required for..."

Icinga2

Originally a fork of the nagios project (and now a complete re-write), this framework has a huge number of features and a good looking dashboard.  I used Jordan Jethwa's icinga2 docker image 

Pros:

Cons:
  • Found the documentation hard to understand at first - this is related to the high learning curve.
  • Can assign notes / free text to alerts but the dashboard seems to only present it at quite a low level.  I couldn't find a way of customising the dashboard to display my "notes"  This could be customised at the dashboard level (via some feature I missed or is perhaps undocumented). 
 

Chosen Framework: icinga2

Sensu was sadly discarded because we felt there was a risk it would not scale in terms of handling the future complexity of the system.  This was mainly due to the apparent lack of support for dependencies.  The gaps in the documentation also made it feel like it wasn't quite ready to be adopted.  I'm definitely hopeful for sensu's future and look forward to seeing how it develops... it's definitely one to watch.

Nagios vs Icinga2.
They both have:
  • the same number of compatible plugins.   
  • lots of features at the cost of a high learning curve.
  • both handle dependencies (so should scale well in terms of complexity).
The differences:
  • icinga2 has a nicer UI - it feels more responsive.
  • dynamically creating objects and their relationships with conditionals (I think) should result in less boiler plate and copy pasted code which I have seen with nagios in the past.
Hopefully this will prove a good decision for the project!

Docker is great for these investigations

Being able to run someone else's docker image which gets an entire monitoring framework up, running and accessible in your browser in less than a minute is amazing.  Not only that, but you can also easily add/edit where necessary to get a feel for development.  It removes all the unwanted complexity in following setup guides.  Thanks to Brian Goff, Jordan Jethwa and Hiroaki Sano for their docker images.

If I missed anything...

Let me know.  I only had limited time and may well have missed some killer feature or another monitoring framework entirely that's way better than all of the above.

8 comments:

  1. Excellent post. I've now worked with all three, and agree with you on almost all points. I'm going with Icinga2 as well, although I don't like icingaweb2 at all, and am replacing it with Thruk which is very obviously inspired by the old Nagios interface, but fixes all the issues, such as the lack of direct links with the frames.

    Completely agreed that Sensu is one to watch, but the lack of complete functionality in any one of the three UIs that I worked with (uchiwa and sensuadmin being the other two) was really frustrating. The API access was pretty cool, though.

    ReplyDelete
  2. Greetings from Seattle. We actually went along a similar path, and ended up migrating away from sensu to icinga2. We are also using ansible for all the things!

    ReplyDelete
  3. I did the same thing sensu is cool but it just lack the maturity that we needed

    ReplyDelete
  4. Ran across this post the other day: http://thehackernews.com/2015/12/how-to-hack-instagram.html

    Had to laugh when I saw it was a flawed Sensu Admin which allowed him in and all that access. Even gladder I went with Icinga2, although any software needs to be patched regularly. Also wonder why they had it publically accessible.

    ReplyDelete
  5. Thank you a lot, it's very interesting overview and quite well done.

    ReplyDelete
  6. "UI has a feature called stash which I don't understand and doesn't seem to be documented."

    Same stash as in Git. Hide event temporary from handling. Like "Ok, I know about that alarm but I will fix this later. Don't bother me now."

    ReplyDelete
  7. Icinga/Nagios is not suited for cloud infrastructures with servers changing at every moment. Whenever the infrastructure changes, Icinga/Nagios has to be reconfigured and restarted. Although this can be automatically done with configuration management tools, it is still not a clean solution, since provisioning runs are performed only in certain intervals. The monitoring tool should be able to handle a changing environment on its own in real-time.

    ReplyDelete
  8. Thank you for sharing your views on the monitoring options

    ReplyDelete