Introduction
In the first part of this mini-series we looked at setting up a Prometheus server, an exporter to report metrics, and Grafana as the graphical front-end for data display.
Where we left off was to set up an alert, route it to a service like Slack, and to secure the set-up by locking down ports and adding SSL. This is what we will be looking at in this second part of the blog post.
Alerts
A monitoring system that you need to stare at isn’t very helpful unless you can afford to do a lot of staring and never sleep. And while you’ll never be able to avoid staring entirely, it’s still best for your ease of mind to know that some things will be reported to you automatically.
Prometheus uses so-called alert rules to define when certain conditions are met and how to deal with them exactly. We’ll begin by adding an alert that fires when an instance goes down.
To set this up we need to make three changes:
- add an
alert.rules
file - map this file into the container by editing
docker-compose.yml
- update
prometheus.yml
to make it use the file
Here’s what alert.rules
looks like:
ALERT service_down
IF up == 0
More on what this does in a moment, let’s first add it to the container. For this we need to add the following line to the volumes
section of the prometheus
service in
docker-compose.yml
:
services:
prometheus:
...
volumes:
...
- ./alert.rules:/etc/prometheus/alert.rules
Finally, we’ll need to tell Prometheus that this is where the alerts are defined. Simply append a top level entry rule_files:
to prometheus.yml
:
...
rule_files:
- 'alert.rules'
Prometheus Expressions
The alert rule for our service status looks deceptively simple. The syntax is based on the Prometheus expression language and allows to set up conditions based on complex queries of the metrics.
In our initial example, we’re querying against what is probably the most basic metric available: the up
state of the exporters. This binary metric (you can inspect it here) reports 1
or 0
for the configured exporters.
To see this in action, simply shut down the node-exporter
:
docker-compose stop node-exporter
and refresh the graph or check the alerts page:
Load Check
What we’ve done here for the built-in up
metric is easily done for others as well. So next we’ll set up an alert for the load above 0.5. Add the following to alert.rules
:
ALERT high_load
IF node_load1 > 0.5
ANNOTATIONS {
summary = "Instance {{ $labels.instance }} under high load",
description = "{{ $labels.instance }} of job {{ $labels.job }} is under high load.",
}
Note that we’ve also taken the opportunity to add an ANNOTATION
, the purpose of which will become apparent in a minute.
First let’s confirm we can trigger this alert by creating some load, for example by running
docker run --rm -it busybox sh -c "while true; do :; done"
We should be seeing the following after a while:
Alertmanager
Alerts themselves are metrics that can be displayed, which means they can easily be added to a Grafana dashboard:
The metric shown at the bottom in two different variants is the following:
ALERTS{alertname="high_load",alertstate="firing"}
The configuration for this can be imported from dashboard.json
in this post’s github repo and you can inspect the set up of the panels to see how to represent the values as shown above.
While it is useful to have this display, you will also want to be notified by other means, like a slack channel or via an email. To set this up we need to add another component to the mix, the Alertmanager, which is also part of Prometheus. We need to make only a handful of changes:
- extend
docker-compose.yml
with a section to launch the container - in that same file, tell
prometheus
how to connect to the Alertmanager, by passing in the-alertmanager.url
flag - provide an
alertmanager.yml
configuration file with our specific alert routes
So, in more detail, these are the additions to docker-compose.yml
:
# docker-compose.yml
version: '2'
services:
prometheus:
...
command:
- '-config.file=/etc/prometheus/prometheus.yml'
- '-alertmanager.url=http://alertmanager:9093'
ports:
...
alertmanager:
image: prom/alertmanager:0.1.1
volumes:
- ./alertmanager.yml:/alertmanager.yml
command:
- '-config.file=/alertmanager.yml'
volumes:
...
This is all that’s needed to launch the alertmanager
service and connect prometheus
to it. (Again, note how we can reference the service simply by its service name, thanks to the name resolution in the container network.)
Slack Receiver
The alertmanager
takes care of routing any alerts that fire to whatever service is configured in its configuration file alertmanager.yml
, which looks as follows:
# alertmanager.yml
route:
receiver: 'slack'
receivers:
- name: 'slack'
slack_configs:
- send_resolved: true
username: 'Prometheus'
channel: '#random'
api_url: 'https://hooks.slack.com/services/<your>/<stuff>/<here>'
In this case we set up a slack receiver for our alerts which will result in the following message to be posted when alerts occur:
In order to make this possible, you will need to set up an incoming webhook integration for your Slack team and update the api_url:
config with the value you get from the integration.
You can see from the screenshot that you can also get notified when an alert is resolved, thanks to the send_resolved: true
setting in the config file. There are a few other parameters you can set, as described in the Slack receiver documentation.
SSL Configuration
The final piece necessary to make this set-up deployable is to protect the monitoring site with SSL and the easiest way to do that is with another dockerised service: lets-nginx
. This is another great example of how you can add pieces to the puzzle of building up a service in a modular way.
We start by adding a section for the ssl
service to our docker-compose.yml
:
ssl:
image: finestructure/lets-nginx:1.3
ports:
- "443:443"
volumes:
- letsencrypt:/etc/letsencrypt
- letsencrypt_backups:/var/lib/letsencrypt
- dhparam_cache:/cache
This sits at the same level as the other services, prometheus
, grafana
, etc, in that file.
You’ll notice that we are referencing three volumes here that we also need to add to the volumes
section at the very end of the file. Simply append the following:
volumes:
...
letsencrypt: {}
letsencrypt_backups: {}
dhparam_cache: {}
While it’s not strictly necessary to do this, it is advisable for the following reasons:
lets-nginx
requests new certs every time you launch the container if there are no valid certs- if you don’t keep your certs around between restarts you may hit letsencrypt’s rate limit (currently 5 per week)
- creating the Diffie Hellman parameters takes quite a while and you don’t want to re-create them on every start up
With this we have set up a generic ssl
container which is not configured in any way specific to our service yet. To do, simply add the following three lines to a new environments:
section, for example between image:
and ports:
and at the same level:
- EMAIL=<your email, e.g [email protected]>
- DOMAIN=<your domain, e.g. mydomain.com>
- UPSTREAM=grafana:3000
These three lines set up environment variables which determine the parameters lets-nginx
uses during startup. EMAIL
and DOMAIN
configure your SSL cert while UPSTREAM
tells nginx
what host to proxy to.
There is one small detail we need to take care of before running this and that is adding a dependency of ssl
on grafana
. The reason for this is to prevent the ssl
from existing because the grafana
host is not available, which can happen if ssl
launches faster than grafana
(which is typically the case, unless ssl
computes the DH parameters). Add the following sub-section to the ssl:
entry:
depends_on:
- grafana
And that is all there is to getting SSL for you service. If you run docker-compose up -d
now you should be able to access your service on your public IP address via SSL. (Be aware that the first launch of ssl
will be slow because of the DH parameter computation.)
Removing Open Ports
Of course, we’re not quite done yet. While we’ve added an SSL proxy to our grafana
service, we haven’t closed the door yet on the unsecured ports from the other services. This is as simple as removing all ports:
sub-sections from docker-compose.yml
except for the one for port 443 under ssl:
.
The ssl service will still be able to talk to grafana
even without its ports:
declared, because they are still exposed at the container network level, just not externally.
What this means is that you will no longer be able to connect to Prometheus directly at port 9090 or to node-exporter
at port 9100. However, this is really only necessary while setting up the system or for trouble-shooting as all the information gathered via those subsystems will be displayed via Grafana.
Conclusion
This concludes our two part mini-series about monitoring with Prometheus, Grafana & Docker. The configuration files are available on github in finestructure/blogpost-prometheus, with the tags part1
and part2
pointing to the current state of the configuration at the end of each part.
There is one difference between the files in this repository and the description in this blog post and that is how the configuration files are add to the containers. Throughout this series, this mapping was declared as follows:
...
volumes:
...
- ./prometheus.yml:/etc/prometheus/prometheus.yml
This works fine as long you run docker-compose
against a local docker
daemon. However, if you attempt to run this set-up with a docker-machine
for example on Digital Ocean, this will fail. The reason is that the volume mapping to these local files will not ‘travel’ with the service description and the services will not find their configuration.
Therefore, we have made a small change in commit 3a01d23 to copy the configuration files into the images via specific Dockerfiles
for the two services that require configuration files, prometheus
and alertmanager
. This makes it much easier to test the set-up with a hosted service, which in turn is an easy way to get a public IP for the SSL set-up.
If you have any questions or comments, please get in touch!