Monitoring With Prometheus Loki And Grafana

With the growth of micro-service-based architectures, the points of failure have distributed across multiple applications and servers. This raises a need for an active monitoring solution that helps the administrators and the application developers to know the failures before even the users of the systems notice them. In this blog, we briefly introduce the responsibilities of the monitoring system followed by brief guidance to the Prometheus and Loki based monitoring infrastructure.

Motivation

Monitoring infrastructure can help in the following aspects:

1. Identification of faults:

Faults such as network failures, unavailability, application exceptions resource overload are unpreventable while running a complex IT infrastructure. Monitoring helps in identifying the faults so that they are addressed on time. Most of the time, it is not just about the faults that have occurred already, rather it is about the faults that are about to occur soon as a consequence of other malicious behaviours or lack of resources. The monitoring system should facilitate the identification of such signs by providing interactive dashboards and by providing timely alerts. Fault identification involves the indication of the existence of a fault. However, finding the exact point of failure often needs further debugging.

2. Debugging

Debugging an identified problem needs a deep investigation concerning the occurred fault or event. The root cause analysis of the problem needs an investigation of different factors specific to the deployed service viz. CPU usage, internal logs, network transmissions, memory usage, and exceptions, etc. Monitoring dashboards make such information available to the administrators without manual extraction. Monitoring the logs accelerates the debugging provided there is a provision for visualization and filtering of logs.

3. Data-driven insights

Analyzing the long-term and short-term data collected for a running service adds a multitude of strategic benefits. Historical data showing the resource usage of a service overtime assists in the decision related to the acquisition of the software, comparing the alternatives, scaling up/down the infrastructure.

Now that we have listed the overall purpose of the monitoring, let us discuss the general functionalities provided by monitoring tools. AA typical monitoring tool would provide mainly two functionalities: (1) Monitoring, that involves the gathering of various metrics and logs, and (2) Reporting, that involves the visualization and alerting based on the collected metrics. In addition, some monitoring tools also perform administrative actions such as resource optimization (e.g. Amazon Cloudwatch), providing remote management tooling (NinjaRMM) and IT workflow automation (OpManager).

Overview

In this blog, we talk about monitoring using opensource monitoring tools: Prometheus and Grafana Loki. After reading this blog, the reader should be able to perform the following tasks:

  1. Metrics Monitoring: Metrics collected from the containers running on different hosts includes CPU, memory usage, and network transactions. We shall achieve this using Prometheus.
  2. Log Management: Logging gives an insight into the running applications and their behavior. This can be used by the service providers to deduce the causes for malfunctioning and to get an overview of the target’s behavior. We realize this using Loki.
  3. Visualization and Alerting: Monitored metrics and logs are visualized and explored using different Grafana panels. Alertmanager is used to manage the alerts generated by Prometheus

Deployment and Data Flow

The figure below shows a monitoring infrastructure deployment where a Monitoring server monitors different target hosts running in local or remote premises. Prometheus, Loki, Grafana, and Alertmanager are installed on the Monitoring server. Vector and cAdvisor are installed on the target hosts to extract the logs and metrics respectively from the target hosts. Vector collects the docker logs and cAdvisor collects different docker metrics. Prometheus scrapes the metrics from cAdvisor. Loki, on the other hand, does not have a scraping mechanism and hence the logs are pushed by Vector to Loki. Both the metrics (collected by Prometheus) and the logs (collected by Loki) are visualized using Grafana. Generated alerts are forwarded to Alertmanager which then further groups, deduplicates and routes to the relevant applications such as email servers.

The deployment and data flow

Setting up the Monitoring Server with Docker

The monitoring server is composed of Grafana, Prometheus, Loki, and Alertmanager. Optionally you can run Vector and cAdvisor in case you want to monitor the monitoring server itself.

The monitoring server deployment can have the following file structure, where conf/ directory has all the configurations and data/ directory has all the docker volumes mounted to the docker containers running monitoring services.

.
+-- docker-compose.yaml
+-- conf/
|   +-- alertmanager.yaml
|   +-- Loki.yaml
|   +-- prometheus_alert.rules
|   +-- prometheus.yaml
+-- data/
|   +-- alertmanager/
|   +-- grafana/
|   +-- Loki/
|   +-- prometheus/

Let us set up the server by following the steps below:

  1. Download Docker compose file using wget
     wget https://raw.githubusercontent.com/linksmart/blog/master/_posts\resources\2021-03-01-Monitoring-with-Prometheus-Loki-and-Grafana/monitoring-server/docker-compose.yaml
    
  2. Create a configuration directory and directories for docker volume directories. If the mounted data directories are not created beforehand, Docker engine creates these directories which cannot be altered by the containers and causes unexpected behaviors.
     mkdir -p conf data data/grafana data/prometheus data/alertmanager data/loki
    

    Set the right ownership to the volumes so that the docker containers have permissions to read and write the contents of the volumes. See the docker-compose.yaml file for the set values. The user ids are explicitly set in the docker-compose.yaml to avoid the problem of having a default user id which is difficult for book-keeping. The user id should match the one set in docker-compose file.

     sudo chown -R 5678:5678 data
    
  3. Create and edit Prometheus configuration file conf/prometheus.yaml. A sample configuration can be found here. scrape_configs specify different jobs related to different targets for metric monitoring activities. Prometheus pulls the metrics from the endpoints mentioned under the scrape_config. The setting related to alerting configures Prometheus to send the alerts to alertmanager which further routes the generated alerts.
    More about the configuration can be found in the official documentation.

  4. Create and edit Prometheus alert rules configuration file conf/alert.rules. A sample rule file can be found here. In the sample, the group targets triggers alert whenever a scraping target of Prometheus is down. The other two groups create alerts whenever a container running in a target server is down. More about the configuration can be found in the official documentation.

  5. Create and edit the Alertmanager configuration file conf/alertmanager.yaml. A sample configuration can be found here. Here the routing options such as mail servers and mailing lists are configured. More about the Alertmanager configuration can be found in the official documentation.

  6. Create and edit the Loki configuration file conf/Loki.yaml. A sample configuration can be found here. More about the Loki configuration can be found in the official documentation.
  7. Change the Grafana configurations. You can do it by setting the environmental variables through docker-compose.yaml. You can also pre-install Grafana plugins using the environmental variables.

  8. Run all the services as docker containers.
    docker-compose up
    

Setting Up the Monitoring Clients

Let us follow the following steps to export metrics to Prometheus

1. Exporting the metrics to Prometheus

Export metrics using cAdvisor

If you are running your containers in a virtual machine and want to expose metrics related to these containers to Prometheus, cAdvisor can be a handy tool. You can run a Docker container of cAdvisor using the following command:

docker run --name cadvisor \
   --restart unless-stopped \
   -d \
   -v /:/rootfs:ro \
   -v /var/run:/var/run:rw \
   -v /sys:/sys:ro \
   -v /var/lib/docker/:/var/lib/docker:ro \
   -p 8080:8080 \
   --privileged \
         google/cadvisor:latest

Note that cAdvidor runs in privileged mode and mounts the root of the filesystem as a volume. Therefore, it is important to ensure that you update this container regularly to avoid any security issues.

Custom Prometheus exporters

If you want to use a custom exporter, there are plenty of other exporters listed in the Prometheus official page.

2. Adding the scrape config in Prometheus

Once the exporters are set up following the instructions mentioned in the previous section, edit the prometheus.yaml to add the following configuration.

  - job_name: cadvisor_vm1
    scrape_interval: 5s
    static_configs:
      - targets: ['vm1:8080']

Setting up the Log Monitoring Clients

Exporting logs to Loki

If you are running your containers in a server and want to export the logs related to these containers to Loki, Vector can be a handy tool. A sample configuration vector.toml for vector is here which is set to export the docker logs to loki running in a sample server. Update the target loki url and the hostname for proper labelling in the configuration and run the vector using the following command:

docker run \
  -d \
  -v ~/vector.toml:/etc/vector/vector.toml:ro \
  -v /var/run/docker.sock:/var/run/docker.sock \
  -p 8383:8383 \
  --restart unless-stopped \
  timberio/vector:latest-alpine 

You can also use Promtail to push the logs to Loki. Other supported clients for Loki are listed here.

Visualization of the Metrics and Logs

To visualize the Prometheus metrics and Loki logs in Grafana, the Prometheus data source plugin in Grafana needs to be configured. First, login to Grafana as admin with the default credentials. Then, follow the instructions specified in the official documentation of Prometheus data source plugin and Loki data source plugin to set up the plugins.

Exploring Loki and Prometheus

Both the data sources of Loki and Prometheus provide exploring functionalities. Loki can be explored using LogQL queries and Prometheus can be explored using PromQL.

PromQL examples

  • Overall CPU usage (percentage) of all the containers running in a host for 5 minutes

    rate(container_cpu_user_seconds_total{image!="",job="<target name>"}[5m]) * 100

  • Overall memory usage of all the containers running in a host

    container_memory_usage_bytes{image!="",job="<target name>"}

LogQL examples

  • Get the NGINX logs where there was HTTP error : This query shall show the logs there are 4xx or 5xx errors:

    {containers="<container_name>"} |~ "/grafana" |~ "HTTP.1[.]1.. [4-5][0-9][0-9]"

  • Rate of HTTP errors (4XX and 5XX responses) for 10 minutes for the resource endpoint /grafana:

    count_over_time({containers="<container_name>"} |~ "/grafana" |~ "HTTP.1[.]1.. [4-5][0-9][0-9]" [10m])

    The result is a metric. Therefore this can be visualized using panels

Visualization using Grafana panels

If the exporter to Prometheus is cAdvisor, then a ready-made cadvisor-prometheus panel can be used to visualize the docker containers.

Monitoring the docker containers running in a server

To visualize Loki metrics, such as the logs from NGINX as described in the previous LogQL examples, Graph or stat panels can be used. To see bare logs, Logs panels can be used. There is also a recently published panel to show NGINX logs.

Use Case: The EFPF Ecosystem

The EFPF ecosystem, being created in a European project EFPF: European Connected Factory Platform for Agile Manufacturing, is a federated platform ecosystem that interlinks multiple IoT platforms in the manufacturing domain. The objective of the EFPF ecosystem is to enable communication and collaboration among the connected digital manufacturing platforms and support the creation of innovative cross-platform composite applications that offer more added value, and ultimately help companies to meet the market demands for mass-customization or lot-size-one manufacturing.

The EFPF ecosystem

Monitoring

The EFPF ecosystem consists of centrally deployed ecosystem enablers such as the Data Spine, EFPF Portal, Marketplace, etc., and other distributed tools, services, and platforms owned by different entities and deployed at different locations. The EFPF ecosystem administrator needs to ensure good health and proper functioning of the ecosystem enablers as well as the connected platforms. Because of the distributed nature of deployment, the traditional monitoring solutions that facilitate the monitoring of a centralized deployment cannot be used. The cAdvisor collects the availability and resource usage-related information locally and sends it to the central Prometheus based monitoring server. This information can then be visualized using customized Grafana dashboards and alerts can be generated on detection of critical events using the AlertManager. This monitoring infrastructure can also be used by the individual platform administrators, by just installing the data sources such as cAdvisor alongside their deployment, who would otherwise have to set up the complete monitoring solution from scratch for their respective platforms.

Logging and debugging

In the EFPF ecosystem, a typical composite application orchestrates multiple services across different platforms together to achieve a common objective. In the cases where a composite application does not function as expected even when all its component services are up and running, a deep dive into the logs of the component services is needed to pinpoint the exact cause of the problem. In the EFPF ecosystem with the connected platforms owned and managed by different organizations, this is especially difficult. Contacting the system administrators of different platforms to check the logs of the component services would be very cumbersome and time-consuming. The log collection by vector, filtering and visualization functionalities provided by the Loki and Grafana support an easy debugging of issues across multiple platforms and simplifies the process of finding and fixing issues.

Benefits of historical insights

With the historical insights on the resource consumption information available, the EFPF administrator and the individual platform administrators can optimize the allocation of resources. Besides, if there are several tools or services available that provide similar functionalities, the consumers in the EFPF ecosystem can be empowered to make a better decision based on the available information and other quality of service parameters for those tools/services.

Acknowledgement

This work was funded by the European Commission (European Union) within the H2020 DT-ICT-07-2018-2019 project “European Connected Factory Platform for Agile Manufacturing” (EFPF), grant number 825075.


By Shreekantha Devasya and Rohit Deshmukh

Written on March 1, 2021