Monitoring Linux system with Grafana

This post shows some examples of how to monitor Linux system with Grafana. I use telegraf to collect data and it sends the collected data to influxdb.

Monitoring disk, RAM and CPU usage

Configure telegraf to collect relevant data

etc/telegraf/telegraf.conf:

# Read metrics about memory usage
[[inputs.mem]]
  # no configuration

# Read metrics about cpu usage
[[inputs.cpu]]
  ## Whether to report per-cpu stats or not
  percpu = true
  ## Whether to report total system cpu stats or not
  totalcpu = true
  ## If true, collect raw CPU time metrics
  collect_cpu_time = false
  ## If true, compute and report the sum of all non-idle CPU states
  report_active = false

# Read metrics about disk usage by mount point
[[inputs.disk]]
  ## By default stats will be gathered for all mount points.
  ## Set mount_points will restrict the stats to only the specified mount points.
  # mount_points = ["/"]

  ## Ignore mount points by filesystem type.
  ignore_fs = ["tmpfs", "devtmpfs", "devfs", "iso9660", "overlay", "aufs", "squashfs"]

Grafana panels

RAM

The query I use to get RAM usage in percentage:

from(bucket:"homebucket")|> range(start: -60m)|> filter(fn:(r) => r._measurement == "mem" and r._field == "used_percent" and r.host == "nuc" )

I use the Gauge panel type and configure it with the following options:

The end result is very simple and nice view:

Disk usage

For disk usage I have pretty much the same configuration, but the query is of course a little different (and panel’s title):

from(bucket:"homebucket")
  |> range(start: -1h)
  |> filter(fn: (r) =>
    r._measurement == "disk" and
    r._field == "used_percent" and
    r.host == "nuc"
  )

I know I could have combined RAM and disk usage queries to same panel but this was the way I did my initial setup and have not changed it since as it works well enough.

CPU usage

For CPU I use the default Time series panel type and the following query:

from(bucket:"homebucket")
  |> range(start: -15m)
  |> filter(fn: (r) =>
    r._measurement == "cpu" and
    r._field == "usage_system" and
    r.cpu == "cpu-total"
  )

There’s is not really any specific configurations for the panel and the end-result looks like this:

Monitoring systemd services

Configure telegraf to collect systemd data

etc/telegraf/telegraf.conf:

[[inputs.systemd_units]]
  ## Set timeout for systemctl execution
  timeout = "5s"
  #
  ## Filter for a specific unit type, default is "service", other possible
  ## values are "socket", "target", "device", "mount", "automount", "swap",
  ## "timer", "path", "slice" and "scope ":
  unittype = "service"
  #
  ## Filter for a specific pattern, default is "" (i.e. all)
  pattern = "*beat* *falco*"

I have some specific services that I want to monitor and I’m using pattern option to pick those only.

Grafana panel

I use the below query to get service status from the systemd_units data collection.

from(bucket:"homebucket")
  |> range(start: -15m)
  |> filter(fn: (r) =>
    r._measurement == "systemd_units" and
    r._field == "active_code"
  )

The query gives a number between 0-5 as a result and its meaning is defined in unit_active_state_table:

Value	Meaning	Description
0	active	unit is ~
1	reloading	unit is ~
2	inactive	unit is ~
3	failed	unit is ~
4	activating	unit is ~
5	deactivating	unit is ~

The display name by default has a value like active_code {active="active", host="nuc", load="loaded", name="auditbeat.service", sub="running"}. By setting the display name value and some value mappings I get more readable results to the panel.

This configuration sets “name” as display text from the above dictionary {...}.

This configuration maps integer values from unit_active_state_table to more understandable string values.

The result is a panel that shows the status of the monitored systemd services.