Enable Alerts

Grafana can send alerts through various channels, such as Discord, Telegram, or Email when unregular behavior in metrics or reading from Prometheus is recognized. The following will guide will configure Grafana notifications for Telegram.

It is convenient to temporarily open a text editor to store information needed for these steps.

8.1.1 Create Telegram Bot

You need a Telegram account in order to continue.

Open the web-based or Desktop version of Telegram.
Click on this link https://t.me/botfather and allow the BotFather application to open Telegram.
A BotFather channel will open.
Type /newbot in the message line
Send off the message
Choose a full name for your bot.
Choose a user name for your bot. The name must end with bot
A message will appear with information about your bot.
Highlight and copy the API token, then paste it into your text editor.

8.1.2 Create a Group

Open the Telegram menu.
Set up a new group.
Choose a name for the group.
Add your bot to the group by typing the exact username
Select the user when it appears in the list and click create
Send a message /my_id to trigger the API refresh
Copy https://api.telegram.org/bot<your-bot-api-token>/getUpdates to a text document.
Replace <your-bot-api-token> with your token ID of the bot

8.1.3 Fetching the Chat ID

Copy the link you just edited into a web browser.
Look for text that says {"id"}:
Copy and paste the id into your notes.

Ensure your Chat ID is fully copied. It might have a - in front.

8.1.4 Add Telegram Contact Points

Return to Grafana
Login using your credentials
On the left-hand menu, click Alerting
Click the Contact Points on the left side
Click Add contact point
Click on Add channel
Fill in the following information:
- Name: Your Notification Channel Name
- Integration: Telegram
- BOT API Token: Your copied BOT API Token
- Chat ID: The copied Chat ID
Click Test within the Integration Panel.
Wait until you see a new message in Telegram.
Click Save contact point.

8.1.5 Update Notification Policies

On the left-hand menu, click Alerting
Click the Notification policies on the left side
On the right of the default notification, click the 3-dot-menu and chose Edit
Change Default contact point to Telegram's Contact Point
Click Update default policy

8.1.6 Add Notifications to Metrics

In order to make sure that the node notifies you if some process is down, follow these steps. If something is unclear, you can have a look at the picture of each alert down below.

Make sure to get the latest dashboard of this guide, before continuing.

Click the Granfana Icon to get to the landing page.
Click the LUKSO Dashboard.
Scroll down to the dashboard's Alerts section.
Select each alert and click Edit on the 3-dot-menu.
Within the Alert tab, select Create alert rule from this panel if you do not already see a alert panel on the page that you can click on. Do not worry if you need to create it first, as this is the default behavior since you will have to create folders and groups first.
Click Preview to print out the graph to evaluate metric numbers.
Adjust the Threshold section to your likings on when the alert should happen.
In case the Reduce section is showing multiple lines and one of them is NaN, set Replace Non-numeric Values with a custom number above the alert range. For single line metrics that rely on clients or the network, its recommended to set a NaN number within the alert range, meaning that an alert is sent when the process or network is currently down.
Within the Alert evaluation behavior section, add a node-alerts folder where all the alert data will be stored. If it is already existing, select it from the panel. You can change the name of the folder of your likings, it is just to group alert's data. Its recommended to always choose the same name for one network, node or validator, so you do not mix up various targets and dashboards.
Within the Evaluation group selection, add a node-group. If it is already existing, select it from the panel. You can change the name of the group of your likings, it is just to group alert's data. Its recommended to always choose the same name for one network, node or validator, so you do not mix up various targets and dashboards.
Scroll up and click Save
Repeat this for every Alert on the Dashboard.

8.1.7 Metrics Presets

Grafana Alert Board

Here are some example metrics that are included in the default dashboard. You can check the picures and validate if everything is configured the same way as in the guide.

Consensus Process Down

1: Process up
0: Process down

Consensus Process Down Metric

up{job="consensus-client-job"}

Validator Process Down

1: Process up
0: Process down

Validator Process Down Metric

up{job="validator-client-job"}

Consensus Process Restarted

1:   Process up
0:   Process down
NaN: Not available (likely down --> 0)

Consensus Process Restarted Metric

(time()-process_start_time_seconds{job="consensus-client-job"})/3600

Validator Process Restarted

1:   Process up
0:   Process down
NaN: Not available (likely down --> 0)

Validator Process Restarted Metric

(time()-process_start_time_seconds{job="validator-client-job"})/3600

Below 40 Peers

above 30: Ideal healthy connections
below 30: Resyncing or weak connections
NaN:      Not available (no connections --> 0)

Below 40 Peers Metric

p2p_peer_count{state="Connected",job="consensus-client-job"}

Participation Rate below 80%

above 80: Ideal healthy network
below 80: Unstable network
NaN:      2nd data feed (ignore metric --> 100)

Participation Rate below 80% Metric

beacon_prev_epoch_target_gwei / beacon_prev_epoch_active_gwei * 100

50 Slots Behind

below 50: Ideal syncing speed
above 50: Unstable syncing
NaN:      Not available (likely unstable --> 51)

50 Slots Behind Metric

beacon_clock_time_slot-beacon_head_slot

No Hourly Earnings

above 0,0001: Earning rewards
below 0,0001: Syncing or negative rewards
NaN:          Not available (likely negative rewards --> 0)

No Hourly Earnings Metric

sum(validator_balance) - sum(validator_balance offset 1h) - count(validator_balance > 16)*32 + count(validator_balance offset 1h > 0)*32

Less than 2GB Free Memory

above 2000000000: More than 2GB remaining
below 2000000000: Less than 2GB remaining

Less than 2GB Free Memory Metric

(node_memory_MemFree_bytes{job="node-exporter-job"} or node_memory_MemFree{job="node-exporter-job"}) + (node_memory_Cached_bytes{job="node-exporter-job"} or node_memory_Cached{job="node-exporter-job"})

CPU Usage above 40%

above 4: More than 40% of computation resources used
below 4: Less than 40% of computation resources used

CPU Usage above 40% Metric

sum(irate(node_cpu_seconds_total{mode="user",job="node-exporter-job"}[5m])) or sum(irate(node_cpu{mode="user",job="node-exporter-job"}[5m]))

Disk Usage above 60%

above 0,6: Disk more than 60% occupied by tasks
below 0,6: Disk less than 60% occupied by tasks

Disk Usage above 60% Metric

(sum(node_filesystem_size_bytes{job="node-exporter-job"})-sum(node_filesystem_avail_bytes{job="node-exporter-job"}))/sum(node_filesystem_size_bytes{job="node-exporter-job"})

CPU Temperature above 75 °C

above 75: Processor is running hot
below 75: Processor is running normally

CPU Temperature above 75 °C Metric

node_hwmon_temp_celsius{chip="platform_coretemp_0",job="node-exporter-job",sensor="temp1"}

Google Ping above 30ms

above 0,03: Connection takes longer than 30ms, not ideal
below 0,03: Connection takes less than 30ms, everything alright

Google Ping above 30ms Metric

probe_duration_seconds{job="google-ping-job"}

8.1.8 Configuring Notfication Intervals

Head over to the Alerting section on the left menu.
Click on Notification policies.
Click the 3-dot-menu on the default notification channel.
Choose Edit within the popup.
Expand the Timing Options field

The window should look similar to this one, to send one notification every 5 minutes and refresh existing errors every 10 minutes. Grafana 9 will also send you and resolved message if the alarm is not present anymore.

Grafana Alert Interval

Enable Alerts

8.1.1 Create Telegram Bot​

8.1.2 Create a Group​

8.1.3 Fetching the Chat ID​

8.1.4 Add Telegram Contact Points​

8.1.5 Update Notification Policies​

8.1.6 Add Notifications to Metrics​

8.1.7 Metrics Presets​

Consensus Process Down​

Validator Process Down​

Consensus Process Restarted​

Validator Process Restarted​

Below 40 Peers​

Participation Rate below 80%​

50 Slots Behind​

No Hourly Earnings​

Less than 2GB Free Memory​

CPU Usage above 40%​

Disk Usage above 60%​

CPU Temperature above 75 °C​

Google Ping above 30ms​

8.1.8 Configuring Notfication Intervals​