Enable Alerts
Grafana can send alerts through various channels, such as Discord, Telegram, or Email when unregular behavior in metrics or reading from Prometheus is recognized. The following will guide will configure Grafana notifications for Telegram.
It is convenient to temporarily open a text editor to store information needed for these steps.
8.1.1 Create Telegram Bot
You need a Telegram account in order to continue.
- Open the web-based or Desktop version of Telegram.
- Click on this link https://t.me/botfather and allow the BotFather application to open Telegram.
- A BotFather channel will open.
- Type
/newbot
in the message line - Send off the message
- Choose a full name for your bot.
- Choose a user name for your bot. The name must end with
bot
- A message will appear with information about your bot.
- Highlight and copy the API token, then paste it into your text editor.
8.1.2 Create a Group
- Open the Telegram menu.
- Set up a new group.
- Choose a name for the group.
- Add your bot to the group by typing the exact username
- Select the user when it appears in the list and click
create
- Send a message
/my_id
to trigger the API refresh - Copy
https://api.telegram.org/bot<your-bot-api-token>/getUpdates
to a text document. - Replace
<your-bot-api-token>
with your token ID of the bot
8.1.3 Fetching the Chat ID
- Copy the link you just edited into a web browser.
- Look for text that says
{"id"}:
- Copy and paste the
id
into your notes.
Ensure your Chat ID is fully copied. It might have a
-
in front.
8.1.4 Add Telegram Contact Points
Return to Grafana
Login using your credentials
On the left-hand menu, click
Alerting
Click the
Contact Points
on the left sideClick
Add contact point
Click on
Add channel
Fill in the following information:
Name
: Your Notification Channel NameIntegration:
TelegramBOT API Token:
Your copied BOT API TokenChat ID:
The copied Chat ID
Click
Test
within the Integration Panel.Wait until you see a new message in Telegram.
Click
Save contact point
.
8.1.5 Update Notification Policies
- On the left-hand menu, click
Alerting
- Click the
Notification policies
on the left side - On the right of the default notification, click the 3-dot-menu and chose
Edit
- Change
Default contact point
to Telegram's Contact Point - Click
Update default policy
8.1.6 Add Notifications to Metrics
In order to make sure that the node notifies you if some process is down, follow these steps. If something is unclear, you can have a look at the picture of each alert down below.
Make sure to get the latest dashboard of this guide, before continuing.
- Click the Granfana Icon to get to the landing page.
- Click the LUKSO Dashboard.
- Scroll down to the dashboard's
Alerts
section. - Select each alert and click
Edit
on the 3-dot-menu. - Within the
Alert
tab, selectCreate alert rule from this panel
if you do not already see a alert panel on the page that you can click on. Do not worry if you need to create it first, as this is the default behavior since you will have to create folders and groups first. - Click
Preview
to print out the graph to evaluate metric numbers. - Adjust the
Threshold
section to your likings on when the alert should happen. - In case the
Reduce
section is showing multiple lines and one of them isNaN
, setReplace Non-numeric Values
with a custom number above the alert range. For single line metrics that rely on clients or the network, its recommended to set aNaN
number within the alert range, meaning that an alert is sent when the process or network is currently down. - Within the
Alert evaluation behavior
section, add anode-alerts
folder where all the alert data will be stored. If it is already existing, select it from the panel. You can change the name of the folder of your likings, it is just to group alert's data. Its recommended to always choose the same name for one network, node or validator, so you do not mix up various targets and dashboards. - Within the
Evaluation group
selection, add anode-group
. If it is already existing, select it from the panel. You can change the name of the group of your likings, it is just to group alert's data. Its recommended to always choose the same name for one network, node or validator, so you do not mix up various targets and dashboards. - Scroll up and click
Save
- Repeat this for every Alert on the Dashboard.
8.1.7 Metrics Presets
Here are some example metrics that are included in the default dashboard. You can check the picures and validate if everything is configured the same way as in the guide.
Consensus Process Down
1: Process up
0: Process down
up{job="consensus-client-job"}
Validator Process Down
1: Process up
0: Process down
up{job="validator-client-job"}
Consensus Process Restarted
1: Process up
0: Process down
NaN: Not available (likely down --> 0)
(time()-process_start_time_seconds{job="consensus-client-job"})/3600
Validator Process Restarted
1: Process up
0: Process down
NaN: Not available (likely down --> 0)
(time()-process_start_time_seconds{job="validator-client-job"})/3600
Below 40 Peers
above 30: Ideal healthy connections
below 30: Resyncing or weak connections
NaN: Not available (no connections --> 0)
p2p_peer_count{state="Connected",job="consensus-client-job"}
Participation Rate below 80%
above 80: Ideal healthy network
below 80: Unstable network
NaN: 2nd data feed (ignore metric --> 100)
beacon_prev_epoch_target_gwei / beacon_prev_epoch_active_gwei * 100
50 Slots Behind
below 50: Ideal syncing speed
above 50: Unstable syncing
NaN: Not available (likely unstable --> 51)
beacon_clock_time_slot-beacon_head_slot
No Hourly Earnings
above 0,0001: Earning rewards
below 0,0001: Syncing or negative rewards
NaN: Not available (likely negative rewards --> 0)
sum(validator_balance) - sum(validator_balance offset 1h) - count(validator_balance > 16)*32 + count(validator_balance offset 1h > 0)*32
Less than 2GB Free Memory
above 2000000000: More than 2GB remaining
below 2000000000: Less than 2GB remaining
(node_memory_MemFree_bytes{job="node-exporter-job"} or node_memory_MemFree{job="node-exporter-job"}) + (node_memory_Cached_bytes{job="node-exporter-job"} or node_memory_Cached{job="node-exporter-job"})
CPU Usage above 40%
above 4: More than 40% of computation resources used
below 4: Less than 40% of computation resources used
sum(irate(node_cpu_seconds_total{mode="user",job="node-exporter-job"}[5m])) or sum(irate(node_cpu{mode="user",job="node-exporter-job"}[5m]))
Disk Usage above 60%
above 0,6: Disk more than 60% occupied by tasks
below 0,6: Disk less than 60% occupied by tasks
(sum(node_filesystem_size_bytes{job="node-exporter-job"})-sum(node_filesystem_avail_bytes{job="node-exporter-job"}))/sum(node_filesystem_size_bytes{job="node-exporter-job"})
CPU Temperature above 75 °C
above 75: Processor is running hot
below 75: Processor is running normally
node_hwmon_temp_celsius{chip="platform_coretemp_0",job="node-exporter-job",sensor="temp1"}
Google Ping above 30ms
above 0,03: Connection takes longer than 30ms, not ideal
below 0,03: Connection takes less than 30ms, everything alright
probe_duration_seconds{job="google-ping-job"}
8.1.8 Configuring Notfication Intervals
- Head over to the
Alerting
section on the left menu. - Click on
Notification policies
. - Click the 3-dot-menu on the default notification channel.
- Choose
Edit
within the popup. - Expand the
Timing Options
field
The window should look similar to this one, to send one notification every 5 minutes and refresh existing errors every 10 minutes. Grafana 9 will also send you and resolved
message if the alarm is not present anymore.