Why So Detailed?
I tend to write a blow-by-blow account of my processes when possible – this means the account can become quite lengthy (if I run into issues). Why record all this detail and not just what works? Sometimes I do and sometimes I provide a summary so folks can jump right to the answers. But in my experience troubleshooting systems over the past 30 years I’ve almost always found articles to provide too little information rather than too much.
I realize this is a balancing act. But I figure there are plenty of articles out there on how to do it the straightforward way…if that’s what you need use one of them. I’m here for those who like me have spent hours…sometimes days…maybe weeks…desperately attempting to resolve a technical issue when there is no documentation available from others on the problems I’m encountering.
The Actual Process
Recently, while troubleshooting an issue on one of my virtual private servers (VPS) someone recommended I try New Relic or netdata to get some in-depth insight into the problem. I’ve used New Relic in the past and while I like their product it is fairly expensive for small projects…netdata on the other hand is free (thus far).
When I setup an account with netdata.cloud I was instructed to claim nodes and given a script to execute. It looks something like this:
sudo netdata-claim.sh -token={yourtoken} -rooms={yourroom} -url=https://app.netdata.cloud
Try running that on your server and you’ll run into a problem – netdata isn’t installed on your server! Surprisingly, there is no link on the claim page to instructions on how to install the agent. There are several variations on the install instructions available (Linux, Docker, Kubernetes, macOS, etc). The first time around I used the “Linux with one-line installer” – literally:
bash <(curl -Ss https://my-netdata.io/kickstart.sh)
(Yes, paste the entire thing, not just the curl portion).
Now For the Fun Stuff
(aka, when things don’t go smoothly and you are treading deep waters)
The above script failed somewhere halfway through and I reran it one or two times before it completed successfully. But eventually it did install and I was able to successfully claim the node using the sudo netdata-claim.sh...
command that appears above.
I received all sorts of interesting information about the system from netdata and figured I’d try it on another server. This time around I was writing this post and noticed that there was an install option for “Linux with pre-built static binary” – well, that sounded like it might resolve the issues I had with getting netdata installed on the previous server (it seemed the build was perhaps running out of memory or something similar, when I reran it would make it further each time, having everything pre-built should ease that issue). So I have it a try:
bash <(curl -SS https://my-netdata.io/kickstart-static64.sh)
So far that seems to have been a mistake. Everything started off happy enough and then there were a few glaring red messages:
--- Add user netdata to required user groups ---
Group 'docker' does not exist.
FAILED Failed to add netdata user to secondary groups
Group 'nginx' does not exist.
FAILED Failed to add netdata user to secondary groups.
Group 'varnish' does not exist.
...
Group 'haproxy' does not exist.
...
Group 'squid' does not exist.
...
Group 'ceph' does not exist.
...
Group 'nobody' does not exist.
...
Group 'I2C' does not exist.
...
I feel like blaring RED FAILED messages may not have been required here. The issue is that I don’t have docker or nginx or varnish, haproxy, etc. installed on this server, so why should these user groups exist? So, false alarm, imho. Thereafter things became happy again – I love the green OK!
Eventually I see this message:
^
|.-. .-. .-. .-. .-. . netdata .-. .-. .-. .-
| '-' '-' '-' '-' '-' is installed now! -' '-' '-' '-'
+----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+--->
Sweet! But apparently it isn’t installed, installed yet. Things continue…and take a turn for the worse:
/dev/fd/63: line 214: /opt/netdata/etc/netdata/.install-type: Permission denied
NOTE: did not remove /tmp/netdata-kickstart-M5GUSGscMA/netdata-latest.gz.run
Okay, the first server required a few executions of the script before a successful install, maybe the same will work here. So I rerun the command:
bash <(curl -Ss https://my-netdata.io/kickstart-static64.sh)
The installer finds the existing install and attempts to update it instead of creating a new install (I appreciate this). Again I get the false alarms about non-existent user groups but that’s it for problems. I’m happily informed that “netdata is installed now!” and “Updated existing install at /opt/netdata/bin/netdata”. Sweet, I assume this means I’m good to go.
I attempt to claim this node:
sudo netdata-claim.sh -token={yourtoken} -rooms={yourroom} -url=https://app.netdata.cloud
And get this unhappy reply:
sudo: netdata-claim.sh: command not found
Okay, not great. What if I attempt to run the original script that eventually worked on the previous server?
bash <(curl -Ss https://my-netdata.io/kickstart.sh)
It finds the existing install and attempts to update…I’m getting good at ignoring the false alarms now and don’t go into a panic as the RED FAILED message blasts repeatedly onto my terminal. The upgrade completes successfully. I try rerunning the netdata-claim.sh
command but am greeted by the same error:
sudo: netdata-claim.sh: command not found
Lets try running that script one more time, just for kicks and giggles…runs successfully…try to claim, same command not found.
Well, I know the netdata software installed to /opt/netdata
so I head over there. I cd bin
once there and use ls
to see what I’ve got. There it is – netdata-claim.sh
! I try running the node claim script again:
sudo netdata-claim.sh -token={yourtoken} -rooms={yourroom} -url=https://app.netdata.cloud
I’m greeted by the same error. Duh, I need to tell it I want to use the netdata-claim.sh
file in the current bin
directory:
sudo ./netdata-claim.sh -token={yourtoken} -rooms={yourroom} -url=https://app.netdata.cloud
It kicks off, thank goodness. But what’s being written to the terminal isn’t all sunshine:
Unable to communicate with Netdata daemon, querying config from disk instead.
Unable to communicate with Netdata daemon, querying config from disk instead.
...
Connection attempt 1 successful
uv_pipe_connect(): connection refused
Make sure the netdata service is running.
The claim was successful but the agent could not be notified (0)- it requires a restart to connect to the cloud.
Okay, so I restart netdata:
systemctl restart netdata
I am prompted to authenticate, provide the requested password, and all seems well. Back over in my netdata cloud dashboard The node is flipping back and forth between displaying collected data and stating it is unreachable. Hmmm…So a next step for me is usually trying to uninstall and reinstall a piece of software. I found the uninstall netdata page but it requests that I first try reinstalling netdata. Okay, I’m game.
I try running for the reinstall command for the static binary:
bash <(curl -Ss https://my-netdata.io/kickstart-static64.sh) --reinstall
This says it is successful but I’m still having issues with it on the cloud dashboard. So I try running it for the build script install:
bash <(curl -Ss https://my-netdata.io/kickstart.sh) --reinstall
It politely informs me that the static build is what is installed so it can’t run. Appears reinstalling isn’t going to work for me…back to my initial plan – uninstallation!
On this page I learn that netdata should have created a .environment
file at /etc/netdata/
, well, it didn’t. In fact, the netdata
directory doesn’t exist at all. So I create it:
cd /etc
sudo mkdir netdata
cd netdata
sudo nano .environment
Placing the file in this directory actually appears unnecessary (see below) but I didn’t know that at the time. The contents of .environment
are simple:
NETDATA_PREFIX=""
NETDATA_ADDED_TO_GROUPS=""
I’m instructed to run the uninstaller:
/usr/libexec/netdata/netdata-uninstaller.sh --yes --env /etc/netdata/.environment
In the above you’ll notice that you specify where the .environment
file is, so it could be anywhere (e.g. your home directory). Hmm, the file can’t be found. Ahh, continue reading the uninstall directions:
Note: Existing installations may still need to download the file if it’s not present. To execute uninstall in that case, run the following commands:
NETDATA Learn, Uninstall.
wget https://raw.githubusercontent.com/netdata/netdata/master/packaging/installer/netdata-uninstaller.sh
chmod +x ./netdata-uninstaller.sh
./netdata-uninstaller.sh --yes --env <environment_file>
That seems to succeed but when I attempt to install using the build script I again get the message:
--- Found existing install of Netdata under: /opt/netdata ---
ABORTED Existing install is a static install, please use kickstart-static64.sh instead.
I decide to nuke the folder myself:
sudo rm -rf /opt/netdata
When I run the build script again (with --reinstall
) there is no more complaining about a static install. Instead things seem to kick off hopefully.
The installer tells me that it needs to grab a few packages using apt-get update
and apt-get install
to perform the build. I have no problem with this.
Some packages could not be installed. This may mean that you have requested an impossible situation or if you are using the unstable distribution that some required packages have not yet been created or been moved out of Incoming.
The following information may help to resolve the situation:
The following packages have unmet dependencies:
libssl-dev : Depends: libssl1.1 (= 1.1.1f-1ubuntu2.4) but 1.1.1g-1+ubuntu20.04.1+deb.sury.org+1 is to be installed
E: Unable to correct problems, you have held broken packages.
We are very sorry!
Installation of required packages failed.
What to do now:
1. Make sure your system is updated.
Most of the times, updating your system will resolve the issue.
2. If the error message is about a specific package, try removing that package from the command and run it again.
Depending on the broken package, you may be able to continue.
3. Let us know. We may be able to help.
...
FAILED
WARNING It failed to install all the required packages, but installation might still be possible.
Press ENTER to attempt netdata installation >
I do it, I press ENTER. But no, I’m getting all sorts of errors about make
, libmosquitto
, cmake
, libJudy
, libtoolize
, pkg-config
, libbpf
…and on it goes. I try to try the install again, that worked once before – right? Nope, not this time.
I did a sudo apt upgrade
on the system and upgraded all the packages and then attempted an install again using the build script and this seems to have installed successfully. I then ran netdata-claim.sh
again but it is giving the same warnings I saw previously (unable to communicate…, uv_pipe_connect()
, and agent could not be notified). Rather than just restart the service I’ve decided to reboot the entire system (sudo reboot
).
After the system started again I checked the netdata cloud dashboard – the node is still showing as “unreachable”. I then tried rerunning netdata-claim.sh
. This time I didn’t receive any of the previous warnings but instead:
The agent cloud base url is set to the url provided.
The cloud may have different credentials already registered for this agent ID and it cannot be reclaimed under different credentials for security reasons. If you are unable to connect use -id=$(uuidgen) to overwrite this agent ID with a fresh value if the original credentials cannot be restored.
Failed to claim node with the following error message: "already claimed"
I tried running netdata-claim.sh
with the -id=$(uuidgen)
param added to the regular params. This seemed to succeed but now I’m seeing the warnings I had previously when registering: unable to communicate…, uv_pipe_connect()
, and agent could not be notified (0) – it requires a restart to connect to the cloud.
Okay, I feel a bit like I’m going in circles but here goes: sudo systemctl restart netdata
. Whelp. No luck.
I stumbled on some documentation on removing and reclaiming a node, thought that might be helpful:
sudo rm -rf /var/lib/netdata/cloud.d/
Note that while this “removes” the node it doesn’t do so from the netdata cloud dashboard:
This node no longer has access to the credentials it was cleamed with and cannot connect to Netdata Cloud via the ACLK. You will still be able to see this node in your War Rooms in an unreachable state.
NETDATA Learn Claim: Remove and Reclaim a Node
Ugh. That isn’t awesome. I don’t want to have “dead” nodes floating around my “war room.” I delete the war room (it wasn’t an important one anyway) and try re-adding the node without specifying a war room. It runs but gives an error about being “already claimed” again and doesn’t appear in the dashboard.
I notice when going to “Claim Nodes” in the Dashboard I have a bunch of “Recently Claimed Nodes” (all dmserver01 with different id’s). Clicking on them doesn’t do anything (though the mouse pointer turns to a pointing finger as if they are clickable). I find I can add nodes to a war room (when in a war room there is a green + button which shows a list of available nodes) but none of them seem to be operational. Cool.
Okay, I’m walking away for now.
I’m Back
I walked away for a while but now I’m back. I decided to install netdata on a third VPS using the initial methodology (build script rather than pre-built) and see if it worked. Specifically I:
- Ran
sudo apt update
- Ran
sudo apt upgrade
and updated all packages that weren’t current. - Ran the bash script:
bash <(curl -Ss https://my-netdata.io/kickstart.sh)
- Ran the
netdata-claim.sh
script as instructed in netdata cloud. - Was informed I need to restart the
netdata
service before it would connect to the cloud. - . Restarted the service:
sudo systemctl restart netdata
- It appeared as unreachable and I was saddened.
- Used
ssh -L 19999:ip-of-server:19999 username@ip-of-server
and then accessed the local interface at https://name-of-server:19999/, which loaded. - At some point noticed that the node was now accessible to netdata cloud.
What exactly caused it to start working? Darned if I know. Might have taken a few minutes to sync up, maybe accessing the local server fixed something? The latter seems unlikely, but so does the former. I’m flummoxed. I hate when I fix a problem and don’t know exactly what was wrong.
And oops, here we go again, it is unreachable. I had this previously with nodes where they would go in and out of operation.
Performing a tail /var/log/netdata/error.log -f
shows a lot of errors.
Some Links
- I found out about the local netdata web ui in the Get started doc.
- I looked for the option to remove nodes from the cloud, I haven’t found one that actually takes them away, but did find a github issue on how to remove them from a war room.
- I found github issues about node’s being unreachable, insightful but didn’t resolve anything for me. (github issue #9624, github issue #8966)