I got hacked, and that has uncovered all the things I've been doing wrong
Apologies, I was sure I was taking more screenshots, but now I cannot find them, another thing I did not handle well.
My VPS got hacked, and this is the postmortem of the whole situation, written down as a precaution and for future reference.
On Friday evening I took a look at my Grafana dashboard for the VPS and saw that for the last hour the CPU usage was at constant 100% on all cores with a load of over 5.
Quick ssh to the machine, htop and I was shown that the CPU was used by several instances of the /tmp/fghgf process. I killed it with killall, but of course it came back again like in that Chumbawumba song. I tried searching for that process name in every web search engine, but there were zero definite results.
However I also saw that from time to time a process called xmrig would run, and that gave a definite result: a crypto miner. So somebody used my CPU to mine crypto. And I thought this bullshit scam “currency” is close to dying out. sigh.
The processes were being run by the prometheus user, so my first suspiction went in that direction. The first thing I did was to close the 9090 port in UFW. I had it opened for my selfhosted Grafana to be able to reach the Prometheus on the VPS for telemetry logging. The connection is secured by basic auth with user and password, the password is a randomally generated uuid. I knew that blocking the port was not enough, but I hoped I least I severed the connection between the attacker and my server. Now that I write it, I thought I could also block the 9090 outgoing traffic.
After some consideration I decided to go the kinda nuclear option and remove everything related to Prometheus from the system with the command:
sudo deluser --remove-home --remove-all-files prometheusThis worked, and continuous monitoring with htop showed that nothing suspicious was running anymore, and the load came back to the usual 0.1 - 0.2. But I still did not come to the root of the problem and I started to think about the future steps. And so the holes in my administration process came to light.
Insufficient backups
First, a short recap: On that VPS I am running my blog, which is a static site server by nginx, Umami, a self hosted analytics service running in Docker, Commento, blog comments engine, also in Docker, and finally GoToSocial, my Fediverse instance, running as a systemd service with a SQLite database.
I have made the typical error of a beginner sysadmin: I am doing backups but I never tested in full recovering from them.
Every few weeks, mostly before larger updates I do a full snapshost of the server using Hetzner snapshot functionality. Additionally, I am doing nightly backups of the GoToSocial database and user data using BorgBase, as described in the GTS Documentation. What I am not backing up is the Umami and Commento databases, and that is a glaring ommission. I am not backing up my blog, because it is just a static site, and I can deploy it from my Forgejo instance anytime I want.
So yeah, If I wanted to just kill this infected machine and redo the whole setup from scratch on a fresh VPS, I could at least easily restore my blog. With GTS it would be a different issue, I think it would be possible with the backups I do, but again, I never tried it, and that is a problem. And, I would lose the comments and the analytics data, which would be painful, but I could live with it.
Therefore instead of doing the best way with just killing and recreating the server, I went with trying to save my VPS.
Fixing the VPS
The breakthough came when Louis from the Fediverse sent me a link to an issue reported in Umami, in the version that I have been using. It was reported that indeed, the Umami version that I have been using (v2.19) has been vulnerable to the xmrig crypto miner. At the same time I was running clamav the Linux antivirus scanner on the VPS, and it did found two infected files in /var/lib/docker/overlay/.
I shutdown the containers, pulled the most recent version (v3.0.2) and started them again. I rerun the clamav scan and this time no infected files were reported.
As an additional line of defense, I also added a Hetzner Firewall in front of my VPS in addition to the UFW firewall running on it, as suggested by Grzegorz.
Current situation
I have been monitoring my VPS closely for the last two days, and for now everything seems to be ok, the load is low, there are no suspicious processes visible in htop, and several subsequent runs of clamav have shown no issues. I still have not restored Prometheus, I’m leaving that for the time when I have more time (what a great sentence).
Future plans
This whole situation has shown all the the things that I have been doing wrong with administraing my server. I’m feeling bad because of it, of course, but also that is a motivation to step up my Linux admin game. What I will need to do now:
— add the missing things (docker container DBs) to my nightly backup solution
— test my backups!!!
— add more (that means, any) automation so that I can quickly restore my VPS from scratch if a need ever occurs
— add alerting for prolonged high CPU load.
Bottom Line
And just a few days ago I wrote in the Fediverse that my servers Just Work and I am happy with them xD The whole situation has been frustrating for me, but I guess it’s for the better, now I vividly know what I need to fix and when new skills I need to acquire. All in all, it was just a random trojan and not something really, really serious.
Many thanks to Louis, Agnieszka and Grzegorz for the feedback and support!
Here’s the Fedi thread that I was doing as the things were unfolding.
Thanks for reading!
