Anyone else cpu at 90 percent plus after 1.2.5?
Started by Alledegly_noderunner 85b0a24d2efd401f... ·
Both my nodes are using almost max cpu? mem is fine, maybe 160 connections. Nothing else seems wrong just that.
cpu usage seems to be proportional to total connections in my experience so the 160 connections probably isn't helping, however the latest update did "Significantly improved transport logic for path request and announce handling", so that probably added extra overhead. If that extra overhead is from a bug or just normal functionality is hard to tell. My own node (which is a transport node) is perfectly fine but I don't have any public backbone interfaces on it so it might just be a connections thing
On rns.sofia with Backbone TCP I have a median of 4.7% RNSD CPU load with 4h57min uptime since update. The machine is very calm and runs cool. :)
Yea its strange, got an email about cpu usage while i was asleep, checked it when i saw the message hours later and cpu was still at 90+ percent. restarted the container and it was still going high. checked my other vps and it was the same, rebooted that one and still high. have left them both for a couple of hours and the first one has now gone back to normal. maybe its stamps after reboot or once the announce rate limit kicks in for the majority of traffic it chills out? im just guessing.
Also seeing only around 10-20% CPU load on a very small single-core VM with around 400 clients on BackboneInterface listener.
Are you seeing anything in the logs? Need to know a bit more about your setup to figure this one out. Are you using backbone or TCPServer interfaces? What OS?
What version were you on before updating to 1.2.5? If it was significantly older, there could be a lot of initial cleanup running on old cached data. The handling of this was improved significantly recently, but if it had 100k+ cached destinations and announces (as some long-running nodes would have), the initial restructuring and cleanup could take quite a bit of time on very slow machines/VMs. A lot of that work is I/O bound, and especially on cheap VMs, I/O throughput can be quite bad. If you're then also running a container inside the VM, the throughput just tanks even further. That could explain what you were seeing, but hard to guess exactly without further info. What are the specs and setup roughly of how you're running it? Container inside hosted VM?
Do you have interface discovery broadcasts enabled? If you do, you will see CPU spikes just as the node starts up and computes discovery broadcast stamps for your interfaces with discovery enabled, but depending on target stamp value, that should subside within a few seconds to a minute or so. If you do have it enabled, what LXMF version is installed?
Are they both back to normal now?
If they're still doing it, try kicking the loglevel up to debug, and observe live as it starts up, there's probably a hint in there.
Also, CPU usage shouldn't really significantly affected by having many connections, unless they of course are doing something. Unless you're using TCPServerInterface - don't do that for large public nodes, Backbone is much more efficient; only uses one thread and super fast EPOLL I/O no matter how many connections you have.
My big public node also seems fine, the pattern of CPU usage looks pretty similar from before I updated to 1.2.5. Sometimes a single core will get maxed out when someone is transferring a lot of data, and LXMF will spike all of the cores when generating propagation peering stamps (actually all of the cores minus 1, but that's a custom patch I deploy on my node to make things run more smoothly).
Yeah, I really should make the core count configurable there :)
Second VPS was pinned for a few more hours but is back to normal now. I dont think i would have noticed unless my provider sent that email about cpu useage, everything was responding fine. It has to be above 80 percent for an hour to trigger a notification via email. 1.2.4 to 1.2.5, i built them both from scripts i pull in from git so in theory they are identcal. I have another node on an OpenWRT which can handle 500 or so connections without any sort of drama on 1.2.4. Watched it very closely for a few weeks before standing up the vps. it lived through the RU CGNAT drama like a champ. Will look through the logs today and see if i can see any offenders, I know that as soon as I opened a port that was not discoverable i got a connection attempt from Alibaba servers in the US so not sure whats going on.
Yea i would say they in a vm, KVM or something Intel, Ubuntu 24.04 LTS, headless and then one docker container. I want to try Armbian on some new hardware to replace my OpenWRT one node, then im hoping to pull in the same script as the VPS. Im pretty new to using VPS but im getting there.
Thanks noderunner. Yeah, that is a bit of a weird situation then. Let me know if you find anything, interested to know what went down there, if possible.