cross-posted from: https://sopuli.xyz/post/34381286
I’ve been having issues with my homelab ever since I set it up a few months ago. For some reason the server becomes unresponsive as if it is online. However when accessing its CLI, it seems to spew out this message in continuity.
I’ve tried entering commands directly into the CLI, but it shows an ‘input/output’ error instead. I cannot even get it to shutdown through the CLI so I have to manually pull the plug.
Here’s another screenshot of the logs in the CLI a few moments just after the error occurred.
The issue does not even get fixed after I try switching it off and on. Sometimes the homelab gets stuck indefinitely in the startup loading screen, fails to detect the system partition between the GRUB stage, results in a Linux kernel crash or refuses to boot altogether. It is only mitigated when I leave the homelab switched off for 5 minutes or so.
The weird thing about it is that there is no way to predict when this error could come up. The server would work completely unhindered for a few weeks straight on some occasions, and break down just a few minutes after startup. It doesn’t depend on what type of services I am hosting, all of which are lightweight in nature.
Additionally, once it does start working again there seems to be no record of the encountered error to be seen in the logs, apart from the number of unsafe shutdowns. This makes it difficult to debug or even document the matter coupled with the fact that its occurence is random in nature. I’be tried running several diagnostic tools including smartctl but I am unable to deduce anything useful out of it.
Some specs and info about the homelab is as follows:
- Build: Pre built Compact Mini PC
- CPU: Intel i7-14700
- RAM: 16GB
- Storage: 1TB SSD
- GPU: Integrated Intel HD Graphics 770
- Operating System: Ubuntu 24.04 LTS
I would really appreciate if you could point out the cause of this issue. This experience makes the server reliable which is why I don’t feel comfortable hosting anything valuable or sensitive on it yet. I can provide you additional details or logs if required.
If GRUB is having problems too, not just Linux, I’d be inclined to blame hardware of some sort. Do you have another stick of NVMe that you can swap in, see if that makes the issue magically go away? Maybe run off a USB drive, see what happens?
Maybe less likely, but that processor is a 14th gen Intel desktop processor, one of the models affected by the voltage degradation problems. I burned up both a 13th gen and 14th gen processor myself. Looked like a variety of random errors, often related to memory, eventually not even managing to get through boot unless I disabled all but one of my cores. Might look into that. I assume that there’s a potentially-affected serial number range list somewhere.
And you can run memtest86 to bang on the memory and CPU, see if anything comes up. If it runs into errors, then it probably isn’t the NVMe at fault.