I have recently built a new PC, to be used as a server. For months now, I have been getting unexplained crashes, sometimes after a few minutes, sometimes after a few days, where the PC just reboots without any trace in the logs. Just normal occasional status logs, and then, a few seconds later, the log of a normal boot process.

This is slowly driving me crazy because I just can’t make out the issue. I have tried multiple different Linux installs, swapped out the ssd and PSU and ran a ram test but this behaviour stills persists.

Today something was different. Instead of rebooting, it showed me this blue screen, this time finally with a log. But I still can’t seem to make out the issues. Some quick internet searches show some very vague answers; everything from software to hardware, and psu to CPU.

Can any Linux wizard help me fix my problem? Link to the log

Update: I have now faced an even weirder issue. I booted up, installed cpupower like a comment suggested, installed man to look up its documentation and then the screen froze, and I was forced to reboot the PC by pressing the power button for 3s. Then when I booted back up, my bash history was reset to a state a from a few days back (~.bash_history mod time from 2 days ago) even though I rebooted several times since then, and have not had any persistency errors like this. man was also not installed anymore. Even weirder is that cpupower was still installed. So it seems like some data was saved, while other files were discarded. I will now use a second ssd and try to replicate this. I now suspect some kind of Storage issue, even though the two ssd drives in question have never caused issues in my laptop. This seems scary, I have never witnessed a so weirdly corrupted Linux install, ever.

  • HiddenLayer555@lemmy.ml
    link
    fedilink
    English
    arrow-up
    5
    ·
    edit-2
    4 hours ago

    Off topic, but is it safe to share what I’m assuming is a stack trace/debug info QR code? Does it have any potentially sensitive data?

  • Gyroplast@pawb.social
    link
    fedilink
    English
    arrow-up
    3
    ·
    19 hours ago

    Screen freezes should also leave traces in your syslog, if they’re caused by any panic or GPU driver issue. You might want to check if your system is still accessible via SSH, if only the screen froze, and try killing X from there, if switching to text VTs doesn’t work. SysRq might become helpful, too.

  • Gyroplast@pawb.social
    link
    fedilink
    English
    arrow-up
    3
    ·
    19 hours ago

    screen froze, and I was forced to reboot the PC by pressing the power button for 3s

    seems like some data was saved, while other files were discarded

    I would not worry too much about a somehow “forgetful” file system immediately after a hard power cycle. This is exactly what happens if data could not be flushed to disk. Thanks to journaling, your FS does not get corrupted, but data lingering in caches is still lost and discarded on fsck, to retain a consistent fs. I would recommend to repeat the installations you did before the crash, and maybe shove a manual sync behind it, to make sure you don’t encounter totally weird “bugs” with man later, when you don’t remember this as a cause anymore. Your bash history is saved to file on clean shell exit only, and is generally a bit non-intuitive, especially with multiple interactive shells in parallel, so I would personally disregard the old .bash_history file as “not a fault, only confusing” and let that rest, too.

    Starting a long SMART self-test and a keen eye on the drive’s error logs (smartctl -l error <drive>), or better yet, all available SMART info (-x), to see if anything seems fishy with your drive is a good idea, anyway. Keep in mind that your mainboard / drive controller or its connection may just as well be (intermittently) faulty. In ye olden times, a defective disk cable or socket was messing up my system once or twice. You will see particular faults in your syslog, though - this is not invisible. You don’t only get a kernel panic without some sprinkling of I/O errors as well. If your drive is SMART-OK, but you clearly get disk I/O errors, time to inspect and clean the SSD socket and contacts and re-seat once more. If you never saw any disk I/O errors, and your disk’s logs are clean, I’d consider the SSD as not an issue.

    If you encouter random kernel panics, random as in “in different and unrelated call stacks that do not make sense in any other way”, I agree that RAM is a likely culprit, or an electrical fault somewhere on the mainboard. It’s rare, but it happens. If you can, replace (only) the mainboard, or better yet, take a working PC with compatible parts, and replace the working MBO with your suspected broken one to see if the previously working machine now faults. “Carrying the fault with you” is easier/quicker than proving an intermittent fault gone.

    Unless you get different kernel panics, my money’s still on your c-states handling. I’d prefer the lowest level you can find to inhibit your CPUs from going to sleep, i. e. BIOS > kernel boot args > sysctl > cpupower, to keep the stack thin. If that is finnicky somehow, you could alternatively boot with a single CPU and leave the rest disabled (bootarg nosmp). The point is just to find out where to focus your attention, not to keep this as a long-term workaround.

    To keep N CPUs running, I usually just background N infinite loops in bash:

    $ cpus=4; for i in $(seq 1 $cpus); do { while true; do true; done; } & done 
    [1] 7185
    [2] 7186
    [3] 7187
    [4] 7188
    

    In your case you might change that to:

    cpus=4; for i in $(seq 0 $((cpus - 1))); do { taskset -c $i bash -c 'while true; do sleep 1; done'; } & done
    

    To just kick each CPU every second, it does not have to be stressed. The taskset will bind each loop to one CPU, to prevent the system from cleverly distributing the tiny load. This could also become a terrible, terrible workaround to keep running if all else fails. :)

  • Gyroplast@pawb.social
    link
    fedilink
    English
    arrow-up
    20
    arrow-down
    1
    ·
    edit-2
    1 day ago

    Looking at the call trace:

    [ 1641.073507] RIP: 0010:rb_erase+0x199/0x3b0
    ...
    [ 1641.073601] Call Trace:
    [ 1641.073608]  <TASK>
    [ 1641.073615]  timerqueue_del+0x2e/0x50
    [ 1641.073632]  tmigr_update_events+0x1b5/0x340
    [ 1641.073650]  tmigr_inactive_up+0x84/0x120
    [ 1641.073663]  tmigr_cpu_deactivate+0xc2/0x190
    [ 1641.073680]  __get_next_timer_interrupt+0x1c2/0x2e0
    [ 1641.073698]  tick_nohz_stop_tick+0x5f/0x230
    [ 1641.073714]  tick_nohz_idle_stop_tick+0x70/0xd0
    [ 1641.073728]  do_idle+0x19f/0x210
    [ 1641.073745]  cpu_startup_entry+0x29/0x30
    [ 1641.073757]  start_secondary+0x11e/0x140
    [ 1641.073768]  common_startup_64+0x13e/0x141
    [ 1641.073794]  </TASK>
    

    What’s happening here leading up to the panic is start_secondary followed by cpu_startup_entry, eventually ending up in CPU idle time management (tmigr), giving a context of “waking up/sleeping an idle CPU”. I’ve had a few systems in my life where somewhat aggressive power-saving settings in the BIOS were not cleanly communicated to Linux, so to say, causing such issues.

    This area is notorious for being subtly borked, but you can test this hypothesis easily by either disabling a setting akin to “Global C States” in your BIOS, which effectively disables power-saving for your CPUs, or try an equivalent setting of the kernel arguments processor.max_cstate=1 intel_idle.max_cstate=0, or even a cpuidle.off=1.

    This is obviously losing your power-saving capability of the CPUs, but if your system runs stable that way, you’re likely in the right ballpark and find a specific solution for that issue, possibly in a BIOS/Fimware update. Here’s a not too shabby gist roughly explaining what c-states are. Don’t read too many of the comments, they’re more confusing than enlightening.

    The kernel docs I linked to above are comprehensive, and utterly indecipherable for a layperson. Instead of fumbling about in sysfs, try the cpupower tool/package to visualize the CPU idle settings, and try increasing enabled idle states until your system crashes again, to find out if a specific (deep) sleep state triggers your issue, and disable exactly that if you cannot find a bugfix/BIOS update.

    If this is your problem, to reproduce the panic, try leaving your system as idle as possible after bootup. If a panic happens regularly that way, try starting processes exercising all your CPUs - if the hypothesis holds, this should not panic at any time, as no CPU is ever idle.

    • Molecular5869@feddit.orgOP
      link
      fedilink
      arrow-up
      4
      ·
      1 day ago

      Thanks, please check my updated post. I have disabled the relevant setting in my BIOS, installed cpupower and increased the idle state to the maximum value of 2. I have also tried states 0 & 1. Do I need to run the machine for longer or should it have crashed right away according to your hypothesis? I also can’t tell you if the BIOS setting already fixed my issue since I still can’t reproduce it.

      About your last paragraph, the system has had these issues mostly while idle, but that’s probably because my system is running idle most of time anyways. I have also had the issue during low to medium loads, like transcoding audio via jellyfin. But I haven’t methodically run a process on all cpus. How would I go about running a load that uses all cores? I don’t particularly want to run a stress test for hours (because loud), but at this time I’m really open to trying anything.

      I have also enabled an option in my BIOS that generates a dummy load some time ago, because some forum post had suggested a PSU issue is at fault for unexplained reboots. I have a 500W PSU that is way overkill for my components, and some users suggested that some PSU’s can turn of when the load is to low. The option did not fix my problem. I have since connected a weaker 220W PSU, which also didn’t help.

      • Gyroplast@pawb.social
        link
        fedilink
        English
        arrow-up
        1
        ·
        12 hours ago

        Do I need to run the machine for longer or should it have crashed right away according to your hypothesis?

        Sorry for mudding the waters with my verbosity. It should not crash anymore. I believe your kernel panic was caused when an idle CPU 6 was sent to sleep. Disabling C-states, or limiting them to C0 or C1, prevents your CPUs from going into (deep) sleep. Thusly, by disabling or limiting c-states, a kernel panic should not happen anymore.

        I haven’t found a way to explicitly put a core into a specific c-state of your choosing, so best I can recommend now is to keep your c-states disabled or limited to C1, and just normally use your computer. If this kernel panic shows up again, and you’re sure your c-state setting was effective, then I would consider my c-state hypothesis as falsified.

        If, however, your system runs normally for a few days, or “long enough for you to feel good about it” with disabled c-states, that would be a strong indication for having some kind of issue when entering deeper sleep modes. You may then try increasing the c-state limit again until your system becomes unstable. Then you know at least a workaround at the cost of some loss of power savings, and you can try to find specific issues with your CPU or mainboard concerning the faulty sleep mode on Linux.

        Best of luck!

        • Molecular5869@feddit.orgOP
          link
          fedilink
          arrow-up
          1
          ·
          11 hours ago

          Thank you very much for your help so far, I will test the different methods and settings suggested in this thread over the next few weeks. I probably won’t find the time or motivation to methodically figure out the specific issue. That means that if at some point my system seems stable again, I will just leave everything as is and try to just be happy about it.

          But when my life gets less busy I’ll maybe have time to see this completely through.

          Anyways thanks to everyone, especially you, for taking the time to help me. I will update this post should I ever figure it out.

      • Ænima@lemm.ee
        link
        fedilink
        arrow-up
        4
        ·
        18 hours ago

        Just my two cents as someone who does this a lot, myself, only change one thing at a time when testing troubleshooting suggestions. I know the reply suggested a few things in succession, but that was showing progressive steps to confirm and identify the underlaying cause. Doing them all at once fails to correctly identify the root-cause at best, and at worst may have introduced new problems.

        I say this again, as someone who notoriously does this all the time. It’s a time-saver reflex, but one that will bite you in the ass eventually.

        • Molecular5869@feddit.orgOP
          link
          fedilink
          arrow-up
          3
          ·
          17 hours ago

          Yes, I went to fast because I have been sitting on this for months now. Normally I would only change one thing at a time, but with this situation it can take everywhere from 5 minutes to multiple days to test one single thing. If it doesn’t crash for 48 hours, it might be because I fixed the issues, or it might just be a coincidence and it will crash in hour 49 ¯_(ツ)_/¯.

          But your right, I will attempt it the right way when I find the time, even though it will probably take weeks 😮‍💨.

          • Ænima@lemm.ee
            link
            fedilink
            arrow-up
            1
            ·
            17 hours ago

            I know it sucks but I’m glad you seen to have corrected the problem. As someone who does more harm than good with Linux systems, myself, to fix a Linux issue without completely reinstalling the OS, is impressive and you should be proud to have accomplished such a feat!

  • bazsy@lemmy.world
    link
    fedilink
    English
    arrow-up
    6
    ·
    1 day ago

    What troubleshooting steps did you take so far? I would try these:

    • different OS, maybe a live usb running fedora or ubuntu if it is possible to emulate the workload where this appears
    • bios reset to defaults, no OC not even XMP
    • memtest, either the memtest86+ boot iso or the runtime memtester can detect obvious errors
    • long smart self test on OS drive and an fsck or scrub based on FS

    Also the logs show a very old nvidia gpu which is not supported by the new driver. I don’t know if this can cause crashes, haven’t used one in ages, maybe someone else has more insight.

    • Molecular5869@feddit.orgOP
      link
      fedilink
      arrow-up
      2
      ·
      1 day ago

      Thanks. I have already tried the first three steps, and the same drives worked fine in another machine so i don’t think the drive is at fault.

      About the GPU im pretty sure the issue also happens before I connected the gpu.

      About simulating the workload:

      Because of these issues, i am not currently running anything of importance on this machine and it mostly idles. The sudden reboots don’t seem to be affected by workload.

  • MrPistachios@lemmy.today
    link
    fedilink
    English
    arrow-up
    5
    ·
    1 day ago

    I had issues with reboots on a old server and it turned out to be the memory even though I didnt find anything in the memtests, maybe pull one stick out and try, and if it happens swap the other stick and try

    • Molecular5869@feddit.orgOP
      link
      fedilink
      arrow-up
      4
      arrow-down
      1
      ·
      1 day ago

      Thanks. I have run memtests and they all passed, so I thought I ruled out the RAM. Now I will try your suggestion. It will likely take me several days to come to any conclusion because I need to try changing only one thing and then hoping it stops happening, but I will only know for sure if the server has been running for, say, more than a week straight.

      • Gyroplast@pawb.social
        link
        fedilink
        English
        arrow-up
        4
        arrow-down
        1
        ·
        19 hours ago

        No, it is not. There is an issue with the installed GPU not being supported by the initializing driver, but this is entirely irrelevant for the reported fault and panic happening more than 1600 seconds later.

        Or would you argue the NIC is 100% the issue, because r8169 0000:04:00.0 enp4s0: Link is Down is literally right in the logs?

        • chonkyninja@lemmy.world
          link
          fedilink
          English
          arrow-up
          1
          ·
          3 hours ago

          I’ll bet $10 on the Nvidia drivers, OP is running 6.14.4 as am I, and the Nvidia drivers have a whole bunch of issues and require special patches to remove deprecated api calls.

          Also, for this kernel you need the Nvidia Open driver.

  • just_another_person@lemmy.world
    link
    fedilink
    arrow-up
    3
    ·
    edit-2
    1 day ago

    Have you been keeping an eye on the CPU temps? What’s the usual workload of this machine at any given time?

    Also, any metrics over time would useful. Maybe run Prometheus and Grafana.

    • Molecular5869@feddit.orgOP
      link
      fedilink
      arrow-up
      1
      ·
      1 day ago

      Thanks. I have not been actively keeping an eye on the temps, but I had already suspected an overheating issue, so I once ran a 15 Minute stress test with temps never reaching beyond 60°C. Also because the machine is so unreliable it has been running idle for most of the time (archinstall standard with no addional packages installed.) So I thought I ruled this out, but I will look into running grafana and/or Prometheus.

      • just_another_person@lemmy.world
        link
        fedilink
        arrow-up
        2
        ·
        1 day ago

        If you already stressed it, I doubt that’s the issue.

        If your hitting regular segfaults, it’s almost always a CPU<>MEM issue. What’s your memory install like? (Single, dual, size, speed…etc)

        Have you also checked the memory timing config in your BIOS to make sure it matches what the manufacturer suggests with your CPU?

    • Molecular5869@feddit.orgOP
      link
      fedilink
      arrow-up
      3
      ·
      1 day ago

      Thanks, I already ran memtest86+ multiple times and it passed every time. But another comment said that they had a similar issue where RAM was indeed the problem but memtest showed no errors. I will try removing one stick of memory to further rule out faulty RAM.

  • Molecular5869@feddit.orgOP
    link
    fedilink
    arrow-up
    1
    ·
    1 day ago

    It’s of course also possible that this is a completely seperate issue, but I would still like some tips. Please help me because I am about to buy an entirely new PC because this has been driving me nuts!!