Random application segfaults on Arch

Hi everyone,

ever since I switched to Arch about two months ago, most applications segfault multiple times a day. There doesn’t seem to be any pattern for the crashes, sometimes it’s even happening while idling (e.g. reading a news article).

Things I’ve tried without any luck so far:

  • Running Firefox in safe-mode without any extensions
  • Switching from regular to LTS kernel
  • Disable Hardware Acceleration in Firefox
  • Change RAM speed and timings
  • Run Memtest successfully
  • Replace entire RAM with a new certified kit
  • Use only a single RAM slot
  • Apply Ryzen fixes (iommu=soft, limit c-states)
  • Use only a single CPU core (maxcpus=1)
  • Downgrade Nvidia driver to 535xx
  • Use Nouveau instead of the nvidia driver
  • Use Openbox instead of KDE
  • Disable zswap and THP

Here’s full journalctl from a day where both Spotify and Firefox crashed at the end, a few seconds after each other:

pastebin.com/BH0LMnD9

Some more info about my system:

  • Ryzen 5 3600X
  • MSI B450M PRO-VDH Max
  • 32GB RAM @ 3200MHz
  • Geforce RTX 2070 SUPER (using nvidia-dkms)
  • Plasma 5.27.10 on X11

I’m pretty sure that it’s not hardware related, because I’ve booted up a Debian 12 live image where everything ran for several hours without a crash. But it seems to be Arch related, as I also booted up a fresh EndeavourOS live image (so basically Arch), where applications also randomly segfaulted. Any idea why everything works fine on Debian but not on Arch? Debian uses the 6.1 kernel, which I already tried, so that’s not it.

Let me know if you need any more information that might help solve this issue. Thanks!

Edit [solved]: It looks like disabling PBO in the UEFI/BIOS did the trick. The strange thing is, after enabling it again, it’s still not crashing again. Someone suspected that the MoBo default/training settings were faulty, so I guess this was a very rare case here. That’s probably why it took so long to find a solution. Thanks everyone for helping me out!

SpaceCadet,
@SpaceCadet@feddit.nl avatar

I’m pretty sure that it’s not hardware related

Random segfaulting is not something that “just happens” because of an OS misconfiguration, then if the same problem happens on Arch as well as on a clean EndeavourOS live image it convinces me that it is in fact hardware related somehow. As you have already replaced the RAM, my guess is CPU or motherboard issue.

Zen2/B450 is a widely used and well supported configuration on Linux that you normally shouldn’t have issues with, but Zen2 CPUs are rather notorious for having fragile memory controllers, and sometimes dodgy AGESA firmware releases that can cause issues on some CPUs. I used to have a 3600X myself that started crashing at idle around a particular firmware release of my motherboard, and it was fixed by a subsequent release.

BTW the fact that it doesn’t happen on Debian doesn’t necessarily mean that Arch is the culprit. It could just be that Debian is not triggering the fault because of different, perhaps more conservative, compiler optimizations.

As a last ditch effort, you could try resetting your entire UEFI (bios) settings to default, preferably by pulling the CMOS battery.

BTW, is it only GUI applications that are segfaulting? Or other programs as well? Do you have an old spare GPU you can test with?

vildis,

Could you try an older endeavour os image?

This sounds very much like a driver/firmware/hardware issue

Ludrol,
@Ludrol@szmer.info avatar

I would guess that this is CPU SSD issue you ran an live debian image from an usb and did not encounter any crashes.

NoisyFlake,

But I also ran a live EndeavourOS from USB and the same crashes happened.

CameronDev,

Try increasing RAM voltage? Might make it more stable under load. I had a similar issue, clean memtest, but games would randomly crash. Increasing RAM voltage fixed it.

NoisyFlake,

What voltage should I try? It’s currently at 1.35V, and I’ve read somewhere that this is the highest “safe” voltage.

CameronDev, (edited )

I jumped to 1.4V which afaik is safe. But i cant guarentee anything. Going up slowly might be better, but stop at 1.4?

Corsair says 1.4 is safe: help.corsair.com/…/360052448851-Tips-on-safely-ov…

mmstick, (edited )
@mmstick@lemmy.world avatar

Make sure you have the latest firmware for your motherboard. This sounds like unstable voltages for memory, or an overly-aggressive PBO curve. Did you try disabling the XMP profile on the RAM, disabling PBO, and upping the voltages (within safe limits) of the SOC, DDR, and VDDP? You might find some useful info here[0] or here[1] if you intend to run your memory at 3200 MHz.

NoisyFlake,

Motherboard firmware is up-to-date, and I’ve already tried disabling XMP. I’ll give disabling PBO a try, thanks!

I don’t necessarily have to run at 3200MHz, if it means that the system is finally stable. But since it’s already crashing at the default 2133MHz, I suppose there’s no use in playing with the voltages?

mmstick, (edited )
@mmstick@lemmy.world avatar

It’s difficult to say for sure with certainty what the issue is without trial and error. I would expect that the motherboard’s manufacturer would make sure that their board can successfully pass all tests with the standard JEDEC spec for DDR4 (2133 MHz).

Since you say that you’ve tried different RAM kits, another alternative could be the cleanliness of power from the power supply. Perhaps there is intermittent voltage droop, and you need to experiment with the Load Line Calibration settings to adjust for vdroop between idle and load. Disabling frequency boosting and manually setting the CPU frequency could help check if it’s related to that. PBO curves might be undervolting too much while idle.

NoisyFlake,

I’m a bit speechless right now. I’ve disabled PBO and didn’t have a single crash since then, everything’s been running fine for hours. Just to make sure that this really was the issue, I’ve enabled PBO again - but still haven’t experienced any crashes in the last hours. I have no idea how simply disabling and then enabling the feature again fixed my issue, but for now it seems like all is well.

Do you have any explanation for this weird behavior?

Anyway, thank you very much for your suggestion, looks like this actually did the trick!

mmstick, (edited )
@mmstick@lemmy.world avatar

Sounds like voltage droop and/or a motherboard with faulty automatic “training” settings. I don’t recall if the Ryzen 3000 had custom PBO curves, but tweaking this can fix it. Upping LLC and the SOC and CPU voltage slightly alternatively could help. Though I’ve had my most stable overclock by disabling PBO entirely and using a manual CPU multiplier.

gbin,

The crashes are in the middle of browsers (both Firefox and chrome embedded in Spotify), if you try a simple mprime stress test (from the AUR mprime-bin) does it crash too?

cbarrick,

Yeah, this sounds somewhat like unstable hardware.

Definitely start with a stress test or memory test.

lemming741,

I had a 3700x that was doing that sort of thing. It seemed mostly random, but moving big files would crash it pretty often. It ran memtest86 for 3 days no problem. I replaced part by part, and it ended up being the CPU. I’d bought it second hand so it may have been abused.

vzq,

Can you enable core dumps and get stack traces? From there you should be able to figure out which shared library is broken.

NoisyFlake,

Uhm, isn’t that what can be found at the end of the journalctl log I posted? Or are you talking about something different?

avidamoeba, (edited )
@avidamoeba@lemmy.ca avatar

Could be a defective library that’s used by many apps. Glibc, etc. That said, if something like this is that broken, others should be complaining about it too.

gbin,

One crash was in libxul and the other in libcef I doubt this is a specific lib

avidamoeba,
@avidamoeba@lemmy.ca avatar

Crashes on Arch, doesn’t crash on Debian:

Debian > Arch

Sanguine,

Not the point of this thread.

avidamoeba,
@avidamoeba@lemmy.ca avatar

Of course.

zelifcam, (edited )
@zelifcam@lemmy.world avatar

That is how you ask a question!

You’ve already addressed the few ideas I had. I’ll try to get a better look at the logs once I’m home.

Edit: what happens if you use the arch testing repos instead? Maybe there’s some software that’s been updated in the test repos, that’s currently behaving badly with your system and it’s just unfortunate timing?

NoisyFlake,

Hm, I’ve had this problem since my initial setup about 2-3 months ago, I think that if there’s something wrong with the software in the repos, it would’ve been fixed by now and I wouldn’t be the only one having this problem, right?

But of course, if you want I can give the testing repos a try :)

zelifcam,
@zelifcam@lemmy.world avatar

2-3 months would certainly be enough time for a bad package to find its way out.

drwho,

Are you keeping an eye on system temperature?

NoisyFlake,

Yeah, temperatures are usually between 40-50 °C, so that should be fine.

drwho,

Yeah, that should be fine.

Anything in the kernel message buffer? dmesg -T | less

NoisyFlake,

I’m not sure, here’s the entire dmesg output: pastebin.com/MZfhB0xK

  • All
  • Subscribed
  • Moderated
  • Favorites
  • linux@lemmy.ml
  • localhost
  • All magazines
  • Loading…
    Loading the web debug toolbar…
    Attempt #