Help w/ crash

Any advice on where to go from here? This console was running dmesg -w to try and catch an intermittent crash… And this is what I got. I am using an el cheapo USB wifi adapter that I’m suspicious of.

Everything was working fine until I rebuilt nixos with Nvidia support… Now my old generations of the OS are crashing after a few minutes (display on, no response to input, keyboard lights don’t respond, SysRq doesn’t work)

stardreamer, (edited )
@stardreamer@lemmy.blahaj.zone avatar

Look at the line with the asm_exc_invalid_op. That seems like a hardware fault caused by an invalid asm instruction to me. Either something wrong is being interpreted as an opcode (unlikely) or maybe the driver was compiled with extensions not available on the current machine.

OP, how old is your CPU? And how old is the nic you are using?

Edit: did you use a custom driver for the NIC? I’m looking at the Linux src and rt_mutex_schedule does not exist. Nevermind. Was checking 4.18 instead of 6.7. found it now. The bug is most likely inside a macro called preempt_disable(). Unfortunately most of the functions are pretty heavily inlined and architecture dependent so you won’t get much out of it. But it is likely any changes you made in terms of premption might also be causing the bug.

mvirts,

It’s a 3770k… So super old? 😅 The USB nic is this guy: CF-953AX a.aliexpress.com/_mNfj796

Maybe I should set up a config that doesn’t use a preemptable kernel for when I want faster wifi :P

Maybe this is my chance to actually fix something kernel related

Thanks for taking a look at this, your comments are super helpful.

stardreamer,
@stardreamer@lemmy.blahaj.zone avatar

My suggestion would be to try compiling the kernel locally.its highly likely the one packaged in your distro contains extensions that you don’t have. Doing a local native compile should rule that out pretty quickly without having to disable any additional features.

mvirts,

Looks like dmesg isn’t being logged to disk… But I made my font smaller 😹 https://lemmy.world/pictrs/image/7d5b658b-f029-4cce-aeeb-ddcce9760426.jpegDefinitely more to go on there, this happened while playing Minecraft with a small human so I didn’t dig into it yet. I’m pretty sure the kernel I’m running was built by a derivation that applies some preempt patches so I’ll start there. Ubuntu works fine with the adapter, but it’s also not a preemptable kernel.

Sorcaeden, (edited )

I don’t pretend to be an expert in this, and I also have no idea what the state machine looks like for unauthenticated WiFi, but my thinking on the call stack is either you were authenticated and the association with the AP dropped while sending a frame and puked, or it kicked it while attempting to authenticate to an AP, and I have no idea why a mutex would be taken, or to what, but it timed out apparently.

So why would this happen after a rebuild?

  1. freak accident/timing thing.
  2. I see multiple mt## modules loaded, and I’m suspecting while not looking it up that they are operating a MediaTek chip in that dongle, and are potentially conflicting.
  3. lots of wifi devices I’ve seen recently have loaded firmware separately from driver from /use/lib(or lib64)/firmware and the version changed from before, and maybe needs updating now or you did it before or whatever.

I agree with others - I’d give you a fiver if it happens again without the adapter connected.

mvirts,

I think You’re right, it is a mediatek chip and I used to add the USB device id manually to load the module, but with nixos 23.11 it started working automatically. I’m also running a preemptable kernel… Probably related now that I think about it :P

I should track down the firmware, that was one of the things I was looking into when setting up the device id hack.

I think this happened once before after uptime of about a week… But I didn’t get any information from that crash. Also, I’m remembering that some configurations were failing to see this wifi device and falling back to wired so maybe this has been a hidden problem since the new nixos release…

Thanks to everyone for your thoughts, it’s very helpful.

Rentlar,

Comm: wpa_supplicant being the wifi function makes me suspicious of your wifi hardware as well before I saw the rest of your post. I’ve had the best success with PCIe based wifi cards (if this is a desktop pc)

mvirts,

Agreed, this wifi stick was mega cheap on AliExpress so I went for it. I may take a look at the PCB in detail if removing it restores order to my PC. Yes, desktop PC (still hanging on to 2012 hardware woohoo!)

mozz,
@mozz@mbin.grits.dev avatar

Does it still get the error without the wifi adapter connected? The stack trace shows some network-related stuff (which doesn't necessarily mean that's where the issue arose, but it would be a little coincidence based on what you said).

That's the first thing I'd try, and if removing the adapter fixes it (long term) I wouldn't use the adapter anymore. Sometimes broken hardware breaks other hardware it's connected to.

If removing the adapter doesn't fix it, then the next thing I'd try is booting back into the known-good old old OS, maybe removing the NVidia card, basically simplify everything one step at a time until it stops happening, if you can.

mvirts,

Next chance I get I’m booting without the USB wifi adapter. I’m worried I may have broken something because it was mostly stable before :/ lol I actually don’t have the Nvidia card yet, I ordered a cheap Tesla K80 that’s arriving on Tuesday 😹 and it already brokey system :P

That’s a good idea, I have an Ubuntu partition that I should try.

mkwt,

Don’t know much, but nl80211 in the stack is indicative that the crash happens in a WiFi driver.

Looks like maybe some bad behaviour with a mutex.

The2b, (edited )

You’re only showing us part pf the error. There should be more above the list pf modules loaded that will provide useful information

dmesg > dmesg-out will give the entire dmesg log as a text file, and you can cut out the irrelevant parts

mvirts,

Good to know! I need to set that up next time, the whole system was unresponsive when I took the photo.

The2b,

In that case it should be in your logs. I believe the default is /var/log/dmesg.log*, depending on how many rotations have occured since the error

mvirts,

Lol I checked the system journal but forgot to check if the dmesg los is being written 😹 thanks for the reminder, going to take a look later today

  • All
  • Subscribed
  • Moderated
  • Favorites
  • linux@lemmy.ml
  • localhost
  • All magazines
  • Loading…
    Loading the web debug toolbar…
    Attempt #