LE11 Nightly - Samba server freezes, including logs - NTFS3 KERNEL DRIVER IS SLOW

  • Hello all!

    I run an RPi4 h/w rev 1.5 with LE11 nightly-20220207-72b6d0d (RPi-LE).

    Wi-Fi is used to connected to my home network and the internet. The Ethernet port is used for a direct link to another RPi (preferred technology order changed in connman).

    It has a 5TB USB3 HDD connected to it with single partition formatted in NTFS, mounted by ntfs3g. That hard drive's partition is shared over Samba. The drive's content is 4K HDR videos with HBR audio, directly connected for performance purposes.

    There is another Raspberry Pi (RPi-Mgr), running DietPi, connected to the first using the private Ethernet link, mounting the above shared drive through this link, vers=3.1.1 (tried going down to 2.0 with the same results). This Pi manages the content on the USB hard drive.

    Randomly, and quite regularly, RPi-Mgr reports that the mounted network share stops responding, which creates havoc higher up in the software stack.

    I have tried using a network mount using SMB from my NAS, using the exact same workflow, with ideal results. So RPi-Mgr can be considered fine.

    So I've taken a look at RPi-LE, the Samba server.

    I may have been pretty annoying in another post, thinking that the h/w rev and firmware could be problematic, apparently in error... Oooops...

    I started turning the smbd logs to level 10 (stupidly verbose) about everything and went about the normal workflow, looking for the time the client reports the server as unresponsive. I then grepped the server smbd logs for that time. Uh oh... No result!

    So i went into the file, massively dense logs, and searched for the timestamp. Nada, zilch, zip, rien, nothing! There is a gap in all logs for about 20s! No information appears before or after indicative of any ill reaction. It's as if smbd took a break for an espresso cup and came back, as if nothing was wrong... Nothing weird appears in the journalctl logs about the rest of the system either.

    Now, that is annoying... There is something that is clearly happening, but I can't find any verbosity about it.

    No data, no help, no fix...

    What could I do to generate anything that would be useful?

    I have an identical pair of RPis, one running LE11 nightly-20220120-f7f2fd5 on Raspberry Pi 4 Model B Rev 1.2, without any issue at all. That on is in production, so I'm not too tempted to upgrade it. I just wish I had an installable image of that version handy to put on the dev kit as a comparison point.

    Nico

    Edited 3 times, last by camelreef (February 9, 2022 at 12:08 PM).

  • Please try to add "debugging" to the end of cmdline.txt, this enables extra debug logging (for connman and other services) which might provide important information. Then upload logs as usual (journalctl should be the most important).

    so long,

    Hias

  • Please try to add "debugging" to the end of cmdline.txt, this enables extra debug logging (for connman and other services) which might provide important information. Then upload logs as usual (journalctl should be the most important).

    so long,

    Hias

    It shall be done tomorrow!

    Nico

  • Please try to add "debugging" to the end of cmdline.txt, this enables extra debug logging (for connman and other services) which might provide important information. Then upload logs as usual (journalctl should be the most important).

    so long,

    Hias

    Doing it now, but, of course, there has been absolutely zero smb drops yet!

    I'm now cross-compiling an LE11 nightly-20220120-f7f2fd5 image. Seeing if that one works well or not could be a nice data point.

    Ah, crap, I can't... Debian is not hosting http://ftp.debian.org/debian/pool/ma….26.orig.tar.gz anymore...

    Darnit!

    Edited once, last by camelreef (February 10, 2022 at 12:07 PM).

  • Please try to add "debugging" to the end of cmdline.txt, this enables extra debug logging (for connman and other services) which might provide important information. Then upload logs as usual (journalctl should be the most important).

    so long,

    Hias

    Here is the journalctl with the system booted with the debugging option:

    http://ix.io/3Pd4

    NOT enlightening...

    For info, here are the timestamps when the client complained about the share being unavailable:

    What else could I turn on for you that could enable something more useful?

    Nico

  • Can you test with a simple, stock network configuration? i.e. plain LE installation with just wired (or wifi) network enabled but not both wifi and wired at the same time.

    I'm suspecting network / routing could be the culprit, that test should help to narrow it down.

    so long,

    Hias

  • Can you test with a simple, stock network configuration? i.e. plain LE installation with just wired (or wifi) network enabled but not both wifi and wired at the same time.

    I'm suspecting network / routing could be the culprit, that test should help to narrow it down.

    so long,

    Hias

    Single network, wired, was how it was until very recently, and the problem was exactly the same.

    I have changed to that second backend network only once I thought that the new bootloader/firmware/kernel/samba combo had stabilised things, in error, when I thought that I could put the kit in production at my friend's.

    Single network or two networks show exactly the same behaviour.

    In any case, this is all very simple networking, a directly connected class C on one interface, with the default route, another directly connected class C on another interface. The mount is using the IP address on that directly connected class C network. There is no need for any IP forwarding to be turned on.

    Nothing is confusing, the routing table is exactly as it should be, with proper metrics, as is the IP assignment on the interfaces.

    The issue is not network related.

    Edited once, last by camelreef (February 10, 2022 at 4:25 PM).

  • have you tried to do what is written in the log ?

    Code
    Feb 10 16:39:35 kodiplayer smbd[5958]:   If you are running a recent Samba version, and if you think this problem is not yet fixed in the latest versions, please consider reporting this bug, see https://wiki.samba.org/index.php/Bug_Reporting
  • Code
    Feb 10 17:00:10 kodiplayer smbd[6089]:   smb2_sendfile_send_data: sendfile failed for file Transit/Incoming/Game.of.Thrones.S03.2160p.BluRay.REMUX.HEVC.DTS-HD.MA.TrueHD.7.1.Atmos-FGT/Game.of.Thrones.S03E06.The.Climb.2160p.BluRay.REMUX.HEVC.DTS-HD.MA.TrueHD.7.1.Atmos-FGT.mkv (Connection reset by peer) for client ipv4:10.25.25.1:46624. Terminating

    "Connection reset by peer" means the client (i.e. the other machine) unexpectedly closed the connection.

    Might be worth looking into what the client is doing and also testing if the issue happens with other clients (eg a Windows box).

    so long,

    Hias

  • Code
    Feb 10 17:00:10 kodiplayer smbd[6089]:   smb2_sendfile_send_data: sendfile failed for file Transit/Incoming/Game.of.Thrones.S03.2160p.BluRay.REMUX.HEVC.DTS-HD.MA.TrueHD.7.1.Atmos-FGT/Game.of.Thrones.S03E06.The.Climb.2160p.BluRay.REMUX.HEVC.DTS-HD.MA.TrueHD.7.1.Atmos-FGT.mkv (Connection reset by peer) for client ipv4:10.25.25.1:46624. Terminating

    "Connection reset by peer" means the client (i.e. the other machine) unexpectedly closed the connection.

    Might be worth looking into what the client is doing and also testing if the issue happens with other clients (eg a Windows box).

    so long,

    Hias

    I would theorise that this is the client resetting things because all hell broke loose at some point, and not the root problem. It is a pretty rare occurrence.

    I am, however testing both your and my theories, moving large files back and forth using another client and watching the server's logs.

    Just a reminder from another thread, I had tested my client using another Samba server, one running on my old-ish Lenovo-EMC NAS, without any hiccup whatsoever, as I naturally wanted to find out if it was a server or client issue.


    have you tried to do what is written in the log ?

    Code
    Feb 10 16:39:35 kodiplayer smbd[5958]:   If you are running a recent Samba version, and if you think this problem is not yet fixed in the latest versions, please consider reporting this bug, see https://wiki.samba.org/index.php/Bug_Reporting

    Should I really push that issue upstream now? Is LE using the latest and greatest from Samba? Is the issue really a Samba bug?

    I have no problem contacting the Samba guys and following up, but I feel that there is a bit more that needs gathering before that.

    Also, LE is using Samba 4.13.17, samba.org announces that the latest stable is 4.15.5. I can't judge the LE decision to use an older version, but there must be a good one. I'm pretty sure that if I contact upstream, the first reaction will be to ask me to use the latest version first and go from there, as mentioned in the log message.

    Edited 3 times, last by camelreef: Merged a post created by camelreef into this post. (February 12, 2022 at 9:35 AM).

  • On versioning - Samba 4.13 is being tracked as it is in LE10(and one of the few packages that is backport Ed - for its security fixes). In March/April - security patches will stop being released for 4.13 and the :master will start tracking the to be released 4.16 release of Samba.

  • I'm pretty sure that if I contact upstream, the first reaction will be to ask me to use the latest version

    this might be the case.

    anyway, I would try it, it won't hurt.

    no one and nothing warrants that a bug in a prev. versions is still, but unknown, in the last version too

  • I absolutely hate problems coming up randomly or without an identified...

    I've been hammering the system with the usual workflow plus additional external transfers, and not a single drop or anything in the past 24 hours... Aaaaaaaaaaah!

  • I absolutely hate problems coming up randomly or without an identified...

    ... which might cause 20s caps in the logs, too ?

    RAM ?

    I've searched the samba bugzilla for "INTERNAL ERROR Signal 11"

    Bug List


    As I'm not a developer I can't judge if the newest report named "Regression: Samba 4.15.2 on macOS ..." (and of cause the patch => comment #3) makes sense/could help (in case if it would be able to be backported)

    but maybe the "Client specific logging" settings I found here:

    1. https://wiki.samba.org/index.php/Client_specific_logging

    - it seems it's something more then just "log level 10" -

    and

    2. against the log cap and because the error occurs randomly to set in the LE configuration Tool => System => Logging (on the right) => "Use persistent Logs" (systemd log survives reboots) in combination with the boot parameter HiassofT suggested (if not already set)

    - I couldn't exactly get in which log your 20s gap occured (samba log's, systemd log or both) -

    happy bug hunting ;)

    Edited 6 times, last by GDPR-7 (February 14, 2022 at 3:33 AM).

  • UK morning all!

    I have a bit of time to dedicate to this today.

    First, I don't believe that I've ever given the client's mount parameters:

    Code
    //10.25.25.2/content on /mnt/content type cifs (rw,relatime,vers=3.1.1,cache=strict,username=guest,uid=1000,forceuid,gid=1000,forcegid,addr=10.25.25.2,file_mode=0770,dir_mode=0770,soft,nounix,serverino,mapposix,rsize=4194304,wsize=4194304,bsize=1048576,echo_interval=60,actimeo=1,x-systemd.automount)

    Who knows why there is an NFS mount option in there (actimeo)...

    The server's parameters are default LE, automatically sharing external discs.

    The gaps in logs were in the Samba logs with Level 10 on.

    Those Samba bugs are not really helpful, but I'm not the best person to judge.

    I'll try to coerce the LE samba to do some client-specific logging and come back here with data.

    [EDIT] It's on and working, giving very dense/verbose logs. How much you want to bet that the error won't crop up for hours, or at least before the /storage partition of my SD card is full?

    Oh, and a last data point, not the smallest: the similar production setup running an LE11 20220120 that I believed not subjected to the issue has displayed the same problem. This is depressing as it removes the possibility of a good known point...