[Orange Pi 3 LTS] Freezes

  • I have one possible explanation - IOMMU. H6 driver doesn't have locking implemented in some functions, but other do. Without that, race conditions are possible when allocating and freeing memory during video decoding and rendering.

    I had a similar experience as gadget_guy, using LE 12 build. I'll give it a shot disabling IOMMU as well.

  • Note that this is just test, IOMMU benefits are big. It protects unallocated memory being used by HW (in other words, it prevents HW corrupting memory) and it allows any amount of free memory to be allocated to video decoding (otherwise only CMA memory can be used, which is limited in size).

  • Note that this is just test, IOMMU benefits are big. It protects unallocated memory being used by HW (in other words, it prevents HW corrupting memory) and it allows any amount of free memory to be allocated to video decoding (otherwise only CMA memory can be used, which is limited in size).

    So far no lock ups (14 hours uptime), Kodi has locked up but not the shell (so this is different), but this is in dmesg, seems to be a conflict between sugov (some sort of kernel thread or work queue?) and kodi.bin:

    http://ix.io/4Gr3



    I'm trying the performance governor, when I tried to change it while it was in this state the shell would hang/deadlock. I rebooted, did:

    echo performance > /sys/devices/system/cpu/cpufreq/policy0/scaling_governor

    We will see how that goes, the sugov kthread is related to schedutil governor.

    Edited once, last by frakkin64: Merged a post created by frakkin64 into this post. (September 15, 2023 at 12:36 PM).

  • Anyway, I went through CedarX vendor lib and found some differences regarding MPEG2 setup. You can try this kernel patch:

    http://ix.io/4Fsv However, I couldn't find any real world difference. Maybe something can be done on ffmpeg side.

    Yeah, that patch didn't make much of a difference with MPEG2. I'm OK with software decoding, most of the MPEG2 content is just OTA broadcasts and it's usually not much more than 1080 which neither the OPi3 or RPi4 have a problem to software decode.

  • I'm trying the performance governor, when I tried to change it while it was in this state the shell would hang/deadlock. I rebooted, did:

    echo performance > /sys/devices/system/cpu/cpufreq/policy0/scaling_governor


    We will see how that goes, the sugov kthread is related to schedutil governor.

    jernej I ran for 31.5 hours using the performance governor instead of schedutil, and no problems, no lock ups. I just rebuilt on master with just one patch really for H6, which is changing the governor to performance instead of schedutil. I'll let you guys know how that goes.

    There is a possibly related discussion on LKML, but it's really about CPU hotplug (I guess when the turn off performance cores vs low-power cores in some of the Android phones) but it suggests there is a potential lock contention between schedutil & cgroups.

    [PATCH] sched/schedutil: Fix deadlock between cpuset and cpu hotplug when using schedutil (kernel.org)

    It's just speculation at this point, and it seems like that discussion sort of died out.

  • Random thought .. I wonder if something like this is needed?

    Possibly, but it seems like there is a definitely some sort of dependency/contention between schedutil governor and kodi. I decided to go with ondemand and it did freeze, so now I am back to testing performance (which I think is full speed?). So it's either a combination of scheduler + iommu disabling, or it didn't run long enough, or it is frequency switching.

    I am trying to enable lockdep & prove_locking now, which I thought all the dependencies were there but it just stripped those config options out of the kernel build. Can you trigger the config target via scripts/build, or do you usually do that outside of the LE build system?

  • If you navigate to the build folder and do "make menuconfig" it should detect the local .config file which you can then diff/copy to update the one in the project/device linux folder. I normally just mod the defconfig directly and respin the image. Once in a while settings don't stick and I need to do a little digging in kernel Kconfig to understand dependencies, but over time (with gained experience) that occurs less :)

  • So it's either a combination of scheduler + iommu disabling, or it didn't run long enough, or it is frequency switching.

    Frequency switching can be a bit tricky, true.

    Can you trigger the config target via scripts/build, or do you usually do that outside of the LE build system?

    Easiest way is to execute make ARCH=arm64 menuconfig in LE linux build folder and change whatever you want. Then you diff changes between .config in build folder and linux.aarch64.conf and pick changes that are of interest. Don't take all changes, there are some magic values, which are replaced by build script.

  • So far it was just CONFIG_PROVE_LOCKING=y, and the Ethernet driver really hates it :). The network connections are super-super slow. I think I read it is a bit heavy, oddly the shell via serial is very responsive. I think there may also be a more generic lock debugging where it just logs locks acquired/released, but prove locking is supposed to check for deadlocks and report them out before deadlocking.

    I will probably just let it sit there and hope there is a splat or something on the serial console. If that doesn't produce anything meaningful then I will try running with performance governor on a nightly build and see what happens.

    I ended up going into a source tree for linux, and copying over the config, running menu config, then copying back over the minimal adjustment.

  • frakkin64 if you don't mind messing with DT, there are some differences between mainline and bsp worth testing. BSP DT has clock latency set to 2000000 and it's missing opp points for 1608 MHz and 1704 MHz. Would you mind testing this? It's all in sun50i-h6-cpu-opp.dtsi.

  • frakkin64 if you don't mind messing with DT, there are some differences between mainline and bsp worth testing. BSP DT has clock latency set to 2000000 and it's missing opp points for 1608 MHz and 1704 MHz. Would you mind testing this? It's all in sun50i-h6-cpu-opp.dtsi.

    Yes, can do. I just pulled down the BSP to take a look at the DT. It's quite a swing from ~244ms to 2s for transition latency.

    Edit: I guess it's more like 244us to 2ms. It seems to me this clock-latency-ns DT node is largely informational? I'll try dropping out the two frequency points, but I get the impression with BSPs they get to a point where it works and they are like good enough.

    Edited once, last by frakkin64 (September 17, 2023 at 6:07 PM).

  • Ethernet issue is related to commit f8e03eae5, reverted and it is working. rmmod dwmac_sun8i; modprobe dwmac_sun8i will recover ethernet as a workaround.

    http://ix.io/4GGs - for full dmesg (which includes an unload and reload of the module noted above).

    Code
    [   15.837181] NETDEV WATCHDOG: eth0 (dwmac-sun8i): transmit queue 0 timed out 5584 ms

    After this splat, the network is super slow, lots of soft hang ups. As for the prove locking, it proved nothing (locked up with nothing useful). So I am running now with stock + performance governor, and if that yields any success then I will go into tweaking the transition latency & dropping some of the opps to match BSP.

  • This also solves the Ethernet issue. As a side note was testing vanilla nightly with just using performance governor, and it was stable for three days (no issues).

    So it seems like the problem is indeed related to frequency scaling, but I am a bit surprised as this CPU is used in a number of devices. Is there a reference document that indicates the voltages @ frequencies that should be used?