[Orange Pi 3 LTS] Freezes

  • 2ms transition latency didn't really solve it. I'm going to assume it's unstable at one of those intermediate frequencies, because 90% of the time it's at 480Mhz or 8% around 1800Mhz. And we know with performance governor it seems stable at 1800Mhz all the time.

    I am assuming the voltages @ frequency are a problem. Going to drop out those two clock rates between 1400 and 1800Mhz.

  • So far 40 hours in and no lock up, using schedutil governor and I dropped these two cpu opps to align with BSP. Not sure why they don't work, perhaps it is an issue with a derived (divided from the core clock) frequencies that are incompatible [i.e. timings] with those core frequencies or voltage issues.

    Code
    LibreELEC:~ # uptime; cat /sys/devices/system/cpu/cpufreq/policy*/stats/time_in_state
     09:52:47 up 1 day, 16:32,  load average: 0.25, 0.32, 0.37
    480000 13646338
    720000 153534
    816000 16741
    888000 9042
    1080000 14350
    1320000 7532
    1488000 1507
    1800000 748489

    Below is the DT patch, just for OPi3 LTS:

    Here is the full set of patches I am running with on top of LE12 master.

  • It did actually freeze again. So, I am back on the stock CPU operating points and using performance governor. I'm going to give that at least a week to confirm.

    It seems like there is a lot of chasing my own tail (some of the earlier kernel crashes were likely related to playing with an DT overlay), I did try using cpuburn-a53 at different clock frequencies and naturally it hit thermal throttling, so I don't know if that was very useful. I only ran it for 3 minutes per clock/all cores. The only thing that yielded some clue that sent me on the governor/cpufreq path was disabling IOMMU and the rcu_preempt stalls between kodi & schedutil governor. But maybe that is a red herring, even though the behavior aligns with typical frequency/voltage issues for CPU cores (even though I presume other H6-based devices are OK).

    I guess if others want to test the performance governor, then easiest thing to do is add to autostart.sh the following:

    Code
    echo performance > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
  • Can you test some other frequency? IIRC it's possible to select opp through sysfs. If it doesn't matter which frequency is selected, as long as it's not switched, then the bug is in frequency switching code.

    I have to give you big thank you for your patience in debugging this issue.

  • IIRC it's possible to select opp through sysfs. If it doesn't matter which frequency is selected, as long as it's not switched, then the bug is in frequency switching code.

    Yes, with userspace governor or limiting the min/max of one of the other governors. I'll have to give that shot, going to confirm max clock is stable.

    I was also thinking along those lines, but my doubts are I thought most of this is common code. But I guess it could be an issue with the PMIC driver or the clock driver. The clock driver is going to be common to all H6 devices, because if I am not mistaken the clock generator is part of the SoC. I am also noticing most the H6 devices in mainline use the same PMIC.

  • Speaking of the PMIC in the DT, the schematic suggests that the IRQ from AXP805 is connected to the NMI pin on the H6, which is supposed to be interrupt number 128. Is this correct for the PMIC? Or is something just lost in translation here?

    interrupt-parent = <&r_intc>;

    interrupts = <GIC_SPI 96 IRQ_TYPE_LEVEL_LOW>;

  • Not much of an update, but what I am doing is letting the device run for approximately 30 hours using performance governor at different clock frequencies (switching it manually by setting the scaling_max_freq via sysfs, governor will downclock automatically):

    I think this probably suggests what the current thinking is that it's related to DVFS and the actual switching. I don't know if it is a driver issue, or perhaps something like ramp-up times for the regulators are not quite right(?). My understanding with DVFS is you have to switch the voltage first, then bump the clock. I guess I am wondering if you don't wait for the voltage ramp-up and you have adjusted the clock before the voltage is stablized would that cause SoC to crash (i.e. not enough voltage/current to support the higher clock rate).

  • My understanding with DVFS is you have to switch the voltage first, then bump the clock.

    For raising frequency yes. For lowering it, it's the other way around.

    I guess I am wondering if you don't wait for the voltage ramp-up and you have adjusted the clock before the voltage is stablized would that cause SoC to crash (i.e. not enough voltage/current to support the higher clock rate).

    You can easily adjust ramp up time in DT with regulator-ramp-delay. For testing purposes, you can set it to some ridiculous high number, like 100 ms.

  • You can easily adjust ramp up time in DT with regulator-ramp-delay. For testing purposes, you can set it to some ridiculous high number, like 100 ms.

    I haven't done this yet, but will probably do it over the weekend. I stepped-down through each frequency and ran for 30 hours on each, and really there is no problem. As for the value in the DT, what I read is this value is uV per us, and I believe this is the DVM setting on the AXP805 datasheet which it's pretty unclear if anything initialized REG 1A for this chip, but the way this is described is it's the slop of the voltage rise/drop over time when an adjustment is made. Assuming 15us/step, and the lowest increment is 10mV, this gives me a ramp delay of 640uV/us (which is the 2nd highest delay, the only higher delay is if the DVM slope is 31us/step which seems less likely). I'm going to give that a shot over the weekend:

    regulator-ramp-delay = <640>

    The current value, if my assumptions are correct, is too short of a delay and the voltage may not be stablized when the frequency is changed on the CPU. Basically, what I understood is the driver code determines ( target uV - current uV ) / ramp-delay and this is fed into the sleep routine. I'm going to guess this is more of a problem with ramping up frequencies than down.

  • This is a very good research and find. I always thought ramp delay is in us and it has never occurred to me to be coefficient. Anyway, if this is the culprit, I would ask you to send patch to Linux mailing lists to update all H6 boards with AXP805. It would also explain why my primary H6 development board - Tanix TX6 - doesn't show this kind of issues. It use fixed voltage regulator instead of AXP805.

  • wow, thats a great find, thanks guys for putting in all this effort!

    Is it possible that the handling of this AXP805 could also explain intermittent lack of ethernet on boot?

  • Anyway, if this is the culprit

    Unfortunately, it seems not to be. I turned on debug for PM/regulators and it's interesting to see with ondemand/schedutil it is bumping the clock/voltages probably every 30-40ms, generally that's a huge part of the debug logging going on. The other piece is the mmc, and naturally things that are not logged. The only thing pinging more in debug is the thermal side of it.

    Not really sure what else it could be, I think I feel confident it is related to DVFS, and I have doubts it would be a spinlock/mutex or anything in that realm since it would definitely be more prevalent. I am assuming under/over voltage or the rate of switching the clock is throwing something off in the SoC and probably causing it to "crash". What's interesting is I have caught it where it is responsive and all I do on the terminal is hit previous command history (up arrow) on the shell and it will freeze, other times it is frozen when I get there and the shell is disconnected, device doesn't ping, UI is non-responsive, serial console is non-responsive. The device is solid on the performance governor.

    Not really sure if there is any way to debug it further, other than maybe a JTAG? I set clock-latency as well, to see if that has any bearing on it (thinking it would slow the rate of changing the clock, couldn't really tell if it is waiting for the clock to be adjusted or not).

    If that doesn't work out, then probably just going to set the performance governor.

  • I was recently reminded of the similar GPU issue on H6: https://lore.kernel.org/linux-arm-kern…HGUjKi@kista/T/ Maybe it's worth checking it out?

    Speaking of the GPU, I noticed that the OPP tables (sun50i-h6-gpu-opp.dtsi) are not included for any device other than Beelink GS1. I'll take a look at the CPU-related registers. I also assume the Tanix TX6 does dynamic frequency scaling (just with a fixed voltage?).

    My other thought was the protected clock patch, I noticed that was sort of abandoned and never mainlined. The feedback at the time was this patch may cause child clocks to also become protected.

    I started looking at the clock code, and like everything it is a lot of indirection, macros, etc that takes years to fully understand -- and even then, there are things probably still not understood. I know this feeling of knowing something and not knowing it, kudos to those few that have mastery over it all or even a good portion of the subsystems. :)

    Anyway, I think fixing regulator values in DT is still worth it.

    My feeling is it's correct as well, but far from an expert on PMICs. It seems there is a lot of forgiveness in the electronics, maybe it's because of the overhead of the code to set + delay and all that in-between. I guess I am thinking that's why this isn't the root cause, the other factor is the scaling up/down of the frequency probably has an inherent delay and I would expect for the CPU clock to be adjusted the CPU is actually halted and the CCU has an inherent delay to wait for the clock to lock before releasing the CPU halt and perhaps that timing is greater than the regulator stabilization time (which worst case is 44us). It may matter more to the GPU side than the CPU side.

  • Speaking of the GPU, I noticed that the OPP tables (sun50i-h6-gpu-opp.dtsi) are not included for any device other than Beelink GS1.

    Yeah, GS1 was first, but then nobody actually updated and tested other boards.

    I also assume the Tanix TX6 does dynamic frequency scaling (just with a fixed voltage?).

    Correct (for CPU). GPU scaling should just work, if table is included.

    I started looking at the clock code, and like everything it is a lot of indirection, macros, etc that takes years to fully understand -- and even then, there are things probably still not understood.

    My research of bugs often lead me to clocks. So yeah, it's complex mechanism and, at least in the past, source of many issues.

    I guess I am thinking that's why this isn't the root cause, the other factor is the scaling up/down of the frequency probably has an inherent delay and I would expect for the CPU clock to be adjusted the CPU is actually halted and the CCU has an inherent delay to wait for the clock to lock before releasing the CPU halt and perhaps that timing is greater than the regulator stabilization time (which worst case is 44us).

    That's entirely possible.

    I found something. H6 user manual has description how to properly adjust CPUX PLL. I'm sure that at least delay at the end is missing. However, I also noticed that many other SoCs, like A64 and H3, reparent CPU clock to something stable, like 24 MHz, switch CPUX PLL and then reparent back. I suppose that this is also valid option, at least until we can figure out more direct process.