Posts by frakkin64

    Anyway, if this is the culprit

    Unfortunately, it seems not to be. I turned on debug for PM/regulators and it's interesting to see with ondemand/schedutil it is bumping the clock/voltages probably every 30-40ms, generally that's a huge part of the debug logging going on. The other piece is the mmc, and naturally things that are not logged. The only thing pinging more in debug is the thermal side of it.

    Not really sure what else it could be, I think I feel confident it is related to DVFS, and I have doubts it would be a spinlock/mutex or anything in that realm since it would definitely be more prevalent. I am assuming under/over voltage or the rate of switching the clock is throwing something off in the SoC and probably causing it to "crash". What's interesting is I have caught it where it is responsive and all I do on the terminal is hit previous command history (up arrow) on the shell and it will freeze, other times it is frozen when I get there and the shell is disconnected, device doesn't ping, UI is non-responsive, serial console is non-responsive. The device is solid on the performance governor.

    Not really sure if there is any way to debug it further, other than maybe a JTAG? I set clock-latency as well, to see if that has any bearing on it (thinking it would slow the rate of changing the clock, couldn't really tell if it is waiting for the clock to be adjusted or not).

    If that doesn't work out, then probably just going to set the performance governor.

    You can easily adjust ramp up time in DT with regulator-ramp-delay. For testing purposes, you can set it to some ridiculous high number, like 100 ms.

    I haven't done this yet, but will probably do it over the weekend. I stepped-down through each frequency and ran for 30 hours on each, and really there is no problem. As for the value in the DT, what I read is this value is uV per us, and I believe this is the DVM setting on the AXP805 datasheet which it's pretty unclear if anything initialized REG 1A for this chip, but the way this is described is it's the slop of the voltage rise/drop over time when an adjustment is made. Assuming 15us/step, and the lowest increment is 10mV, this gives me a ramp delay of 640uV/us (which is the 2nd highest delay, the only higher delay is if the DVM slope is 31us/step which seems less likely). I'm going to give that a shot over the weekend:

    regulator-ramp-delay = <640>

    The current value, if my assumptions are correct, is too short of a delay and the voltage may not be stablized when the frequency is changed on the CPU. Basically, what I understood is the driver code determines ( target uV - current uV ) / ramp-delay and this is fed into the sleep routine. I'm going to guess this is more of a problem with ramping up frequencies than down.

    It's been a year and things are going well" .. which is roughly how long I've been saying "ask me again in six months" each time someone asks if LE supports RK3588.

    I know you are intimately aware of the Amlogic problems as well, and the difficulty of mainline support. It's quite a different story between paid engineers focused on the product versus unpaid volunteers/paid bounties to implement a feature. The open-source process is pretty difficult for the average user to engage in, especially that of the kernel, it's also a bit intimidating. No knock on folks working to pay their bills, but one of the problems is manufacturers hire consultants to implement the product into mainline, and once they feel it is "good enough" the cut them loose, and I am sure those guys feel some sense of responsibility for what they put out in the world but also need to eat -- so that code doesn't get the same level of support when they were paid to maintain it.

    Anyone that maintains open-source software for free is a real hero in my book.

    A bit on the fence to be honest, the RPi4 performs great, and I don't spend a lot of time in the menu. :)

    It would have been nice if there was a native on-board M.2 slot, but I would probably only use that for a desktop or server device. For LE, all of my content is on the NAS, and I just use SD cards to boot LE. I haven't seen a genuine SD card failure yet, but I already backup the configuration to the NAS. At this point I have 2 spare RPi4 boards, so probably won't be picking up any more until RPi6.

    I am sure we will see great device support from the Raspberry Pi team, it's really hard to beat the fact that you can open an issue on Github and they will address it.

    Not much of an update, but what I am doing is letting the device run for approximately 30 hours using performance governor at different clock frequencies (switching it manually by setting the scaling_max_freq via sysfs, governor will downclock automatically):

    I think this probably suggests what the current thinking is that it's related to DVFS and the actual switching. I don't know if it is a driver issue, or perhaps something like ramp-up times for the regulators are not quite right(?). My understanding with DVFS is you have to switch the voltage first, then bump the clock. I guess I am wondering if you don't wait for the voltage ramp-up and you have adjusted the clock before the voltage is stablized would that cause SoC to crash (i.e. not enough voltage/current to support the higher clock rate).

    Speaking of the PMIC in the DT, the schematic suggests that the IRQ from AXP805 is connected to the NMI pin on the H6, which is supposed to be interrupt number 128. Is this correct for the PMIC? Or is something just lost in translation here?

    interrupt-parent = <&r_intc>;

    interrupts = <GIC_SPI 96 IRQ_TYPE_LEVEL_LOW>;

    IIRC it's possible to select opp through sysfs. If it doesn't matter which frequency is selected, as long as it's not switched, then the bug is in frequency switching code.

    Yes, with userspace governor or limiting the min/max of one of the other governors. I'll have to give that shot, going to confirm max clock is stable.

    I was also thinking along those lines, but my doubts are I thought most of this is common code. But I guess it could be an issue with the PMIC driver or the clock driver. The clock driver is going to be common to all H6 devices, because if I am not mistaken the clock generator is part of the SoC. I am also noticing most the H6 devices in mainline use the same PMIC.

    It did actually freeze again. So, I am back on the stock CPU operating points and using performance governor. I'm going to give that at least a week to confirm.

    It seems like there is a lot of chasing my own tail (some of the earlier kernel crashes were likely related to playing with an DT overlay), I did try using cpuburn-a53 at different clock frequencies and naturally it hit thermal throttling, so I don't know if that was very useful. I only ran it for 3 minutes per clock/all cores. The only thing that yielded some clue that sent me on the governor/cpufreq path was disabling IOMMU and the rcu_preempt stalls between kodi & schedutil governor. But maybe that is a red herring, even though the behavior aligns with typical frequency/voltage issues for CPU cores (even though I presume other H6-based devices are OK).

    I guess if others want to test the performance governor, then easiest thing to do is add to autostart.sh the following:

    Code
    echo performance > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor

    So far 40 hours in and no lock up, using schedutil governor and I dropped these two cpu opps to align with BSP. Not sure why they don't work, perhaps it is an issue with a derived (divided from the core clock) frequencies that are incompatible [i.e. timings] with those core frequencies or voltage issues.

    Code
    LibreELEC:~ # uptime; cat /sys/devices/system/cpu/cpufreq/policy*/stats/time_in_state
     09:52:47 up 1 day, 16:32,  load average: 0.25, 0.32, 0.37
    480000 13646338
    720000 153534
    816000 16741
    888000 9042
    1080000 14350
    1320000 7532
    1488000 1507
    1800000 748489

    Below is the DT patch, just for OPi3 LTS:

    Here is the full set of patches I am running with on top of LE12 master.

    2ms transition latency didn't really solve it. I'm going to assume it's unstable at one of those intermediate frequencies, because 90% of the time it's at 480Mhz or 8% around 1800Mhz. And we know with performance governor it seems stable at 1800Mhz all the time.

    I am assuming the voltages @ frequency are a problem. Going to drop out those two clock rates between 1400 and 1800Mhz.

    This also solves the Ethernet issue. As a side note was testing vanilla nightly with just using performance governor, and it was stable for three days (no issues).

    So it seems like the problem is indeed related to frequency scaling, but I am a bit surprised as this CPU is used in a number of devices. Is there a reference document that indicates the voltages @ frequencies that should be used?

    Thank you so much for your enormous efforts to track this bug down. It's so annoying to have to restart Kodi every few hours because it runs OOM. Can't wait to have a normally working instance again.

    I expect it will be worse with the more Python add-ons you use, until it is fixed by the Kodi team. I was only using a few Python addons (with YouTube really being the primary one), everything else is binary addons (pvr.hts, for example) which seem unaffected by this change.

    Memory leak associated to Youtube when MPEG-DASH is enabled · Issue #520 · anxdpanic/plugin.video.youtube (github.com)


    I couldn't reproduce it with a simple requests.get loop in Python, so it's perhaps related to the embedded interpreter or Kodi built-ins bound to the embedded interpreter.

    This was the kludge test script I tried via command line:

    So there is a 60-second loop in YouTube add-on for checking whether httpd is running, which it is enabled if you use MPEG-DASH or API configuration page. I typically use MPEG-DASH, turned that off in settings and the leak is gone. Turned it back on (w/ enable/disable toggle of Youtube addon) and the leak returns.

    I suspect it is perhaps related to the act of "pinging the httpd server" that is causing the problem, the loop does two things:

    1. uses requests to do an http get to the internal http service running for youtube addon

    2. restarts httpd service when request fails (not 204)

    I really doubt it's #2 because I don't think Youtube would work at all, IIRC ISA uses httpd to communicate back to the requesting player for some stuff (I remember that from the Netflix addon), it is probably requests library is doing some sort of resource leakage?