Possible SSD issues using as a PVR

  • We have a couple of Intel based NUC's (when a Celeron J4125 and 8GB of ram) and a 1TB Kingspec SSD.

    We started using these on Libreelec 10, with tvheadend and they are primarily used as DVR's with two dual USB tuners.

    Over time we have been seeing some of these machines have issues where the hard drive goes into read only mode, and I have attached some of the dmesg log from what that happened.

    Here is where it gets weird, at that point upon a restart one of two things can happen.

    1. A reboot and the machine starts up as normal.

    2. A reboot and the machine goes to the BIOS screen as it cannot see the SSD as a bootable drive.

    At this point what we have found is that by unplugging the power from the machine and leaving it off for at least an hour or so and it will boot ok, sometimes it needs to be left without power for longer, but almost all of the time leaving it unplugged for a period of time the machine boots normally.

    The thing is it hasn't happened on all the machines and it's very random. If anyone has any ideas it would be greatly appreciated.

  • SSD drives fail differently to mechanical HDD drives. The management firmware can persistently mark memory areas; so in the current boot cells can go bad causing a cascade of problems, but on reboot those gone-bad areas are marked bad so capacity reduced but problems are avoided .. until more cells go bad. In my experience once cells start failing reliability only heads in a downwards direction.

    LE automatically fsck's any filesystems marked dirty on boot, so if a drive (and thus partitions and filesystems had issues) we attempt cleanup and this might keep things working. In this respect our distro packaging into two files (KERNEL and SYSTEM) means as long as those files are intact a reboot often does a "clean" start; but equally if those files are damaged you have a total boot failure.

    Kingspec are a budget SSD manufacturer so I would have lower expectations on drive lifespan compared to e.g. Samsung EVO 'pro' drives, and I would expect less-developed firmware which increases the probability of low-level issues where the entire drive behaves bad or has problem interactions with BIOS etc.

    To me, the log looks like a dying drive. Make sure you have a backup of any important config.

  • Thankyou for the info, we will definitely keep investigating.

    One question I still have (and I don't know if you have any ideas at all, but it's worth a shot), is we have seen more than once where a reboot will stop the drive from booting and it appears to not show up in bios, but after unplugging the computer from power completely and letting it cold soak for at least an hour it will then boot up as normal and work until the next failure.

    The big question I am trying to work out is why a reboot would not work, but a cold soak would get it back up and running.

  • have you ever checked tha SMART values of the drives?

    If there are SMART values, you can go and check the total bytes written. There are plenty of calculators in the www.

    In case wear-out is the problem, you might want to chose more reliable drives as chewitt recommended. Samsung PM series or even Evo Pro. I use Micron 5300/5400. Or any other brand where you have the option of overprovisioning. The internal processes of SSDs are made very different depending on the manufacturer. TRIM and garbage collection as well as re-copying data. You might know, that - other than HDDs - SSDs aren't able to rewrite data blocks. A SSDs always has to delete the block before it can be written. This causes write overhead which hurts the overall endurance. Some manufactures are handling this better then others....

    However, this might not be the case on streamed data. The more you stream sequentially the better it is in terms of endurance.

    What Kingston model do you use? Are there TBW or DWPD (drive writes per day) values available? Quite often the entry level drives don't even mention these values....

    e.g. a 128GB drive has 0% overprovisioning - so no room for re-copying data

    120GB has around 7% overprovisioning - some room for re-copying data and internal processes.

    ...the more OP the more reliable is the drive. The more often you write the NAND the earlier it gets weared-out

    hope this helps


    maybe the kingston drives stucks with internal processes at a certain point and unplugging the power does cause a reset... kind of wild without knowing the wear-out level, but it could be the reason

    never use entry level drives for valuable data

    Edited 2 times, last by Bub4: Ein Beitrag von Bub4 mit diesem Beitrag zusammengefügt. (May 16, 2023 at 1:37 PM).

  • We are still investigating this one and I am reading up on a lot of different issues surround SSD use.

    From what I can gather TRIM is enabled by default in Libreelec, but I can't seem to find whether fstrim needs to be run periodically or not, does this need to be setup to aid in drive performance and reliability over time? Keeping in mind that as a PVR decently sized video files are regularly being 'recorded' onto it, then when watched deleted.

    The general use case is to have this PC powered on 24x7, so it doesn't get rebooted.

  • TRIM and Garbage collection are usually running in backround when the drive is idling. No manual action necessary.

    As the prices are currently on an all-time low level, I'd recommend you to buy Micron 5400 or Samsung PM893.

    These are enterprise level drives.

    Prices tend to increase the coming quarter...

    cheers

  • In general it's a bad idea to trim on each block free (i.e. using the discard mount option) as that leads to write amplification and will wear out SSDs quickly.

    It's better to regularly run "fstrim -a", typically once a week, via a systemd timer.

    LE doesn't ship fstrim.service/fstrim.timer systemd files but you can easily add them to /storage/.config/systemd/ and then enable it with "systemctl enable fstrim.timer".

    Untested as I don't have any SSDs on my LE devices here, but these should work:

    fstrim.service

    Code
    [Unit]
    Description=Discard unused blocks on filesystems
    
    [Service]
    Type=oneshot
    ExecStart=/usr/sbin/fstrim --all --verbose --quiet-unsupported

    fstrim.timer:

    so long,

    Hias