Crypto performance on SoCs - help me interpret

  • Hi everyone,

    first, a bit of preface so you can understand why I do what I do. You can skip to after the first horizontal line, if you aren't interested in that.

    For many months, I was running a RaPi 3. It was running fine for the most part, but some media files would very occasionally stutter. I checked everything, from limitations of my server (HDD, etc.), to bad network connections - all for nothing. Iperf told me that the raw network connection between me Pi and the server was absolutely fine, too. So what gives? I was using SFTP for the connection. I changed it to NFS - and all the problems were gone. After some more digging around I found the culprit: When streaming the media files (or doing a scp command on the Pi for that matter), one of the RaPi cores went to 100% - while the others kept idling. So it was obvious, that the RaPi could not deliver the CPU performance to handle the crypto overhead. Right?

    Wrong - partially. Introducing the Rock64. I got that device a few months as a replacement for my RaPi. At this point a big shout out to everyone involved in getting the RK3328 to work so great with LibreElec! It works great as a daily driver with Kwiboos changes. However, the stuttering with SFTP was still there - bummer. Same behavior as on The RaPi, with one core maxed out, while the others idle around.


    After hours searching on the net and not finding anything useful, I was ready to give up. But then I found two interesting things. First, I stumbled across the HPN-SSH project. I had read about it before, and I thought it would be good way to get my feet wet with the LL build system. In short - I got it to work with the kitchen sink patch, and was able to try it out. I checked all 4 combinations of HPN server and client. As you can see with the following table, there were some speed improvements, but nothing big.

    Quick preamble on how the tests were conducted:

    Both servers were identical Debian 9 VMs with OpenSSH 7.4p1 build from source, one with and one without the HPN kitchen sink patch applied. Data rate was read from a scp call on the Rock64 to /dev/null. All HPN related tests were conducted with commit 3e3cc3d of the original LL repo (not Kwiboos Branch!)

    Cipher HPN server
    HPN client
    HPN server
    Regular client

    Regular server

    HPN client

    Regular server
    Regular client
    [email protected] 29.5MB/s 30.2MB/s 33.9MB/s 34.1MB/s
    aes128-ctr 48.6MB/s 53.5MB/s 60.7MB/s 55.8MB/s
    [email protected] 64.5MB/s 60.2MB/s 65.5MB/s 62.2MB/s
    aes128-cbc 74.4MB/s 64.7MB/s 76.3MB/s 72.4MB/s
    aes192-cbc 73.0MB/s 60.1MB/s 70.3MB/s 60.4MB/s
    aes256-cbc 56.5MB/s 57.3MB/s 65.3MB/s 54.5MB/s

    The numbers are a bit all over the place. But two things I deduced from this: First, for whatever reason, the HPN server seems a bit slower overall. Second - funnily enough, the fastest configuration is the HPN patched client with a non-HPN server. Whats off is the the HPN server + client configuration, where some ciphers are faster, and some are slower than their HPN server + regular client counter part.

    These numbers underline my original problem very good by the way - My favorite media file for testing (which triggers the stuttering right at the opening logo), peaks at a data rate of around 60MB/s - hence the stuttering. From the numbers, it seems the sftp connection is negotiated with chacha20-poly1305, as the other ciphers "should" be fast enough.


    So, after the HPN tests pretty much were fruitless, I was looking for other options to get the load off from that single core. And this is how I read about Cryptodev on some forum (can't remember where). From what I could piece together from various forum posts, the architecture of both the RaPi 3 and Rock64 (ARMv8) should support some kind of crypto operations, which should make it all much faster - provided you can access it. So I set out to integrate cryptodev into LL. I got it kinda working after two days, to... mixed results. Hence this thread.


    First, I was able to verify that indeed both the cryptodev kernel module was loaded, and openssl was build correctly with cryptodev support:

    Code
    $ modinfo cryptodev
    filename:       /lib/modules/4.4.114/cryptodev-linux/cryptodev.ko
    license:        GPL
    description:    CryptoDev driver
    author:         Nikos Mavrogiannopoulos <[email protected]>
    depends:
    vermagic:       4.4.114 SMP mod_unload aarch64
    parm:           cryptodev_verbosity:0: normal, 1: verbose, 2: debug (int)
    Code
    $ openssl engine -t
    (cryptodev) BSD cryptodev engine
         [ available ]
    (dynamic) Dynamic engine loading support
         [ unavailable ]

    Next, I did some openssl speed tests. One exemplary call to get you an idea - the other numbers are plunged in the following table for an easier overview.

    cryptodev enabled type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes
    no aes-128-cbc 137789.31k 388067.80k 698093.23k 892886.70k 971374.59k
    yes aes-128-cbc 56166.71k
    194606.08k
    921446.40k
    4646713.60k
    24582963.20k
    no sha1 15756.34k 57921.41k 179581.53k 377012.91k 557703.17k
    yes sha1 4224.80k
    19347.53k
    67981.19k
    331761.66k
    2233042.82k

    So, what do these numbers tell me? I was already aware, that smaller block sizes tend to be slower with cryptodev. Fair enough - bigger block sizes see dramatic improvements though. So, any changes in the openssh speeds? Yes - data rates where actually slower. But not much, so it might be off to margin of error.

    >> The big question now - what is happening here? Integrating cryptodev certainly did something, but not what I expected or hoped. <<


    So, all tests so far have been done with the LL master at 3e3cc3d. After some more investigation, I figured that maybe something else had to be enabled on the kernel side of things. This is where Kwiboos fork came into play. Primarily, as it used an updated Rockchip kernel, which (among other things) exposed a new CONFIG_CRYPTO_DEV_ROCKCHIP configuration. After some google investigation it seemed that the RK3328 includes its own crypto engine, which might just not be accessible with my previous setup yet (falling back to the ARMv8 internal thingy?). So, compile time again....

    So, first things first. I build the current Kwiboo part-6 branch, an integrated cryptodev. Openssl speed tests - pretty much no difference. I also verified that rk_crypto was not loaded yet. So, next ; build with CONFIG_CRYPTO_DEV_ROCKCHIP enabled. However, this did nothing. rk_crypto was not loaded, and subsequently no change in openssl performance. I am still looking into why that is, but currently I'm out of ideas.

    So, this is the second topic to discuss - Just what am I seeing here? Does cryptodev not pick up on rk_crypto? Or does it maybe not even support that? I'm totally out of my league here :-/


    So, massive text wall. I would be glad for any input to sort this out, and interpret the numbers I am seeing here.


    EDIT 1:

    Well, I found out something neat while looking at the rockchip kernel config. The kernel can be compiled as aarch64 for supported cpus - such as the aforementioned RK3328 I'm using. So - did a recompile of LL with ARCH=aarch64 (instead of ARCH=arm, as mentioned in the readme), and bam, performance increased significantly. Openssl speed test saw performance improvements both with and without cryptodev. Transfer speeds for sftp also increased (for the most part - have yet to do a proper test sweep - So no numbers yet).

    I will do more tests tomorrow, and see if I can get some solid numbers. Also, I don't know if this radical performance change is just because of the aarch64 mode, or because of the additional "crypto" cpu flag in LL for RK3328 with aarch64. I will investigate.

    Edited once, last by Kernle32DLL (June 30, 2018 at 11:16 PM).

    • Official Post

    It's not a great surprise that crypto functions run better as 64-bit code on a 64-bit processor but you may hit two challenges running aarch64 LE images. The first is there are no aarch64 binary addons in our repo; because there are no official LE 9.0 aarch64 images. The second is addons requiring libwidevine (e.g. Netflix/Amazon) will not work as there are no 64-bit 'arm' versions of the lib in circulation to use. Being able to access DRM protected services is a hallmark feature/capability for Kodi 18 so this is the primary reason LE moved from aarch64 in LE 8.x releases to a split 64-bit kernel and 32-bit userspace arrangement (the same as Android) for LE 9.0.

    Questions on RK crypto things might find an audience in the #linux-rockchip channel on IRC, or perhaps their GitHub issue tracker:

    Issues · rockchip-linux/kernel · GitHub

  • It's not a great surprise that crypto functions run better as 64-bit code on a 64-bit processor but you may hit two challenges running aarch64 LE images. The first is there are no aarch64 binary addons in our repo; because there are no official LE 9.0 aarch64 images. The second is addons requiring libwidevine (e.g. Netflix/Amazon) will not work as there are no 64-bit 'arm' versions of the lib in circulation to use. Being able to access DRM protected services is a hallmark feature/capability for Kodi 18 so this is the primary reason LE moved from aarch64 in LE 8.x releases to a split 64-bit kernel and 32-bit userspace arrangement (the same as Android) for LE 9.0

    Darn, I totally forgot that. Makes sense. So no aarch64 then.

    But yeah, I will move the rock64 specific crypto discussion (e.g. rk_crypto) somewhere else, as my intent for creating this thread was - for the most part - to understand what cryptodev does in this particular LL setup (is that an ARM thing? Rockchip specific? Or else?).