Hi everyone,
first, a bit of preface so you can understand why I do what I do. You can skip to after the first horizontal line, if you aren't interested in that.
For many months, I was running a RaPi 3. It was running fine for the most part, but some media files would very occasionally stutter. I checked everything, from limitations of my server (HDD, etc.), to bad network connections - all for nothing. Iperf told me that the raw network connection between me Pi and the server was absolutely fine, too. So what gives? I was using SFTP for the connection. I changed it to NFS - and all the problems were gone. After some more digging around I found the culprit: When streaming the media files (or doing a scp command on the Pi for that matter), one of the RaPi cores went to 100% - while the others kept idling. So it was obvious, that the RaPi could not deliver the CPU performance to handle the crypto overhead. Right?
Wrong - partially. Introducing the Rock64. I got that device a few months as a replacement for my RaPi. At this point a big shout out to everyone involved in getting the RK3328 to work so great with LibreElec! It works great as a daily driver with Kwiboos changes. However, the stuttering with SFTP was still there - bummer. Same behavior as on The RaPi, with one core maxed out, while the others idle around.
After hours searching on the net and not finding anything useful, I was ready to give up. But then I found two interesting things. First, I stumbled across the HPN-SSH project. I had read about it before, and I thought it would be good way to get my feet wet with the LL build system. In short - I got it to work with the kitchen sink patch, and was able to try it out. I checked all 4 combinations of HPN server and client. As you can see with the following table, there were some speed improvements, but nothing big.
Quick preamble on how the tests were conducted:
Both servers were identical Debian 9 VMs with OpenSSH 7.4p1 build from source, one with and one without the HPN kitchen sink patch applied. Data rate was read from a scp call on the Rock64 to /dev/null. All HPN related tests were conducted with commit 3e3cc3d of the original LL repo (not Kwiboos Branch!)
Cipher |
HPN server HPN client
|
HPN server Regular client |
Regular server
HPN client
|
Regular server Regular client |
[email protected]
|
29.5MB/s |
30.2MB/s |
33.9MB/s |
34.1MB/s |
aes128-ctr |
48.6MB/s |
53.5MB/s |
60.7MB/s |
55.8MB/s |
[email protected]
|
64.5MB/s |
60.2MB/s |
65.5MB/s |
62.2MB/s |
aes128-cbc |
74.4MB/s |
64.7MB/s |
76.3MB/s |
72.4MB/s |
aes192-cbc |
73.0MB/s |
60.1MB/s |
70.3MB/s |
60.4MB/s |
aes256-cbc |
56.5MB/s |
57.3MB/s |
65.3MB/s |
54.5MB/s |
The numbers are a bit all over the place. But two things I deduced from this: First, for whatever reason, the HPN server seems a bit slower overall. Second - funnily enough, the fastest configuration is the HPN patched client with a non-HPN server. Whats off is the the HPN server + client configuration, where some ciphers are faster, and some are slower than their HPN server + regular client counter part.
These numbers underline my original problem very good by the way - My favorite media file for testing (which triggers the stuttering right at the opening logo), peaks at a data rate of around 60MB/s - hence the stuttering. From the numbers, it seems the sftp connection is negotiated with chacha20-poly1305, as the other ciphers "should" be fast enough.
So, after the HPN tests pretty much were fruitless, I was looking for other options to get the load off from that single core. And this is how I read about Cryptodev on some forum (can't remember where). From what I could piece together from various forum posts, the architecture of both the RaPi 3 and Rock64 (ARMv8) should support some kind of crypto operations, which should make it all much faster - provided you can access it. So I set out to integrate cryptodev into LL. I got it kinda working after two days, to... mixed results. Hence this thread.
First, I was able to verify that indeed both the cryptodev kernel module was loaded, and openssl was build correctly with cryptodev support:
$ modinfo cryptodev
filename: /lib/modules/4.4.114/cryptodev-linux/cryptodev.ko
license: GPL
description: CryptoDev driver
author: Nikos Mavrogiannopoulos <[email protected]>
depends:
vermagic: 4.4.114 SMP mod_unload aarch64
parm: cryptodev_verbosity:0: normal, 1: verbose, 2: debug (int)
$ openssl engine -t
(cryptodev) BSD cryptodev engine
[ available ]
(dynamic) Dynamic engine loading support
[ unavailable ]
Next, I did some openssl speed tests. One exemplary call to get you an idea - the other numbers are plunged in the following table for an easier overview.
$ openssl speed -evp aes-128-cbc -engine cryptodev
engine "cryptodev" set.
Doing aes-128-cbc for 3s on 16 size blocks: 1088230 aes-128-cbc's in 0.31s
Doing aes-128-cbc for 3s on 64 size blocks: 1064252 aes-128-cbc's in 0.35s
Doing aes-128-cbc for 3s on 256 size blocks: 971838 aes-128-cbc's in 0.27s
Doing aes-128-cbc for 3s on 1024 size blocks: 726049 aes-128-cbc's in 0.16s
Doing aes-128-cbc for 3s on 8192 size blocks: 180051 aes-128-cbc's in 0.06s
OpenSSL 1.0.2o 27 Mar 2018
built on: reproducible build, date unspecified
options:bn(64,32) rc4(ptr,char) des(idx,cisc,16,long) aes(partial) idea(int) blowfish(ptr)
compiler: /LibreELEC.tv/build.LibreELEC-RK3328.arm-9.0-devel/toolchain/bin/armv8a-libreelec-linux-gnueabi-gcc -I. -I.. -I../include -fPIC -DOPENSSL_PIC -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -DHAVE_CRYPTODEV -DUSE_CRYPTODEV_DIGESTS -march=armv8-a+crc -mtune=cortex-a53 -mabi=aapcs-linux -Wno-psabi -Wa,-mno-warn-deprecated -mcpu=cortex-a53 -mfloat-abi=hard -mfpu=crypto-neon-fp-armv8 -fomit-frame-pointer -Wall -pipe -Os -march=armv8-a+crc -mtune=cortex-a53 -fuse-ld=gold -O3 -Wall -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DAES_ASM -DBSAES_ASM -DGHASH_ASM
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes
aes-128-cbc 56166.71k 194606.08k 921446.40k 4646713.60k 24582963.20k
Display More
cryptodev enabled |
type |
16 bytes |
64 bytes |
256 bytes |
1024 bytes |
8192 bytes |
no |
aes-128-cbc |
137789.31k |
388067.80k |
698093.23k |
892886.70k |
971374.59k |
yes |
aes-128-cbc |
56166.71k
|
194606.08k
|
921446.40k
|
4646713.60k
|
24582963.20k
|
no |
sha1 |
15756.34k |
57921.41k |
179581.53k |
377012.91k |
557703.17k |
yes |
sha1 |
4224.80k
|
19347.53k
|
67981.19k
|
331761.66k
|
2233042.82k |
So, what do these numbers tell me? I was already aware, that smaller block sizes tend to be slower with cryptodev. Fair enough - bigger block sizes see dramatic improvements though. So, any changes in the openssh speeds? Yes - data rates where actually slower. But not much, so it might be off to margin of error.
>> The big question now - what is happening here? Integrating cryptodev certainly did something, but not what I expected or hoped. <<
So, all tests so far have been done with the LL master at 3e3cc3d. After some more investigation, I figured that maybe something else had to be enabled on the kernel side of things. This is where Kwiboos fork came into play. Primarily, as it used an updated Rockchip kernel, which (among other things) exposed a new CONFIG_CRYPTO_DEV_ROCKCHIP configuration. After some google investigation it seemed that the RK3328 includes its own crypto engine, which might just not be accessible with my previous setup yet (falling back to the ARMv8 internal thingy?). So, compile time again....
So, first things first. I build the current Kwiboo part-6 branch, an integrated cryptodev. Openssl speed tests - pretty much no difference. I also verified that rk_crypto was not loaded yet. So, next ; build with CONFIG_CRYPTO_DEV_ROCKCHIP enabled. However, this did nothing. rk_crypto was not loaded, and subsequently no change in openssl performance. I am still looking into why that is, but currently I'm out of ideas.
So, this is the second topic to discuss - Just what am I seeing here? Does cryptodev not pick up on rk_crypto? Or does it maybe not even support that? I'm totally out of my league here
So, massive text wall. I would be glad for any input to sort this out, and interpret the numbers I am seeing here.
EDIT 1:
Well, I found out something neat while looking at the rockchip kernel config. The kernel can be compiled as aarch64 for supported cpus - such as the aforementioned RK3328 I'm using. So - did a recompile of LL with ARCH=aarch64 (instead of ARCH=arm, as mentioned in the readme), and bam, performance increased significantly. Openssl speed test saw performance improvements both with and without cryptodev. Transfer speeds for sftp also increased (for the most part - have yet to do a proper test sweep - So no numbers yet).
I will do more tests tomorrow, and see if I can get some solid numbers. Also, I don't know if this radical performance change is just because of the aarch64 mode, or because of the additional "crypto" cpu flag in LL for RK3328 with aarch64. I will investigate.