So I've tested it and I can confirm that the audio "clicks / pops" get significantly reduced by setting the governor to performance. But they are not completely gone, they just seem to occur a lot less and they are also less noticable (only in quiet parts).
It seems it is related to the current CPU speed:
Low speed (powersave governor) -> more audio "drops" and bigger "gaps".
High speed (performance governor) -> less audio "drops" and smaller "gaps".
Looks like a buffer underrun issue to me.
Also it seems that the common atribute of the affected audio streams is that they are all 32-bit (like the AAC 32-bit from the YouTube sample above) which means they need a bigger buffer than a common 16-bit audio.
EDIT: I've tested it on the current official image with the rockchip-4.4 kernel but it seems that the problem affects all RK3399 images.