difference is only if CEC protocol is completely implemented in software (algorithm in kernel determines when pin goes to 1 or 0, determined by timer) or in hardware (driver only reads what message was received or sends one out). Obviously, you want hardware to do as much as possible, as that leaves CPU to do better things. Also, time sensitive events, like when pin has to go to 1 or 0, are completely done in hardware and thus not dependent on CPU load which might introduce some lag.
So, it's interesting to me that software implementation works better than hardware. Usually, it's other way around. Only explanation for that is 32k clock accuracy, which is used by HW CEC implementation for timing events. But I don't see a reason why that would be a problem. Maybe you can confirm that indeed dedicated crystal is used by executing devmem 0x01f00004 w. It should have value 1.
Best way to determine what happened would be to first find PR on github which broke HW CEC. That could be done via git bisect.