In Perl there are two modules on CPAN for UTF-8 decoding:
- Encode
- Unicode::UTF-8
Unicode::UTF-8 claims to be faster. I was interested how fast it is compared to my own implementation in pure C to decode utf8 (1-6 byte, needing 31 bits) into 32-bit code points.
UTF-8 became a bottleneck and is worth to have a look at. New SIMD implementations can validate up to 4 GB UTF-8 per second with AVX512.
That's the result of a quick benchmark:
$octets: Chſerſplzon chars: 11 bytes 13 Rate Encode Unicode::UTF8 TL::BVXS Encode 1927529/s -- -83% -89% Unicode::UTF8 11143773/s 478% -- -36% TL::BVXS 17311395/s 798% 55% -- $octets: राज्यराज्य chars: 9 bytes 27 Rate Encode Unicode::UTF8 TL::BVXS Encode 1592888/s -- -83% -90% Unicode::UTF8 9466311/s 494% -- -42% TL::BVXS 16287053/s 922% 72% --
Mine is fastest with ~215 MB/s but still is far away from SIMD solutions. In my use case the decoding consumes 35 % of the execution time. But SIMD would not help much for short strings.
Keine Kommentare:
Kommentar veröffentlichen