Montag, 25. April 2022

Benchmark UTF-8 decoding in Perl

 In Perl there are two modules on CPAN for UTF-8 decoding:

- Encode

- Unicode::UTF-8

Unicode::UTF-8 claims to be faster. I was interested how fast it is compared to my own implementation in pure C to decode utf8 (1-6 byte, needing 31 bits) into 32-bit code points.

UTF-8 became a bottleneck and is worth to have a look at. New SIMD implementations can validate up to 4 GB UTF-8 per second with AVX512.

That's the result of a quick benchmark:

$octets: Chſerſplzon chars: 11 bytes 13
                    Rate        Encode Unicode::UTF8      TL::BVXS
Encode         1927529/s            --          -83%          -89%
Unicode::UTF8 11143773/s          478%            --          -36%
TL::BVXS      17311395/s          798%           55%            --

$octets: राज्यराज्य chars: 9 bytes 27
                    Rate        Encode Unicode::UTF8      TL::BVXS
Encode         1592888/s            --          -83%          -90%
Unicode::UTF8  9466311/s          494%            --          -42%
TL::BVXS      16287053/s          922%           72%            --

Mine is fastest with ~215 MB/s but still is far away from SIMD solutions. In my use case the decoding consumes 35 % of the execution time. But SIMD would not help much for short strings.

Keine Kommentare:

Kommentar veröffentlichen