Montag, 25. April 2022

Benchmark UTF-8 decoding in Perl

 In Perl there are two modules on CPAN for UTF-8 decoding:

- Encode

- Unicode::UTF-8

Unicode::UTF-8 claims to be faster. I was interested how fast it is compared to my own implementation in pure C to decode utf8 (1-6 byte, needing 31 bits) into 32-bit code points.

UTF-8 became a bottleneck and is worth to have a look at. New SIMD implementations can validate up to 4 GB UTF-8 per second with AVX512.

That's the result of a quick benchmark:

$octets: Chſerſplzon chars: 11 bytes 13
                    Rate        Encode Unicode::UTF8      TL::BVXS
Encode         1927529/s            --          -83%          -89%
Unicode::UTF8 11143773/s          478%            --          -36%
TL::BVXS      17311395/s          798%           55%            --

$octets: राज्यराज्य chars: 9 bytes 27
                    Rate        Encode Unicode::UTF8      TL::BVXS
Encode         1592888/s            --          -83%          -90%
Unicode::UTF8  9466311/s          494%            --          -42%
TL::BVXS      16287053/s          922%           72%            --

Mine is fastest with ~215 MB/s but still is far away from SIMD solutions. In my use case the decoding consumes 35 % of the execution time. But SIMD would not help much for short strings.

Trial: port some code to "use standard"

Just wanted to know how it feels to adapt code to use standard; 

I selected a small, unpublished module: Levenshtein::Simple. It's in the range of 530 lines of code without tests. A typical reinvent a better wheel, because Text::Levenshtein does not support arrays as input and misses an alignment method, which I needed in a simple and robust way.

Here the changes as diff:

49 changes, most of them explicit syntax for dereferencing.

2 barewords needed quoting: STDOUT, STDERR. But why are they there? They are not needed at all.

The nested ternary needed explicit braces. Now it's better readable.

1 LABEL not allowed as bareword. That's a bug of Guacamole. The label was a relict.

2 Invalid characters in the pod. Copy and paste of a reference. use standard; slurps the source as UTF-8 and it breaks. That's undocumented and the second bug. 

In summary simple changes and IMHO better readable now. Nice development tool but not ready for production.

Dienstag, 19. April 2022

Unicode, UTF-8, utf8 and Perl-XS revisited

Are you sure knowing everything about Unicode in Perl? You never can.

Imagine doing everything right in your code, use a UTF-8 boilerplate, use Test::More::UTF8 and then get a 'wide character' warning and the test fails.

What happend? 

Boiling it down to diagnostic code, it has something to do with the utf8-flag:

#!perl

use strict;
use warnings;
use utf8;

binmode(STDOUT,":encoding(UTF-8)");
binmode(STDERR,":encoding(UTF-8)");

my $ascii  = 'abc';
my $latin1 = 'äöü';
my $uni    = 'Chſ';

print '$ascii:  ',$ascii,' utf8::is_utf8($ascii): ',utf8::is_utf8($ascii),"\n";
print '$latin1: ',$latin1,' utf8::is_utf8($latin1): ',utf8::is_utf8($latin1),"\n";
print '$uni:    ',$uni,' utf8::is_utf8($uni): ',utf8::is_utf8($uni),"\n";

my $file = 'utf8-flag-ascii.txt';

my $ascii_file;
open(my $in,"<:encoding cannot="" die="" file:="" file="" line="<$in" my="" open="" or="" while="">) {
    chomp($line);
    $ascii_file = $line;
}
close($in);

print '$ascii_file: ',$ascii_file,
    ' utf8::is_utf8($ascii_file): ',utf8::is_utf8($ascii_file),"\n";

This prints:

# with 'use utf8;'

$ascii:  abc utf8::is_utf8($ascii):
$latin1: äöü utf8::is_utf8($latin1): 1
$uni:    Chſ utf8::is_utf8($uni): 1
$ascii_file: abc utf8::is_utf8($ascii_file): 1

Without use utf8:

# without 'use utf8;'

$ascii:  abc utf8::is_utf8($ascii):
$latin1: äöü utf8::is_utf8($latin1):
$uni:    Chſ utf8::is_utf8($uni):
$ascii_file: abc utf8::is_utf8($ascii_file): 1

That's known and expected. Perl doesn't set the utf-flag under use utf8 for ASCII string literals.

Let's have a look into the XS interface if we want to process strings in C:

int
dist_any (SV *s1, SV *s2)
{
    int dist;

    STRLEN m;
    STRLEN n;
    /* NOTE:
        SvPVbyte would downgrade (undocumented and destructive)
        SvPVutf8 would upgrade (also destructive)
    */
    unsigned char *a = (unsigned char*)SvPV (s1, m);
    unsigned char *b = (unsigned char*)SvPV (s2, n);

    if (SvUTF8 (s1) || SvUTF8 (s2) ) {
        dist = dist_utf8_ucs (a, m, b, n);
    }
    else {
        dist = dist_bytes (a, m, b, n);
    }
    return dist;
}

With two strings involved we have a problem, if one has the utf-flag set and the other not. With a constraint 'SvUTF8 (s1) || SvUTF8 (s2)'  both would be treated as bytes, even in the case that one is utf8 and the other a string literal in the ASCII range (which is also valid UTF-8). In combination with SvPVbyte the utf8 literal would be downgraded. That caused 'wide character' warning, because SvPVbyte changes the original string. The new code is not correct, because it could treat some other encoding as UTF-8. But I decided to be harsh and let the user run into his own problems (not using best practises).

The inconsequent treatment appears if we decode as UTF-8 from a file. Then the utf8-flag is also set for strings in the ASCII range. A brute force comparison of an English and a German dictionary, each 50,000 words resulted in 292,898,337 calls and the script needed 102 seconds runtime. That's a rate of ~3 M/sec. and slow, because it always needs to decode UTF-8, even if 100% of the English words are in the ASCII range. The nice lesson: With a small change in the code the decoding got faster with an overall acceleration of 13% via Perl and 19% in pure C. But the decoding still consumes 35% of the runtime.