Wollmers about Perl, coding and IT: tuning

Posts mit dem Label tuning werden angezeigt. Alle Posts anzeigen

Montag, 2. Mai 2022

Tuning Levenshtein Distance

On 2022-01-19 I released an improved version of Text::Levenshtein::BV. BV means Bit Vector, because it uses a bit-parallel algorithm with runtime complexity of O(ceil(m/w)*n), where w is the width of a CPU register. It's 20 times faster than the popular Text::Levenshtein using the traditional or simple algorithm with O(n*n) on strings with length 10.

Last summer I developed a variant with the simple algorithm named Levenshtein::Simple, because working on arrays is more flexible than strings. A string is easy to split into an array of characters. The same module could be used for arrays of characters, graphemes, words, lines.

Another reason is, that the Levenshtein alignment (shortest edit script) is more important than just the distance. Most modules on CPAN calculate only the distance for similarity, which is easier and faster with LCS.

It was a surprise that my Levenshtein::Simple without tuning was 50% faster than Text::Levenshtein. During development of an XS implementation I tuned Levenshtein::Simple to beat Text::Fuzzy::PP.

These are the results for the pure Perl (PP) implementations measured in Rate per second:

                         N=10
Text::WagnerFischer     4,931
Text::Levenshtein       6,339
Text::Fuzzy::PP        11,164
Levenshtein::Simple    27,926
Text::Levenshtein::BV 118,153

There are many XS versions on CPAN for calculating Levenshtein distance. Only three of them work without problems. They all use the simple algorithm. Text::Levenshtein::Flexible uses the C implementation of PostgreSQL, but has tricky code. Perl strings at the XS interface can be bytes or UTF-8. In C UTF-8 can be processed by iterating over UTF-8 or convert UTF-8 to 32-bit integers (numerical code points). Working with integers including the conversion is faster.

Measured via Perl the benchmarks are (TL::BVXS is mine, simple is my implementation just for fun, uni is with Unicode strings, bytes with bytes):

                            N=10
Text::Levenshtein::XS    347,539
TL::Flexible           2,725,258
Text::Fuzzy            3,026,401
TL::BVXS_simple        6,056,132
TL::BVXS_uni           8,495,407
TL::BVXS_bytes        12,743,110

The same simple algorithm, also implemented in C is 2 times faster than that of PostgreSQL. BV is 4 times faster (on longer strings the difference is larger). Compared to Text::Levenshtein it's 2000 times faster.

In pure C without the XS interface distance on bytes works at a rate of 25 Mio./second, codepoints ~10 % slower.

Optimising for fun.

Montag, 25. April 2022

Benchmark UTF-8 decoding in Perl

In Perl there are two modules on CPAN for UTF-8 decoding:

- Encode

- Unicode::UTF-8

Unicode::UTF-8 claims to be faster. I was interested how fast it is compared to my own implementation in pure C to decode utf8 (1-6 byte, needing 31 bits) into 32-bit code points.

UTF-8 became a bottleneck and is worth to have a look at. New SIMD implementations can validate up to 4 GB UTF-8 per second with AVX512.

That's the result of a quick benchmark:

$octets: Chſerſplzon chars: 11 bytes 13
                    Rate        Encode Unicode::UTF8      TL::BVXS
Encode         1927529/s            --          -83%          -89%
Unicode::UTF8 11143773/s          478%            --          -36%
TL::BVXS      17311395/s          798%           55%            --

$octets: राज्यराज्य chars: 9 bytes 27
                    Rate        Encode Unicode::UTF8      TL::BVXS
Encode         1592888/s            --          -83%          -90%
Unicode::UTF8  9466311/s          494%            --          -42%
TL::BVXS      16287053/s          922%           72%            --

Mine is fastest with ~215 MB/s but still is far away from SIMD solutions. In my use case the decoding consumes 35 % of the execution time. But SIMD would not help much for short strings.

Dienstag, 19. April 2022

Unicode, UTF-8, utf8 and Perl-XS revisited

Are you sure knowing everything about Unicode in Perl? You never can.

Imagine doing everything right in your code, use a UTF-8 boilerplate, use Test::More::UTF8 and then get a 'wide character' warning and the test fails.

What happend?

Boiling it down to diagnostic code, it has something to do with the utf8-flag:

#!perl

use strict;
use warnings;
use utf8;

binmode(STDOUT,":encoding(UTF-8)");
binmode(STDERR,":encoding(UTF-8)");

my $ascii  = 'abc';
my $latin1 = 'äöü';
my $uni    = 'Chſ';

print '$ascii:  ',$ascii,' utf8::is_utf8($ascii): ',utf8::is_utf8($ascii),"\n";
print '$latin1: ',$latin1,' utf8::is_utf8($latin1): ',utf8::is_utf8($latin1),"\n";
print '$uni:    ',$uni,' utf8::is_utf8($uni): ',utf8::is_utf8($uni),"\n";

my $file = 'utf8-flag-ascii.txt';

my $ascii_file;
open(my $in,"<:encoding cannot="" die="" file:="" file="" line="<$in" my="" open="" or="" while="">) {
    chomp($line);
    $ascii_file = $line;
}
close($in);

print '$ascii_file: ',$ascii_file,
    ' utf8::is_utf8($ascii_file): ',utf8::is_utf8($ascii_file),"\n";

This prints:

# with 'use utf8;'

$ascii:  abc utf8::is_utf8($ascii):
$latin1: äöü utf8::is_utf8($latin1): 1
$uni:    Chſ utf8::is_utf8($uni): 1
$ascii_file: abc utf8::is_utf8($ascii_file): 1

Without use utf8:

# without 'use utf8;'

$ascii:  abc utf8::is_utf8($ascii):
$latin1: Ã¤Ã¶Ã¼ utf8::is_utf8($latin1):
$uni:    ChÅ¿ utf8::is_utf8($uni):
$ascii_file: abc utf8::is_utf8($ascii_file): 1

That's known and expected. Perl doesn't set the utf-flag under use utf8 for ASCII string literals.

Let's have a look into the XS interface if we want to process strings in C:

int
dist_any (SV *s1, SV *s2)
{
    int dist;

    STRLEN m;
    STRLEN n;
    /* NOTE:
        SvPVbyte would downgrade (undocumented and destructive)
        SvPVutf8 would upgrade (also destructive)
    */
    unsigned char *a = (unsigned char*)SvPV (s1, m);
    unsigned char *b = (unsigned char*)SvPV (s2, n);

    if (SvUTF8 (s1) || SvUTF8 (s2) ) {
        dist = dist_utf8_ucs (a, m, b, n);
    }
    else {
        dist = dist_bytes (a, m, b, n);
    }
    return dist;
}

With two strings involved we have a problem, if one has the utf-flag set and the other not. With a constraint 'SvUTF8 (s1) || SvUTF8 (s2)' both would be treated as bytes, even in the case that one is utf8 and the other a string literal in the ASCII range (which is also valid UTF-8). In combination with SvPVbyte the utf8 literal would be downgraded. That caused 'wide character' warning, because SvPVbyte changes the original string. The new code is not correct, because it could treat some other encoding as UTF-8. But I decided to be harsh and let the user run into his own problems (not using best practises).

The inconsequent treatment appears if we decode as UTF-8 from a file. Then the utf8-flag is also set for strings in the ASCII range. A brute force comparison of an English and a German dictionary, each 50,000 words resulted in 292,898,337 calls and the script needed 102 seconds runtime. That's a rate of ~3 M/sec. and slow, because it always needs to decode UTF-8, even if 100% of the English words are in the ASCII range. The nice lesson: With a small change in the code the decoding got faster with an overall acceleration of 13% via Perl and 19% in pure C. But the decoding still consumes 35% of the runtime.

Sonntag, 19. Juli 2015

Four Shades of Devel::NYTProf

Sometimes a question or remark helps us solving another problem, which has absolutely nothing to do with the question.

The question was about the colors in the reports of Devel::NYTProf. The colors are green, yellow, orange and red, where green means good and red means bad.

These colors appear in the columns "Statements" and "Time on line" of the reports. To illustrate the coloring I found a nice piece of code where nearly all combinations appear:

Line	State ments	Time on line	Code
51	1	2µs	for my $j ($bmin..$bmax) {
52	6	3µs	$bj = $b->[$j];
53	6	3µs	unless (defined $positions->{$bj}) {
54	2	1µs	$Vs->[$j] = $S;
55	2	8µs	next;
56			}
57	4	1µs	$y = $positions->{$bj};
58	4	1µs	$u = $S & $y; # [Hyy04]
59	4	3µs	$S = ($S + $u) \| ($S - $u); # [Hyy04]
60	4	3µs	$Vs->[$j] = $S;
61			}

The different colors are determined by the difference of the values to the median, measured in median absolute deviations. The colors in the column "Statements" are calculated independent from the column "Time on line". This explains, why the first column can be red while the second is green, and why the first column can be green and the second red.

In the above report we see that the body of the for loop is executed 6 times. The lines in the unless body is executed 2 times, and the part outside 4 times.

The unless condition including the body needs 12µs. The intention of the unless construction is an optimization. So let us estimate how much time this safes. The second part (lines 57-60) needs 8µs for 4 executions, that's 2µs per execution. If we can avoid lines 53-55 we need 2 more executions of lines 57-60. That would theoretically mean minus 12µs plus 4µs.

Of course we need to care for the undefined case. And we do not longer need the caching in line 52.

We get now the following report:

Line	State ments	Time on line	Code
52	1	700ns	for my $j ($bmin..$bmax) {
53	6	2µs	$y = $positions->{$b->[$j]} // 0;
54	6	900ns	$u = $S & $y; # [Hyy04]
55	6	1µs	$S = ($S + $u) \| ($S - $u); # [Hyy04]
56	6	3µs	$Vs->[$j] = $S;
57			}

This looks a lot shorter. We eliminated 5 lines of 11 lines. The total execution time is now below 8µs. This is less than one third of the 25µs before the tuning.

The code is from LCS::BV which now is another bit faster than after Tuning Algorithm::Diff.

Mittwoch, 10. Juni 2015

Tuning Algorithm::Diff

Motivation

There always is a need for speed. During tuning one of my applications in the field of OCR postprocessing, the profiler identified the LCS (Longest Common Subsequence) routine as the most time consuming part. But the big question was: Is it possible to make the already fast Algorithm::Diff and Algorithm::Diff::XS faster.

Reading the code again and again, trying some micro-optimizations there was only small progress.

OK, don't twiddle, try better algorithms.

So I spent a lot of time reading papers, then trying to implement the described algorithms. Each implementation took some time to debug. Some do not work as they should, i.e. they fail test cases. Some are slow in Perl.

After a while I had a collection of various implementations of the so called Hunt-Szymansky algorithm and improvements of it. This is the same algorithm Algorithm::Diff is based on.

During tuning a bit-vector implementation I compared it to my already tuned version of Algorithm::Diff, and merged a part of code of the bit-vector variant back. The result of the benchmark surprised me, as the now tuned Algorithm::Diff was only 10 percent slower than the bit-vector variant.

This motivated me to publish it as a reliable and portable pure Perl module under the name LCS::Tiny.

Here are the steps making it so fast.

Remove features

Algorithm::Diff::LCSidx has two additional parameters:

- a code ref to a key generating function

- a code ref to a compare function

Usually they would not be specified and the defaults are used, but there is still overhead for checking and calling.

Inlining

When Algorithm::Diff::LCSidx is called then four subroutines are involved. Inlining the subroutines into one subroutine reduces the lines of code and runtime overhead. Of course the code would maybe become less readable.

Squeezing

Try to use less instructions. Use Devel::NYTProf and benchmarks to check the changes. This sort of tuning can be very time consuming.

Here is an example from Algorithm::Diff:


sub _withPositionsOfInInterval
{
    my $aCollection = shift;    # array ref
    my $start       = shift;
    my $end         = shift;
    my $keyGen      = shift;
    my %d;
    my $index;
    for ( $index = $start ; $index <= $end ; $index++ )
    {
        my $element = $aCollection->[$index];
        my $key = &$keyGen( $element, @_ );
        if ( exists( $d{$key} ) )
        {
            unshift ( @{ $d{$key} }, $index );
        }
        else
        {
            $d{$key} = [$index];
        }
    }
    return wantarray ? %d : \%d;
}

This is how it looks after squeezing:


  my $bMatches;
  unshift @{ $bMatches->{$b->[$_]} },$_ for $bmin..$bmax;

Resulting module - LCS::Tiny

As bit-vector solutions can have portability problems, I decided to release the now very fast non-bit-vector implementation on CPAN as LCS::Tiny.

Additional speed - LCS::BV

After release of LCS::Tiny I got back to work on the bit-vector implementation, to make it portable for different lengths of words, and support sequences longer than word-length (32 or 64 bit).

Also I tried out improvements for a part of the algorithm. One resulted in a significant accelleration and is now implemented in the released module LCS::BV.

Benchmarks

Short sequences

A nice and typical average case for spelling-correction is the comparison of 'Chrerrplzon' with 'Choerephon'.


use Algorithm::Diff;
use Algorithm::Diff::XS;
use String::Similarity;

use Benchmark qw(:all) ;

use LCS::Tiny;
use LCS;
use LCS::BV;

my @data = (
  [split(//,'Chrerrplzon')],
  [split(//,'Choerephon')]
);

my @strings = qw(Chrerrplzon Choerephon);

if (1) {
    cmpthese( 50_000, {
       'LCS' => sub {
            LCS->LCS(@data)
        },
       'LCSidx' => sub {
            Algorithm::Diff::LCSidx(@data)
        },
        'LCSXS' => sub {
            Algorithm::Diff::XS::LCSidx(@data)
        },
        'LCSbv' => sub {
            LCS::BV->LCS(@data)
        },
        'LCStiny' => sub {
            LCS::Tiny->LCS(@data)
        },
        'S::Sim' => sub {
            similarity(@strings)
        },
    });
}

Gives the following result:

$ perl 50_diff_bench.t 
            (warning: too few iterations for a reliable count)
             Rate     LCS  LCSidx LCStiny   LCSXS   LCSbv  S::Sim
LCS        7022/s      --    -71%    -82%    -87%    -87%    -99%
LCSidx    23923/s    241%      --    -40%    -55%    -57%    -98%
LCStiny   39683/s    465%     66%      --    -25%    -29%    -96%
LCSXS     53191/s    657%    122%     34%      --     -4%    -95%
LCSbv     55556/s    691%    132%     40%      4%      --    -94%
S::Sim  1000000/s  14140%   4080%   2420%   1780%   1700%      --

Here the pure Perl LCS::BV beats Algorithm::Diff::XS (written in C).

Worst case

This example shows the weakness of the algorithms against more complicated cases.


my @data3 = ([qw/a b d/ x 50], [qw/b a d c/ x 50]);

$ perl 50_diff_bench.t 
            (warning: too few iterations for a reliable count)
          Rate     LCS LCStiny  LCSidx   LCSbv   LCSXS
LCS     35.2/s      --    -33%    -38%    -97%    -99%
LCStiny 52.7/s     50%      --     -7%    -95%    -98%
LCSidx  56.5/s     60%      7%      --    -94%    -98%
LCSbv   1020/s   2795%   1836%   1705%      --    -67%
LCSXS   3125/s   8766%   5828%   5428%    206%      --

Here the XS is fastest, because the stressed part is in pure C. The bit-vector implementation scales good, because of the parallelisms. The others fall back to worst case performance near the traditional implementation.