Posts mit dem Label C werden angezeigt. Alle Posts anzeigen
Posts mit dem Label C werden angezeigt. Alle Posts anzeigen

Montag, 2. Mai 2022

Tuning Levenshtein Distance

On 2022-01-19 I released an improved version of Text::Levenshtein::BV. BV means Bit Vector, because it uses a bit-parallel algorithm with runtime complexity of O(ceil(m/w)*n), where w is the width of a CPU register. It's 20 times faster than the popular Text::Levenshtein using the traditional or simple algorithm with O(n*n) on strings with length 10.

Last summer I developed a variant with the simple algorithm named Levenshtein::Simple, because working on arrays is more flexible than strings. A string is easy to split into an array of characters. The same module could be used for arrays of characters, graphemes, words, lines.

Another reason is, that the Levenshtein alignment (shortest edit script) is more important than just the distance. Most modules on CPAN calculate only the distance for similarity, which is easier and faster with LCS.

It was a surprise that my Levenshtein::Simple without tuning was 50% faster than Text::Levenshtein. During development of an XS implementation I tuned Levenshtein::Simple to beat Text::Fuzzy::PP.

These are the results for the pure Perl (PP) implementations measured in Rate per second:

                         N=10
Text::WagnerFischer     4,931
Text::Levenshtein       6,339
Text::Fuzzy::PP        11,164
Levenshtein::Simple    27,926
Text::Levenshtein::BV 118,153

There are many XS versions on CPAN for calculating Levenshtein distance. Only three of them work without problems. They all use the simple algorithm. Text::Levenshtein::Flexible uses the C implementation of PostgreSQL, but has tricky code. Perl strings at the XS interface can be bytes or UTF-8. In C UTF-8 can be processed by iterating over UTF-8 or convert UTF-8 to 32-bit integers (numerical code points). Working with integers including the conversion is faster.

Measured via Perl the benchmarks are (TL::BVXS is mine, simple is my implementation just for fun, uni is with Unicode strings, bytes with bytes):

                            N=10
Text::Levenshtein::XS    347,539
TL::Flexible           2,725,258
Text::Fuzzy            3,026,401
TL::BVXS_simple        6,056,132
TL::BVXS_uni           8,495,407
TL::BVXS_bytes        12,743,110


The same simple algorithm, also implemented in C is 2 times faster than that of PostgreSQL. BV is 4 times faster (on longer strings the difference is larger). Compared to Text::Levenshtein it's 2000 times faster.

In pure C without the XS interface distance on bytes works at a rate of 25 Mio./second, codepoints ~10 % slower.

Optimising for fun.

Dienstag, 19. April 2022

Unicode, UTF-8, utf8 and Perl-XS revisited

Are you sure knowing everything about Unicode in Perl? You never can.

Imagine doing everything right in your code, use a UTF-8 boilerplate, use Test::More::UTF8 and then get a 'wide character' warning and the test fails.

What happend? 

Boiling it down to diagnostic code, it has something to do with the utf8-flag:

#!perl

use strict;
use warnings;
use utf8;

binmode(STDOUT,":encoding(UTF-8)");
binmode(STDERR,":encoding(UTF-8)");

my $ascii  = 'abc';
my $latin1 = 'äöü';
my $uni    = 'Chſ';

print '$ascii:  ',$ascii,' utf8::is_utf8($ascii): ',utf8::is_utf8($ascii),"\n";
print '$latin1: ',$latin1,' utf8::is_utf8($latin1): ',utf8::is_utf8($latin1),"\n";
print '$uni:    ',$uni,' utf8::is_utf8($uni): ',utf8::is_utf8($uni),"\n";

my $file = 'utf8-flag-ascii.txt';

my $ascii_file;
open(my $in,"<:encoding cannot="" die="" file:="" file="" line="<$in" my="" open="" or="" while="">) {
    chomp($line);
    $ascii_file = $line;
}
close($in);

print '$ascii_file: ',$ascii_file,
    ' utf8::is_utf8($ascii_file): ',utf8::is_utf8($ascii_file),"\n";

This prints:

# with 'use utf8;'

$ascii:  abc utf8::is_utf8($ascii):
$latin1: äöü utf8::is_utf8($latin1): 1
$uni:    Chſ utf8::is_utf8($uni): 1
$ascii_file: abc utf8::is_utf8($ascii_file): 1

Without use utf8:

# without 'use utf8;'

$ascii:  abc utf8::is_utf8($ascii):
$latin1: äöü utf8::is_utf8($latin1):
$uni:    Chſ utf8::is_utf8($uni):
$ascii_file: abc utf8::is_utf8($ascii_file): 1

That's known and expected. Perl doesn't set the utf-flag under use utf8 for ASCII string literals.

Let's have a look into the XS interface if we want to process strings in C:

int
dist_any (SV *s1, SV *s2)
{
    int dist;

    STRLEN m;
    STRLEN n;
    /* NOTE:
        SvPVbyte would downgrade (undocumented and destructive)
        SvPVutf8 would upgrade (also destructive)
    */
    unsigned char *a = (unsigned char*)SvPV (s1, m);
    unsigned char *b = (unsigned char*)SvPV (s2, n);

    if (SvUTF8 (s1) || SvUTF8 (s2) ) {
        dist = dist_utf8_ucs (a, m, b, n);
    }
    else {
        dist = dist_bytes (a, m, b, n);
    }
    return dist;
}

With two strings involved we have a problem, if one has the utf-flag set and the other not. With a constraint 'SvUTF8 (s1) || SvUTF8 (s2)'  both would be treated as bytes, even in the case that one is utf8 and the other a string literal in the ASCII range (which is also valid UTF-8). In combination with SvPVbyte the utf8 literal would be downgraded. That caused 'wide character' warning, because SvPVbyte changes the original string. The new code is not correct, because it could treat some other encoding as UTF-8. But I decided to be harsh and let the user run into his own problems (not using best practises).

The inconsequent treatment appears if we decode as UTF-8 from a file. Then the utf8-flag is also set for strings in the ASCII range. A brute force comparison of an English and a German dictionary, each 50,000 words resulted in 292,898,337 calls and the script needed 102 seconds runtime. That's a rate of ~3 M/sec. and slow, because it always needs to decode UTF-8, even if 100% of the English words are in the ASCII range. The nice lesson: With a small change in the code the decoding got faster with an overall acceleration of 13% via Perl and 19% in pure C. But the decoding still consumes 35% of the runtime.

Samstag, 20. Juni 2015

Minimal Coding versus Indirection

Yesterday I tried to use a C implementation of the LCS problem within Perl via XS.

Reuse an existing library (libmba) sounded nice. It has good documentation and minimal foreign dependencies. The promise for libmba::diff is working for sequences of anything.

First problem was to get it compiling, stepping from one missing include to the next. Also had to patch a source file, because of some incompatible code for printing error messages.

Per default libmba::diff works on byte-strings. The comparison with String::Similarity showed only 40% of the speed of String::Similarity. Both use the same algorithm ["An O(ND) Difference Algorithm and its Variations", Eugene Myers,
 Algorithmica Vol. 1 No. 2, 1986, pp. 251-266;], both written in C, both interfaced from Perl via XS.

But ok, fast enough for the first step, can tune later.

In the next step I wanted to support array references instead of strings as input. This needs to call libmba::diff() with references of two subroutines, one for comparing two elements, one for indexing an element. Here I lost (or wasted) some time getting it compiled without warnings (see on Stackoverflow).

The speed slowed down now to 24%, 0.237 MHz versus 1 MHz (i.e. 1 million cases processed per second) of String::Similarity. 

What's the difference in detail?

Let's measure rough figures.

The lines of code of String::Similarity:

~/github/String-Similarity-1.04$ cloc --by-file-by-lang .
      17 text files.
      13 unique files.                              
      17 files ignored.

http://cloc.sourceforge.net v 1.62  T=0.20 s (45.6 files/s, 10777.8 lines/s)

[ details snipped ]

---------------------------------------------------------------------
Language           files          blank        comment           code
---------------------------------------------------------------------
make                   1            231            106            702
C                      2            127            193            570
YAML                   2              0              0             41
JSON                   1              0              0             39
Perl                   2             31             24             38
C/C++ Header           1              7             14              4
---------------------------------------------------------------------
SUM:                   9            396            337           1394
---------------------------------------------------------------------


And the same of LCS::XS:



~/github/LCS-XS$ cloc --by-file-by-lang .
      44 text files.
      42 unique files.                              
       8 files ignored.

http://cloc.sourceforge.net v 1.62  T=0.33 s (115.4 files/s, 37434.7 lines/s)

[ details snipped ]

----------------------------------------------------------------------
Language            files          blank        comment           code
----------------------------------------------------------------------
C/C++ Header           25           1109           3652           4317
C                       6            230            252           1376
make                    1            231            111            704
Perl                    3             70             63            146
JSON                    1              0              0             39
YAML                    2              1              1             28
----------------------------------------------------------------------
SUM:                   38           1641           4079           6610
----------------------------------------------------------------------

The lines of code in C are 1376 versus 570 for the same algorithm. The reason is more abstraction, more subroutines, more internal interfaces, more flexibility. This also causes more documentation for interfaces and more time to read the documentation and master the interfaces.

The size of the compiled objects is 7 KB versus 33 KB.

And a difference of factor 4 in speed.