Wollmers about Perl, coding and IT: diff

Posts mit dem Label diff werden angezeigt. Alle Posts anzeigen

Dienstag, 19. Januar 2016

Porting LCS::BV to Perl 6 - a fast core loop for Algorithm::Diff

Christmas happened and Perl6 is now reasonable stable. It's time to use it for real problems, learn more about pros and cons, detect the elegance.

Porting my CPAN distribution LCS::BV is a nice task. It's short code, has full test coverage and allows benchmarks. Also it could be expected that LCS::BV (a bit-vector implementation of the Longest Common Subsequence algorithm) will be faster than the Perl6 port of Algorithm::Diff and the Perl6 module Algorithm::LCS. LCS::BV is as fast as the XS version of A::D, see my blogpost Tuning Algorithm::Diff.

Test dependencies

LCS algorithms are complex beasts. It's nearly impossible to implement them without good test cases. One problem is, that these algorithms find a sequence of maximum length, but only one of maybe more possible solutions. All possible solutions are provided by the slow CPAN distribution LCS and the method allLCS() as explained in a another blog post. So I need to port this to Perl 6 first.

To compare the results during tests Test::Deep is needed. Should be no problem, sure somebody already ported it. Or not? No, there is no Test::Deep on modules.perl6.org. So I need to port this too, or the part needed. Or not?

This is the relevant part of the tests in Perl 5:

use Test::More;
use Test::Deep;

use LCS;
use LCS::BV;

cmp_deeply(
    LCS::BV->LCS(\@a,\@b),
    any(@{LCS->allLCS(\@a,\@b)} ),
    "$a, $b"
);

done_testing;

Trying to write a subroutine from bottom I got the following solution using any():


# say so $[[2,3],[2,4],] eqv $[[[0,1],[2,3],],[[0,1],[2,4],]].any;
  
my sub any_of ($result, $expected) {
  return so $result eqv $expected.any;
}
  
ok(
  any_of(
    LCS::BV::LCS($A, $B),
    LCS::All::allLCS($A, $B),
  ),
  "'$a', '$b'",
);

That's an example how many features of Perl 6 are worth the half of CPAN.

Length of an integer

In Perl 5 bit operations are limited to length of an integer. In most implementations this is 64 bits, but can be 32 bits. If an array is mapped to bits it needs ceil(array_length/integer_width) integers.

This needs the following code in Perl 5:


our $width = int 0.999+log(~0)/log(2);

use integer;
no warnings 'portable'; # for 0xffffffffffffffff

my $positions;

# $a ... array of elements, e.g. ['f','o','o']

$positions->{$a->[$_]}->[$_ / $width] |= 1 << ($_ % $width) 
    for $amin..$amax;

# gives {'f' => [ 0b001 ], 'o' => [ 0b110 ], }

Doing this the first time in Perl 6, I wondered how long integers are and how to find out. Took me very long with the result: infinite. That's awesome.

This part ported to Perl 6:


my $positions;
$positions{$a[$_]} +|= 1 +< $_  for $amin..$amax;

This is approximately half the code. You can imagine that the rest of the code also gets simpler. It also is the half.

Just 30 lines

For rosettacode.org I adapted the code for strings and removed the usual prefix and suffix optimization, see Bit Vector:


sub lcs(Str $xstr, Str $ystr) {
    my ($a,$b) = ([$xstr.comb],[$ystr.comb]);

    my $positions;
    for $a.kv -> $i,$x { $positions{$x} +|= 1 +< $i };

    my $S = +^0;
    my $Vs = [];
    my ($y,$u);
    for (0..+$b-1) -> $j {
        $y = $positions{$b[$j]} // 0;
        $u = $S +& $y;
        $S = ($S + $u) +| ($S - $u);
        $Vs[$j] = $S;
    }

    my ($i,$j) = (+$a-1, +$b-1);
    my $result = "";
    while ($i >= 0 && $j >= 0) {
        if ($Vs[$j] +& (1 +< $i)) { $i-- }
        else {
            unless ($j && +^$Vs[$j-1] +& (1 +< $i)) {
                $result = $a[$i] ~ $result;
                $i--;
            }
            $j--;
        }
    }
    return $result;
}

You can install LCS::BV and LCS::All via Panda or clone them from github.

Summary

Perl 6 has a lot of convenience, allows short and clear code. It's huge. But learning all the nice features is easier than I thought. Understanding e.g. any() and trying it out in the REPL is a magnitude faster, than searching in CPAN, reading docs and trying out some distributions.

Freitag, 20. November 2015

Testing fuzzy approximation with Test::Deep

Maybe the title of this post is misleading, but I will explain what I mean.

Last year I wrote a series of LCS (longest common subsequence) modules for speed and rock solid quality. Then I had a need for smarter alignment and added similarity to the algorithm, now released as LCS::Similar on CPAN.

The hardest part in developing LCS and friends was sorting out bogus algorithms from a collection of nearly 100 publications with pseudocode and sometimes code. It needed some time to get all cornercases as testcases.

But how to test an algorithm which returns one of maybe more than one optimal ("longest") solutions? With a method allLCS() returning all solutions and comparing the single solution against them.

Here is what the output looks like:


my $cases = {
  'case_01' => {
    'allLCS' => [
      [ [ 0, 0 ], [ 2, 2 ] ],
      [ [ 1, 0 ], [ 2, 2 ] ]
    ],
    'LCS' =>
      [ [ 0, 0 ], [ 2, 2 ] ],
  },
  'case_02' => {
    'allLCS' => [
      [ [ 1, 1 ], [ 3, 2 ], [ 4, 3 ], [ 6, 4 ] ],
      [ [ 1, 1 ], [ 3, 2 ], [ 5, 3 ], [ 6, 4 ] ],
      [ [ 2, 0 ], [ 3, 2 ], [ 4, 3 ], [ 6, 4 ] ],
      [ [ 2, 0 ], [ 3, 2 ], [ 5, 3 ], [ 6, 4 ] ],
      [ [ 2, 0 ], [ 4, 1 ], [ 5, 3 ], [ 6, 4 ] ]
    ],
    'LCS' =>
      [ [ 1, 1 ], [ 3, 2 ], [ 5, 3 ], [ 6, 4 ] ]
  },
};

The result of LCS is valid if it is in the results of allLCS.

At first I wrote the comparison myself but Test::Deep already provides it:


  use Test::More;
  use Test::Deep

  use LCS::Tiny;

  cmp_deeply(
    LCS::Tiny->LCS(\@a,\@b),
    any(@{$object->allLCS(\@a,\@b)} ),
    "LCS::Tiny->LCS $a, $b"
  );

  done_testing;

That's how we can test approximation. But if the comparison in LCS changes from eq to a similarity function returning values between 0 and 1, than additional pairs matched with similarity appear in the result. For a comparison we need any of allLCS is a subset of LCS.

This was the first approach which works not reliable:


use Test::Deep;

# THIS DOES NOT WORK
sub cmp_any_superset {
    my ($got, $sets) = @_;

    for my $set (@$sets) {
      my $ok = cmp_deeply(
        $got, supersetof(@$set)
      );
      return $ok if $ok;
    }
};

Trying around and reading the documentation of Test::Deep again and again, nearly giving up and write it myself, I gave one possible interpretation of the docs a chance:

  cmp_deeply(
    $lcs,
    any(
      $lcs,
      supersetof( @{$all_lcs} )
    ),
    "Example $example, Threshold $threshold"
  );

It works.

Sonntag, 19. Juli 2015

Four Shades of Devel::NYTProf

Sometimes a question or remark helps us solving another problem, which has absolutely nothing to do with the question.

The question was about the colors in the reports of Devel::NYTProf. The colors are green, yellow, orange and red, where green means good and red means bad.

These colors appear in the columns "Statements" and "Time on line" of the reports. To illustrate the coloring I found a nice piece of code where nearly all combinations appear:

Line	State ments	Time on line	Code
51	1	2µs	for my $j ($bmin..$bmax) {
52	6	3µs	$bj = $b->[$j];
53	6	3µs	unless (defined $positions->{$bj}) {
54	2	1µs	$Vs->[$j] = $S;
55	2	8µs	next;
56			}
57	4	1µs	$y = $positions->{$bj};
58	4	1µs	$u = $S & $y; # [Hyy04]
59	4	3µs	$S = ($S + $u) \| ($S - $u); # [Hyy04]
60	4	3µs	$Vs->[$j] = $S;
61			}

The different colors are determined by the difference of the values to the median, measured in median absolute deviations. The colors in the column "Statements" are calculated independent from the column "Time on line". This explains, why the first column can be red while the second is green, and why the first column can be green and the second red.

In the above report we see that the body of the for loop is executed 6 times. The lines in the unless body is executed 2 times, and the part outside 4 times.

The unless condition including the body needs 12µs. The intention of the unless construction is an optimization. So let us estimate how much time this safes. The second part (lines 57-60) needs 8µs for 4 executions, that's 2µs per execution. If we can avoid lines 53-55 we need 2 more executions of lines 57-60. That would theoretically mean minus 12µs plus 4µs.

Of course we need to care for the undefined case. And we do not longer need the caching in line 52.

We get now the following report:

Line	State ments	Time on line	Code
52	1	700ns	for my $j ($bmin..$bmax) {
53	6	2µs	$y = $positions->{$b->[$j]} // 0;
54	6	900ns	$u = $S & $y; # [Hyy04]
55	6	1µs	$S = ($S + $u) \| ($S - $u); # [Hyy04]
56	6	3µs	$Vs->[$j] = $S;
57			}

This looks a lot shorter. We eliminated 5 lines of 11 lines. The total execution time is now below 8µs. This is less than one third of the 25µs before the tuning.

The code is from LCS::BV which now is another bit faster than after Tuning Algorithm::Diff.

Samstag, 20. Juni 2015

Minimal Coding versus Indirection

Yesterday I tried to use a C implementation of the LCS problem within Perl via XS.

Reuse an existing library (libmba) sounded nice. It has good documentation and minimal foreign dependencies. The promise for libmba::diff is working for sequences of anything.

First problem was to get it compiling, stepping from one missing include to the next. Also had to patch a source file, because of some incompatible code for printing error messages.

Per default libmba::diff works on byte-strings. The comparison with String::Similarity showed only 40% of the speed of String::Similarity. Both use the same algorithm ["An O(ND) Difference Algorithm and its Variations", Eugene Myers,

Algorithmica Vol. 1 No. 2, 1986, pp. 251-266;], both written in C, both interfaced from Perl via XS.

But ok, fast enough for the first step, can tune later.

In the next step I wanted to support array references instead of strings as input. This needs to call libmba::diff() with references of two subroutines, one for comparing two elements, one for indexing an element. Here I lost (or wasted) some time getting it compiled without warnings (see on Stackoverflow).

The speed slowed down now to 24%, 0.237 MHz versus 1 MHz (i.e. 1 million cases processed per second) of String::Similarity.

What's the difference in detail?

Let's measure rough figures.

The lines of code of String::Similarity:


~/github/String-Similarity-1.04$ cloc --by-file-by-lang .
      17 text files.
      13 unique files.                              
      17 files ignored.

http://cloc.sourceforge.net v 1.62  T=0.20 s (45.6 files/s, 10777.8 lines/s)

[ details snipped ]

---------------------------------------------------------------------
Language           files          blank        comment           code
---------------------------------------------------------------------
make                   1            231            106            702
C                      2            127            193            570
YAML                   2              0              0             41
JSON                   1              0              0             39
Perl                   2             31             24             38
C/C++ Header           1              7             14              4
---------------------------------------------------------------------
SUM:                   9            396            337           1394
---------------------------------------------------------------------

And the same of LCS::XS:


~/github/LCS-XS$ cloc --by-file-by-lang .
      44 text files.
      42 unique files.                              
       8 files ignored.

http://cloc.sourceforge.net v 1.62  T=0.33 s (115.4 files/s, 37434.7 lines/s)

[ details snipped ]

----------------------------------------------------------------------
Language            files          blank        comment           code
----------------------------------------------------------------------
C/C++ Header           25           1109           3652           4317
C                       6            230            252           1376
make                    1            231            111            704
Perl                    3             70             63            146
JSON                    1              0              0             39
YAML                    2              1              1             28
----------------------------------------------------------------------
SUM:                   38           1641           4079           6610
----------------------------------------------------------------------

The lines of code in C are 1376 versus 570 for the same algorithm. The reason is more abstraction, more subroutines, more internal interfaces, more flexibility. This also causes more documentation for interfaces and more time to read the documentation and master the interfaces.

The size of the compiled objects is 7 KB versus 33 KB.

And a difference of factor 4 in speed.

Mittwoch, 10. Juni 2015

Tuning Algorithm::Diff

Motivation

There always is a need for speed. During tuning one of my applications in the field of OCR postprocessing, the profiler identified the LCS (Longest Common Subsequence) routine as the most time consuming part. But the big question was: Is it possible to make the already fast Algorithm::Diff and Algorithm::Diff::XS faster.

Reading the code again and again, trying some micro-optimizations there was only small progress.

OK, don't twiddle, try better algorithms.

So I spent a lot of time reading papers, then trying to implement the described algorithms. Each implementation took some time to debug. Some do not work as they should, i.e. they fail test cases. Some are slow in Perl.

After a while I had a collection of various implementations of the so called Hunt-Szymansky algorithm and improvements of it. This is the same algorithm Algorithm::Diff is based on.

During tuning a bit-vector implementation I compared it to my already tuned version of Algorithm::Diff, and merged a part of code of the bit-vector variant back. The result of the benchmark surprised me, as the now tuned Algorithm::Diff was only 10 percent slower than the bit-vector variant.

This motivated me to publish it as a reliable and portable pure Perl module under the name LCS::Tiny.

Here are the steps making it so fast.

Remove features

Algorithm::Diff::LCSidx has two additional parameters:

- a code ref to a key generating function

- a code ref to a compare function

Usually they would not be specified and the defaults are used, but there is still overhead for checking and calling.

Inlining

When Algorithm::Diff::LCSidx is called then four subroutines are involved. Inlining the subroutines into one subroutine reduces the lines of code and runtime overhead. Of course the code would maybe become less readable.

Squeezing

Try to use less instructions. Use Devel::NYTProf and benchmarks to check the changes. This sort of tuning can be very time consuming.

Here is an example from Algorithm::Diff:


sub _withPositionsOfInInterval
{
    my $aCollection = shift;    # array ref
    my $start       = shift;
    my $end         = shift;
    my $keyGen      = shift;
    my %d;
    my $index;
    for ( $index = $start ; $index <= $end ; $index++ )
    {
        my $element = $aCollection->[$index];
        my $key = &$keyGen( $element, @_ );
        if ( exists( $d{$key} ) )
        {
            unshift ( @{ $d{$key} }, $index );
        }
        else
        {
            $d{$key} = [$index];
        }
    }
    return wantarray ? %d : \%d;
}

This is how it looks after squeezing:


  my $bMatches;
  unshift @{ $bMatches->{$b->[$_]} },$_ for $bmin..$bmax;

Resulting module - LCS::Tiny

As bit-vector solutions can have portability problems, I decided to release the now very fast non-bit-vector implementation on CPAN as LCS::Tiny.

Additional speed - LCS::BV

After release of LCS::Tiny I got back to work on the bit-vector implementation, to make it portable for different lengths of words, and support sequences longer than word-length (32 or 64 bit).

Also I tried out improvements for a part of the algorithm. One resulted in a significant accelleration and is now implemented in the released module LCS::BV.

Benchmarks

Short sequences

A nice and typical average case for spelling-correction is the comparison of 'Chrerrplzon' with 'Choerephon'.


use Algorithm::Diff;
use Algorithm::Diff::XS;
use String::Similarity;

use Benchmark qw(:all) ;

use LCS::Tiny;
use LCS;
use LCS::BV;

my @data = (
  [split(//,'Chrerrplzon')],
  [split(//,'Choerephon')]
);

my @strings = qw(Chrerrplzon Choerephon);

if (1) {
    cmpthese( 50_000, {
       'LCS' => sub {
            LCS->LCS(@data)
        },
       'LCSidx' => sub {
            Algorithm::Diff::LCSidx(@data)
        },
        'LCSXS' => sub {
            Algorithm::Diff::XS::LCSidx(@data)
        },
        'LCSbv' => sub {
            LCS::BV->LCS(@data)
        },
        'LCStiny' => sub {
            LCS::Tiny->LCS(@data)
        },
        'S::Sim' => sub {
            similarity(@strings)
        },
    });
}

Gives the following result:

$ perl 50_diff_bench.t 
            (warning: too few iterations for a reliable count)
             Rate     LCS  LCSidx LCStiny   LCSXS   LCSbv  S::Sim
LCS        7022/s      --    -71%    -82%    -87%    -87%    -99%
LCSidx    23923/s    241%      --    -40%    -55%    -57%    -98%
LCStiny   39683/s    465%     66%      --    -25%    -29%    -96%
LCSXS     53191/s    657%    122%     34%      --     -4%    -95%
LCSbv     55556/s    691%    132%     40%      4%      --    -94%
S::Sim  1000000/s  14140%   4080%   2420%   1780%   1700%      --

Here the pure Perl LCS::BV beats Algorithm::Diff::XS (written in C).

Worst case

This example shows the weakness of the algorithms against more complicated cases.


my @data3 = ([qw/a b d/ x 50], [qw/b a d c/ x 50]);

$ perl 50_diff_bench.t 
            (warning: too few iterations for a reliable count)
          Rate     LCS LCStiny  LCSidx   LCSbv   LCSXS
LCS     35.2/s      --    -33%    -38%    -97%    -99%
LCStiny 52.7/s     50%      --     -7%    -95%    -98%
LCSidx  56.5/s     60%      7%      --    -94%    -98%
LCSbv   1020/s   2795%   1836%   1705%      --    -67%
LCSXS   3125/s   8766%   5828%   5428%    206%      --

Here the XS is fastest, because the stressed part is in pure C. The bit-vector implementation scales good, because of the parallelisms. The others fall back to worst case performance near the traditional implementation.