Freitag, 20. November 2015

Testing fuzzy approximation with Test::Deep

Maybe the title of this post is misleading, but I will explain what I mean.

Last year I wrote a series of LCS (longest common subsequence) modules for speed and rock solid quality. Then I had a need for smarter alignment and added similarity to the algorithm, now released as LCS::Similar on CPAN.

The hardest part in developing LCS and friends was sorting out bogus algorithms from a collection of nearly 100 publications with pseudocode and sometimes code. It needed some time to get all cornercases as testcases.

But how to test an algorithm which returns one of maybe more than one optimal ("longest") solutions? With a method allLCS() returning all solutions and comparing the single solution against them.

Here is what the output looks like:

my $cases = {
  'case_01' => {
    'allLCS' => [
      [ [ 0, 0 ], [ 2, 2 ] ],
      [ [ 1, 0 ], [ 2, 2 ] ]
    'LCS' =>
      [ [ 0, 0 ], [ 2, 2 ] ],
  'case_02' => {
    'allLCS' => [
      [ [ 1, 1 ], [ 3, 2 ], [ 4, 3 ], [ 6, 4 ] ],
      [ [ 1, 1 ], [ 3, 2 ], [ 5, 3 ], [ 6, 4 ] ],
      [ [ 2, 0 ], [ 3, 2 ], [ 4, 3 ], [ 6, 4 ] ],
      [ [ 2, 0 ], [ 3, 2 ], [ 5, 3 ], [ 6, 4 ] ],
      [ [ 2, 0 ], [ 4, 1 ], [ 5, 3 ], [ 6, 4 ] ]
    'LCS' =>
      [ [ 1, 1 ], [ 3, 2 ], [ 5, 3 ], [ 6, 4 ] ]

The result of LCS is valid if it is in the results of allLCS.

At first I wrote the comparison myself but Test::Deep already provides it:

  use Test::More;
  use Test::Deep

  use LCS::Tiny;

    any(@{$object->allLCS(\@a,\@b)} ),
    "LCS::Tiny->LCS $a, $b"


That's how we can test approximation. But if the comparison in LCS changes from eq to a similarity function returning values between 0 and 1, than additional pairs matched with similarity appear in the result. For a comparison we need any of allLCS is a subset of LCS.

This was the first approach which works not reliable:

use Test::Deep;

sub cmp_any_superset {
    my ($got, $sets) = @_;

    for my $set (@$sets) {
      my $ok = cmp_deeply(
        $got, supersetof(@$set)
      return $ok if $ok;

Trying around and reading the documentation of Test::Deep again and again, nearly giving up and write it myself, I gave one possible interpretation of the docs a chance:

      supersetof( @{$all_lcs} )
    "Example $example, Threshold $threshold"

It works.

Donnerstag, 15. Oktober 2015

How bad can code be - Part 1: The Gordian Knot

How bad can code be?

At the beginning of my career as a developer starting from scratch I became responsible for the maintenance of a package consisting of approximately 40 KLoCs (Lines of Code) Assembler. It was called "realtime controlling part", a sort of middleware controlling transactions, database recovery, asynchronity, and communication.

It existed in three variants for four different applications. First person A developed it for application AA, then he got datacenter manager. Then person B forked it for different requirements of HH and BB+CC. In summary 70 KLoCs partly similar, but not exactly the same.

The BB+CC version was the most complicated with a special feature called preread. It guesses in the queue of incoming transaction, which records should be read in advance, waits for all of them, and puts it into the application queue. Same in reverse order with the inserts and updates in the answer of the application. And exactly this most complicated part was written by not-so-skilled person C.

The preread module had around 4000 lines and was hard to read. No subroutines. Variables uses as switches, i.e. store the result of conditions, check or change them somewhere in the code. Of course you need to init them. You need a good name within the allowed 8 characters. You must predefine them at the end of the source. But debugging is a hell, because the more switches you use, the more possible combinations of states you have. Combined with spaghetti style code, jumping around cross-cross it's a Gordian Knot. Lady programmer D removed two dozens bugs raising in production. In my period I also solved a similar amount. One bug was not solvable. Sometimes the whole thing got sleeping, waiting for nothing, hanging.

This was the worst code in production I had to work with.

Lessons learned:

  • do not use the weakest skilled person for the most complicated task
  • avoid switches
  • no literals in the code, define them as constants
  • use structured programming style (needs discipline in Assembler)
  • subroutines should fit into a screen (24 x 80 at this time)
  • code reentrant, even if it's not necessary

Fortunately I got 1 year time and budget to rewrite this package.

It was only necessary to change to another network technology, so just rewrite this part of handling incoming and outgoing messages. Second it was also necessary to refine restart and recovery as the architecture changed to distributed processing, i.e. the local organizations got all midrange servers, and the communication changed.

I started with the communication interface as an isolated module, but the network servers were not available, planned 6 months later for first integration tests. So I learned to test with stubs and drivers, simulating the other parts. In a team of 12 persons you speak with the others, even if you are working alone. One day person O, a genius, but nearly autistic developer said to me "The whole thing is crap and completely wrong." Why? What's wrong? Should I do it this way? "No, it should be this way." He always answered with one short sentence. We always had short and loud discussions. After the discussions I returned to my room for thinking about the design. And then I visited him again.

With this ideas I wrote a dispatcher, controlling queues, reqeuing ready tasks, queuing free tasks into the memory pool, activating the handlers. It had some 100 lines. This made possible to reduce the control logic in the handlers. The handlers could be reduced to wait for work, do some stuff, subrequests to other tasks and wait for the result, do some stuff, send result to requesting task, wait for new work. No need for variables, only maps to the control blocks in the queue, only constants in the task modules. Hey, this is reentrant.

Now the problem was to factor out the necessary logic from the old code, and clean away the control logic. I did this during the easter days at home. Imagine a living room with the compile printouts covering the whole floor, me using markers in two different colors, marking the good and the bad. Of course it was not finished in one step. I refactored my own code 12 times along the progress. The horrible preread module was reduced from 4000 lines to 100 lines nearly linear, straight ahead code. Others were also reduced in size and the variants disappeared.

This software was in production nearly 20 years, without an outage, and without change.

Lessons learned:

  • you need 2 very good developers in a team
  • not all developers in a team need good communication skills
  • learn from critic, even from harsh critic
  • advantage by restriction
  • less is more
  • refactor often
  • conform to quality standards and best practices

Gordian Knots can be solved.

Next in this series: two huge Perl projects

Mittwoch, 14. Oktober 2015

Install cperl with perlbrew

A month ago Reini Urban released cperl-5.22.1 with promising features. It should be faster and consume less memory than the original perl. See here the details and status.

Maybe in one of the next releases cperl will have faster arrays and hashes. At the moment creation of hash or array elements is fast, i.e. fast enough for most applications, but a limiting factor e.g. for the data transport between Perl and C via XS.

Let's try.

Fortunately I found this article Прецизионные бенчмарки Perl containing instructions how to install cperl with perlbrew.

Of course you need perlbrew installed and well configured.

# create a new directory
$ mkdir cperl
$ cd cperl

# download cperl
$ wget \
   -O cperl-cperl-5.22.1.tar.gz

# brew it
$ perlbrew install --as=cperl-5.22.1 $(pwd)/cperl-cperl-5.22.1.tar.gz

# list installed perls
$ perlbrew list
* perl-5.20.1

# switch to use cperl
$ perlbrew use cperl-5.22.1
$ perl -v

This is cperl 5, version 22, subversion 1 (v5.22.1c) built for darwin-2level

Easy, isn't it?

Dienstag, 25. August 2015

Interface to Shopware with Perl

At work we often have the problem to migrate data to web shops or other applications.

Perl is a great help for task like this, get things done.

The possible methods to maintain data via Perl scripts are:
  • manipulate the database, e.g. using DBIx::Class. This can be a maintenance hell, if the schema of the main application changes.
  • compile CSV records, if the application can import it.
  • use an API, if the application supports it.
In the case of Shopware an REST API exists, but the documentation targets PHP and suggests to use the PHP layers.

After some trial and error I got success using an ordinary web browser, entering an URL like this:


Should be straight forward with Mojo::UserAgent, but it returned errors.

Exploring the details it took some time to detect the header Authorization: Digest in the browser communication.

Fortunately with the help of search engines I found the module Mojo::UserAgent::DigestAuth which needed some trial and error to understand, but works like this:

use Mojo::UserAgent::DigestAuth;
use Mojo::UserAgent;

my $ua = Mojo::UserAgent->new;

# e.g. 'http://username:apikey@shop.local/api/articles'
my $url = $protocol . $user . ':' . $apikey . '@' . $host . $path . $object;

my $tx = $ua->$_request_with_digest_auth(get => $url);
my $value = $tx->res->json;

May this post help all the people having a problem with the Shopware API or with Mojo::UserAgent::DigestAuth.

Elasticsearch with Mojo::UserAgent

Trying to find a more flexible and scalable solution for my key-value backends with more than 10 millions of keys each, stored in read-only dictd now, I wanted to try Elasticsearch.

Installation on my development station (a MacBook Air) was easy.

Next step is a perlish interface.

There are modules on CPAN interfacing to Elasticsearch, but with tons of dependencies, and a lot of documentation to learn.

What's about using the JSON REST-API of Elasticsearch directly with one of the perlish user agents?

Having Mojolicious installed on all machines, and knowing Mojo::UserAgent very well, I gave it a try.

Some minutes later I had it working:

use Mojo::UserAgent;

my $ua = Mojo::UserAgent->new;

# insert a record
my $tx = $ua->post('http://localhost:9200/nomen/child' => json =>
 "taxon"  => "Rosa",
 "rank"   => "Genus",
 "child"  => "Rosa canina",
 "system" => "gbif"

# query
my $result = $ua->post('http://localhost:9200/_search' => json => {
    "query" => {
        "query_string" => {
            "query"  => "Spermophilus",
            "fields" => ["taxon","child"]

It allows to use perl structures directly with the built-in JSON converter.

Easy and simple.

Next will be scalability and performance tests.

Sonntag, 19. Juli 2015

Four Shades of Devel::NYTProf

Sometimes a question or remark helps us solving another problem, which has absolutely nothing to do with the question.

The question was about the colors in the reports of Devel::NYTProf. The colors are green, yellow, orange and red, where green means good and red means bad.

These colors appear in the columns "Statements" and "Time on line" of the reports. To illustrate the coloring I found a nice piece of code where nearly all combinations appear:

Line State
on line
Calls Time
in subs
5112µsfor my $j ($bmin..$bmax) {
5263µs  $bj = $b->[$j];
5363µs  unless (defined $positions->{$bj}) {
5421µs    $Vs->[$j] = $S;
5528µs    next;
56  }
5741µs  $y = $positions->{$bj};
5841µs  $u = $S & $y; # [Hyy04]
5943µs  $S = ($S + $u) | ($S - $u); # [Hyy04]
6043µs  $Vs->[$j] = $S;

The different colors are determined by the difference of the values to the median, measured in median absolute deviations. The colors in the column "Statements" are calculated independent from the column "Time on line". This explains, why the first column can be red while the second is green, and why the first column can be green and the second red.

In the above report we see that the body of the for loop is executed 6 times. The lines in the unless body is executed 2 times, and the part outside 4 times.

The unless condition including the body needs 12µs. The intention of the unless construction is an optimization. So let us estimate how much time this safes. The second part (lines 57-60) needs 8µs for 4 executions, that's 2µs per execution. If we can avoid lines 53-55 we need 2 more executions of lines 57-60. That would theoretically mean minus 12µs plus 4µs.

Of course we need to care for the undefined case. And we do not longer need the caching in line 52.

We get now the following report:

Line State
on line
Calls Time
in subs
521700nsfor my $j ($bmin..$bmax) {
5362µs  $y = $positions->{$b->[$j]} // 0;
546900ns  $u = $S & $y; # [Hyy04]
5561µs  $S = ($S + $u) | ($S - $u); # [Hyy04]
5663µs  $Vs->[$j] = $S;

This looks a lot shorter. We eliminated 5 lines of 11 lines. The total execution time is now below 8µs. This is less than one third of the 25µs before the tuning.

The code is from LCS::BV which now is another bit faster than after Tuning Algorithm::Diff.

Samstag, 20. Juni 2015

Minimal Coding versus Indirection

Yesterday I tried to use a C implementation of the LCS problem within Perl via XS.

Reuse an existing library (libmba) sounded nice. It has good documentation and minimal foreign dependencies. The promise for libmba::diff is working for sequences of anything.

First problem was to get it compiling, stepping from one missing include to the next. Also had to patch a source file, because of some incompatible code for printing error messages.

Per default libmba::diff works on byte-strings. The comparison with String::Similarity showed only 40% of the speed of String::Similarity. Both use the same algorithm ["An O(ND) Difference Algorithm and its Variations", Eugene Myers,
 Algorithmica Vol. 1 No. 2, 1986, pp. 251-266;], both written in C, both interfaced from Perl via XS.

But ok, fast enough for the first step, can tune later.

In the next step I wanted to support array references instead of strings as input. This needs to call libmba::diff() with references of two subroutines, one for comparing two elements, one for indexing an element. Here I lost (or wasted) some time getting it compiled without warnings (see on Stackoverflow).

The speed slowed down now to 24%, 0.237 MHz versus 1 MHz (i.e. 1 million cases processed per second) of String::Similarity. 

What's the difference in detail?

Let's measure rough figures.

The lines of code of String::Similarity:

~/github/String-Similarity-1.04$ cloc --by-file-by-lang .
      17 text files.
      13 unique files.                              
      17 files ignored. v 1.62  T=0.20 s (45.6 files/s, 10777.8 lines/s)

[ details snipped ]

Language           files          blank        comment           code
make                   1            231            106            702
C                      2            127            193            570
YAML                   2              0              0             41
JSON                   1              0              0             39
Perl                   2             31             24             38
C/C++ Header           1              7             14              4
SUM:                   9            396            337           1394

And the same of LCS::XS:

~/github/LCS-XS$ cloc --by-file-by-lang .
      44 text files.
      42 unique files.                              
       8 files ignored. v 1.62  T=0.33 s (115.4 files/s, 37434.7 lines/s)

[ details snipped ]

Language            files          blank        comment           code
C/C++ Header           25           1109           3652           4317
C                       6            230            252           1376
make                    1            231            111            704
Perl                    3             70             63            146
JSON                    1              0              0             39
YAML                    2              1              1             28
SUM:                   38           1641           4079           6610

The lines of code in C are 1376 versus 570 for the same algorithm. The reason is more abstraction, more subroutines, more internal interfaces, more flexibility. This also causes more documentation for interfaces and more time to read the documentation and master the interfaces.

The size of the compiled objects is 7 KB versus 33 KB.

And a difference of factor 4 in speed.

Mittwoch, 10. Juni 2015

Porting some modules to Perl6

Playing around with Perl6 since 2008 I wanted to port two of my CPAN distributions, Set::Similarity and Bag::Similarity, to Perl6. It would be a nice example of using the Perl6 built-in types Set and Bag (also known as multisets).

Last year I visited the Austrian Perl Workshop and also stayed the additional two days on the hackaton. My intention was to discuss details of the long missing NFG (Normalization Form Grapheme) with the core developers, but they spent their time discussing GLR (Grand List Refactor/Redesign).

So I worked on the port of Set::Similarity and Bag::Similarity.

These modules are reference implementations of similarity coefficients known under the names Dice, Jaccard, cosine and overlap.

Input are always two sequences, which are tokenized if necessary, transformed to sets (arrays of unique tokens in Perl6) or bags (hashes in Perl5).

In case of strings as input they are tokenized into single characters by default, or into n-grams if n is specified.

Splitting into single characters is easy in Perl6, if you know that the method is named comb.

$ perl6
> say "banana".comb;
b a n a n a
Transforming it to a set

> say "banana".comb.Set;
set(b, a, n)
With sets we can use the intersection operator (a logical and reduction)

> say "banana".comb.Set (&) "anna".comb.Set;
set(a, n)
But Perl6 understands it shorter

> say "banana".comb (&) "anna".comb;
set(a, n)
N-grams are not so obvious (but there is more than one way ...)

> my $w = 4;'wollmersdorfer'.comb.rotor($w,$w-1).map({[~] @$_})
woll ollm llme lmer mers ersd rsdo sdor dorf orfe rfer
The Dice coefficient is defined as count(intersection(A,B)) / (count(A) + count(B)). With the pieces from above we can code:

> my $a="banana".comb; my $b="anna".comb; say ($a (&) $b)/($a.Set + $b.Set)
Very short, isn't it? Note, that Perl6 automatically uses the number of elements of a set in the context of numeric operations.
Next the Jaccard coefficient is defined as count(intersection(A,B)) / count(union(A,B)). If the operator for intersection is the logical and, then the operator for the union should be the logical or. Let's try.

> my $a = 'banana';my $b = 'anna'; my $union = ($a.comb (|) $b.comb)
set(b, a, n)
Put the pieces together:

> y $a = 'banana';my $b = 'anna'; my $jaccard = ($a.comb (&) $b.comb) / ($a.comb (|) $b.comb);
Understanding the code for for cosine and overlap is left as an exercise to the reader:

my $cosine = ($a.comb (&) $b.comb) / ($a.comb.Set.elems.sqrt * $b.comb.Set.elems.sqrt);
my $overlap = ($a.comb (&) $b.comb) / ($a.comb.Set.elems, $b.comb.Set.elems).min;
Is this short code really worth own modules? No, it isn't. Thus I contributed it to Perl6-One-Liners as chapter Text analysis.
Many thanks to the Perl6 hackers answering my questions.

Tuning Algorithm::Diff


There always is a need for speed. During tuning one of my applications in the field of OCR postprocessing, the profiler identified the LCS (Longest Common Subsequence) routine as the most time consuming part. But the big question was: Is it possible to make the already fast Algorithm::Diff and Algorithm::Diff::XS faster.

Reading the code again and again, trying some micro-optimizations there was only small progress.

OK, don't twiddle, try better algorithms.

So I spent a lot of time reading papers, then trying to implement the described algorithms. Each implementation took some time to debug. Some do not work as they should, i.e. they fail test cases. Some are slow in Perl.

After a while I had a collection of various implementations of the so called Hunt-Szymansky algorithm and improvements of it. This is the same algorithm Algorithm::Diff is based on.

During tuning a bit-vector implementation I compared it to my already tuned version of Algorithm::Diff, and merged a part of code of the bit-vector variant back. The result of the benchmark surprised me, as the now tuned Algorithm::Diff was only 10 percent slower than the bit-vector variant.

This motivated me to publish it as a reliable and portable pure Perl module under the name LCS::Tiny.

Here are the steps making it so fast.

Remove features

Algorithm::Diff::LCSidx has two additional parameters:

- a code ref to a key generating function
- a code ref to a compare function

Usually they would not be specified and the defaults are used, but there is still overhead for checking and calling.


When Algorithm::Diff::LCSidx is called then four subroutines are involved. Inlining the subroutines into one subroutine reduces the lines of code and runtime overhead. Of course the code would maybe become less readable.


Try to use less instructions. Use Devel::NYTProf and benchmarks to check the changes. This sort of tuning can be very time consuming.

Here is an example from Algorithm::Diff:

sub _withPositionsOfInInterval
    my $aCollection = shift;    # array ref
    my $start       = shift;
    my $end         = shift;
    my $keyGen      = shift;
    my %d;
    my $index;
    for ( $index = $start ; $index <= $end ; $index++ )
        my $element = $aCollection->[$index];
        my $key = &$keyGen( $element, @_ );
        if ( exists( $d{$key} ) )
            unshift ( @{ $d{$key} }, $index );
            $d{$key} = [$index];
    return wantarray ? %d : \%d;

This is how it looks after squeezing:

  my $bMatches;
  unshift @{ $bMatches->{$b->[$_]} },$_ for $bmin..$bmax;

Resulting module - LCS::Tiny

As bit-vector solutions can have portability problems, I decided to release the now very fast non-bit-vector implementation on CPAN as LCS::Tiny.

Additional speed - LCS::BV

After release of LCS::Tiny I got back to work on the bit-vector implementation, to make it portable for different lengths of words, and support sequences longer than word-length (32 or 64 bit).

Also I tried out improvements for a part of the algorithm. One resulted in a significant accelleration and is now implemented in the released module LCS::BV.


Short sequences

A nice and typical average case for spelling-correction is the comparison of 'Chrerrplzon' with 'Choerephon'.

use Algorithm::Diff;
use Algorithm::Diff::XS;
use String::Similarity;

use Benchmark qw(:all) ;

use LCS::Tiny;
use LCS;
use LCS::BV;

my @data = (

my @strings = qw(Chrerrplzon Choerephon);

if (1) {
    cmpthese( 50_000, {
       'LCS' => sub {
       'LCSidx' => sub {
        'LCSXS' => sub {
        'LCSbv' => sub {
        'LCStiny' => sub {
        'S::Sim' => sub {
Gives the following result:

$ perl 50_diff_bench.t 
            (warning: too few iterations for a reliable count)
             Rate     LCS  LCSidx LCStiny   LCSXS   LCSbv  S::Sim
LCS        7022/s      --    -71%    -82%    -87%    -87%    -99%
LCSidx    23923/s    241%      --    -40%    -55%    -57%    -98%
LCStiny   39683/s    465%     66%      --    -25%    -29%    -96%
LCSXS     53191/s    657%    122%     34%      --     -4%    -95%
LCSbv     55556/s    691%    132%     40%      4%      --    -94%
S::Sim  1000000/s  14140%   4080%   2420%   1780%   1700%      --
Here the pure Perl LCS::BV beats Algorithm::Diff::XS (written in C).

Worst case

This example shows the weakness of the algorithms against more complicated cases.

my @data3 = ([qw/a b d/ x 50], [qw/b a d c/ x 50]);
$ perl 50_diff_bench.t 
            (warning: too few iterations for a reliable count)
          Rate     LCS LCStiny  LCSidx   LCSbv   LCSXS
LCS     35.2/s      --    -33%    -38%    -97%    -99%
LCStiny 52.7/s     50%      --     -7%    -95%    -98%
LCSidx  56.5/s     60%      7%      --    -94%    -98%
LCSbv   1020/s   2795%   1836%   1705%      --    -67%
LCSXS   3125/s   8766%   5828%   5428%    206%      --
Here the XS is fastest, because the stressed part is in pure C. The bit-vector implementation scales good, because of the parallelisms. The others fall back to worst case performance near the traditional implementation.

Donnerstag, 4. Juni 2015

Loopify Recursions

During development it is sometimes convenient to use recursions. But in Perl recursions don't scale very well, they are consuming memory.

It happened to me, that one of my CPAN modules LCS made problems after extending the  test cases to large sequences (more than 200 elements) as input. Avoiding the warning with 

no warnings 'recursion';

is easy but not a good solution.

Here is the code of the traditional LCS algorithm using recursion. The first subroutine calculates the match matrix $c using a nested loop. At the end it calls _print_lcs() to backtrace the resulting LCS (Longest Common Sequence). See the full source code at github.

sub LCS {
  my ($self,$X,$Y) = @_;

  my $m = scalar @$X;
  my $n = scalar @$Y;

  my $c = [];
  my ($i,$j);
  for ($i=0;$i<=$m;$i++) {
    for ($j=0;$j<=$n;$j++) {
  for ($i=1;$i<=$m;$i++) {
    for ($j=1;$j<=$n;$j++) {
      if ($X->[$i-1] eq $Y->[$j-1]) {
        $c->[$i][$j] = $c->[$i-1][$j-1]+1;
      else {
        $c->[$i][$j] = max($c->[$i][$j-1], $c->[$i-1][$j]);
  my $path = $self->_print_lcs($X,$Y,$c,$m,$n,[]);
  return $path;

sub max {
  ($_[0] > $_[1]) ? $_[0] : $_[1];

sub _print_lcs {
  my ($self,$X,$Y,$c,$i,$j,$L) = @_;

  if ($i==0 || $j==0) { return ([]); }
  if ($X->[$i-1] eq $Y->[$j-1]) {
    $L = $self->_print_lcs($X,$Y,$c,$i-1,$j-1,$L);
    push @{$L},[$i-1,$j-1];
  elsif ($c->[$i][$j] == $c->[$i-1][$j]) {
    $L = $self->_print_lcs($X,$Y,$c,$i-1,$j,$L);
  else {
    $L = $self->_print_lcs($X,$Y,$c,$i,$j-1,$L);
  return $L;

Each recursion can also be implemented as a loop. So it should be possible to rewrite the subroutine _print_lcs() without calling itself.

First let's analyze the control flow. The recursion begins with the maximum values of $i=$m, and $j=$n, the length of the two sequences, which now are also indices of the matching matrix $c, i.e. the subroutine begins in bottom right corner and counts down $i and $j to the upper left corner of the matrix.

Along the way matching points are recorded in the array $L.

The recursive solutions has the convenience, that first the subresult can be calculated, and after that the current match is pushed at the end of the subresults. This way the resulting array is sorted from lower to higher while the recursion tracked back from higher to lower.

The loopified solution also counts down but now has to use unshift for recording the match points. See how simple the code looks now (full version at github):

sub _lcs {
  my ($self,$X,$Y,$c,$i,$j,$L) = @_;

  while ($i > 0 && $j > 0) {
    if ($X->[$i-1] eq $Y->[$j-1]) {
      unshift @{$L},[$i-1,$j-1];
    elsif ($c->[$i][$j] == $c->[$i-1][$j]) {
    else {
  return $L;