Wollmers about Perl, coding and IT: Similarity

Freitag, 20. November 2015

Testing fuzzy approximation with Test::Deep

Maybe the title of this post is misleading, but I will explain what I mean.

Last year I wrote a series of LCS (longest common subsequence) modules for speed and rock solid quality. Then I had a need for smarter alignment and added similarity to the algorithm, now released as LCS::Similar on CPAN.

The hardest part in developing LCS and friends was sorting out bogus algorithms from a collection of nearly 100 publications with pseudocode and sometimes code. It needed some time to get all cornercases as testcases.

But how to test an algorithm which returns one of maybe more than one optimal ("longest") solutions? With a method allLCS() returning all solutions and comparing the single solution against them.

Here is what the output looks like:


my $cases = {
  'case_01' => {
    'allLCS' => [
      [ [ 0, 0 ], [ 2, 2 ] ],
      [ [ 1, 0 ], [ 2, 2 ] ]
    ],
    'LCS' =>
      [ [ 0, 0 ], [ 2, 2 ] ],
  },
  'case_02' => {
    'allLCS' => [
      [ [ 1, 1 ], [ 3, 2 ], [ 4, 3 ], [ 6, 4 ] ],
      [ [ 1, 1 ], [ 3, 2 ], [ 5, 3 ], [ 6, 4 ] ],
      [ [ 2, 0 ], [ 3, 2 ], [ 4, 3 ], [ 6, 4 ] ],
      [ [ 2, 0 ], [ 3, 2 ], [ 5, 3 ], [ 6, 4 ] ],
      [ [ 2, 0 ], [ 4, 1 ], [ 5, 3 ], [ 6, 4 ] ]
    ],
    'LCS' =>
      [ [ 1, 1 ], [ 3, 2 ], [ 5, 3 ], [ 6, 4 ] ]
  },
};

The result of LCS is valid if it is in the results of allLCS.

At first I wrote the comparison myself but Test::Deep already provides it:


  use Test::More;
  use Test::Deep

  use LCS::Tiny;

  cmp_deeply(
    LCS::Tiny->LCS(\@a,\@b),
    any(@{$object->allLCS(\@a,\@b)} ),
    "LCS::Tiny->LCS $a, $b"
  );

  done_testing;

That's how we can test approximation. But if the comparison in LCS changes from eq to a similarity function returning values between 0 and 1, than additional pairs matched with similarity appear in the result. For a comparison we need any of allLCS is a subset of LCS.

This was the first approach which works not reliable:


use Test::Deep;

# THIS DOES NOT WORK
sub cmp_any_superset {
    my ($got, $sets) = @_;

    for my $set (@$sets) {
      my $ok = cmp_deeply(
        $got, supersetof(@$set)
      );
      return $ok if $ok;
    }
};

Trying around and reading the documentation of Test::Deep again and again, nearly giving up and write it myself, I gave one possible interpretation of the docs a chance:

  cmp_deeply(
    $lcs,
    any(
      $lcs,
      supersetof( @{$all_lcs} )
    ),
    "Example $example, Threshold $threshold"
  );

It works.

Mittwoch, 10. Juni 2015

Porting some modules to Perl6

Playing around with Perl6 since 2008 I wanted to port two of my CPAN distributions, Set::Similarity and Bag::Similarity, to Perl6. It would be a nice example of using the Perl6 built-in types Set and Bag (also known as multisets).

Last year I visited the Austrian Perl Workshop and also stayed the additional two days on the hackaton. My intention was to discuss details of the long missing NFG (Normalization Form Grapheme) with the core developers, but they spent their time discussing GLR (Grand List Refactor/Redesign).

So I worked on the port of Set::Similarity and Bag::Similarity.

These modules are reference implementations of similarity coefficients known under the names Dice, Jaccard, cosine and overlap.

Input are always two sequences, which are tokenized if necessary, transformed to sets (arrays of unique tokens in Perl6) or bags (hashes in Perl5).

In case of strings as input they are tokenized into single characters by default, or into n-grams if n is specified.

Splitting into single characters is easy in Perl6, if you know that the method is named comb.


$ perl6
> say "banana".comb;
b a n a n a

Transforming it to a set


> say "banana".comb.Set;
set(b, a, n)

With sets we can use the intersection operator (a logical and reduction)


> say "banana".comb.Set (&) "anna".comb.Set;
set(a, n)

But Perl6 understands it shorter


> say "banana".comb (&) "anna".comb;
set(a, n)

N-grams are not so obvious (but there is more than one way ...)


> my $w = 4;'wollmersdorfer'.comb.rotor($w,$w-1).map({[~] @$_})
woll ollm llme lmer mers ersd rsdo sdor dorf orfe rfer

The Dice coefficient is defined as count(intersection(A,B)) / (count(A) + count(B)). With the pieces from above we can code:


> my $a="banana".comb; my $b="anna".comb; say ($a (&) $b)/($a.Set + $b.Set)
0.4

Very short, isn't it? Note, that Perl6 automatically uses the number of elements of a set in the context of numeric operations.
Next the Jaccard coefficient is defined as count(intersection(A,B)) / count(union(A,B)). If the operator for intersection is the logical and, then the operator for the union should be the logical or. Let's try.


> my $a = 'banana';my $b = 'anna'; my $union = ($a.comb (|) $b.comb)
set(b, a, n)

Put the pieces together:


> y $a = 'banana';my $b = 'anna'; my $jaccard = ($a.comb (&) $b.comb) / ($a.comb (|) $b.comb);
0.666667

Understanding the code for for cosine and overlap is left as an exercise to the reader:


my $cosine = ($a.comb (&) $b.comb) / ($a.comb.Set.elems.sqrt * $b.comb.Set.elems.sqrt);
my $overlap = ($a.comb (&) $b.comb) / ($a.comb.Set.elems, $b.comb.Set.elems).min;

Is this short code really worth own modules? No, it isn't. Thus I contributed it to Perl6-One-Liners as chapter Text analysis.
Many thanks to the Perl6 hackers answering my questions.