Sonntag, 14. Februar 2016

Credit the first or last uploader on meta::cpan?

Credit problem


This post is related to the post the "credit the last uploader" problem by RJBS.

A related thread on twitter caused IMHO more misunderstandings than solutions. Twitter is not the best medium to discuss complex problems.

RJBS still did a deep analysis in the above referenced post.

In summary I agree mostly with the opinions. But the suggested solution

MetaCPAN Web should stop showing the name and image of the latest uploader as prominently, and should show the first-come user instead.

isn't IMHO an improvement. It solves a problem and creates a new one.

Persons in the life cycle


At the minimum a distribution on CPAN has an author and an uploader. In many cases the author is the same person as the uploader. Maybe the distribution has users. Also there can be persons submitting issues or patches.

Distributions can also have co-maintainers, persons with the permission to upload new versions of the distribution.

Persons without permission can upload non-authorized versions. They are available but marked as non-authorized.

During the life of a distribution the maintainers (people with upload permission) can change.

In my opinion everybody contributing to a distribution should be attributed, including submitters of issues. Even bad contributions which are removed later improve the quality. In free software with unpaid developers honor and attribution is important for motivation.

Prominent person


The release page of a distribution on meta::cpan shows in the top right corner name and image of one person, the last uploader.

Now RJBS means that the first uploader of a distribution is more important.

Maybe, if the first uploader is still very active. But what does "very active" mean and how cam meta::cpan measure it? It can't.

Imagine distributions first uploaded 15 years ago. Meanwhile maintainers changed, features where added, code completely rewritten. Which person should be displayed?

Only one


If only one can be displayed, who should it be?

In my role as user I prefer the last uploader. For me it is important to get a feeling of quality, which is best associated with a picture and a name of a person. This sounds like pre-justice, and it is. Experience and trust is a main factor of human decisions. The picture and name of a person work like a brand of a computer manufacturer. 

But if person A writes most of the code and person B is the last uploader, shouldn't A be displayed? No, as a user I prefer B, because the last uploader did the last quality check.

Solution


Let's look at very successful solutions like github. Github allows repositories owned by organisations (all members have permissions) or single persons. Organisations are not possible on PAUSE without change.

Another idea inspired from github could be to display miniature pictures of all persons. Github displays all contributors. In case of CPAN this could be all current persons with upload permissions. I prefer this solution.

If there is no better solution, nothing should be changed. The change suggested by RJBS can harm the motivation of maintainers, some could feel it like a slap into the face.

Anyway there should be broad consensus and not a decision by one or a few persons.

Edit: Some hours after finishing this post I detected a similar post by Neil Bowers: It takes a community to raise a CPAN module.

Dienstag, 19. Januar 2016

Porting LCS::BV to Perl 6 - a fast core loop for Algorithm::Diff

Christmas happened and Perl6 is now reasonable stable. It's time to use it for real problems, learn more about pros and cons, detect the elegance.

Porting my CPAN distribution LCS::BV is a nice task. It's short code, has full test coverage and allows benchmarks. Also it could be expected that LCS::BV (a bit-vector implementation of the Longest Common Subsequence algorithm) will be faster than the Perl6 port of Algorithm::Diff and the Perl6 module Algorithm::LCS. LCS::BV is as fast as the XS version of A::D, see my blogpost Tuning Algorithm::Diff.

Test dependencies

LCS algorithms are complex beasts. It's nearly impossible to implement them without good test cases. One problem is, that these algorithms find a sequence of maximum length, but only one of maybe more possible solutions. All possible solutions are provided by the slow CPAN distribution LCS and the method allLCS() as explained in a another blog post. So I need to port this to Perl 6 first.

To compare the results during tests Test::Deep is needed. Should be no problem, sure somebody already ported it. Or not? No, there is no Test::Deep on modules.perl6.org. So I need to port this too, or the part needed. Or not?

This is the relevant part of the tests in Perl 5:


use Test::More;
use Test::Deep;

use LCS;
use LCS::BV;

cmp_deeply(
    LCS::BV->LCS(\@a,\@b),
    any(@{LCS->allLCS(\@a,\@b)} ),
    "$a, $b"
);

done_testing;

Trying to write a subroutine from bottom I got the following solution using any():


# say so $[[2,3],[2,4],] eqv $[[[0,1],[2,3],],[[0,1],[2,4],]].any;
  
my sub any_of ($result, $expected) {
  return so $result eqv $expected.any;
}
  
ok(
  any_of(
    LCS::BV::LCS($A, $B),
    LCS::All::allLCS($A, $B),
  ),
  "'$a', '$b'",
);

That's an example how many features of Perl 6 are worth the half of CPAN.

Length of an integer

In Perl 5 bit operations are limited to length of an integer. In most implementations this is 64 bits, but can be 32 bits. If an array is mapped to bits it needs ceil(array_length/integer_width) integers.

This needs the following code in Perl 5:


our $width = int 0.999+log(~0)/log(2);

use integer;
no warnings 'portable'; # for 0xffffffffffffffff

my $positions;

# $a ... array of elements, e.g. ['f','o','o']

$positions->{$a->[$_]}->[$_ / $width] |= 1 << ($_ % $width) 
    for $amin..$amax;

# gives {'f' => [ 0b001 ], 'o' => [ 0b110 ], }

Doing this the first time in Perl 6, I wondered how long integers are and how to find out. Took me very long with the result: infinite. That's awesome.

This part ported to Perl 6:


my $positions;
$positions{$a[$_]} +|= 1 +< $_  for $amin..$amax;

This is approximately half the code. You can imagine that the rest of the code also gets simpler. It also is the half.

Just 30 lines

For rosettacode.org I adapted the code for strings and removed the usual prefix and suffix optimization, see Bit Vector:


sub lcs(Str $xstr, Str $ystr) {
    my ($a,$b) = ([$xstr.comb],[$ystr.comb]);

    my $positions;
    for $a.kv -> $i,$x { $positions{$x} +|= 1 +< $i };

    my $S = +^0;
    my $Vs = [];
    my ($y,$u);
    for (0..+$b-1) -> $j {
        $y = $positions{$b[$j]} // 0;
        $u = $S +& $y;
        $S = ($S + $u) +| ($S - $u);
        $Vs[$j] = $S;
    }

    my ($i,$j) = (+$a-1, +$b-1);
    my $result = "";
    while ($i >= 0 && $j >= 0) {
        if ($Vs[$j] +& (1 +< $i)) { $i-- }
        else {
            unless ($j && +^$Vs[$j-1] +& (1 +< $i)) {
                $result = $a[$i] ~ $result;
                $i--;
            }
            $j--;
        }
    }
    return $result;
}

You can install LCS::BV and LCS::All via Panda or clone them from github.

Summary

Perl 6 has a lot of convenience, allows short and clear code. It's huge. But learning all the nice features is easier than I thought. Understanding e.g. any() and trying it out in the REPL is a magnitude faster, than searching in CPAN, reading docs and trying out some distributions.


Freitag, 20. November 2015

Testing fuzzy approximation with Test::Deep

Maybe the title of this post is misleading, but I will explain what I mean.

Last year I wrote a series of LCS (longest common subsequence) modules for speed and rock solid quality. Then I had a need for smarter alignment and added similarity to the algorithm, now released as LCS::Similar on CPAN.

The hardest part in developing LCS and friends was sorting out bogus algorithms from a collection of nearly 100 publications with pseudocode and sometimes code. It needed some time to get all cornercases as testcases.

But how to test an algorithm which returns one of maybe more than one optimal ("longest") solutions? With a method allLCS() returning all solutions and comparing the single solution against them.

Here is what the output looks like:


my $cases = {
  'case_01' => {
    'allLCS' => [
      [ [ 0, 0 ], [ 2, 2 ] ],
      [ [ 1, 0 ], [ 2, 2 ] ]
    ],
    'LCS' =>
      [ [ 0, 0 ], [ 2, 2 ] ],
  },
  'case_02' => {
    'allLCS' => [
      [ [ 1, 1 ], [ 3, 2 ], [ 4, 3 ], [ 6, 4 ] ],
      [ [ 1, 1 ], [ 3, 2 ], [ 5, 3 ], [ 6, 4 ] ],
      [ [ 2, 0 ], [ 3, 2 ], [ 4, 3 ], [ 6, 4 ] ],
      [ [ 2, 0 ], [ 3, 2 ], [ 5, 3 ], [ 6, 4 ] ],
      [ [ 2, 0 ], [ 4, 1 ], [ 5, 3 ], [ 6, 4 ] ]
    ],
    'LCS' =>
      [ [ 1, 1 ], [ 3, 2 ], [ 5, 3 ], [ 6, 4 ] ]
  },
};

The result of LCS is valid if it is in the results of allLCS.

At first I wrote the comparison myself but Test::Deep already provides it:


  use Test::More;
  use Test::Deep

  use LCS::Tiny;

  cmp_deeply(
    LCS::Tiny->LCS(\@a,\@b),
    any(@{$object->allLCS(\@a,\@b)} ),
    "LCS::Tiny->LCS $a, $b"
  );

  done_testing;

That's how we can test approximation. But if the comparison in LCS changes from eq to a similarity function returning values between 0 and 1, than additional pairs matched with similarity appear in the result. For a comparison we need any of allLCS is a subset of LCS.

This was the first approach which works not reliable:


use Test::Deep;

# THIS DOES NOT WORK
sub cmp_any_superset {
    my ($got, $sets) = @_;

    for my $set (@$sets) {
      my $ok = cmp_deeply(
        $got, supersetof(@$set)
      );
      return $ok if $ok;
    }
};

Trying around and reading the documentation of Test::Deep again and again, nearly giving up and write it myself, I gave one possible interpretation of the docs a chance:


  cmp_deeply(
    $lcs,
    any(
      $lcs,
      supersetof( @{$all_lcs} )
    ),
    "Example $example, Threshold $threshold"
  );

It works.

Donnerstag, 15. Oktober 2015

How bad can code be - Part 1: The Gordian Knot

How bad can code be?

At the beginning of my career as a developer starting from scratch I became responsible for the maintenance of a package consisting of approximately 40 KLoCs (Lines of Code) Assembler. It was called "realtime controlling part", a sort of middleware controlling transactions, database recovery, asynchronity, and communication.

It existed in three variants for four different applications. First person A developed it for application AA, then he got datacenter manager. Then person B forked it for different requirements of HH and BB+CC. In summary 70 KLoCs partly similar, but not exactly the same.

The BB+CC version was the most complicated with a special feature called preread. It guesses in the queue of incoming transaction, which records should be read in advance, waits for all of them, and puts it into the application queue. Same in reverse order with the inserts and updates in the answer of the application. And exactly this most complicated part was written by not-so-skilled person C.

The preread module had around 4000 lines and was hard to read. No subroutines. Variables uses as switches, i.e. store the result of conditions, check or change them somewhere in the code. Of course you need to init them. You need a good name within the allowed 8 characters. You must predefine them at the end of the source. But debugging is a hell, because the more switches you use, the more possible combinations of states you have. Combined with spaghetti style code, jumping around cross-cross it's a Gordian Knot. Lady programmer D removed two dozens bugs raising in production. In my period I also solved a similar amount. One bug was not solvable. Sometimes the whole thing got sleeping, waiting for nothing, hanging.

This was the worst code in production I had to work with.

Lessons learned:

  • do not use the weakest skilled person for the most complicated task
  • avoid switches
  • no literals in the code, define them as constants
  • use structured programming style (needs discipline in Assembler)
  • subroutines should fit into a screen (24 x 80 at this time)
  • code reentrant, even if it's not necessary

Fortunately I got 1 year time and budget to rewrite this package.

It was only necessary to change to another network technology, so just rewrite this part of handling incoming and outgoing messages. Second it was also necessary to refine restart and recovery as the architecture changed to distributed processing, i.e. the local organizations got all midrange servers, and the communication changed.

I started with the communication interface as an isolated module, but the network servers were not available, planned 6 months later for first integration tests. So I learned to test with stubs and drivers, simulating the other parts. In a team of 12 persons you speak with the others, even if you are working alone. One day person O, a genius, but nearly autistic developer said to me "The whole thing is crap and completely wrong." Why? What's wrong? Should I do it this way? "No, it should be this way." He always answered with one short sentence. We always had short and loud discussions. After the discussions I returned to my room for thinking about the design. And then I visited him again.

With this ideas I wrote a dispatcher, controlling queues, reqeuing ready tasks, queuing free tasks into the memory pool, activating the handlers. It had some 100 lines. This made possible to reduce the control logic in the handlers. The handlers could be reduced to wait for work, do some stuff, subrequests to other tasks and wait for the result, do some stuff, send result to requesting task, wait for new work. No need for variables, only maps to the control blocks in the queue, only constants in the task modules. Hey, this is reentrant.

Now the problem was to factor out the necessary logic from the old code, and clean away the control logic. I did this during the easter days at home. Imagine a living room with the compile printouts covering the whole floor, me using markers in two different colors, marking the good and the bad. Of course it was not finished in one step. I refactored my own code 12 times along the progress. The horrible preread module was reduced from 4000 lines to 100 lines nearly linear, straight ahead code. Others were also reduced in size and the variants disappeared.

This software was in production nearly 20 years, without an outage, and without change.

Lessons learned:

  • you need 2 very good developers in a team
  • not all developers in a team need good communication skills
  • learn from critic, even from harsh critic
  • advantage by restriction
  • less is more
  • refactor often
  • conform to quality standards and best practices

Gordian Knots can be solved.

Next in this series: two huge Perl projects




Mittwoch, 14. Oktober 2015

Install cperl with perlbrew

A month ago Reini Urban released cperl-5.22.1 with promising features. It should be faster and consume less memory than the original perl. See here the details and status.

Maybe in one of the next releases cperl will have faster arrays and hashes. At the moment creation of hash or array elements is fast, i.e. fast enough for most applications, but a limiting factor e.g. for the data transport between Perl and C via XS.

Let's try.

Fortunately I found this article Прецизионные бенчмарки Perl containing instructions how to install cperl with perlbrew.

Of course you need perlbrew installed and well configured.


# create a new directory
$ mkdir cperl
$ cd cperl

# download cperl
$ wget https://github.com/perl11/cperl/archive/cperl-5.22.1.tar.gz \
   -O cperl-cperl-5.22.1.tar.gz

# brew it
$ perlbrew install --as=cperl-5.22.1 $(pwd)/cperl-cperl-5.22.1.tar.gz

# list installed perls
$ perlbrew list
  perl-5.16.3
  perl-5.18.2
* perl-5.20.1
  perl-5.21.11
  perl-5.22.0
  cperl-5.22.1

# switch to use cperl
$ perlbrew use cperl-5.22.1
$ perl -v

This is cperl 5, version 22, subversion 1 (v5.22.1c) built for darwin-2level


Easy, isn't it?

Dienstag, 25. August 2015

Interface to Shopware with Perl

At work we often have the problem to migrate data to web shops or other applications.

Perl is a great help for task like this, get things done.

The possible methods to maintain data via Perl scripts are:
  • manipulate the database, e.g. using DBIx::Class. This can be a maintenance hell, if the schema of the main application changes.
  • compile CSV records, if the application can import it.
  • use an API, if the application supports it.
In the case of Shopware an REST API exists, but the documentation targets PHP and suggests to use the PHP layers.

After some trial and error I got success using an ordinary web browser, entering an URL like this:

http://username:apikey@shop.local/api/articles

Should be straight forward with Mojo::UserAgent, but it returned errors.

Exploring the details it took some time to detect the header Authorization: Digest in the browser communication.

Fortunately with the help of search engines I found the module Mojo::UserAgent::DigestAuth which needed some trial and error to understand, but works like this:


use Mojo::UserAgent::DigestAuth;
use Mojo::UserAgent;

my $ua = Mojo::UserAgent->new;

# e.g. 'http://username:apikey@shop.local/api/articles'
my $url = $protocol . $user . ':' . $apikey . '@' . $host . $path . $object;

my $tx = $ua->$_request_with_digest_auth(get => $url);
my $value = $tx->res->json;

May this post help all the people having a problem with the Shopware API or with Mojo::UserAgent::DigestAuth.

Elasticsearch with Mojo::UserAgent

Trying to find a more flexible and scalable solution for my key-value backends with more than 10 millions of keys each, stored in read-only dictd now, I wanted to try Elasticsearch.

Installation on my development station (a MacBook Air) was easy.

Next step is a perlish interface.

There are modules on CPAN interfacing to Elasticsearch, but with tons of dependencies, and a lot of documentation to learn.

What's about using the JSON REST-API of Elasticsearch directly with one of the perlish user agents?

Having Mojolicious installed on all machines, and knowing Mojo::UserAgent very well, I gave it a try.

Some minutes later I had it working:


use Mojo::UserAgent;

my $ua = Mojo::UserAgent->new;

# insert a record
my $tx = $ua->post('http://localhost:9200/nomen/child' => json =>
  {
 "taxon"  => "Rosa",
 "rank"   => "Genus",
 "child"  => "Rosa canina",
 "system" => "gbif"
  }
);

# query
my $result = $ua->post('http://localhost:9200/_search' => json => {
    "query" => {
        "query_string" => {
            "query"  => "Spermophilus",
            "fields" => ["taxon","child"]
        }
    }
})->res->json;


It allows to use perl structures directly with the built-in JSON converter.

Easy and simple.

Next will be scalability and performance tests.