Wollmers about Perl, coding and IT: performance

Posts mit dem Label performance werden angezeigt. Alle Posts anzeigen

Montag, 2. Mai 2022

Tuning Levenshtein Distance

On 2022-01-19 I released an improved version of Text::Levenshtein::BV. BV means Bit Vector, because it uses a bit-parallel algorithm with runtime complexity of O(ceil(m/w)*n), where w is the width of a CPU register. It's 20 times faster than the popular Text::Levenshtein using the traditional or simple algorithm with O(n*n) on strings with length 10.

Last summer I developed a variant with the simple algorithm named Levenshtein::Simple, because working on arrays is more flexible than strings. A string is easy to split into an array of characters. The same module could be used for arrays of characters, graphemes, words, lines.

Another reason is, that the Levenshtein alignment (shortest edit script) is more important than just the distance. Most modules on CPAN calculate only the distance for similarity, which is easier and faster with LCS.

It was a surprise that my Levenshtein::Simple without tuning was 50% faster than Text::Levenshtein. During development of an XS implementation I tuned Levenshtein::Simple to beat Text::Fuzzy::PP.

These are the results for the pure Perl (PP) implementations measured in Rate per second:

                         N=10
Text::WagnerFischer     4,931
Text::Levenshtein       6,339
Text::Fuzzy::PP        11,164
Levenshtein::Simple    27,926
Text::Levenshtein::BV 118,153

There are many XS versions on CPAN for calculating Levenshtein distance. Only three of them work without problems. They all use the simple algorithm. Text::Levenshtein::Flexible uses the C implementation of PostgreSQL, but has tricky code. Perl strings at the XS interface can be bytes or UTF-8. In C UTF-8 can be processed by iterating over UTF-8 or convert UTF-8 to 32-bit integers (numerical code points). Working with integers including the conversion is faster.

Measured via Perl the benchmarks are (TL::BVXS is mine, simple is my implementation just for fun, uni is with Unicode strings, bytes with bytes):

                            N=10
Text::Levenshtein::XS    347,539
TL::Flexible           2,725,258
Text::Fuzzy            3,026,401
TL::BVXS_simple        6,056,132
TL::BVXS_uni           8,495,407
TL::BVXS_bytes        12,743,110

The same simple algorithm, also implemented in C is 2 times faster than that of PostgreSQL. BV is 4 times faster (on longer strings the difference is larger). Compared to Text::Levenshtein it's 2000 times faster.

In pure C without the XS interface distance on bytes works at a rate of 25 Mio./second, codepoints ~10 % slower.

Optimising for fun.

Montag, 25. April 2022

Benchmark UTF-8 decoding in Perl

In Perl there are two modules on CPAN for UTF-8 decoding:

- Encode

- Unicode::UTF-8

Unicode::UTF-8 claims to be faster. I was interested how fast it is compared to my own implementation in pure C to decode utf8 (1-6 byte, needing 31 bits) into 32-bit code points.

UTF-8 became a bottleneck and is worth to have a look at. New SIMD implementations can validate up to 4 GB UTF-8 per second with AVX512.

That's the result of a quick benchmark:

$octets: Chſerſplzon chars: 11 bytes 13
                    Rate        Encode Unicode::UTF8      TL::BVXS
Encode         1927529/s            --          -83%          -89%
Unicode::UTF8 11143773/s          478%            --          -36%
TL::BVXS      17311395/s          798%           55%            --

$octets: राज्यराज्य chars: 9 bytes 27
                    Rate        Encode Unicode::UTF8      TL::BVXS
Encode         1592888/s            --          -83%          -90%
Unicode::UTF8  9466311/s          494%            --          -42%
TL::BVXS      16287053/s          922%           72%            --

Mine is fastest with ~215 MB/s but still is far away from SIMD solutions. In my use case the decoding consumes 35 % of the execution time. But SIMD would not help much for short strings.

Mittwoch, 14. Oktober 2015

Install cperl with perlbrew

A month ago Reini Urban released cperl-5.22.1 with promising features. It should be faster and consume less memory than the original perl. See here the details and status.

Maybe in one of the next releases cperl will have faster arrays and hashes. At the moment creation of hash or array elements is fast, i.e. fast enough for most applications, but a limiting factor e.g. for the data transport between Perl and C via XS.

Let's try.

Fortunately I found this article Прецизионные бенчмарки Perl containing instructions how to install cperl with perlbrew.

Of course you need perlbrew installed and well configured.

# create a new directory
$ mkdir cperl
$ cd cperl

# download cperl
$ wget https://github.com/perl11/cperl/archive/cperl-5.22.1.tar.gz \
   -O cperl-cperl-5.22.1.tar.gz

# brew it
$ perlbrew install --as=cperl-5.22.1 $(pwd)/cperl-cperl-5.22.1.tar.gz

# list installed perls
$ perlbrew list
  perl-5.16.3
  perl-5.18.2
* perl-5.20.1
  perl-5.21.11
  perl-5.22.0
  cperl-5.22.1

# switch to use cperl
$ perlbrew use cperl-5.22.1
$ perl -v

This is cperl 5, version 22, subversion 1 (v5.22.1c) built for darwin-2level

Easy, isn't it?

Sonntag, 19. Juli 2015

Four Shades of Devel::NYTProf

Sometimes a question or remark helps us solving another problem, which has absolutely nothing to do with the question.

The question was about the colors in the reports of Devel::NYTProf. The colors are green, yellow, orange and red, where green means good and red means bad.

These colors appear in the columns "Statements" and "Time on line" of the reports. To illustrate the coloring I found a nice piece of code where nearly all combinations appear:

Line	State ments	Time on line	Code
51	1	2µs	for my $j ($bmin..$bmax) {
52	6	3µs	$bj = $b->[$j];
53	6	3µs	unless (defined $positions->{$bj}) {
54	2	1µs	$Vs->[$j] = $S;
55	2	8µs	next;
56			}
57	4	1µs	$y = $positions->{$bj};
58	4	1µs	$u = $S & $y; # [Hyy04]
59	4	3µs	$S = ($S + $u) \| ($S - $u); # [Hyy04]
60	4	3µs	$Vs->[$j] = $S;
61			}

The different colors are determined by the difference of the values to the median, measured in median absolute deviations. The colors in the column "Statements" are calculated independent from the column "Time on line". This explains, why the first column can be red while the second is green, and why the first column can be green and the second red.

In the above report we see that the body of the for loop is executed 6 times. The lines in the unless body is executed 2 times, and the part outside 4 times.

The unless condition including the body needs 12µs. The intention of the unless construction is an optimization. So let us estimate how much time this safes. The second part (lines 57-60) needs 8µs for 4 executions, that's 2µs per execution. If we can avoid lines 53-55 we need 2 more executions of lines 57-60. That would theoretically mean minus 12µs plus 4µs.

Of course we need to care for the undefined case. And we do not longer need the caching in line 52.

We get now the following report:

Line	State ments	Time on line	Code
52	1	700ns	for my $j ($bmin..$bmax) {
53	6	2µs	$y = $positions->{$b->[$j]} // 0;
54	6	900ns	$u = $S & $y; # [Hyy04]
55	6	1µs	$S = ($S + $u) \| ($S - $u); # [Hyy04]
56	6	3µs	$Vs->[$j] = $S;
57			}

This looks a lot shorter. We eliminated 5 lines of 11 lines. The total execution time is now below 8µs. This is less than one third of the 25µs before the tuning.

The code is from LCS::BV which now is another bit faster than after Tuning Algorithm::Diff.

Samstag, 20. Juni 2015

Minimal Coding versus Indirection

Yesterday I tried to use a C implementation of the LCS problem within Perl via XS.

Reuse an existing library (libmba) sounded nice. It has good documentation and minimal foreign dependencies. The promise for libmba::diff is working for sequences of anything.

First problem was to get it compiling, stepping from one missing include to the next. Also had to patch a source file, because of some incompatible code for printing error messages.

Per default libmba::diff works on byte-strings. The comparison with String::Similarity showed only 40% of the speed of String::Similarity. Both use the same algorithm ["An O(ND) Difference Algorithm and its Variations", Eugene Myers,

Algorithmica Vol. 1 No. 2, 1986, pp. 251-266;], both written in C, both interfaced from Perl via XS.

But ok, fast enough for the first step, can tune later.

In the next step I wanted to support array references instead of strings as input. This needs to call libmba::diff() with references of two subroutines, one for comparing two elements, one for indexing an element. Here I lost (or wasted) some time getting it compiled without warnings (see on Stackoverflow).

The speed slowed down now to 24%, 0.237 MHz versus 1 MHz (i.e. 1 million cases processed per second) of String::Similarity.

What's the difference in detail?

Let's measure rough figures.

The lines of code of String::Similarity:


~/github/String-Similarity-1.04$ cloc --by-file-by-lang .
      17 text files.
      13 unique files.                              
      17 files ignored.

http://cloc.sourceforge.net v 1.62  T=0.20 s (45.6 files/s, 10777.8 lines/s)

[ details snipped ]

---------------------------------------------------------------------
Language           files          blank        comment           code
---------------------------------------------------------------------
make                   1            231            106            702
C                      2            127            193            570
YAML                   2              0              0             41
JSON                   1              0              0             39
Perl                   2             31             24             38
C/C++ Header           1              7             14              4
---------------------------------------------------------------------
SUM:                   9            396            337           1394
---------------------------------------------------------------------

And the same of LCS::XS:


~/github/LCS-XS$ cloc --by-file-by-lang .
      44 text files.
      42 unique files.                              
       8 files ignored.

http://cloc.sourceforge.net v 1.62  T=0.33 s (115.4 files/s, 37434.7 lines/s)

[ details snipped ]

----------------------------------------------------------------------
Language            files          blank        comment           code
----------------------------------------------------------------------
C/C++ Header           25           1109           3652           4317
C                       6            230            252           1376
make                    1            231            111            704
Perl                    3             70             63            146
JSON                    1              0              0             39
YAML                    2              1              1             28
----------------------------------------------------------------------
SUM:                   38           1641           4079           6610
----------------------------------------------------------------------

The lines of code in C are 1376 versus 570 for the same algorithm. The reason is more abstraction, more subroutines, more internal interfaces, more flexibility. This also causes more documentation for interfaces and more time to read the documentation and master the interfaces.

The size of the compiled objects is 7 KB versus 33 KB.

And a difference of factor 4 in speed.