Reuse an existing library (libmba) sounded nice. It has good documentation and minimal foreign dependencies. The promise for libmba::diff is working for sequences of anything.
First problem was to get it compiling, stepping from one missing include to the next. Also had to patch a source file, because of some incompatible code for printing error messages.
Per default libmba::diff works on byte-strings. The comparison with String::Similarity showed only 40% of the speed of String::Similarity. Both use the same algorithm ["An O(ND) Difference Algorithm and its Variations", Eugene Myers,
Algorithmica Vol. 1 No. 2, 1986, pp. 251-266;], both written in C, both interfaced from Perl via XS.
But ok, fast enough for the first step, can tune later.
In the next step I wanted to support array references instead of strings as input. This needs to call libmba::diff() with references of two subroutines, one for comparing two elements, one for indexing an element. Here I lost (or wasted) some time getting it compiled without warnings (see on Stackoverflow).
The speed slowed down now to 24%, 0.237 MHz versus 1 MHz (i.e. 1 million cases processed per second) of String::Similarity.
What's the difference in detail?
Let's measure rough figures.
The lines of code of String::Similarity:
~/github/String-Similarity-1.04$ cloc --by-file-by-lang .
17 text files.
13 unique files.
17 files ignored.
http://cloc.sourceforge.net v 1.62 T=0.20 s (45.6 files/s, 10777.8 lines/s)
[ details snipped ]
---------------------------------------------------------------------
Language files blank comment code
---------------------------------------------------------------------
make 1 231 106 702
C 2 127 193 570
YAML 2 0 0 41
JSON 1 0 0 39
Perl 2 31 24 38
C/C++ Header 1 7 14 4
---------------------------------------------------------------------
SUM: 9 396 337 1394
---------------------------------------------------------------------
And the same of LCS::XS:
~/github/LCS-XS$ cloc --by-file-by-lang .
44 text files.
42 unique files.
8 files ignored.
http://cloc.sourceforge.net v 1.62 T=0.33 s (115.4 files/s, 37434.7 lines/s)
[ details snipped ]
----------------------------------------------------------------------
Language files blank comment code
----------------------------------------------------------------------
C/C++ Header 25 1109 3652 4317
C 6 230 252 1376
make 1 231 111 704
Perl 3 70 63 146
JSON 1 0 0 39
YAML 2 1 1 28
----------------------------------------------------------------------
SUM: 38 1641 4079 6610
----------------------------------------------------------------------
The lines of code in C are 1376 versus 570 for the same algorithm. The reason is more abstraction, more subroutines, more internal interfaces, more flexibility. This also causes more documentation for interfaces and more time to read the documentation and master the interfaces.
The size of the compiled objects is 7 KB versus 33 KB.
And a difference of factor 4 in speed.
The size of the compiled objects is 7 KB versus 33 KB.
And a difference of factor 4 in speed.