Skip to main content

Gestalt Diffing algorithm in Java

What we see depends mainly on what we look for. John Lubbock
We do not see things as they are, we see things as we are. Anais Nin



Gestalt Diffing algorithm in Java

Intro

When it comes down to diffing or LCS usually story ends with Myers algorithm. It is idea to create BitMatrix (or IntMatrix) of similarities between two strings and then to approach to a problem from graph point, and find the shortest path of changes between two arrays. It sounds as a great solution, and it is, but in reality it has few drawbacks.

It is algorithm that gives great results most of the time but it is a slow algorithm for some obvious cases, if strings are equal, or very similar. You create whole matrix but for your solution not all values from matrix are needed. It involves one more algorithm, finding the shortest path in it self.




Then improvements of Myers algorithm came, like to check on start are strings are equals, to check for same prefix or suffix, to run snake chase from both ends and so on. Most popular version of this improvements is google's diff_match_patch. There are and some alternative improvements with dynamic matrix and so on...

So what is Gestalt


Gestalt is configuration or pattern of elements so unified as a whole that it cannot be described merely as a sum of its parts. A gestalt has two or more parts (like figure and ground) that are so integrated together that we perceive them as one object.



Gestalt is a name of an algorithm that Ratcliff/Obershelp developed in 1983. It is actually more heuristic principle than algorithm (more here). It is like searching for Schrodinger's cat. Imagine that you own a cat. When you come home you want to find your cat. The way how normal algorithms work is, you search inch by inch in a room 1, then you move your search inch by inch to room 2 and so on. Gestalt way is to search for a cat on exact places where most likely it could or should be. You will search first in a cat box in room 1, than by the window in room2, and so on moving to less and less probable places. You would more try to guess where cat is than search for a real detail answer.

Cat is (probability that is in a cat box) + (probability that is on all other possible places).

It sounds a little bit ridiculous, but that is just because it is ridiculous. Gestalt way is recursive by its nature. What is very interesting people who implement this algorithm claim that it gives better results than Myers algorithm and that it is faster too in most cases.

Implementation

Since i haven't found Java implementation till now I decided to give it try and I implemented my variation of it.

So you start with two strings. First you will find longest equal sub-string in them, and then you will split both input strings and run recursion on parts that were not equal.

1:  public static List<Diff> diffList(char[] chars1, char[] chars2) {  
2:            
3:          List<Diff> res = new ArrayList<Diff>();  
4:            
5:          // if inputs are empty return empty list  
6:          if (chars1 == null && chars2 == null || chars1.length == 0 && chars2.length == 0) {  
7:              return res;  
8:          }  
9:            
10:          TextPart part = findLongesEqualStringWithImaginaryMatrix(chars1, chars2);  
11:            
12:          if (part == null) {  
13:              // end of reqursion  
14:              if (chars1 != null && chars1.length > 0) {  
15:                  res.add(Diff.createNewSimpleDiff(chars1, Operation.DELETE));  
16:              }  
17:              if (chars2 != null && chars2.length > 0) {  
18:                  res.add(Diff.createNewSimpleDiff(chars2, Operation.INSERT));  
19:              }  
20:          } else {  
21:              // split text to pre and after EQUAL  
22:              char[] orgPre = Arrays.copyOfRange(chars1, 0, part.getStart1());  
23:              char[] orgAfter = Arrays.copyOfRange(chars1, part.getEnd1(), chars1.length);  
24:              char[] chgPre = Arrays.copyOfRange(chars2, 0, part.getStart2());  
25:              char[] chgAfter = Arrays.copyOfRange(chars2, part.getEnd2(), chars2.length);  
26:                
27:              // do same for pre  
28:              List<Diff> diffPre = diffList(orgPre, chgPre);  
29:              // do same for after  
30:              List<Diff> diffAfter = diffList(orgAfter, chgAfter);  
31:                
32:              res.addAll(diffPre);  
33:              res.add(part);  
34:              res.addAll(diffAfter);  
35:          }  
36:            
37:          return res;  
38:      }  

Algorithm  for finding longest sub-string is not that important, and this can work with almost any of them, but it would be nice that we are in line with heuristic and we are searching first the most probable case. So I implemented here find longest sub-string by diagonal search on imaginary bit matrix of two strings. Instead building bitMatrix and than searching on it, I have found that it would be more easier to build and find longest sub-string in same time.

Start for this search is a diagonal that two strings are equal, biggest diagonal, and then move to smaller diagonals. If you already found equal sub-string of length N you will not go to a red corners of size NxN. Orange squares are parts that will go into next recursion.

 

We have double while loop, for diagonal +1 and diagonal -1, inside one more while loop that moves diagonals.

1:  public static SimpleTextPart findLongesPossibleEqualSubStringImaginaryMatrix(char[] chars1, char[] chars2) {  
2:          
3:      int max = chars1.length;  
4:      if(chars2.length > max) {  
5:        max = chars2.length;  
6:      }  
7:        
8:      int start1 = 0;  
9:      int start2 = 0;  
10:      int count = 0;  
11:      SimpleTextPart found = null;  
12:        
13:      int move = 0;  
14:      int i = 0;  
15:      int j = 0;  
16:      while (move < max) {  
17:        i = move;  
18:        j = 0;  
19:        if (found == null || found.getText().length() < Math.min(chars1.length - move, chars2.length)) {  
20:          while(i < chars1.length && j < chars2.length) {  
21:            System.out.println(chars1[i] + "-" + chars2[j]);  
22:            if (chars1[i] == chars2[j]) {  
23:              if (count == 0) {  
24:                start1 = i;  
25:                start2 = j;  
26:              }  
27:              count++;  
28:            } else {  
29:              if (count > 0) {  
30:                if (found == null || count > found.getText().length()) {  
31:                  found = SimpleTextPart.createNewEqualsTextPart(start1, start2, Arrays.copyOfRange(chars1, start1, start1 + count));  
32:                }  
33:                count = 0;  
34:              }  
35:            }  
36:            i++;  
37:            j++;  
38:          }  
39:          if (count > 0) {  
40:            if (found == null || count > found.getText().length()) {  
41:              found = SimpleTextPart.createNewEqualsTextPart(start1, start2, Arrays.copyOfRange(chars1, start1, start1 + count));  
42:            }  
43:            count = 0;  
44:          }  
45:          System.out.println("--------------------------------");  
46:        }  
47:          
48:        if (move == 0) {  
49:          move++;  
50:          continue;  
51:        }  
52:          
53:        count = 0;  
54:          
55:        i = 0;  
56:        j = move;  
57:        if (found == null || found.getText().length() < Math.min(chars1.length, chars2.length - move)) {  
58:          while(i < chars1.length && j < chars2.length) {  
59:            System.out.println(chars1[i] + "-" + chars2[j]);  
60:            if (chars1[i] == chars2[j]) {  
61:              if (count == 0) {  
62:                start1 = i;  
63:                start2 = j;  
64:              }  
65:              count++;  
66:            } else {  
67:              if (count > 0) {  
68:                if (found == null || count > found.getText().length()) {  
69:                  found = SimpleTextPart.createNewEqualsTextPart(start1, start2, Arrays.copyOfRange(chars1, start1, start1 + count));  
70:                }  
71:                count = 0;  
72:              }  
73:            }  
74:            i++;  
75:            j++;  
76:          }  
77:          if (count > 0) {  
78:            if (found == null || count > found.getText().length()) {  
79:              found = SimpleTextPart.createNewEqualsTextPart(start1, start2, Arrays.copyOfRange(chars1, start1, start1 + count));  
80:            }  
81:            count = 0;  
82:          }  
83:        }  
84:          
85:        System.out.println("==================================");  
86:        move++;  
87:      }  
88:        
89:      return found;  
90:    }  

We have and two classes to represent diff - Diff, and TextPart to represent part of text and enum that defines types of diffs (EQUAL, DELETE, INSERT).

You can brows full code at git here.

Conclusion


What I have found interesting about it:

It is algorithm that doesn't need optimizations. If strings are equal algorithm will end instantly since that is the first place where we will look for a cat. In fact the more strings are similar less time will be needed to make results.

It is based on divide and conquer principle by its nature, it reduce scope of calculations with each step of recursion.

It is based on growing biggest equal solution and it is forcing equal results before (delete and insert), which always produce great end-results.

What is bad about it:
I still do not understand how does it work. I am surprised with results each time when I run it.

Let me know what do you think about this implementation.
 
We must be willing to see things as they are, rather than as we hope, wish, or expect them to be. Unknown



Links
Mayers work An O(ND) Difference Algorithm and Its Variations
http://www.xmailserver.org/diff2.pdf
Google diff-match-patch that we use implementation of Mayers with optimizations
Description of google alg

Gestalt alg, Ratcliff/Obershelp


Comments

This is really an awesome post, thanks for it. Keep adding more information to this.mulesoft training in bangalore
Data Science said…
Amazingly by and large very interesting post. I was looking for such an information and thoroughly enjoyed examining this one. Keep posting. An obligation of appreciation is all together for sharing.data science course in gwalior

Popular posts from this blog

Javascript REST client

Note : I still work on text, code example should be fine. REST is the one of the most popular interfaces on the web today. One part to its success it owes to its simplicity. How number of sites that support REST grows we will need some fast and good solution to use those services on client side. Even if its possible to do it all, since REST is based on http, with old AJAX calls, it would be nice to have some little more... This is one approach that i choose. I create in javascript new object that has six methods five for REST methods, four for POST, PUT, GET and REMOVE plus one more to GET all resources, and one more to get html template to display your data. This code is based on jquery and json js libs. function RestServiceJs(newurl) { this.myurl = newurl; this.add = function(model, callback) { $.ajax({ type: 'POST', url: this.myurl, data: JSON.stringify(model), // '{"name":"' + model.name + '"}',

Use JPA with MongoDb and Datanucleus

Web applications nowadays have huge demand for data processing, and they need to get them, process them, and displayed them fast. Traditional relational tables and big SQL engines fail to serve that purpose. NoSQL movement is on the rise. There are huge numbers of alternatives to SQL like BigTable, HBase, CouchDB, Cassandra and MongoDB. They are all fast but you'll need to implement new way to work with them, since all of them have they own query languages. It would be nice that we can use them in similar or same way in our projects, after all ORM layer is there for that to decouple our objects from underlying storage. MongoDB is one of most popular NoSQL solutions and easiest one to install. For HBase i would had to install cygwin or virtual linux, and still will be hard to confgiure it right. MongoDb is easy, it plays out of the box, just unzip it, and it still offers some nice features like MapReduce, Replication and Sharding. Datanucleus is one of the leading ORM provid