Gestalt Diffing algorithm in Java

What we see depends mainly on what we look for. John Lubbock
We do not see things as they are, we see things as we are. Anais Nin

Gestalt Diffing algorithm in Java

Intro

When it comes down to diffing or LCS usually story ends with Myers algorithm. It is idea to create BitMatrix (or IntMatrix) of similarities between two strings and then to approach to a problem from graph point, and find the shortest path of changes between two arrays. It sounds as a great solution, and it is, but in reality it has few drawbacks.

It is algorithm that gives great results most of the time but it is a slow algorithm for some obvious cases, if strings are equal, or very similar. You create whole matrix but for your solution not all values from matrix are needed. It involves one more algorithm, finding the shortest path in it self.

Then improvements of Myers algorithm came, like to check on start are strings are equals, to check for same prefix or suffix, to run snake chase from both ends and so on. Most popular version of this improvements is google's diff_match_patch. There are and some alternative improvements with dynamic matrix and so on...

So what is Gestalt

Gestalt is configuration or pattern of elements so unified as a whole that it cannot be described merely as a sum of its parts. A gestalt has two or more parts (like figure and ground) that are so integrated together that we perceive them as one object.

Gestalt is a name of an algorithm that Ratcliff/Obershelp developed in 1983. It is actually more heuristic principle than algorithm (more here). It is like searching for Schrodinger's cat. Imagine that you own a cat. When you come home you want to find your cat. The way how normal algorithms work is, you search inch by inch in a room 1, then you move your search inch by inch to room 2 and so on. Gestalt way is to search for a cat on exact places where most likely it could or should be. You will search first in a cat box in room 1, than by the window in room2, and so on moving to less and less probable places. You would more try to guess where cat is than search for a real detail answer.

Cat is (probability that is in a cat box) + (probability that is on all other possible places).

It sounds a little bit ridiculous, but that is just because it is ridiculous. Gestalt way is recursive by its nature. What is very interesting people who implement this algorithm claim that it gives better results than Myers algorithm and that it is faster too in most cases.

Implementation

Since i haven't found Java implementation till now I decided to give it try and I implemented my variation of it.

So you start with two strings. First you will find longest equal sub-string in them, and then you will split both input strings and run recursion on parts that were not equal.

1:  public static List<Diff> diffList(char[] chars1, char[] chars2) {  
2:            
3:          List<Diff> res = new ArrayList<Diff>();  
4:            
5:          // if inputs are empty return empty list  
6:          if (chars1 == null && chars2 == null || chars1.length == 0 && chars2.length == 0) {  
7:              return res;  
8:          }  
9:            
10:          TextPart part = findLongesEqualStringWithImaginaryMatrix(chars1, chars2);  
11:            
12:          if (part == null) {  
13:              // end of reqursion  
14:              if (chars1 != null && chars1.length > 0) {  
15:                  res.add(Diff.createNewSimpleDiff(chars1, Operation.DELETE));  
16:              }  
17:              if (chars2 != null && chars2.length > 0) {  
18:                  res.add(Diff.createNewSimpleDiff(chars2, Operation.INSERT));  
19:              }  
20:          } else {  
21:              // split text to pre and after EQUAL  
22:              char[] orgPre = Arrays.copyOfRange(chars1, 0, part.getStart1());  
23:              char[] orgAfter = Arrays.copyOfRange(chars1, part.getEnd1(), chars1.length);  
24:              char[] chgPre = Arrays.copyOfRange(chars2, 0, part.getStart2());  
25:              char[] chgAfter = Arrays.copyOfRange(chars2, part.getEnd2(), chars2.length);  
26:                
27:              // do same for pre  
28:              List<Diff> diffPre = diffList(orgPre, chgPre);  
29:              // do same for after  
30:              List<Diff> diffAfter = diffList(orgAfter, chgAfter);  
31:                
32:              res.addAll(diffPre);  
33:              res.add(part);  
34:              res.addAll(diffAfter);  
35:          }  
36:            
37:          return res;  
38:      }

Algorithm for finding longest sub-string is not that important, and this can work with almost any of them, but it would be nice that we are in line with heuristic and we are searching first the most probable case. So I implemented here find longest sub-string by diagonal search on imaginary bit matrix of two strings. Instead building bitMatrix and than searching on it, I have found that it would be more easier to build and find longest sub-string in same time.

Start for this search is a diagonal that two strings are equal, biggest diagonal, and then move to smaller diagonals. If you already found equal sub-string of length N you will not go to a red corners of size NxN. Orange squares are parts that will go into next recursion.

We have double while loop, for diagonal +1 and diagonal -1, inside one more while loop that moves diagonals.

1:  public static SimpleTextPart findLongesPossibleEqualSubStringImaginaryMatrix(char[] chars1, char[] chars2) {  
2:          
3:      int max = chars1.length;  
4:      if(chars2.length > max) {  
5:        max = chars2.length;  
6:      }  
7:        
8:      int start1 = 0;  
9:      int start2 = 0;  
10:      int count = 0;  
11:      SimpleTextPart found = null;  
12:        
13:      int move = 0;  
14:      int i = 0;  
15:      int j = 0;  
16:      while (move < max) {  
17:        i = move;  
18:        j = 0;  
19:        if (found == null || found.getText().length() < Math.min(chars1.length - move, chars2.length)) {  
20:          while(i < chars1.length && j < chars2.length) {  
21:            System.out.println(chars1[i] + "-" + chars2[j]);  
22:            if (chars1[i] == chars2[j]) {  
23:              if (count == 0) {  
24:                start1 = i;  
25:                start2 = j;  
26:              }  
27:              count++;  
28:            } else {  
29:              if (count > 0) {  
30:                if (found == null || count > found.getText().length()) {  
31:                  found = SimpleTextPart.createNewEqualsTextPart(start1, start2, Arrays.copyOfRange(chars1, start1, start1 + count));  
32:                }  
33:                count = 0;  
34:              }  
35:            }  
36:            i++;  
37:            j++;  
38:          }  
39:          if (count > 0) {  
40:            if (found == null || count > found.getText().length()) {  
41:              found = SimpleTextPart.createNewEqualsTextPart(start1, start2, Arrays.copyOfRange(chars1, start1, start1 + count));  
42:            }  
43:            count = 0;  
44:          }  
45:          System.out.println("--------------------------------");  
46:        }  
47:          
48:        if (move == 0) {  
49:          move++;  
50:          continue;  
51:        }  
52:          
53:        count = 0;  
54:          
55:        i = 0;  
56:        j = move;  
57:        if (found == null || found.getText().length() < Math.min(chars1.length, chars2.length - move)) {  
58:          while(i < chars1.length && j < chars2.length) {  
59:            System.out.println(chars1[i] + "-" + chars2[j]);  
60:            if (chars1[i] == chars2[j]) {  
61:              if (count == 0) {  
62:                start1 = i;  
63:                start2 = j;  
64:              }  
65:              count++;  
66:            } else {  
67:              if (count > 0) {  
68:                if (found == null || count > found.getText().length()) {  
69:                  found = SimpleTextPart.createNewEqualsTextPart(start1, start2, Arrays.copyOfRange(chars1, start1, start1 + count));  
70:                }  
71:                count = 0;  
72:              }  
73:            }  
74:            i++;  
75:            j++;  
76:          }  
77:          if (count > 0) {  
78:            if (found == null || count > found.getText().length()) {  
79:              found = SimpleTextPart.createNewEqualsTextPart(start1, start2, Arrays.copyOfRange(chars1, start1, start1 + count));  
80:            }  
81:            count = 0;  
82:          }  
83:        }  
84:          
85:        System.out.println("==================================");  
86:        move++;  
87:      }  
88:        
89:      return found;  
90:    }

We have and two classes to represent diff - Diff, and TextPart to represent part of text and enum that defines types of diffs (EQUAL, DELETE, INSERT).

You can brows full code at git here.

Conclusion

What I have found interesting about it:

It is algorithm that doesn't need optimizations. If strings are equal algorithm will end instantly since that is the first place where we will look for a cat. In fact the more strings are similar less time will be needed to make results.

It is based on divide and conquer principle by its nature, it reduce scope of calculations with each step of recursion.

It is based on growing biggest equal solution and it is forcing equal results before (delete and insert), which always produce great end-results.

What is bad about it:
I still do not understand how does it work. I am surprised with results each time when I run it.

Let me know what do you think about this implementation.

We must be willing to see things as they are, rather than as we hope, wish, or expect them to be. Unknown

Links
Mayers work An O(ND) Difference Algorithm and Its Variations
http://www.xmailserver.org/diff2.pdf

Google diff-match-patch that we use implementation of Mayers with optimizations

https://code.google.com/archive/p/google-diff-match-patch/

Description of google alg

https://neil.fraser.name/writing/diff/

Gestalt alg, Ratcliff/Obershelp

http://collaboration.cmc.ec.gc.ca/science/rpn/biblio/ddj/Website/articles/DDJ/1988/8807/8807c/8807c.htm

http://word-association.tumblr.com/post/26781144264/gestalt-principles-vs-algorithms

https://github.com/chrismwendt/diff-gestalt