ALGORITHMS IN PRACTICE

Create a “text-differentiator” using longest-common-subsequence.

.. the problem, use, solution and implementation!

Riddhi Nisar

--

https://riddhinn99.github.io/text-diff/

The Problem:

Let us first look at the basic problem: Let X and Y be two arbitrary strings of length u and v respectively. What is the longest common subsequence (LCS) between X and Y?

A subsequence is a subset of the original string where all the letters present in the subsequence appear in the same order in which they were present in the original string. Example:

(if the lines cross, it isn’t a subsequence since the order is changed)

Given two strings, we are supposed to find the longest subsequence which is common between both these strings.

The letters circled are the final answer

The Use:

Before jumping to the solution, let us think where and why would you require such a longest-common-subsequence finder? The first and one of the very useful cases is “gene-similarity-finder”. We can compare 2 strands of DNA by finding the longest common subsequence of bases [adenine (A), cytosine (C), guanine (G), or thymine (T)] between them, which will tell us how similar the 2 DNA strands are genetically.

Another important use-case is in finding the exact opposite of the LCS ie. finding what is the difference between 2 texts. So the best solution for this is to find out the LCS and compare it with the 2 texts to retrieve the difference. This is particularly useful in version control systems such as Git, which compare files and the difference between them between different commits.

In this tutorial/article, I’ll be creating a simple text-differentiator where you can paste 2 different texts and find how much of it is similar and how much different. We’ll use JavaScript to get the text from the inputs, find the difference, highlight them and show it as output. But first, let us see the solution:

The Solution:

Now lets see what a recursive solution to this problem might look like.

Consider stringsX and Y of length u and v respectively. Let LCS(Xᵤ , Yᵥ) be the longest common subsequence between X (length 0 to uᵗʰ character) and Y (length 0 to vᵗʰ character).

We must divide the problem LCS(Xᵤ , Yᵥ) into smaller subproblems. If the uᵗʰ character of X and vᵗʰ character of Y are equal, it means that we may consider this equal character in the answer of the final subsequence. So breaking it into a smaller problem: LCS between X and Y for length u and v respectively is one plus LCS of X and Y for length u-1 and v-1 (since the uᵗʰ character of X and vᵗʰ character of Y are equal).

LCS(Xᵤ , Yᵥ) = 1 + LCS(Xᵤ₋₁ , Yᵥ₋₁) -> (for equal characters)

But what if these characters aren’t equal? Then we have 2 choices: either consider the uᵗʰ character of X and ignore the vᵗʰ character of Y, or vice versa. Then, we will have to find whichever among these 2 choices produces the maximum length subsequence. Thus the recursive equation reduces to:

LCS(Xᵤ , Yᵥ) = max(LCS(Xᵤ , Yᵥ₋₁), LCS(Xᵤ₋₁ , Yᵥ)) -> (for unequal characters)

BASIC RECURSION TREE

Next, whenever you see a big recursive problem, always think of using Dynamic Programming(DP) if you can. We can use dynamic programming if a problem satisfies 2 conditions:

  1. Optimal sub-structure — a problem that can be divided into subproblems whose optimal solutions can make up the most optimal solution of the main problem too. Here, we see that LCS can be divided into smaller LCS subproblems and so LCS satisfies this condition.
  2. Overlapping subproblems — some subproblem occurs more than once in the recursion tree ie. we are repeatedly calculating the answer for the same subproblem. LCS satisfies this too.

Thus, we can use DP to store the values of recurring subproblems, so that we only compute them once.. hence converting the time complexity of LCS from exponential recursive O(2ⁿ) (look at the worst case recursion branches.. each problem gives rise to 2 subproblems.. so counting all the nodes in the tree, we get 2ⁿ subproblems to solve) to O(mn) for lengths of X and Y as m and n respectively.

Algorithm/ Code:

Let us use the bottom-up approach to solve LCS(one can also use the top-down approach, no harm in that, the time complexity remains the same.. but the recursive call stack increases, so I prefer the bottom-up approach here).

Using the bottom-up approach, we first create an empty matrix of rows= length(X)+1 and cols=length(Y)+1 . You might ask why +1? Thats because the first row and column are for empty characters and will be filled with 0s. The indexing of our letters in each word will begin from 1. Next, we loop through (i, j) for each element of the matrix and fill it according to the formula: (ignore the red boxes for now, just look at how each row is filled one by one using the formula)

Heres the JavaScript code snippet:

Now, this function only gives us the length of LCS. We want to find the actual LCS string. For that, we need to backtrack our steps in the matrix as shown by the red boxes in the figure above. The basic formula is: if both the characters at the given index (i, j) are equal, it means we’ve traversed along the diagonal to reach this value. So we append the character to the LCS in the reverse order and move back to (i-1,j-1). If the characters are not equal, we backtrack to the element having maximum value between (i,j-1) and (i-1,j) and append nothing to the reverse output LCS. Heres the JS snippet:

Final Implementation:

All this is well and good, but now we want to see this thing in action 😛

We use 2 basic input text-areas to let the user input the text. Then, we add an event listener on a submit button. Once the user clicks submit, findDifference(text1, text2) function is run.. which gets the text values from the input and finds the LCS between them.

Now, we need to show the same text in another <div> for both left and right and highlight the common letters of LCS with green, and others with red. For this, we use a function called highlightText(commonText, originalText).

The final step is to append the text returned by highlightText() to the DOM.

And, voila! We have a created a simple working text differentiator using the longest-common-subsequence algorithm in practice. 😁

The entire source code including the HTML, CSS and JS can be found here! You can also try out the text-diff I made here :)

Thanks for reading! If you found the article useful and/or fun, do give it a clap and follow me for more! 😋

And as always, you and I learn together. If you find something incorrect or if there is anything you’d like to add, leave a response and I’ll check it out. Cheers! ✨

--

--

Riddhi Nisar

The author is an engineer finally out of her hibernation mode, proactively chasing her curiosities.