Do two strings differ by N or less characters?

Recently I saw a very nice interview question: given two strings, can you tell quickly whether they differ by no more than one character? The two-pointers approach works well for this question.
However, when we expand the problem to ask "do two given strings differ by no more than N characters?" then it becomes problematic with the two-pointers approach.

The problem is that for a case like this one:

String 1: abc8
String 2: a123456bc7
N: 8

The answer is "yes", because you can remove all the numbers (8 numbers in total) and the two remaining strings will match. But if you're operating with pointers then when you get to the first mismatched element ('b' and '1') which pointers do you move? The first, second, or both? If you go this route recursively chances are you'll end up with an exponential solution.

One way to solve this problem is using Dynamic Programming. The idea requires solving the problem for smaller strings and use that information to solve the current problem. Suppose that we keep track of a matrix M[i,j] and the semantics is that it gives you the proper solution for when the string s1 ends at position "i" and when the string s2 ends at position "j". Suppose that we want now to find the value for M[i+1, j+1]. That value will be the following:

if s1[i+1] == s2[j+1], then that value should be the same as M[i,j]
if s1[i+1] !=  s2[j+1], then take the minimum between M[i,j+1] and M[i+1,j] and add 1

Doing that across the matrix and the min number of characters to be removed from the strings will be found in the M[s1.len, s2.len] cell. Code below, cheers, Marcelo.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;

namespace DifferByN
{
    class Program
    {
        static void Main(string[] args)
        {
            int n = 0;
            if (args.Length != 3 || !Int32.TryParse(args[2], out n))
            {
                Console.WriteLine("Usage: DifferByN <string1> <string2> <integer>");
                return;
            }

            string s1 = args[0];
            string s2 = args[1];

            Console.WriteLine(Differ(s1, s2, n));
        }

        static bool Differ(string s1, string s2, int n)
        {
            //O(1)-time optmization
            if (n < 0) return false;
            if (String.IsNullOrEmpty(s1) && String.IsNullOrEmpty(s2)) return true;
            if (String.IsNullOrEmpty(s1) && s2.Length <= n) return true;
            if (String.IsNullOrEmpty(s2) && s1.Length <= n) return true;
            if (s1.Length < s2.Length - n) return false;
            if (s2.Length < s1.Length - n) return false;

            //O(len1*len2)-space
            int[,] partialSolutions = new int[s1.Length + 1, s2.Length + 1];

            //O(len1+len2)-time
            for (int i = 0; i < s1.Length + 1; i++) partialSolutions[i, 0] = i;
            for (int i = 0; i < s2.Length + 1; i++) partialSolutions[0, i] = i;

            //O(len1*len2)-time
            for (int i = 1; i < s1.Length + 1; i++)
            {
                for (int j = 1; j < s2.Length + 1; j++)
                {
                    partialSolutions[i, j] = (s1[i - 1] == s2[j - 1]) ? partialSolutions[i - 1, j - 1] : (Math.Min(partialSolutions[i - 1, j], partialSolutions[i, j - 1]) + 1);
                }
            }

            Console.WriteLine("Min Diff: {0}", partialSolutions[s1.Length, s2.Length]);
            return partialSolutions[s1.Length, s2.Length] <= n;
        }
    }
}

Comments

  1. ah, DP classic - special case of Levenshtein distance :) This is nice and clean, but can be slightly improved, since you don't need to keep the entire matrix in memory, since the algorithm only uses 2 rows - current and previous (which would be swapped after each nested loop exit), so space usage can be reduced to O(|s1|) or O(|s2|) and using a row for a smaller string would obviously be better :)

    Keep up the good work, Marcelo!

    ReplyDelete
    Replies
    1. P.S.: in fact there is also at least one semi-useful runtime optimization, which is applicable for this problem, but not for general Levenshtein distance - when populating the `next` row, also update `min_distance` at every iteration of nested loop and check if it's > `n` after nested iteration is over, initially the min_distance can be set to `n+1`. This does not improve worst case time, but can significantly improve best case time, when we strings are very long, but have different prefixes.

      Delete

Post a Comment

Popular posts from this blog

Advent of Code - Day 6, 2024: BFS and FSM

Advent of Code - Day 7, 2024: Backtracking and Eval

Golang vs. C#: performance characteristics (simple case study)