Coding interview Essentials

3. Two string anagram

Two string anagram

Introduction

From a set of words, you can construct other words by only changing the arrangements of their characters. For instance, from the characters in "alerting" you can spell the following words:

  • "altering"

  • "integral"

  • "relating"

  • "triangle".

Words sharing the same characters set are called anagrams.

Being able to create good anagrams, especially if they are able to reflect or comments on the words they are generated from (for instance turning "Madam Curie" into "Radium came") is regarded as a rather difficult task. Computers have been used for a long time to find anagrams in long texts, as well as to generate the so-called anagram dictionaries. A special kind of dictionary, where all the letters in a word and all their transposition are arranged in alphabetical order, are often used in games like Scrabble[^2]. Often, at the core of such applications lies an efficient algorithm for determining if a word is an anagram of another word.

As it is pretty clear at this point, the problem discussed in this lesson is about anagrams, and more specifically, about determining the number of modifications you need to make to a word in order to make it a valid anagram of a another word. This kind of question is considered rather easy during a coding interview process, as it does not require particular insights or particularly tricky reasoning in order to come up with an efficient solution, aside from understanding the concept of an anagram, which is common knowledge.

That said, it is still very much worth studying this problem as it is frequently asked during the preliminary interview stages. Moreover, despite its simplicity, there is more than one neat and elegant approach leading to an efficient solution to this problem.

In the rest of the chapter we are going to have a look at three solutions, starting from the slow but easy to understand brute-force in Section 3.3.1 then touching briefly on a faster approach using sorting in Section 3.3.2,and finally, and finally, we will get to the optimal solution running in linear time in Section 3.3.3.

Problem statement

Write a function that given two string, \(a\) and \(b\) of length \(n\) and \(m\), respectively, determines the minimum number of character substitution, \(C(s, i, c)\), necessary to make the string \(a\) an anagram of the string \(b\).

Two strings are said to be anagrams of one another if you can turn the first string into the second by rearranging its letters.

A substitution operation \(C(s,i,c)\) modifies the string \(s\), by changing its \(i^{th}\) character into \(c\). Notice that deletions or additions of characters are not allowed. The only operation you can do is change a character of the first string into another one.

In other words, what is the minimum number of characters of the input strings that need to be modified (no addition or deletion) so that \(a\) becomes an anagram of \(b\)?

[ex:two_string_anagram:example1]

  • a = "aaa"

  • b = "bbb"

The function returns \(3\). All the characters of a need to be changed into ‘b’. [ex:anagrams:example1]


  • a = "tear"

  • b = "fear"

The function returns \(1\). All it is necessary is turning the first letter ‘t’ into a ‘f’.


  • a = "Protectional"

  • b = "Lactoprotein"

The answer for this case is \(0\) because Protectional is already an angram of Lactoprotein.

Clarification Questions

Are the letters of the string always only letters from the English alphabet?

Yes, letters are always from the English alphabet.

Should the function be case sensitive?

No. You can assume the input letters are always lower case.

Can the input string be modified? No, the input is immutable.

No, the input strings are immutable.

What value should be returned when there is no solution?

In such case you can return \(-1\).

Discussion

Let’s start by first quickly review what the word anagram means in the context of this problem. First of all, notice that both \(a\) and \(b\) contain a single word (which can be fairly long). Moreover, for \(a\) to be an anagram of \(b\), it has to be the case that exists an arrangement of characters in \(a\) that is equal to \(b\). In other words, the question to which we need to answer is: is it possible to shuffle the character of \(a\) such that we obtain \(b\)? For this to be the case, it must be that \(a\) and \(b\) contain the same set of characters meaning that sorting both \(a\) and \(b\) would make them equal. In addition, as a consequence of the fact that no addition or deletion is allowed, \(a\) and \(b\) must have the same length. On the other hand, if they have the same length then it is always possible to solve this problem because in the worst case, we can modify every letter of \(a\) (see Example [ex:two_string_anagram:example1]). Thus, the only case when the problem has no solution has been isolated: when \(n \neq m\) we must return \(-1\) otherwise we can proceed with our calculation knowing that a solution exists.

Brute-Force

One of the first options coming to mind is a solution where we generate all possible arrangements of the letters in \(a\), and for each of these arrangements, calculate the number of modifications necessary to convert it into \(b\). The key idea is that the cost of transforming a string into another is equal to the number positions having different letters. For instance, the cost of transforming "abcb" into "bbbb" is \(2\) because the two strings differ in the first and third letters.

Although it is simple to explain, this approach must be considered poor because the number of arrangements of a set of \(n\) letters grows as fast as \(n!\). Moreover, enumerating all the arrangements is no trivial task, unless we use a library function capable of doing that (for instance, the C++ standard library provides the function [^3] devoted to this purpose).

Listings [list:two_string_anagram_bruteforce] shows a C++ implementation of the idea above.

Listing 1: Brute force.

#include <algorithm>
#include <limits>
#include <string>

int count_different_letters(const std::string &a_perm, const std::string &b)
{
  assert(a_perm.size() == b.size());

  int count = 0;
  for (size_t i = 0; i < a_perm.length(); i++)
  {
    if (a_perm[i] != b[i])
      ++count;
  }
  return count;
}

int solution_brute_force(const std::string &a, const std::string &b)
{
  if (a.length() != b.length())
    return -1;

  std::string a_perm(a);
  sort(a_perm.begin(), a_perm.end());
  int ans = std::numeric_limits<int>::max();
  do
  {
    ans = std::min(ans, count_different_letters(a_perm, b));
    if (ans == 0)
      break;
  } while (std::next_permutation(a_perm.begin(), a_perm.end()));

  return ans;
}

Sorting

The brute-force solution does a lot of superfluous work, because it tries to find a permutation of the string \(a\) requiring minimal modifications to be morphed into \(b\). But is it really necessary to turn \(a\) into exactly \(b\), or is it sufficient to modify \(a\) so that it is equal to a particular permutation of \(b\)? After all, being an anagram is a transitive property: if \(a\) is a permutation of \(b\) and \(b\) is a permutation of \(c\), then \(a\) must also be a permutation of \(c\).

By definition, an anagram of \(b\) is any permutation of its characters, and therefore, the particular permutation in which the characters of \(b\) are sorted is a valid anagram on its own. It is much easier than checking all possible permutations, to modify \(a\) into the "sorted" anagram of \(b\) (where all of its characters are sorted), rather than to exactly \(b\) because all we need to do is to create a copy of both \(a\) and \(b\), sort both of them and then calculate the character-by-character difference. This approach works because if \(x\) is an anagram of \(b\) then \(x\) is also an anagram of ‘sort(b)’. In other words, it does not matter how the characters are arranged in \(a\) and \(b\), as the only thing that matters is the set of the characters appearing in \(a\) and \(b\): the order in which characters in both \(a\) and \(b\) appear does not matter.

Listings [list:two_string_anagram_sorting] shows how we can take advantage of this fact and write a fast solution for this problem.

Listing 2: Solution based on sorting.

int solution_sorting(const std::string &a, const std::string &b)
{
  if (a.length() != b.length())
    return -1;

  std::string aa(a);
  std::string bb(b);

  std::sort(aa.begin(), aa.end());
  std::sort(bb.begin(), bb.end());
  return count_different_letters(aa, bb);
}

Notice that, if the input was mutable, then, the additional space occupied by the copies of the string and could have been avoided.

The time complexity of Listing [list:two_string_anagram_sorting] is \(O(n log(n))\) (because of sorting). The space complexity is \(O(n)\) (we create copies of the input strings).

Histograms

There is another bit of information that we have not used yet: the alphabet from which the letters of \(a\) and \(ab\) are taken from is small. If the only thing that matters is the set of characters appearing in \(a\) and \(b\) (and not their order, as discussed above), then we can use the same idea at the core of the bucket sort algorithm to achieve a linear time complexity solution.

The key idea is to pre-process \(a\) and \(b\) so to calculate their per-character frequencies, denoted here as \(F_a\) and \(F_b\), respectively. An entry of \(F_a[\mathrm{c}]\) and \(F_b[\mathrm{c}]\), where \(\mathrm{c} \in \{\mathrm{a},\mathrm{b},\ldots,\mathrm{z}\}\) (a letter of the alphabet), contains the frequency of character \(\mathrm{c}\), in \(a\) and \(b\), respectively.

If \(F_a\) and \(F_b\) are the same, then \(a\) and \(b\) have exactly the same character set and \(a\) is an anagram of \(b\). Otherwise, it must be the case that some characters of \(a\) appear in \(b\) a different number of times. In this case, we can fix \(a\) in such a way to make sure that its frequencies \(F_a\) ey match the ones in \(F_b\). But the main question is still unanswered: how many operations are necessary to do so? In order to get this answer, it is useful to look at the difference (\(D\)) of \(F_a\) and \(F_b\).

\(D = F_a - F_b = \{D[\mathrm{a}] = (F_a[\mathrm{a}] - F_b[\mathrm{a}]), D[\mathrm{b}] = (F_a[\mathrm{b}] - F_b[\mathrm{b}]), D[\mathrm{c}] = (F_a[\mathrm{c}] - F_b[\mathrm{c}]), \ldots, D[\mathrm{z}] = (F_a[\mathrm{z}] - F_b[\mathrm{z}])\}\)

\(D[\mathrm{c}]\) (where \(\mathrm{c} \in \{\mathrm{a},\mathrm{b},\ldots,\mathrm{z}\}\)) contains the difference between the number of occurrences of the character \(\mathrm{c}\) in the string \(a\) and \(b\). Depending on whether the value of \(D[\mathrm{c}]\) is greater or smaller than \(0\), \(a\) has an excess or a deficit of the letter c, respectively.

Firstly, notice that \(\sum_{c=\mathrm{a}}^{\mathrm{z}} D[\mathrm{c}] = 0\). This observation stems from the fact that \(|a|=n=m=|b|\) (\(a\) and \(b\) must have equal length for this problem to have a solution, remember?) and that if \(a\) has an excess of a certain character \(\mathrm{c}\) then there must exist another character \(\mathrm{d} \neq \mathrm{c}\) that the string \(a\) has a shortage of. If that is not the case, it is impossible for \(a\) and \(b\) to have equal length.

We can use this fact to modify the excesses of the letters of \(a\), the ones having a positive value of \(D\) into some of the letters there is a shortage of so that eventually, every single value of \(D\) is zero. If \(D[\mathrm{c}] = x\) is going to take \(x\) modifications to transform the excess of characters \(\mathrm{c}\). The answer to this problem is, therefore, the sum of all the positive numbers of \(D\).

Listings [list:two_string_anagram_histogram] shows a possible implemenetation of the idea above.

Listing 3: C++ solution to the two string anagram problem using the histogram approach.

int solution_histogram(const std::string &a, const std::string &b)
{
  if (a.length() != b.length())
    return -1;

  std::array<int, 128> F = {0};
  for (int i = 0; i < a.size(); i++)
  {
    F[a[i] - 'a']++;
    F[b[i] - 'a']--;
  }

  int ans = 0;
  for (const auto x : F)
    if (x > 0)
      ans += x;

  return ans;
}

Notice how the array of differences of frequencies \(D\) can be easily calculated without explicitly computing the frequencies for the characters of \(a\) and \(b\) but by simply adding \(1\) to \(D[\mathrm{c}]\) when the letter \(\mathrm{c}\) appears in \(a\) and subtracting \(1\) when it does in \(b\).

The time and space complexity of the code above is \(O(n)\) and \(O(1)\) in space (we are using an array of \(128\) integers regardless of the size of the input). We cannot do any better than this, as all characters in the input strings must be at least read once.