Coding interview Essentials

5. Unique Elements in a collection

Unique Elements in a collection

Introduction

The problem presented in this section is probably one of the most popular possibly because it features an incredibly simple statement and it is and easy to understand. This problem has an intuitive brute-force solution that can be coded in a few minutes and that can be refined and optimized into a short and efficient one.

Problem statement

Given a string \(s\) of length \(n\), determine whether it does not contain duplicate characters.

  • Given s="graph" the function returns . There are no duplicates in \(s\).

  • Given s="tree" the function returns . Characters at indices \(2\) and \(3\) are the same.

Clarification Questions

What is the maximum size of the input?

The maximum length for the input string is \(10^6\).

Are all characters upper or lower case?

No, both upper and lower case might be present.

Is the function case-sensitive?

Yes.

Can I assume characters only alphanumeric characters are present in the input?

Yes. Upper and lower case Latin letters and numbers only.

Discussion

Being this problem so popular, the interviewer is expecting you to come up with a good solution in a relatively small time-window. For this reason the obvious \(O(n^2)\) solution should be immediately put on the whiteboard or at least spoken out-loud.

Brute Force

The trivial and easy approach in solving this problem works by looping over each character at index \(i\), and checking if \(s_i\) is present in any of the elements of \(s\) appearing at positions higher than \(i\). In other words we want to check whether the following is true: \(\exists \; j\) s.t. \(s_j=s_i\) and \(j>i\). This idea can be implemented as shown in Listing [list:unique_elements_brute_force1] using two simple nested loops.

Listing 1: C++ solution for determining all characters in a string are unique.

bool unique_elements_brute_force(const std::string &s)
{
  for (size_t i = 0; i < s.size(); i++)
    for (size_t j = i + 1; j < s.size(); j++)
      if (s[i] == s[j])
        return false;

  return true;
}

As a stylistic improvements to the code in Listing [list:unique_elements_brute_force1], Listing [list:unique_elements_brute_force2] uses the C++ standard library function to search for a duplicate of the character \(s_i\). This not only makes the code shorter and cleaner, but also shows to the interviewer that you are able to use the standard library and do not reinvent the wheel.

Listing 2: C++ solution for determining if all characters in a string are unique using \inline{std::find}

bool unique_elements_brute_force_std(const std::string &s)
{
  for (auto it = s.begin(); it != s.end(); it++)
    if (std::find(it + 1, s.end(), *it) != s.end())
      return false;
  return true;
}

Linear time - Hashset

In Listing [list:unique_elements_brute_force1] the internal loop is doing the hard work of searching for a duplicate of the character at index \(i\). We can trade space for time and reduce the complexity of the search for a duplicate of \(s_i\) down to \(O(1)\). The idea is that wen can use a set to keep track, as we loop over the characters of \(s\), of all the distinct characters seen so far. A search for a duplicate of \(s_i\) becomes a query into this set. If the query is positive then we know we have seen this character before, otherwise we inser \(s_i\) into the set and can continue processing the rest of \(s\). Listing [list:unique_elements_brute_force_map] shows how this idea can be implemented.

Listing 3: C++ solution for determining all characters in a string are unique in $O(n)$ using an hashset.

bool unique_elements_map(const std::string &s)
{
  std::unordered_set<char> L;
  for (size_t i = 0; i < s.size(); i++)
  {
    if (L.contains(s[i]))
      return false;
    L.insert(s[i]);
  }
  return true;
}

This approach effectively lowers the time complexity down to linear, but at the cost of some space. But how much space exactly? The intuition would suggest \(O(n)\) because that is the size of the input string, and afterall we might be inserting into the hashset all of the characters of \(s\). But the intuition is wrong as the string is made of characters from an alphabet \(\Sigma\) which has a (very) limited size, at most \(128\) (which is the size of the ASCII set) elements. The insert instruction will not be executed more than \(|\Sigma|\) times. Because of this the space complexity of this solution is \(O(1)\).

The previous argument can be expanded furthermore with the following idea: Every string with more than \(|\Sigma|\) character contains at least one duplicate(follows from the pigeon principle[^4]). The longest string with only unique characters is one of the permutations of "abcde…zABCD …Z123 …9". Thus the solution using the hashset has complexity of \(O(1)\) because in the worst case we can have \(|\Sigma|\) negative lookups. For this reason, we can limit our investigation to only strings that have size smaller or equal to \(|\Sigma|\) character. For all other strings we can immediately return false. Note that under the light of these new facts the brute-force approach also has a complexity of \(O(1)\) if \(i\) and \(j\) in Listing [list:unique_elements_brute_force1] are forced to stay below \(|\Sigma|\).

Armed with these new arguments, the solution we suggest to present during an actual interview only uses a array of booleans of size \(|\Sigma|\) storing the information regarding the presence of a character in the portion of \(s\) considered so far. If at any time the currently examined character has been already seen, then it is a duplicate. Listing [list:unique_elements_brute_force_final] shows an implementation of this idea.

Listing 4: C++ solution for determining all characters in a string are unique in $O(n)$ using an hashset.

bool unique_elements_final(const std::string &s)
{
  constexpr size_t ALPH_SIZE = 128;

  if (s.size() > ALPH_SIZE)
    return false;

  std::array<bool, ALPH_SIZE> F = {};
  for (size_t i = 0; i != s.size(); i++)
  {
    // index in F
    const int idx = s[i] - 'a';
    if (F[idx])
      return false;
    F[idx] = true;
  }
  return true;
}