Coding interview Essentials

10. First positive missing

First positive missing

Introduction

In this chapter, we are going to address a fairly common problem that has been asked primarily during on-site interviews and for which we can find quite a number of solutions making up for a wide range of time and space complexities.

There is no doubt that coming up with what is considered to be the best solution (in terms of asymptotic complexity) is challenging. This problem is definitely a tricky one, worth an in-depth look.

Depending on the examples provided, it might feature a short and purposely slightly vague statement. Therefore, more than for other problems, it is important to ask questions and clarify with the interviewer every unclear aspect of the statement before actually attempting to write any code for it, to make sure that the statement is understood fully [^7] and there is no misunderstanding about the details.

Problem statement

Write a function that, given an unsorted integer array \(A\), returns the smallest positive integer not contained in \(A\).


Given \(A=\{ 1, 0, -1, -2\}\) the answer is \(2\).


Given \(A=\{ 2, 3, -7, 6, 8, 1, -10, 15\}\) the answer is \(4\).


Given \(A=\{ 1, 0, -1, -2\}\) the answer is \(2\).

Clarification Questions

Are the input numbers always positive?

No, the array contains positive and negative numbers.

Are all the elements distinct?

No, the array might contains duplicates.

Can be the input array be modified?

Yes.

Can the size of the array be zero? In other words, can the array be empty?

No, the input array contains at least one element.

Is \(0\) a valid output?

No, only strictly positive numbers should be returned.

Discussion

This is actually a problem with a solid real-life application. Imagine for instance, how an OS might assign PID[^8]s to processes. One approach would be to keep a list of all the PIDs for all the processes running, and once a new one is fired up, the OS will assign to it the smallest PID not assigned already to any other process.

In a highly dynamic environment, like the OS with thousands of applications active at the same time, the focus of solving this task should be on sheer speed as you want the process to be up and running as fast as possible.

Brute-force

One of the simplest approaches we can think about consists of blindly searching \(A\) for the missing number, incrementally starting from \(1\). What this practically means is that we are going to perform a search operation in \(A\) for each number from \(1\) onward until the search fails. This algorithm is guaranteed to return always the smallest missing number, because we perform the searches in order, with the smallest numbers being searched first.

Listing [list:first_positive_missing_bruteforce] shows an implementation of this idea where we use as a means to do the actual search in the \(A\).

Listing 1: Two bruteforce solution implementations the problem of finding the smallest missing positive integer in an array.

int first_positive_missing_bruteforce1(const std::vector<int> A)
{
  int ans = 0;
  // until ans is found
  while (std::find(begin(A), end(A), ans) != end(A))
    ans++;
  return ans;
}

int first_positive_missing_bruteforce2(const std::vector<int> A)
{
  for (int i = 0;; i++)
  {
    // not found
    if (std::find(begin(A), end(A), i) == end(A))
      return i;
  }
}

However, this is considered to be a poor solution (as a rule of thumb, in the context of coding interviews, all brute-force solutions are) and has a complexity of \(O(n^2)\) time and \(O(1)\) space.

The advantages of this approach are that it is extremely easy and fast to write, and it is also very difficult to make implementations mistakes because the logic is simple and the amount of code involved is small.

Sorting

The next most intuitive (after brute-force) way to solve this problem is sorting the input because, as we will see, having the numbers sorted is helpful for easily coming up with a fast approach to this problem.

When the \(A\) is sorted, we know the positive numbers in it will all be appearing in an ordered fashion from a certain index \(k\geq 0\) onwards (the positions from index \(0\) to \(k-1\) are occupied by negatives or zeros).

We also know that, if no number is missing in \(A[k \ldots n-1]\), then we would expect to see:

  • \(A[k]=1\)

  • \(A[k+1]=2\)

  • \(A[k+2]=3\)

  • \(\ldots\)

  • \(A[n-1]=n-k+1\)

i.e. all numbers from \(1\) onwards appearing in their natural order \((1,2,3, \ldots (n-k+1))\) from the cell at index \(k\) to the end of \(A\). If any of these numbers is missing, then we would not be able to see such a sequence.

The goal of this problem is to find the first number that is missing from that sequence: we can do that by finding the first element among \(A[k \ldots n-1]\) where the condition \(A[k+i]=i+1\) with \((i=0,1,2, \ldots)\) is false. When this happens, we can conclude the missing number is \(i+1\). If such a cell does not exist (every cell satisfies the condition above), then we know that the missing number is \(A[n-1]+1\).

For instance, consider the array \(A=\{ 9,-7,0,4,5,2,0,1\}\). When sorted, the array becomes \(A=\{ -7,0,0,1,2,4,5,9\}\). The positives start at index \(k=3\):

  • \(A[3+0] = 1\) (test passes)

  • \(A[3+1] = 2\) (test passes)

  • \(A[3+2] = 4\) (test fails)

As we can see, the test fails after three tries, and therefore we can conclude the missing number is \(3\).

Now let’s consider the array \(B=\{ 3,-7,0,4,5,2,0,1\}\) which is exactly the same as in the previous example, with the exception we have swapped a \(9\) for a \(3\). When sorted, the array becomes \(B=\{ -7,0,0,1,2,3,4,5\}\) which contains no gaps between any of the positive numbers. As before, the positives start at index \(k=3\) but this time every element passes the test:

  • \(A[3+0] = 1\) (test passes)

  • \(A[4+1] = 2\) (test passes)

  • \(A[4+2] = 3\) (test passes)

  • \(A[4+3] = 4\) (test passes)

  • \(A[4+4] = 5\) (test passes)

We can clearly see that the missing number is \(6 = A[8]+1 = A[n-1]+1\).

An implementation of this idea is shown in Listing [list:first_positive_missing_sorting].

Listing 2: Solution to the problem of finding the smallest missing positive integer in an array.

int first_positive_missing_sorting(std::vector<int> A)
{
  std::sort(begin(A), end(A));

  auto it =
      std::find_if(begin(A), end(A), [](const auto &x) { return x >= 0; });

  int expected = 0;
  while (it != end(A) && (*it) == expected)
  {
    expected++;
    it++;
  }
  return expected;
}

Notice that:

  • the iterator always points at the currently evaluated element. It is initialized to either:

    • the smallest positive in the sorted array;

    • to one element past the end of the array if no positives are present.

    is moved to its initial location by using the function from the STL which runs in linear time. We might as well have used binary search to perform this task, but that would have not really helped us lowering the overall asymptotic time complexity, because the sorting operation itself costs \(O(nlog(n))\) and the subsequent loop runs in linear time.

  • is a variable holding the value that is expected to be found where is pointing to (the value \(i+1\) mentioned above).

  • if the runs to completion because we have examined every element of \(A\) () then points to .

  • if no positives are present, then the does not even run once and \(1\) is returned.

This is considered a good solution with an overall time and space complexity of \(O(nlog(n))\) and \(O(n)\), respectively. It is, however, not optimal, as we can come up with solutions with better time and space complexities, as we will see in the next section.

Linear time and space solution

Looking at the problem a bit closer, we can immediately notice that the missing number will always be in the range \([1,n]\), where \(n\) is the size of the input array. Why is this the case? We can convince ourselves this is the case by thinking about which input can possibly lead to the largest possible output: among all possible arrays of size \(n\), only one configuration leads to the highest missing number: \(A = \{1,2,3,4, \ldots ,n\}\) i.e. the configuration where all numbers from \(1\) to \(n\) are present. All the other configurations contain duplicates, negative or numbers higher than \(n\) which forces the input to have “holes” (i.e. missing numbers in the range \([1,n]\)).

This fact can be exploited to keep track of which positive numbers from \(1\) to \(n\) are present in \(A\). We can use an array of booleans flags of size \(n\) to store this information. Therefore, all we have to do is set the \(x^{th}\) flag to true for each number \(x\) in \(A\) in the range \([1,n]\). Eventually, this array of flags contains the answer, which can be found by scanning through it linearly to find the first false element which signals that this is the first missing element.

An implementation of the idea above is found in Listing [list:first_positive_missing_constant_space].

Listing 3: Solution to the problem of finding the smallest missing positive integer in an array.

int first_positive_missing_linear_space(std::vector<int> A)
{
  std::vector<bool> F(A.size(), false);

  for (const auto &x : A)
  {
    if (x >= 0 && x < A.size())
      F[x] = true;
  }
  for (size_t i = 0; i < F.size(); i++)
    if (!F[i])
      return i;

  return A.size();
}

The code in Listing [list:first_positive_missing_linear_space] works in two phases:

  1. For each number of \(A\) in the range \([1,n]\), we remember we have found it by marking the cell at index in .

  2. We find if exists the first cell in containing false which means the corresponding element was not found in \(A\). If a false cell does not exist then we can conclude the missing number is \(n+1\).

This approach is considered good and it has time and space complexity both equal to \(O(n)\).

As an alternative, instead of an array, we can use a hashmap to keep track of which element has been already seen as the array is scanned from left to right. This might have advantages in some cases, especially when \(n\) is large but there are many duplicates in \(A\).

Linear time and constant space solution

As mentioned above, the optimal solution does not use any additional space but shares the idea of keeping track of which element has been found with the solution described in Section 10.3.3.

In order not to use additional space, we have to somehow implement the functionality the array (in the code above) provides using the input array itself! Doing so sounds harder than it really is, especially until we realize that we can store the information about a positive number being pres

We know already from what we have seen in the previous solutions that we can safely ignore every negative number in \(A\), and that the largest output we can ever hope to get is always less than the number of positives in \(A\). For instance, if we have an array of size \(n\) with \(x\) negatives and \(y\) positives (\(n=x+y\)), then the largest possible output we can get is: \(y+1\).

As an example, let’s consider the array \(A=\{-1, -2, -3, 0, 1, 2, 3\}\) which has \(x=4\) negatives and \(y=3\) positives (\(n=7\)). We can see that the missing number is \(4=y+1\) and also that, if we substitute any positive (or negative or zeros for that matter) number with a different positive, the output of the function will not increase.

The idea is to loop through \(A\) and for each number \(x>0\) store the information about \(x\) being present in \(A\) in some cell of \(A\) itself. But which one? What we want is to have a mapping from the positives in \(A\) to indices of \(A\). We will use this mapping to choose the cell in which we remember whether a number is present or not by changing its sign.

We can start by noticing that, if \(A\) contains \(y\) positive, then we have \(y\) cells to which we can change the sign from positive to negative. In our quest to create this 1-to1 mapping one problem, the problem we face is that \(A\) is unsorted and positives and negatives are all shuffled around. Therefore, the first step would be to rearrange the elements so that all the positives appear before all the negatives. This way, creating the mapping becomes much easier. If all \(y\) positive are located from index \(0\) to \(y-1\), then every time we process an element of \(x\) of \(A\), which value is between \(1\) and \(y\) (\(1 \leq x \leq y\)), we can change the value of the cell at index \(x\) to remember the fact the value \(x\) is present in \(A\). Therefore, the main problem at this point is to rearrange \(A\) so that all positives appear before all negatives and zeros in an efficient manner.

Given an unsorted array, we can rearrange its element so that all positives before all negatives by using a two-pointer technique where we use (not surprisingly) two pointers \(s\) and \(e\) pointing initially to index \(0\) and \(n-1\), respectively. The idea is to keep moving \(s\) towards the end of the array and \(e\) towards the start, until \(s\) points to a positive and \(e\) points to a negative. At this point, \(s\) points to a value that should appear after the element pointed by \(e\) and therefore we swap those values. If we keep repeating this process until \(s>=e\), eventually all the pairs of misplaced elements (a negative appearing before a positive) are swapped and the final array is arranged so that all negatives appear in the first \(x\) positions of the array.

Consider for instance the array \(A=\{-1, 1, 2,-2 0,3,-3\}\). Initially \(s=0\) and \(e=6\). We first move \(s\) to the first positive which happens to appear at index \(1\). Similarly, we move \(e\) to the left towards the first negative which appear at index \(6\). The elements pointed by \(s\) and \(e\) are swapped, and \(A=\{-1, -3, 2,-2 0,3,1\}\).

The same process is repeated and after they are moved, \(s=2\) and \(e=4\). The values they point to are swapped leaving \(A=\{-1, -3, 0,-2 2,3,1\}\). When the pointers are moved next, they would cross, and this signals the rearrangement is finally complete. It is important to notice that the aim of this process is not to sort the array, but to simply make all the negatives appear before all the positives (therefore still allowing the positives and the negatives to appear in any order).

After all the numbers in \(A\) are processed this way, similarly to what we did in step \(2\) of the solution running in linear time and space (where we used the array ‘F’), we can scan the portion of \(A\) from index \(0\) to \(y\) looking for positives. If we find one at index \(k\), it means that the element \(k+1\) is missing from the array as if it was present it would have undergone a sign change. If all those cells contain negative, then it means that the missing number is \(n+1\).

This idea is implemented in Listing [list:first_positive_missing_linear_space] shown below.

Listing 4: Linear time and constantspace solution to the problem of finding the smallest missing positive integer in an array.

int divide_pos_neg(std::vector<int> N)
{
  int s = 0;
  int e = N.size() - 1;
  while (s <= e)
  {
    while (s <= e && N[s] > 0)
      s++;
    while (s <= e && N[e] <= 0)
      e--;
    if (s >= e)
      break;

    std::swap(N[s], N[e]);
  }
  return s;
}

int first_positive_missing_constant_space(std::vector<int> N)
{
  const int num_pos = divide_pos_neg(N);
  for (int i = 0; i < num_pos; i++)
  {
    const int ni = abs(N[i]);
    if (ni > 0 && ni - 1 < num_pos)
      N[ni - 1] *= -1;
  }

  for (int i = 1; i < num_pos; i++)
    if (N[i - 1] >= 0)
      return i;
  return num_pos + 1;
}

The two phases of this solution are packaged into two functions:

  • builds on it and uses the rearrangement to mark the presence of any element \(x\) in the range \([1,y]\) by changing the sign of the cell at index \(x-1\). When it is done with it, it proceeds in finding the answer by searching for the smallest index in \(i\) containing a positive.

  • is responsible for rearranging the input array as discussed above

The complexity of this approach is linear in time and constant in space which is optimal.