Coding interview Essentials

34. Median of two sorted arrays

Median of two sorted arrays

Introduction

The median is one of the most basic and important concepts in statistics and probability theory with applications in almost every field of science. It is defined as the value that splits a certain dataset into two equally sized halves: the higher and the lower half. For example the median of the dataset \(\{1,3,4,6,10,12,19\}\) is \(6\) because we have \(3\) elements greater and \(3\) elements smaller than \(6\). When the size of the dataset is even, no such element exists and thus the median is defined as the mean of the two middle elements; For instance given the dataset \(\{1,3,4,6,8,10,12,19\}\), the median is \(\frac{6+8}{2}=7\).

The problem covered in this chapter concerns finding the median from a dataset provided as two separate input lists of values (you can imagine, for instance, that each input set comes from a separate thread as part of a multithreaded application to analyze a large dataset). Although an obvious solution can be derived from the definition of median, this problem is still considered difficult to solve optimally in a coding interview context as it requires non-trivial insights and careful implementation.

We will start by dissecting the problem statement before we then take a deeper dive into a number of possible solutions beginning with a naive and inefficient approach and working up to the more more sophisticated and optimal approach.

Problem statement

You are given two sorted arrays \(A\) and \(B\) of size \(m\) and \(n\), respectively. Your task is to write a function that takes as input \(A\) and \(B\) and returns the median of the two sorted arrays. \(A\) and \(B\) con be considered to be proper subsets of a third dataset \(C = A \cup B\).


Given two sorted arrays:

  • \(A=[1,4,6,10,15]\)

  • \(B=[2,3,5,6]\)

The median is \(5\) (see Figure 38.1).


Given two sorted arrays:

  • \(A=[1,4,6,10]\)

  • \(B=[2,3,5,6]\)

The median is \(\frac{5+4}{2} = 4.5\) (see Figure 38.2).

[fig:median_sorted_arrays:example2] Figure 1: Example of median of two sorted arrays where the total number of elements is odd.

[fig:median_sorted_arrays:example1] Figure 2: Example of median of two sorted arrays where the total number of elements is even.

Clarification Questions

Can \(A\) or \(B\) be empty?

Yes, but you can assume that \(|A \cup B| > 0\) i.e. at most one of the input array can be empty.

Discussion

Let’s start our discussion by reviewing the concept of the median. The median of a collection \(C\) of \(n\) elements is (\(C_i\) represents the \(i^{th}\) element of \(C\)):

  • \(C_{\frac{n}{2}}\) if \(n\) is odd (see Figure 38.2)

  • \(\frac{C_{ \lfloor \frac{n}{2} \rfloor }+C_{ \lceil \frac{n}{2} \rceil }}{2}\) if \(n\) is even (see Figure 38.1)

In simpler terms the median of a sorted collection is the element which divides the collections into two equally sized halves, left and right, each with the same number of elements. If \(n\) is even, clearly such element does not exists and thus the median is the defined to be the mean of the two middle elements as shown in Figure 38.2. Additionally, note that because the collection is sorted then all the elements in the left half are smaller than or equal to the median and all the elements on the right half are larger.

Brute-force

Armed with the definition of median, we can immediately devise a simple and effective approach to find it given the two input sorted arrays. The only difference between the problem statement and the definition of median is that we are given two sorted arrays and not just one. Therefore the very first thing that should come to mind is to:

  1. create a third array \(C = A \cup B\), which is the combination of the two input arrays

  2. proceed by sorting \(C\),

  3. calculate the median (not forgetting to take into consideration the parity of \(|C|\))

This approach is clearly correct as it is a direct consequence and application of the definition of the median given above; but it is also far from being optimal. Listing [list:median_sorted_naive] shows a C++ implementation of this idea. Time and space complexities of this approach are \(O((n+m)log(n+m))\)(because of sorting) and \(O(s+m)\)(space required by the third array), respectively. Despite being sub-optimal this solution does have the benefit of being very short (only a few lines) and easy to read, explain and understand.

Listing 1: Naive implementation of solution to the problem of finding the median of two sorted arrays.

double mediam_sorted_arrays_naive(const std::vector<int> &A,
                                  const std::vector<int> &B)
{
  std::vector<int> C(A);
  C.insert(std::end(C), std::begin(B), std::end(B));
  std::sort(std::begin(C), std::end(C));

  const auto mid = C.size() / 2;
  return (C.size() & 1) ? C[mid] : (C[mid] + C[mid + 1]) / 2.0;
}

Brute-force improved

The brute-force approach can be improved somewhat if we use the fact that the arrays are already sorted. In the approach described in Section 38.3.1 we do not use this fact and therefore we are forced to sort the entire array \(C\) that we created by blindly juxtaposing \(A\) and \(B\) one after the other. By taking advantage of the fact that the inputs are sorted we can create the array \(C\) in a smarter way, so that it is already sorted. In order to do this we will use the fact that you can merge two sorted array into a third sorted array in linear time. You may already be familiar with this idea if you know how the famous merge-sort algorithm[@wiki:mergesort] works as the same operation is one of its two basic building blocks. Listing [list:median_sorted_naive_2] shows how this idea can be coded in C++. Note how most of the code is now taken by the function that is responsible for taking two sorted arrays (pay attention to the ) as input and returning a third sorted one.

Listing 2: Naive implementation of solution to the problem of finding the median of two sorted arrays using the merge part of merge-sort algorithm.

#include <cassert>
template <typename T>
std::vector<T> mergeSortedArrays(const std::vector<T> &A,
                                 const std::vector<T> &B)
{
  assert(std::is_sorted(std::begin(A), std::end(A)));
  assert(std::is_sorted(std::begin(B), std::end(B)));

  const int size = A.size() + B.size();
  std::vector<int> C;
  C.reserve(size);

  auto itA = std::begin(A);
  auto itB = std::begin(B);

  while (itA != std::end(A) && itB != std::end(B))
  {
    if (*itA < *itB)
      C.push_back(*itA++);
    else
      C.push_back(*itB++);
  }

  while (itA != std::end(A))
    C.push_back(*itA++);

  while (itB != std::end(B))
    C.push_back(*itB++);
  return C;
}

double mediam_sorted_arrays_merge(const std::vector<int> &A,
                                  const std::vector<int> &B)
{
  std::vector<int> C = mergeSortedArrays(A, B);

  const int mid = C.size() / 2;
  return (C.size() & 1) ? C[mid] : (C[mid] + C[mid + 1]) / 2.0;
}

The time and space complexities of this version are both \(O(n+m)\), much better than for the solution presented in Section 38.3.1. It is, however, still sub-optimal as this problem can be solved in logarithmic time. We are going to see how in Section 38.3.3.

Merge sorted arrays in linear time

How exactly can we merge two sorted arrays \(X\) and \(Y\) into a third sorted array \(Z\) in linear time? The basic idea is that we can build \(Z\) incrementally starting from an empty array and at each step of the process inserting one of the elements of \(X\) or \(Y\) depending on which one of the two contains the smallest element at that moment. In the implementation of this is achieved by using two iterators, and each pointing to the next element of \(X\) and \(Y\) to be inserted in \(Z\), respectively. The loop is responsible for comparing the two elements pointed by the iterators and always inserting the smallest one into \(Z\). Once an element is merged in the final array, the corresponding iterator is incremented so the next value will be considered at the next iteration. When one of the two iterators reaches the end of its array all we are left with are the remaining elements of the other collection that we can at this point blindly insert into \(Z\) because they are sorted (see the last two loops in the code). Figure [fig:median_sorted_array:mergearray] shows all the steps that are necessary to perform the merging for the arrays: \(X = \{1,3,5,8,15\}\) and \(X = \{2,4,7\}\). At step \(1\) (Figure 38.3) \(Z\) is initially empty and and point to the beginning of \(X\) and \(Y\), respectively. Because the element pointed by itA is smaller, it is selected for merging and thus itA is incremented. At step \(2\) (Figure 38.4) the element pointed by itB is smaller, and as in the previous step, it is merged in \(Z\) and itB is incremeneted. The same operations are performed for all the steps depicted in Figures 38.5, 38.6, 38.7, 38.8 and 38.9. Eventually it goes out of range signalling that all the elements in \(Y\) have been processed (see Figure 38.9). At this point then, as shown in Figure 38.10 we can safely merge all the elements in \(X\) into \(Z\) that is now ready to be returned (see Figure 38.11).

Figure 3: Step 1: Z is initially empty. itX and itY points to the beginning of X and Y, respectively. The element pointed by itX is merged as it is smaller. itX is advanced by one position.
Figure 4: Step 2: itY is smaller than itX, thus it is the one being merged. itB is also advanced by one position.
Figure 5: Step 3: itX is smaller than itY, thus it is the one being merged. itX is also advanced by one position.
Figure 6: Step 4: itY is smaller than itX, thus it is the one being merged. itB is also advanced by one position.
Figure 7: Step 5: itX is smaller than itY, thus it is the one being merged. itX is also advanced by one position.
Figure 8: Step 6:itY is smaller than itX, thus it is the one being merged. itB is also advanced by one position.
Figure 9: itY now points to the past-the-end element of Y. There are no more elements of Y to merge.
Figure 10: Step 7: All the remaining elements from the current location of itX to the end of X are merged into Z.
Figure 11: All the elements of X and Y have been merged. Z contains the element of X \cup Y and it is sorted.

Logarithmic solution

If we want to improve the \(O(n+m)\) solution we have at hand at this point we need to abandon the idea of constructing a third array containing all the elements from the inputs. There is no way this can be done in less than linear time as one must at least access the input element once. In reality, we do not actually need to have the array \(C\) at all.

The key insights are:

  • we know exactly what the size of the merged array \(C\) would be: \(n+m\).

  • we also know that the part of \(C\) to the left of the median, \(C_l\) would be made up from elements among the smallest values of \(A\) and \(B\). These values also lie in the left portions of \(A\) and \(B\). For instance, in the example in Figure 38.1 we can see that the left half (the first \(5\) elements) of \(A \cup B\) is made up from the first two elements of \(A\) and the smallest \(3\) elements of \(B\). Because \(C\) will be sorted, only the smallest elements of \(A\) and \(B\) will can be part of \(C_l\).

The problem here is that we do not know exactly how many elements of \(A\) will be part \(C_l\) but if we do then we also know immediately how many elements of \(B\) go to \(C_l\) and at that point we can calculate the median. We cannot directly find how many elements of \(A\) contribute to \(C_l\), but we can test fairly easily if the first \(i\) elements do. Let’s suppose we try to make \(C_l\) by using \(i\) elements from the left portion of \(A\). Because \(|C|=n+m\) then \(|C_l| = \frac{n+m}{2} = i+j\) where \(j\) is the number of elements from the left part of \(B\) contributing to \(C_l\). Thus if we take \(i\) elements from \(A\) we need to take \(j = (n+m)-i\) elements from \(B\). Once \(i\) and \(j\) are decided, we also know that the last element of \(C_l\), will be the maximum element among the first \(i\) elements of \(A\) and the first \(j\) elements of \(B\). From these arguments it follows that the right half of \(C\), \(C_r\) must contain all the remaining \(n-i\) elements of \(A\) and \(m-j\) of \(B\), and also that the first element of \(C_r\) will be the smallest element among them. Given:

  • \(M_l\) is the largest elements among the \(A[i]\) \(B[j]\)

  • \(m_r\) is the smallest element among the \(A[i+1]\) \(B[j+1]\)

then if \(i\) is indeed the right amount of elements from \(A\) belonging to \(C_l\) then, \(M_l \leq m_r\). If \(M_l > m_r\) then we need to understand whether we took too many or too few elements from \(A\) to be part of \(C_l\). We can check this by checking whether \(M_l\) belongs to \(B\) or \(A\), respectively. Thus if \(A[i] > B[j]\) we reduce or increase \(i\) by doing \(r = r-1\). Conversely if \(A[i] < B[i]\) then \(i\) is increased by moving the left boundary of the binary search range: \(l = l+1\).

Listing [list:median_sorted_binary] shows an implementation of the idea above.

Listing 3: Binary search solution to the \textit{median of two sorted arrays} problem.

size_t midpoint(const size_t l, const size_t r)
{
  assert(l <= r);
  return l + (r - l) / 2;
}
double mediam_sorted_arrays_binary_search(const std::vector<int> &A,
                                          const std::vector<int> &B)
{
  size_t l = 0, r = A.size() - 1;
  const size_t size_C      = A.size() + B.size();
  const size_t half_size_C = size_C / 2;

  auto median = 0.0;
  while (l <= r)
  {
    const size_t i = midpoint(l, r);
    const size_t j = half_size_C - i;
    /*
        const int idx_i = i - 1;
        const int idx_j = j - 1;
        if (A[i - 1] <= B[j] && B[j - 1] <= A[i])
          if (size % 2 == 0)
            return (std::max(A[i - 1], B[j - 1]) + std::min(A[i], B[j])) / 2.0;
          else
            return std::max(A[i - 1], B[j - 1]);

        if (A[i - 1] > B[j])
          r = i - 1;
        else
          l = i - 1;*/
  }
  return median;
}

https://leetcode.com/problems/median-of-two-sorted-arrays/ complete this code first use binary search to find i l , r is the range of elements of A initially l = 0 r = min(n, (n+m)/2) i = l+r/2 j = n+m-i if max among A[i] and B[j] <= min A[i+1],B[j+1] we have a median otherwise if A[i] > B[j] then r = i-1 else l = i+1