Coding interview Essentials

23. Find the majority element

Find the majority element

Introduction

It is election time and we are hired to make sure the vote counting is free of mistakes and quick. Our job is to determine the winner of this year election. Votes are collected as an unordered list and our task is to determine whether there is a clear winner i.e. someone with the majority of the votes (i.e. with more than 50% of them) or a new voting session is necessary.

This problem has been asked at Google and Yahoo interviews for software engineering positions and it is considered medium difficulty. Infact, despite the fact it is almost trivial to solve it in linear space, doing so in constant space proves to be quite a bit more challenging and requires non-trivial insights.

As we will see in the coming sections, this problem (and its solution) is a specialization of a more general problem where we need to find out if there is an element in the input list that appears more than \(\frac{n}{k}\) times. Clearly, under this definition, we have the majority element problem when \(k=2\).

Problem statement

Given an array \(N\) of size \(n\), find the majority element i.e. that element that appears more than \(\lfloor \frac{n}{2} \rfloor\) times. If such element does not exists, return \(-1\).


Given the array \([1,2,3,2,2,1,1,1]\), the function returns \(1\) because it appears \(4\) times in an array of length \(8\).


Given the input array \([2, 1, 2]\) the function return \(2\) because it is greater than \(\frac{3}{2}\).


Given the input array \([2, 1, 2,3,4,5]\) the function return \(-1\)no element appear more than \(3\) times.

Clarification Questions

What are the minimum and maximum values an element of the array can take?

The minimum and maximum values are \([-10^9, 10^9]\), respectively. This is a good question to ask because if the range is small then we can apply a solution based on bucket counting.

Can the input array \(N\) be modified or shuffled?

Yes, the input array can be modified.

Discussion

We will examine three different solutions for this problem. We begin by looking at the brute-force approach in section 27.3.1 Section 27.3.3 will then describe an approach that uses sorting to improve the time complexity of the brute-force approach. Finally, in section 27.3.5 we will examine the optimal approach using the Boyer-Moore algorithm.

Brute-force

The brute force solution is very simple and consist of looping through the array and for each element counting how many times it occurs in the input array. Although this approach is simple, it will not serve you well in an interview scenario as it is far from the optimum and the interviewer is certainly expecting a more sophisticated solution. Listing [list:majority_element:bruteforce] shows a possible implementation of this approach.

Listing 1: Sample Caption

int find_majority_element_brute_force(const std::vector<int>& N)
{
  const size_t threshold = N.size() / 2;
  for (const auto x : N)
  {
    const size_t countx = std::count(begin(N), end(N), x);
    if (countx > threshold)
      return x;
  }
  return -1;
}

Hash-map approach

A simple improvement on the solution in the section 27.3.1 can be made by using a hash-map to store the number of occurrence of each element in the input array. There cannot be more than \(n\) different numbers in the the array \(N\), thus with a single pass of the input and with a linear cost in space we can calculate the number of occurrence of each element and check if any of the counters at any points gets higher than \(\lfloor \frac{n}{2} \rfloor\).

A possible implementation of this approach is shown in Listing [list:majority_element:hashmap]. The complexity of this approach is \(O(n)\) for both space and time. In-fact, even in the worst case all the elements of the input array are only read and stored once.

Listing 2: Solution to the problem of finding the majority element in an array using hash-map.

int find_majority_element_hash_map(const std::vector<int>& N)
{
  const size_t threshold = N.size() / 2;

  std::unordered_map<int, int> C;
  std::pair<int, int> max_val = {0, 0};
  for (const auto x : N)
  {
    int& countx = C[x];
    countx++;
    if (countx > threshold)
      return x;
  }
  return -1;
}

Sorting - Counting

The approach described in section 27.3.2 is definitely faster then the quadratic brute-force but at a linear price in space. In order not to pay the price in space and to lower the time complexity down from quadratic, we could rely on the fact that in a sorted collection of elements all equal elements appear grouped together e.g. in \([1,1,2,2,3,3,3,4,4,9,9]\), all the \(1\)s appear at the beginning of the array, followed by all the \(2\)s, etc. We can then count the number of occurrences of each element in constant space as shown in Listing [list:majority_element:sorting]. The complexity of this approach is bounded by the sorting which costs \(O(nlog(n)\) time.

Listing 3: Solution to the problem of finding the majority element in an array using sorting.

int find_majority_element_sorting(std::vector<int>& N)
{
  const size_t threshold = N.size() / 2;
  if (N.size() == 0)
    return -1;

  std::sort(std::begin(N), std::end(N));

  // current.first = number
  // number.second = occurrences
  std::pair<int, int> current;
  for (size_t i = 0; i < N.size(); i++)
  {
    if (N[i] != current.first || i == 0)
      current = {N[i], 1};
    else
      current.second++;

    if (current.second > threshold)
      return current.first;
  }
  return -1;
}

Sorting - Median

We can, however, make even better use of the fact that we have a sorted collection. The key idea here is that if a majority element exists then this element must be the median. After all, by definition , the median element is right in the middle of the sorted collection. Since the majority element will be occupy more than half of the positions of the array it must also occupy the median position. All that is necessary after sorting the array is to check if the median value appears more than \(\lfloor \frac{n}{2} \rfloor\) times. This idea is implemented in Listing [list:majority_element:median] and has a complexity of \(O(nlog(n)\) due to sorting.

Listing 4: Linear time constant space solution to the problem of finding the majority element in an array.

int find_majority_element_median(std::vector<int>& N)
{
  if (N.size() == 0)
    return -1;
  std::sort(std::begin(N), std::end(N));
  const int midpoint = N.size() / 2;
  const int el       = N[midpoint];

  const size_t threshold = N.size() / 2;
  if (std::count(begin(N), end(N), el) > threshold)
    return el;
  return -1;
}

Boyer-Moore algorithm

The best approach to solving this problem in linear time and constant space is, however, to use the Boyer-Moore algorithm[@Boyer1991].

The algorithm uses two variables to maintain a candidate element \(el\) of the sequence and its current count \(count=0\) (initialized to \(0\)). It processes the elements one by one and will perform the following operations:

  • if we are processing the very first element of the sequence or , it will set \(count=1\) and \(el\) to that element (this is our first candidate).

  • otherwise, if the element currently processed is equal to it increments the counter i.e.

  • if the element currently processed is different, then it decrements the counter i.e. ;

At the end of this process the variable will contain a candidate majority element. If the array contains a majority element then is the one. The algorithm correctness can be derived from the fact that the counter will be incremented more times than it will be decremented for the majority element. If we cannot assume that a majority element always exists then a second pass on the array is necessary in order to count the number of occurrences of in the input array.

Listing [list:majority_element:moore] shows a possible implementation of the Boyer-Moore algorithm. The complexity of this approach is \(O(n)\) time and \(O(1)\) space because the input array is scanned twice and only two additional integer variables are used.

Listing 5: Linear time constant space solution to the problem of finding the majority element in an array.

int find_majority_element_linear(const std::vector<int>& nums)
{
  if (nums.size() <= 0)
    return -1;

  int el    = nums.front();
  int count = 0;
  for (size_t i = 0; i < nums.size(); i++)
  {
    if (nums[i] == el)
    {
      count++;
    }
    else
    {
      count--;
    }
    if (count == 0)
    {
      el    = nums[i];
      count = 1;
    }
  }
  // check that el appears > n/2 times
  if (std::count(begin(nums), end(nums), el) > nums.size() / 2)
    return el;
  return -1;
}

Find the element repeated \(\frac{n}{k}\) times.

[example:find_repeated_number_n_divided_3_times:exercice1] Write a function that, given an array \(A\) of \(n\) integers and an integer \(k\), returns any of its element that occurs more than \(\frac{n}{k}\). If such an element does not exists the function returns \(-1\).

[example:find_repeated_number_n_divided_3_times:example1]
Given \(A=\{1,2,1,3,1\}\) and \(k=3\), the function return \(1\) as it occurs \(3 > \Big\lfloor\frac{|A|}{3}\Big\rfloor =2\) times.

Boyer-Moore algorithm extended

Solution approach:

if you have three distinct elements in the array the solution does not change. You can ignore all three of them.

Therefore, just keep track of the frequencies of two elements. When you process a new element you can do the following:

1. if you do not have three elements in the frequency list. add it with frequency onesid 2. if the element is equal to another element in the list. increase its frequency 3. if it is a new element (not in the list), decrease the frequency of all in the list by one. Remove any element with frequency zero.

Listing 6: Sample Caption