Finding the Median of a Stream Using Heaps

When it comes to handling a dynamic set of numbers, finding the median efficiently is often a challenge that programmers face. The median is defined as the middle value when the numbers are sorted. If there's an even number of observations, the median is usually considered as the average of the two middle numbers. In our context, we'll explore how to maintain this median while processing a continuous stream of numbers.

The Algorithm Overview

We will use two heaps (priority queues) to allow us to efficiently find the median as we insert numbers into our dataset:

Max Heap (left heap): This will store the smaller half of the numbers.
Min Heap (right heap): This will store the larger half of the numbers.

This dual-heap approach allows us to keep track of the median efficiently:

The max-heap can access the largest number in the smaller half quickly.
The min-heap can access the smallest number in the larger half quickly.

How It Works

Insert: When inserting a new number, we must decide whether it belongs to the max-heap or the min-heap.
- If the number is less than or equal to the maximum of the max-heap, it goes into the max-heap.
- Otherwise, it goes into the min-heap.
Balance: After each insertion, we may need to balance the heaps. The sizes of the heaps should not differ by more than one.
- If the max-heap exceeds the size of the min-heap by more than one, move the maximum from the max-heap to the min-heap.
- If the min-heap exceeds the size of the max-heap, move the minimum from the min-heap to the max-heap.
Find Median:
- If the max-heap has more elements, the median is the root of the max-heap.
- If both heaps are of equal size, the median is the average of the roots of both heaps.

Implementation in Java

Here's a simple yet effective Java solution to manage the insertion of numbers and median calculation:

import java.util.Collections;
import java.util.PriorityQueue;

public class MedianFinder {
    private PriorityQueue<Integer> maxHeap; // Max-heap for the lower half
    private PriorityQueue<Integer> minHeap; // Min-heap for the upper half

    public MedianFinder() {
        maxHeap = new PriorityQueue<>(Collections.reverseOrder()); // Store smaller half
        minHeap = new PriorityQueue<>(); // Store larger half
    }

    public void addNum(int num) {
        // Always add to maxHeap first
        maxHeap.offer(num);
        
        // Ensure the largest of maxHeap is less than or equal to the smallest of minHeap
        if (!minHeap.isEmpty() && maxHeap.peek() > minHeap.peek()) {
            minHeap.offer(maxHeap.poll());
        }

        // Balance the sizes of the heaps
        if (maxHeap.size() > minHeap.size() + 1) {
            minHeap.offer(maxHeap.poll());
        } else if (minHeap.size() > maxHeap.size()) {
            maxHeap.offer(minHeap.poll());
        }
    }

    public double findMedian() {
        if (maxHeap.size() > minHeap.size()) {
            return maxHeap.peek();
        } else {
            return (maxHeap.peek() + minHeap.peek()) / 2.0;
        }
    }

    public static void main(String[] args) {
        MedianFinder medianFinder = new MedianFinder();
        medianFinder.addNum(1);
        System.out.println(medianFinder.findMedian()); // Outputs 1.0
        medianFinder.addNum(2);
        System.out.println(medianFinder.findMedian()); // Outputs 1.5
        medianFinder.addNum(3);
        System.out.println(medianFinder.findMedian()); // Outputs 2.0
    }
}

Explanation of Code

Initialization: Two heaps are defined, one as a max-heap and the other as a min-heap.
addNum Method: This method handles the addition of new numbers:
- It first adds the number to the max-heap.
- Then, it checks if it needs to move elements between heaps to keep them balanced.
findMedian Method: Depending on the sizes of the heaps, it calculates the median.

Why Heaps?

Utilizing heaps allows us to achieve efficient O(log n) time complexity for insertions while keeping the retrieval of the median at O(1). This feature becomes especially useful when processing large streams of data, where performance matters.

Example Walkthrough

Let’s consider we have a stream of numbers: 5, 10, 15, 20, 25.

Add 5: Max-Heap: [5], Min-Heap: []
Find Median: 5
Add 10: Max-Heap: [5], Min-Heap: [10]
Find Median: (5 + 10) / 2 = 7.5
Add 15: Max-Heap: [5], Min-Heap: [10, 15]
Balance: Max-Heap: [10], Min-Heap: [15]
Find Median: 10
And so forth...

Conclusion

Utilizing bi-heap data structures to compute the median on a streaming dataset is an efficient and elegant solution, and understanding it can distinguish you as a strong candidate during developer interviews.

Level Up Your Skills with Xperto-AI