Vectorization in the EM Algorithm: A Two-Coin Example

Introduction

The Expectation-Maximization (EM) algorithm is a core method for estimating parameters in probabilistic models with latent variables. In the classic two-coin problem, each observed flip comes from either Coin A or Coin B (we do not observe which). Each coin has an unknown bias:

\pi_A = P(\text{Head} \mid A),\qquad \pi_B = P(\text{Head} \mid B).

We observe a sequence of flips (o_1, o_2, \dots, o_N) where each (o_n \in {0,1}) (1=heads, 0=tails). The EM algorithm estimates (\pi_A) and (\pi_B) from these incomplete observations. This post explains how to vectorize EM so we can run many independent EM initializations in parallel using tensor operations (PyTorch), why that is faster, and how the math maps to the code.

Step 1 — Probability of a Single Observation

For a single outcome (o\in{0,1}) and a coin with bias (\pi):

P(o \mid \pi) = o\cdot\pi + (1-o)\cdot(1-\pi).

This single expression covers both cases:

If (o=1), the expression becomes (\pi).
If (o=0), the expression becomes (1-\pi).

This algebraic trick lets us compute probabilities for heads or tails without branching.

Step 2 — Vectorization setup (shapes & broadcasting)

We will perform (B) EM runs in parallel (different random inits). Let:

(N) = number of observed flips.
(B) = number of parallel EM runs.

Represent data and parameters as tensors:

Outcomes tensor (row vector): (\mathbf{o}\in\mathbb{R}^{1\times N}).
Parameter vectors: (\boldsymbol{\pi}_A, \boldsymbol{\pi}_B \in \mathbb{R}^B).

We use unsqueeze to reshape for broadcasting:

\text{outcomes}:\ [1, N] \qquad\text{and}\qquad \pi_A.\text{unsqueeze}(1):\ [B, 1].

Broadcasting expands these to shape ([B, N]) so elementwise operations act across all runs and all observations at once.

Step 3: The E-Step (Expectation Step)

In the E-step, we compute the "responsibilities" — i.e., how much each coin (A or B) is responsible for generating each observed outcome.

For a single data point $x_i$ :

Probability that coin A generated $x_i$ :

P(x_i | A) = x_i \cdot \pi_A + (1 - x_i)(1 - \pi_A)

Probability that coin B generated $x_i$ :

P(x_i | B) = x_i \cdot \pi_B + (1 - x_i)(1 - \pi_B)

The intuition:

If $x_i = 1$ (heads), the formula reduces to $\pi_A$ or $\pi_B$ .
If $x_i = 0$ (tails), the formula reduces to $(1 - \pi_A)$ or $(1 - \pi_B)$ .

In vectorized form:

prob_A becomes a $[B, N]$ matrix, where each row corresponds to a different EM run and each column to a different data point.
Same for prob_B.

Then we normalize to compute responsibilities:

\gamma_{A,i} = \frac{P(x_i|A)}{P(x_i|A) + P(x_i|B)}

\gamma_{B,i} = 1 - \gamma_{A,i}

This gives us two $[B, N]$ matrices of weights, $\gamma_A$ and $\gamma_B$ .

Step 4: The M-Step (Maximization Step)

In the M-step, we update the coin biases using weighted averages.

For coin A:

\pi_A^{new} = \frac{\sum_i \gamma_{A,i} \cdot x_i}{\sum_i \gamma_{A,i}}

For coin B:

\pi_B^{new} = \frac{\sum_i \gamma_{B,i} \cdot x_i}{\sum_i \gamma_{B,i}}

Interpretation:

The numerator counts how many "effective heads" were assigned to each coin.
The denominator counts the total "effective flips" assigned to each coin.
Together, they compute the weighted fraction of heads.

In code:

Total_heads_A = torch.sum(gamma_A * outcomes, dim=1)
Total_tails_A = torch.sum(gamma_A * (1 - outcomes), dim=1)
pi_A = Total_heads_A / (Total_heads_A + Total_tails_A + eps)

The same applies for coin B.

Step 5: Log-Likelihood Calculation

To choose the best run (since EM can converge to local optima), we compute the log-likelihood for each run:

\mathcal{L}(\pi_A, \pi_B) = \sum_i \log \left( 0.5 \cdot P(x_i|A) + 0.5 \cdot P(x_i|B) \right)

This measures how well the current parameters explain the observed data.

In code:

Compute prob_A and prob_B again.
Average them: total_prob = 0.5 * prob_A + 0.5 * prob_B
Take logs: log_probs = torch.log(total_prob + eps)
Sum across data points.

The output is one log-likelihood per run, shape $[B]$ .

Step 6: Selecting the Best Run

Since EM is sensitive to initialization, we repeat the process $B$ times with different random initial guesses.

At the end, we select the run with the highest log-likelihood:

\text{best run} = \arg \max_{b} \mathcal{L}_b

This ensures we find the most plausible parameters for $\pi_A$ and $\pi_B$ .

Step 7: Vectorization Summary

The key insight of vectorization is that instead of looping over:

each data point $N$
and each initialization $B$

we let PyTorch broadcasting handle the alignment of shapes.

outcomes: shaped $[1, N]$
pi_A.unsqueeze(1): shaped $[B, 1]$
Result: a broadcasted $[B, N]$ matrix

This means:

One matrix stores all probabilities for coin A across all runs and all data points.
Another for coin B.
Then, all operations (sums, divisions, logs) are applied in parallel.

This drastically reduces Python-level loops and leverages optimized tensor math on GPU/MPS.

Final Thoughts

By vectorizing the EM algorithm:

We can run dozens or hundreds of EM initializations in parallel.
The code is simpler (no nested loops).
The math maps directly to tensor operations.

This design is critical when working with larger datasets or more complex mixture models, where performance and stability matter.