Coordination, Waiting & Cancellation

CountDownLatch Never Reaches Zero

CountDownLatch Never Reaches Zero: practice a Java concurrency bug with symptoms like Main thread hangs, Await never returns, Partial work finished....

  • One-shot synchronizers
  • CountDownLatch
  • Coordination
  • Java
  • Beginner

Production symptoms

  • Main thread hangs
  • Await never returns
  • Partial work finished

Failure scenario

Code

Java example
CountDownLatch done = new CountDownLatch(3);

for (Task task : tasks) {
    executor.submit(() -> {
        if (!task.isValid()) {
            return;
        }
        task.prepare();
        done.countDown();
    });
}

done.await();

Prod Symptoms

A startup or batch coordinator launches several workers and waits for all participants before moving to the next phase. One worker fails or returns before signaling completion.

Key signal: A CountDownLatch tracks anonymous signals. It does not know which worker owns a count or whether that worker succeeded.

  • Some initialization or partition-complete logs appear
  • The coordinator never marks the service ready or completes the batch phase
  • Worker threads may already be gone while the coordinator remains in await()
  • CPU stays low because the coordinator is parked
  • Restart may change which worker fails, but does not fix the accounting protocol

Run Locally

  • worker 1 and worker 3 finish
  • worker 2 throws before countDown
  • The remaining latch count is 1 after all workers terminate
  • main remains in await
  • The final all workers done line is never printed

What to look for

  • main waiting in CountDownLatch.await
  • A worker exception or early return path before countDown
  • Latch count initialized higher than the number of guaranteed signals
Run
javac CountDownLatchStuckDemo.java
java CountDownLatchStuckDemo
Inspect while stuck
jps
jstack <pid>
jcmd <pid> Thread.print
CountDownLatchStuckDemo.java
import java.util.concurrent.CountDownLatch;

public class CountDownLatchStuckDemo {
    public static void main(String[] args) throws Exception {
        CountDownLatch done = new CountDownLatch(3);
        Thread[] workers = new Thread[3];

        for (int i = 1; i <= 3; i++) {
            final int workerId = i;
            Thread worker = new Thread(() -> {
                System.out.println("worker " + workerId + " started");
                if (workerId == 2) {
                    throw new RuntimeException("failed before countDown");
                }
                sleepQuietly(300);
                System.out.println("worker " + workerId + " done");
                done.countDown();
            }, "worker-" + i);
            workers[i - 1] = worker;
            worker.start();
        }

        for (Thread worker : workers) {
            worker.join();
        }

        System.out.println("remaining count = " + done.getCount());
        System.out.println("main waiting for all workers");
        done.await();
        System.out.println("all workers done");
    }

    private static void sleepQuietly(long millis) {
        try {
            Thread.sleep(millis);
        } catch (InterruptedException e) {
            Thread.currentThread().interrupt();
        }
    }
}

Note: Worker 2 throws before countDown. After all worker threads terminate, the latch still has one missing signal.

Diagnosis and fix

Explanation

CountDownLatch tracks a fixed number of anonymous signals. It does not know which worker owns each count or whether that worker succeeded.

Key signal: Define what one count means: successful completion or reaching a terminal state. CountDownLatch cannot represent both by itself.

  • The latch starts with one count per expected participant
  • Two workers call countDown(), reducing the demo count from three to one
  • The failed worker exits without signaling
  • An untimed await cannot complete while the count remains above zero
  • Calling await again does not repair or reset the one-shot latch
  • Putting countDown() in finally is correct when the latch represents terminal worker states, not successful results

How to Diagnose

Use the thread dump to find the waiting coordinator, then reconcile the expected participant count with actual task lifecycle events.

  • Find the coordinator parked in CountDownLatch.await()
  • Log or inspect the remaining count with getCount()
  • Compare the initial count with tasks successfully submitted
  • Check exception, early-return, rejection, and cancellation-before-start paths
  • Identify workers that started but never reached a terminal path
  • Remember that the dump shows the waiter, not which anonymous signal is missing
Commands
jps
jstack <pid>
jcmd <pid> Thread.print
Expected dump shape
"main" #... WAITING (parking)
  at jdk.internal.misc.Unsafe.park(Native Method)
  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:...)
  at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:...)
  at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:...)

How to Fix

  • Put countDown() in finally when each started worker owns one terminal-state signal
  • Report worker success or failure through a separate result channel
  • Handle rejected or cancelled-before-start tasks because their worker finally block never runs
  • Use timed await when indefinite blocking would violate the operational budget
  • Treat timeout as workflow failure, not as successful completion
  • Prefer futures or completion-oriented executor APIs when results and failures must be aggregated
CountDownLatchFinallyFixed.java
import java.util.concurrent.CountDownLatch;

public class CountDownLatchFinallyFixed {
    public static void main(String[] args) throws Exception {
        CountDownLatch done = new CountDownLatch(3);

        for (int i = 1; i <= 3; i++) {
            final int workerId = i;
            Thread worker = new Thread(() -> {
                try {
                    System.out.println("worker " + workerId + " started");
                    if (workerId == 2) {
                        throw new RuntimeException("failed during work");
                    }
                    sleepQuietly(300);
                    System.out.println("worker " + workerId + " done");
                } catch (RuntimeException error) {
                    System.out.println("worker " + workerId + " failed: " + error);
                } finally {
                    done.countDown();
                }
            }, "worker-" + i);
            worker.start();
        }

        System.out.println("main waiting for all workers");
        done.await();
        System.out.println("all workers reached a terminal state");
    }

    private static void sleepQuietly(long millis) {
        try {
            Thread.sleep(millis);
        } catch (InterruptedException e) {
            Thread.currentThread().interrupt();
        }
    }
}

Note: The latch now tracks terminal worker states, not only successful worker states.