1. Introduction
In the previous article, we covered parallel-collectors, a small zero-dependency library that enables parallel processing for Stream API on custom thread pools.
Project Loom is the codename for the organized effort to introduce lightweight Virtual Threads (previously known as Fibers) to JVM, which was finalized in JDK21.
Let’s see how to leverage this in Parallel Collectors.
2. Maven Dependencies
If we want to start using the library, we need to add a single entry in Maven’s pom.xml file:
<dependency>
<groupId>com.pivovarit</groupId>
<artifactId>parallel-collectors</artifactId>
<version>3.0.0</version>
</dependency>
Or a single line in Gradle’s build file:
compile 'com.pivovarit:parallel-collectors:3.0.0'
The newest version can be found on Maven Central.
3. Parallel Processing with OS Threads vs Virtual Threads
3.1. OS Thread Parallelism
Let’s see why parallel processing with Virtual Threads is a big deal.
We’ll start by creating a simple example. We’ll need an operation to parallelize, which is going to be an artificially delayed String concatenation:
private static String fetchById(int id) {
try {
Thread.sleep(1000);
} catch (InterruptedException e) {
// ignore shamelessly
}
return "user-" + id;
}
We’ll also use custom code for measuring the execution time:
private static <T> T timed(Supplier<T> supplier) {
var before = Instant.now();
T result = supplier.get();
var after = Instant.now();
log.info("Execution time: {} ms", Duration.between(before, after).toMillis());
return result;
}
Now, let’s create a simple parallel Stream processing example in which we’re creating n elements and then processing them on n threads with parallelism of n:
@Test
public void processInParallelOnOSThreads() {
int parallelProcesses = 5_000;
var e = Executors.newFixedThreadPool(parallelProcesses);
var result = timed(() -> Stream.iterate(0, i -> i + 1).limit(parallelProcesses)
.collect(ParallelCollectors.parallel(i -> fetchById(i), toList(), e, parallelProcesses))
.join());
log.info("{}", result);
}
When we run it, we can observe that it clearly does the job because we don’t need to wait 5000 seconds for results:
Execution time: 1321 ms
[user-0, user-1, user-2, ...]
But let’s see what happens if we try to increase the number of elements processed in parallel to 20_000:
[2.795s][warning][os,thread] Failed to start thread "Unknown thread" - pthread_create failed (...)
[2.795s][warning][os,thread] Failed to start the native thread for java.lang.Thread "pool-1-thread-16111"
The os-thread-based approach doesn’t scale since threads are expensive to create, and we quickly reach resource limits.
Let’s see what happens if we switch to Virtual Threads.
3.2. Virtual Thread Parallelism
Before Java 21, it wasn’t easy to come up with reasonable defaults for thread pool configuration. Luckily, Virtual Threads don’t require any—we can create as many threads as we want, and they get internally scheduled on a shared ForkJoinPool instance, making them perfect for running blocking operations!
If we’re running Parallel Collectors 3.x, we can effortlessly leverage Virtual Threads:
@Test
public void processInParallelOnVirtualThreads() {
int parallelProcesses = 5_000;
var result = timed(() -> Stream.iterate(0, i -> i + 1).limit(parallelProcesses)
.collect(ParallelCollectors.parallel(i -> fetchById(i), toList()))
.join());
}
As we can see, this is as easy as omitting executor and parallelism parameters since Virtual Threads is the default execution utility.
If we try to run it, we can see that it actually completes faster than the original example:
Execution time: 1101 ms
[user-0, user-1, user-2, ...]
This is because we created 5000 Virtual Threads, which were scheduled using a highly limited set of OS threads.
Let’s try to increase the parallelism to 20_000, which wasn’t possible with a classic Executor:
Execution time: 1219 ms
[user-0, user-1, user-2, ...]
Not only did this execute successfully, but it was completed faster than a 4 times smaller job on OS threads!
Let’s increase the parallelism to 100_000 and see what happens:
Execution time: 1587 ms
[user-0, user-1, user-2, ...]
Works just fine, although significant overhead is observed.
What if we increase the parallelism level to 1_000_000?
Execution time: 6416 ms
[user-0, user-1, user-2, ...]
2_000_000?
Execution time: 12906 ms
[user-0, user-1, user-2, ...]
5_000_000?
Execution time: 25952 ms
[user-0, user-1, user-2, ...]
As we can see, we can easily scale to high levels of parallelism that weren’t achievable with OS threads. This, alongside performance improvements on smaller parallel workloads, is the main benefit of leveraging Virtual Threads for parallel processing of blocking operations.
3.3. Virtual Threads and Older Versions of Parallel Collectors
The easiest way to leverage Virtual Threads is to upgrade to the newest possible version of the library, but if this isn’t possible, we can also achieve this with a 2.x.y version while running on JDK21.
The trick is to manually provide Executors.newVirtualThreadPerTaskExecutor() as executor and Integer.MAX_VALUE as max parallelism level:
@Test
public void processInParallelOnVirtualThreadsParallelCollectors2() {
int parallelProcesses = 100_000;
var result = timed(() -> Stream.iterate(0, i -> i + 1).limit(parallelProcesses)
.collect(ParallelCollectors.parallel(
i -> fetchById(i), toList(),
Executors.newVirtualThreadPerTaskExecutor(), Integer.MAX_VALUE))
.join());
log.info("{}", result);
}
5. Conclusion
In this article, we had a chance to see how to effortlessly leverage Virtual Threads with the Parallel Collectors library, which turned out to scale much better than the classical OS-thread-based solution. Our test machine ended up hitting resource limits at around ~16000 threads, while it was easily possible to scale to millions of Virtual Threads.
As always, code samples can be found over on GitHub.