1. Overview
We may wish to work with compressed files in Java. A common format is .gz, as generated by the GZIP utility.
Java has a built-in library for reading .gz files, which are commonly used for logs.
In this tutorial, we’ll explore reading compressed (.gz) files line by line in Java using the GZIPInputStream class.
2. Reading a GZipped File
Let’s imagine we want to read the contents of a file into a List. First, we need to find the file on our path:
String filePath = Objects.requireNonNull(Main.class.getClassLoader().getResource("myFile.gz")).getFile();
Next, let’s get ready to read from this file into an empty list:
List<String> lines = new ArrayList<>();
try (FileInputStream fileInputStream = new FileInputStream(filePath);
GZIPInputStream gzipInputStream = new GZIPInputStream(fileInputStream);
InputStreamReader inputStreamReader = new InputStreamReader(gzipInputStream);
BufferedReader bufferedReader = new BufferedReader(inputStreamReader)) {
//...
}
Inside our try-with-resources block, we’ve defined a FileInputStream object for reading the GZIP file. Then, we have a GZIPInputStream that decompresses data from the GZIP file. Finally, there’s a BufferedReader to read its lines.
Now, we can loop through the file to read line by line:
String line;
while ((line = bufferedReader.readLine()) != null) {
lines.add(line);
}
3. Handling Large GZipped Files With Java Stream API
When confronted with large GZIP-compressed files, we may not have enough memory to load the whole file. However, the streaming approach allows us to process the content line-by-line as it’s read from the stream.
3.1. Standalone Method
Let’s build a routine to collect lines from our file that match a specific substring:
try (InputStream inputStream = new FileInputStream(filePath);
GZIPInputStream gzipInputStream = new GZIPInputStream(inputStream);
InputStreamReader inputStreamReader = new InputStreamReader(gzipInputStream);
BufferedReader bufferedReader = new BufferedReader(inputStreamReader)) {
return bufferedReader.lines().filter(line -> line.contains(toFind)).collect(toList());
}
This approach utilizes the lines() method to create a stream of lines from the file. Then, the subsequent filter() operation selects the lines of interest and collects them into a list with collect().
The use of try-with-resources ensures the various file and input streams are correctly closed when everything is done.
3.2. Using Consumer<Stream<String>>
In the previous example, we benefit from the surrounding try-with-resources to look after our .gz stream resources. However, we may wish to generalize the method for operating on a Stream<String> read from a .gz file on the fly:
try (InputStream inputStream = new FileInputStream(filePath);
GZIPInputStream gzipInputStream = new GZIPInputStream(inputStream);
InputStreamReader inputStreamReader = new InputStreamReader(gzipInputStream);
BufferedReader bufferedReader = new BufferedReader(inputStreamReader)) {
consumer.accept(bufferedReader.lines());
}
This approach allows the caller to pass in a Consumer<Stream<String>> to operate on the stream of uncompressed lines. Moreover, the code calls accept() on that Consumer to provide the Stream. This allows us to pass in anything we like to operate on the lines:
useContentsOfZipFile(testFilePath, linesStream -> {
linesStream.filter(line -> line.length() > 10).forEach(line -> count.incrementAndGet());
});
In this example, we’re providing a consumer who counts all of the lines over a certain length.
4. Conclusion
In this short article, we’ve looked at how to read .gz files in Java.
First, we looked at how to read the files into a list using BufferedReader and readLine(). Then, we looked at ways to treat the file as a Stream<String> to process the lines without having to load them all in memory at once.