Quantcast
Channel: Baeldung
Viewing all articles
Browse latest Browse all 4535

Intro to the Apache Commons Compress Project

$
0
0

1. Introduction

In this tutorial, we’ll learn how to use Apache Commons Compress to compress, archive, and extract files. We’ll also learn about its supported formats and some of its limitations.

2. What Is Apache Commons Compress

Apache Commons Compress is a library that creates a standard interface for the most widely used compression and archiving formats. It goes from the ubiquitous TAR, ZIP, and GZIP to less known but also commonly used formats, like BZIP2, XZ, LZMA, and Snappy.

2.1. Difference Between Compressors and Archivers

An archiver (such as TAR) bundles a directory structure into a single file, while a compressor takes a stream of bytes and makes them smaller, saving space. Some formats (like ZIP) can act as an archiver and a compressor but are considered archivers by the library.

We can check the supported archive formats by looking at some of the static fields of the ArchiveStreamFactory class provided by Commons Compress. Conversely, we can look at CompressorStreamFactory for supported compressor formats.

2.2. Commons Compress and Additional Dependencies

Let’s start by adding commons-compress in our project:

<dependency>
    <groupId>org.apache.commons</groupId>
    <artifactId>commons-compress</artifactId>
    <version>1.26.1</version>
</dependency>

Out of the box, Commons Compress works with TAR, ZIP, BZIP2, CPIO, and GZIP. But, for other formats, we need additional dependencies. Let’s add XZ, 7z, and LZMA support:

<dependency>
    <groupId>org.tukaani</groupId>
    <artifactId>xz</artifactId>
    <version>1.9</version>
</dependency>

Finally, for LZ4 and ZSTD:

<dependency>
    <groupId>com.github.luben</groupId>
    <artifactId>zstd-jni</artifactId>
    <version>1.5.5-11</version>
</dependency>

With these, we’ll avoid errors when reading or writing files of these types.

3. Compressing and Decompressing Streams

While the library creates an abstraction for the operations these different formats have in common, they also have unique functionalities. We access these using specific implementations, like GzipCompressorInputStream and LZMACompressorInputStream. Instead, we’ll focus on CompressorStreamFactory, which helps us get an implementation without the specific class, which helps create format-agnostic code.

3.1. Compressing a File

We must pass the desired compressing format to the factory method when compressing a file. Commons Compress contains a FileNameUtils class that we’ll use to get our file extension and pass it as the format. Then, we open an output stream, get a compressor instance, and write the bytes from our Path to it:

public class CompressUtils {
    public static void compressFile(Path file, Path destination) {
        String format = FileNameUtils.getExtension(destination);
        try (OutputStream out = Files.newOutputStream(destination);
          BufferedOutputStream buffer = new BufferedOutputStream(out);
          CompressorOutputStream compressor = new CompressorStreamFactory()
            .createCompressorOutputStream(format, buffer)) {
            IOUtils.copy(Files.newInputStream(file), compressor);
        }
    }
    // ...
}

Let’s test it with a simple text file:

@Test
void givenFile_whenCompressing_thenCompressed() {
    Path destination = Paths.get("/tmp/simple.txt.gz");
    CompressUtils.compressFile(Paths.get("/tmp/simple.txt"), destination);
    assertTrue(Files.isRegularFile(destination));
}

Note that we’re using GZIP here, which is denoted by the “gz” extension. We can use any other supported format just by changing the extension of the desired destination. Also, we can use any file type as input.

3.2. Decompressing a Compressed File

Let’s decompress a file compressed with any of the supported formats. First, we need to open a buffered input stream for the file and create a compressor input stream (which detects the compression format by reading the first bytes of the file). Then, write the compressor input to an output stream, resulting in a decompressed file or archive:

public static void decompress(Path file, Path destination) {
    try (InputStream in = Files.newInputStream(file);
      BufferedInputStream inputBuffer = new BufferedInputStream(in);
      OutputStream out = Files.newOutputStream(destination);
      CompressorInputStream decompressor = new CompressorStreamFactory()
        .createCompressorInputStream(inputBuffer)) {
        IOUtils.copy(decompressor, out);
    }
}

Let’s test it with a “tar.gz” file, which indicates it’s a TAR archive compressed with GZIP:

@Test
void givenCompressedArchive_whenDecompressing_thenArchiveAvailable() {
    Path destination = Paths.get("/tmp/decompressed-archive.tar");
    CompressUtils.decompress("/tmp/archive.tar.gz", destination);
    assertTrue(Files.isRegularFile(destination));
}

Note that any combination of supported archivers and compressors would work here without changing any code. For instance, we could use an “archive.cpio.xz” file as input instead. We could even decompress a GZIP’ed ZIP file. Most importantly, this method isn’t exclusive to archive files. Any compressed file can be decompressed with it.

4. Creating and Manipulating Archives

To create archives, we need to specify the format we want. To simplify things, the Archiver class has a convenient method that archives a whole directory to a destination file:

public static void archive(Path directory, Path destination) {
    String format = FileNameUtils.getExtension(destination);
    new Archiver().create(format, destination, directory);
}

4.1. Combining an Archiver With a Compressor

We can also combine archivers and compressors to create a compressed archive in a single operation. To simplify this, we’ll consider the extension as the compressor format and the extension preceding it as the archiver format. Then, we open a buffered output stream for the resulting compressed archive, create a compressor based on our compression format, and instantiate an ArchiveOutputStream that consumes from the output of our compressor:

public static void archiveAndCompress(Path directory, Path destination) {
    String compressionFormat = FileNameUtils.getExtension(destination);
    String archiveFormat = FilenameUtils.getExtension(
      destination.getFileName().toString().replace("." + compressionFormat, ""));
    try (OutputStream archive = Files.newOutputStream(destination);
      BufferedOutputStream archiveBuffer = new BufferedOutputStream(archive);
      CompressorOutputStream compressor = new CompressorStreamFactory()
        .createCompressorOutputStream(compressionFormat, archiveBuffer);
      ArchiveOutputStream<?> archiver = new ArchiveStreamFactory()
        .createArchiveOutputStream(archiveFormat, compressor)) {
        new Archiver().create(archiver, directory);
    }
}

In the end, we still use the Archiver, but now using a version of create() that receives an ArchiveOutputStream.

4.2. Unarchiving an Archive

With the Expander class, we can unarchive our uncompressed archive in a single line:

public static void extract(Path archive, Path destination) {
    new Expander().expand(archive, destination);
}

We pass the archive file and the directory where we want our files extracted to. This utility method takes care of opening (and closing) an input stream, detecting the archive type, iterating over all entries in the archive, and copying them to the directory we chose.

4.3. Extracting an Entry From an Existing Archive

Let’s write a method that extracts a single entry from an archive instead of the whole content:

public static void extractOne(Path archivePath, String fileName, Path destinationDirectory) {
    try (InputStream input = Files.newInputStream(archivePath); 
      BufferedInputStream buffer = new BufferedInputStream(input); 
      ArchiveInputStream<?> archive = new ArchiveStreamFactory()
        .createArchiveInputStream(buffer)) {
        ArchiveEntry entry;
        while ((entry = archive.getNextEntry()) != null) {
            if (entry.getName().equals(fileName)) {
                Path outFile = destinationDirectory.resolve(fileName);
                Files.createDirectories(outFile.getParent());
                try (OutputStream os = Files.newOutputStream(outFile)) {
                    IOUtils.copy(archive, os);
                }
                break;
            }
        }
    }
}

After opening an ArchiveInputStream, we keep calling getNextEntry() on our archive until we find an entry with the same name. If necessary, any parent directories are created. Then, its contents are written in our destination directory. Note that the file name can denote a sub-directory inside the archive. Considering our archive contains a file named “some.txt” under “sub-directory”:

@Test
void givenExistingArchive_whenExtractingSingleEntry_thenFileExtracted() {
    Path archive = Paths.get("/tmp/archive.tar.gz");
    String targetFile = "sub-directory/some.txt";
    CompressUtils.extractOne(archive, targetFile, Paths.get("/tmp/"));
    assertTrue(Files.isRegularFile("/tmp/sub-directory/some.txt"));
}

4.4. Adding an Entry to an Existing Archive

Unfortunately, the library doesn’t give us an easy way to include a new entry into an existing archive. If we open the archive and call putArchiveEntry(), we’ll overwrite its contents. So, it’d also be necessary to rewrite all the existing entries before inserting a new one. Instead of creating a new method with the logic for this, we’ll reuse the methods we’ve created. We’ll extract the archive, copy the new file to the directory structure, archive the directory again, and then delete the old archive:

@Test
void givenExistingArchive_whenAddingSingleEntry_thenArchiveModified() {
    Path archive = Paths.get("/tmp/archive.tar");
    Path newArchive = Paths.get("/tmp/modified-archive.tar");
    Path tmpDir = Paths.get("/tmp/extracted-archive");
    Path newEntry = Paths.get("/tmp/new-entry.txt");
    CompressUtils.extract(archive, tmpDir);
    assertTrue(Files.isDirectory(tmpDir));
    Files.copy(newEntry, tmpDir.resolve(newEntry.getFileName()));
    CompressUtils.archive(tmpDir, newArchive);
    assertTrue(Files.isRegularFile(newArchive));
    FileUtils.deleteDirectory(tmpDir.toFile());
    Files.delete(archive);
    Files.move(newArchive, archive);
    assertTrue(Files.isRegularFile(archive));
}

This will destroy the old archive, so leaving a backup instead is advised.

4.5. Using a Concrete Implementation Directly for Exclusive Features

We can use the specific implementation class directly if we want exclusive features from each format. For example, instead of using ArchiveOutputStream, we’ll instantiate a ZipArchiveOutputStream so we can set its compression method and level directly:

public static void zip(Path file, Path destination) {
    try (InputStream input = Files.newInputStream(file);
      OutputStream output = Files.newOutputStream(destination);
      ZipArchiveOutputStream archive = new ZipArchiveOutputStream(output)) {
        archive.setMethod(ZipEntry.DEFLATED);
        archive.setLevel(Deflater.BEST_COMPRESSION);
        archive.putArchiveEntry(new ZipArchiveEntry(file.getFileName().toString()));
        IOUtils.copy(input, archive);
        archive.closeArchiveEntry();
    }
}

It requires more code than just using the Archiver but gives us more control.

5. Limitations

While Apache Commons Compress offers a versatile toolkit for file compression and archiving, it’s essential to acknowledge certain limitations and considerations. Firstly, while the library provides extensive support for various compression and archive formats, handling multi-volume archives may pose challenges that need careful consideration. Additionally, encoding issues may arise. Mainly when dealing with diverse file systems or non-standardized data.

Moreover, although the library provides comprehensive functionality, Apache suggests leveraging ZipFile for enhanced control in specific scenarios. Finally, the TAR format also has a dedicated page with considerations.

6. Conclusion

In this article, we saw how Apache Commons Compress is a valuable resource for efficient file compression and archiving solutions. By understanding its capabilities, limitations, and best practices, we can leverage this library effectively to streamline file management processes in a format-independent way.

As always, the source code is available over on GitHub.

       

Viewing all articles
Browse latest Browse all 4535

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>