Quantcast
Channel: Baeldung
Viewing all articles
Browse latest Browse all 4537

Merge Multiple PDF Files Into a Single PDF Using Java

$
0
0

1. Introduction

In the modern business and document management workflow, the ability to merge multiple PDF files into a single PDF document is a common requirement. Common use cases include presentations, consolidating reports, or compiling packages into a single package.

In Java, multiple libraries exist that provide out-of-the-box features to handle PDFs and merge them into a single PDF. Apache PDFBox and iText are among the popular ones.

In this tutorial, we’ll implement the PDF merge functionality using Apache PDFBox and iText.

2. Setup

Before diving into the implementation, let’s go through the necessary setup steps. We’ll add the required dependencies for the project, additionally, we’ll create helper methods for our tests.

2.1. Dependencies

We’ll use Apache PDFBox and iText for merging the PDF files. To use Apache PDFBox, We  need to add the below dependency in the pom.xml file:

<dependency> 
    <groupId>org.apache.pdfbox</groupId>
    <artifactId>pdfbox</artifactId> 
    <version>2.0.31</version> 
</dependency>

To use the iText, we need to add the below dependency in the pom.xml file:

<dependency>
    <groupId>com.itextpdf</groupId>
    <artifactId>itextpdf</artifactId>
    <version>5.5.13.3</version>
</dependency>

2.2. Test Setup

Let’s create a sample PDF file that we’ll use to test our logic. We can create a utility method to create PDF so that we can use it across different tests:

static void createPDFDoc(String content, String filePath) throws IOException {
    PDDocument document = new PDDocument();
    for (int i = 0; i < 3; i++) {
        PDPage page = new PDPage();
        document.addPage(page);
        try (PDPageContentStream contentStream = new PDPageContentStream(document, page)) {
            contentStream.beginText();
            contentStream.setFont(PDType1Font.HELVETICA_BOLD, 14);
            contentStream.showText(content + ", page:" + i);
            contentStream.endText();
        }
    }
    document.save("src/test/resources/temp/" + filePath);
    document.close();
}

In the above logic, we create a PDF document and add three pages to it using a custom font. Now that we have the createPDFDoc() method, let’s call it before each test and delete the file after finishing the test :

@BeforeEach
public void create() throws IOException {
    File tempDirectory = new File("src/test/resources/temp");
    tempDirectory.mkdirs();
    List.of(List.of("hello_world1", "file1.pdf"), List.of("hello_world2", "file2.pdf"))
        .forEach(pair -> {
            try {
                createPDFDoc(pair.get(0), pair.get(1));
            } catch (IOException e) {
                throw new RuntimeException(e);
            }
        });
}
@AfterEach
public void destroy() throws IOException {
    Stream<Path> paths = Files.walk(Paths.get("src/test/resources/temp/"));
    paths.sorted((p1, p2) -> -p1.compareTo(p2))
         .forEach(path -> {
            try {
                Files.delete(path);
            } catch (IOException e) {
                throw new RuntimeException(e);
            }
        });
}

3. Using Apache PDFBox

Apache PDFBox is an open-source Java library for working with PDF documents. It provides a range of functionalities to create, manipulate, and extract content from PDF files programmatically.

PDFBox provides a PDFMergerUtility helper class to merge multiple PDF documents. We can use the addSource() method to add PDF files. The mergeDocuments() method merges all the added sources, which results in the final merged PDF document:

void mergeUsingPDFBox(List<String> pdfFiles, String outputFile) throws IOException {
    PDFMergerUtility pdfMergerUtility = new PDFMergerUtility();
    pdfMergerUtility.setDestinationFileName(outputFile);
    pdfFiles.forEach(file -> {
        try {
            pdfMergerUtility.addSource(new File(file));
        } catch (FileNotFoundException e) {
            throw new RuntimeException(e);
        }
    });
    pdfMergerUtility.mergeDocuments(MemoryUsageSetting.setupMainMemoryOnly());
}

As demonstrated above, the mergeDocuments() method takes an argument to configure memory usage when merging documents. We’re defining exclusively to use only main memory, i.e., the RAM, for buffering during the merging of documents. We can choose from plenty of other options for buffering memory, including disk, a combination of RAM and disk, and so on.

We can write a unit test to verify if the merge logic works as expected:

@Test
void givenMultiplePdfs_whenMergeUsingPDFBoxExecuted_thenPdfsMerged() throws IOException {
    List<String> files = List.of("src/test/resources/temp/file1.pdf", "src/test/resources/temp/file2.pdf");
    PDFMerge pdfMerge = new PDFMerge();
    pdfMerge.mergeUsingPDFBox(files, "src/test/resources/temp/output.pdf");
    try (PDDocument document = PDDocument.load(new File("src/test/resources/temp/output.pdf"))) {
        PDFTextStripper pdfStripper = new PDFTextStripper();
        String actual = pdfStripper.getText(document);
        String expected = """
            hello_world1, page:0
            hello_world1, page:1
            hello_world1, page:2
            hello_world2, page:0
            hello_world2, page:1
            hello_world2, page:2
            """;
        assertEquals(expected, actual);
    }
}

In the test above, we merged two PDF files using PDFBox into an output file and validated the merged content.

4. Using iText

iText is another popular Java library for creating and manipulating PDF documents. It provides a wide range of features, such as generating PDF files while including text, images, tables, and other elements such as hyperlinks and form fields.

iText provides the PdfReader and PdfWriter classes that are very handy in reading input files and writing them to the output files:

void mergeUsingIText(List<String> pdfFiles, String outputFile) throws IOException, DocumentException {
    List<PdfReader> pdfReaders = List.of(new PdfReader(pdfFiles.get(0)), new PdfReader(pdfFiles.get(1)));
    Document document = new Document();
    FileOutputStream fos = new FileOutputStream(outputFile);
    PdfWriter writer = PdfWriter.getInstance(document, fos);
    document.open();
    PdfContentByte directContent = writer.getDirectContent();
    PdfImportedPage pdfImportedPage;
    for (PdfReader pdfReader : pdfReaders) {
        int currentPdfReaderPage = 1;
        while (currentPdfReaderPage <= pdfReader.getNumberOfPages()) {
            document.newPage();
            pdfImportedPage = writer.getImportedPage(pdfReader, currentPdfReaderPage);
            directContent.addTemplate(pdfImportedPage, 0, 0);
            currentPdfReaderPage++;
        }
    }
    fos.flush();
    document.close();
    fos.close();
}

In the above logic, we read and then import the pages of PdfReader to PdfWrite using the getImportedPage() method and then add that to the directContent object, which essentially stores the read buffer of contents. Once we finish the reading, we flush the output stream fos, which writes to the output file.

We can verify our logic by writing a unit test:

@Test
void givenMultiplePdfs_whenMergeUsingITextExecuted_thenPdfsMerged() throws IOException, DocumentException {
    List<String> files = List.of("src/test/resources/temp/file1.pdf", "src/test/resources/temp/file2.pdf");
    PDFMerge pdfMerge = new PDFMerge();
    pdfMerge.mergeUsingIText(files, "src/test/resources/temp/output1.pdf");
    try (PDDocument document = PDDocument.load(new File("src/test/resources/temp/output1.pdf"))) {
        PDFTextStripper pdfStripper = new PDFTextStripper();
        String actual = pdfStripper.getText(document);
        String expected = """
            hello_world1, page:0
            hello_world1, page:1
            hello_world1, page:2
            hello_world2, page:0
            hello_world2, page:1
            hello_world2, page:2
            """;
        assertEquals(expected, actual);
    }
}

Our test is almost the same as in the previous section. The only difference is we’re calling the mergeUsingIText() method, which uses iText for merging the PDF files.

5. Conclusion

In this article, we explored how we can merge PDF files using Apache PDFBox and iText. Both libraries are feature-rich and allow us to handle different types of content inside PDF files. We implemented merge functionality and also wrote tests to verify the results.

As usual, the complete source code for the examples is available over on GitHub.

       

Viewing all articles
Browse latest Browse all 4537

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>