1. Overview
In this tutorial, we'll review different approaches to determine if the contents of two files are equal. We'll be using core Java Stream I/O libraries to read the contents of the files and implement basic comparisons.
To finish, we'll review the support provided in Apache Commons I/O to check for content equality of two files.
2. Byte by Byte Comparison
Let's start with a simple approach to reading the bytes from the two files to compare them sequentially.
To speed up reading the files, we'll use BufferedInputStream. As we'll see, BufferedInputStream reads large chunks of bytes from the underlying InputStream into an internal buffer. When the client reads all the bytes in the chunk, the buffer reads another block of bytes from the stream.
Obviously, using BufferedInputStream is much faster than reading one byte at a time from the underlying stream.
Let's write a method that uses BufferedInputStreams to compare two files:
public static long filesCompareByByte(Path path1, Path path2) throws IOException {
try (BufferedInputStream fis1 = new BufferedInputStream(new FileInputStream(path1.toFile()));
BufferedInputStream fis2 = new BufferedInputStream(new FileInputStream(path2.toFile()))) {
int ch = 0;
long pos = 1;
while ((ch = fis1.read()) != -1) {
if (ch != fis2.read()) {
return pos;
}
pos++;
}
if (fis2.read() == -1) {
return -1;
}
else {
return pos;
}
}
}
We use the try-with-resources statement to ensure that the two BufferedInputStreams are closed at the end of the statement.
With the while loop, we read each byte of the first file and compare it with the corresponding byte of the second file. If we find a discrepancy, we return the byte position of the mismatch. Otherwise, the files are identical and the method returns -1L.
We can see that if the files are of different sizes but the bytes of the smaller file match the corresponding bytes of the larger file, then it returns the size in bytes of the smaller file.
3. Line by Line Comparison
To compare text files, we can do an implementation that reads the files line by line and checks for equality between them.
Let's work with a BufferedReader that uses the same strategy as InputStreamBuffer, copying chunks of data from the file to an internal buffer to speed up the reading process.
Let's review our implementation:
public static long filesCompareByLine(Path path1, Path path2) throws IOException {
try (BufferedReader bf1 = Files.newBufferedReader(path1);
BufferedReader bf2 = Files.newBufferedReader(path2)) {
long lineNumber = 1;
String line1 = "", line2 = "";
while ((line1 = bf1.readLine()) != null) {
line2 = bf2.readLine();
if (line2 == null || !line1.equals(line2)) {
return lineNumber;
}
lineNumber++;
}
if (bf2.readLine() == null) {
return -1;
}
else {
return lineNumber;
}
}
}
The code follows a similar strategy as the previous example. In the while loop, instead of reading bytes, we read a line of each file and check for equality. If all the lines are identical for both files, then we return -1L, but if there's a discrepancy, we return the line number where the first mismatch is found.
If the files are of different sizes but the smaller file matches the corresponding lines of the larger file, then it returns the number of lines of the smaller file.
4. Comparing with Files::mismatch
The method Files::mismatch, added in Java 12, compares the contents of two files. It returns -1L if the files are identical, and otherwise, it returns the position in bytes of the first mismatch.
This method internally reads chunks of data from the files' InputStreams and uses Arrays::mismatch, introduced in Java 9, to compare them.
As with our first example, for files that are of different sizes but for which the contents of the small file are identical to the corresponding contents in the larger file, it returns the size (in bytes) of the smaller file.
To see examples of how to use this method, please see our article covering the new features of Java 12.
5. Using Memory Mapped Files
A memory-mapped file is a kernel object that maps the bytes from a disk file to the computer's memory address space. The heap memory is circumvented, as the Java code manipulates the contents of the memory-mapped files as if we're directly accessing the memory.
For large files, reading and writing data from memory-mapped files is much faster than using the standard Java I/O library. It's important that the computer has an adequate amount of memory to handle the job to prevent thrashing.
Let's write a very simple example that shows how to compare the contents of two files using memory-mapped files:
public static boolean compareByMemoryMappedFiles(Path path1, Path path2) throws IOException {
try (RandomAccessFile randomAccessFile1 = new RandomAccessFile(path1.toFile(), "r");
RandomAccessFile randomAccessFile2 = new RandomAccessFile(path2.toFile(), "r")) {
FileChannel ch1 = randomAccessFile1.getChannel();
FileChannel ch2 = randomAccessFile2.getChannel();
if (ch1.size() != ch2.size()) {
return false;
}
long size = ch1.size();
MappedByteBuffer m1 = ch1.map(FileChannel.MapMode.READ_ONLY, 0L, size);
MappedByteBuffer m2 = ch2.map(FileChannel.MapMode.READ_ONLY, 0L, size);
return m1.equals(m2);
}
}
The method returns true if the contents of the files are identical, otherwise, it returns false.
We open the files using the RamdomAccessFile class and access their respective FileChannel to get the MappedByteBuffer. This is a direct byte buffer that is a memory-mapped region of the file. In this simple implementation, we use its equals method to compare in memory the bytes of the whole file in one pass.
6. Using Apache Commons I/O
The methods IOUtils::contentEquals and IOUtils::contentEqualsIgnoreEOL compare the contents of two files to determine equality. The difference between them is that contentEqualsIgnoreEOL ignores line feed (\n) and carriage return (\r). The motivation for this is due to operating systems using different combinations of these control characters to define a new line.
Let's see a simple example to check for equality:
@Test
public void whenFilesIdentical_thenReturnTrue() throws IOException {
Path path1 = Files.createTempFile("file1Test", ".txt");
Path path2 = Files.createTempFile("file2Test", ".txt");
InputStream inputStream1 = new FileInputStream(path1.toFile());
InputStream inputStream2 = new FileInputStream(path2.toFile());
Files.writeString(path1, "testing line 1" + System.lineSeparator() + "line 2");
Files.writeString(path2, "testing line 1" + System.lineSeparator() + "line 2");
assertTrue(IOUtils.contentEquals(inputStream1, inputStream2));
}
If we want to ignore newline control characters but otherwise check for equality of the contents:
@Test
public void whenFilesIdenticalIgnoreEOF_thenReturnTrue() throws IOException {
Path path1 = Files.createTempFile("file1Test", ".txt");
Path path2 = Files.createTempFile("file2Test", ".txt");
Files.writeString(path1, "testing line 1 \n line 2");
Files.writeString(path2, "testing line 1 \r\n line 2");
Reader reader1 = new BufferedReader(new FileReader(path1.toFile()));
Reader reader2 = new BufferedReader(new FileReader(path2.toFile()));
assertTrue(IOUtils.contentEqualsIgnoreEOL(reader1, reader2));
}
7. Conclusion
In this article, we've covered several ways to implement comparison of the contents of two files to check for equality.
The source code can be found over on GitHub.