Quantcast
Channel: Baeldung
Viewing all articles
Browse latest Browse all 4536

Run-Length Encoding and Decoding in Java

$
0
0

1. Overview

In computer science, data compression techniques play an important role in optimizing storage and transmission efficiency. One such technique that has stood the test of time is Run-length Encoding (RLE).

In this tutorial, we’ll understand RLE and explore how to implement encoding and decoding in Java.

2. Understanding Run-Length Encoding

Run-length Encoding is a simple yet effective form of lossless data compression. The basic idea behind RLE is to represent consecutive identical elements, called a “run” in a data stream by a single value and its count rather than as the original run.

This is particularly useful when dealing with repetitive sequences, as it significantly reduces the amount of space needed to store or transmit the data.

RLE is well suited to compress palette-based bitmap images, such as computer icons and animations. A well-known example is the Microsoft Windows 3.x startup screen, which is RLE compressed.

Let’s consider the following example:

String INPUT = "WWWWWWWWWWWWBAAACCDEEEEE";

If we apply a run-length encoding data compression algorithm to the above string, it can be rendered as follows:

String RLE = "12W1B3A2C1D5E";

In the encoded sequence, each character follows the number of times it appears consecutively. This rule allows us to easily reconstruct the original data during decoding.

It’s worth noting that the standard RLE only works with “textual” input. If the input contains numbers, RLE can’t encode them in a nonambiguous way.

In this tutorial, we’ll explore two run-length encoding and decoding approaches.

3. Character Array Based Solution

Now that we understand RLE and how it works, a classic approach to implementing run-length encoding and decoding is converting the input strings into a char array and then applying the encoding and decoding rules to the array.

3.1. Creating the Encoding Method

The key to implementing run-length encoding is to identify each run and count its length. Let’s first look at the implementation and then understand how it works:

String runLengthEncode(String input) {
    StringBuilder result = new StringBuilder();
    int count = 1;
    char[] chars = input.toCharArray();
    for (int i = 0; i < chars.length; i++) {
        char c = chars[i];
        if (i + 1 < chars.length && c == chars[i + 1]) {
            count++;
        } else {
            result.append(count).append(c);
            count = 1;
        }
    }
    return result.toString();
}

Next, let’s quickly walk through the code above and understand the logic.

First, we use StringBuilder to store each step’s result and concatenate them for better performance.

After initializing a counter and converting the input string to a char array, the function iterates through each character in the input string.

For each character:

  • If the current character is the same as the next character and we are not at the end of the string, the count is incremented.
  • If the current character is different from the next character or we are at the end of the string, the count and the current character are appended to the result StringBuilder. The count is then reset to 1 for the next unique character.

Finally, the StringBuilder is converted to a string using toString() and returned as the result of the encoding process.

When we test our INPUT with this encoding method, we get the expected result:

assertEquals(RLE, runLengthEncode(INPUT));

3.2. Creating the Decoding Method

Identifying each run is still crucial to decode a RLE string. A run includes the character and the number of times it appears, such as “12W” or “2C“.

Now, let’s create a decoding method:

String runLengthDecode(String rle) {
    StringBuilder result = new StringBuilder();
    char[] chars = rle.toCharArray();
    int count = 0;
    for (char c : chars) {
        if (Character.isDigit(c)) {
            count = 10 * count + Character.getNumericValue(c);
        } else {
            result.append(String.join("", Collections.nCopies(count, String.valueOf(c))));
            count = 0;
        }
    }
    return result.toString();
}

Next, let’s break down the code and understand step by step how it works.

First, we create a StringBuilder object to hold step results and convert the rle string to a char array for later processing.

Also, we initialized an integer variable count to keep track of the count of consecutive occurrences.

Then, we iterate through each character in the RLE-encoded string. For each character:

  • If the character is a digit, it contributes to the count. The count is updated by appending the digit to the existing count using the formula 10 * count + Character.getNumericValue(c). This is done to handle multi-digit counts.
  • If the character is not a digit, a new character is encountered. The current character is then appended to the result StringBuilder repeated count times using Collections.nCopies(). The count is then reset to 0 for the next set of consecutive occurrences.

It’s worth noting that if we work with Java 11 or later, String.valueOf(c).repeat(count) is a better alternative to repeat the character c count times.

When we verify the decoding method using our example, the test passes:

assertEquals(INPUT, runLengthDecode(RLE));

4. Regex Based Solution

Regex is a powerful tool for dealing with characters and strings. Let’s see if we can perform run-length encoding and decoding using regex.

4.1. Creating the Encoding Method

Let’s first look at the input string. If we can split it into a “run array”, then the problem can be easily solved:

Input     : "WWWWWWWWWWWWBAAACCDEEEEE"
Run Array : ["WWWWWWWWWWWW", "B", "AAA", "CC", "D", "EEEEE"]

We cannot split the input string by a character to achieve this. Instead, we must split by a zero-width position, such as the position between ‘W‘ and ‘B‘, ‘B‘ and ‘A“, etc.

It isn’t hard to discover the rules of these positions: the characters before and after the position are different. Therefore, we can build a regex to match the required positions using look-around and back reference: “(?<=(\\D))(?!\\1)“. Let’s quickly understand what this regex means:

  • (?<=(\\D)) – Positive look-behind assertion ensures the match is after a non-digit character (\\D represents a non-digit character).
  • (?!\\1) – Negative lookahead assertion ensures the matched position isn’t before the same character as in the positive lookbehind. \\1 refers to the previously matched non-digit character.

The combination of these assertions ensures that the split occurs at the boundaries between consecutive runs of the same character.

Next, let’s create the encoding method:

String runLengthEncodeByRegEx(String input) {
    String[] arr = input.split("(?<=(\\D))(?!\\1)");
    StringBuilder result = new StringBuilder();
    for (String run : arr) {
        result.append(run.length()).append(run.charAt(0));
    }
    return result.toString();
}

As we can see, after we obtain the run array, the rest of the task is simply appending each run’s length and the character to the prepared StringBuilder.

The runLengthEncodeByRegEx() method passes our test:

assertEquals(RLE, runLengthEncodeByRegEx(INPUT));

4.2. Creating the Decoding Method

We can follow a similar idea to decode a RLE-encoded string. First, we need to split the encoded string and obtain the following array:

RLE String: "12W1B3A2C1D5E"
Array     : ["12", "W", "1", "B", "3", "A", "2", "C", "1", "D", "5", "E"]

Once we get this array, we can generate the decoded string by easily repeating each character, such as ‘W12 times, ‘B‘ once,  etc.

We’ll use the look-around technique again to create the regex to split the input string: “(?<=\\D)|(?=\\D+)“.

In this regex:

  • (?<=\\D) – Positive look-behind assertion ensures that the split occurs after a non-digit character.
  • | – Indicates the “OR” relation
  • (?=\\D+) – Positive lookahead assertion ensures the split occurs before one or more non-digit characters.

This combination allows the split to occur at the boundaries between consecutive counts and characters in the RLE-encoded string.

Next, let’s build the decoding method based on the regex-based split:

String runLengthDecodeByRegEx(String rle) {
    if (rle.isEmpty()) {
        return "";
    }
    String[] arr = rle.split("(?<=\\D)|(?=\\D+)");
    if (arr.length % 2 != 0) {
        throw new IllegalArgumentException("Not a RLE string");
    }
    StringBuilder result = new StringBuilder();
    for (int i = 1; i <= arr.length; i += 2) {
        int count = Integer.parseInt(arr[i - 1]);
        String c = arr[i];
        result.append(String.join("", Collections.nCopies(count, c)));
    }
    return result.toString();
}

As the code shows, we included some simple validations at the beginning of the method. Then, while iterating the split array, we retrieved the count and character from the array and appended the character repeated count times to the result StringBuilder.

Finally, this decoding method works with our example:

assertEquals(INPUT, runLengthDecodeByRegEx(RLE));

5. Conclusion

In this article, we first discussed how run-length encoding works and then explored two approaches to implementing run-length encoding and decoding.

As always, the complete source code for the examples is available over on GitHub.

       

Viewing all articles
Browse latest Browse all 4536

Trending Articles