Quantcast
Channel: Baeldung
Viewing all articles
Browse latest Browse all 4561

Replace Non-Printable Unicode Characters in Java

$
0
0

1. Introduction

Non-printable Unicode characters are control characters, style markers, and other invisible symbols that we can find in text but aren’t meant to show. Besides, these letters can cause problems with text handling, showing, and saving. So, it’s very important to have ways of changing or getting rid of such characters as required.

In this tutorial, we’ll look at different ways to replace it.

2. Using Regular Expressions

Java’s String class has strong ways to handle text changes, and regular expressions provide a short way to match and replace patterns in strings. We can use simple patterns to find and change non-printable Unicode letters as follows:

@Test
public void givenTextWithNonPrintableChars_whenUsingRegularExpression_thenGetSanitizedText() {
    String originalText = "\n\nWelcome \n\n\n\tto Baeldung!\n\t";
    String expected = "Welcome to Baeldung!";
    String regex = "[\\p{C}]";
    Pattern pattern = Pattern.compile(regex);
    Matcher matcher = pattern.matcher(originalText);
    String sanitizedText = matcher.replaceAll("");
    assertEquals(expected, sanitizedText);
}

In this test method, the regular expression \\p{C} represents any control characters (non-printable Unicode characters) in a given originalText. Besides, we compile the regular expression into a pattern using the Pattern.compile(regex) method, and then we create a Matcher object by calling this pattern with the originalText as a parameter.

Then, we call the Matcher.replaceAll() method to replace all instances of matched control characters with an empty string and hence eradicate them from the source text. Lastly, we compare the sanitizedtext with the expected string using the assertEquals() method.

3. Custom Implementation

We can utilize another approach to go through the letters of our text and remove special Unicode characters based on their numbers. Let’s take a simple example:

@Test
public void givenTextWithNonPrintableChars_whenCustomImplementation_thenGetSanitizedText() {
    String originalText = "\n\nWelcome \n\n\n\tto Baeldung!\n\t";
    String expected = "Welcome to Baeldung!";
    StringBuilder strBuilder = new StringBuilder();
    originalText.codePoints().forEach((i) -> {
        if (i >= 32 && i != 127) {
            strBuilder.append(Character.toChars(i));
        }
    });
    assertEquals(expected, strBuilder.toString());
}

Here, we employ originalText.codePoints() and a forEach loop to iterate through the Unicode code of the original text. Then, we set the condition to eliminate characters with values below 32 and equal to 127, representing non-printable and control characters, respectively.

We then append the characters to the StringBuilder object using the strBuilder.append(Character.toChars (i)) method.

4. Conclusion

In conclusion, this tutorial delved into addressing the challenges posed by non-printable Unicode characters in written text. The exploration encompassed two distinct methods leveraging regular expressions in Java’s String class and implementing a custom solution.

As always, the complete code samples for this article can be found over on GitHub.

       

Viewing all articles
Browse latest Browse all 4561

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>