1. Overview
When dealing with Strings in Java, sometimes we need to encode them into a specific charset.
This tutorial is a practical guide showing different ways to encode a String to the UTF-8 charset; for a more technical deep-dive see our Guide to Character Encoding.
2. Defining the Problem
To showcase the Java encoding, we'll work with the German String “Entwickeln Sie mit Vergnügen”.
String germanString = "Entwickeln Sie mit Vergnügen"; byte[] germanBytes = germanString.getBytes(); String asciiEncodedString = new String(germanBytes, StandardCharsets.US_ASCII); assertNotEquals(asciiEncodedString, germanString);
This String encoded using US_ASCII gives us the value “Entwickeln Sie mit Vergn?gen” when printed, because it doesn't understand the non-ASCII ü character. But when we convert an ASCII-encoded String that uses all English characters to UTF-8, we get the same string.
String englishString = "Develop with pleasure"; byte[] englishBytes = englishString.getBytes(); String asciiEncondedEnglishString = new String(englishBytes, StandardCharsets.US_ASCII); assertEquals(asciiEncondedEnglishString, englishString);
Let's see what happens when we use the UTF-8 encoding.
3. Encoding With Core Java
Let's start with the core library.
Strings are immutable in Java, which means we cannot change a String character encoding. To achieve what we want, we need to copy the bytes of the String and then create a new one with the desired encoding.
First, we get the String bytes and, after that, create a new one using the retrieved bytes and the desired charset:
String rawString = "Entwickeln Sie mit Vergnügen"; byte[] bytes = rawString.getBytes(StandardCharsets.UTF_8); String utf8EncodedString = new String(bytes, StandardCharsets.UTF_8); assertEquals(rawString, utf8EncodedString);
4. Encoding With Java 7 StandardCharsets
Alternatively, we can use the StandardCharsets class introduced in Java 7 to encode the String.
First, we'll decode the String into bytes and, secondly, encode the String to UTF-8:
String rawString = "Entwickeln Sie mit Vergnügen"; ByteBuffer buffer = StandardCharsets.UTF_8.encode(rawString); String utf8EncodedString = StandardCharsets.UTF_8.decode(buffer).toString(); assertEquals(rawString, utf8EncodedString);
5. Encoding With Commons-Codec
Besides using core Java, we can alternatively use Apache Commons Codec to achieve the same results.
Apache Commons Codec is a handy package containing simple encoders and decoders for various formats.
First, let's start with the project configuration. When using Maven, we have to add the commons-codec dependency to our pom.xml:
<dependency> <groupId>commons-codec</groupId> <artifactId>commons-codec</artifactId> <version>1.14</version> </dependency>
Then, in our case, the most interesting class is StringUtils, which provides methods to encode Strings. Using this class, getting a UTF-8 encoded String is pretty straightforward:
String rawString = "Entwickeln Sie mit Vergnügen"; byte[] bytes = StringUtils.getBytesUtf8(rawString); String utf8EncodedString = StringUtils.newStringUtf8(bytes); assertEquals(rawString, utf8EncodedString);
6. Conclusion
Encoding a String into UTF-8 isn't difficult, but it's not that intuitive. This tutorial presents three ways of doing it, either using core Java or using Apache Commons Codec.
As always, the code samples can be found over on GitHub.