Skip to main content

Beware of Java's String.getBytes

Sometimes there are subtle bugs whose origin can be found in some quirks from the underlying language used to build the software. This blog post describes one of those cases in order to let both fellow security researchers and developers, who didn't know about it, become aware of this potential vulnerable pattern. In fact, I'm pretty sure that similar bugs to the one herein described likely affect a bunch of products/codebases out there.

In previous posts, I've already described some bugs in the Swiss Post's future E-voting system. While reading their Crypto-Primitives specification, which among other things describes the custom Hashing algorithm Swiss Post implemented, I noticed something potentially interesting.

Basically, there are 4 different types that are supported: byte arrays, strings, integers and vectors. Before being hashed, strings are converted to a byte array via the 'StringToByteArray' algorithm.

However, by comparing 'StringToByteArray' and 'ByteArrayToString' we can find a significant  difference: invalid UTF-8 sequences are only considered in the latter. Let's see how this was implemented in the code:

File: crypto-primitives-master/src/main/java/ch/post/it/evoting/cryptoprimitives/internal/utils/
079:    /**
080:     * See {@link}
081:     */
082:    public static byte[] stringToByteArray(final String s) {
083:        checkNotNull(s);
085:        // Corresponds to UTF-8(S)
086:        return s.getBytes(StandardCharsets.UTF_8);
087:    }
089:    /**
090:     * See {@link}
091:     */
092:    public static String byteArrayToString(final byte[] b) {
093:        checkNotNull(b);
094:        checkArgument(b.length > 0, "The length of the byte array must be strictly positive.");
096:        CharsetDecoder decoder = StandardCharsets.UTF_8.newDecoder();
097:        // The try-catch clause implements the pseudo-code's if statement
098:        try {
099:            // Corresponds to UTF-8^-1(B)
100:            return decoder.decode(ByteBuffer.wrap(b)).toString();
101:        } catch (CharacterCodingException ex) {
102:            throw new IllegalArgumentException("The byte array does not correspond to a valid sequence of UTF-8 encoding.");
103:        }
104:    }

As expected, at line 100, 'byteArrayToString' tries to decode the input to detect invalid UTF-8 sequences. On the other hand, at line 86, 'StringToByteArray' directly uses 'getBytes'.

Internally, 'getBytes' will encode the string before returning the bytes. Let's see Java's implementation:

File: java/lang/
     * Encodes this {@code String} into a sequence of bytes using the given
     * {@linkplain java.nio.charset.Charset charset}, storing the result into a
     * new byte array.
     * <p> This method always replaces malformed-input and unmappable-character
     * sequences with this charset's default replacement byte array.  The
     * {@link java.nio.charset.CharsetEncoder} class should be used when more
     * control over the encoding process is required.
     * @param  charset
     *         The {@linkplain java.nio.charset.Charset} to be used to encode
     *         the {@code String}
     * @return  The resultant byte array
     * @since  1.6
    public byte[] getBytes(Charset charset) {
        if (charset == null) throw new NullPointerException();
        return encode(charset, coder(), value);

As the description of the method clearly states, any invalid character sequence will be replaced (onMalformedInput(CodingErrorAction.REPLACE).onUnmappableCharacter(CodingErrorAction.REPLACE) with the default "replacement byte array",  it doesn't trigger any exception.

In Java, the replacement byte array for a default Charset provider is one of the Unicode specials, the replacement character (0xFFFD). However, for some reason Java's UTF-8 Charset Encoder uses '?' (63d).

File: 'sun.nio.cs.UTF_8'

This behavior does not seem to comply with the Unicode's security recommendations.

It also differs from other languages, such as C#, which follows the best-practice and replaces (Encoding.UTF8.GetBytes) the malformed character sequence with the UTF-8 encoded version (0xEF 0xBF 0xBD) of the replacement character.  


This specific pattern may result in different attack scenarios, depending on the logic where the vulnerable pattern has been identified.

Swiss Post E-Voting 

In this specific case, what we end up with is a hash collision vulnerability.
This silly  PoC elaborates how two different strings are not injectively encoded, thus leading to a potential hash collision according to the 'RecursiveHash' algorithm.

import java.nio.charset.StandardCharsets;
import java.nio.ByteBuffer;
import java.nio.CharBuffer;
import java.util.Arrays;
import java.nio.charset.CharsetEncoder;
import java.nio.charset.CodingErrorAction;
import java.nio.ByteBuffer;
import java.nio.charset.CharacterCodingException;
import java.nio.CharBuffer;

public class poc {

    public static void encode(String string)
     throws CharacterCodingException {
     CharsetEncoder encoder = StandardCharsets.UTF_8.newEncoder();


     ByteBuffer encoded = encoder.encode(CharBuffer.wrap(string.toCharArray()));

     System.out.println("\nEncoding end");


    public  static void printhex(byte []ars) {

       for(int i=0;i<ars.length;i++){  


    public static byte[] stringToByteArray(final String s) {

        return s.getBytes(StandardCharsets.UTF_8);

    public static String  buildCustomString(String dat) {

        String badString ="\\+U"))
                .filter(s -> ! s.isEmpty()) 
                .map(s -> {
                    try {
                        return Integer.parseInt(s, 16);
                    } catch (NumberFormatException e) { 
                        System.out.println("Error parsing int");
                    return null; 
                .map(i -> Character.toString(i)) 

        return badString;

    public static String DumpInfo(String inf)
        String badString;

        System.out.println("=>> Original string (custom format): \"" + inf+"\""); 
        System.out.println("[+] Building String");
        badString = buildCustomString(inf);

        System.out.println("[+] String badString => \""+ badString+"\"");
        System.out.print("[+] badString.toCharArray() =>\t\t"); 
        char[] hash3 = badString.toCharArray();

        for(int i=0;i<hash3.length;i++){  


        System.out.print("[+] stringToByteArray(badString) =>     "); 

        return badString;

    public static void main(String args[]) {

        String badString1,badString2;

        String base = "Question_1?";

        // +UD8AF is illegal so it will be mapped to '?' after UTF8 encoding.
        String data1 = "+U0051+U0075+U0065+U0073+U0074+U0069+U006f+U006E+U005f+U0031+UD8AF";

        badString1 = DumpInfo(data1);

        System.out.println("[+] badString after UTF-8 encoding: \""+ badString1 + "\"");
        System.out.print("[+] stringToByteArray(base) =>   \t"); 

            System.out.println("[+] 'badString' and 'base' are not equal");

        try {
            //This will trigger an exception

        catch (Exception e) {




In the PoC there are two strings 'base' and 'badString'. The first one is a valid UTF-8 string that contains "Question_1?", while the latter contains "Question_1" plus a malformed UTF-8 sequence ('\uD8AF').

When 'stringToByteArray' is invoked on 'badString', its illegal sequence will be substituted by '?' resulting in the same hash that would be generated for the 'base' string. However, those strings, before being encoded, are different, as can be checked in the 'toCharArray()' and 'equals' comparisons. This is important to be able to potentially bypass certain checks, as 'equals' operates at the char array level

The output of the PoC is as follows

Swiss Post confirmed the bug ( #YWH-PGM232-122 ), which has been patched in version 1.3.0

Other products

The fact that Java's UTF-8 Encoder is replacing a malformed character sequence with a valid one, '?', which in turn plays an important role in URLs  makes me think that probably, in addition to the potential cryptographic issues we have already seen, this vulnerable pattern might be used to bypass certain security-related logic.

As long as the attacker can control the string, for instance after deserializing an attacker controlled JSON by using Jackson, there will be a chance to abuse the Java's UTF-8 'getBytes' replacement logic.

 ObjectMapper objectMapper = new ObjectMapper();
 String json = "{ \"badString\" : \"Question_1\\uD84F\" }";
 JsonNode jsonNode = objectMapper.readTree(json);
 String badString = objectMapper.readValue(jsonNode.get("badString").toString(), String.class);

If you happen to stumble upon a product with such a vulnerability, I would love to hear about it.

Popular posts from this blog

SATCOM terminals under attack in Europe: a plausible analysis.

------ Update 03/12/2022 Reuters has published new information on this incident, which initially matches the proposed scenario. You can find the  update  at the bottom of this post. ------ February 24th: at the same time Russia initiated a full-scale attack on Ukraine, tens of thousands of KA-SAT SATCOM terminals suddenly  stopped  working in several european countries: Germany, Ukraine, Greece, Hungary, Poland...Germany's Enercon moved forward and acknowledged that approximately 5800 of its wind turbines, presumably those remotely operated via a SATCOM link in central Europe, had lost contact with their  SCADA server .  In the affected countries, a significant part of the customers of Eutelsat's domestic broadband service were also unable to access Internet.  From the very beginning Eutelsat and its parent company Viasat, stated that the issue was being investigated as a cyberattack. Since then, details have been scarcely provided but few days ago I came across a really inter

De-Anonymization attacks against Proton services

  In November 2021 YesWeHack invited me to participate in a private bug bounty program organized by  Bug Bounty Switzerland on behalf of Proton AG.  The scope of the program was quite interesting and heterogeneous, as it covered most of the applications and services offered by Proton, such as ProtonMail and ProtonVPN. As a result, multiple technologies and codebases were in scope, ranging from typescript, in the open-source part of Protonmail, to .NET/Swift used by ProtonVPN apps for Windows and macOS respectively. Proton is well-known for its privacy-driven services offer, so they are based on Switzerland where the legislation seems to match Proton's requirements to provide that kind of services: thus maximizing the privacy of their communications, minimizing the amount of data they log from their users while keeping a law-abiding status.  It wouldn't be realistic to think of Proton users as an homogenous group; you may be using Proton because you're genuinely worried

Finding vulnerabilities in Swiss Post’s future e-voting system - Part 1

In September '21, I came across this story  "Swiss Post Offers up to €230,000 for Critical Vulnerabilities in e-Voting System" while catching up with the security news.  The headline certainly caught my attention as it looked like an outlier from the regular bug bounty programs or well-known exploit contests, not only for the announced rewards but mainly because of the target. So essentially Swiss Post , the national postal service of Switzerland, was opening to the general public a bug bounty program, using the YesWeHack platform, intended to uncover vulnerabilities in its future e-voting system. The first part of this blog post series will detail the approach used to analyze the Swiss Post e-voting system, as well as the first round of vulnerabilities that I reported during September/October '21. Index Introduction Approach Attack Surface Vulnerabilities     1.  Insecure USB file handling during 'importOperation'     2.  Insecure 'ReturnCodeGenerationI