Skip to main content

Beware of Java's String.getBytes


Sometimes there are subtle bugs whose origin can be found in some quirks from the underlying language used to build the software. This blog post describes one of those cases in order to let both fellow security researchers and developers, who didn't know about it, become aware of this potential vulnerable pattern. In fact, I'm pretty sure that similar bugs to the one herein described likely affect a bunch of products/codebases out there.


In previous posts, I've already described some bugs in the Swiss Post's future E-voting system. While reading their Crypto-Primitives specification, which among other things describes the custom Hashing algorithm Swiss Post implemented, I noticed something potentially interesting.




Basically, there are 4 different types that are supported: byte arrays, strings, integers and vectors. Before being hashed, strings are converted to a byte array via the 'StringToByteArray' algorithm.


However, by comparing 'StringToByteArray' and 'ByteArrayToString' we can find a significant  difference: invalid UTF-8 sequences are only considered in the latter. Let's see how this was implemented in the code:

File: crypto-primitives-master/src/main/java/ch/post/it/evoting/cryptoprimitives/internal/utils/ConversionsInternal.java
079:    /**
080:     * See {@link ch.post.it.evoting.cryptoprimitives.utils.Conversions#stringToByteArray}
081:     */
082:    public static byte[] stringToByteArray(final String s) {
083:        checkNotNull(s);
084: 
085:        // Corresponds to UTF-8(S)
086:        return s.getBytes(StandardCharsets.UTF_8);
087:    }
088: 
089:    /**
090:     * See {@link ch.post.it.evoting.cryptoprimitives.utils.Conversions#byteArrayToString}
091:     */
092:    public static String byteArrayToString(final byte[] b) {
093:        checkNotNull(b);
094:        checkArgument(b.length > 0, "The length of the byte array must be strictly positive.");
095: 
096:        CharsetDecoder decoder = StandardCharsets.UTF_8.newDecoder();
097:        // The try-catch clause implements the pseudo-code's if statement
098:        try {
099:            // Corresponds to UTF-8^-1(B)
100:            return decoder.decode(ByteBuffer.wrap(b)).toString();
101:        } catch (CharacterCodingException ex) {
102:            throw new IllegalArgumentException("The byte array does not correspond to a valid sequence of UTF-8 encoding.");
103:        }
104:    }

As expected, at line 100, 'byteArrayToString' tries to decode the input to detect invalid UTF-8 sequences. On the other hand, at line 86, 'StringToByteArray' directly uses 'getBytes'.

Internally, 'getBytes' will encode the string before returning the bytes. Let's see Java's implementation:

File: java/lang/String.java
/**
     * Encodes this {@code String} into a sequence of bytes using the given
     * {@linkplain java.nio.charset.Charset charset}, storing the result into a
     * new byte array.
     *
     * <p> This method always replaces malformed-input and unmappable-character
     * sequences with this charset's default replacement byte array.  The
     * {@link java.nio.charset.CharsetEncoder} class should be used when more
     * control over the encoding process is required.
     *
     * @param  charset
     *         The {@linkplain java.nio.charset.Charset} to be used to encode
     *         the {@code String}
     *
     * @return  The resultant byte array
     *
     * @since  1.6
     */
    public byte[] getBytes(Charset charset) {
        if (charset == null) throw new NullPointerException();
        return encode(charset, coder(), value);
     }


As the description of the method clearly states, any invalid character sequence will be replaced (onMalformedInput(CodingErrorAction.REPLACE).onUnmappableCharacter(CodingErrorAction.REPLACE) with the default "replacement byte array",  it doesn't trigger any exception.

In Java, the replacement byte array for a default Charset provider is one of the Unicode specials, the replacement character (0xFFFD). However, for some reason Java's UTF-8 Charset Encoder uses '?' (63d).

File: 'sun.nio.cs.UTF_8'

This behavior does not seem to comply with the Unicode's security recommendations.


It also differs from other languages, such as C#, which follows the best-practice and replaces (Encoding.UTF8.GetBytes) the malformed character sequence with the UTF-8 encoded version (0xEF 0xBF 0xBD) of the replacement character.  

Impact

This specific pattern may result in different attack scenarios, depending on the logic where the vulnerable pattern has been identified.

Swiss Post E-Voting 

In this specific case, what we end up with is a hash collision vulnerability.
 
This silly  PoC elaborates how two different strings are not injectively encoded, thus leading to a potential hash collision according to the 'RecursiveHash' algorithm.


import java.nio.charset.StandardCharsets;
import java.nio.ByteBuffer;
import java.nio.CharBuffer;
import java.util.Arrays;
import java.util.stream.Collectors;
import java.nio.charset.CharsetEncoder;
import java.nio.charset.CodingErrorAction;
import java.nio.ByteBuffer;
import java.nio.charset.CharacterCodingException;
import java.nio.CharBuffer;

public class poc {

    public static void encode(String string)
     throws CharacterCodingException {
     CharsetEncoder encoder = StandardCharsets.UTF_8.newEncoder();

     System.out.println("\nEncoding");

     ByteBuffer encoded = encoder.encode(CharBuffer.wrap(string.toCharArray()));

     System.out.println("\nEncoding end");

    }

    public  static void printhex(byte []ars) {

       for(int i=0;i<ars.length;i++){  
            System.out.print(String.format("\\x%04X",(short)ars[i]));  
        }

        System.out.print("\n");
        return;
    }

    public static byte[] stringToByteArray(final String s) {

        return s.getBytes(StandardCharsets.UTF_8);
    }

    public static String  buildCustomString(String dat) {

        String badString = Arrays.stream(dat.split("\\+U"))
                .filter(s -> ! s.isEmpty()) 
                .map(s -> {
                    try {
                        return Integer.parseInt(s, 16);
                    } catch (NumberFormatException e) { 
                        System.out.println("Error parsing int");
                    }
                    return null; 
                })
                .map(i -> Character.toString(i)) 
                .collect(Collectors.joining());

        return badString;
    }

    public static String DumpInfo(String inf)
    {
        String badString;

        System.out.println("=>> Original string (custom format): \"" + inf+"\""); 
        System.out.println("[+] Building String");
        badString = buildCustomString(inf);

        System.out.println("[+] String badString => \""+ badString+"\"");
        System.out.print("[+] badString.toCharArray() =>\t\t"); 
        char[] hash3 = badString.toCharArray();

        for(int i=0;i<hash3.length;i++){  
            System.out.print(String.format("\\x%04X",(short)hash3[i]));  
        }  

            System.out.print("\n");

        System.out.print("[+] stringToByteArray(badString) =>     "); 
        printhex(stringToByteArray(badString));

        return badString;
    }

    public static void main(String args[]) {

        String badString1,badString2;

        String base = "Question_1?";

        // +UD8AF is illegal so it will be mapped to '?' after UTF8 encoding.
        String data1 = "+U0051+U0075+U0065+U0073+U0074+U0069+U006f+U006E+U005f+U0031+UD8AF";

        badString1 = DumpInfo(data1);

        System.out.println("[+] badString after UTF-8 encoding: \""+ badString1 + "\"");
        System.out.print("[+] stringToByteArray(base) =>   \t"); 
        printhex(stringToByteArray(base));

        if(!badString1.equals(base)){
            System.out.println("[+] 'badString' and 'base' are not equal");
        }

        try {
            //This will trigger an exception
            encode(badString1);

        }
        catch (Exception e) {
            e.printStackTrace();
        }

        System.out.print("\n");

    }

}

In the PoC there are two strings 'base' and 'badString'. The first one is a valid UTF-8 string that contains "Question_1?", while the latter contains "Question_1" plus a malformed UTF-8 sequence ('\uD8AF').

When 'stringToByteArray' is invoked on 'badString', its illegal sequence will be substituted by '?' resulting in the same hash that would be generated for the 'base' string. However, those strings, before being encoded, are different, as can be checked in the 'toCharArray()' and 'equals' comparisons. This is important to be able to potentially bypass certain checks, as 'equals' operates at the char array level

The output of the PoC is as follows

Swiss Post confirmed the bug ( #YWH-PGM232-122 ), which has been patched in version 1.3.0

Other products

The fact that Java's UTF-8 Encoder is replacing a malformed character sequence with a valid one, '?', which in turn plays an important role in URLs  makes me think that probably, in addition to the potential cryptographic issues we have already seen, this vulnerable pattern might be used to bypass certain security-related logic.

As long as the attacker can control the string, for instance after deserializing an attacker controlled JSON by using Jackson, there will be a chance to abuse the Java's UTF-8 'getBytes' replacement logic.

 ObjectMapper objectMapper = new ObjectMapper();
 String json = "{ \"badString\" : \"Question_1\\uD84F\" }";
 JsonNode jsonNode = objectMapper.readTree(json);
 String badString = objectMapper.readValue(jsonNode.get("badString").toString(), String.class);

If you happen to stumble upon a product with such a vulnerability, I would love to hear about it.