How to convert Unicode URL to ASCII in Java

URLs can be quite complex when you add Internationalized Domain Names (IDNs) and Unicode characters to to the path. Often you’ll want to view and store these in ASCII, so proper conversion can become important. After much searching, I couldn’t find a great way to convert an entire Unicode URL to ASCII. Most examples just convert the domain to punycode, but forget about the port, path, and query string. Most examples don’t cover the case when the provided URL doesn’t have a scheme on the front. I tried to incorporate all of those URL components. I wanted a flexible conversion so I came up with some working code that probably has flaws, but it works for most URL formats you will encounter and a large variety that I tested it with.

Here’s a good list of domains to test this with. You can add ports, unicode paths, unicode params, and encoded paths characters to these for additional testing.



public class UnicodeUtil {
    public static String convertUnicodeURLToAscii(String url) throws URISyntaxException {
        if(url != null) {
            url = url.trim();
            // Handle international domains by detecting non-ascii and converting them to punycode
            boolean isAscii = CharMatcher.ASCII.matchesAllOf(url);
            if(!isAscii) {
                URI uri = new URI(url);
                boolean includeScheme = true;

                // URI needs a scheme to work properly with authority parsing
                if(uri.getScheme() == null) {
                    uri = new URI("http://" + url);
                    includeScheme = false;

                String scheme = uri.getScheme() != null ? uri.getScheme() + "://" : null;
                String authority = uri.getRawAuthority() != null ? uri.getRawAuthority() : ""; // includes domain and port
                String path = uri.getRawPath() != null ? uri.getRawPath() : "";
                String queryString = uri.getRawQuery() != null ? "?" + uri.getRawQuery() : "";

                // Must convert domain to punycode separately from the path
                url = (includeScheme ? scheme : "") + IDN.toASCII(authority) + path + queryString;

                // Convert path from unicode to ascii encoding
                url = new URI(url).toASCIIString();
        return url;