How to convert Unicode URL to ASCII in Java


URLs can be quite complex when you add Internationalized Domain Names (IDNs) and Unicode characters to to the path. Often you’ll want to view and store these in ASCII, so proper conversion can become important. After much searching, I couldn’t find a great way to convert an entire Unicode URL to ASCII. Most examples just convert the domain to punycode, but forget about the port, path, and query string. Most examples don’t cover the case when the provided URL doesn’t have a scheme on the front. I tried to incorporate all of those URL components. I wanted a flexible conversion so I came up with some working code that probably has flaws, but it works for most URL formats you will encounter and a large variety that I tested it with.

Here’s a good list of domains to test this with. You can add ports, unicode paths, unicode params, and encoded paths characters to these for additional testing.
https://blogs.msdn.microsoft.com/shawnste/2006/09/14/idn-test-urls/

package com.company.utils;

import java.net.*;

public class UnicodeUtil {
    public static String convertUnicodeURLToAscii(String url) throws URISyntaxException {
        if(url != null) {
            url = url.trim();
            // Handle international domains by detecting non-ascii and converting them to punycode
            boolean isAscii = CharMatcher.ASCII.matchesAllOf(url);
            if(!isAscii) {
                URI uri = new URI(url);
                boolean includeScheme = true;

                // URI needs a scheme to work properly with authority parsing
                if(uri.getScheme() == null) {
                    uri = new URI("http://" + url);
                    includeScheme = false;
                }

                String scheme = uri.getScheme() != null ? uri.getScheme() + "://" : null;
                String authority = uri.getRawAuthority() != null ? uri.getRawAuthority() : ""; // includes domain and port
                String path = uri.getRawPath() != null ? uri.getRawPath() : "";
                String queryString = uri.getRawQuery() != null ? "?" + uri.getRawQuery() : "";

                // Must convert domain to punycode separately from the path
                url = (includeScheme ? scheme : "") + IDN.toASCII(authority) + path + queryString;

                // Convert path from unicode to ascii encoding
                url = new URI(url).toASCIIString();
            }
        }
        return url;
    }
}
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s