Guidelines for Internationalized Domain Names
1. Introduction
This document presents a solution for enabling internationalized domain names and provides guidelines for choosing the local language character set to be used in domain names.
With the increasing development in of international standards it is now possible to deploy contents in local languages on the Internet. However, it is still not possible to easily access them without knowledge of Latin script and English conventions because the Domain Name System (DNS) is in Latin script and uses English conventions and abbreviations. One of the main reasons for this limitation is that the current Internet Protocol (IP) maps onto an addressing system that is based on the 7-bit ASCII standard and, therefore, it is not possible to encode multiple languages which would require the 16-bit or 8-bit Unicode standard. A domain name which capable of encoding languages written in other than Latin script is called internationalized domain name [1].
Section 2 describes the process of enabling internationalized domain names for local languages. In Section 3 guidelines for selecting appropriate character set for local language are presented. In Section 4, a list of required input is given which we will need to enable any language for internationalized domain names. An Excel sheet namely IDNInfo.xls is accompanied with this document to specify language specific information.
2. Internationalized Domain Names in Applications (IDNA)
Internationalized Domain Names in Applications (IDNA) is a proposed solution for Internationalized Domain Names as stated in RFCs 3454 [2], 3490 [3], 3491 [4], and 3492 [5], collectively known as IDNA standard. IDNA adds a layer between the DNS and the client at the application end, e.g., in the form of a plug-in in a web browser. This layer takes the domain name in a local language as input, normalizes it through the nameprep process [4], and converts this non-ASCII string to a DNS compatible, ASCII Compatible Encoding (ACE) known as Punycode [5]. An algorithm
ToASCII is described in [3] for Punycode conversion. The DNS protocol continues to resolve the ASCII based domain name and gets the IP address of host. No changes in DNS infrastructure are required [1]. This is illustrated in Figure 1 below.
The details of the solution for internationalized domain names are described in following sections.
2.1. Label Separation
A domain name consists of several labels which are separated using a separator. For example, the domain name Âwww.nu.edu.pk has four labels, ÂwwwÂ, ÂnuÂ, Âedu and ÂpkÂ. The IDNA processing is done on individual labels, therefore the first step is to separate the domain name into labels. In DNS, a full stop or period (Â.Â) is used as the label separator. While using internationalized domain names the following Unicode characters are recommended to be used as separators by IDNA standard: U+002E (full stop), U+3002 (ideographic full stop), U+FF0E (fullwidth full stop), U+FF61 (halfwidth ideographic full stop) [3]. However, some languages may prefer their own separation markers in addition to these. For example, Urdu has a sentence boundary marker Â۔ U+06D4 which can also be added to the label separator list [6]. So the final separator list for Urdu will be following: U+002E (full stop), U+3002 (ideographic full stop), U+FF0E (fullwidth full stop), U+FF61 (halfwidth ideographic full stop), U+ 06D4. These language specific separators will be taken from IDNInfo.xls, Sheet ÂOther URL SpecificationsÂ.
2.2. Validity Check
Each label will be checked for validity. Each label should only consist of the allowed characters for the language the domain name belongs to. These will be the characters listed in the IDNInfo.xls, Sheet ÂCharacter Set with the option set to anything other than ÂNO in the ÂInclude in Domain Name column. An invalid domain label will result in failure of the process and no further processing will be done.
2.3. Normalization
Following steps of normalization will be preformed.
I. The Nameprep procedure (RFC 3391) and ICANN have proposed some guidelines for selecting the allowed character set in labels [7]. These guidelines can also be seen in Section 3. It is anticipated that some languages may need some characters which are not allowed in these standards. For example the cursive nature of Urdu sometimes requires the space character in a label for proper display [6]. The space is not a valid character according to Nameprep. Such characters should be marked as ÂYES (but not conformant with IDNA standard) in the ÂInclude in domain name column of IDNInfo.xls, Sheet ÂCharacter SetÂ. Such characters will be allowed to be entered as valid characters but will be removed from the label before further processing.
II. There can be characters in the language which are optional, i.e. text with those characters is considered somewhat equivalent to text without them. For example there are diacritics in Arabic script to mark short vowels. It is general practice to write text without using these diacritics though user may use these diacritics. IDNA standard does not specify any special treatment for these characters. But these characters can cause inconvenience for users or may be a source of fake domain names. For example, a person entering domain name in Arabic script with or without these diacritics will assume the same behavior which will not be the case if these diacritics are treated as normal characters. So it is decided that such optional characters should be removed before further processing [6]. Such characters should be marked as ÂYES (but optional) in the ÂInclude in domain name column of IDNInfo.xls, Sheet ÂCharacter SetÂ. Such characters will be allowed to be entered as valid characters but will be removed from the label before further processing.
III. A script can contain characters which are visually similar but have separate code points in Unicode for some reason. Such characters may have linguistically the same meaning or users of any particular language using that script may confuse them as the same (they may be treated as different for some other language). For example, ١(U+6f1) and ١ (U+661) both represent the digit one in Arabic script. The first representation is used with Arabic language whereas second is used with other Arabic script based languages such as Persian, Urdu etc. Hackers can take advantage of such characters to make fake web sites [1]. To avoid this danger, one character from the similar looking group will be selected as the base character and the others will be replaced with that base character before further processing. A character which needs to be replaced should be marked as ÂYES (variant of base character) in the ÂInclude in domain name column of IDNInfo.xls, Sheet ÂCharacter Set and the Unicode of the base character should be entered in the ÂVariant of Base Character column. For example, while defining the character set for Urdu, U+661 will be selected as the base character and entered in the ÂVariant of Base Character column corresponding to U+6f1.
IV. Each label will be normalized according to Unicode normalization form NFKC [8] as proposed in the Nameprep process. Unicode standard defines NFKC as Compatibility Decomposition, followed by Canonical Composition. In the NFKC forms, many formatting distinctions are removed. For example, Â25Â U+0032 and U+2075 are normalized to Â2Â U+0032 and Â5Â U+0035. The superscript formatting is removed from the Â5Â. Normalization support is available in different libraries such as in Perl, Java and open source library ICU.
2.4. Conversion to ASCII Equivalent
After normalizing Unicode labels, each label needs to be converted into ASCII encoding. There are two ways to convert local language labels to ASCII equivalent labels. One is using mapping tables which contain predefined local language labels along with their ASCII equivalents. The second way is to use the
ToASCII algorithm to convert non-ASCII strings to ASCII Compatible Encoding (ACE) at runtime.
2.4.1 Conversion using Mapping Tables
Some labels are a standard part of the DNS such as ÂwwwÂ, which is mostly the first label of domain name; and the last one or two labels of domain names known as the generic Top Level Domain (gTLD), e.g., Â.comÂ, Â.org and country code Top Level Domain (ccTLD), e.g., Â.pk for Pakistan as in www.nu.edu.pk. For conversion of such labels, the first method will be used. A mapping table will be provided with language specific translations ofÂwwwÂ, gTLDs and ccTLDs. This would need to be done until these are also internationalized by ICANN (there is already work on local language gTLDs).
2.4.2 Conversion using IDNA Procedure
The remaining labels of the domain name will be passed through an IDNA conformant non-ASCII to ACE conversion process. The normalized Unicode label will be converted to an ACE string called Punycode using the
ToASCII() algorithm. The Algorithm is described in [5].
2.5. Request to DNS Server
After the conversion of each label to ASCII code, these labels will be concatenated to form a domain name which will be compatible to the current Domain Name System. This address will then be forwarded as an http request.
When the request returns from the http server, the original domain name (as it was before processing) will be displayed in the address bar so that the variant mapping and normalization does not affect the visual representation of the domain name.
3. Guidelines for Character Set Selection
The following points should be kept in mind while deciding on a local language specific character set. These guidelines are based on the ICANN Guidelines [7].
1. The character set should be in compliance with Nameprep requirements described in RFC 3491. Nameprep [RFC 3491, Section 5] specifies a prohibition table for inappropriate codes. It is recommended that no code should be included which is listed in the table. If it is necessary to include any code from that table, provide clear reasoning for the decision in the remarks. This table contains space characters, control characters, characters for private use, non character code points etc.
2. IDN application will use the Âinclusion based approach. The character set which is explicitly allowed for any language will be permitted in domains names of that language. Any character which is not listed in the character list will generate an error.
3. Each label of the domain name will be associated with a single script as defined by the block division of the Unicode code chart. Exceptions to this guideline are languages for which there are established orthographic rules and conventions that require use of multiple scripts. In such cases characters from other scripts should be explicitly listed. If there is any character which is already not listed in IDNInfo.xls, a new row will be added at the end of sheet ÂCharacter Set for that character. All the six columns should be filled corresponding to that character.
4. Inclusion of visually confusable characters in the character set can be a security risk as mentioned in section 2.3. A group of such characters should be added as variants of single base character.
5. Code points which are not permissible are as follows.
I. Geometrical and line-drawing symbols such as those in the Unicode Box Drawing and Box Elements blocks
II. Symbols and icons that are neither alphanumeric nor ideographic language characters, such as typographic and pictographic dingbats
III. Characters with well-established functions as protocol elements
IV. Punctuation marks used solely to indicate the structure of sentences
V. Punctuation marks that are used within words, with the possible exception of those that are not excluded by any of the preceding points, are essential to the language represented by the IDN, and are associated with explicit prescriptive rules about the context in which they may be used.
4. Required Input
As it is described in Section 2, some language specific information is required to enable domain names for that language. Following is the list of input required from each country.
1. Please specify any additional domain label separator(s) in IDNInfo.xls, Sheet ÂOther URL Specifications for your language if required. (see discussion in Section 2.1)
2. Please specify translation for Âwww in IDNInfo.xls, Sheet ÂOther URL Specifications for your language. (see discussion in Section 2.4.1)
3. As described in Section 2.4.1, translations of Top Level Domains (TLD) are required for mapping tables. Two Excel sheets namely gTLDs and ccTLDs in IDNInfo.xls provide lists of gTLDs and ccTLDs along with a brief description to help in correct translation. Provide translations for these TLDs in next column. Make sure that characters used in these translations are valid for domain names i.e. you have selected an option other than ÂNO in ÂCharacter Set for those characters. (See point 4)
4. An Excel sheet ÂCharacter Set in IDNInfo.xls is provided with list of character of your language. First three columns give character glyph, Unicode code point and character description according to Unicode chart for each character. Next five columns are for language feedback.
a. In fourth column there are multiple options given for type of characters. Choose one of those. If some character does not fall in any of the mentioned categories, choose ÂOtherÂ. These types are based on character grouping mentioned in Unicode charts. The list is superset for all languages taken into consideration, so each type may not be relevant to your language.
b. Fifth column gives options for selection of a particular character for domain names. Fill this column as explained in Section 2.3 steps I, II and III.
c. If ÂYES (variant of base character)Â is selected in fifth column, also enter Unicode of its corresponding base character in sixth column. Otherwise leave it empty.
d. The reason for each choice MUST be given in the ÂReason for Choice column.
e. Any additional remarks my be given in ÂRemarks column.
5. References
[1] D. Butt, ÂInternationalized Domain Names,Â
http://www.apdip.net∞ /apdipenote/9.pdf, APDIP.net, 2006.
[2] P. Hoffman, and M. Blanchet, ÂPreparation of Internationalized Strings ("stringprep")Â
http://www.rfc-editor.org/rfc/rfc3454.txt∞, 2002.
[3] P. Faltstrom, P. Hoffman, and A. Costello, ÂInternationalizing Domain Names in Applications (IDNA)Â
http://www.rfc-editor.org/rfc/rfc3490.txt∞, 2003.
[4] P. Hoffman,M. B. Viagenie, "Nameprep: A Stringprep Profile for Internationalized Domain Names (IDN)Â
http://www.rfc-editor.org/rfc/rfc3491.txt∞, 2003.
[5] A. Costello, ÂPunycode: A Bootstring encoding of Unicode for Internationalized Domain Names in Applications (IDNA),Â
http://www.rfc-editor.org/rfc/rfc3492.txt∞, 2003.
[6] Hussain, Sarmad Durrani, Nadir , ÂUrdu Domain NamesÂ, IEEE Multitopic Conference INMIC '06, 2006.
[7] "GUIDELINES FOR THE IMPLEMENTATION OF INTERNATIONALIZED DOMAIN NAMES", Version 2.2 draft 0.03,
http://www.icann.org/topics/idn/idn-guidelines-26apr07.pdf∞, 2007.
[8] M. Davis, M. Durst, ÂUnicode Normalization Forms,Â
http://www.unicode.org/reports/tr15/∞, 2005.
There is one comment on this page. [Display comment]