The Unicode Standard for Scripts of India (TUSSI)

A request to make the TUSSI specification
compatible with the ISCII Standard, and beyond.

By Andy White

It has been suggested that I rewrite this document so that it more clearly highlight differences with the current TUSSI spec. I have tried to do this for TUS3.0 which does not mention much at all. As for TUS4.0; it is yet to be published, so I can not currently say where differences may lie.


Introduction

The Unicode standard's encoding model for the nine main scripts of India is based on the Indian Standard Code for Information Interchange (ISCII), but the two standards are currently incompatible in so far as a lossless conversion between ISCII, Unicode and back cannot be guaranteed.

Certain combining secondary forms of constant can be explicitly encoded in ISCII. These forms cannot be explicitly encoded using the current specified Unicode Indic encoding model.

Some forms of letters are not specified in ISCII at all. Such letters have been inconsistently encoded by various implementers of ISCII. Specifying the encoding of such letters will take the unicode standard beyond that of the now outdated ISCII.

The Unicode standard is continually updated and improved. It cannot be expected to remain completely compatible with outdated and incompletely specified encoding systems such as ISCII. This proposal suggests that the Unicode Standard first moves into line with ISCII and then moves on to deal with other issues forgotten by it

 

The Unicode Standard for the Scripts of India (TUSSI)
(AKA "Unicode encoding model for scripts covered by ISCII-1998")

The USISI specification proposed in this document is for the following Indic scripts:

  • Bengali
  • Devanagari
  • Gujarati
  • Gurmukhi
  • Kannada
  • Malayalam
  • Oriya
  • Tamil
  • Telugu

The Unicode standard mostly encodes characters for these scripts in the same relative positions as those coded in positions A0-F4 of the ISCII-1988 standard.

Dead consonants

A dead consonant is a consonant that has lost its inherent vowel. A dead consonant can always be formed by combining a full consonant with Virama.

In common with the ISCII encoding model, the subsequent rendition that a given Consonant-Virama combination will take, should not be explicitly specified in the Unicode standard. The appearance of a dead consonant is context dependent and so it is left to any rendering mechanism to choose the most appropriate form to display. For example, a Consonant Virama combination could take any one of, Reph; Half; Halant; or explicit Virama form of consonant.

Examples

Exceptions to this rule may apply when certain control characters are placed next to a Virama.

Consonants with Explicit Virama (Explicit Halant)

In the ISCII standard, encoding two consecutive Viramas in succession indicates that a conjunct formation should not take place between two consonants, and that a Virama sign should be visibly displayed.

To accomplish this goal the Unicode Standard adopts the convention of placing the character U+200C ZERO WIDTH JOINER (ZWNJ) immediately after an encoded dead consonant. In this case, the Virama is always depicted as appropriate for the consonant to which it is attached.

Example

 

Note. The Unicode Indic FAQ under 'How does Unicode differ from ISCII?' is wrong. It states that ISCII 'consonant+Virama+Virama' is encoded as 'consonant ViramaZWJ' in Unicode!

Forms of consonant that imply a Virama (Soft Halant or implicit Virama)

A Soft-Halant form of consonant is a dead consonant that may precede another full consonant. It does not display a visible Virama. Examples include Devanagari Half forms and Malayalam Chillaksharams. It does not include subscript, superscript or 'post base' forms of consonant such as 'Reph', '-kar' and '-phalaa' forms.

To request a rendering mechanism to display such a consonant, ISCII adopts the mechanism of placing of a Nukta after a Virama. The Unicode Standard should not consider the Nukta character as a control character and so instead shall use the convention of placing U+200D ZERO WIDTH NON JOINER (ZWJ) immediately after a Virama.

Examples

Note. The section in the Unicode Indic FAQ that deals with the invisible letter is wrong. Contrary to the FAQ, the sequence, 'consonant+Virama+Nukta' is the correct way to create half forms in ISCII [section 6.3.2. ISCII-91]

In some Indic scripts, Languages, or fonts designs, it may not be appropriate to include soft-Halant letter forms for all consonants. A ZWJ after a Virama shall only be said to be an encoded soft Halant: it is possible that it some other form of consonant will be displayed in any subsequent rendering.

 

Consonant conjuncts

As with dead consonants, The Unicode Standard shall not specify a default combining behaviour of a given consonant cluster. However, mechanisms that can override or discourage a default formation are provided.


Controlling combining consonant combinations

The ZWNJ

The natural ligation of a Consonant Virama Consonant combination can be explicitly denied with an encoded explicit Virama.

Example


The ZWJ

A differing combining behaviour can be requested with an encoded soft Virama.

Examples


The next section and following summery will be rewritten soon. Suggested use of the CGJ will be limited to extreme cases only. In most cases use of ZWJ ZWNJ combinations will be advocated.

The CGJ

Mentions of CGJ in this section can be ignored (see above note)!

In cases where the behaviours of ZWJ & ZWNJ do not suffice, the ISCII standard uses an invisible letter (INV).

The INV consonant marks the position of the base consonant in a consonant cluster so that any other surrounding consonants can take on appropriate secondary forms. This mechanism is needed when a consonant cluster can have differing semantics depending on the way it is subsequently rendered. It is also used to encode certain presentation forms of consonant.

In the Unicode standard, this process can be handled with U+034F COMBINING GRAPHEME JOINER (CGJ).
For Indic scripts, the CGJ must always be placed adjacent to a Virama. This allows a CGJ-Virama combination to combine with any character also adjacent to the CGJ and hence combine into a single grapheme (dead consonant).

Such a combined letter may be treated as a base consonant of the script in which the Virama belongs. Any other consonant adjacent to the Virama can then be regarded as a combining secondary form if appropriate.

In the absence of an adjacent letter to CGJ, or in the case that that character is non-Indic, a fall back mechanism such as described in section 5.14 'Rendering Non-Spacing Marks' can be used.

Examples

(Some Scripts, Languages or font designs, may not necessarily have or need to contain all of the above forms, therefore a CGJ may only be considered as a request to a rendering mechanism to display an appropriate representation.)

 

Note. The section in the Unicode Indic FAQ that deals with the invisible letter is wrong: The ISCII standard does not state that the INV letter is required to form vocalic L, LL & Ri etc. However, in some ISCII applications, INV may be required to form Isolated Vowel Sign Ri.

The FAQ states that ISCII, 'INV Virama Ra' is to be encoded as 'Space Virama Ra' in Unicode. This begs the question, How does one encode 'Space Virama Ra'? This statement is clearly be wrong, and the use of CGJ is surely a much better solution.

Summary of proposed ISCII-Unicode equivalents

ISCII UNICODE
Virama Virama Virama ZWNJ
Virama Nukta Virama ZWJ
INV CGJ


 

 

This document last updated December 29, 2002