|
****This proposal has not been submitted****
***This document is displayed for initial feedback***
ISO
INTERNATIONAL ORGANIZATION FOR STANDARDIZATION
ORGANISATION INTERNATIONALE DE NORMALISATION
ISO/IEC JTC 1/SC 2/WG 2
Universal Multiple-Octet Coded Character Set
(U C S)
ISO/IEC JTC1/SC2/WG2 N????
Date: 200?-??-??
Title: Proposal for addition of CONSONANT BASE MARKER
Source: Andy White
Status: Awaiting feedback from member bodies and user groups
Action: ***This proposal has not been submited***
Summary
The Consonant Base
Marker (CBM) is being proposed so as to remove current ambiguities of
a grapheme cluster's semantics. Such cases are currently found in conjuncts
involving Malayalam Chillu forms, Bengali KhandaTa, and various Indic
Above, Below and Post base forms.
Terms
C1, C2 etc: Nominal forms of a consonant
Halant or Virama: The character used after a consonant to "strip"
it of its inherent vowel
Base consonant: The nominal form of a consonant as would be found
in the Unicode code charts.
Full Conjunct: The form whereby two or more consonants fully combine
to form a new ligature.
Half consonant: Form in which consonants appear to the left of
the base consonant if they do not participate in a full ligature
Implicit Halant: An implied Halant form of a consonant such as
Malayalam chillaksharams. Such forms of consonants do not have a visible
Virama.
Explicit Halant: The nominal form of a consonant combined with
a visible Halant.
Above Base consonant: The form of a consonant the appears above
the base such as the above base form of Ra commonly called Reph
Below base consonant: The form in which consonants appear below
the base glyph
Post Base consonant: Form in which consonants appear to the right
of the base glyph. Examples include; Oriya "Ya", Malayalam "Ya"
and "Va"
Secondary form of consonant: Any form of consonant other than
the base form.
Default rendering: The form that a sequence C1+ Virama + C2 would
most commonly take in a given script.
Introduction
A sequence, 'C1 Halant C2' can take one of seven forms:
- C1_C2.FullConjunt
- C1_Half + C2_Base
- C1_ImplicitHalant + C2_Base
- C1_ExplicitHalant + C2_Base
- C1_Above + C2_Base
- C1_Base + C2_Below
- C1_Base + C2_postBase
Out of the above, only cases 2 & 4 can be explicitly encoded i.e:
Case 2: C1 Virama ZWJ C2 = C1_Half + C2_Base
Case 4: C1 Virama ZWNJ C2 = C1_ExplicitHalant + C2_Base
Currently all the other forms can only be derived by context. However,
in some cases context alone is not satisfactory. Differing forms may need
to be created depending on differing semantics of consonant clusters.
For example, the sequence of Bengali characters: Ra Virama Ya, have differing
semantics depending on the way that they combine.

The first form of RYA shown above is common and has the
semantic of R'YO or R'JO e.g. in the Bengali word 'Akarjo'.
The second form is rarer and has the semantic RAWor RRO
e.g. in the English word 'raw'.
Semantics
The CBM is to be used to mark the base consonant in a consonant
cluster to remove ambiguities of the cluster's semantics in cases where
a ZWJ or ZWNJ cannot suffice.
The CBM is to be placed within a Indic coded syllable representation,
so that the CBM falls between a Virama and the consonant to be marked.
For example in the sequence C1 Virama C2, the second consonant can be
marked by placing a CBM to the right of the Virama: C1 Virama CBM C2
The CBM has the semantic of requesting the marked consonant
to serve as the base character in the consonant cluster in which it occurs
(and hence the possible appearance of the cluster after any subsequent
rendering).
The CBM has the secondary implied semantic of requesting
that the consonant at the opposite side of a Virama to appear in its secondary
form (if any).
The CBM neither requests nor denies character forms to ligate.
Detailed Example
The sequence of Bengali characters: Ta Virama Ma, need to
have differing forms depending on the word in which they occur.
In the word Sadaatmaa the Ta_Ma needs the default rendering of a Ta_Ma.conjunct.
In the word Satmaa the Ta_Ma needs to have the appearance
of Ta.implicitHalant +Ma (KhandaTa+Ma), hence the Ma is marked with a
base marker.

As the first form is the nominal form (expected or default
form) of 'Ma Virama Ta', the second form must be marked in some way.
Mechanisms currently available can not resolve this problem. as they can
only imply half forms, Virama forms, and combining grapheme forms (ZWJ,
ZWNJ &CGJ). Khanda Ta is neither of these. (See the old obsolete
Khanda Ta proposal)
Further Examples
Bengali
Ra + Virama Ya would have the default rendering of Ra_above+Ya
(Ya+Reph) but there are cases where it needs to have the form Ra+Ya_PostBase
(Ra+Japhala)

Further information regarding this can be found in my original
problem statement and discussion
Malayalam

Kanada

Collation
The incorporation of the CBM should have no effect on current
collation and should be ignored in such processes. The CBM can be used
to aid phonetic sorting and transcription.
Implications on current rendering
systems.
Current rendering systems should have no problems with a
CBM when inserted in the right hand side of a Virama as it would be treated
as a non-Indic character and hence cause a desired effect of causing preceding
characters to form secondary forms (It would be treated similar to ZWNJ)
CBM when inserted in to the left of a Virama will have the effect of causing
letterforms to appear in their nominal forms.
Interoperability with other standards
With regard to ISCI, existing translaters should continue to process
text with CBMs with semantically correct results (although not 100% visually
correct)
The CBM will not cause any problems with respect to future translation.
In most cases the sequence:
CBM Virama
will be translated as
INV Virama
and
Virama CBM
will be translated as
Virama nukta
(Actual translation appears to be script specific)
In translation to ISO 15919 the CBM can be converted to the ambiguity
marker (the colon)
PROPOSAL SUMMARY
ISO/IEC JTC 1/SC 2/WG 2
PROPOSAL SUMMARY FORM TO ACCOMPANY SUBMISSIONS
FOR ADDITIONS TO THE REPERTOIRE OF ISO/IEC 10646
Please fill all the sections A, B and C below.
(Please read Principles and Procedures Document for
guidelines and details before filling this form.)
See http://www.dkuug.dk/JTC1/SC2/WG2/docs/summaryform.html
for latest Form.
See http://www.dkuug.dk/JTC1/SC2/WG2/docs/principles.html
for latest Principles and Procedures document.
See http://www.dkuug.dk/JTC1/SC2/WG2/docs/roadmaps.html
for latest roadmaps.
(Form number: N2352-F (Original 1994-10-14; Revised
1995-01, 1995-04, 1996-04, 1996-08, 1999-03, 2001-05, 2001-09
A. Administrative
1. Title: CONSONANT BASE MARKER
2. Requester's name: Andy White
3. Requester type (Member body/Liaison/Individual contribution):
Individual
4. Submission date: ??/??/??
5. Requester's reference (if applicable):
6. (Choose one of the following:) This is a complete proposal:This
is a complete proposal
or, More information will be provided later: This is a complete proposal
B. Technical - General
1. (Choose one of the following:)
a. This proposal is for a new script (set of characters): No
Proposed name of script:
b. The proposal is for addition of character(s) to an existing block:Yes
Name of the existing block: BMP
2. Number of characters in proposal: One
3. Proposed category (see section II, Character Categories): Combining
Mark ?
4. Proposed Level of Implementation (1, 2 or 3)
(see clause 14, ISO/IEC 10646-1: 2000):Any level is acceptable
Is a rationale provided for the choice? ______________
If Yes, reference: _______________________________________________________
5. Is a repertoire including character names provided?YES
a. If YES, are the names in accordance with the character naming guidelines
in Annex L of ISO/IEC 10646-1: 2000? Yes
b. Are the character shapes attached in a legible form suitable for
review?N/A
6. Who will provide the appropriate computerized font (ordered preference:
True Type, or PostScript format) for publishing the standard? Not
Applicable but I will provide font for examples if needed in the text
of the Standard
If available now, identify source(s) for the font (include address,
e-mail, ftp-site, etc.) and indicate the tools used: N/A
7. References: a. Are references (to other character sets, dictionaries,
descriptive texts etc.) provided? ***Not Yet***
b. Are published examples of use (such as samples from newspapers,
magazines, or other sources) of proposed characters attached? ***Not
Yet***
8. Special encoding issues: Does the proposal address other aspects
of character data processing (if applicable) such as input, presentation,
sorting, searching, indexing, transliteration etc. (if yes please enclose
information)? Yes
9. Additional Information:
Submitters are invited to provide any additional information about Properties
of the proposed Character(s) or Script that will assist in correct understanding
of and correct linguistic processing of the proposed character(s) or
script. Examples of such properties are: Casing information, Numeric
information, Currency information, Display behaviour information such
as line breaks, widths etc., Combining behaviour, Spacing behaviour,
Directional behaviour, Default Collation behaviour, relevance in Mark
Up contexts, Compatibility equivalence and other Unicode normalization
related information. See the Unicode standard at http://www.unicode.org/ for such information
on other scripts. Also see Unicode Character Database http://www.unicode.org/Public/UNIDATA/UnicodeCharacterDatabase.html
and associated Unicode Technical Reports for information needed for
consideration by the Unicode Technical Committee for inclusion in the
Unicode Standard.
C. Technical - Justification
1. Has this proposal for addition of character(s) been submitted before?
No
If YES explain
2. Has contact been made to members of the user community (for example:
National Body, user groups of the script or characters,
other experts, etc.)? Yes
If YES, with whom? **To Be provided**
If YES, available relevant documents:
3. Information on the user community for the proposed characters
(for example: size, demographics, information technology use, or
publishing use) is included? No
Reference: Indic Comunity (among others)
4. The context of use for the proposed characters (type of use;
common or rare) Common_
Reference: As enclosed
5. Are the proposed characters in current use by the user community?
No
If YES, where? Reference: Character formed by the proposed are
6. After giving due considerations to the principles in Principles
and
Procedures document (a WG 2 standing document) must the proposed
characters be entirely in the BMP? Yes
If YES, is a rationale provided? ***To be provided***
If YES, reference: enclosed
7. Should the proposed characters be kept together in a contiguous range
(rather than being scattered)? N/A
8. Can any of the proposed characters be considered a presentation form
of an
existing character or character sequence? No - N/A
If YES, is a rationale for its inclusion provided?
If YES, reference:_
9. Can any of the proposed characters be encoded using a composed character
sequence of either existing characters or other proposed characters?
No
If YES, is a rationale for its inclusion provided?
If YES, reference:
10. Can any of the proposed character(s) be considered to be similar
(in
appearance or function) to an existing character? No
If YES, is a rationale for its inclusion provided?
If YES, reference:
11. Does the proposal include use of combining characters and/or use
of
composite sequences (see clauses 4.12 and 4.14
in ISO/IEC 10646-1: 2000)? ***Not Sure***
If YES, is a rationale for such use provided?
If YES, reference:
Is a list of composite sequences and their corresponding glyph images
(graphic symbols) provided? *******
If YES, reference:
12. Does the proposal contain characters with any special properties
such as
control function or similar semantics? Yes
If YES, describe in detail (include attachment if necessary) Attached
13. Does the proposal contain any Ideographic compatibility character(s)?
No
If YES, is the equivalent corresponding unified ideographic character(s)
identified?
If YES, reference:
A.1 Submitter's Responsibilities
The national body or liaison organization (or any other organization
or an individual) proposing new character(s) or a new script shall provide:
-
Proposed category for the script or character(s), character name(s),
and description of usage.
-
Justification for the category and name(s).
-
A representative glyph(s) image on paper:
If the proposed glyph image is similar to a glyph image of a previously
encoded ISO/IEC 10646 character, then additional justification for
encoding the new character shall be provided.
Note: Any proposal that suggests that one or more
of such variant forms is actually a distinct character requiring
separate encoding, should provide detailed, printed evidence that
there is actual, contrastive use of the variant form(s). It is insufficient
for a proposal to claim a requirement to encode as characters
in the Standard, glyphic forms which happen to occur in another character
encoding that did not follow the Character-Glyph Model that guides
the choice of appropriate characters for encoding in ISO/IEC 10646.
Note: WG 2 has resolved in Resolution M38.12 not
to add any more Arabic presentation forms to the standard and suggests
users to employ appropriate input methods, rendering and font technologies
to meet the user requirements.
-
Mappings to accepted sources, for example, other standards, dictionaries,
accessible published materials.
-
Computerized/camera-ready font:
Prior to the preparation of the final text of the next amendment or
version of the standard a suitable computerized font (camera-ready
font) will be needed. Camera-ready copy is mandatory for final text
of any pDAMs before the next revision. Ordered preference of the fonts
is True Type or PostScript format. The minimum design resolution for
the font is 96 by 96 dots matrix, for presentation at or near 22 points
in print size.
-
List of all the parties consulted.
-
Equivalent glyph images:
If the submission intends using composite sequences of proposed or
existing combining and non-combining characters, a list consisting
of each composite sequence and its corresponding glyph image shall
be provided to better understand the intended use.
-
Compatibility equivalents:
If the submission includes compatibility ideographic characters, identify
the equivalent unified CJK Ideograph character(s).
-
Any additional information that will assist in correct understanding
of the different characteristics and linguistic processing of the
proposed character(s) or script.

|