Tamil Discussion archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Unicode Project of UC Berkeley

I posted the following to tamil.net list a while ago (May 97) when
we were actively discussing possible character encoding scheme.
Since we now have a larger spectrum of participants, I thought
of reposting it, as the contents are very important to what we are
discussing. We did not have much discussions on this though.
For those who have already read this/aware of this, please accept
my apologies for the repost.

Sujatha has pointed out recently that there is too much of 
"re-inventing the wheel" exercise taking place in the world of tamil

Many of the points raised below are exactly what we are grinding 
through right now. If we can all agree quickly for a character
encoding scheme that will allow storage of tamil texts even in
unicode format, we can induce UC Berkeley to go ahead and
use our package to archive tamil texts right away.  There is even
an explicit statement "To undertake Prof. Hart's
Tamil project or similar projects with other non-roman scripts, a
minimum requirement for data entry will be Unicode conformant fonts 
and software to support them."

Prof. Hart may also be thinking along these lines.
May be Prof. Hart can provide us an update on this project?

The following extracts taken out from the Web pages of the Univ. of
California, Berkeley, USA Library web site should be of interest to 
tamil.net community.

URL:  http://www.lib.berkeley.edu/SSEAL/SouthAsia/Unicode.html
(The above web page is freely accessible to anyone and hence I presume
that the information is free for world-wide public dissimination!).

The project cited below is very important in view of the discussions we
had recently in Singapore at the TamilNet'97 Conference and the
of the Tamil.Net community to have a standard for tamil computing.

Interestingly, the review part of this project was scheduled to take
place during July -Dec. 1996 and conclusions drawn already !!! 

Suzanne McMahon,  438 Doe Library,  Berkeley, CA  94720-6000
   Tel:   510-643-0849    Email:  smcmahon@library.berkeley.edu
Beyond Transliteration: Digitizing Non-Roman Text Using Unicode
Conformant Fonts

In the first phase of the project, I will conduct a review of existing
digitized texts in languages using non-roman fonts, particularly Tamil
and Urdu. In addition I will review and evaluate available proprietary 
and non-proprietary Unicode conformant fonts and software. Based on
 this review I will formulate specifications for software and fonts 
needed to complete a large-scale digitization project in Tamil and Urdu.
Working with consultants I will decide if existing software will meet
the needs of the project or if any special applications need to be
I will then draft a funding request for Phase 2 of the project, the 
development of needed software and fonts and the digitization of 
sample Tamil and Urdu texts. I will also announce the
results of Phase 1 and plans for Phase 2 on suitable lists including

PART A: Need for Research

Last year Prof. George Hart, Tamil Chair in the Department of South and
Southeast Asian Studies, approached me about collaboration on a project
to digitize the corpus of pre-modern Tamil literature. David Farrell,
Berkeley AUL for Collections, Prof. Hart, and I met to discuss models
for the project, agreeing that for a prototype, data entry and tagging
be done in the Library. I also met with Barclay Ogden, Digital Library 
Coordinator, and Daniel Pitti, Head of the Electronic Text Unit to
discuss the project. Pitti explained that the first hurdle to
non-roman projects has been the lack of standards for non-roman fonts.
The recently approved Unicode standard has developed maps for most
modern languages, but software hasn't been developed yet to support 
many of the Unicode maps. The first step for completion of the 
pre-modern Tamil project is the development of appropriate Unicode 
compliant fonts and of software to handle fonts for digitization.

The English language has, until recently, dominated the Internet.
E-mail, gopher, ftp, use the ASCII standard, a 7-bit scheme allowing 
for 128 character positions. Extended ASCII, the character standard for
microcomputer software uses 8-bits to achieve 256 characters. Reliance
on ASCII has been severely limiting for languages written in non-Roman
scripts, and scholars have been forced to use various inadequate
transliteration schemes to handle non-Roman characters when digitizing
With the advent of the Web bitmapped transmissions of scanned text are
possible, but HTML (RFC1866) relies on ISO-8859-1, known as Latin-1, a
version of extended ASCII, which is appropriate only for English and
Western European languages. Font and software developers for languages
written in non-Roman scripts have laid out non-standard character maps,
usually using ASCII positions 128-256 to accommodate non-Roman
characters. The non-standard character maps make it possible to use
non-roman characters in many applications and, after loading special
fonts, even on the Web, but at the expense of interoperability.
An Internet draft on internationalization of HTML
(draft-ietf-html-il8n-03.txt), suggests altering the HTML Document Type
Definition (DTD), i.e., the formal definition of the HTML syntax in
terms of SGML, by encompassing a larger character repertoire than
ISO-8859-1, while still remaining SGML compliant. The larger character
repertoire suggested is Unicode, ISO 10646 BMP. Unicode is based on
16-bit encoding that permits 65,535 characters instead of the 256
characters of Latin-1. The BMP stands for Basic Multilingual Plane,
sometimes referred to as Plane Zero. Unicode is only the
first plane of the larger ISO 10646. ISO 10646 allows for 32 bit
It is divided into 32,000 planes with 65,535 character capacity each.
This permits 2,080 million characters. While Unicode, in order to save
maps duplicate or very similar characters from Chinese Han, Japanese
and Korean Hanja to one position, ISO 10646 needs to be used for full
unique sets of the characters although processing time is prohibitive.
UTF-8 (Universal Character Set Transformation Format) is part of
ISO 10646 and can compress data to speed up processing. Currently
only the Unicode plane of ISO 10646 has been implemented.
The global nature of the Internet has accelerated the push towards a
standard that can accommodate all the world languages. It seems highly
likely that in the next few years the computer industry will switch from
to Unicode. Windows NT uses Unicode and the Windows 4.x series should
have Unicode fully operational.
The Library's Electronic Text Unit is committed to producing the highest
quality digital text for the benefit of future scholarship. While many
non-standard fonts that use the upper half of the ASCII keyboard are
available, these fonts are useful only in a stand-alone setting
and often only with proprietary software. An example is the offering of
Macintosh fonts from Ecological Linguistics. To undertake Prof. Hart's
Tamil project or similar projects with other non-roman scripts, a
minimum requirement for data entry will be Unicode conformant fonts 
and software to support them. Using fonts and software that comply 
with the emerging standard will guarantee the consistency and
necessary for valid scholarship; Unicode conformant fonts and software
 also open the possibility of distributing text in the digitally
Goal: To review existing standards for fonts and digitizing text;
non-Roman text already digitized and projects currently in
progress, concentrating on languages in Indic scripts, particularly
Tamil and Urdu;  and existing Unicode conformant fonts and software 
suitable for digitizing modern and pre-modern Tamil and Urdu text.
 PART B Design and methodology;

1. Review existing standards
The first step is a review of existing spproved or draft standards,
including ASCII;
Unicode; SGML; TEI;  XML; and tranliteration schemes.  This step will
include gathering a small collection of standards for easy reference
during the course of the project.
2. Review of digitized non-Roman text, particularly in Indic scripts, 
concentrating on  Tamil or Urdu 
In this step I'll review what has been digitized and how, including
projects in progress. Information to be gathered will include:
        * Bibliographic details of the text
        * Language of the text
        * Institutions/individuals involved in the work
        * Is the text available on gopher, the Web, CD-ROM or other
        * Is the text transliterated or in original font?
        * If in the original font, does it use a standard or
                 character map?
        * Is it searchable? What other features are included that
wouldn't be
                available in a printed version?

3. Review of Unicode conformant software and fonts
The third step is a review of proprietary and non-proprietary software
conformant with Unicode. This step will include evaluation of the
suitability of software identified in step 2 for the current project.
For instance, Gamma has a series of products to use Unicode with
Windows. A non-proprietary example is the TamilWeb in Singapore. It's
creators have developed a Unicode font and java applet to deliver it.
An example of software previously used with other digitization projects
is the database developed at the University of Chicago for the Thesaurus
Linguae Graecae project.
Criteria for evaluation will include:
        * Does the font scheme allow for expansion, for instance
inclusion of
                pre-modern characters
        * Will the software work across the network, across platforms?
        * To what degree does the software support SGML and TEI?
        * Will the software allow for manipulation of text? addition of
                and bookmarks?
        * Will the software allow for searches in Roman and non-Roman
        * Does the software support SGML and TEI?
        * Does the software support left-to-right and right-to-left data


Home | Main Index | Thread Index