Tamil Discussion archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[WMASTERS] Re: Does 8-bit scheme clash with HTML standard?


This week's sponsors -The Asia Pacific Internet Company (APIC)
  @  Nothing Less Than A Tamil Digital Renaissance Now   @
<http://www.apic.net> Click now<mailto:info@apic.net> for instant info

Dear Mani:

>An example for a practical character subset frustration is ISO 8859-1
>versus Microsoft CP1252 (the Windows character set):
>Today, I have already the problem that Web page authers using
>MS-Windows use the CP1252 characters in the 0x80-0x9f range that
>are not part of ISO 8859-1 and therefore are IN THEORY not allowed
>in HTML.  I can't see these characters on my fully HTML conforming
>system, and the Web authors who are unaware of what the proper subset
>of their available character set is inappropriately use frequently
>QUOTATION MARK (0x91 and 0x92 in CP1252, 0x2019 and 0x201c
>in Unicode) where they really should use just QUOTATION
>MARK (0x22) if the character set announced by HTTP for this
>Web page is ISO 8859-1.
Regarding your post, I have read emails raising this question 
(on how to handle characters that are in the 128-159 slots in HTML)
with answers on how the Web authors should handle characters
while using Windows. I do not use Windows-based softwares to
prepare my HTML pages and so cannot check out the proposed
solution. May be someone like Kumar Mallikarjunan or Srinivasan
or Muthu can check out these?

I am copying below one relevant email that answers how the problem
is to be handled. (taken out from the webpage devoted to Latin-1 and
http://www.pemberley.com/janeinfo/latin1.html )


From: Markus Kuhn <kuhn@cs.purdue.edu>
Newsgroups: comp.text.sgml, comp.std.internat,
Date: Thu, 24 Apr 1997 23:57:52 -0500
Message-ID: <336039D0.FD4@cs.purdue.edu>

       [Question: &#146; valid HTML or no?]

The characters 128-159 are not used in ISO 8859-1 and Unicode, the
sets of HTML. 
MS-Windows uses a superset of ANSI/ISO 8859-1, known to experts as 
"Code Page 1252 (CP1252)", a Microsoft-specific character set with
 additional characters in the 128-159 range (also known as the "C1"

All the CP1252 characters are also available in Unicode. For example the
 character 146 that you mentioned (RIGHT SINGLE QUOTATION MARK) has
 the Unicode number 8217, therefore you should use this number in order
conform to the HTML standard. Modern HTML browsers like Netscape 4.0
understand Unicode, and will automatically convert the Unicode character
back into the character 146 on MS-Windows machines, and into the
character on other systems.

The official CP1252<->Unicode conversion table is printed in the Unicode
standard for instance, and is available on
<ftp://ftp.informatik.uni-erlangen.de/pub/doc/ISO/charsets/> in the file 

MS-Windows HTML-authoring software definitely should implement the 
conversion table below! Please forward this mail to the developers of
HTML authoring tool if this is currently done wrong.

The CP1252 characters that are not part of ANSI/ISO 8859-1, and that
 therefore always be encoded as Unicode characters greater than 255, are

 Windows   Unicode    Char.
  char.   HTML code   test         Description of Character
  -----     -----     ---          ------------------------
ALT-0130   &#8218;   ?    Single Low-9 Quotation Mark
ALT-0131   &#402;    ?    Latin Small Letter F With Hook
ALT-0132   &#8222;   ?    Double Low-9 Quotation Mark
ALT-0133   &#8230;   ?    Horizontal Ellipsis
ALT-0134   &#8224;   ?    Dagger
ALT-0135   &#8225;   ?    Double Dagger
ALT-0136   &#710;    ?    Modifier Letter Circumflex Accent
ALT-0137   &#8240;   ?    Per Mille Sign
ALT-0138   &#352;    ?    Latin Capital Letter S With Caron
ALT-0139   &#8249;   ?    Single Left-Pointing Angle Quotation Mark
ALT-0140   &#338;    ?    Latin Capital Ligature OE
ALT-0145   &#8216;   ?    Left Single Quotation Mark
ALT-0146   &#8217;   ?    Right Single Quotation Mark
ALT-0147   &#8220;   ?    Left Double Quotation Mark
ALT-0148   &#8221;   ?    Right Double Quotation Mark
ALT-0149   &#8226;   ?    Bullet
ALT-0150   &#8211;   ?    En Dash
ALT-0151   &#8212;   ?    Em Dash
ALT-0152   &#732;    ?    Small Tilde
ALT-0153   &#8482;   ?    Trade Mark Sign
ALT-0154   &#353;    ?    Latin Small Letter S With Caron
ALT-0155   &#8250;   ?    Single Right-Pointing Angle Quotation Mark
ALT-0156   &#339;    ?    Latin Small Ligature OE
ALT-0159   &#376;    ?    Latin Capital Letter Y With Diaeresis

Markus Kuhn, Computer Science grad student, Purdue
University, Indiana, US, email: kuhn@cs.purdue.edu


Sponsors/Advertisers  needed -  please email bala@tamil.net
Check out the tamil.net web site on <http://tamil.net>
Postings to <webmasters@tamil.net>. To unsubscribe send
the text - unsubscribe webmasters - to majordomo@tamil.net

Home | Main Index | Thread Index