Character set and encoding confusion

How to turn a simple problem into a big mess ? Inconsistent terminology is the way. Depending on the specification you read, the same concepts have differents and conflicting names : Welcome in the world of character representation.

It starts by only two basic concepts : The letter “A” is assigned to a code 435. Then the code 435 is converted in a sequence of bytes. THAT’S ALL ! BUT THAT’S TWO ENCODING !

Now let’s explore the mess it generated :

  • Some definitions, according to HTML Document Representation :
    • The Document Character Set or base character set : consists of :
      • A Repertoire : A set of abstract characters, such as the Latin letter “A”, the Cyrillic letter “I”, etc.
      • Code positions or Code points : A set of integer references to characters in the repertoire.

      For XML/HTML internal : it is Unicode and the Universal Character Set (ISO/IEC 10646), which are code for code identical.

    • Coded character set is the same as “Document Character Set”. There is either “Character Set” and “Repertoire” or “Coded character set” and “Character set”. “Character set” means “repertoire” when “coded character set” is used !
    • Charset identifies a character encoding.
    • Character encodings is a method of converting a sequence of bytes into a sequence of characters (ISO-8859-1, UTF-8, US-ASCII, …)
    • Byte Order Mark (BOM) first bytes of a stream used to know if it is big or little endian (wouldn’t be a problem if MS-Notepad wasn’t widely used.)
    • Numeric character references (NCR) specify the code position of a character in the document character set : å or 水 . Note : for XML and HTML they are interpreted as Unicode characters – no matter what encoding you use for your document ! In CSS, use \E5 , \6C34  and don’t ask why.
    • Character entity references are symbolic names : å or Å. Usually used in HTML for manual authoring. They are lost if not defined in the DTD.
    • Character escape is the use of character references (either numeric or entity.)
  • How it is managed by client and server is another mess :
  • The best ressource I found so far : W3C Internationalization and more specificaly : Character encodings, (and Authoring Techniques for XHTML & HTML Internationalization: Characters and Encodings).
    For non-XML/HTML problems, go to Unicode.
  • Follow up : DataBase Connection ; Security concern ; For Linux : The Unicode HOWTO (Bruno Haible) and UTF-8 and Unicode FAQ for Linux by Markus Kuhn.

More references : i18nGurus.

Leave a Reply

Please log in using one of these methods to post your comment: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: