On 1 Aug, 09:15, Andreas Prilop <Prilop2....DeleteThis@trashmail.net> wrote:
> On Tue, 31 Jul 2007, Andy Dingley wrote:
> > Agreed, but that's by their definition as _characters_, not codepoints.
>
> You still struggle with basic terms.
So go on then, please enlighten us as to my error here. Anyone can
point to huge great references from the W3C or even Plan 9, but unless
you cite a specific piece of text here, and a specific statement in
the reference, then all you're doing is a feeble bit of proof by
authority.
> if you want to discuss further on this topic and if you want
> to be taken seriously.
I presume your unhelpful patronising attitude is holiday cover for
Jukka.
As to the issue here, then a "character" is "the smallest component of
written language that has semantic value; refers to the abstract
meaning and/or shape" (from Unicode 4.0). A "code point" is the
integer that refers to a location within the space defined by Unicode
(Unicode themselves are inconsistent as to what this space is called).
If you're being that picky, then it's "code point", not "codepoint".
The term code point isn't used outside the Unicode world, but the
concept could usefully be extended to describe other character sets,
within a bounded world of discourse where you're careful to define the
term beforehand.
So 8217 is a code point (right single quote). So is 27 a code point
(apostrophe). As a purely typographic question, we should discuss
whether one is better than the other for representing apostrophes
with. I think the answer is fairly clear to that.
As to which is prettier, then if you want a pretty apostrophe glyph:
choose a pretty typeface to render it with.
It's not unheard of for characters to be deliberately mis-encoded, so
as to gain a prettier glyph than the one intended for that character.
We've got four choices here: apostrophe, quote, prime and even an
acute accent! They all look much the same as glyphs, why not choose
the one that's prettiest and hang the accuracy of the markup used to
obtain it? Again, I think the answer is fairly clear to that.
Now, as to the mapping of windows-1252. We're talking about 0x92 which
is a "right single quote" and not an "apostrophe". As such, the
_character_ of "right single quote" is an exact match for the Unicode
character "right single quote" found at code point 0x2019 / ’
This does not mean that 0x92 = 0x2019, that the integers 146 =
8217 ! As far as we can attach the same definition of code point
("an integer that defines a location in codespace") to windows-1252,
then it's clear that these are quite different code points.
Their characters are the same. Their code points are mappable to each
other.
That's not the same thing as saying that the code points are the
same.
In particular, we have web-encodings that can distribute encodings of
the Unicode codespace correctly through a number of different
encodings, including ISO-8859-* encodings that don't even support
those characters. By using numeric character references we can work
around this. Note though that these many web pages all (by
definition) refer to codepoints in the Unicode codespace, no matter
what their encoding. If you use a literal character from ISO-8859-*
then it will be transcoded to the appropriate Unicode code point (and
just to show you that I read the damn thing years ago, here's the
reference that describes the process
http://www.w3.org/TR/charmod/#sec-Transcoding
).
So if you're going to use ’ as a numeric character reference,
then use it - it'll work from any encoding (caveat the problem if it
_stops_ being a numeric character reference, owing to some part of a
supposedly transparent CMS changing it into the literal)
If you're going to use a literal character, then use it. It's probably
simplest, you just have to track that you've labelled it with an
appropriate encoding.
Personally my strong recommendation is to do this, and additionally to
_only_ use UTF-8 for _all_ of your content. It's easier to manage than
allowing variation. Expunge the ISO-8859-* encodings and the Windows
encodings.
Although 0x92 is a fine character to use as a literal in a
windows-1252 encoding (transcoding will map it for you) it's generally
a bad idea to use the arcane and obsolete windows encodings anyway.
Using 0x92 as a numeric character reference (i.e. ) in a
windows-1252 encoding is ugly. It's wrong according to the standard
(it's not a Unicode codepoint) and you're relying on browser fix-ups
to make it work. It probably will work, but why do it? If you _want_
a numeric character reference for "right single quote", ’ is the
correct one to use.