[tex-live] Re: UTF-8 support
Vladimir Volovich
vvv@vsu.ru
Wed, 22 Jan 2003 16:35:02 +0300
"PO" == Petr Olsak writes:
PO> UTF-8 characters are interpreted by TeX as a sequence of
PO> commands, so don't use calls like \macro ä instead of \macro{ä}.
>> it is always a good style to delimit macro arguments with braces
PO> It means that I don't completelly switch all my old documents to
PO> UTF-8 because problems can occur! On the other hand, the encTeX
PO> is really robust solution.
>> with encTeX, expansion of a multibyte UTF-8 character can also be
>> not a single letter, but a sequence of several tokens (e.g. a call
>> to macro), - so encTeX suffers from exactly the same "problem":
>> you can't be sure that one UTF-8 character in the input file will
>> be one token,
PO> NO! Please, read the encTeX manual before this discussion.
OK, sorry, - i didn't read it carefully enough...
nevertheless, i still see a problem:
encTeX cannot define all possible UTF-8 characters (due to very big
number of characters), so some valid UTF-8 character in input file
will not be translated in any way, thus \macro ä MAY still be
processed incorrectly by encTeX if a multibyte UTF-8 sequence for ä
was not defined in encTeX, and a bad behavior of \macro will occur.
i.e. \macro ä will get the first byte of UTF-8 representation of ä
instead of the whole character (just the same effect as mentioned in
the ucs package).
while with UCS package (purely TeX solution) you can at least
generate sensible warnings for undefined UTF-8 characters which may
occur in imput files without defining all 2^31 characters, it is not
achievable with encTeX - characters which were left undefined and
which will appear in input files will horribly fail in encTeX
without any warning.
also, as far as i understand, encTeX is a very limited solution: it
mostly assumes that one uses a single text font encoding (e.g. T1)
throughout the document (just like TCX), and thus it does not provide
solution for really multilingual UTF-8 documents.
if i'm wrong, please correct me - give an example of how one can use
encTeX to support e.g. T1 and T2A font encodings (for e.g. french and
cyrillic) in the same document. i think that there will be problems
because the same slot numbers in T1 and T2A encodings contain
completely different glyphs, and reverse mappings will not work
correctly.
purely (La)TeX solution works just fine, - e.g. ucs package or my
small utf-8 input encoding support at
CTAN:macros/latex/contrib/supported/t2/etc/utf-8
(see e.g. multilingual example file in that directory)
PO> The second example: You have written that \write files includes
PO> only \'A notation of characters in LaTeX. Do you know a documents
PO> where you have to re-read the \write files in verbatim mode? I
PO> know these documents. What happens in LaTeX in such situation?
>> nothing bad - it is very well possible to write to files in LaTeX
>> using the ASCII LICR representation, and then read the files back:
>> you'll need to translate \ into, say, \textbackslash, and
>> characters like Á to \'A (which is a native representation in
>> LaTeX); then, when you read the file back, all will be correct: *
>> Á will be written as \'A, and read back as Á * \'A will be written
>> as \textbackslash 'A, and read back as \'A so verbatim
>> representation will be preserved. (fancyvrb package contains a
>> lot of such framework)
PO> The "\textbacklash dance" will help you if the native verbatim
PO> environment is used. But if you first set all \catcodes to 12
PO> (including backslash) and second you \input the external file, no
PO> \textbacklash will help you.
PO> Sorry, I am not a TeX novice, I _know_ what I am saying. The
PO> LaTeX solution of UTF-8 encoding is not robust.
sorry, i don't understand your reasoning... are you saying that it is
impossible to achieve some effect with writing to files and verbatim
reading of files from TeX, using purely TeX machinery?
if so, could you describe what is it? you are not forced to redefine
the backslash's catcode to 12 when reading or writing files, - nothing
prevents you to preserve the original catcode when reading.
i.e., what encTeX buys us WRT verbatim which could not be achieved
without any extensions to TeX? could you give a small example?
Best,
v.