On Tue, Aug 03, 2004 at 03:40:42PM -0700, Martin Quinson wrote:
[...]
I'm ok with being pedentic here, too. This approach would fit
me:
For the master:
- if no encoding specified, supposed to be UTF8
If you run "xgettext --from-code=UTF-8", no other charset can be used
for PO files, and translators may dislike being forced to use this
charset without any good reason.
I much prefer assuming ASCII by default. (Then UTF-8 if a falback is
needed)
- if it's not valid UTF8, refuse to process until being given
what it is
For translations:
- if not specified, suppose it's the same than the one in translated part
of the po file
There is a problem I did not think about before, few English man pages
contain non-ASCII characters, like euro-test in Debian. PO files have
then to be UTF-8 encoded, and generated man pages will also be UTF-8
encoded which is not the expected result, at least in Debian.
The easy solution is to use escaped sequences (see groff_char(7))
instead of ISO-8859-1 characters, and hope that a similar solution
is always available. Then documentation should clearly state which
encoding can be used for original documents, depending on their format.
- could be cool if we could check that the encoding is not broken,
but I'm
not sure whether it's even possible.
Double conversion from ISO-8859-1 to UTF-8 is a common error and seems
pretty hard to diagnose.
- during gettextization, assume it's UTF8 if no encoding is
provided, whine
for a proper setting if it's not the case
For po files:
- msgid must be in UTF8. No matter what happen.
- msgstr have to be in the encoding specified in the po file headers.
No, msgids and msgstrs must share the same encoding, which is why UTF-8
is the only sane encoding if msgids contain non-ASCII characters.
Denis