On Wed, Aug 04, 2004 at 02:46:41PM +0200, Jordi Vilalta wrote:
Ok, let me summarize what we have said until now (thanks everyone to
help
me understand better the limitations of the po files and the objectives of
the encodings).
Here are the conditions we have to fulfil:
- msgids and msgstrs must share the same encoding
- msgids should only be ascii or utf-8
- ascii is preferred over utf-8 by translators
Fully right.
And here's a proposal of the processes:
* Handling the master document (in gettextize, translate and update):
- If a charset is specified in the command-line, convert from that to
utf-8 (and set the po charset to utf-8)
- Else, if the format module can detect the encoding from the document,
convert from this to utf-8 (and set the po charset to utf-8)
No, it must be ASCII by default because 'ascii is preferred over utf-8
by translators'.
- If nothing can determine the file encoding, assume it's in
ascii and
don't convert anything (and set the po charset to something invalid, so
that the translator can set it)
If master file contains non-ASCII characters, one can check whether it
is UTF-8 encoded. In such a case, lib/Locale/Po4a/Po.pm has to write
"Content-Type: text/plain; charset=UTF-8\n"
instead of
"Content-Type: text/plain; charset=CHARSET\n"
in the POT file. If translated PO files already exist, they have to
be converted to UTF-8 so that they can be merged with the POT file.
If master file is not UTF-8 encoded, po4a-gettextize must abort because
this has to be fixed by maintainers, not translators.
* Handling the input translated document (in gettextize):
- If the master document's charset is ascii (not specified in the po), we
should let the translated document remain in the specified charset (in
the command line or the format module's detected one (if nothing
detected, stop the process)), and set the po charset to it.
- If the master document's charset is utf-8, we should convert from the
specified charset (in the command line or the format module's detected
one) to utf-8.
Fine by me, but this seems in contradiction with your previous paragraph,
because you said that if no charset is specified, PO file is UTF-8
encoded ;)
In the first case, PO charset can be unspecified until translator fixes
it. In the second case, it is troublesome, msgstrs really have to be
recoded into UTF-8, otherwise the PO file is pretty useless, this
conversion cannot be performed afterwards. Maybe po4a-gettextize should
abort too.
* Handling the output translated document (in translate):
- Use the charset specified in the command line, or the po file's charset
if nothing specified.
Ok.
* Handling the addendum (in translate):
- It should be converted from the specified charset in the command line
(mandatory) to the output document charset determined in the point
above.
Ok.
Did I miss something? Am I wrong in some points?
Sounds good.
Oh, and one last question for now: should we recode everything or
just the
translated strings (assuming that's the only place where there can be
encoding issues...)?
The safest solution is to allow only ASCII encoded non-translatable materials,
and see if there are complaints.
Denis