Ok, let me summarize what we have said until now (thanks everyone to help
me understand better the limitations of the po files and the objectives of
the encodings).
Here are the conditions we have to fulfil:
- msgids and msgstrs must share the same encoding
- msgids should only be ascii or utf-8
- ascii is preferred over utf-8 by translators
And here's a proposal of the processes:
* Handling the master document (in gettextize, translate and update):
- If a charset is specified in the command-line, convert from that to
utf-8 (and set the po charset to utf-8)
- Else, if the format module can detect the encoding from the document,
convert from this to utf-8 (and set the po charset to utf-8)
- If nothing can determine the file encoding, assume it's in ascii and
don't convert anything (and set the po charset to something invalid, so
that the translator can set it)
* Handling the input translated document (in gettextize):
- If the master document's charset is ascii (not specified in the po), we
should let the translated document remain in the specified charset (in
the command line or the format module's detected one (if nothing
detected, stop the process)), and set the po charset to it.
- If the master document's charset is utf-8, we should convert from the
specified charset (in the command line or the format module's detected
one) to utf-8.
* Handling the output translated document (in translate):
- Use the charset specified in the command line, or the po file's charset
if nothing specified.
* Handling the addendum (in translate):
- It should be converted from the specified charset in the command line
(mandatory) to the output document charset determined in the point
above.
Did I miss something? Am I wrong in some points?
Oh, and one last question for now: should we recode everything or just the
translated strings (assuming that's the only place where there can be
encoding issues...)?
Regards,
Jordi Vilalta