[Po4a-devel]Encoding options

[Po4a-devel][CVS]...

Re: [Po4a-devel][CVS]...

Jordi Vilalta

Tuesday, 3 August 2004 Tue, 3 Aug '04

8:24 a.m.

Hi, Today I had a quick look around the charset options and I got some questions: I haven't seen any named option for choosing the po file charset. Should we have one (-P, --po-charset), or should we handle all the po files with the same charset? po4a-translate doesn't have the option to select the localized file charset. Should we put it? Or may it take it from the po file? What should the defaults point to? iso-8859-1? utf-8? Something else? I'll try to put my hands on TransTractor and try to see how to integrate all this. Hopefully it won't hurt too much :) Regards, Jordi Vilalta

Show replies by date

Denis Barbier

Tuesday, 3 August Tue, 3 Aug

3:54 p.m.

On Tue, Aug 03, 2004 at 03:24:13PM +0200, Jordi Vilalta wrote:

...

PO files declare their encoding in their header field, so this option is only relevant when generating PO files. This is done only once, and the one who generates PO files in this context does often not know which encoding is used, so it is better to have an invalid charset as in "Content-Type: text/plain; charset=CHARSET\n" and let translators puts the right value.

...

po4a-translate doesn't have the option to select the localized file charset. Should we put it? Or may it take it from the po file? What should the defaults point to? iso-8859-1? utf-8? Something else?

ISO-8859-1 is culturally biased, the only choices are UTF-8 and the charset of the PO file, so a --encoding=po|utf8 should be sufficient. I have no opinion about the default value. Denis

Jordi Vilalta

4:46 p.m.

On Tue, 3 Aug 2004, Denis Barbier wrote:

...

On Tue, Aug 03, 2004 at 03:24:13PM +0200, Jordi Vilalta wrote: > Hi, > > Today I had a quick look around the charset options and I got some > questions: > > I haven't seen any named option for choosing the po file charset. Should > we have one (-P, --po-charset), or should we handle all the po files with > the same charset? PO files declare their encoding in their header field, so this option is only relevant when generating PO files. This is done only once, and the one who generates PO files in this context does often not know which encoding is used, so it is better to have an invalid charset as in "Content-Type: text/plain; charset=CHARSET\n" and let translators puts the right value.

Ok, good point of view. The other part is the charset of the msgids, because they have to be converted from the master file's charset to something. What should it be?

...

> po4a-translate doesn't have the option to select the localized file > charset. Should we put it? Or may it take it from the po file? > > What should the defaults point to? iso-8859-1? utf-8? Something else? ISO-8859-1 is culturally biased, the only choices are UTF-8 and the charset of the PO file, so a --encoding=po|utf8 should be sufficient. I have no opinion about the default value.

Well, here I meant (for example) when gettextizing, if no charset is specified for the master file in command line, and the format module cannot determine which encoding is it using, we should convert it from something (the default?) to the po file encoding. Uhmmm, now I've thought that maybe there should be no recoding between the master file and the po msgids. Am I right? I think I need some sleep :P Regards, Jordi Vilalta

Denis Barbier

5:29 p.m.

On Tue, Aug 03, 2004 at 11:46:50PM +0200, Jordi Vilalta wrote: [...]

...

>PO files declare their encoding in their header field, so this option >is only relevant when generating PO files. This is done only once, and >the one who generates PO files in this context does often not know which >encoding is used, so it is better to have an invalid charset as in > "Content-Type: text/plain; charset=CHARSET\n" >and let translators puts the right value. Ok, good point of view. The other part is the charset of the msgids, because they have to be converted from the master file's charset to something. What should it be?

By default, ASCII, and UTF-8 if they contain non-ASCII characters. IIRC xgettext only accepts --from-code=UTF-8 as other values do not make sense because the encoding stored in the POT file must be compatible with all other encodings.

...

>>po4a-translate doesn't have the option to select the localized file >>charset. Should we put it? Or may it take it from the po file? >> >>What should the defaults point to? iso-8859-1? utf-8? Something else? > >ISO-8859-1 is culturally biased, the only choices are UTF-8 and the >charset of the PO file, so a --encoding=po|utf8 should be sufficient. >I have no opinion about the default value. Well, here I meant (for example) when gettextizing, if no charset is specified for the master file in command line, and the format module cannot determine which encoding is it using, we should convert it from something (the default?) to the po file encoding. Uhmmm, now I've thought that maybe there should be no recoding between the master file and the po msgids. Am I right? I think I need some sleep :P

It may be so, I do not follow you ;) You were talking about po4a-translate and localized file charset, and now gettextizing master file. In the latter case, if master file contains only ASCII, no conversion is performed. Otherwise it has to be recoded into UTF-8, and there is indeed a problem if original charset is not specified. One could check whether it is UTF-8, and goes back to ISO-8859-1 otherwise, but unspecified encodings really suck, so let's be pedantic and force those people to declare their encoding. After all they know the encoding used in their English documentation, so they can add the right options to po4a tools. Denis

Martin Quinson

5:40 p.m.

Thanks for your work on encoding issues, dudes, I really suck at that. On Wed, Aug 04, 2004 at 12:29:54AM +0200, Denis Barbier wrote:

...

On Tue, Aug 03, 2004 at 11:46:50PM +0200, Jordi Vilalta wrote:

...

You were talking about po4a-translate and localized file charset, and now gettextizing master file. In the latter case, if master file contains only ASCII, no conversion is performed. Otherwise it has to be recoded into UTF-8, and there is indeed a problem if original charset is not specified. One could check whether it is UTF-8, and goes back to ISO-8859-1 otherwise, but unspecified encodings really suck, so let's be pedantic and force those people to declare their encoding. After all they know the encoding used in their English documentation, so they can add the right options to po4a tools.

I'm ok with being pedentic here, too. This approach would fit me: For the master: - if no encoding specified, supposed to be UTF8 - if it's not valid UTF8, refuse to process until being given what it is For translations: - if not specified, suppose it's the same than the one in translated part of the po file - could be cool if we could check that the encoding is not broken, but I'm not sure whether it's even possible. - during gettextization, assume it's UTF8 if no encoding is provided, whine for a proper setting if it's not the case For po files: - msgid must be in UTF8. No matter what happen. - msgstr have to be in the encoding specified in the po file headers. And once all this in implemented, we could be able to quit with assuming that master-document = english-document ;) Again, I've no definitive idea of all this should work, all this is merely a proposition. Thanks, Mt. -- Dans la france profonde, il y a surtout des spéléologues. -- Le Chat

Denis Barbier

Wednesday, 4 August Wed, 4 Aug

1:28 a.m.

On Tue, Aug 03, 2004 at 03:40:42PM -0700, Martin Quinson wrote: [...]

...

I'm ok with being pedentic here, too. This approach would fit me: For the master: - if no encoding specified, supposed to be UTF8

If you run "xgettext --from-code=UTF-8", no other charset can be used for PO files, and translators may dislike being forced to use this charset without any good reason. I much prefer assuming ASCII by default. (Then UTF-8 if a falback is needed)

...

- if it's not valid UTF8, refuse to process until being given what it is For translations: - if not specified, suppose it's the same than the one in translated part of the po file

There is a problem I did not think about before, few English man pages contain non-ASCII characters, like euro-test in Debian. PO files have then to be UTF-8 encoded, and generated man pages will also be UTF-8 encoded which is not the expected result, at least in Debian. The easy solution is to use escaped sequences (see groff_char(7)) instead of ISO-8859-1 characters, and hope that a similar solution is always available. Then documentation should clearly state which encoding can be used for original documents, depending on their format.

...

- could be cool if we could check that the encoding is not broken, but I'm not sure whether it's even possible.

Double conversion from ISO-8859-1 to UTF-8 is a common error and seems pretty hard to diagnose.

...

- during gettextization, assume it's UTF8 if no encoding is provided, whine for a proper setting if it's not the case For po files: - msgid must be in UTF8. No matter what happen. - msgstr have to be in the encoding specified in the po file headers.

No, msgids and msgstrs must share the same encoding, which is why UTF-8 is the only sane encoding if msgids contain non-ASCII characters. Denis

Pierre Machard

2:57 a.m.

Hi, On Wed, Aug 04, 2004 at 08:28:22AM +0200, Denis Barbier wrote: [...]

...

> - if it's not valid UTF8, refuse to process until being given what it is > For translations: > - if not specified, suppose it's the same than the one in translated part > of the po file There is a problem I did not think about before, few English man pages contain non-ASCII characters, like euro-test in Debian. PO files have then to be UTF-8 encoded, and generated man pages will also be UTF-8 encoded which is not the expected result, at least in Debian. The easy solution is to use escaped sequences (see groff_char(7)) instead of ISO-8859-1 characters, and hope that a similar solution is always available. Then documentation should clearly state which encoding can be used for original documents, depending on their format.

That was the initial reason for filling this bug (About non existing -M). My po-files were in UTF-8 but I wanted ISO-8859-1 for manpages. Perhaps the best thing would be to remove the -M option. After all, one can still use msgconv. Cheers, -- Pierre Machard <pmachard(a)debian.org> http://debian.org GPG: 1024D/23706F87 : B906 A53F 84E0 49B6 6CF7 82C2 B3A0 2D66 2370 6F87

Jordi Vilalta

7:46 a.m.

Ok, let me summarize what we have said until now (thanks everyone to help me understand better the limitations of the po files and the objectives of the encodings). Here are the conditions we have to fulfil: - msgids and msgstrs must share the same encoding - msgids should only be ascii or utf-8 - ascii is preferred over utf-8 by translators And here's a proposal of the processes: * Handling the master document (in gettextize, translate and update): - If a charset is specified in the command-line, convert from that to utf-8 (and set the po charset to utf-8) - Else, if the format module can detect the encoding from the document, convert from this to utf-8 (and set the po charset to utf-8) - If nothing can determine the file encoding, assume it's in ascii and don't convert anything (and set the po charset to something invalid, so that the translator can set it) * Handling the input translated document (in gettextize): - If the master document's charset is ascii (not specified in the po), we should let the translated document remain in the specified charset (in the command line or the format module's detected one (if nothing detected, stop the process)), and set the po charset to it. - If the master document's charset is utf-8, we should convert from the specified charset (in the command line or the format module's detected one) to utf-8. * Handling the output translated document (in translate): - Use the charset specified in the command line, or the po file's charset if nothing specified. * Handling the addendum (in translate): - It should be converted from the specified charset in the command line (mandatory) to the output document charset determined in the point above. Did I miss something? Am I wrong in some points? Oh, and one last question for now: should we recode everything or just the translated strings (assuming that's the only place where there can be encoding issues...)? Regards, Jordi Vilalta

Denis Barbier

2:44 p.m.

On Wed, Aug 04, 2004 at 02:46:41PM +0200, Jordi Vilalta wrote:

...

Fully right.

...

And here's a proposal of the processes: * Handling the master document (in gettextize, translate and update): - If a charset is specified in the command-line, convert from that to utf-8 (and set the po charset to utf-8) - Else, if the format module can detect the encoding from the document, convert from this to utf-8 (and set the po charset to utf-8)

No, it must be ASCII by default because 'ascii is preferred over utf-8 by translators'.

...

- If nothing can determine the file encoding, assume it's in ascii and don't convert anything (and set the po charset to something invalid, so that the translator can set it)

If master file contains non-ASCII characters, one can check whether it is UTF-8 encoded. In such a case, lib/Locale/Po4a/Po.pm has to write "Content-Type: text/plain; charset=UTF-8\n" instead of "Content-Type: text/plain; charset=CHARSET\n" in the POT file. If translated PO files already exist, they have to be converted to UTF-8 so that they can be merged with the POT file. If master file is not UTF-8 encoded, po4a-gettextize must abort because this has to be fixed by maintainers, not translators.

...

* Handling the input translated document (in gettextize): - If the master document's charset is ascii (not specified in the po), we should let the translated document remain in the specified charset (in the command line or the format module's detected one (if nothing detected, stop the process)), and set the po charset to it. - If the master document's charset is utf-8, we should convert from the specified charset (in the command line or the format module's detected one) to utf-8.

Fine by me, but this seems in contradiction with your previous paragraph, because you said that if no charset is specified, PO file is UTF-8 encoded ;) In the first case, PO charset can be unspecified until translator fixes it. In the second case, it is troublesome, msgstrs really have to be recoded into UTF-8, otherwise the PO file is pretty useless, this conversion cannot be performed afterwards. Maybe po4a-gettextize should abort too.

...

* Handling the output translated document (in translate): - Use the charset specified in the command line, or the po file's charset if nothing specified.

Ok.

...

* Handling the addendum (in translate): - It should be converted from the specified charset in the command line (mandatory) to the output document charset determined in the point above.

Ok.

...

Did I miss something? Am I wrong in some points?

Sounds good.

...

Oh, and one last question for now: should we recode everything or just the translated strings (assuming that's the only place where there can be encoding issues...)?

The safest solution is to allow only ASCII encoded non-translatable materials, and see if there are complaints. Denis

Jordi Vilalta

3:45 p.m.

On Wed, 4 Aug 2004, Denis Barbier wrote:

...

On Wed, Aug 04, 2004 at 02:46:41PM +0200, Jordi Vilalta wrote: > Ok, let me summarize what we have said until now (thanks everyone to help > me understand better the limitations of the po files and the objectives of > the encodings). > > > Here are the conditions we have to fulfil: > > - msgids and msgstrs must share the same encoding > - msgids should only be ascii or utf-8 > - ascii is preferred over utf-8 by translators Fully right. > And here's a proposal of the processes: > * Handling the master document (in gettextize, translate and update): > > - If a charset is specified in the command-line, convert from that to > utf-8 (and set the po charset to utf-8) > - Else, if the format module can detect the encoding from the document, > convert from this to utf-8 (and set the po charset to utf-8) No, it must be ASCII by default because 'ascii is preferred over utf-8 by translators'.

Well, this "detect" means that the the document specifies the charset inside himself (like the xml headers: <?xml encoding='iso-8859-1'?>), the format module checks it, and then this should be converted to utf-8.

...

> - If nothing can determine the file encoding, assume it's in ascii and > don't convert anything (and set the po charset to something invalid, so > that the translator can set it) If master file contains non-ASCII characters, one can check whether it is UTF-8 encoded. In such a case, lib/Locale/Po4a/Po.pm has to write "Content-Type: text/plain; charset=UTF-8\n" instead of "Content-Type: text/plain; charset=CHARSET\n" in the POT file. If translated PO files already exist, they have to be converted to UTF-8 so that they can be merged with the POT file.

Do you mean that an update on the master document can cause the change from ascii to utf-8 and we should convert the po files to utf-8 when updating?

...

If master file is not UTF-8 encoded, po4a-gettextize must abort because this has to be fixed by maintainers, not translators. > * Handling the input translated document (in gettextize): > > - If the master document's charset is ascii (not specified in the po), we > should let the translated document remain in the specified charset (in > the command line or the format module's detected one (if nothing > detected, stop the process)), and set the po charset to it. > - If the master document's charset is utf-8, we should convert from the > specified charset (in the command line or the format module's detected > one) to utf-8. Fine by me, but this seems in contradiction with your previous paragraph, because you said that if no charset is specified, PO file is UTF-8 encoded ;) In the first case, PO charset can be unspecified until translator fixes it. In the second case, it is troublesome, msgstrs really have to be recoded into UTF-8, otherwise the PO file is pretty useless, this conversion cannot be performed afterwards. Maybe po4a-gettextize should abort too.

Yes, it's what I meant. When the master (msgids) is utf-8, we should convert the translated strings to utf-8 also (before mixing the 2 po)

...

> * Handling the output translated document (in translate): > > - Use the charset specified in the command line, or the po file's charset > if nothing specified. Ok. > * Handling the addendum (in translate): > > - It should be converted from the specified charset in the command line > (mandatory) to the output document charset determined in the point > above. Ok. > Did I miss something? Am I wrong in some points? Sounds good. > Oh, and one last question for now: should we recode everything or just the > translated strings (assuming that's the only place where there can be > encoding issues...)? The safest solution is to allow only ASCII encoded non-translatable materials, and see if there are complaints.

I also vote for this. Regards, Jordi Vilalta

Denis Barbier

6:16 p.m.

On Wed, Aug 04, 2004 at 10:45:36PM +0200, Jordi Vilalta wrote: [...]

...

>No, it must be ASCII by default because 'ascii is preferred over utf-8 >by translators'. > Well, this "detect" means that the the document specifies the charset inside himself (like the xml headers: <?xml encoding='iso-8859-1'?>), the format module checks it, and then this should be converted to utf-8.

Which charset will be used for the POT file? Just to be clear, my opinion is that if an encoding is declared (either iso-8859-1 or utf-8) but document only contains ASCII characters, the POT file should not declare its charset being UTF-8 (ie. charset=CHARSET is unchanged).

...

>>- If nothing can determine the file encoding, assume it's in ascii and >> don't convert anything (and set the po charset to something invalid, so >> that the translator can set it) > >If master file contains non-ASCII characters, one can check whether it >is UTF-8 encoded. In such a case, lib/Locale/Po4a/Po.pm has to write > "Content-Type: text/plain; charset=UTF-8\n" >instead of > "Content-Type: text/plain; charset=CHARSET\n" >in the POT file. If translated PO files already exist, they have to >be converted to UTF-8 so that they can be merged with the POT file. Do you mean that an update on the master document can cause the change from ascii to utf-8 and we should convert the po files to utf-8 when updating?

Yes, as soon as master document contains non-ASCII characters, PO files have to be UTF-8 encoded. Denis PS: I am away on Sunday for 2 weeks, and do not know if I will be able to read mails before leaving.

7908

days inactive

7909

days old

devel@lists.po4a.org

Manage subscription

10 comments

4 participants

tags (0)

participants (4)

Denis Barbier
Jordi Vilalta
Martin Quinson
Pierre Machard

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

[Po4a-devel]Encoding options