On Wed, Feb 16, 2005 at 09:25:31PM +0100, Jordi Vilalta wrote:
On Wed, 16 Feb 2005, Nicolas François wrote:
>On Wed, Feb 16, 2005 at 01:00:09AM +0100, Jordi Vilalta wrote:
>>I was just gettextizing some man pages and I've noticed a problem when
>>trying to mix several po files:
>>
>>$ msgcat *.po
>>file1.po:19:10: invalid multibyte sequence
>>msgcat: found 1 fatal error
>>
>>I've found that there was a strange character in that position, and it
>>seems it's the equivalent of man page's "\ ". What's its
meaning? Why is
>>it handled with this strange byte? It seems we're generating non-compliant
>>po files :S
>
>Yes, "\ " are changed to 0xA0. Maybe this should be done only if the
>charset used support this character (at least UTF-8 & latin-1).
Is it important to mantain a "\ " instead of converting it to a standard
space? When translators rewrite the message, (I think) they write standard
spaces, so the "\ " loses its posible utility. If it's important to
maintain them, I think it would be better to put "\ " in the po files.
I think it is important: when used in a macro argument, "\ " permits to
continue the same argument. For example:
.BI foo bar
has two arguments, the first one in bold face, the second one in italic
(and they are displayed joined).
Whereas:
.BI foo\ bar
only consist in one argument and is displayed as "foo bar" in bold face.
>However, I'm surprised it generate an error. I'm only
getting warnings
>(sometimes annoying):
>warning: The following msgid contains non-ASCII characters.
> This will cause problems to translators who use a character
> encoding
> different from yours. Consider using a pure ASCII msgid instead.
>
>(There is no warning when the charset is UTF-8)
>
>Can you point me to the man page you gettextized (I will need the original
>and translated man page)?
It has happened for example with the ldd man page (along with a lot more).
There's no need to use the translated one. Here's a simple example to
reproduce it:
- create a simple man page that contains this line (typical):
\-V\ \-\-version
- po4a-gettextize -f man -m file.man -p file.po
- edit file.po to put a valid charset
- msgcat file.po: with ascii and utf-8 charsets i get this:
file.po:19:10: invalid multibyte sequence
msgcat: found 1 fatal error
Thanks for the example. I can reproduce it (I didn't tried the ascii
charset)
The conversion of '\ ' to 0xA0 was done because some french translators
used the latin-1 non-breakable space in PO, and it was useful for them to
convert the 0xA0 char in the PO to a '\ ' in the man page (otherwise there
is others warnings).
What I propose is to keep the conversion of 0xAO to '\ ' in post_trans,
but remove the opposite conversion in pre_trans. Thus PO will be valid and
translators will be able (at their will) to use 0xA0 in the msgid (and
will have to set a correct charset in the header).
Implementation Note: in fact, in pre_trans, we may need to convert 0xA0 to
'\ ', because the opposite conversion is performed earlier on some strings
(it helps the splitargs subroutine).
Do you think we may keep the 0xA0 if the user specified an
$self->{TT}{'file_in_charset'} = UTF-8 or latin-1
(should we then check in_charset or out_charset ?)
I'm also asking this for the TeX module (there I'm doing translation of
accentuated characters, i.e. \'e in the TeX file becomes é in the PO which
is then translated again to \'e in the TeX file).
These translations are, IMHO, user friendly (when it does not break the
PO ;). They also help spell checkers or syntax checkers (acheck): for
example, in French a ';' has to be preceded by a non-breaking space...
I would also have Martin's opinion on this point (I may have advocated for
this feature, but he has done the commit).
Best Regards,
--
Nekral