CR-LF characters in docbook (and xml in general)

Wednesday, 14 December 2011

Hello

I have a problem with parsing message strings in docbook para
elements.  Many authors present the text in these elements in multiple
lines eg.

<para>This is a long paragraph which spans
          multiple lines.  I would like it to be presented
          as a po message string to be translated</para>

This is handled perfectly well by po4a-gettextize for most cases.
Unfortunately some authors are using editors which represent the
newline as a \r\n sequence instead of just \n.

The proper behaviour of an xml parser according to the spec
(http://www.w3.org/TR/REC-xml/#sec-line-ends) is to simply swallow
those redundant \r characters.  Unfortunately the po4a tools are
passing them through to the po files which is confusing for the
translator when using tools like pootle.  The translator doesn't know
whether the \r characters are required, whether they are significant
etc.

I am not sure exactly how the XML module is working, but would it be a
complicated fix to request that the module processes the XML by
swallowing the \r characters as per the spec?

It is not critical as it is certainly simple enough to pre-process the
input files with tr or something similar, but it would be nice (and
make the workflow easier) if the xml parser did this by default.

Regards
Bob

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004