[Po4a-devel]Sgml bug in the tracker

Sunday, 8 May 2005

Hello,

There is a bug reported on the Alioth tracker against the Sgml module.

I did not notice it before.
Was there a notification on po4a-devel(a)lists.alioth?
Otherwise, is there a way to get some notifications from the tracker?

Then regarding the bug report:
 * I've already uploaded a simple fix for a typo reported in the bug
   report.
 * the SGML book uses a contrib and epigraph tag. Are those tags
   standards? Can I add them to the translate category?
 * for the main part of the bug report, I propose to escape '<', '>'
and
   '&' to {PO4A-lt}, {PO4A-gt} and {PO4A-amp} before feeding nsgmls. And
   changing them back to the original in the cdata type.

I also had some other issues with this PHP book:
 * around line 795, PO4A-beg/end are changed back to there SGML
   counterparts only if they appear at the beginning of a line.
   Why only at the beginning?
   This cause some PO4A-beg/end to be kept in the output document.
 * also, the content of the cdata is pushed, but the buffer is not
   flushed, so it can be pushed too early.
   In my patch, I appended the content of the cdata to $buffer.
   Should the content of cdata be verbatim? shouldn't it be translated?
 * also, I don't really understand what is done with the leading spaces
   and the added trailing '\n', but this is probably not an issue.

 * around line 535, & is changed to {PO4A-amp} if it is not the beginning
   of an entity.
   This uses:
     while ($origfile =~ /^(.*?)&([^;\s]*);(.*)$s/) {
       ...
     }
   this regex is too permissive. This cause the following line:
     ]]><![CDATA[&d_op=viewdownload&cid=79\">Web Installer...
   being changed in:
     ]]><![CDATA[_op=viewdownload=79\">Web Installer...

   I found the following grammar (for XML):
     http://www.w3.org/TR/REC-xml/#NT-Name
   It's probably too complicated (the Letter or Digit rules use a lot of
   Unicode chars). So I propose to only allow ASCII chars (with a non
   greedy match):
     while ($origfile =~ /^(.*?)&([A-Za-z_:][-_:.A-Za-z0-9]*?);(.*)$s/) {
       ...
     }

 * my last point: can anybody have a look at the sgmldiff between
   EN-Book.sgml and po4a-normalize.output?

I'm highly incompetent regarding SGML and I based my analysis on po4a and
sgmldiff outputs. So please stop me if any of the above statement is
wrong.

Attached is the patch I plan to commit this week.

TIA,
-- 
Nekral

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004