Sorry for the delay, I was playing with shadow ;)
On Mon, May 09, 2005 at 01:59:09AM +0200, Nicolas François wrote:
Hello,
There is a bug reported on the Alioth tracker against the Sgml module.
I did not notice it before.
Was there a notification on po4a-devel(a)lists.alioth?
It should have been. The bug tracker is configured to send everything to
po4a-devel(a)lists.alioth.debian.org
See
https://alioth.debian.org/tracker/admin/index.php?group_id=30267&atid...
Otherwise, is there a way to get some notifications from the
tracker?
Then regarding the bug report:
* I've already uploaded a simple fix for a typo reported in the bug
report.
* the SGML book uses a contrib and epigraph tag. Are those tags
standards? Can I add them to the translate category?
I dunno ; please do so. If it helps for this document, it's good. There's
almost no change that it break anything.
* for the main part of the bug report, I propose to escape
'<', '>' and
'&' to {PO4A-lt}, {PO4A-gt} and {PO4A-amp} before feeding nsgmls. And
changing them back to the original in the cdata type.
Great, that's what we have to do.
I also had some other issues with this PHP book:
* around line 795, PO4A-beg/end are changed back to there SGML
counterparts only if they appear at the beginning of a line.
Why only at the beginning?
I can't remember. That's a *long* time that I didn't dig into sgml.pm
anymore. And I keep bad remembering about this. The code is a bit obscure,
and there is a bunch of stuff we should move to TransTractor (file
inclusion) or do another way (I dream of killing nsgml).
This cause some PO4A-beg/end to be kept in the output document.
If so, this is a bug ;)
* also, the content of the cdata is pushed, but the buffer is not
flushed, so it can be pushed too early.
In my patch, I appended the content of the cdata to $buffer.
Should the content of cdata be verbatim? shouldn't it be translated?
I think it should be verbatim. I'm not sure anymore about translation.
* also, I don't really understand what is done with the leading
spaces
and the added trailing '\n', but this is probably not an issue.
What I absolutely want to avoid here is getting the whole document on only
one line since it kills any dream of addendum. So, I try to get one
structuring tag per line, and to add some spaces around to make this look
better. But this code also can be bugged...
* around line 535, & is changed to {PO4A-amp} if it is not the
beginning
of an entity.
This uses:
while ($origfile =~ /^(.*?)&([^;\s]*);(.*)$s/) {
...
}
this regex is too permissive. This cause the following line:
]]><![CDATA[&d_op=viewdownload&cid=79\">Web Installer...
being changed in:
]]><![CDATA[_op=viewdownload=79\">Web Installer...
I found the following grammar (for XML):
http://www.w3.org/TR/REC-xml/#NT-Name
It's probably too complicated (the Letter or Digit rules use a lot of
Unicode chars). So I propose to only allow ASCII chars (with a non
greedy match):
while ($origfile =~ /^(.*?)&([A-Za-z_:][-_:.A-Za-z0-9]*?);(.*)$s/) {
...
}
Ups. :-/
btw, you can make it greedy, ";" is not accepted so it won't make any
difference, will it?
* my last point: can anybody have a look at the sgmldiff between
EN-Book.sgml and po4a-normalize.output?
I'm highly incompetent regarding SGML and I based my analysis on po4a and
sgmldiff outputs. So please stop me if any of the above statement is
wrong.
I'm rather sort on time, but I'll try to do so. The statements look good.
Attached is the patch I plan to commit this week.
No need to wait that long ;)
Thanks again for your time,
Mt.