[Po4a-Devel] Re: questions on parser input output includes initialization and more

Friday, 29 August 2025

On Thu, Aug 07, 2025 at 02:31:02PM +0200, Martin Quinson wrote:
...
 Hello Patrice, 

 many thanks for your email. I am really appreciating the time you took to dig into the
po4a internals, and I'm sorry for both the fact that the documentation is not good
enough in its current state, and for the time it takes to answer. This is because it
require me to dig back in the code, as I forgot almost anything since then. My comments
are inline below. 
For some reason I missed your mail too...

...
 ----- Le 28 Juil 25, à 1:10, Patrice Dumas via Devel
devel(a)lists.po4a.org a écrit :

 I added this comment to the TransTractor manual, just below the code chunk showing an
example of TransTractor implementation for dumb text files.

 | By default, your document class will only read from the user-provided
 | file. To handle file inclusion (e.g. with \include in LaTeX or similar),
 | you need to call C<read()> on each file to be included. See the Tex.pm
 | module for an example of an overridden C<read()> function that not only
 | loads the master file but also search for inclusion requests and
 | gracefully adds the included files to the array of lines to be parsed.

 Does this answer your question? 
Yes.

...
 Here is the paragraph I added just below. Does it answer your
question?

 | Please note that using C<read()> to add content to the list and 
 | C<shiftline()> to get this content is optional. For example, the
 | Sgml.pm module does not use this mechanism because it uses an external
 | parser instead. Check the C<read()> function from Sgml.pm to see how
 | the filenames are saved in a private array, and then how C<parse()> is
 | overridden to not use the classical parse features but a SGML-specific
 | C<parse_file()> instead. This later function is very different from
 | the rest of the po4a code base, as all the grunt work is done by the
 | C<onsgmls> external binary. 
Perfect.

...
 I added this at the end of the first added paragraph that I was
mentioning above.

 | The included files are inlined in the translated document by the Tex.pm module:
 | all the content is placed in the output translation document instead of
 | creating separate files for the translation of included files. This is somewhat
 | suboptimal, in particular if an included file is used in many files, but
 | changing this would be quite complex and no document class does it so far.
 | It would imply overloading the C<write()> function that is in charge of writing
 | the content into the produced localized file, but I'm not sure that the usage
 | benefit would worth the code complexity. Localized files are worthless files
 | that can be removed from disk once the document compilation is done, so the
 | waste of space is very discutable.

 I fully understand that this is not optimal, but as noted, I think that changing 
 it would be really complex for a tiny usage advantage.  
I think that the end of this paragraph is not useful as documentation.
It could be somewhere, maybe in the TODO, but not in the documentation as
there is too much information about design and limitations, it is not
useful to go in details for users of TransTractor.

Here is a proposal for this paragraph:

 There is no possibility to write to multiple files corresponding to
 each of the included input file. All the contents should be placed in the
 output translation document. The Tex.pm module can also be used as an
 example showing how included files are inlined in the translated document.

Now, the reason why I asked is not at all about wasting disk space, but
it seemed to me that having the same organization in output as in input
could be useful for the translation process.  However, it may not be the
case, maybe there is no practical use for having the output document
organized as the input document.

...
 > I have another question about the files encoding.  Is it
possible to
 > change the encoding based on the the information in the file(s) being
 > read, both for input and output?  In Texinfo, there is a
 > @documentencoding directive that can be used to specify the encoding,
 > and it can change within the document (more likely when including a
 > file).  It is becoming less and less relevant now that UTF-8 is
 > increasingly used for every manual, but still if it is possible to do
 > something it could be relevant.

 This is a complex issue I didn't have an answer for. Encoding issues are very
difficult in Perl (at least, I found it difficult to get the encoding working during the
long-overdue refactoring of v0.70), and I think that an assert to ensure that the encoding
provided as @documentencoding matches the one provided in po4a would be much easier than
forcing po4a to obey the @documentencoding no matter what is given on the command line. I
mean. Who's not using UTF-8 anyway? 
Ok.  Maybe there could be something more explicit about that in the
documentation.  Currently, I do not really understand what is done by
po4a with regard to encoding and what is expected from parser writer.
There is something on that both in the translate description, saying
that strings are recoded and there is something in get_out_charset
description, but it is not fully clear to me.

I think that it would be good if there was at least the following
information somewhere:
* does the parser and other similar functions (read, shiftline,
  parser...) get bytes or characters from the TransTractor code?
* are characters expected out of parser in translate call?

And more information on what is going on maybe could be useful.

There is also the issue of encoding of file names, it is not clear
whether file names are byte strings or character strings.

...
 > Also, there is a @documentlanguage command in Texinfo to specify
the
 > language of the document.  It is used by processors, for example, to
 > translate strings added to the output format, for example the "Appendix"
 > string could be added and translated depending on the @documentlanguage.
 > It can also be used to change hyphenation patterns and the like.  Is it
 > possible to have the information on the language being translated to,
 > such as to add the @documentlanguage at the beginning of the translated
 > document output?  I have not seen anything about that in the
 > documentation.

 This feature does not exist yet, but it would be a nice addition. You can do it now by
retrieving the current language code with $self->{TT}{po_in}->{lang} from within a
TransTractor 
It looks like the feature I was looking for, so why do you say that it
does not exist yet?  Do you mean that this interface should not be
relied upon?

...
 > It seems that there is an initialize function that can be
redefined that
 > is called (but only by new?).  Is it possible to use that function to
 > initialize the parser globally?  Or should it be done in another way?

 Sorry, I'm not sure I understand what you mean here.  
I do not really remember what I meant either ;-).  But I think I can
explain better with examples.
* Let's imagine that I have a setup function in the Perl module doing
  the parsing that should be called once before doing any parsing, where
  should I put that call?
* Let's imagine that there is a function that should be called before starting
  to parse a file.  Where should this call be?

-- 
Pat

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

[Po4a-Devel] Re: questions on parser input output includes initialization and more