Hello Patrice,
many thanks for your email. I am really appreciating the time you took to dig into the
po4a internals, and I'm sorry for both the fact that the documentation is not good
enough in its current state, and for the time it takes to answer. This is because it
require me to dig back in the code, as I forgot almost anything since then. My comments
are inline below.
----- Le 28 Juil 25, à 1:10, Patrice Dumas via Devel devel(a)lists.po4a.org a écrit :
Hello,
I have questions I did not find answers to by reading the documentation,
mainly for TransTractor. In link with the specificities of the Texinfo
format and of the parser I intend to use to define a TransTractor
derived parser (which produces a whole tree for an input file and
included files. That tree can then be split in translated paragraphs,
environment and lines and used to reconstitute the translated output
document).
I have a first question regarding the input. I did not understand
if the shiftline function provides the content of one file or of a
series of files. If a series of files is provided, I guess that it is
up to the parser() to read the reference returned by shiftline and
determine when there is a change in input file. Maybe this could be
documented in the TransTractor manual.
I added this comment to the TransTractor manual, just below the code chunk showing an
example of TransTractor implementation for dumb text files.
| By default, your document class will only read from the user-provided
| file. To handle file inclusion (e.g. with \include in LaTeX or similar),
| you need to call C<read()> on each file to be included. See the Tex.pm
| module for an example of an overridden C<read()> function that not only
| loads the master file but also search for inclusion requests and
| gracefully adds the included files to the array of lines to be parsed.
Does this answer your question?
Second, to what extent is it important to read the input file with
shiftline? My understanding is that the shiftline facility is there
to have a way to read the input file line by line, but the parser caller
do not really care about how the input file is read, what really matters
are the calls to translate and to pushline. If the parser() can
determine by the file name and the line, to provide to translate second
argument, it could be used instead of shifltine information. If there
are several files passed through shiftline, then the parser needs to
read all the input be it only to get the list of files to process.
Here is the paragraph I added just below. Does it answer your question?
| Please note that using C<read()> to add content to the list and
| C<shiftline()> to get this content is optional. For example, the
| Sgml.pm module does not use this mechanism because it uses an external
| parser instead. Check the C<read()> function from Sgml.pm to see how
| the filenames are saved in a private array, and then how C<parse()> is
| overridden to not use the classical parse features but a SGML-specific
| C<parse_file()> instead. This later function is very different from
| the rest of the po4a code base, as all the grunt work is done by the
| C<onsgmls> external binary.
Both for input and output, is there a way to handle included files?
It
seems to me that ideally, one translated file for each include file
should be written along with the translation of the main input file,
with modified include directives in translated files including
translated included files to use the translated file names. My
reading of the documentation is that it could be possible for
translate() as there is a reference in argument that could, in
principle, specify a different file than the file passed by shiftline,
but it does not seems to be possible for pushline that only accepts a
line and no information on the file(s) to write to. How is such a
situation supposed to be handled? I had a look at the TeX.pm code, and
it seems that read is redefined but also that the include files are not
translated as files but output together with the main file.
I added this at the end of the first added paragraph that I was mentioning above.
| The included files are inlined in the translated document by the Tex.pm module:
| all the content is placed in the output translation document instead of
| creating separate files for the translation of included files. This is somewhat
| suboptimal, in particular if an included file is used in many files, but
| changing this would be quite complex and no document class does it so far.
| It would imply overloading the C<write()> function that is in charge of writing
| the content into the produced localized file, but I'm not sure that the usage
| benefit would worth the code complexity. Localized files are worthless files
| that can be removed from disk once the document compilation is done, so the
| waste of space is very discutable.
I fully understand that this is not optimal, but as noted, I think that changing
it would be really complex for a tiny usage advantage.
I have another question about the files encoding. Is it possible to
change the encoding based on the the information in the file(s) being
read, both for input and output? In Texinfo, there is a
@documentencoding directive that can be used to specify the encoding,
and it can change within the document (more likely when including a
file). It is becoming less and less relevant now that UTF-8 is
increasingly used for every manual, but still if it is possible to do
something it could be relevant.
This is a complex issue I didn't have an answer for. Encoding issues are very
difficult in Perl (at least, I found it difficult to get the encoding working during the
long-overdue refactoring of v0.70), and I think that an assert to ensure that the encoding
provided as @documentencoding matches the one provided in po4a would be much easier than
forcing po4a to obey the @documentencoding no matter what is given on the command line. I
mean. Who's not using UTF-8 anyway?
Also, there is a @documentlanguage command in Texinfo to specify the
language of the document. It is used by processors, for example, to
translate strings added to the output format, for example the "Appendix"
string could be added and translated depending on the @documentlanguage.
It can also be used to change hyphenation patterns and the like. Is it
possible to have the information on the language being translated to,
such as to add the @documentlanguage at the beginning of the translated
document output? I have not seen anything about that in the
documentation.
This feature does not exist yet, but it would be a nice addition. You can do it now by
retrieving the current language code with $self->{TT}{po_in}->{lang} from within a
TransTractor
It seems that there is an initialize function that can be redefined
that
is called (but only by new?). Is it possible to use that function to
initialize the parser globally? Or should it be done in another way?
Sorry, I'm not sure I understand what you mean here.
Again, thanks for your time, and sorry for the time it took me to write this answer.
Cheers,
Mt