libwpd2 is the codename for the next iteration of libwpd. In this document, the older version will be referred to as libwpd1. Note that these are just names: it's all still libwpd.
This design is inspired by several sources: libwpd1, some of Ariya Hidayat's proposals, but most of all wv2 (a library for reading/writing Microsoft Word files).
This is only the first iteration of the document: I wrote it up to generate discussion. It is not complete. Let's talk about what's missing, and what should be changed..
I generally look down on re-writing a fully functioning project. It puts a damper on development of new features and generally causes a lot of pain. Most of the time, it's better to just suffer through an older iteration, warts and all. However, in this case, I think we are justified in doing so, given the issues we are addressing:
Additionally, libwpd1 is not that well developed. In its entirety, libwpd1 is less than 2500 lines of code: much of which may be straightforwardly reused in a next generation library.
The importer works (or will work) :-) along an "assembly-line" type principle: the parser gathers information, instantiating byte functions as appropriate (which are represented as classes, allowing them to be re-used by the exporter), which in turn notify the DocumentReader of what was read, and then delete themselves. The DocumentReader keeps track of state and assembles structures (paragraph, section, etc.) based on the information passed from the lower levels. When a structure must be terminated (e.g.: on a paragraph break), the current structure is passed to the filter. The filter should then handle the structure appropriately, after which it may be deleted by libwpd and parsing may continue. The following two diagrams might help in visualizing this process:
The export case should only be slightly different: the filter should provide the higher-level structures, which may then be converted into lower level structures, and then written to WordPerfect byte-functions. As we can provide things in one consistent format for the exporter, it is anticipated that this will be easier than the import case: it should require much less state tracking, at the very least.
It is currently proposed that libwpd should depend on libgsf (for OLE support) and the STL.
libgsf is a relatively lightweight library (linking to it should hurt none of the target platforms for memory usage), robust, and will be used in wv2 (if it's good enough for them..). True, libgsf is written in C, which isn't our native language, but we should need only to use about 20 lines of libgsf code in libwpd (see 5.3). In my view, writing a bug-prone wrapper (and going to the trouble of making sure it interfaces well with all possible ole libraries and libwpd) is a waste of time.
I am uncertain about the use of the STL. It may be a better choice to simply use glib-2.0's data structures (since that is required by libgsf anyway and the library is considerably lighter). This is open to discussion.
The parser needs a general way of skipping bytes and reading data in the stream. This should be provided by an interface filled by the target application: this interface should probably be something like that provided by libgsf.
It is important that libwpd be designed such that its parser never need to go backward in the stream. This is not supported by OpenOffice.
To my knowledge, WordPerfect never uses OLE for anything particularly useful: it embeds objects (images, spreadsheets, etc.) in the WordPerfect document. WordPerfect OLE files just contain a wrapper, with an entry index pointing to the main document proper. In the import case, our OLE handling should be limited to finding the document in the OLE stream, and then passing that document to the parser (should be less than 20 lines of code). For the export case, we shouldn't bother with OLE at all. It's worse than useless.
Embedded objects are in the document stream, after the prefix area, but before the document area. We should pass the stream onto client applications/libraries (e.g.: gnumeric, kspread, a wordperfect graphics parser), and then store the embedded information somewhere safe, for later use by the target application when gets the appropriate message (in a box group).
This is for communication both between the low-level API and the high-level API, and the communication between libwpd and the target application. I propose we use this functor approach, as seen in wv2.
We will have to implement seperate parsers for 5.x and 6+. The file format is too different: we need to instantiate completely different byte groups depending on which version we are dealing with. This also means that we will need to define seperate DocumentReader (and DocumentState) classes for the two different versions. It is probably worth spending some time defining an intelligent inheritance hierarchy here, to make sure we're not duplicating code.
We will handle text in the lower level, more or less, in the same way as we do a byte group: pass the raw information up to the WPDocumentReader class. This is to avoid assumptions on how to convert things like extended or international characters. We may do some buffering with standard characters as an optimization (this is the idea behind the LLText class you can see in the diagram above).
The DocumentReader class will probably represent text as UCS2, which is ooWriter's and (if I remember correctly) KWord's internal encoding. AbiWord-2.0 uses UCS4 internally, but a UCS2 -> UCS4 conversion is straightforward.
Dia sources files for the UML diagrams: libwpd_lowlevel.dia, libwpd_highlevel.dia.