libwpd2 design

1.0 Introduction

libwpd2 is the codename for the next iteration of libwpd. In this document, the older version will be referred to as libwpd1. Note that these are just names: it's all still libwpd.

This design is inspired by several sources: libwpd1, some of Ariya Hidayat's proposals, but most of all wv2 (a library for reading/writing Microsoft Word files).

This is only the first iteration of the document: I wrote it up to generate discussion. It is not complete. Let's talk about what's missing, and what should be changed..

2.0 Why another version?

I generally look down on re-writing a fully functioning project. It puts a damper on development of new features and generally causes a lot of pain. Most of the time, it's better to just suffer through an older iteration, warts and all. However, in this case, I think we are justified in doing so, given the issues we are addressing:

  1. No clean seperation of parsing/state logic. WordPerfect uses a token stream document model: that is, document properties are generated "on the fly" in response to byte functions within the file. Modern Word Processors (AbiWord, ooWriter, and KWord) have a more rigid "state model" which demands that each style change be accompanied with a complete declaration of what that style involves (ooWriter goes a step further and demands that all styles be declared "up-front", at the beginning of the document). To do a proper conversion from one model to the other, one needs to keep track of the document state at the time of every significant change in the document (e.g.: a paragraph or section break). libwpd1 made a mess of this, doing some of this itself, leaving the rest of this job to the file filter. This needs to be done in a more principled manner for conversion to be really robust.
  2. It is possible for libwpd to do more of the work required in producing a target document for the three word processors we desire to support. To avoid code-duplication (and work required by filter-maintainers), it is desirable to put this functionality in libwpd itself.
  3. Exporter work (not yet started for libwpd1) should, in the ideal case, re-use the patterns and code for the importer case. This is difficult, if not impossible to do with libwpd1's design.

Additionally, libwpd1 is not that well developed. In its entirety, libwpd1 is less than 2500 lines of code: much of which may be straightforwardly reused in a next generation library.

3.0 Design Goals for libwpd2

  1. Maximize code-reuse on all levels (and between import and export case): should be self explanatory.
  2. Integrate cleanly with target applications: KWord, AbiWord, and ooWriter. Duplicated logic in the target applications should be put into libwpd2.
  3. Make our decisions on converting WordPerfect structures as explicit as possible. It is impossible to represent a WordPerfect document 100% faithfully on any of our target applications: we should make it clear in our code where we are making sacrifices, and also offer a lower-level of parsing where none of this information is lost.

4.0 A high-level view of the proposed structure of libwpd2

The importer works (or will work) :-) along an "assembly-line" type principle: the parser gathers information, instantiating byte functions as appropriate (which are represented as classes, allowing them to be re-used by the exporter), which in turn notify the DocumentReader of what was read, and then delete themselves. The DocumentReader keeps track of state and assembles structures (paragraph, section, etc.) based on the information passed from the lower levels. When a structure must be terminated (e.g.: on a paragraph break), the current structure is passed to the filter. The filter should then handle the structure appropriately, after which it may be deleted by libwpd and parsing may continue. The following two diagrams might help in visualizing this process:

The export case should only be slightly different: the filter should provide the higher-level structures, which may then be converted into lower level structures, and then written to WordPerfect byte-functions. As we can provide things in one consistent format for the exporter, it is anticipated that this will be easier than the import case: it should require much less state tracking, at the very least.

5.0 Specific Issues: Dependencies, Streaming, OLE, Embedded Objects, Message Passing, Document Parsing

5.1 Dependencies

It is currently proposed that libwpd should depend on libgsf (for OLE support) and the STL.

libgsf is a relatively lightweight library (linking to it should hurt none of the target platforms for memory usage), robust, and will be used in wv2 (if it's good enough for them..). True, libgsf is written in C, which isn't our native language, but we should need only to use about 20 lines of libgsf code in libwpd (see 5.3). In my view, writing a bug-prone wrapper (and going to the trouble of making sure it interfaces well with all possible ole libraries and libwpd) is a waste of time.

I am uncertain about the use of the STL. It may be a better choice to simply use glib-2.0's data structures (since that is required by libgsf anyway and the library is considerably lighter). This is open to discussion.

5.2 Streaming

The parser needs a general way of skipping bytes and reading data in the stream. This should be provided by an interface filled by the target application: this interface should probably be something like that provided by libgsf.

It is important that libwpd be designed such that its parser never need to go backward in the stream. This is not supported by OpenOffice.

5.3 OLE

To my knowledge, WordPerfect never uses OLE for anything particularly useful: it embeds objects (images, spreadsheets, etc.) in the WordPerfect document. WordPerfect OLE files just contain a wrapper, with an entry index pointing to the main document proper. In the import case, our OLE handling should be limited to finding the document in the OLE stream, and then passing that document to the parser (should be less than 20 lines of code). For the export case, we shouldn't bother with OLE at all. It's worse than useless.

5.4 Graphics and Other Embedded Objects

Embedded objects are in the document stream, after the prefix area, but before the document area. We should pass the stream onto client applications/libraries (e.g.: gnumeric, kspread, a wordperfect graphics parser), and then store the embedded information somewhere safe, for later use by the target application when gets the appropriate message (in a box group).

5.5 Message Passing

This is for communication both between the low-level API and the high-level API, and the communication between libwpd and the target application. I propose we use this functor approach, as seen in wv2.

5.6 Document Parsing: 5.x and 6+

We will have to implement seperate parsers for 5.x and 6+. The file format is too different: we need to instantiate completely different byte groups depending on which version we are dealing with. This also means that we will need to define seperate DocumentReader (and DocumentState) classes for the two different versions. It is probably worth spending some time defining an intelligent inheritance hierarchy here, to make sure we're not duplicating code.

5.7 Document Parsing: Text

We will handle text in the lower level, more or less, in the same way as we do a byte group: pass the raw information up to the WPDocumentReader class. This is to avoid assumptions on how to convert things like extended or international characters. We may do some buffering with standard characters as an optimization (this is the idea behind the LLText class you can see in the diagram above).

The DocumentReader class will probably represent text as UCS2, which is ooWriter's and (if I remember correctly) KWord's internal encoding. AbiWord-2.0 uses UCS4 internally, but a UCS2 -> UCS4 conversion is straightforward.

6.0 Files

Dia sources files for the UML diagrams: libwpd_lowlevel.dia, libwpd_highlevel.dia.


William Lachance
Last modified: Sat Nov 16 10:06:44 EST 2002