From Open Watcom
The programs gendev and wgml are present in the repository but their source is not. Inquiries made early in the history of Open Watcom pretty much established that the sources no longer existed. This project is intended to recreate the source code. One good reason for doing this is that only DOS and OS/2 versions exist. The DOS version of wgml is used heavily in producing the documentation under Windows (Win32) -- and the 64-bit version of Windows XP is known not to be able to run it itself (as opposed to using a third-party program). The advantages of being able to use a Win32 version eventually is thus apparent. Having the source would also allow for any changes that may be desired (such as the command-line syntax).
I have never worked collaboratively. If I am behaving badly, let me know. My intent is to get as many people working on this as are willing to do so. But what can I do, in writing this page, but put down my ideas? Feel free to correct errors and suggest alternatives to suggested paths that you know are dead ends.
I have also never done a C project, and my C++ experience is that of an amateur. I have no doubt that the quality of the source code will depend on the contributions of others.
The oldest documentation I have found for wgml/gendev is located here. The document Waterloo SCRIPT is the only documentation of most of the control words supported (or not) by wgml. This is for version 88.1, which is dated "87DEC11".
The other files in that directory with names beginning with "script" document the state of wgml, including how to use various printers, in 1988. While interesting historically, they are not of much (if any) use in understanding and recreating the source for wgml 4.0 and gendev 4.1.
The primary documentation for wgml/gendev is the WGML Reference, which clearly states that it is based on version 90.1 of Waterloo Script. This document is copyrighted 1992.
The WGML 3.33 Update, when installed, includes a README directory with which wgml may be used to create a README document. This document is dated September 1992 and documents several new and changed features of wgml/gendev, which are present in wgml 4.0 and gendev 4.1.
In terms of documentation, then, we have three documents which, together, provide a great deal of information about how earlier versions of Script, the GML tags, wgml, and gendev worked. As it happens, most of this information does apply to wgml 4.0 and gendev 4.1, but not all of it. A fair amount of the effort recorded here was expended in determining exactly how gendev 4.1, and then wgml 4.0, actually behaved.
Notes on Terms
These notes are provided to help clarify some of the terminlogy used in this project.
Binary Device Files/Libraries
The entire purpose of gendev is to create and maintain binary device libraries. These consist in a set of binary device files encoding various :DEVICE, :DRIVER, and :FONT blocks plus the directory file "wgmlst.cop". wgml uses this information to produce, from the document specification, an output stream (usually written to a file rather than directly to the device) which will cause the device to produce the document.
On the computers currently targeted by Open Watcom, these binary device libraries take the form of a directory in which the binary device files are found. Examination of the WGML Reference suggests that, originally, there was exactly one binary device library, and it used an OS-defined and OS-provided "library" to hold the data.
Thus, the term "directory file" was originally unambiguous because the term "library" referred to the location of the data. When gendev/wgml were ported to the PC, the term "library" became purely conceptual, and the term "directory" then applied both to the location of the data and (as part of "directory file") to file "wgmlst.cop". In this Wiki, the term "directory file" always refers to the file "wgmlst.cop", and the term "directory" usually refers to a file system directory, but occasional instances where it refers to a "directory file" may yet be found. Corrections to those instances are ongoing.
The extension ".COP" was frequently used where "binary device file" was meant; this is also undergoing correction, and the older useage may be still be found. The reason for this is historic: all of the binary device files available to me, whether in the Open Watcom repository or in the WGML 3.33 Update, use the extension ".COP". Indeed, the only way to produce a different extension is to provide it as part of the attribute member_name of a :DEVICE, :DRIVER, or :FONT block. Because the natural term to use for a binary file encoding a :DEVICE block is also "binary device file", a more detailed useage is found (and explained) in the discussion of member files.
Empty Strings and NULL
The term "NULL" is used exclusively to refer to NULL pointers.
The term empty string is used in place of the term null string, which is used in the WGML Reference, to avoid confusion with the use of NULL. An empty string in a source file has one of two forms:
'' or ""
On the command line, only the first form is valid: two double quotes ("") is interpreted by wgml as a missing value, producing this error message:
CL--004: Missing option value for 'device'
if it the last item in the command stream, or, for the invocation
wgml test ( device "" incl
the message becomes
IO--008: For the device (or font) 'incl': The information file for this name cannot be found. If the device/font has been defined, the problem may be that the DOS SET symbol GMLLIB has not been correctly set to point to the device library.
which indicates that the "" is skipped and the next option is taken as the device name. In contrast, '', at least in copparse.exe, becomes a two-character string value. This presumably reflects how the command line is presented to the program, so wgml probably sees the same string.
In the binary device file format, an empty string has one of two forms:
- If the string is encoded as a length byte followed by an array of characters, then an empty string is the length byte only, with a value of "0x00". The array does not appear in any way.
- If the string is encoded as a character array terminated by '\0', then only the terminator appears.
Since the terminator '\0' is identical to a count of value "0x00", the effect in either case is the same: a single byte, of value "0x00".
An empty string in a binary member file is parsed as a NULL pointer when an empty string is a valid value for the corresponding attribute, and as a format error when gendev neither allows nor creates an empty string for the corresponding attribute.
The DOS and OS/2 versions of wgml found in the Open Watcom repository both state that they are version 4.0.
The DOS and OS/2 versions of gendev found in the Open Watcom repository both state that they are version 4.1.
The binary device files produced by gendev 4.1 and read by wgml 4.0 are all identified as "V4.1 PC/DOS" in the header.
The Wiki currently reflects a more relaxed usage. An effort is in progress to correct this. The correct usage would be:
- Use "version 4.1" with binary device files and gendev only.
- Use "version 4.0" with wgml.
- Use "version 4.x" when both gendev and wgml are referred to or this version is referred to in general.
Until the correction is complete, some references to wgml as "version 4.1" or other errors may be encountered.
Keeping Priorities Straight
Implementing the code for parsing binary device files posed a classic problem: I had no way of predicting what form wgml will need the data in and yet I had to provide it in some form both to show how it is done and to verify that I have described the format adequately. This problem persists with code intended specifically for use with wgml, and will apply to my future endeavors as well.
I was therefore compelled to make decisions which may turn out to be incorrect. For example, my decision to return each cop_device, cop_driver, and cop_font struct as a single block of memory on the heap, so that it can be freed with one statement without worrying about all the pointers it contains, also makes it very hard to modify, in particular, those areas where modification is most likely to be needed. This may still cause problems; however, the code written to load the binary library and create the "available fonts" revealed no problems with that decision.
On the other hand, the version of cop_font used in the code written for wgml differs from that used originally because this struct needed to become an element in a linked list and the defined name turned out be required so that the font which the cop_font contained data for could be identified. Thus, the structures and function signatures provided are not to be treated as being in their final form. As wgml reaches the point that it is known what it needs in terms of particular functions or of data structures, then changes will be made as needed to produce a properly-functioning program.
Note to Users of Linux and Related OSes
Currently, the research functions compile and link under Linux. So far as I know, nobody has tested them.
The function cfcheck() depends on being able to determine the size of disk files. For DOS/Windows/OS2, this is as simple as accessing a field in the DIRENT structure; but the Linux DIRENT does not have this field.
This difference has been factored out of cfcheck() and placed in lhdirect.h/lhdirect.c. This module includes the correct headers in either environment for the functions used and adds a function to obtain the file size. The Linux version of this function should be all that has to be implemented to get cfcheck() producing exactly the same output (when pointed at exactly the same directory) under Linux as under the other OSes.
It is my intention to factor out such differences and isolate them in lhxxx modules whenever they appear. There should be no #IFDEF sections separating Linux from the other OSes in any other source modules (there may be a few for #define macros in headers). Your cooperation in this is requested and appreciated. If this requires adjustment of the function signatures (to provide necessary additional parameters to the Linux version), then that is certainly what will happen, keeping in mind that both versions of these functions must have the same signature.
wgml emits CRLF characters ("0x0d" "0x0a") as output record separators. This works great in the DOS/OS2/Windows world, but may cause problems with Linux. It must be kept in mind that output records, conceptually at least, are sent to physical devices, protypically line printers, which use CR to return the print head to the start-of-row position and LF to physically move the paper up -- and, presumably, do so whether the computer they are attached to is running one of DOS/OS2/Windows or is running Linux. So, simply replacing CRLF with a Linux newline may not actually work.
Note on DOS386
I discovered inadvertently that all of the research programs produced for 32-bit (extended) MS/PC-DOS, when invoked in an XP NTVDM running CMD, closed the window when they exited.
This did not happen in OS/2's MDOS. It does not appear to affect MS-DOS 6.22 in any way.
It is not clear if this is a problem or not. It is recorded here simply as something that may, or may not, be worth looking in to in the future.
Note to Big-Endian Implementors
All binary device files available to me, whether part of the Open Watcom documentation build system or of the WGML 3.33 Update are little-endian. This is quite obvious when the two-byte and four-byte values in the source files are compared with the values encoded by gendev in the binary files.
The code itself is also little-endian, since Open Watcom supports (at present) only the little-endian 80x86 processor family. The build system, however, does contain code generators (untouched, I gather from the newsgroup, for 20+ years) for other processors, some of which, I believe, are big-endian.
The function parse_header() does not check for endian as such; instead, it checks that the endian of the binary device file is the same as that of the program. It does this by comparing the numerical version as a two-byte integer. The value read and the value it is compared to will not match if the endian of the file is not the same as the endian of the program.
So, big-endian versions of gendev and wgml should work properly together as-is, provided that gendev is used to create a big-endian binary device library. The only situation is which problems arise is if the binary device library and the program have different endian. If the Open Watcom build system is expanded to be useable on big-endian processors, then this situation probably will arise, since there are only two binary device libraries and both are little-endian.
If it is ever necessary to modify gendev and wgml so that the programs can use binary device files with an endian different from their own, then there are several steps that will be needed:
- The parse_header() function would need to be modified to identify and report different-endian binary device files. This would involve extending the existing enum and the case statements in the calling programs which process the return value.
- Project-specific read and write functions would be needed so that different-endian files could be read from and written to properly.
The %binary2() device function inserts its argument into the output buffer as a two-byte integer. More specifically, a two-byte little-endian integer. To the extent that %binary4() can be said to work, it also uses a little-endian integer, in this case, one with four bytes.
In practice, %binary4() is not used in any source device file available to me, and %binary2() is only used with one argument -- "0" -- for which endian does not matter. But what people use these functions for in their own files cannot be predicted, and it would be helpful if a source file could be used with both big-endian and little-endian versions of gendev and wgml. Since, for literal arguments, %binary(), %binary1(), %binary2(), and %binary4() are compiled as if they were %image() functions, it is not possible for wgml to tell how the values were used in the source file. Presumably, the order in which the bytes must appear is determined by the device to which they are sent -- not by the endian of the computer the program is run on. On the other hand, many of these files are produced to be read by other programs, which may or may not need to know the endian of the data.
So, at this time, I can see two options:
- Standardize on little-endian.
- Provide little-endian and big-endian versions of %binary2() and %binary4() (for example: %le_binary2(), %be_binary2(), %le-binary4(), and %be_binary4()) in place of the current functions.
General Road Map
- Recreate gendev, to the extent that binary device files created by our gendev can be used by the existing wgml.
- Recreate wgml, to the extent that it can do what we need it to do.
- Test the our gendev and wgml natively on 32-bit Windows and OS/2, on at least virtual DOS, and natively on DOS and Linux if possible.
- Rework the build system to use our wgml instead of the existing wgml. The existing wgml will, of course, still exist; it just won't be used in the build process.
This section is intended to "corral" subprojects and notes for recreating gendev.
The gendev road map appears at this point to be:
- Decode the binary device file format, at least as it is used in Open Watcom.
- Write test programs verifying that our understanding of the binary device file format as it is used in Open Watcom is correct.
- Explore how the current gendev performs such actions as: adding a new file to a library, removing a file from a library, and how it reacts when asked to add the same file, or a file with the same device/driver/font name, to an existing library.
- Write the replacement program.
These pages record the binary device file format as such:
- Binary Device Files
- Common File Blocks
- Device File Blocks
- Driver File Blocks
- Font File Blocks
- Device Functions
- Meta Data
These pages record progress toward decoding the device function language and its format in the binary device file:
These pages discuss other topics which apply to both wgml and gendev:
Implementation Notes (gendev)
This is intended to be a list of high-level or summary notes. The links given above for binary device files contain many additional notes.
- When invoked with no parameters, gendev currently waits for a command line to be entered. Initial tests suggested that, under Windows XP, at least, the only way to break out of the program is to close the window. This is incorrect: I was expecting Ctl-C to terminate the program but, in fact, pressing the Enter key will do so. My intention nonetheless is to change this behavior so that invoking gendev with no parameters will display the list of possible parameters, as Open Watcom programs generally do.
- When exploring the binary device file format, it was observed that gendev adds new files to the front of the list in the directory file. Although only one file of each type (device, driver and font) was added, if this observation holds true, then it means that the file entry order in the directory file is (probably) not important, that is, that neither gendev nor wgml depend on it being in any particular order.
- When multiple files are processed at once (using the :INCLUDE tag) in an empty directory, then if one of the files named in an :INCLUDE tag is not found, no directory file is created: this implies that gendev creates/modifies "wgmlst.cop" after processing the source files and producing the binary device files.
- When exploring the binary file format, it was observed that each file is processed separately: processing a device file does not automatically result in the associated driver file and font file(s) being processed (unless, of course, the :INCLUDE tag is used to produce this result).
- When exploring both :DEVICE blocks and :DRIVER blocks, several cases of items mentioned in the WGML Reference were found to have been dropped or altered as shown by the by the README file produceable from the WGML 3.33 Update. The implication is that the documentation is for a version prior to version 3.33 and that we do not have documentation for version 4.1 of gendev or version 4.0 of wgml.
- When exploring the device file format, it became clear that the 3.33 and 4.1 versions do have different limits for some values. Thus, there were some changes between the two versions.
- When exploring the :DEVICEFONT block, it was found that "Sally" in a :DEVICEFONT did not match "Sally" in a :DEFAULTFONT but rather "sally" in the :DEVICEFONT did. The possibility that some, but not all, attribute values must be lower case needs to be examined.
- When exploring both :DEVICE and :DRIVER blocks, as noted on the pages Device File Blocks and Driver File Blocks in various places, gendev 3.33 enforces certain limits that 4.1 does not (the limits themselves are the same for both versions). However, this does not mean that 4.1 actually encodes the larger values: rather, they are allowed to wrap (or GP-fault 4.1). Our version should enforce these limits, just as 3.33 does, since there is no point in allowing out-of-range inputs to be accepted and mis-coded.
- Although multiple :VALUE blocks are allowed inside a :FINISH block, only the first is actually used by wgml. Perhaps this should be enforced by gendev, as it is for most other :VALUE blocks.
- The existence of junk bytes in P-buffers following the last CodeBlock mean that designing gendev to produce files identical to those produced by gendev 4.1 is not likely to be possible; testing will instead need to be done to verify that wgml, whether using binary files produced by the existing gendev 4.1 or by our gendev, produces the same output.
- The curious case of PCGRDRV.COP discussed in Multiple CodeBlocks records another situation in which our gendev (and wgml) will not behave quite the same as the originals do. Unless, of course, this behavior turns out to be necessary to get our gendev or our wgml to function properly.
- gendev will need at least two tokenizers: one which uses whitespace as a separator but not within delimited strings; and one which uses "%" as the start character and "(" as the end character for device functions.
- gendev will need at least two parsers: one which enforces a strict order (:DEVICE, :FONTSWITCH) and one which enforces no order at all beyond requiring all attributes to preceed the first include block (attributes, :DRIVER, :FONT).
- gendev accepts many device functions in contexts where wgml either ignores them or produces an "Abnormal program termination" message. Since our wgml must be able to work with binary files produced by gendev 4.1 or our gendev indistinguishably, our wgml will have to written in such a way that it will do this -- not, however, with any actual existing devices, which avoid these errors. However, our gendev can, as suggested in many specific implementation notes, catch many of these problems at compilation time and issue an error message where it is most likely to do some good.
- Since they cause wgml 4.0 to GP-fault, gendev should probably not accept a :LINEPROC containing only a line pass number.
- gendev should object if so much code is provided that the uint16_t used for the size of the code or the object the code it is embedded in is not large enough. wgml, being a 32-bit program, should be able to handle much more code than gendev 4.1, if necessary.
- When implementing the code for finding the member name from the defined name, matching the "empty string" defined name turned out to be overly complex. Since this "feature" is probably neither used nor needed, the wgml code will be left not doing the match (which is simplest) and gendev should be written to not accept an empty string as a defined name.
- When gendev 4.1 was used with a file containing a list of :INCLUDE statements for source files one of which included a :DEVICE definition with only two :DEFAULTFONTS (all the other :DEVICE definitions had six), gendev 4.1 generated six :DEFAULTFONTS, the last four empty. When the file was processed alone, gendev 4.1 generated only two :DEFAULTFONTS. Our gendev should probably generate identical files regardless of which way it is used.
Implementation Notes (devices)
This section will accumulate such notes as occur from time to time. It is unlikely to be exhaustive or definitive.
:NEWLINE and Varying Line Heights
The section on computing line heights shows that it is possible for the values returned by device function %y_address() before and after a new line is specified can differ by amounts which depend on the font in use. Thus, for example, a device can use "2" with font 0 and "3" with fonts 1 on up, if the various relevant attributes have values which produce that result.
For clarity and compactness, the abbreviation ":NEWLINE x block" will be used for ":NEWLINE block with the value "x" for attribute advance" in this discussion. The term "vertical displacement" will be used to designate the difference in the values reported by device function %y_address() before and after the :NEWLINE blocks.
When :NEWLINE blocks are used to move to the new line, several interesting phenomena were discovered:
- A single :NEWLINE 1 block with font "3" at the top of a page produced a vertical displacement of "5". This may have been affected by a non-printing banner or the fact that a .bx control word was producing the output.
- A single :NEWLINE 1 block with font "0" produced a vertical displacement of "2".
- A single :NEWLINE 2 block with font "0" produced a vertical displacement of "5".
- A single :NEWLINE 1 block with font "1" produced a vertical displacement of "3".
With a slightly different setup, where font "0" corresponded to a vertical displacement of "1" and font "1" to one of "2", this was observed:
- A single :NEWLINE 1 was used with font "1" (font displacement "2") to produce a vertical displacement of "2".
- A single :NEWLINE 2 was used with font "0" (font displacement "1") to produce a vertical displacement of "2".
In effect, the :NEWLINE 1 block is taken to move device print position by a vertical displacement equal to the displacement of a particular font. The font (that is, the value returned by device function %font_number()) used to determine this value, however, appears to be not the current font when a :NEWLINE block is invoked but rather the last font of the prior line. This makes it hard to see how the :NEWLINE blocks could be written to detect and implement the displacement which wgml 4.0 is attributing to them.
A subsequent investigation revealed two interesting additional details:
- When a new section is entered, as discussed here, then a requirement to move down 10 lines produced 5 :NEWLINE 2 blocks even though the line height for both the "from" font and the "to" font, and the value returned by device function %line_height(), was "2". This suggests that the default font, the line height of which was "1", is in use for this purpose as well as for determining which font's :OUTTRANS table is to be used in this context.
- An instance was found in which a request for 7 lines was met with one :NEWLINE 2 block and 1 :NEWLINE 1 block, producing 6 lines at a line height of "2". On the desired line, wgml 4.0 first printed the top of a box formed with :BOX characters and then, with no :NEWLINE intervening, went back to the start of that line and printed the first line of text in the box (including the vertical line characters) on the same document line. The implication is that, if multiple line height values are to be allowed with devices which rely on :NEWLINE blocks, our wgml is going to have to improve how the situation is handled compared to wgml 4.0.
It would appear, then, that, when a device is designed, this rule should be followed:
If the device uses :NEWLINE rather than :ABSOLUTEADDRESS for vertical positioning, then each font should use the same line height. If this is not done, then a discrepancy may develop between where the device is printing the text and where wgml believes that the device is printing the text. With fan-fold paper, this is a recipe for disaster.
This does not mean that all the fonts must actually have the same height. It only means that the vertical sizing information for all of the fonts must produce the same line height. Differently-sized fonts would still actually print at different heights, but the line spacing would be constant.
Further experience suggests that what wgml 4.0 is doing is just applying the normal rules to these devices and so that our implementation will do much the same. Changing this to do better can wait until such time as it becomes necessary, that is, until such a device needs to be supported. Since this may never happen, no further action may ever be needed.
If :ABSOLUTEADDRESS is used, then there should be no problem, since the position designated by %y_address() is the position which will almost certainly be used in the implementation of that block.
gendev as a Research Program
After giving the matter considerable thought, I have finally devised an operant definition of the V4.0 binary format:
those items, and only those items, actually used by wgml from the binary files are part of the binary format.
There are several items that are not parsed because I do not see a need for them:
- The field next_codeblock discussed in the section on the binary device file format.
- The field DriverFile.unknown discussed in the section on the binary driver file format.
- The fields ShortFontStyle.unknownCount and ShortFontStyle.nulls discussed in the section on :FONSTYLE blocks.
- The device function flags used by :FONTSTYLE and :FONTSWITCH blocks.
- Several fields in the CodeBlock structure.
- The inversions in field order associated with :LINEPROC blocks.
- The occasional "junk byte" associated with multiple CodeBlocks.
- The padding at the end of the binary file.
Note: Despite it's name, the field Attributes.unknown in the :DRIVER block is almost certainly a count field, giving the number of data bytes following it. It is called "unknown" because those bytes are not all part of the Attributes.
The initial version of our gendev is not expected to accomodate these in any way: neither set flags, nor pad files, nor skip the occasional output position. This may cause problems with wgml 4.1: and that will tell us which, if any, of these items are, in fact, features which are part of the V4.0 binary format. There are two different types of problems that may occur:
- wgml 4.1 may simply refuse to work unless the feature is enabled; and
- wgml 4.1 may work without the feature, but may produce different results than it does when the feature is present.
The second case will, of course, also provide some insight into what use wgml 4.1 makes of the information.
Thus, gendev can reasonably be expected to become the ultimate research program. However, since wgml as used in the context of the Open Watcom document build system is what this project is concerned with recreating, it seems likely that the testing, some of it at least, will need to be done using the Open Watcom document build system. There are two immediately apparent requirements:
- the testing will need to be done in a copy of the Open Watcom document build system, a copy that can be modified without affecting Open Watcom itself; and
- the Open Watcom document build system itself will have to be reasonalby well understood and, if necessary, modified to become a stable platform for testing.
Clearly, when the time comes to test wgml, the same requirements will apply.
The exact details of what needs to be done to the existing build system to become a stable testing platform can, of course, only be determined by examining it, although some preliminary notes can be found at Document Build System. However, a few items have come up in the newsgroup that should certainly be looked at as candidates for modification:
- Some files (those found so far appear to be related to the PostScript device but there may be others not related to PostScript) exist in multiple copies; the suspicion is that only one copy is actually used; the situation should be clarified and unneeded files removed (or renamed, if they should be retained for some reason).
- The primary directory file (ow\docs\gml\syslib\wgmlst.cop) is a mess, as documented on page Directory File Format. It needs to be replaced.
- Most of the binary font files in ow\docs\gml\syslib\ do not have source files. Adding the source files and creating a genall.pcd file which when processed by gendev causes all the files in the library to be recreated by gendev would probably make sense.
If you can think of anything else that might be worth looking at when the Open Watcom document build system is reviewed, or have additional information on any of the above topics, please add the information!
This section is intended to "corral" subprojects and notes for recreating wgml.
While an actual wgml road map cannot be given, it should be noted that the input side (both command-line parameters and document specification processing, as well as related issues) is being worked on by my co-implementor. The binary device file parsing project, including the device function parsing, is concerned with the output side (or at least that part of it that shapes the output for use with a specific device -- the boundaries are not clear at this point).
These pages, which record the binary device file format as such, also identify a few areas to check when wgml is implemented:
- Binary Device Files
- Common File Blocks
- Device File Blocks
- Driver File Blocks
- Font File Blocks
- Device Functions
- Meta Data
These pages discuss topics which apply to both wgml and gendev:
These pages relate directly to wgml:
- Augmented Devices
- Device Function Notes
- Drawing Boxes
- GML Tag Notes
- Keyword Statistics
- Page Layout Subsystem
- System Symbol Notes
- Tabs and Tabbing
- wgml Fonts
- wgml Sequencing
- When invoked with no parameters, wgml 4.0 currently waits for a command line to be entered. Initial tests suggested that, under Windows XP, at least, the only way to break out of the program is to close the window. This is incorrect: I was expecting Ctl-C to terminate the program but, in fact, pressing the Enter key will do so. My intention nonetheless is to change this behavior so that invoking wgml with no parameters will display the list of possible parameters, as most Open Watcom programs do. This is already implemented.
- The documentation for wgml implies that it supports several printers by providing printer-specific forms of certain functions. My intention is to reduce the list of supported printers to one -- PostScript, and to do it in a way that can be used as a model if anyone wishes to add support for any other printer in the future.
- When exploring the binary file format, it was observed that wgml 4.0 will not work with version 3.33 binary device files.
- When exploring the binary device file format, it was observed that wgml cannot directly use the files defining devices, drivers or fonts; it can only use them when they have been processed by gendev into binary device files.
- When testing the use of extensions in search paths, it was found that, if the output file has the same name as the source file given on the wgml command line (which is quite common) and the extension for the output file matches any of the extensions wgml is using to find the source file, then wgml 4.0 loops endlessly, processing the output file as if it were the source over and over and over again. This happens even when the directory starts out empty, that is, before the output file exists. The reason for this appears to be that wgml creates the output file and runs the code for the :INIT block with "start" for the value of attribute place before it looks for the source file; this was seen to happen when the source file could not be found. Now, this should be detectable since running the code for the :INIT block implies that the :DRIVER has been loaded, which in turn implies that the :DEVICE has been loaded, which means that the value of the attribute output_suffix is available, and that value can certainly be checked against the extensions wgml is planning to use in its search. The documentation should probably advise caution in assigning a value to output_suffix, in adding an extenstion to a source file on the command line, or in providing a value for option ALTEXTENSION.
- At present, the implementation of our wgml does not allow paths in file names, either for document specifications or for option files. This was done because wgml 4.0's behavior was difficult to determine and the Open Watcom build system does not appear to use paths with file names but instead puts the paths in the environment variables, where they belong.
- At present, in our wgml code, all defined names are compared without regard to case. This may have to be reconsidered if our wgml is ever released beyond the Open Watcom document build system.
- At present, in our wgml code, defined names cannot contain spaces. Since neither of the device names used with Open Watcom contains spaces, this poses no problems. This may have to be reconsidered if our wgml is ever released beyond the Open Watcom document build system.
- At present, the parts of our wgml code that relate to the binary device library and the output buffer use mostly unsigned integers. This was done because the fields in the binary format appeared to me to be unsigned integers. When testing the effect of the :PAGESTART block, the :PAGEOFFSET block, and the :PAGEADDRESS block on the apparent internal state of wgml 4.0, I found that, when the value of attribute x_start was "0" and the value of attribute x_positive was "no", %decimal(%x_address()) would, in fact, produce a negative value. The flags in the :PAGEADDRESS block are both set to "true" (mostly by default) in all but two of the devices known to me, and both of those have a large, positive, nonzero value for the corresponding attribute in the :PAGESTART block. So, unless someone comes up with an actual device that they need to use with our wgml which allows negative values for positioning the print head, or some other reason for using signed values appears, the relevant parts of our wgml code will continue to use unsigned integers.
- While attempting to clarify the use of empty text_chars instances at the start of lines and the treatment of newline sequences in the document specification's text, I discovered that at least one :BANNER section (topodd, letlast) in the :LAYOUT section has an interesting effect on the DOCUMENT :INIT block: it causes an initial :VALUE block (at least) to use available font "2" rather than "0", and the first :FONTVALUE block (at least) to start with available font "2" (it is then done for available font "1", then for available font "2" again, but never for available font "0"). Note that other :LAYOUT items and :VALUE/:FONTVALUE blocks within the DOCUMENT :INIT block may show these effects as well; they were not tested. Review of the existing devices shows that only PS has a DOCUMENT :INIT block, and that that block contains a single :VALUE block that does not refer to the available font number. Since this behavior is hardly correct, our wgml will not do this unless it becomes necessary, which is unlikely even if it is released more widely. The :BANNER sections were being used to suppress the banners defined by the default layout to enhance control over the test documents.
- Various sections indicate items not implemented despite being researched. Research was done with test devices which implemented all of the :LINEPROC sub-blocks and, in several instances, it became clear that the weirder behavior seen would never be triggered by any actual devices. As this realization grew, willingness to not implement the unneeded behavior grew as well. Some items which might have been unimplemented under this criterion were implemented before the criterion was developed, and have been left in place, as they do no harm and might, in the long run, make sense or turn out to be needed. Any item not implemented because it appears to be not needed will be implemented if a need for it becomes apparent.
- Testing with boxtest.gml revealed some items which would have been treated as bugs and fixed, except for the fact that they appeared in the PS file, and so had to be duplicated. Note that concatenation was on in all cases. After our wgml is firmly established, they should probably be reversed to improve the appearance of the documentation.
- Macros invoked through a symbol have a space after preceding text even when no space is in the source document: "&LANGL.test&RANGLE. produces "<test >" rather than "<test>".
- Macros invoked through a symbol (but definitely not through a user tag) which follow a user tab have an extra space inserted as well. If the text is part of a column aligned by the user tab, the text affected will not be properly aligned.
- Tags which start at the beginning of a physical input record and which are followed (in the next physical input record) by control word CT insert a space anyway. What appears to happen is this: the prior physical input record's end-of-line character is converted into a space and this space is kept, while the CT applies to the similar space created from the end-of-line character of the physical input record in which the tag appears.
Since the 4.1 version of gendev differs from the description in the documentation, it is clear that the documentation will have to be updated at some point to reflect the actual behaviour of the wgml/gendev produced.
Attempting to produce a PDF from openwatcom\docs\doc\wgmlref produced an error and a foreshortened file. This may, of course, reflect the version of GhostScript I am using to do the conversion as much as anything else -- but it will have to be investigated at some point.
The various pages cited above for gendev and wgml also contain information which will affect, or even go into, the documentation.
These pages record the device library formats:
- Binary Device Files
- Common File Blocks
- Device File Blocks
- Driver File Blocks
- Font File Blocks
- Device Functions
- Meta Data
This page records the device function format within the binary device file:
These pages discuss topics which apply to both wgml and gendev:
These pages relate directly to wgml:
- Augmented Devices
- Device Function Notes
- Drawing Boxes
- GML Tag Notes
- Page Layout Subsystem
- System Symbol Notes
- Tabs and Tabbing
- wgml Fonts
- wgml Sequencing
And, while the statistics may not need to go into the documentation, the list of keywords supported will need to be given (with documentation of what each supported keyword does):
From time to time, issues occur to me that can only become worth pursuing if, after this project is complete and the Open Watcom document build system uses our gendev and our wgml, it is decided to provided gendev and wgml to a larger audience (such as by distributing the programs and associated files with Open Watcom itself). I record them here in the hope that, having been written down, they will not distract me from the task at hand.
New Binary Format
There are several changes possible here.
The version number could be changed to "0x0100" and the version string to "WGML/OW V1.0" to identify the new format with Open Watcom.
The padding at the end of the file, if our gendev has to produce it for the benefit of wgml 4.0, can be dropped.
Several counted items can be shortened or otherwise improved:
- Since the concept of more than 64K binary device files is appalling, the count in the directory file could be reduced to two bytes (in effect, we would be using the version 3.33 directory file).
- The box characters could be preceded by "0x0B" and only the 11 characters actually used placed in the file, saving four bytes that have to be skipped.
- The underscore character could be preceded by "0x01" and occupy only one byte, saving another four bytes.
- In the Driver file, instead of
04 fill_char x_positive y_positive 00
where the "0x04" is not treated as a count field only because it is counting one of the Attributes plus the PageAddressBlock, we could instead have
01 fill_char 02 x_positive y_positive
which would occupy the same space but be much more consistent with the general format of these files.
Several aspects of the compiled form of the device functions could be revised so that they are shorter:
- The number of Offsets could be reduced to two, used for the parameters: our parsing code does not use the first offset at all and no device function uses more than two parameters.
- The four-byte struct used for numeric literals could either be shortened to two-bytes (the last two bytes are always null, the struct can only encode two-byte integers anyway) or used to encode four-byte integers (most of the two-byte attribute values in version 3.33 became four-byte values in version 4.1, and this would be consistent with that change).
- The four-byte struct used for numeric literals is never used in the Header; all Headers could be ShortHeader structs.
- In point of fact, the four-byte struct used for numeric literals is only used in a Directive with op_code "0x3C". It need only be present in those Directives.
- Actually, a Directive instance with op_code "0x3C" could consist of the op_code followed immediately by the literal value: since the value is a literal, it cannot take parameters, so the Offsets instance is not needed.
- Similarly, a Directive instance with an op_code of "0x00", which indicates a literal char parameter, could be followed immediately by a CharParameter struct, since, again, this value takes no parameters.
- All flags can be omitted, unless, of course, testing shows that some need to be captured and used by our wgml, in which case those flags (only) will need to be kept. In particular, since the cb05_flag records the presence or absence of a :STARTVALUE block in a :FONTSTYLE block, and since the cop_driver struct will have a NULL pointer for that compiled block if there was none present, the cb_05 flag can be omitted.
It has been confirmed that we can not redistribute anything in this file. We can certainly not, for example, remove .FON or .PCD files from it to provide every binary file in the Open Watcom repository with a source file. Indeed, it appears that even having it on our FTP server is improper, so referring to it in anything we distribute appears to be out of the question as well.
The existing wgml is based on the typical DOS pattern of providing a driver for each supported printer. This apparently was how the IBM mainframe and VAX VMS computers worked as well, since the device library was clearly used on those systems.
The Open Watcom repository contains only a few devices (WHELP, HELP, PS, TASA, TERM). wgml33.zip contains many more, but it is not available to us. Creating new drivers for new printers is something that only an owner of the printer could test properly. I could create one or more device/driver files and various font files for my Panasonic KX-P1624: the Operating Instructions contains the complete command sets (Epson LQ-2500 and IBM Proprinter XL24), character set charts, and proportional spacing tables, so it should be possible. But my Epson Stylus Color 880 documentation has none of that information. Unless such information is available online, a device/driver pair and font files would be impossible to write.
At least for Windows and, presumably, OS/2 and Linux, the freely-available GhostScript/GSView program can be used to print PS files to any printer supported by the OS. (Examination of the download page shows several versions for GhostScript, with a variety of available packages, but it appears that binaries for OS/2 as well as 32-bit and 64-bit Windows can be found, not always of the same or the latest version, and that Linux users are expected to make do with source code. GSView shows binaries for OS/2 and both 32-bit and 64-bit Windows and a source code package for Linux users.)
Doing this greatly simplifies the program, albeit at the cost of no longer supporting many of the devices found in the WGML 3.33 Update. This support can, of course, be added if needed in the future. A list of known restrictions and unsupported devices is given here.
Restricting record specifications to those starting with "t:" plus a number removes support for two groups of devices:
- This device, which uses "t:c:133":
- These devices, which use file type "f":
hpfaxsys hplaser hplaserii hplaserplus hplaserpluslegal mlt mltexpress qume tty x2700
This section was created because these device/driver pairs (there are several) accomodate what I believe are called "soft fonts", that is, fonts that must be downloaded to the printer, with two programs: readfont and loadfont. At present, output files using a "soft font" must be processed by loadfont before they can be sent to the printer
Since wgml33.zip is not available, any HP printer support will have to be provided through the PS/PSDRV device with GhostScript/GSView providing the link to the printer driver. If this does not work, then an HP-printer user will have to solve the problem without taking anything (such as the source for readfont and loadfont) from wgml33.zip.
The PS/PSDRV device provided in wgml33.zip actually produces Encapsulated PostScript files. This is why they must have a file such as ezamble.ps prepended to them before they can be printed or converted to .PDF.
The PS/PSDRV device in the Open Watcom binary device library produce actual PostScript files, which can be printed or converted to .PDF as-is. Several copies of the source files exist and, in the case of PSDRV.PCD, they differ.
This creates a potential problem: if other existing document build systems than Open Watcom's exist and our PS/PSDRV device is substituted for the version already in use, which would presumably be the wgml33.zip version, this would break those build systems.
Ideally, the PS/PSDRV provided with wgml33.zip would be renamed as eps/epsdrv and altered to produce .EPS files. This would be a more accurate device name, and might provide an easier method of fixing a broken build than adapting to a different PS/PSDRV device, since only the command-line parameter "DEVICE" would need to be changed to keep the current behavior. Since wgml33.zip is not available, the best way to do this is to see if an earlier version of one of the PSDRV.PCD files matches the version in wgml33.zip: if it does, then we can clearly recover and use it, since the earlier versions of files in the Open Watcom repository are, themselves, in that repository.
Note that, for this to work, our wgml would have to treat the device filename prefix "eps" in the same way as it will have to treat "ps", which is to say, the same way that wgml 4.0 treats the prefix "ps", whatever that turns out to be.
Code Base Structure
This section describes the overall structure of the code created for this project. The code itself is located in bld\wgml.
At the moment, this section covers my contributions only. As others contribute to the project, they may or may not choose to update this section. Or add their own section. Or whatever they wish.
Prefixes And Special Function Names
This section discusses the prefixes and special function names which I found myself using while writing the code. There is no obligation to use them; they are discussed here to promote understanding.
If some of these items seem reminiscent of C++ classes, well, all my pre-wgml programming was in C++, and old habits die hard.
The typedef structs used to hold the data from the binary device library files are prefixed with "cop_". When the time came to create global variables to point to the instances, instead of
cop_device * cop_device;
which, even if it had worked, would have been very confusing, I created the prefix "bin_" (for "binary device library") and wrote
cop_device * bin_device;
Fonts were more complicated. I did use "bin_fonts" to point to the head of the linked list of cop_font instances, for conformity with "bin_device" and "bin_driver", but there are two more font typedef structs: opt_font for command-line OPTION and wgml_font for the "available fonts" actually used by wgml. The corresponding globals are "opt_fonts" (points to a linked list) and "wgml_fonts" (points to an array).
These are used to provide separate name spaces for various groups of functions. The groups are:
cop parsing the binary device library df device function interpreter use fb function block interpretation, see below ff find file ob output buffer
A function starting with "fb_" interprets at least one function block. It may interpret more than one block, and may perform other actions as well. The highest-level "fb_" functions are task-oriented: they accomplish a particular task involved in producing the output for the device or file. The lowest-level "fb_" functions are specific to interpreting a particular function block.
Functions *_setup() and *_teardown()
These functions always appear prefixed.
The *_setup() functions initialize global and file-level local variables. They only exist when needed, and are to be used at the start of the program only.
The *_teardown() functions free any memory allocated for use with any pointers among the global and file-level local variables initialized by the corresponding *_setup() function. They are intended to be used by the program on the way out.
The Binary Device File Subsytem
This section discusses the files whose names start with "cop" in ow\bld\wgml. Because files in the Open Watcom repository are expected to have names conforming to the FAT 8.3 file naming convention (the exceptions being files whose names are specified by a standard implemented by or the implementation of which is included in Open Watcom), "cop" is used as a code for "binary device file".
These are the files involved. The code, of course, is the best documentation:
copfiles.h declares all of the structs used to hold the parsed data from the binary device files encoding :DEVICE, :DRIVER, and :FONT blocks. This are slightly different from the versions used with the research programs, which were a bit inconsistent because they were developed at three different times. In addition, copfiles.h declares, and copfiles.c implements, the public API to the binary device file subsystem. This is the only file the rest of wgml should ever need to include.
copdir.h declares a struct and an enum used with binary device library directory files. It also declares, and copdir.c implements, the functions used to extract compact and extended entries from a directory file.
copdev.h, copdrv.h and copfon.h declare, and copdev.c, copdrv.c and copfon.c implement, the functions used in parsing the post-header part of binary device files encoding :DEVICE, :DRIVER, and :FONT blocks (respectively).
copfunc.h declares structs used with encoded device functions embedded in binary device files. It also declares, and copfunc.c defines, functions used in processing those functions as blocks of binary data.
As discussed in Common Attributes, wgml can use an empty string to search for a directory file entry. In particular, wgml will accept
on the command line and a zero-length string in a binary member file, but will not accept
on the command line, treating it (reasonably enough) as a missing device defined name.
The research program "copparse.c" was modified to test the value of the global variable tgt_path and, if it was two single quotes, to change it to a zero-length string. This works; however, since it was done in the research driver code, this means that wgml will have to do this eventually, if we wish to retain this behavior in our wgml.
copparse.exe will accept "" and treats it the same as an empty string. Our wgml will probably have to reject it.
The Error Message Quandry
The WGML Reference lists these error messages for both wgml and gendev:
LI--001 %s %nis not a library file or a directory LI--002 %s %nis incompatible with this version of the library manager
The first would most reasonably be used when searching through the directories in the search path and a file named "wgmlst.cop" is found but it is not, in fact, a binary device file at all. The second would most reasonably be used when a binary device file is found but the version is wrong.
In both cases, the best place to put these messages would be the function invoking parse_header(). The reason for this is that parse_header() does not know the file name but the calling function, presumably, does. This inconvenient fact makes this a much less compelling example than I had hoped, and it remains to be seen where and if this problem actually arises.
Depending on how errors are handled, this can pose a problem. The choices appear to be:
- Put the text of each message in the place where it is used. This would make maintaining a naming system like that used by wgml and gendev difficult, but there is no reason that the existing practice of wgml and gendev needs to be duplicated.
- Reproduce the appearance of errors in wgml and gendev and consolidate all errors into a single large string table. A really large enum would be used to provide symbolic names to access the array. The problem here is that, while some error messages are common to both wgml and gendev, most are not. Each program would be stuck with a large number of strings which are never used.
- Keep the gendev and wgml codebases completely separate, and create separate error systems. The problem here is that both programs do many of the same things, up to a point, so functionality such as "look up this file using this search path" would have to be located in two places, which has maintenance implications.
- Allow gendev to use some of wgml's code (for simplicity, all common code would be located in wgml's code) and have three sets of error messages: one for wgml, one for gendev, and one for the common functions that emit errors. This would impose on wgml a distribution of the affected functions which might not be the distribution which would otherwise make sense.
- More experience with enums in C has "reminded" me that, unlike C++, enum tags in C share one large namespace, which means that, if they have different values for the same error message, they must also have different names.
- Have two sets of error messages, one for gendev and one for wgml, and have gendev use a function that converts wgml's enum tags to gendev's enum tags. This may appear impractical, however, if, for example, wgml uses "wgml_error()" for its error messages and gendev uses "gendev_error()" for its error messages, then gendev could provide a version of "wgml_error()" that does the conversion and calles "gendev_error()". Of course, there is likely to be more than one function involved in processing error messages, since, first, some of them are warnings, and, second, different messages will use different additional parameters.
- It has occurred to me that, since out_msg() is used for all error messages, it might make sense to wait until gendev and wgml are done to identify the error messages that are, in fact, emitted and organize them at that time.
My intention is to proceed following these guidelines:
- All existing error messages will be removed.
- Appropriate error messages for conditions which a user can correct (or reporting an "internal error") will be output directly from what appears to be the best location for issuing them.
- wgml and gendev will share as much code as possible.
This section will be used to document expected interdependencies with, and between, wgml and gendev. My collaborator, who is working on wgml, has advised me that his code is experimental and that the style will change to use lower-case names separated by underscores. Those changes will be reflected here and in any Binary Device File Subsystem and/or gendev code that exists when the change is made. It should be noted the Binary Device File Subsystem code is not written in stone either and will be changed as needed.
Binary Device File Subsystem Items for wgml
These three functions are provided for use by wgml:
cop_device * get_cop_device( char const * defined_name ); cop_driver * get_cop_driver( char const * defined_name ); cop_font * get_cop_font( char const * defined_name );
The parameter in each case is the "defined name" of the device, driver, or font desired. A NULL pointer is returned if any problems are encountered.
The structs cop_device, cop_driver, and cop_font (and all of the structs they contain) are also available to wgml for its use.
It should be noted that alternative function signatures are possible:
- The device's defined name is a required command-line (or option file) parameter and, while it might be read in, converted to a cop_device *, and then discarded (this might happen if it is not ever needed for any other purpose), it might also be saved in a global variable, in which case get_cop_device() would not need a parameter but could use the global variable instead since there can be only one device file.
- If the cop_device * is stored in a global variable, then get_cop_driver() could be written to get the driver's defined name from the (unique) cop_device.
Also, there are other functions which may be provided if they turn out to be useful:
- A brief test done to determine how to store the CodeBlocks encoding the :VALUE and :FONTVALUE blocks in the :INIT block showed that each :FONTVALUE block is invoked multiple times, apparently once for each :DEVICEFONT and once for the :UNDERSCORE font if it is given as a font name rather than a font number. Since the :INIT block(s) are, presumably, executed fairly early in the process, this means that get_cop_font() will have to be invoked for those fonts fairly early as well. A function which does this for wgml might be worth writing once it can be tested.
- Many of the items from the binary device files are embedded rather deeply in the structs. Utility functions to find, say, the FontStyle for "bold" or the CodeBlock text for the :NEWLINE with a value of "3" for attribute advance might be helpful.
When the Device Function Language project is done, the additional functions needed should be easier to identify.
Binary Device File Subsystem Items for gendev
Buried in the file "copfiles.c" is this function, which actually converts the defined name to the corresponding member name:
char * get_member_name( char const * directory, char const * defined name );
In the binary device file subsystem (that is, eventually at least, in wgml), this function is invoked for each directory in the search path until it returns a value other than NULL. The value returned is the member name exactly as found in the directory file.
For gendev, it could be invoked with ".\" (or perhaps even ""), since gendev only looks in the current directory when checking the member name. It needs the member name because it objects if the defined name in the file it is processing is found but the member name returned does not match the member name in that file.
Also found in "copfiles.c" is this function, which parses the header of an alleged binary device file and returns an enum tag indicating what it actually is:
cop_file_type parse_header( FILE * in_file )
It is possible that gendev will use this function directly, since it would probably be prudent to ensure that the "wgmlst.cop" file in the current directory actually is a directory file before updating it.
Additional functions may or may not be needed for gendev. In particular, the function(s) which actually create a binary device file may be placed here or may be part of gendev itself. Similarly, gendev may or may not use the same structs to hold the data to be written (which it will obtain by parsing the :DEVICE block, :DRIVER block, or :FONT block); different structs may turn out to work better.
Relevant Items in the Existing wgml Code
The binary device file subsystem uses these items provided by wgml:
- mem_alloc( size_t size );
- mem_realloc( void * p, size_t size )
- mem_free( void * p );
- out_msg( char *fmt, ... );
As noted above, any of these may be renamed. This section and any existing code will be updated to match. Also, the descriptions below are minimal; for full details, see the source code.
The first two are #DEFINE macros:
PATH_SEP // "/" for Linux, "\\" otherwise INCLUDE_SEP // ':' for Linux, ';' otherwise
These are the global variables used:
char * Pathes; // content of PATH Envvar char * GMLlibs; // content of GMMLIB Envvar char * GMLincs; // content of GMLINC Envvar
These are the functions used:
void *mem_alloc( size_t size ); void *mem_realloc( void * p, size_t size ) void mem_free( void * p ); void out_msg( char *fmt, ... );
The memory allocation functions permit the use of an allocator other than malloc(), if that becomes desireable:
- mem_alloc() emits an error message and ends the program if a memory allocation fails.
- mem_realloc() emits an error message and ends the program if a memory reallocation fails.
- mem_free() simply calls free().
mem_free() must be used with memory allocated by mem_alloc() or reallocated by mem_ralloc(), in case an allocator other than malloc() is in use.
mem_realloc() has the same requirement as realloc(): if it returns a different pointer from the one it was given, then the calling program must mem_free() the original block of memory.
mem_alloc() and mem_realloc() only return if the memory allocation works. If either fails, exit() is eventually invoked. This has an interesting effect on the code using them: functions whose only possible error was a failed memory allocation now only return if they succeed; if they fail, the path to exit() is taken. This expands the number of functions whose return value, if a pointer, need not be compared to NULL.
In addition, those functions which could fail either from a memory allocation error, or a file error, or a formatting error can now only fail for a file error or a formatting error -- and, since the binary device file formats consist in a series of required (if, in some situations, trivial) blocks, a file error or premature end-of-file can be regarded as a formatting error. So the functions in the binary device file subsytem which return a NULL pointer as an error indicator generally can be regarded as reporting a corrupted binary device file, avoiding the need to distinguish "file error" from "formatting error".
The function out_msg() provides a common point for outputting error messages, with a familiar signature. This has several advantages:
- If all error messages use it, then, when gendev and wgml are done, the embedded error messages can be easily located and organized.
- This function could be modified to write to a file either instead of or in addition to the screen, or perform other actions as desired.
These function were used in "copparse.c". They would also be useful in gendev:
char *GML_get_env( char *name ); void get_env_vars( void );
They use getenv_s() to get the three environment variables into character strings.
In addition, wgml includes items related to searching for source files, parsing command lines and parsing GML/Script. These items will be of interest when gendev itself is being built.
Changes to the wgml Code
One result of these investigations was the merging of the file-finding code in the main part of wgml and that in the binary device library code into a single module, "findfile". This involved:
- Using the function initialize_directory_list(), and related items, to acquire and preprocess the three environment variables. This should save considerable time, as compared to doing this directory-by-directory each time a file is sought -- Open Watcom documents typically run from 50 to 200 files, and at least one has 802 of them.
- Modifying he function search_file_in_dirs() to use the directory_list instances instead of the environment variables.
- Suppressing any path information given as part of the filenames, since the rules appear to be complicated, Open Watcom does not use paths as part of its filenames (so far as is known), and this is completely undocumented.
- Using a fourth directory_list instance to point to the current directory, allowing the current directory to be handled like any other.
- Producing the various file names before entering the loop and passed to try_open() sequentially as parameters.
It turned out that the function get_cop_file() was not as good a fit as I had hoped, and it's sections stand out rather plainly in the code. Still, search_file_in_dirs() is now used for all files that need to be opened for reading.
Relevant Items in the watcom Project
The watcom project (ow\bld\watcom) is listed in file ow\projects.txt as "shared source code". It is quite large and no documentation of its contents was found, so this list may not be exhaustive. Also, I cannot say that I understand some of the more specialized bits (debugging formats, object-file formats, machine descriptions, code-generator stuff, dpmi stuff, dos-extender stuff, resource editor stuff, others I cannot even identify) well enough to know what they do. Overall, the bulk of this project appears to be concerned with common information needed by both the compiler and the code generator or the debugger.
Note on style: the watcom project is in a different style from that expected to be used for wgml and gendev. However, calling a function from another project should not be avoided for stylistic reasons.
These #DEFINE macros are very interesting:
#define DIR_SEP_CHAR // '/' for Unix, '\\' otherwise #define DIR_SEP_STR // "/" for Unix, "\\" otherwise #define PATH_SEP_CHAR // ':' for Unix, ';' otherwise #define PATH_SEP_STR // ":" for Unix, ";" otherwise #define _WGML_VERSION_ BAN_VER_STR // current version, e.g, "1.8 Limited Availability"
note that _WGML_VERSION_ is in a group of macros of which it is said that "these can be what ever they want to be", so we could change it if something else was needed. There is, at present no such macro for gendev, but that can be remedied if necessary or the existing wgml macro can be used for both.
These #DEFINE macros are similar to those given earlier, but are they what we want?
#define SYS_OPTION_CHAR // '-' for Unix, '/' otherwise #define SYS_OPTION_STR // "-" for Unix, "/" otherwise #define SYS_DIR_SEP_CHAR // '/' for Unix, '\\' otherwise #define SYS_DIR_SEP_STR // "/" for Unix, "\\" otherwise #define SYS_PATH_DELIM_CHAR // ':' for Unix, ';' otherwise #define SYS_PATH_DELIM_STR // ":" for Unix, ";" otherwise
For our purposes, perhaps adding something like
#define OPTION_CHAR // '\\' for Unix, '/' otherwise #define OPTION_STR // "\\" for Unix, "/" otherwise
would work better, together with using '-' or "-" as literals.
The header banner.h contains many macros which are used to construct the banners for Open Watcom components, and wgml uses some of them.
These functions look promising:
char *_getFilenameFullPath( char *buff, char const *name, size_t max ); int watcom_setup_env( void ); int BuildQuotedFName( char *buffer, int bufferlen, const char *path, const char *filename, const char *quote_char ); int UnquoteFName( char *dst, int maxlen, const char *src ); char *FindNextWS( char *str ); void *SortLinkedList( void *list, unsigned next_offset, int (*compare)(void*,void*), void *(*allocrtn)(unsigned), void (*freertn)(void*) ); unsigned char _dos_switch_char( void );
Here are some notes on these functions:
- _getFilenameFullPath() (in autodept.c) looks very similar to the wgml function getFilenameFullPath(); the wgml code was based on an existing project (as was the research code, although not necessarily the same one); perhaps that project was the source of the watcom function.
- watcom_setup_env() (in autoenv.c) is documented as setting the WATCOM environment variable if not already set and updating the PATH environment variable. The latter turns out to involve setting things like the HELP and BOOKSHELF environment variables; BEGINLIBPATH is only affected for newer versions of OS/2 (__OS220__ must be defined or nothing happens). This is probably not relevant to wgml/gendev.
- BuildQuotedFName() (in cmdlhelp.c) combines parameters path and filename, putting quotes around the result if either path or filename contain a space. It is not clear when or whether we would need this.
- UnquoteFName() (in cmdlhelp.c) removes quotes (") from, well, the comments say "filenames" but it should work with any character string. My research code uses this function.
- FindNextWS() (in cmdlhelp.c) finds the next whitespace character "allowing doublequotes to be used to specify strings with white spaces". I think the point is that whitespace inside text enclosed in " characters is ignored, but the comments aren't at all clear.
- SortLinkedList() (in sortlist.c) is one of several functions dealing with linked lists. If wgml/gendev use linked lists, these might be worth examining.
- _dos_switch_char() (in swchar.c) returns '-' for Unix, '/' for everything else except DOS, and the result of _DOS_Switch_Char(), which is in assembly, for DOS. My research code uses it, but it may or may not be what we want to use with wgml/gendev.
Notably missing is a function to skip whitespace and so find the start of the next token. When I did cfcheck, I found four or five such functions in the Open Watcom repository and based my own on one of them.
The header symtab.h might be useful when working with device functions in gendev or macros definitions in wgml. Or it might be overkill.
The header utf8supp.h might or might not be useful, depending on whether or not the various source files are treated as Unicode encoded with UTF8.
The header watcom.h contains Endian-swappers that might someday be useful, if Endian ever becomes an active issue.
The header idedll.h is also very interesting: the description states that it defines the "[i]nterface for DLLs pluggable into the Watcom IDE (and wmake)".
Notes on Output
Someday, this section may have a description of the various functions and modules created to handle the output of wgml. Or maybe not, since a lot of the details are folded into the sections discussing the related concepts. At any rate, it will be used for notes which have no specific section to reside in.
Validating Record Types
The file outbuff.c contains code which validates the record type. Record types can be prepended to file names (and is required with :BINCLUDE when it is given a text file). These file names can be given on the command line or as an attribute value. The :DRIVER block contains the default record type for the output device it is to be used with. The current location is less than ideal because it only validates record types for output devices or their image files. The ideal would be to move this into a global function that is called whenever a record type is to be used (by wgml) and, perhaps, for the :DRIVER block attribute, by gendev to ensure that the default given in the :DRIVER block is useable.
Now that both cfparse.exe and copparse.exe can display the compiled device functions in the form of the device functions compiled by gendev to create it, the source for a binary device file can be reconstructed subject to these limitations:
- the comments are unrecoverable as they do not appear in the binary format;
- multiple :INTRANS, :OUTTRANS, and :WIDTH blocks appear in the binary file as single blocks, so the reconstruction will only show (at most) one block of each type;
- included blocks in the :DRIVER block have no prescribed order (other than having to appear after all of the attributes) and so their order in the original source file cannot be completely reconstructed (the :LINEPROC blocks are ordered within each :FONTSTYLE block and the :LINEPROC block sub-blocks are ordered within the :LINEPROC, so some reconstruction is possible);
- some compiled %image() functions were actually compiled from those function sequences involving literal parameters which are discussed here; these are reported as a mixture of %binary() and %image() functions but the full complexity of the source cannot be recovered.
The reconstructed source file will, however, produce the same binary file when processed by gendev as the file that was analysed.
Device Functions Not Implemented
Device function %binary2() is only used with the literal argument "0", and the result appears to wgml as an instance of device function %image(). Device function %binary4() is never used.
These functions, currently, output their argument in little-endian order. This might or might not cause problems on big-endian computers; the issue is obscure at best. And these functions are not needed:
will insert 'FFEE' into the document. This technique can be used whenever it is necessary to insert the bytes of a two-byte value into the document in a particular order.
Thus, device functions %binary2() and %binary4() are not implemented.
Device Function and Function Block Activity Patterns
This summarizes when certain device functions and function blocks are and are not active.
- All device functions except
%clear3270() %clearPC() %endif() %ifeqn() %ifeqs() %ifnen() %ifnes() %image() %recordbreak() %setsymbol() %sleep() %text() %wait()
are inactive during the START :PAUSE block.
- The remaining type I device functions will be ignored in the remaining :PAUSE and all :FONTPAUSE blocks. The type II functions are all active in these blocks.
- Except for these device functions:
%flushpage() %textpass() %ulineon() %ulineoff()
all device functions are active in all blocks in a :DRIVER block.
- Device functions %textpass(), %ulineon(), and %ulineoff() are restricted to specific :LINEPROC sub-blocks.
- Device funtion %flushpage() is not active until the start of the initial vertical positioning. Alternately, it is not active during the first line pass, but is active on the final line pass.
- Certain blocks produce output only on the first line pass, as detailed here.
- The remaining blocks produce output only on the final line pass.
- It remains to be determined whether any blocks are interpreted (as opposed to producing output) at any other time.
This section discusses the code in ow\bld\wgml\research.
As is always the case, the code itself is the best documentation.
The research code was done first and so retains some fairly primitive characteristics. The executables that link with production code use the conventions established by that code. The executables that are entirely build from code within research are slowly being migrated to these conventions.
File common.h declares a global variable, three functions which research programs are expected to define and use, and two functions intended for general use which are implemented in file common.c. More recently, various items needed to emulate parts of wgml have been added for use by those research programs which link to code written for wgml.
Files lhdirect.h and lhdirect.c allow the cfcheck research program to compile and link for all four targets even it uses direct.h, which is not available in Linux. Note that full Linux functionality will depend on Linux users implementing the Linux version of the common function.
File research.h declares a global variable and four functions intended for use in the research programs which are implemented in file research.c.
File cfheader.h declares, and file cfheader.c implements, the function parse_header(), which verifies that a supposed binary device file is, in fact, a binary device file.
File cfdir.h declares two structs and six functions used in parsing directory files which are implemented in file cfdir.c.
File cfdev.h declares struct cop_device and its sub-structs and two functions, one of which verifies that the file does, in fact, encode a :DEVICE block and the other of which returns a structure containing the data from the encoded :DEVICE block. The functions are implemented in file cfdev.c, along with several local functions.
File cfdrv.h declares struct cop_driver and its sub-structs and two functions, one of which verifies that the file does, in fact, encode a :DRIVER block and the other of which returns a structure containing the data from the encoded :DRIVER block. The functions are implemented in file cfdrv.c, along with several local functions.
File cffon.h declares struct cop_font and its sub-structs and two functions, one of which verifies that the file does, in fact, encode a :FONT block and the other of which returns a structure containing the data from the encoded :FONT block. The functions are implemented in file cffon.c, along with several local functions.
File cffunc.h declares three structs and three functions which extract P-buffers and parse them into either a Variant A FunctionsBlock or into an array of CodeBlocks; the functions are implemented in file cffunc.c.
File cftrans.h declares three structs used for parsing encoded :INTRANS and :OUTTRANS blocks. These were separated out because they are needed by both cfdev.h and cffon.h.
Files cfcheck.h, cfcheck.c, cfcutils.c, and cfcusage.h are all specific to the research program cfcheck. This program was used to verify that each binary file's size is an even multiple of 16 and that directory files use only three designators and that every binary file except the directory file encodes either a :DEVICE block, a :DRIVER block, or a :FONT block.
Files cfparse.h, cfparse.c, cfputils.c, and cfpusage.h are all specific to the current version of the research program cfparse.exe. This program was used to adjust the model of the binary device file format in the Wiki until the code based on that model parsed each and every known binary device file (except for PCGRDRV.COP, as discussed in Multiple CodeBlocks). It can currently display the structure of every binary device file (except PCGRDRV.COP), including the device functions encoded within that structure. This program could also serve as the starting point for a utility that reads a version 3.33 or version 4 binary file and generates source code which, when processed by gendev 4.1, produces a binary file with the same functionality as the original.
File copparse.c contains the main() and other research functions for copparse.exe, which is generally similar to cfparse.exe with these differences:
- the command-line parameter is a "defined name", not a file name;
- each invocation only processes one binary device file; and
- it uses the code in the binary device file subsystem for parsing the binary device files and was, in fact, written to test that subsystem.
File dfinterp.h defines and file dfinterp.c implements the function interpret_function(), which, when given a pointer to the start of a CodeBlock.function field, parses it and prints out the device functions/literals it finds. File dfinterp.c also includes some items that may well evolve into globals in wgml and a gazillion local functions, most of them implementing a specific device function (by printing out the function name, generally). This is an obvious base for building the wgml device function parser(s).
File findfunc.c defines several functions which implement findfunc.exe. Aspects of these function definitions will probably apply to or even be used in gendev, but further development when gendev is written will probably be needed.
Files heapchk.h/heapchk.c provide some simple heap checking functions.
File devldchk.c produces devldchk.exe, which loads the device library using the code written for wgml and then prints out the device name, the driver name, and the data from the available fonts. This program could be used as the base for creating a utility that would analyze an existing version 4 device library and reports which driver and font files are related to which device files. It could also verify that each of the binary files implied by those relationships is actually present on the disk, and identify any that should be present but which are not.
As noted above, this is my first C project. I intend to use this section to record lessons learned, especially if I suspect it may be a while before I will be applying them.
Use of these functions was recommended and it has, indeed, been helpful. But it is not a panacea: while it is certainly better to get a run-time error message with a hint as to what the problem might be than to allow the program to crash and burn later, this is not something that should happen while wgml or gendev is actually being used. So, while wgml and gendev will use these functions, run-time checks will also be made, with messages to the user as appropriate.
In particular: _MAX_PATH will be used to enforce the maximum file path length whenever a file name is being worked with, including the relevant command-line parameters and the paths found in the environment variables. This check will be used especially when creating a file name from component parts. And, since file names and paths generally are provided by the user, messages will be emitted to advise the user of the problem giving the specific path and/or file name involved.
Which Functions to Use
Whenever possible, C Standard functions will be used in preference to other variants. This lesson was learned with regard to file manipulation functions, but will be applied generally.
Functions complying with other standards will be used if no C Standard function exists and those functions are available in all targets, including Linux.
Functions specific to Watcom and available for all targets, including Linux, will be used if no standard function exists.
The attempt to smooth out header file discrepancies between Linux and the other targets represented by lhdirect.h/.c will continue as needed. All my code will compile and link for all four targets. Whether it works when executed under Linux is something I cannot verify and so will not concern myself with.