Tabs and Tabbing
From Open Watcom
←Older revision | Newer revision→
This page discusses tabbing, a topic not covered in the WGML Reference (it does have some error messages that refer to tabs and tabbing; however, the actual use of tab stops is not mentioned and the error messages are used with such control words as TB.)
There are three ways to introduce a character into a document which wgml 4.0 will treat as indicating a tab:
- By pressing the "Tab" key on the keyboard, which inserts the byte value "0x09".
- By using input translation to map some other character to the byte value "0x09".
- By using control word DC TB or control word TB to designate a particular character as a tab character.
In wgml 4.0, all tab characters are not treated the same way. In terms of their effect, there are two types of tabbing:
- Keyboard tabbing, which results from pressing the "Tab" key.
- wgml tabbing, which results from input translation or the use of a character designated by control word DC TB or control word TB.
It is interesting to note that input translation produces a tab byte that is identical in value to that produced by the keyboard, but which produces the same effect as that produced by a character designated by control word DC TB or control word TB.
It turns out that the difference between keyboard tabbing and wgml tabbing is as simple as input tabbing and output tabbing.
Comparing devices PS and TASA shows no apparent difference in how keyboard tabs are treated, although the output from PS makes it clear that the number of characters is multiplied by the space character width of the current font to produce the horizontal position required. This, however, is nothing new.
Keyboard tabbing, in wgml 4.0, appears to be treated as tabbing within the input text:
- The first input paragraph begins with the first input record which contains text.
- The current input paragraph ends before and a new input paragraph starts after a break.
- The position of the first tab stop in an input paragraph is 8 characters from the start of the input record in which that paragraph begins.
- Tab stops continue throughout the entire paragraph, that is, all input text before the next break.
- The tab stops are 8 characters apart.
This produces some very strange effects in the output text, particularly when "script" is in effect. Thus, for example, if "@" is a tab character, then
:P. This is a@test.
This is a test.
:P.This is a@test.
This is a test.
The first example has the tab stop 16 characters from the left margin, the second does not.
Using the TASA device to explore this, a table can be constructed relating the value of the indent specified for :P. in the :LAYOUT and the offset of the tab in the second example from its position in the first:
Indent Value Offset of First Tab (Columns) 0 or 0.0i -3 0.1i -2 0.2i -1 0.3i 0 0.4i +1 0.5i +2 0.6i +3 0.7i +4 0.8i +5 0.9i +6 1.0i +7
The most reasonable explanation of this is that wgml 4.0 is expanding keyboard tabs so early in processing that it is counting ":P." as part of the text to which they apply, producing three fewer space characters than it should. When the ":P." disappears in processing, the three missing space characters are still missing, and the tab position is shifted.
That the keyboard tab positions continue throughout an input paragraph (which may be spread over many input records) is shown by the output produced when the first output line ends and the next output line in the same paragraph begins: as is the normal case, any spaces at the end of the first line are not emitted, and spaces that would appear at the start of the next line are replaced by the left margin (as affected by any relevant control words). The tab stops then appear at the left margin and every 8 characters thereafter; since the output lines are determined only after the various control words, tags, symbols, and functions have been processed, it is not possible for the end of the output lines to be known at the time when the tab stops are being expanded. The most likely explanation for this regularity is that the keyboard tab stops are applied continuously throughout the input paragraph.
Experiments with input translation show a similar result: when "script" is in effect, an offset of -1 is produced because the character used to mark other characters for input translation is removed from the output but is present when the keyboard tabs are expanced.
When "wscript" or "noscript" is in effect, keyboard tabs are still expanded as they are with "script", but the effect is hidden because each word which is followed by at least one space in the input is followed by only one space (unless it ends with one of ":", ".", "!", or "?", in which case it is followed by two spaces) in the output: the net effect is to treat the tab as a single space character. This might suggest a very simple implementation for our wgml, since only "wscript" is needed: replace each keyboard tab in the input with a space character. Unfortunately, when used with input translation, the output produced by a keyboard tab does vary depending on the tabbing behavior described above.
To illustrate this, consider this admittedly-unlikely bit of content ("~" is used for the input translation indicator):
This is a~@test.
produces this with "script"
This is a test.
and this with "wscript" or "noscript"
This is a test.
while the non-input-translated version produced
This is a test.
What is happening (using "|" for the space character is:
is expanded to, in this case,
but "~|" is treated as a token; indeed, use of a test device shows that it appears in its own text_chars instance if it does not occur at the beginning, at the end, or within the text controlled by a text_chars instance. When the text_chars instance which controls it is input-translated, "~|" becomes "|", that is, a single space character (unless an actual input translation has been defined for the space character). When "wscript" or "noscript" is in effect, two space charactes are seen: one from a text_chars instance, and one from the other spaces resulting from the expansion of the keyboard tab.
Unless there are no other space characters: from time to time, a keyboard tab will expand to only one space character. In that case, the input-translated space character will occur at the start or within a text_chars instance, and only one space character will appear in the output. This, however, depends on the position of the keyboard tab with regard to the start of an input buffer, and so, even when "wcript" or "noscript" is in effect, an accurate reproduction of wgml 4.0's output in general would require an expansion of keyboard tabs very early in input processing.
Of course, our wgml does not, at least initially, need to reproduce wgml 4.0's output in general; it only needs to reproduce its output when processing the Open Watcom documentation. So, this strategy is suggested:
- For now, do nothing with keyboard tabs.
- If diffs occur when our wgml and wgml 4.0's output of the Open Watcom documentation are compared which are caused by keyboard tabs, then the simplest processing which will produce the proper result will need to be implemented.
The Open Watcom document build system does use keyboard tabs; it uses them to format the parameters to the FONT option in various option files. However, that use has already been accomodated in the command option tokenizer. It is not clear that keyboard tabs are used in the document specifications themselves.
If doing nothing works, then our wgml will treat keyboard tabs as if they were wgml tabs. This appears to represent a return to the behavior documented in Waterloo SCRIPT, under the title "Fill-String Processing", which starts by stating:
Tab Characters present in text input lines are expanded by SCRIPT into one or more "fill" characters (blanks, if not specified) to the next defined tab-stop position on the output line.
This is definitely not happening in wgml 4.0: the "next defined tab-stop position on the output line" is, in fact, being ignored.
It must, however, be admitted that the final example in the discussion of .tr in Waterloo SCRIPT contains this note:
.TR 05 40 This is an unsuccessful attempt to remove all TAB characters from the input and replace them with blanks. It will fail because TAB characters are expanded during input processing, not on output.
And that summarizes the problem neatly: the keyboard tabs are expanded during input processing, and yet they are somehow intended to use the output tab stops, which cannot be located until the text fitting on the output line containing the keyboard tab has been identified. It is, then, possible that the effects shown above existed in Script itself.
This section, then, may turn out to be of use only in updating the WGML Reference.
While the WGML Reference says nothing about tabs in the context of processing document specfications into documents, the document Waterloo SCRIPT has a great deal to say about them. However, Waterloo SCRIPT describes an older version of Script, and so is not entirely accurate. For example, it treats 0x05 as the keyboard tab character (and also as the default user-defined tab character); this was, perhaps, true for EBCDIC and/or the mainframes and terminals targeted by this version of Script, but for ASCII-based computers running, say, Windows or OS/2 or Linux or even DOS, the keyboard tab is 0x09, and the default user-defined tab character is 0x09 as well. That this is the case can be seen by looking at the value of the system variables &systb and &systab, which both contain 0x09.
The effects of wgml tabbing on the output are the same whether "script", "wscript" or "noscript" is in effect. Of course, to be used with "noscript" the device must define an INTRANS block mapping some character to 0x09 and the document LAYOUT must define an input translation prefix character, since control words (such as TI or TB) will be treated as text and so have no effect.
A distinction is drawn between a single horizontal base unit and a single column. For a typical character device there is no difference between the two; but, for the PS device, the difference is vital. This affects several of the formulas presented.
As it happens, this topic is a bit complicated, so the rest of the discussion is grouped into sub-topics.
The Default Tab Stops
This section explores the tab stops in effect when wgml 4.0 starts up. These are also the tab stops placed in effect when control word TB is used without parameters or when control word TP (which is not used by the Open Watcom documents) is used to remove all of the tabs previously defined. Thus, they are the default tab stops.
At first, the situation appeared clear: the input lines
.ju off .ti > 09 .tb set > .br ~>1~>2~>3~>4~>5~>6~>7~>8~>9~>a~>b~>c~>d~>e~>f~>g~>h~>i~>j~>k~>l~>m~>n~>o~>p~>q .br >1>2>3>4>5>6>7>8>9>a>b>c>d>e>f>g>h>i>j>k>l>m>n>o>p>q
when processed for the TASA device (with "script" or "wscript", but not "noscript") and a one inch left margin produced the output lines
1 2 3 4 5 6 7 8 9 a b c d e f g h i j k l m n o p q 1 2 3 4 5 6 7 8 9 a b c d e f g h i j k l m n o p q
which appeared to show that the tabs extended to the full width of the page (TASA has a page width of 132 characters -- all letters through "o" are on the same line, although you may need to use a scroll bar to see all of them) and then were wrapped to the next line (hence the " p q" lines). However, when processed for the PS device, the same lines produced, in effect, these output lines (reconstructed to match the text displayed by GhostView):
1 2 3 4 5 6 7 8 9 a b c d e 1 2 3 4 5 6 7 8 9 a b c d e
Testing with a test device configured with the same metrics as TASA and examination of the .ps file as text revealed the actual situation:
For this rather unlikely sort of input, all tab stops are on the same line, which, so far as wgml 4.0 is concerned, continues on without limit. However, the device definition provides a limit for the length of an output record, which is flushed when the record is full. Devices like TASA respond to this limit by printing a new line; but PS text is output in such a way that lines of any length can be specified: the only limit in PS is that the line is clipped at the right edge of the page and so do not wrap.
This testing also showed that the five characters used with TASA became 500 horizontal base units when PS was the device (except for the first, discussed below): the default tab stops, in other words, are every half-inch across the page, with no known limit.
Experimenting with the LAYOUT settings and the values of the PAGESTART and PAGEOFFSET block attribute x_start in the test devices shows that the first tab stop is measured from the left margin -- and that the left margin is computed as discussed here.
It is possible that other control words or tags affect this; however, experimenting with tag P and an indent of one inch showed clearly that the first tab stop is not only at the same position as it otherwise would be but that text positioned there will appear there, even though the indent would put the start of the first line of the paragraph further to the right. Further preliminary research showed that these control words affect wgml tab stop positions, possibly by affecting the left margin:
this one affects the initial tab stop position only:
and these do not affect the tab stop positions at all, although they do affect which tab stop is used:
HI IN UN
Control word IN is the only one used in the Open Watcom documents. It is, however, clear that the wgml tab stops should be stored with values based on a left margin of "0", and then adjusted to the actual margin since that value can change during document processing.
One result of this is that the statement in the document Waterloo SCRIPT in the discussions of TB and TP that
At the start of SCRIPT processing, tab-stop positions are initially defined as 6, 11, 16, ..., and 81.
is no longer correct or, more exactly, is only correct for character devices, and then only if the value of attribute x_start in the PAGESTART block is "0" and there are only 16 tab stops in use.
It is time to consider just what a tab stop is.
For a character device, a tab stop is the column in which the first character following the tab character will appear. Thus, the first tab stop is given above as "6" because the text following that tab stop is preceded by five space characters and so appears in column six. In the same way, when the normal one inch margin is in effect, ten spaces are emitted (or equivalent, using horizontal tabbing, when available) and the text begins in column 11.
When the PS device is considered, the first tab stop is 599 horizontal base units from the left margin; a one-inch left margin, in contrast, is 1000 horizontal base units from the left edge of the paper. The ABSOLUTEADDRESS block is implemented simply with the PostScript "moveto" operator; research shows that this operator positions the "current point" at the position indicated, at least for drawing lines. Text is displayed using the PostScript "show" operator, which is said to start at the "current point".
Very early tests established that, if everything else is identical, wgml 4.0 generates the same positions for test devices that use NEWLINE and those that use ABSOLUTEADDRESS. Thus, wgml 4.0 expects PS to work the same way TASA does: it expects that the first character after a one inch left margin will start 1001 horizontal base units from the left edge of the paper, and that the first character at the first tab stop will start 600 horizontal base units from the left margin. This expectation is not met by the PS device.
Of course, a difference of a single horizontal base unit when there are 1000 horizontal base units per inch is a bit hard to see: it makes no practical difference whether the "moveto" operator results in the first character being drawn starting at 1000 or 1001 for a one inch left margin, or the first character starting at 599 or 600 for the first tab stop. In terms of how wgml 4.0 treats tab stops, however, it makes sense to treat the text as starting 1 horizontal base unit past the "current point" because it leads to a formula for computing the location of the first tab stop that depends, certainly, on the device characteristics, but not on which device is being targeted.
Testing with various values of the command-line option "CPI" and with various values for the attribute horizontal_base_units of the DEVICE block produces these results for wgml 4.0:
- The CPI setting has no effect whatsoever: when constructing the first tab stop and computing the distance between them, wgml 4.0 uses 10 CPI only.
- The value of the attribute horizontal_base_units of the DEVICE block does have an effect on both the position of the first tab stop and the space used for each tab stop.
The formula for the space used for first tab stop is:
first_tab_stop_space = (6 * ((horizontal_base_units/in) / 10cpi )) - 1
where the division is integer division. Note that this is 6 columns, where a "column" is one-tenth of the value of the attribute horizontal_base_units of the DEVICE block, minus 1.
The formula for the spacing between tabs is:
space_between_tabs = (5 * ((horizontal_base_units/in) / 10cpi ))
where the division is integer division. Note that this is 5 columns, where a "column" is one-tenth of the value of the attribute horizontal_base_units of the DEVICE block.
This table summarizes the situation. All values were observed during testing:
hbu/in first stop interstop distance 10 6 5 11-19 6 5 20 11 10 1000 599 500 1500 899 750 2000 1199 1000
Default Tabs and Output Lines
This section explores how tabs work when used inside text lines. This should be a bit more realistic than the examples used above.
With a left margin of one inch, a paragraph indent of 0.5 inch, 10 horizontal base units per inch, and the default tab stops, this example:
.tb set > :P.This>illustrates some>embedded tabs.
produces this output:
This illustrates some embedded tabs.
which illustrates these points:
- The paragraph indent is in full effect: the text starts 15 spaces into the line.
- The first wgml tab starts the text following it ("illustrates") on column 21.
- The second wgml tab starts the text following it ("embedded") on column 41.
This is, of course, what we would expect: text following a tab character is placed at the next available tab stop, and the default tab stops are still in use. However, as shown below, this is not what is actually happening here: instead, the user-defined tabs are first exhausted and then the default tabs are applied based on the end position of the last text output.
If the example is extended to:
.tb set > :P.This>illustrates some>embedded tabs.>1>2>3>4>5>6>7>8>9>a>b>c>d>e>f>g>h>i>j>k>l>m>n>o>p>q
then the output becomes:
This illustrates some embedded tabs. 1 2 3 4 5 6 7 8 9 a b c d e f g h i j k l m n o p q
where testing with the PS device shows that, so far as wgml 4.0 is concerned, this is one output line.
The last word ("tabs.") and the following text are placed on the next line in PS when this line with more text is used before "tabs.":
.tb set > :P.This>illustrates some>embedded 12345678901234567890123456789012345678901234567 tabs.>1>2>3>4>5>6>7>8>9>a>b>c>d>e>f>g>h>i>j>k>l>m>n>o>p>q
A test device confirms that "tabs.", with the preceding space, would put it past the right margin. It also shows that each token preceded by a wgml tab is placed in its own text_chars instance. Further testing showed that placing a space character after a wgml tab would, when otherwise appropriate, cause the next word to appear on the next line, and that multiple wgml tabs with no intervening text behave exactly as might be expected: each tab is processed and the text following appears at the final tab stop.
On the other hand, since a wgml tab character is not a space character, it is clear that words separated only by wgml tab characters will be placed (initially) in a single text_chars instance. This would apply to "This>illustrates" and "some>embedded" as well as "tabs." plus the rather unrealistic tail appended above to "tabs.". And, since a wgml tab can be inserted through input translation, it is clear that the wgml tabs within the text controlled by a text_chars instance cannot be identified until after input translation has occurred.
After input translation has been done, determining that a text_line is full, that is, that the current text_chars will start a new text_line, uses these steps:
- Compute the width of the text up to the first wgml tab character and store it in field width of the text_chars instance.
- Perform the test for a full text_line normally.
- If the text_line is not full, then convert the text_chars instance into a doubly-linked list of text_chars instances, using the rules provided below, and append the entire doubly-linked list to the text_line.
- If the text_line is full, then reset the first text_chars instance to start at the left margin, convert the text_chars instance into a doubly-linked list of text_chars instances, using the rules provided below, and use the entire doubly-linked list to start the new text_line.
- Set the field last of the text_line to the last text_chars instance in the doubly-linked list.
- Set the current horizontal position to the position plus width of the last text_chars instance.
When the original text_chars instance is converted into a doubly-linked list, these rules apply:
- If there are no wgml tabs in the text controlled by the original text_chars instance, then no doubly-linked list is formed. The effect is the same as it would be had the text_chars instance simply been appended to the text_line.
- If there is at least one wgml tab in the text controlled by the original text_chars instance:
- Unless it is at the start of the line, the the first text_chars instance in the linked list will control the text up to the first tab stop and it will be positioned where the original text_chars instance was.
- If it is at the start of the line, the first text_chars instance in the linked list will be positioned at the tab stop position.
- Empty text_chars instances are produced under various conditions, as discussed here.
The above sequence and rules have a few additional notes that should be made:
- All empty text_chars instances noted are created whether "script", "wscript", or "noscript" is in effect.
- The normal spacing differences between "script" and "wscript/noscript" apply. Thus, if the text has something like:
then, with "script", all the internal spaces will appear but with "wscript/noscript", only two will appear (two because of the stop).
It might be thought that it would be simpler to convert the text_chars instance into the doubly-linked list first, and then recompute the offsets if a new text_line is started, than to find the first wgml tab character twice (once in step 1 and once in either step 3 or step 4). However, it is much more efficient to find the first wgml tab character twice than to do the conversion.
This can be seen by considering what would be involved in recomputing the offsets:
- With respect to the default tab stops, the position of the end of the preceding word may be very different at the end of the old text_line and the start of the new text_line.
- Since multiple tabs can occur, the next tab stop on the old text_line would have to be computed and compared to the actual value used in the second text_chars instance in the doubly-linked list, recovering the number of tabs preceding that text_chars instance's text.
- The position of the same number of tab stops past the end of the first text_chars instance's text in the new text_line would then have to be computed.
- The difference between the position computed for the old and the new text_line instances would then need to be subtracted from the remaining (third onwards) text_chars instances in the doubly-linked list.
In contrast to this, finding the first tab takes one pass through the text in the original text_chars instance -- and that pass will end when the first wgml tab is encountered.
User-Specified Tab Stops
The control words TB and TP allow the author of a document specification to specify the columns which have tab stops assigned to them. This, I suspect, is the true use of wgml tabs: to set up columnar tables in the text. This section explores how such tab stops work.
Consider first this text:
.tr < > .tb set > .tb 28 >This is a line of text starting with "<".
which produces this output:
This is a line of text starting with ">".
when used with the TASA device.
If we examine this in more detail:
1 2 3 4 5 6 7 01234567890123456789012345678901234567890123456789012345678901234567890 This is a line of text starting with ">".
and keep in mind that the first column is used by TASA for printer control bytes and so is never part of the text to be printed out, even when it is a space character, then it is clear that the tab stop is the same as the default tab stops: the text starts 28 columns from the left margin.
If we examine the PostScript command for the same text:
3700 10466 am (This is a line of text starting with ">".)
and keep in mind the discussion above about how wgml 4.0 sees the PS device behaving, it is clear that the text is seen by wgml 4.0 as starting at 2701 horizontal base units from the left margin. This, of course, is not the same as how the first default tab stop was positioned: instead of substracting one horizontal base unit from the specified column width, one column's width is being subtracted.
The formula for user-specified tab stops appears to be be:
user_specified_tab_stop = ((column - 1) * ((horizontal_base_units/in) / 10cpi ))
where the division is integer division. Although not quite the same as either of the formulas used for the default tab stops, it is still true that each column is one-tenth of the value of the attribute horizontal_base_units of the DEVICE block, and that this value is added to the left margin, computed as discussed here.
This table summarizes the situation. All values were observed during testing, and are for column 28:
hbu/in tab stop position 10 28 15 42 20 56 1000 2700 1500 4050 2000 5400
Note: further testing with a character device with an hbu of "6" and further consideration of the second line suggests a more complicated situation, which may eventually need to be accomodated in the coding. Which is a pity, since the above formula allows the width of a tab stop to be computed once and then reused as a constant.
Now consider consider this text:
.tr < > .tb set > .tb 28 40 >This>is a line of text starting with "<".
which produces this output:
This is a line of text starting with ">".
when used with the TASA device. It should come as no surprise that the second word starts at 40 columns from the left margin. Nor should it be surprising that, when the PS device is used, the second word is preceding by a tab to 3900 horizontal base units from the left margin.
Thus, when multiple tab stops are specified using TB (and, presumably, TP), each tab stop is computed as shown above.
Finally, consider this text:
.tr < > .tb set > .tb 28 >This>is a line of text starting with "<".
which produces this output:
This is a line of text starting with ">".
when used with the TASA device. Here the second word starts 36 columns from the left margin, which is one of the default tab stops but is 8 columns from the the first tab stop, suggesting that this is, indeed a default tab stop. The position used with the PS device is 3099 horizontal base units from the left margin, which is a default tab stop, although not the same one, presumably because the first word takes up less horizontal space in PS than it does in TASA, Experimentation confirms that, if the first word is replace by "This1234", the second word, when the TASA device is used, shifts to the next default tab stop (41 columns from the left margin).
So, user-specified tab stops only replace default tab stops from the first default tab stop through the last which they are farther to the right than. The remaining default tab stops are still in effect.
Tab Stop Format
This section discusses the tab stop format. It is intended to provide a conceptual framework for investigating and implements the effects of a tab stop.
The full tab stop format in the document Waterloo Script is quite complicated:
<'string'|char/>n<L|R|C|'char'>: the complete specification of a tab-stop position, including both the fill string or character and the alignment type. Each of the tab-stop positions on the .TB control line may be specified in this fashion.
The same description occurs, with the control word name changed, for both TB and TP.
The first element will be called the fill string. It can have any of these forms:
./ -- for a single character fill string '+-' "+-" /+-/ -- for a multicharacter fill string '.' "." /./ -- these are treated as multicharacter fill strings, not as fill characters
The last point was discovered by explicitly using a space character:
/ / " " / /
caused wgml 4.0 to insert a string of composed entirely of spaces into the output file, while
produced this error:
SC--058: Right string delimeter missing
This element may be followed by one or more spaces.
The second element will be called the tab stop position. It must be a decimal value, which can be preceded by a "+" character. If the "+" character is used, then the decimal value is added to the prior tab stop position. If it is used with the first tab position, then it has no effect. Thus, these tab stop lists have identical results:
15 25 15 +10 +15 +10
If the second element is followed by a space, then the tab stop ends and a new tab stop begins (if any are left).
The third element will be called the alignment. It must follow the tab stop position immediately. These are the only valid forms:
c C l L r R '.'
where the first is a space, and any character can appear between the single quotes. Note that double quotes and slashes are not allowed with an alignment character.
As might be expected, ill-formed tab stops produce a variety of error messages, depending on what the error is.
This section describes how left, center, and right alignment work. Alignment characters are not explored.
Detailed work with the PS device was actually done with a test device using the same metrics. This allowed character widths to be altered (and so allowed the exact point at which the current tab stop was skipped to be determined) without affecting the PS device itself.
The default tabs do not specify an alignment. This is documented to make them behave as if the alignment was "left". This has been confirmed by many tests.
Alignment "left" behaves exactly as described above: the text starts at the tab stop position. This is implicit in how the tab stop position is computed, and appears to be what the document Waterloo SCRIPT refers to as "left aligning in the standard typewriter fashion", that is "with the character following a tab being placed in the column specified".
Alignment "center" is done by subtracting one-half of the width of the text from the tab stop position used with alignment "left". For a character device, if the width of the text is an odd number, then the mid-point will fall on this position; otherwise, the second half will start at this position. The same computation is used for the PS device, but the mid-point will rarely fall at the start of a character.
Alignment "right" is done so that the last character falls on the tab stop position. For the PS device, the print position after the text has been printed is the tab stop position.
For alignment "left" and "center", the "tab stop position" for the PS device is one column to the left of what might otherwise be expected because of the way in which it is computed and how the PS device works: with a one-inch margin and a tab stop of "15" or "15l", the text will start at 2400, and with a tab stop of "15c", the text will be centered at 2400. This is consistent with the computation is described above. For alignment "right", the "tab stop position" is, in fact, the tab stop position: with a one-inch margin and a tab stop of "15r", the text will end at 2500.
Tab stops are selected without regard to alignment or text width: the next tab stop with a tab stop position to the right of the current print head position is taken. Alignment "left" almost always works with the tab stop selected, since the text will follow that position and cannot possibly conflct with the prior text, unless these conditions apply:
- The text starts with a tab character.
- The first tab stop position is "1".
In this case, the "prior text" is, in fact, the left margin.
Alignments "right" and "center" can cause the tab stop selected initially to be skipped and the next tab stop used (tab stops will, in fact, be skipped until one that produces no conflict is found) if the current text would conflict with the prior text.
A conflict exists if the current tab would cause there to be no space between the current text and the prior text, whether they would overlap or simply be adjacent. The displaced text will use the alignment of the next tab stop. Thus,
.tb set > .tb 15r 18c 30 Tab>test>line.
will result, for a character device, in "line." appearing (with "left" alignment) at column 30 because the last "t" in test is in column 15 and starting "line" in column 16 would leave no space between them. The PS device, in the case shown, would allow "line" to start as far left as 1501 before skipping the tab. Thus, the required space between the prior and current text is one horizontal base unit, not one column.
If a line begins with a tab, then the left margin, computed as discussed here plays the role of the current print position and of the prior text.
This section describes how fill characters work. Fill character strings are not explored.
The default tabs do not specify a fill character. This is documented to make them behave as if the fill character was a space. This has been confirmed by many tests.
When the fill character is the default space character, the print position is simply set to the desired location. As discussed here, this may or may not result in space characters being emitted. When the fill character is specified as, for example, '.', then a string of these characters is output controlled by its own text_chars, that is, as as a unit separate from the text before or the text after the fill characters.
The same algorithm is used for all devices. It is not what might be expected by examining character devices; in fact, the document Waterloo SCRIPT describes the algorithm but as being used for fill strings rather than fill characters:
The fill string is propagated in an internal workarea, and the particular column bounds of the tabulation gap are then used to extract the portion required to fill the tabulation gap. For example, a fill string of "abc" in a tabulation gap from column 5 to 9 inclusive will result in the character string "bcabc" filling the tabulation gap.
These tab stop specifications:
produce identical results with wgml 4.0 when using the PS device, so this algorithm is actually used for single fill characters as well as for character strings.
Or, rather, it is if the internal buffer is taken to have these boundaries:
- The start of the buffer corresponds to the left margin (that is, to a tab stop position of "1", the minimum value)
- The end of the buffer is at least at the actual tab stop position. An actual physical buffer would presumably extend to the right margin, or perhaps to the rightmost tab stop position of the current list of tab stops.
Using this algortithm has one visible effect on the output: when variable-width fonts are used, then, under these conditions:
- multiple lines use the same font or fonts with the same metrics; or
- multiple lines use fonts with metrics that are close enough to the same metrics that:
- the starting positions are the same;
- the numbers of fill characters are the same; and
- the fill char widths are identical
the fill characters will line up in vertical columns.
For fill characters, the internal buffer can be entirely notional, since all that is needed to create the fill string are these two values:
- The number of fill characters to be output (fill-char-count).
- The starting position of the fill characters (fill-start).
For simplicity, the computation for alignment "left" will be considered first.
The number of fill characters that will fit between the last print position and the current tab stop position, the tab-fill-count must first be computed:
tab-fill-count = tab-gap-width / fill-char-width
where tab-gap-width is the width of the gap between the last text printed (or left margin) and the current tab stop position and fill-char-width is the width of the fill character, both in horizontal base units, and the the result is truncated.
Now it is necessary to determine how many of these fill characters are, as it were, overlayed by the preceding text. This is done by determining the width of all the text preceding the tab stop (including the effects of any prior tab stops), in horizontal base units, that is, the text-width. The value text-count can then be computed by:
text-count = text-width / fill-char-width if( text-width % fill-char-width > 0 ) text-count++
where the result of the division is truncated but the overall effect is to round text-count up if there is any remainder.
The fill-char-count and fill-start can now be found very easily:
fill-char-count = tab-fill-count - text-count fill-start = (text-count + 1) * fill-char-width
That is, the fill string begins with the first fill character in the hypothetical string constructed above which is not overwritten, even in part, by the preceding text and extends to as close to the tab stop position as an integral number of fill characters can get.
For alignments "center" and "right", the tab stop position is replaced by the start position of the text; otherwise, the algorithm is the same as for alignment "left". They do not appear following centered text.
In devices which use monospaced fonts which all characters occupy one column, which is equal to one horizontal base unit, all of the divisions shown will be exact and the fill string will have one fill character in each column between the prior text and the tab stop position (or start position of the text).
In other devices, where the divisions may be associated with non-zero remainders, some additional effects appear:
- There may be a gap between the last character of the preceding text and the first fill character.
- There may be a gap between the last fill character and the start of the following text.
- A tab stop may be found to be useable and so be used even though the value of fill-char-count turns out to be "0". In this case, if the starting position of the text is to the left of the value of fill-start, then the starting position of the text is set equal to fill-start. The effect is to force the text to start where the first fill character would normally start.
Fill characters appear whenever a tab stop is used, even if no text exists. If a tab stop is skipped, then the fill characters of the skipped tab stop do not appear.
Fill characters use the font of the tab character. Thus, given these lines:
.tb set > .tb 15 ./25r 30 45 60 Tab>:hp1.test:ehp1.>:hp2.line:ehp2.. .br; Tab>:hp1.test>:ehp1.:hp2.line:ehp2.. .br; Tab>:hp1.test:ehp1.:hp2.>line:ehp2..
The fill characters will be in the default font in the first line, font 1 in the second line, and font 2 in the third line.
Tab stop characters can also be superscripted. The effect is to superscript any fill characters and the tab stop position. That is, just as the font is reduced in size, so also are any horizontal positions specified between the code sequences used to produce superscripts. This can cause the fill characters to overprint the prior text; this will not be treated as a conflict so long as the specified horizontal positions do not produce a conflict.
Although wgml 4.0 does not process subscripts properly, there is no doubt that, if it did, subscripted tab characters would behave the same way as superscripted tab characters (except for being subscripted).
Tab Character Scope
This section discusses the scope of a tab character, that is, how much of the text following a tab character is affected by that tab character.
The basic rule appears to be:
a tab character controls all text following it up to the next tab stop character or the next break
There is a subtle irony here: on the one hand, space characters, as such, do not end the scope of a tab character but, on the other hand, the basic rule is entirely correct only when there are no space characters at any point between the tab character and either the next tab character or the next break. When space characters do occur, then the basic rule becomes not entirely correct.
Unless otherwise noted, WSCRIPT is in effect (so that newline sequences and keyboard tabs are treated as space characters), concatenation is always "on", justification is always "off", and the left margin is '1i'.
This is the sort of input used in testing:
.tb set > .tb 15 30 45 60 Tab>test line.>still more
When the input shown is processed by wgml 4.0, the result is exactly what would be expected if the tab character's scope ended at the first space character: "test" and "still" start at the appropriate tab stop position and "line" and "more" start one space to the right of the end of "test" and "still" (respectively).
When the alignment of the tab stops is changed to "right" or "center", then the basic rule becomes apparent: the phrases "test line" and "still more" are treated as units. So, for alignments "right" and "center", the tab scope definitely is not ended by the first space character. Also, with WSCRIPT in effect, that these phrases only have one space between the words, not the two in the input text, so it isn't as simple as just using all the characters in the scope unaltered.
Now suppose a great deal of additional text is added to the line, both before "line." and after "still more", but that "line.>still" is still present. This are the rules derived from my tests:
- Tab stops are skipped as usual for alignments "center" and "right".
- If the tab stop used has the default ("left") alignment, then the text is broken into lines exactly as it is when no tabs are present, except for the sequence "line.>still", which follows the usual rules for tabbing:
- It is moved to the next line or not based on the length of "line." only.
- If it stays on the current line, then "still" may be placed past the right margin.
- If it moves to the next line, then "still" will be positioned based on the first useable tab position following "line.".
- If a tab stop with alignment "center" is used, then text may continue past the right column, even if it contains spaces. This is because a tab stop with alignment "center" is accepted if one-half the total width of the text will fit between the last text printed and the tab stop position. Whether the other half will fit within the right margin or not is not considered.
- If all user tabs are skipped, then the first default tab stop following the end of the already-placed text is used. This tab stop may be to the left of one or more of the user tab stops. This clarifies how default tabs and user tabs interact, and underlies the description given above.
- Since the default tab stops all use normal ("left") alignment, in the long run the text will be positioned normally.
These rules, carefully applied, explain a great great deal of interesting behavior involving tab stops.
Up to this point, the tab characters have all had non-space characters on both sides. Now it is time to consider what happens when space characters are adjacent to tab characters.
The first case is when one or more spaces precede the tab character. Changing "line.>still" to "line. >still" shows that the only effect is to treat the space or spaces inserted by wgml 4.0 when WSCRIPT is in use as text when selecting the next tab stop and when applying alignments "center" and "right". Thus, with "line.", since it ends in a stop, the tab stop used for "still" must be far enough to the right that there is at least one column between the start position of "still" and the position of the second of the two spaces placed after "line." by wgml 4.0. For "line,", which does not end in a stop, the tab used for "still" must have a position one column to the right of the position of the space placed after "line," by wgml 4.0. Further, the phrases aligned using either alignment "center" or alignment "right" will be "test line, " or "test line. ".
However, if the first tab stop has center alignment, then the center point of "test line" will be moved to the left by one-half of the width of the spaces. When WSCRIPT is in effect, this will be the width of one or two spaces (two when the prior text ends with a stop), not the total width of the space characters actually present on the line.
If the spaces occur at the start of the line, then
- With concatenation on, a marker is emitted with its position at the left margin.
- With concatenation off, a marker is emitted with its position at the end of the space characters.
Concatenation off was explored only to the extent that the Open Watcom documents use it. This is normally done by using such tags as XMP, which turn off concatenation and (right) justification in the block between the start and end tag.
Since the effect seen is more a matter of how the first tab character's scope ends than of having a space in front of a tab character, testing was also done with placing one or more spaces after the final word "text". This, however, has no effect on the result: for alignments "center" and "right" the phrase used is "still more", with no spaces at the end. The most likely reason for this is that wgml 4.0 does not insert these spaces at the end of an output line.
The second case is when one or more spaces follow the tab character and are adjacent to it. Here the rules appear to be:
- For default ("left") alignment, what happens depends on whether or not the next word, if there is one, will fit on the same line within the margins:
- If there is a next word and it will fit, then every space preceding the text is output, even with WSCRIPT in effect.
- It there is no next word or there is and it will not fit, then the spaces preceding the first word are all ignored and the next word (if any) is moved to the next line.
- For alignment "right", there is no effect on the position of the end of the text; however, the tab stop will be skipped if all the spaces preceding the text will not fit.
- For alignment "center", the next tab stop will be taken if one-half of the total width of the space(s) preceeding the text, truncated, will not fit. The mid-point used is shifted by one-half of the total width of the space(s) preceeding the text:
- rounded up, if the tab character scope is terminated by a tab character; or
- truncated, if the tab character scope is terminated by a break.
These results have been confirmed with both a character device and with the PS device.
It was eventually discovered that wgml 4.0 behaves differently with text which is produced by a macro and text which is not in at least two cases:
- A tab marker produced to mark the position of a tab at the end of the prior input text will not be removed for a tab stop with alignment "left" unless a macro is not being processed.
- If spaces preceed text but follow a tab character and the tab stop alignment is "left", then the position of the text will reflect the actual number of spaces in the text even with WSCRIPT in effect if a macro is not being processed but will use the spacing normally produced by WSCRIPT when a macro is being processed.
Whether these differences are intended or merely reflect how wgml 4.0 processes text is unknown.
Finally, the case where space characters, but no text, occurs before, between, or after a tab character must be considered:
- If placed before the first tab character, however many they may be, they have is no effect: the first tab stop position will be used unless the text will not fit.
- If enough are placed between two tab characters, then that alone can cause a tab top to be skipped. The number required varies with the alignment of the tab stop involved.
- If placed after the last tab character before the next break, there is no effect.
The discussion of wgml 4.0 sequencing behavior found here includes some information of tabs, based on observations of default tabs, which may be explained by the rules identified here.
Tab Character Breaks
Under these conditions, a break is inserted by wgml 4.0 at the start of an input text line, even with concatenation ON and WSCRIPT specified:
- User tabs have been defined.
- The current output line:
- contains no tab characters; or
- contains tab characters and all user tabs have been used to position text.
- The next input line:
- contains one or more tab characters; or
- starts with a tab character.
Note that these situations are excluded:
- No user tabs are in use.
- If the current output line contains tab characters and ends with a tab character: since a tab character at the end of the input line does not position text on that input line, it is counted as "not used".
- The next input line contains no tab characters.
In these cases, the new input line is appended to the current output line and processed as usual.
This is clearly intended to facilitate creating tables with user tabs: it allows each line in the table to be listed without having to separate them with control word BR.
Font changes and subscripting/superscripting before (but not including) the first tab character does not change this behavior, nor does placing a tab character in the value of a symbol which is expanded on the input line. Thus, at some point, wgml 4.0 must know whether or not an input line, considered as a whole, contains at least one tab character after any symbols have been expanded, before the line is broken into logical records.
Use of subscripting or superscripting before the first tab character does not work properly in wgml 4.0: enormous horizontal values are eventually generated, hiding the text, at least in some cases.
Note that this section is based on limited testing. Further details may become apparent as time goes on.
A tab marker is a text_chars instance that controls no text and which wgml 4.0 creates when tabbing. They appear to serve three different functions:
- to set the current print position to a specific location;
- to set the font when it changes;
- to set the type (normal, subscript, superscript) when it changes to either subscript or superscript.
These items should be mentioned:
- If tab stop is skipped, then the tab markers associated with that tab stop do not appear.
- wgml 4.0 does not implement subscripts correctly; however, it does respond to the subscripting functions in the same way it does to superscripting functions. Whenever the behavior of superscripting is described below, the same behavior should be understood to apply to subscripting.
- The tab marker will have the font number and type of the tab character.
When used to mark the position of the current tab stop, this position it is affected by the alignment as discussed here. Thus, a tab stop of "15" with the PS device will produce (with a one-inch margin) a tab marker at position "2400" for default ("left") alignment and alignment "center", an a tab marker at position "2500" for alignment "right".
When spaces occur following text and preceeding a tab character, then a tab marker is used to mark the position which the tab character would normally occupy.
A tab character with no text or space characters between it and a following tab character or the next break produces a tab marker that, in effect, records the fact that the tab stop was used.
When two tab characters have only only space characters between them, then the first tab character produces a tab marker that, compared to the tab marker produced when two tab characters are adjacent, is shifted to the right by the total width of the space characters, except when the alignment of the tab stop used with the first tab character is "center", in which case the position is shifted to the right by one-half of the total width of the space characters, rounded up if the total width of the space characters is odd.
However, when a font change begins immediately after the first tab character, then a second tab marker one space to the right of the tab marker for the first tab character is produced instead. This is troubling for two reasons:
- It implies that the behavior of wgml tabbing depends on logical record boundaries.
- Limited testing revealed a great many strange behaviors which, as they do not affect the Open Watcom documents, need not be further explored.
The strange behaviors were:
- When the font changes back before the second tabcharacter, then the text following the second tab is offset by one space to the right from the tab stop position.
- If text occurs before the second tab character, whether the font changes back after the text and before the next tab character or not, then the marker for the first tab still appears but the text is offset by one space to the right.
It is very hard to see why, given
.tb set $ .tb 10 20
a line like
This is $:hp1. $a test:ehp1.
and a line like
This is $ $a test
should produce different output. Yet it clearly does, at least when spaces are involved. Note that these limited tests were all done with alignment left; alignment center might differ slightly.
When a font changes immediately after a tab character, and the text for that tab character does not start with spaces, a tab marker is used to mark the font change and the position used will be:
- that of the current tab stop when it has default ("left") alignment;
- that of the start of the text positioned by the current tab stop when it has alignment "right" or "center".
These are not, of course, actually different cases: when the default ("left") alignment is in use, then the text begins at the tab stop position. Clearly, this tab marker is used to ensure that any spaces output as a result of the tab stop position are in the proper font. If a fill character is specified, then the fill character string precedes the tab marker, which still appears.
When the type changes, then as many as two tab markers can apper:
- when the change is from normal to superscripting and occurs immediately before a tab character, then the tab marker used will have the position where the last text already output ended.
- when the change is superscripting to normal and occurs immediately after a tab character, then the tab marker used will have the position where the next non-superscripted text to be output will start.
If a fill character is specified, then if both tab markers would appear, then they still appear with the fill characters in between them. If the criteria above are not satisfied for one or the other (or both) of the tab markers, then any tab marker which does not appear still does not appear when fill characters are output.
Superscripts and font changes using tags interact badly in wgml4.0. One reason for this is that the same character, ".", is used to terminate both the one-character superscript function (just as it is a symbol substitution) and the tag. However, the multicharacter subscript function, which appears to use parentheses to control its scope, does not interact well with tags producing font changes either. Testing showed that simply placing a highlighted phrase adjacent to a superscripted phrase, with no overlap, could cause problems that the PS interpreter would treat as errors. Using control word DC to change the character terminating the tag (DC MCS) did not change the situation in the limited tests performed.
The implementation of tabbing involves several different parts of the wgml code:
- various control words
- structs and data objects that represent tab stops and tab stop lists
- tab stop processing
- input-text processing
- interactions with other parts of wgml
Default tabbing has been implemented. This involved:
- extending the implementation of control word DC to set the user-defined tab character
- developing the struct encoding a tab stop
- defining the data objects needed to encode tab stop lists
- implementing functions to manage the default tab stops and to select the next tab stop given a horizontal position
- implementing a function to handle default tab stops and integrating its invocations into input-text processing
- exploring the interaction between default tab stops and overlong words
Note that the last item was not completely implemented, as noted here (toward the bottom). In retrospect, this was the first sign that tab stop scoping was not as simple as I thought.
User-defined tabs are currently being implemented. This has been done so far:
- control word TB has been partially implemented (control word TP is not used by the Open Watcom documents and so can be postponed)
- the struct has been revised to handle alignment better
- the function that handles default tab stops has been extended to handle non-default fill characters and non-default alignments
However, work is still ongoing.
Implementing both DC and TB allows them to be compared. They do not work identically. The system symbols $tb and $tab are set by wgml 4.0 user-defined tab character when ".dc tb" is used, but not when ".tb set" is used. Since these symbols are documented to contain the current user-defined tab character, our wgml sets these system symbols regardless of which control word is used to set the character.
Another difference is that ".dc tb 7e" and ".dc tb ~" both work, but ".tb set 7e" fails with this error message:
SC--057: Tab character defined by .TB SET must be one character
This difference has been preserved. It is interesting that even control words which perform the same action behave in different ways: it suggests that there may have been sets of control words which were developed independently of each other (possibly as successive generations).
Tab stop parsing is partially implemented; features not implemented (since they appear not to be needed by the Open Watcom documents) are:
- alignment characters
- fill character strings
The implementation identifies these and issues error messages if they are found; however, a fill character string consisting of a single character is treated as a fill character (if the single character happens to be a space, then it is treated as the default fill character), which differs from wgml 4.0. Should they need to be implemented, the code emitting the error messages can be replaced with code saving the values and single-character fill character strings can be treated as fill character strings, at least if that character is a space character.
Implementing fill character strings would require storing the string somewhere associated with the tab stop. No matter how this is done, resetting the user tag array, which is currently a matter of setting a data member to "0", would require releasing the memory involved. Also, depending on how this is done, it might be necessary to redo the code involved in creating and using groups of tab stops to use a linked list rather than an array. The implementation should be fairly straightforward, since it is currently implemented for fill characters. Of course, with multiple characters from a variable-width font, an actual buffer will probably have to be allocated and filled with repeated copies of the fill character string, rather than using a notional buffer, as the fill character implementation does.
Implementing alignment characters would involve adding a tag to an enum and a field for the the character itself. Duplicating the actual behavior should be fairly straightforward: the code pattern used for al_center and al_right should work, although the test for whether or not to skip the tab stop may be a bit more complicated.
Testing is currently being done use a diff program to compare the output of our wgml and wgml 4.0 directly. In addition, the macros, symbols, and input text involved in the use of tabbing in the Open Watcom documents are being extracted and placed in a test file. This has revealed several issues, not all of them related to tabbing.
The scope of a tab character has now been implemented to match wgml 4.0 except in two cases:
- When the scope is terminated by a break and more than one word is available but only one will fit on the output line all of the spaces preceding the text are still reflected in the starting position.
- The difference in the center-point of alignment center between scopes terminated by tab characters and scopes terminated by breaks is ignored.
This was done because the testing showed that spaces following the tab character are used with left alignment, and that multiple-word scopes at the end of the line are used with right alignment. Although, so far, this has not happened with center alignment, it is actually easier to implement both center and right alignment together than to exclude center alignment.
When the case where spaces follow text between tab characters (for example, with "$" as the tab character, "$text $") was implemented, testing showed that the extra space(s) could cause the tab to be skipped. When the tab stop was skipped, the wgml 4.0 behavior made no sense, and so was not implemented.
Tabbing may also be of use internally. For example, when the entries to the Table of Contents or List of Figures, generated with the default :LAYOUT, are examined, lines such as these:
Simple Document ........................................ 1 Figure 1. Sample caption with multiple words ........... 1
are found. Examining the output shows that these are treated as normal text lines: each word, including the "..." bits, is in a separate text_chars instance. It is, of course, possible for each output line to be placed, as shown, in a buffer and then converted to a text_line as usual (.co off or equivalen would be needed to prevent the lines from being merged); however, it would also be possible (and, since the Table of Contents and List of Figures will likely have multiple lines in it, quite efficient) to modify the internal state to specify a tab stop at the appropriate point using "." as a fill character, and then construct something like (using ">" as a wgml tab character):
Simple Document > 1 Figure 1. Sample caption with multiple words > 1
and convert that to a text_line as usual, relying on the tabbing implementation to create the "..." bits. An additional tab stop with alignment "right" would be effective in positioning the page number.
And adding documentation of tabbing to the WGML Reference should be done at some point.