Code Generator Improvements

From Open Watcom

Jump to: navigation, search

This page is intended to collect notes on possible improvements of the Open Watcom code generator (cg) with regards to generated code quality (speed or size). Most of the issues are minor in impact and the generated code is often just as good as the competition. The list is by its nature somewhat random. Note that bugs need to be discussed elsewhere.

Many of the following observations were made when building 16-bit code for BIOS/firmware applications.

Contents

Better intrinsics

Some intrinsic functions would benefit from being handled as pseudo-assembler instructions. That notably includes the inp()/outp() family of functions. For 8-bit I/O port addresses, immediates could be generated instead of requiring the DX register. And even for 16-bit port addresses, giving the register allocator full knowledge of the instructions could avoid unnecessary DX reloads.

More accurate intrinsics declarations

Consider the following:

#include <conio.h>
unsigned *pport;
void foo( unsigned val )
{
    outpw( *pport, val );
    outpw( *pport, val );
}

Compiled with 'wcc -s -oi', the resulting code is:

Segment: _TEXT BYTE USE16 00000017 bytes
0000                          foo_:
0000    53                        push        bx
0001    52                        push        dx
0002    56                        push        si
0003    89 C3                     mov         bx,ax
0005    8B 36 00 00               mov         si,word ptr _pport
0009    8B 14                     mov         dx,word ptr [si]
000B    EF                        out         dx,ax
000C    8B 36 00 00               mov         si,word ptr _pport
0010    8B 14                     mov         dx,word ptr [si]
0012    EF                        out         dx,ax
0013    5E                        pop         si
0014    5A                        pop         dx
0015    5B                        pop         bx
0016    C3                        ret

It is obvious that the code between the two OUT instructions is redundant. The intrinsics are not marked as modifying no memory and the cg therefore must assume that the content of the global may have changed. Properly declaring the intrinsic as modifying no memory and no registers helps. That is accomplished by adding the following:

#pragma aux outpw modify nomemory exact []

The generated code now reads:

Segment: _TEXT BYTE USE16 00000011 bytes
0000                          foo_:
0000    53                        push        bx
0001    52                        push        dx
0002    56                        push        si
0003    89 C3                     mov         bx,ax
0005    8B 36 00 00               mov         si,word ptr _pport
0009    8B 14                     mov         dx,word ptr [si]
000B    EF                        out         dx,ax
000C    EF                        out         dx,ax
000D    5E                        pop         si
000E    5A                        pop         dx
000F    5B                        pop         bx
0010    C3                        ret

Many of the intrinsics built into the C and C++ compilers are unnecessarily conservative, sometimes causing the cg to generate larger/slowed code than needed. Numerous intrinsics (inp()/outp(), ldiv(), lrotr()/lrotl(), etc.) modify no memory but that isn't taken into consideration.

More aggressive register reuse

Consider the following:

int comp( signed char sc, unsigned char uc )
{
    return( sc > uc );
}

When built with wcc -oaxs -3, the generated code is not bad:

Segment: _TEXT BYTE USE16 00000010 bytes
0000                          comp_:
0000    53                        push        bx
0001    0F BE D8                  movsx       bx,al
0004    0F B6 C2                  movzx       ax,dl
0007    39 C3                     cmp         bx,ax
0009    0F 9F C0                  setg        al
000C    30 E4                     xor         ah,ah
000E    5B                        pop         bx
000F    C3                        ret

However, it is obvious that register BX doesn't need to be used at all. The cg could zero/sign extend the arguments in AX and DX where they are already, avoiding the use of BX and the need to save/restore the register. Note that a similar problem occurs when compiling with wcc -oaxs, but wcc -oaxsh avoids unnecessary register use. There are other similar situations where registers are used unnecessarily. Interestingly, Microsoft Visual C++ 1.52c has the same kind of problem when compiling with cl -c -Oxs -Gr and uses CX (instead of BX) for no good reason.

The problem is caused by the cg allocating a new register when converting a value instead of trying to reuse the already used register. The zero extension is allocated first and grabs register AX (because it is free), instead of sticking to DX since it's using DL already. The sign extension then can't use AX anymore and must allocate BX (the next free register).

The AssignConflicts() routine in cg/c/regalloc.c is responsible. Neither CalcSavings() not SortConflicts() gives any preference to DX when zero extending DL; AX is used simply because it's the first available register. It should be possible to tweak the cg to prefer register reuse. That is, when zero extending DL, the result should be stored in DX if DX is available.

Note that sign extension is more difficult when generating code for 286 and older because only the CBW instruction is available and it is tied to AL/AX. In that case, register moves may be unavoidable. Zero extension is more flexible because AH/BH/CH/DH can all be zeroed directly.

The following is a good testcase. It is surprisingly difficult for 16-bit compilers when passing arguments in registers.

int comp1( signed char sc, unsigned char uc )
{
    return( sc < uc );
}
int comp2( unsigned char uc, signed char sc )
{
    return( sc < uc );
}
int comp3( signed char sc, unsigned char uc )
{
    return( uc < sc );
}
int comp4( unsigned char uc, signed char sc )
{
    return( uc < sc );
}

After Change 36188, Open Watcom generates good code for the above. That was achieved by modifying the CountRegMoves() routine in regalloc.c to also consider the cases where the first operand of a MOV or other instruction (notably CNV) overlaps the examined registers.

Adding to SP

Consider the following:

extern int __cdecl foo( int arg );
int __fastcall bar( int arg )
{
    return( foo( arg ) );
}

The generated code is as follows:

Segment: _TEXT BYTE USE16 00000008 bytes
0000                          @bar:
0000    50                        push        ax
0001    E8 00 00                  call        _foo
0004    83 C4 02                  add         sp,0x0002
0007    C3                        ret

While that isn't wrong, it would be much better to generate something like this:

Segment: _TEXT WORD USE16 00000006 bytes
0000                          @bar:
0000    50                        push        ax
0001    E8 00 00                  call        _foo
0004    5B                        pop         bx
0005    C3                        ret

If there is a "killed" register available (unused and its value doesn't need to be saved), it's much better to generate a 1-byte POP rather than 3-byte ADD SP. If a routine with one parameter which uses stack argument passing is called often, this can add up to noticeable savings.

When adding 4 to SP, it can still be advantageous to generate two POP instructions when optimizing for size. There is analogous situation with the 386 cg. Note that on older CPUs, POP is slower than ADD SP, and should therefore be only used when optimizing for size.

Efficient short to long multiply

Consider the following:

long foo( short a, short b )
{
    return( (long)a * b );
}

This sort of code is not unusual - it comes up whenever two 16-bit operands need to be multiplied to produce a 32-bit result. The cg currently generates code which is far too long and inefficient, because the intermediate pseudo-assembler language cannot express the x86 MUL/IMUL instructions which produce a result that is twice the size of the operands. Thus to get a 32-bit result, the 16-bit operands are first converted to 32-bit and then the _I4M routine is called. Microsoft C 6.0 and later produce far better code (generated via cl -c -Ox -G2):

Segment: _TEXT WORD USE16 0000000C bytes
0000                          _foo:
0000    55                        push        bp
0001    8B EC                     mov         bp,sp
0003    8B 46 06                  mov         ax,word ptr 0x6[bp]
0006    F7 6E 04                  imul        word ptr 0x4[bp]
0009    C9                        leave
000A    C3                        ret
000B    90                        nop

That code is much shorter and requires no support routine calls. The compiler recognized that the operands were really 16-bit quantities and therefore IMUL is exactly the right instruction, taking 16-bit operands and producing a 32-bit result in DX:AX.

Fixing this probably won't be trivial and some kind of "extending multiply" pseudo-instruction may need to be added to the cg. The cg will also need to recognize the case where a 32-bit multiply is performed on operands which were converted from 16-bit.

Note that the 32-bit cg has analogous problem with a multiply where the operands are int/long (32-bit) but the result is long long (64-bit).

There is also a potential improvement to be made for multiplies where the operands are 8-bit and the result 16-bit or 32-bit. That case isn't nearly as bad because the runtime support routine does not need to be called, but it would still be better to perform an 8-bit multiply and possibly convert the result rather than converting the two operands.

Better offset calculations (arithmetic optimization)

Consider the following:

typedef struct S1 {
    unsigned char   length;
    unsigned char   c;
    unsigned char   size;
} S1;
#define round_size( s ) \
    ( (&s->size - &s->length) + (16 - ((&s->size - &s->length) % 16)) )
int foo( S1 *ps )
{
    return( &ps->size - &ps->length );
} 
int bar( S1 *ps )
{
    return( round_size( ps ) );
}

The &s->size - &s->length expression is essentially a poor man's sizeof (although the result is not identical). However, the code generated by Open Watcom for bar() is quite bad, although the final result is correct. The code for foo() is much easier to understand:

0000                          foo_:
0000    52                        push        dx
0001    89 C2                     mov         dx,ax
0003    83 C2 02                  add         dx,0x0002
0006    29 C2                     sub         dx,ax
0008    89 D0                     mov         ax,dx
000A    5A                        pop         dx
000B    C3                        ret

The code essentially calculates AX + 2 - AX, which obviously equals 2. In a more complex expression as in bar(), this leads to rather long code which should instead evaluate to a constant at compile time. A simpler and even more obvious testcase follows:

int foo( int a )
{
    return( a + 2 - a );
}

The optimization would probably be best performed in the cg when constructing the expression tree.

Note that not every compiler performs this seemingly simple optimization (including e.g. IBM VisualAge C++ 3.5, Borland C++ 5.5). On the other hand, Microsoft C 6.0 does perform this optimization.

Personal tools