Performance tuning

From Open Watcom

Jump to: navigation, search

The following is a collection of notes and observations based on experience with improving the performance of Open Watcom tools, primarily the C compiler and code generator. The notes are concerned predominantly with data structure and algorithm selection and tuning, which are entirely generic, as opposed to lower and micro-level optimizations related to code generation and architecture specific instruction selection.

Contents

Performance Analysis

Finding bottlenecks is the first step in tuning application performance. It is discussed separately on the Performance Profiling page. Needless to say, without first analyzing performance, it is pointless to attempt any improvements.

Algorithms and Data Structures

Once the hotspots are identified, the real work starts. The programmer first needs to understand what the code is trying to accomplish, then replace it with more efficient algorithm. There are several common performance bottlenecks that show up often in practice.

Slow I/O

This was not a significant problem for Open Watcom, but showed up several times. Many programmers don't realize that unbuffered I/O can be extremely slow. It tends to involve OS calls which incur context switches. Reading binary files one word at a time using unbuffered I/O is a great way to kill performance. The remedy is simple - use C library buffered stream I/O, or, if you feel you can do a better job, create your own buffering scheme (this is generally not recommended). I/O on the operating system level should be done in at least 1K chunks, while 4-16K blocks tend to be even better. Going beyond 32-64K block size rarely provides measurable benefits.

Memory Management

A compiler is a good example of application that is heavily dependent on dynamic memory management. Complexity and structure of user source code can differ wildly at runtime and the compiler must be able to cope with very varied workloads.

The first rule of thumb is, don't try to be clever. In the typical case, you won't do a better job than your runtime library's malloc() and getting a memory allocator right is non-trivial task. Only consider custom memory management code if you know for a fact that the generic memory allocator isn't efficient enough.

That said, there are cases where custom memory allocation can be beneficial. Two real-world examples follow:

  • The C compiler's symbol table is a strange beast, because symbols are only added to it until the entire source module is parsed. After code is generated and symbols are no longer needed, the entire symbol table is discarded at once. Experience shows that a fixed-block allocator works well. Memory is allocated in large chunks from the OS; allocating a new entry usually only involves incrementing the 'next available' pointer by the size of allocated block. Needless to say, this scheme is extremely fast and typically impossible to beat performance-wise.
  • The code generator tends to allocate and free large numbers of small memory blocks. Several block sizes are very common. In this case, custom free lists based memory manager is a good solution. There are several free lists for blocks of different sizes (8, 16, 32, 64 bytes etc.). A memory allocation request always uses up entire block size, which means some memory may be wasted. However, freeing a memory block only involves adding it on the appropriate free list; if any block is available on the free list, to allocate memory the block only needs to be removed from the free list. These operations are extremely fast with no worst case performance - there is no searching involved for either allocation or freeing. This approach also tends to reuse memory blocks quickly, which improves cache and paging efficiency.

Caching Information

There are many instances where calculating some information is relatively expensive, but the data can be stored for later reuse. This may dramatically improve performance.

When the Watcom C 11.0 compiler implemented function inlining, the information about which function should be inlined and which shouldn't (note that this is determined dynamically, based on compile options) was often needed; determining the "inlineability" of a function was fairly expensive. Although barely noticeable in most cases, in complex source modules this created a severe performance bottleneck. Simply caching the inlining information in the symbol table sped up the compilation of certain large and complex source files roughly by a factor of 100.

Data Management

Sometimes an inefficient use of data structures can cause trouble. The code generator keeps a singly linked list of structure fields for generating debugging information. The list is only added to while parsing the source code, and later it's used to output debugging information when generating the object file. It is important to note that in this case, the data structure was chosen correctly.

However, it was not being properly used. Adding a new entry involved traversing the list from the first entry until the last, then linking in the new entry. This caused performance degradation for complex structures. All that was needed to improve performance was to keep a pointer to the last entry in the list and use that when adding to it. For one atypical testcase, the compiler was sped up by a factor of 50.

Data Structure Selection

In many cases, inappropriate data structures are used. If a data structure needs to be searched, a linked list is going to have very poor worst case performance. In such instances, a different data structure is often a good way to significantly improve performance.

The C compiler uses hash tables for various internal structures. The symbol table heavily relies on hashing; symbols are identified by name (ie. a string) and string comparisons are relatively expensive. Instead of hash tables, it might be possible to use trees; however, the trees would have to be balanced to avoid degeneration, and their management would be very complex.

We use simple hash tables instead; creating a good hash function is, of course, a topic unto itself. Such function has to be fast (ie. the hash value can be calculated very quickly) and must ensure minimum of collisions. For the C language keywords, we use a program that creates a custom 'perfect' hash function (by trial and error) to ensure a unique hash for each keyword.

In the past, we have successfully used hash tables to replace linked lists for some C compiler internal data structures where linear searches proved unsatisfactory. Experience shows that a well chosen data structure can make all the difference between unacceptably slow and blazingly fast performance.

Performance Verification

If any attempt to improve performance is undertaken, it is crucial to measure the performance and evaluate the impact of any changes. Intuition is often misleading and a change that sounds good on paper may have no real benefit, or can be even detrimental. Before any performance oriented change is made, it must be known to be an improvement over existing code. If the improvement is negligible, added complexity may not be worthwhile. If new code isn't any faster than old code, then it obviously can't be used at all.

In the case of a compiler, measuring performance is easy - the programmer simply needs to measure how long it takes to run a command. On today's machines, the testcases need to be relatively complex (anything including the full Win32 SDK headers is a good candidate); a simple "hello world" program will be compiled so quickly that the margin of error will be far too large. Besides, it makes little difference if a source module is compiled in 0.01 or 0.02 second. However, a difference between one and ten seconds is quite noticeable.

Maintaining Performance

In any complex application, it is important to continuously observe and measure performance. Changes to existing code may inadvertently cause performance problems. If such problems are identified quickly, it is often simple to slightly redesign the new code so that any negative performance impact is negligible.

Also, if a conscious effort is undertaken to improve performance, it is possible to release new versions of software that are faster than older versions despite additional features and complexity. Users love that!

Known Performance Deficiencies

Here's a brief list of known performance problems in the Open Watcom tools. These may or may not be addressed in the future.

  • There are currently no known performance problems with the C compiler. While it could probably be sped up slightly, any dramatic improvements are highly unlikely. The Open Watcom C compiler is noticeably faster than other compilers with comparable feature set.
  • The C++ compiler performs poorly when creating precompiled headers. This is likely caused by the custom memory manager that is used; for some reason, it is behaving badly in this scenario.
  • The C++ compiler is extremely slow when compiling complex C or C++ source modules. No attempt has been made to analyze the problem. Note that for simpler modules, the C++ compiler performs very well, in some cases faster than the already very fast C compiler.
  • The debugger takes relatively long to load DWARF debugging information for large executables. It is currently unknown why, or if the process can be sped up significantly.
  • The make utility can take a few seconds to process dependencies for complex projects on some filesystems. It is currently unknown whether this can be improved; the performance may be limited purely by the OS/filesystem.
Personal tools