Journal:   Dr. Dobb's Journal  March 1991 v16 n3 p129(6)
-----------------------------------------------------------------------------
Title:     Fast convex polygons. (Graphics programming) (tutorial)
Author:    Abrash, Michael.
AttFile:    Program:  GP-MAR91.ASC  Source code listing.

Summary:   Assembly language programming can improve graphics performance and
           help programmers understand the function of their programs.  Fast
           convex polygon filling can be performed by optimizing the existing
           polygon filling implementation.  The drawing portion of the
           program is easily accelerated by using the 'memset' library
           function, which uses the REP STOS instruction.  The code
           replacement yields a speed an order of magnitude faster than the
           old implementation.  The polygon edge tracing procedures are slow
           because they use floating-point calculations, but integer
           calculations are both faster and more accurate.  Modification of
           the drawing and edge tracing portions of the program increase
           performance almost 20 times.  Assembly language can be used in
           place of C to improve speed even further.
-----------------------------------------------------------------------------
Descriptors..
Topic:     Graphics Software
           Programming Instruction
           Painting
           Tutorial
           Type-In Programs
           Solids Modeling.
Feature:   illustration
           table
           program.
Caption:   Execution times of various versions of the convex polygon drawing
           code. (table)
           Drawing polygons. (program)

-----------------------------------------------------------------------------
Full Text:

Fast Convex Polygons

Last month, we explored the surprisingly intricate process of filling convex
polygons.  This month we're going to fill them an order of magnitude or so
faster.

Two thoughts may occur to some of you at this point: "Oh, no, he's not going
to get into assembly language and device dependent code, is he?" and, "Why
bother with polygon filling--or, indeed, any drawing primitives--anyway?
Isn't that what GUIs and third-party libraries are for?"

To which I answer, "Well, yes, I am," and, "If you have to ask, you've missed
the magic of microcomputer programming."  Actually, both questions ask the
same thing, and that is: "Why should I, as a programmer, have any idea how my
program actually works?"

Put that way, it sounds a little different, doesn't it?

GUIs, reusable code, portable code written entirely in high-level languages,
and object-oriented programming are all the rage now, and promise to remain
so for the foreseeable future.  The thrust of this technology is to enhance
the software development process by offloading as much responsibility as
possible to other programmers, and by writing all remaining code in modular,
generic form.  This modular code then becomes a black box to be reused
endlessly without another thought about what actually lies inside.  GUIs also
reduce development time by making many interface choices for you.  That, in
turn, makes it possible to create quickly and reliably programs that will be
easy for new users to pick up, so software becomes easier to both produce and
learn.  This is, without question, a Good Thing.

The "black box" approach does not, however, necessarily cause the software
itself to become faster, smaller, or more innovative; quite the opposite, I
suspect.  I'll reserve judgement on whether that is a good thing or not, but
I'll make a prediction: In the short run, the aforementioned techniques will
lead to noticeably larger, slower programs, as programmers understand less
and less of what the key parts of their programs do and rely increasingly on
general-purpose code written by other people.  (In the long run, programs
will be bigger and slower yet, but computers will be so fast and will have so
much memory that no one will care.)  Over time, PC programs will also come to
be more similar to one another--and to programs running on other platforms,
such as the Mac--as regards both user interface and performance.

Again, I am not saying that this is bad.  It does, however, have major
implications for the future nature of PC graphics programming, in ways that
will directly affect the means by which many of you earn your livings.  Not
so very long from now, graphics programming--all programming, for that
matter--will become mostly a matter of assembling in various ways components
written by other people, and will cease to be the all-inclusively creative,
mind-bendingly complex pursuit it is today.  (Using legally certified black
boxes is, by the way, one direction in which the patent lawyers are leading
us; legal considerations may be the final nail in the coffin of homegrown
code.)  For now, a PC programmer, to understand and even control every single
thing that happens on a computer if you so desire, to realize any vision you
may have.  Take advantage of this unique window of opportunity to create some
magic!

Neither does it hurt to understand what's involved in drawing, say, a filled
polygon, even if you are using a GUI.  You will better understand the
performance implications of the available GUI functions, and you will be able
to fill in any gaps in the functions provided.  You may even find that you
can out-perform the GUI on occasion by doing your own drawing into a system
memory bitmap, then copying the result to the screen.  You will also be able
to understand why various quirks exist, and will be able to put them to good
use.  For example, X Window follows the polygon drawing rules described last
month (although it's not obvious from the documentation); if you understood
last month's discussion, you're in good shape to use polygons under X.

In short, even though it runs counter to current trends, it helps to
understand how things work, especially when they're very visible parts of the
software you develop.  That said, let's learn more about filling convex
polygons.

Fast Convex Polygon Filling

When last we left the topic of filling convex polygons, the implementation we
had met all of our functional requirements.  In particular, it met stringent
rules that guaranteed that polygons would never overlap at shared edges, an
important consideration when building polygon-based images.  Unfortunately,
the implementation was also slow as molasses.  This month we'll work up
polygon filling code that's fast enough to be truly usable.

Last month's polygon filling code involved three major tasks, each performed
by a separate function: Tracing each polygon edge to generate a coordinate
list (performed by the function ScanEdge); drawing the scanned-out horizontal
lines that constitute the filled polygon (DrawHorizontalLineList); and
characterizing the polygon and coordinating the tracing and drawing
(FillConvexPolygon).  The amount of time that the sample program from last
month spent in each of these areas is shown in Table 1.  As you can see, half
the time was spent drawing and the other half was spent tracing the polygon
edges (the time spent in FillConvexPolygon was relatively minuscule), so we
have our choice of where to begin optimizing.

Fast Drawing

Let's start with drawing, which is easily sped up.  The original code used a
double-nested loop that called a draw-pixel function to plot each pixel in
the polygon individually.  That's a ridiculous approach in a graphics mode
that offers linearly mapped memory, as does VGA mode 13h, the mode with which
we're working.  At the very least, we could point a far pointer to the left
edge of each polygon scan line, then draw each pixel in that scan line in
quick succession, using something along the lines of *ScrPtr++ = FillColor;
inside a loop.

However, it seems silly to use a loop when the 80x86 has an instruction, REP
STOS, that's uniquely suited to filling linear memory buffers.  There's no
way to use REP STOS directly in C code, but it's a good bet that the memset
library function uses REP STOS, so you could greatly enhance performance by
using memset to draw each scan line of the polygon in a single shot.  That,
however, is easier said than done.  The memset function linked in from the
library is tied to the memory model in use; in small (tiny, small, or medium)
data models memset accepts only near pointers, so it can't be used to access
screen memory.  Consequently, a large (compact, large, or huge) data model
must be used to allow memset to draw to display memory--a clear case of the
tail wagging the dog.  This is an excellent example of why, although it is
possible to use C to do virtually anything, it's sometimes much simpler just
to use a little assembly code and be done with it.

At any rate, Listing One (page 146) shows a version of DrawHorizontalLineList
that uses memset to draw each scan line of the polygon in a single call.
When linked to last month's test program, Listing One increases pure drawing
speed (disregarding edge tracing and other nondrawing time) by more than an
order of magnitude over last month's draw-pixel-based code.  This despite the
fact that Listing One requires a large (in this case, compact) data model.
Listing One works fine with Turbo C++, but may not work with other compilers,
for it relies on the aforementioned interaction between memset and the
selected memory model.

At this point, I'd like to mention that benchmarks are notoriously
unreliable; the results in Table 1 are accurate only for the test program,
and then only when running on a particular system.  Results could be vastly
different if smaller, larger, or more complex polygons were drawn, or if a
faster or slower computer/VGA combination were used.  These factors
notwithstanding, the test program does fill a variety of polygons of varying
complexity sized from large to small and in between, and certainly the order
of magnitude difference between Listing One and the old version of
DrawHorizontalLineList is a clear indication of which code is superior.

Anyway, Listing One has the desired effect of vastly improving drawing time.
There are cycles yet to be had in the drawing code, but as tracing polygon
edges now takes 92 percent of the polygon filling time, it's logical to
optimize the tracing code next.

Fast Edge Tracing

There's no secret as to why last month's ScanEdge was so slow: It used
floating point calculations.  One secret of fast graphics is using integer or
fixed-point calculations, instead.  (Sure, the floating point code would run
faster if a math coprocessor were installed, but it would still be slower
than the alternatives; besides, why require a math coprocessor when you don't
have to?)  Both integer and fixed-point calculations are fast.  In many
cases, fixed-point is faster, but integer calculations have one tremendous
virtue: They're completely accurate.  The tiny imprecision inherent in either
fixed- or floating-point calculations can result in occasional pixels being
one off from their proper location.  This is no great tragedy, but after
going to so much trouble to ensure that polygons don't overlap at common
edges, why not get it exactly right?

In fact, when I tested out the integer edge tracing code by comparing an
integer-based test image to one produced by floating point calculations, two
pixels out of the whole screen differed, leading me to suspect a bug in the
integer code.  It turned out, however, that in those two cases, the floating
point results were sufficiently imprecise to creep from just under an integer
value to just over it, so that the ceil function returned a coordinate that
was one too large.  Floating point is very accurate--but it is not precise.
Integer calculations, properly performed, are.

Listing Two (page 146) shows a C implementation of integer edge tracing.
Vertical and diagonal lines, which are trivial to trace, are special-cased.
Other lines are broken into two categories: Y-major (closer to vertical) and
X-major (closer to horizontal).  The handlers for the Y-major and X-major
cases operate on the principle of similar triangles: The number of X pixels
advanced per scan line is the same as the ratio of the X delta of the edge to
the Y delta.  Listing Two is more complex than the original floating point
implementation, but not painfully so.  In return for that complexity, Listing
Two is more than 80 times faster--and, as just mentioned, it's actually more
accurate than the floating point code.

You gotta love that integer arithmetic.

The Finishing Touch: Assembly Language

The C implementation is now nearly 20 times as fast as the original, good
enough for most purposes.  Still, it requires that a large data model be used
(for memset), and it's certainly not the fastest possible code.  The obvious
next step is assembly language.

Listing Three (page 146) is an assembly language version of
DrawHorizontalLineList.  In actual use, it proved to be about 36 percent
faster than Listing One; better than a poke in the eye, but just barely.
There's more to these timing results than meets that eye, though.  Display
memory generally responds much more slowly than system memory, especially in
386 and 486 systems.  That means that much of the time taken by Listing Three
is actually spent waiting for display memory accesses to complete, with the
processor forced to idle by wait states.  If, instead, Listing Three drew to
a local buffer in system memory or to a particularly fast VGA, the assembly
implementation might well display a far more substantial advantage over the C
code.

And indeed it does.  When the test program is modified to draw to a local
buffer, both the C and assembly language versions get 0.29 seconds faster,
that being a measure of the time taken by display memory wait states.  With
those wait states factored out, the assembly language version of
DrawHorizontalLineList becomes almost three times as fast as the C code.

There is a lesson here.  An optimization has no fixed payoff; its value
fluctuates according to the context in which it is used.  There's relatively
little benefit to further optimizing code that already spends half its time
waiting for display memory; no matter how good your optimizations, you'll get
only a two-times speedup at best, and generally much less than that.  There
is, on the other hand, potential for tremendous improvement when drawing to
system memory, so if that's where most of your drawing will occur,
optimizations such as Listing Three are well worth the effort.

Know the environments in which your code will run, and know where the cycles
go in those environments.

Maximizing REP STOS

Listing Three doesn't take the easy way out and use REP STOSB to fill each
scan line; instead, it uses REP STOSW to fill as many pixel pairs as possible
via word-sized accesses, using STOSB only to do odd bytes.  Word accesses to
odd addresses are always split by the processor into 2-byte accesses.  Such
word accesses take twice as long as word accesses to even addresses, so
Listing Three makes sure that all word accesses occur at even addresses, by
performing a leading STOSB first if necessary.

Listing Three is another case in which it's worth knowing the environment in
which your code will run.  Extra code is required to perform aligned
word-at-a-time filling, resulting in extra overhead.  For very small or
narrow polygons, that overhead might overwhelm the advantage of drawing a
word at a time, making plain old REP STOSB faster.

Faster Edge Tracing

Finally, Listing Four (page 147) is an assembly language version of ScanEdge.
Listing Four is a relatively straightforward translation from C to assembly,
but is nonetheless almost twice as fast as Listing Two.

The version of ScanEdge in Listing Four could certainly be sped up still
further by unrolling the loops.  FillConvexPolygon, the overall coordination
routine, hasn't even been converted to assembly language, so that could be
sped up as well.  I haven't bothered with these optimizations because all
code other than DrawHorizontalLineList takes only 14 percent of the overall
polygon filling time when drawing to display memory; the potential return on
optimizing nondrawing code simply isn't great enough to justify the effort.
Part of the value of a profiler is being able to tell when to stop
optimizing; with Listings Three and Four in use, more than two-thirds of the
time taken to draw polygons is spent waiting for display memory, so
optimization is pretty much maxed out.  However, further optimization might
be worthwhile when drawing to system memory, where wait states are out of the
picture and the nondrawing code takes a significant portion (46 percent) of
the overall time.

Again, know where the cycles go.

By the way, note that all the versions of ScanEdge and FillConvexPolygon that
we've looked at are adapter independent, and that the C code is also machine
independent; all adapterspecific code is isolated in DrawHorizontalLineList.
This makes it easy to add support for other graphics formats, such as the
8514/A, the XGA, or, for that matter, a non-PC system.

Books of the Month

Two books share honors this month.  One isn't new, but is well worth having
if you're a PC graphics programmer.  Programmer's Guide to PC & PS/2 Video
Systems, by Richard Wilton (Microsoft Press, 1987) is pretty much the
standard graphics reference for the PC, covering all standards up through the
VGA, including MCGA, EGA, Hercules, CGA, and MDA (but not 8514/A).  For
8514/A programming, Graphics Programming for the 8514/A, by Jake Richter and
Bud Smith (M&T Books, 1990), is good; which is lucky, because it's the only
8514/A book I know of!