|
The International Computer Games
Association (ICGA) regularly organizes
computer chess world championships.
For quite a long time, large mainframe
computers won these championships.
Since 1992, however, only PC programs
have been world chess champions.
They have dominated the world,
increasing their playing strength by
about 30 ELO per year. (ELO is a statistical
measure of 100 points difference
corresponding to a 64% winning
chance. A certain number of ELO
points determines levels of expertise:
Beginner = ~1,000 ELO, International
Master = ~2,400, Grandmaster =
~2,500, World Champion = ~2,830.)
Today, the computer chess community
is highly developed, with special
machine rooms using virtual reality
and closed and open tournament
rooms. Anybody can play against
grandmasters or strong machines
through the Internet.
Hydra
The Hydra Project is internationally driven,
financed by PAL Computer Systems in
Abu Dhabi, United Arab Emirates. The
core team comprises programmer Chrilly
Donninger in Austria; researcher Ulf
Lorenz from the University of Paderborn in
Germany; chess grandmaster Christopher
Lutz in Germany; and project manager
Muhammad Nasir Ali in Abu Dhabi.
FPGAs from XilinxR are provided on PCI
cards from AlphaData in the United
Kingdom. The compute cluster is built by
Megware in Germany, supported by the
Paderborn Center for Parallel Computing.
Taiwan is involved as well.
The goal of the Hydra Project is literally
one of world dominance in the computer
chess world: a final, widely accepted
victory over human players. Indeed, we are
convinced that in 2005, a computer will be
the strongest chess entity, a world first.
Four programs stand out as serious contenders
for the crown:
- Shredder, by Stefan Meyer-Kahlen, the
dominating program over the last
decade
- Fritz, by Frans Morsch, the most wellknown
program
- Junior, by Amir Ban and Shay
Bushinsky, the current Computer
Chess World Champion
- Our program, Hydra, in our opinion
the strongest program at the moment
These four programs scored more than
95% of the points against their opponents
in the 2003 World Championship.
Computational speed, as well as sophisticated
chess knowledge, are the two most
important features of chess programs.
FPGAs play an important role in Hydra by
harnessing the demands on speed and program
sophistication. Additionally, FPGAs
provide these benefits:
- Implementing more knowledge
requires additional space, but nearly
no additional time.
- FPGA code can be debugged and
changed like software without long
ASIC development cycles. This is an important feature of FPGAs, because
the evolution of a chess program never
ends, and the dynamic progress in
computer chess enforces short development
cycles. Therefore, flexibility is at
least as important as speed.
- We can use a lot of fine-grain
parallelism.
Technical Description
The key feature that enables computer
chess programs to play as strong as . or
stronger . than the best human players is
their search algorithm. The programs perform
a forecast: given a certain position,
what can I do, what can my opponent do
next, and what can I do thereafter?
Modern programs use some variant of
the Alphabeta algorithm to examine the
resulting game tree. This algorithm is optimal
in the sense that in most cases it will
examine only O(bt/2) many leaves, instead of
bt many leaves, assuming a game tree depth
of t and a uniform branching of b. With the
help of upper and lower bounds, the algorithm
uses information that it collects during
the search process to keep the remaining
search tree small. This makes it a sequential
procedure that is difficult to parallelize, and
naive approaches waste resources.
Although the Alphabeta algorithm is
efficient, we cannot compute true values
for all positions in games like chess. The
game tree is simply far too large. Therefore,
we use tree search as an approximation procedure.
First, we select a partial tree rooted near the top of the complete game tree for
examination; usually we select it with the
help of a maximum depth parameter. We
then assign heuristic values (such as one
side has a queen more, so that side will
probably win) to the artificial leaves of the
pre-selected partial tree. We propagate
these values up to the root of the tree as if
they were true ones (Figure 1).
The key observation over the last 40
years of computer chess data is that the game
tree acts as an error filter. The larger the tree
that we can examine and the more sophisticated
its shape, the better its error filter
property. Therefore, what we need is speed.
The Hardware Architecture
Hydra uses the ChessBase/Fritz GUI running
on a Windows XP PC. It connects to
the Internet using ssh to our Linux cluster,
which itself comprises eight dual PC server
nodes able to handle two PCI buses simultaneously.
Each PCI bus contains one FPGA
accelerator card. One message passing interface
(MPI) process is mapped onto each of
the processors; one of the FPGAs is associated
with it as well. A Myrinet network interconnects
the server nodes (Figure 2).
The Software Architecture
The software is partitioned into two: the
distributed search algorithm running on
the Pentium nodes of the cluster and the
soft co-processor on the Xilinx FPGAs.
The basic idea behind our parallelization
is to decompose the search tree in
order to search parts of it in parallel and to
balance the load dynamically with the help
of the work-stealing concept.
First, a special processor, P0, gets the
search problem and starts performing the
forecast algorithm as if it would act sequentially.
At the same time, the other processors
send requests for work to other
randomly chosen processors. When Pi (a
processor that is already supplied with
work) catches such a request, it checks
whether or not there are unexplored parts
of its search tree ready for evaluation. These
unexplored parts are all rooted at the right
siblings of the nodes of Pi's search stack. Pi
sends back either a message that it cannot
perform work, or it sends a work packet (a
chess position with bounds) to the requesting
processor Pj. Thus, Pi becomes a master
itself, and Pj starts a sequential search on its
own. The processors can be master and
worker at the same time.
The relationship dynamically changes
during computation. When Pj has finished
its work (possibly with the help of other
processors), it sends an answer message to
Pi. The master/worker relationship between
Pi and Pj is released, and Pj becomes idle. It
again starts sending requests for work into
the network. When processor Pi finds out
that it has sent a wrong ab-window to one
of its workers (Pj), it makes a window message
follow to Pj. Pj stops its search, corrects
the window, and starts its old search from
the beginning. If the message contained a
"cutoff," which indicates superfluous work,
Pj just stops its work. We achieve speed-ups
of 12 on the 16 cluster entities, which
means that Hydra now examines 36 million
nodes per second.
At a certain level of branching, the
remaining problems are small enough that
we can solve them with the help of a
Configware coprocessor, benefiting from the
fine-grain parallelism inside the application.
We have a complete chess program on-chip,
consisting of modules for the search, the
evaluation, generating moves, and executing
or taking back moves. At present, we use 67
block RAMs, 9,879 slices, 5,308 TBUFs,
534 flip-flops, and 18,403 LUTs. An upper
bound for the number of cycles per search
node is nine cycles. We estimate that a pure
software solution would require at least
6,000 Pentium cycles. The longest path consists
of 51 logic levels, and the design runs at
30 MHz on a Virtex-II 1000. We have
just ported the design to a Virtex
XC2VP70-5 so that we can now run the
program with 50 MHz.
In software, a move generator is usually
implemented as a quad-loop: one loop over
all piece types; an inner loop over pieces of
that type; a second inner loop for all directions
where that piece can move; and the
most internal loop for the squares to which
the piece can move under consideration of
the current direction. This is quite a
sequential procedure, especially when we
consider that piece-taking moves should be
promoted to the front of the move list.
In a fine-grain parallel design, however,
we have a fast, small move generator, which
works very differently. In principle, the
move generator consists of two 8 x 8 chess
boards, as shown in Figure 3. The
GenAggressor and GenVictim modules
instantiate 64 square instances each. Both
determine to which neighbor square
incoming signals must be forwarded.
The square instances will send piece
signals (if there is a piece on that square),
respectively, forwarding the signals of far-reaching
pieces to neighbor square
instances. Additionally, each square can
output the signal "victim found." Then we
know that this square is a "victim" (a tosquare
of a legal move). The collection of
all "victim found" signals is input to an
arbiter (a comparator tree) that selects the
most attractive not-yet-examined victim.
The GenAggressor module takes the
arbiter's output as input and sends the signal
of a super-piece (a combination of all
possible pieces). For example, if a rook-move
signal hits a rook of our own, we
will find an "aggressor" (a from-square of
a legal move). Thus, many legal moves are
generated in parallel.
These moves must be sorted and we have
to mask those already examined. The moves
are sorted with the help of another comparator
tree and the winner is determined
within six levels of the tree. Sorting criteria
is based on the value of attacked pieces and
whether or not a move is a killer move.
Figure 4 shows the Finite State Machine
for the recursive Alphabeta algorithm. On
the left side of the figure you can see the
states of the normal search, including the
nullmove heuristic. The right side shows
the states of the quiescence search. The
main difficulty is controlling the timing, as
some of the sub-logic (the evaluation and
the move generator) may need two or three
cycles instead of one. We enter the search at
FS_INIT. If there is anything to do, and if
nullmove is not applicable, we come to the
start of the full search.
After possibly increasing the search
depth (not shown in Figure 4), we enter the
states FS_VICTIM and thereafter
FS_AGGR, which give us the next legal
move as described previously. Reaching the
state FS_DOWN corresponds to a recursive
call of the Alphabeta algorithm with a
1-point window (a,a + 1). If the search
remaining depth is greater than zero, we
look for a move in the state FS_START.
Otherwise we enter the quiescence search,
which starts with the evaluation inspection.
In the quiescence search we only consider
capture- and check-evasion moves.
The search stack (not shown) is realized
by six blocks of dual-port RAM, organized
as 16-bit wide RAMs. Thus, we can write
two 16-bit words into the RAM, or one 32-bit word at one point of time. A depth variable
controlled by the search FSM controls
the data flow. Various tables capture different
local variables of the recursive search.
Conclusion
We are quite optimistic that Hydra already
plays better chess than anybody else.
Nevertheless, we must now show this in a
series of matches.
At the same time, we want to maintain
the distance between Hydra and other
computer players and even increase it.
Therefore, in future versions of Hydra, we
plan to switch to newer generations of
Xilinx FPGAs, increase the number of
processors further, and fine-tune the evaluation
function.
For more information, visit www.hydrachess.com and www.chessbase.com.
Printable PDF version of this article with graphics. (4/18/05) 380 KB |