305 lines
15 KiB
Plaintext
305 lines
15 KiB
Plaintext
* The organization of the BigNum Library
|
|
|
|
As mentioned in bn.doc, the library should compile on anything with an
|
|
ANSI C compiler and 16 and 32-bit data types. (Non-power-of-2 word
|
|
lengths probably wouldn't be *too* hard, but the matter is likely to
|
|
remain academic.) However, assembly subroutines can be added in a
|
|
great variety of ways to speed up computations.
|
|
|
|
It's even possible to vary the word length dynamically at run time.
|
|
Currently, 80x86 and 680x0 assembly primitives have been written in 16
|
|
and 32-bit forms, as not all members of these families support 32x32->64
|
|
bit multiply. In future, 32/64 bit routines may be nice for the MIPS
|
|
and PowerPC processors. (The SPARC has a 64-bit extension, but it still
|
|
only produces a maximum 64-bit multiply result. The MIPS, PowerPC and
|
|
Alpha give access to 128 bits of product.)
|
|
|
|
The way that this works is that the file bn.c declares a big pile of
|
|
function pointers, and the first bnInit() call figures out which set
|
|
of functions to point these to. The functions are named so that
|
|
it is possible to link several sets into the same executable without
|
|
collisions.
|
|
|
|
The library can store numbers in big-endian or little-endian word order,
|
|
although the order of bytes within a word is always the platform native
|
|
order. As long as you're using the pure C version, you can compile
|
|
independent of the native byte ordering, but the flexibility is available
|
|
in case assembly primitives are easier to write one way or the other.
|
|
(In the absence of other considerations, little-endian is somewhat more
|
|
efficient, and is the default. This is controlled by BN_XXX_ENDIAN.)
|
|
|
|
In fact, it would be possible to change the word order at run time,
|
|
except that there is no naming convention to support linking in
|
|
functions that differ only in endianness. (Which is because the
|
|
point of doing so is unclear.)
|
|
|
|
The core of the library is in the files lbn??.c and bn??.c, where "??"
|
|
is 16, 32, or 64. The 32 and 64-bit files are generated from the 16-bit
|
|
version by a simple textual substitution. The 16-bit files are generally
|
|
considered the master source, and the others generated from it with sed.
|
|
|
|
Usually, only one set of these files is used on any given platform,
|
|
but if you want multiple word sizes, you include one for each supported
|
|
word size. The files bninit??.c define a bnInit function for a given
|
|
word size, which calls bnInit_??() internally. Only one of these may
|
|
be included at a time, and multiple word sizes are handled by a more
|
|
complex bnInit function such as the ones in bn8086.c and bn68000.c,
|
|
which determine the word size of the processor they're running on and
|
|
call the appropriate bnInit_??() function.
|
|
|
|
The file lbn.h uses <limits.h> to find the platform's available data
|
|
types. The types are defined both as macros (BNWORD32) and as typedefs
|
|
(bnword32) which aren't used anywhere but can come in very handy when
|
|
using a debugger (which doesn't know about macros). Any of these may
|
|
be overridden either on the compiler command line (cc -DBN_BIG_ENDIAN
|
|
-DBNWORD32="unsigned long"), or from an extra include file BNINCLUDE
|
|
defined on the command line. (cc -DBNINCLUDE=lbnmagic.h) This is the
|
|
preferred way to specify assembly primitives.
|
|
|
|
So, for example, to build a 68020 version of the library, compile the
|
|
32-bit library with -DBNINCLUDE=lbn68020.h, and compile and link in
|
|
lbn68020.c (which is actually an assembly source file, if you look).
|
|
|
|
Both 16- and 32-bit 80x86 code is included in lbn8086.h and .asm. That
|
|
code uses 16-bit large-model addressing. lbn80386.h and .asm use 32-bit
|
|
flat-model addressing.
|
|
|
|
Three particularly heavily used macros defined by lbn.h are BIG(x),
|
|
LITTLE(y) and BIGLITTLE(x,y). These expand to x (or nothing) on
|
|
a big-endian system, and y (or nothing) on a little-endian system.
|
|
These are used to conditionalize the rest of the code without taking
|
|
up entire lines to say "#ifdef BN_BIG_ENDIAN", "#else" and "#endif".
|
|
|
|
* The lbn??.c files
|
|
|
|
The lbn?? file contains the low-level bignum functions. These universally
|
|
expect their numbers to be passed to them in (buffer, length) form and
|
|
do not attempt to extend the buffers. (In some cases, they do allocate
|
|
temporary buffers.) The buffer pointer points to the least-significant
|
|
end of the buffer. If the machine uses big-endian word ordering, that
|
|
is a pointer to the end of the buffer. This is motivated by considering
|
|
pointers to point to the boundaries between words (or bytes). If you
|
|
consider a pointer to point to a word rather than between words, the
|
|
pointer in the big-endian case points to the first word past the end of the
|
|
buffer.
|
|
|
|
All of the primitives have names of the form lbnAddN_16, where the
|
|
_16 is the word size. All are surrounded by "#ifndef lbnAddN_16".
|
|
If you #define lbnAddN_16 previously (either on the command like or
|
|
in the BNINCLUDE file), the C code will neither define *nor declare* the
|
|
corresponding function. The declaration must be suppressed in case you
|
|
declare it in a magic way with special calling attributes or define it as
|
|
a macro.
|
|
|
|
If you wish to write an assembly primitive, lbnMulAdd1_??, which
|
|
multiplies N words by 1 word and adds the result to N words, returning
|
|
the carry word, is by FAR the most important function - almost all of
|
|
the time spent performing a modular exponentiation is spent in this
|
|
function. lbnMulSub1_??, which does the same but subtracts the product
|
|
and returns a word of borrow, is used heavily in the division routine
|
|
and thus by GCD and modular inverse computation.
|
|
|
|
These two functions are the only functions which *require* some sort
|
|
of double-word data type, so if you define them in assembly language,
|
|
the ?? may be the widest word your C compiler supports; otherwise, you
|
|
must limit your implementation to half of the maximum word size. Other
|
|
functions will, however, use a double-word data type if available.
|
|
|
|
Actually, there are some even simpler primitives which you can provide
|
|
to allow double-width multiplication: mul??_ppmm, mul??_ppmma and
|
|
mul??_ppmmaa These are expected to be defined as macros (all arguments
|
|
are always side-effect-free lvalues), and must return two words of result
|
|
of the computation m1*m2 + a1 + a2. It is best to define all three,
|
|
although any that are not defined will be generated from the others in
|
|
the obvious way. GCC's inline assembler can be used to define these.
|
|
(The names are borrowed from the GNU MP package.)
|
|
|
|
There is also lbnMulN1_??, which stores the result rather than adding or
|
|
subtracting it, but it is less critical. If it is not provided, but
|
|
lbnMulAdd1_?? is, it will be implemented in terms of lbnMulAdd1_?? in the
|
|
obvious way.
|
|
|
|
lbnDiv21_??, which divides two words by one word and returns a quotient
|
|
and remainder, is greatly sped up by a double-word data type, macro
|
|
definition, or assembly implementation, but has a version which will run
|
|
without one. If your platform has a double/single divide with remainder,
|
|
it would help to define this, and it's quite simple.
|
|
|
|
lbnModQ_?? (return a multi-precision number reduced modulo a "quick"
|
|
(< 65536) modulus is used heavily by prime generation for trial division,
|
|
but is otherwise little used.
|
|
|
|
Other primitives may be implemented depending on the expected usage mix.
|
|
It is generally not worth implementing lbnAddN_?? and lbnSubN_?? unless
|
|
you want to start learning to write assembly primitives on something
|
|
simple; they just aren't used very much. (Of course, if you do, you'll
|
|
probably get some improvements, in both speed and object code size, so
|
|
it's worth keeping them in, once written.)
|
|
|
|
* The bn??.c files
|
|
|
|
While the lbn??.c files deal in words, the bn??.c files provide the
|
|
public interface to the library and deal in bignum structures. These
|
|
contain a buffer pointer, an allocated length, and a used length.
|
|
The lengths are specified in words, but as long as the user doesn't go
|
|
prying into such innards, all of the different word-size libraries
|
|
provide the same interface; they may be exchanged at link time, or even
|
|
at run time.
|
|
|
|
The bn.c file defines a large collection of function pointers and one
|
|
function, bnInit. bnInit is responsible for setting the function pointers
|
|
to point to the appropriate bn??.c functions. Each bn??.c file
|
|
provides a bnInit_?? function which sets itself up; it is the job
|
|
of bnInit to figure out which word size to use and call the appropriate
|
|
bnInit_?? function.
|
|
|
|
If only one word size is in use, you may link in the file bninit??.c,
|
|
which provides a trivial bnInit function. If multiple word sizes are
|
|
in use, you must provide the appropriate bnInit function. See
|
|
bn8086.c as an example.
|
|
|
|
For maximum portability, you may just compile and link in the files
|
|
lbn00.c, bn00.c and bninit00.c, which determine, using the preprocessor
|
|
at compile time, the best word size to use. (The logic is actually
|
|
located in the file bnsize00.h, so that the three .c files cannot get out
|
|
of sync.)
|
|
|
|
The bignum buffers are allocated using the memory management routines in
|
|
lbnmem.c. These are word-size independent; they expect byte counts and
|
|
expect the system malloc() to return suitably aligned buffers. The
|
|
main reason for this wrapper layer is to support any customized allocators
|
|
that the user might want to provide.
|
|
|
|
* Other bn*.c files
|
|
|
|
bnprint.c is a simple routine for printing a bignum in hex. It is
|
|
provided in a separate file so that its calls to stdio can be eliminated
|
|
from the link process if the capability is not needed.
|
|
|
|
bntest??.c is a very useful regression test if you're implementing
|
|
assembly primitives. If it doesn't complain, you've probably
|
|
got it right. It also does timing tests so you can see the effects
|
|
of any changes.
|
|
|
|
* Other files
|
|
|
|
sieve.c contains some primitives which use the bignum library to perform
|
|
sieving (trial division) of ranges of numbers looking for candidate primes.
|
|
This involves two steps: using a sieve of Eratosthenes to generate the
|
|
primes up to 65536, and using that to do trial division on a range of
|
|
numbers following a larger input number. Note that this is designed
|
|
for large numbers, greater than 65536, since there is no check to see
|
|
if the input is one of the small primes; if it is divisible, it is assumed
|
|
composite.
|
|
|
|
prime.c uses sieve.c to generate primes. It uses sieve.c to eliminate
|
|
numbers with trivial divisors, then does strong pseudoprimality tests
|
|
with some small bases. (Actually, the first test, to the base 2, is
|
|
optimized a bit to be faster when it fails, which is the common case,
|
|
but 1/8 of the time it's not a strong pseudoprimality test, so an extra,
|
|
strong, test is done in that case.)
|
|
|
|
It prints progress indicators as it searches. The algorithm
|
|
searches a range of numbers starting at a given prime, but it does
|
|
so in a "shuffled" order, inspired by algorithm M from Knuth. (The
|
|
random number generator to use for this is passed in; if no function
|
|
is given, the numbers are searched in sequential order and the
|
|
returns value will be the next prime >= the input value.)
|
|
|
|
germain.c operates similarly, but generates Sophie Germain primes;
|
|
that is, primes p such that (p-1)/2 is also prime. It lacks the
|
|
shuffling feature - searching is always sequential.
|
|
|
|
jacobi.c computes the Jacobi symbol between a small integer and a BigNum.
|
|
It's currently only ever used in germain.c.
|
|
|
|
* Sources
|
|
|
|
Obviously, a key source of information was Knuth, Volume 2,
|
|
particularly on division algorithms.
|
|
|
|
The greatest inspiration, however, was Arjen Lenstra's LIP
|
|
(Large Integer Package), distributed with the RSA-129 effort.
|
|
While very difficult to read (there is no internal documentation on
|
|
sometimes very subtle algorithms), it showed me many useful tricks,
|
|
notably the windowed exponentiation algorithm that saves so many
|
|
multiplies. If you need a more general-purpose large-integer package,
|
|
with only a minor speed penalty, the LIP package is almost certainly
|
|
the best available. It implements a great range of efficient
|
|
algorithms.
|
|
|
|
The second most important source was Torbjorn Granlund's gmp
|
|
(GNU multi-precision) library. A number of C coding tricks were
|
|
adapted from there. I'd like to thank Torbjorn for some useful
|
|
discussions and letting me see his development work on GMP 2.0.
|
|
|
|
Antoon Bosselaers, Rene' Govaerts and Joos Vandewalle, in their CRYPTO
|
|
'93 paper, "Comparison of three modular reduction functions", brought
|
|
Montgomery reduction to my attention, for which I am grateful.
|
|
|
|
Burt Kaliski's article in the September 1993 Dr. Dobb's Journal,
|
|
"The Z80180 and Big-number Arithmetic" pointed out the advantages (and
|
|
terminology) of product scanning to me, although the limited
|
|
experiments I've done have shown no improvement from trying it in C.
|
|
|
|
Hans Reisel's book, "Prime Numbers and Computer Methods for Factorization"
|
|
was of great help in designing the prime testing, although some of
|
|
the code in the book, notably the Jacobi function in Appendix 3,
|
|
is an impressive example of why GOTO should be considered harmful.
|
|
Papers by R. G. E. Pinch and others in Mathematics of Computation were
|
|
also very useful.
|
|
|
|
Keith Geddes, Stephen Czapor and George Labahn's book "Algorithms
|
|
for Computer Algebra", although it's mostly about polynomials,
|
|
has some useful multi-precision math examples.
|
|
|
|
Philip Zimmermann's mpi (multi-precision integer) library suggested
|
|
storing the numbers in native byte order to facilitate assembly
|
|
subroutines, although the core modular multiplication algorithms are
|
|
so confusing that I still don't understand them. His boasting about
|
|
the speed of his library (albeit in 1986, before any of the above were
|
|
available for study) also inspired me to particular effort to soundly
|
|
beat it. It also provoked a strong reaction from me against fixed
|
|
buffer sizes, and complaints about its implementation from Paul Leyland
|
|
(interface) and Robert Silverman (prime searching) contributed usefully
|
|
to the design of this current library.
|
|
|
|
I'd like to credit all of the above, plus the Berkeley MP package, with
|
|
giving me difficulty finding a short, unique distinguishing prefix for
|
|
my library's functions. (I have just, sigh, discovered that Eric Young
|
|
is using the same prefix for *his* library, although with the
|
|
bn_function_name convention as opposed to the bnFunctionName one.)
|
|
|
|
I'd like to thank the original implementor of Unix "dc" and "factor"
|
|
for providing useful tools for verifying the correct operation of
|
|
my library.
|
|
|
|
* Future
|
|
|
|
- Obviously, assembly-language subroutines for more platforms would
|
|
always be nice.
|
|
- There's a special case in the division for a two-word denominator
|
|
which should be completed.
|
|
- When the quotient of a division is big enough, compute an inverse of
|
|
the high word of the denominator and use multiplication by that
|
|
to do the divide.
|
|
- A more efficient GCD algorithm would be nice to have.
|
|
- More efficient modular inversion is possible. Do it.
|
|
- Extend modular inversion to deal with non-relatively-prime
|
|
inputs. Produce y = inv(x,m) with y * x == gcd(x,m) mod m.
|
|
- Try some product scanning in assembly.
|
|
- Karatsuba's multiplication and squaring speedups would be nice.
|
|
- I *don't* think that FFT-based algorithms are worth implementing yet,
|
|
but it's worth a little bit of study to make sure.
|
|
- More general support for numbers in Montgomery form, so they can
|
|
be used by more than the bowels of lbnExpMod.
|
|
- Provide an lbnExpMod optimized for small arguments > 2, using
|
|
conventional (or even Barrett) reduction of the multiplies, and
|
|
Montgomery reduction of the squarings.
|
|
- Adding a Lucas-based prime test would be a real coup, although it's
|
|
hard to give rational reasons why it's necessary. I have a number of
|
|
ideas on this already. Find out if norm-1 (which is faster to
|
|
compute) suffices.
|
|
- Split up the source code more to support linking with smaller subsets
|
|
of the library.
|