Travis Cross d2edcad66e Merge Phil Zimmermann's libzrtp as a FreeSWITCH library
Thanks to Phil Zimmermann for the code and for the license exception
we needed to include it.

There remains some build system integration work to be done before
this code will build properly in the FreeSWITCH tree.
2012-03-31 23:42:27 +00:00

1039 lines
28 KiB
NASM

;;; Copyright (c) 1995, Colin Plumb.
;;; For licensing and other legal details, see the file legal.c.
;;;
;;; Assembly primitives for bignum library, 80x86 family.
;;;
;;; Several primitives are included here. Only lbnMulAdd1 is *really*
;;; critical, but once that's written, lnmMul1 and lbnSub1 are quite
;;; easy to write as well, so they are included here as well.
;;; lbnDiv21 and lbnModQ are so easy to write that they're included, too.
;;;
;;; All functions here are for large code, large data.
;;; All use standard "cdecl" calling convention: arguments pushed on the
;;; stack (ss:sp) right to left (the leftmost agrument at the lowest address)
;;; and popped by the caller, return values in ax or dx:ax, and register
;;; usage as follows:
;;;
;;; Callee-save (preserved by callee if needed):
;;; ss, esp, cs, eip, ds, esi, edi, ebp, high byte of FLAGS except DF,
;;; all other registers (CRx, DRx, TRx, IDT, GDT, LDT, TR, etc.).
;;; Caller-save (may be corrupted by callee):
;;; es, eax, ebx, ecx, edx, low byte of flags (SF, ZF, AF, PF, CF)
;;;
;;; The direction flag (DF) is either preserved or cleared.
;;; I'm not sure what the calling convention is for fs and gs. This
;;; code never alters them.
;; Not all of this code has to be '386 code, but STUPID FUCKING MASM (5.0)
;; gives an error if you change in the middle of a segment. Rather than
;; fight the thing, just enable '386 instructions everywhere. (And lose
;; the error checking.)
.386
_TEXT segment para public use16 'CODE' ; 16-byte aligned because '486 cares
assume cs:_TEXT
public _lbnMulN1_16
public _lbnMulAdd1_16
public _lbnMulSub1_16
public _lbnDiv21_16
public _lbnModQ_16
public _lbnMulN1_32
public _lbnMulAdd1_32
public _lbnMulSub1_32
public _lbnDiv21_32
public _lbnModQ_32
public _not386
;; Prototype:
;; BNWORD16
;; lbnMulAdd_16(BNWORD16 *out, BNWORD16 *in, unsigned len, BNWORD16 k)
;;
;; Multiply len words of "in" by k and add to len words of "out";
;; return the len+1st word of carry. All pointers are to the least-
;; significant ends of the appropriate arrays. len is guaraneed > 0.
;;
;; This 16-bit code is optimized for an 8086/80286. It will not be run
;; on 32-bit processors except for debugging during development.
;;
;; NOTE that it may be possible to assume that the direction flag is clear
;; on entry; this would avoid the need for the cld instructions. Hoewever,
;; the Microsoft C libraries require that the direction flag be clear.
;; Thus, lbnModQ_16 clears it before returning.
;;
;; Stack frame:
;; +--------+ bp+18
;; | k |
;; +--------+ bp+16
;; | len |
;; +--------+ bp+14
;; | |
;; +- in -+
;; | |
;; +--------+ bp+10
;; | |
;; +- out -+
;; | |
;; +--------+ bp+6
;; | |
;; +-return-+
;; | |
;; +--------+ bp+2
;; | old bp |
;; +--------+ bp
;;
;; Register usage for lbnMul1_16:
;; ds:[si] in
;; es:[di] out
;; bp k
;; cx loop counter (len/4)
;; dx,ax high,low parts of product
;; bx carry from previous multiply iteration
;;
;; Register usage for lbnMulAdd1_16 and lbnMulSub1_16:
;; ds:[si] in
;; es:[bx+si] out
;; bp k
;; cx loop counter (len/4)
;; dx,ax high,low parts of product
;; di carry from previous multiply iteration
;;
;; The reson for the difference is that straight mul can use stosw, but
;; the multiply and add or multiply and subtract add the result in, so
;; they have to reference es:[di] to add it in.
;;
;; The options are either "add ax,es:[di]; stosw" or "add es:[di],ax;
;; add di,2"; both take 10 cycles on an 80286, 27 on an 8086 and 35 on
;; an 8088 although the former is preferred since it's one byte smaller.
;; However, using [bx+si] is even faster; "add es:[bx+si],ax" takes
;; 7 cycles on an 80286, 25 on an 8086 and 33 on an 8088, as well as
;; being the smallest. (Of course, stosw, at 3 on an 80286, 11 on an
;; 8086 amd 15 on an 8088 wins easily in the straight multiply case over
;; mov es:[bx+si],ax, which takes 3/18/22 cycles and is larger to boot.)
;;
;; Most of these register assignments are driven by the 8086's instruction
;; set. The only really practical variation would be to put the multiplier
;; k into bx or di and use bp for carry, but if someone can make a faster
;; Duff's device using a lookup table, bx and di are useful because indexing
;; off them is more flexible than bp.
;;
;; Overview of code:
;;
;; len is guaranteed to be at least 1, so do the first multiply (with no
;; carry in) unconditionally. Then go to a min loop unrolled 4 times,
;; jumping into the middle using a variant of Duff's device.
;;
;; The loop is constructed using the loop instruction, which does
;; "} while (--cnt)". This means that we have to divide the count
;; by 4, and increment it so it doesn't start at 0. To gain a little
;; bit more efficiency, we actually increment the count by 2, so the
;; minimum possible value is 3, which will be shifted down to produce 0.
;; usually in Duff's device, if the number of iterations is a multiple
;; of the unrolling factor, you branch to just before the loop conditional
;; and let it handle the case of 0. Here, we have a special test for 0
;; at the head of the loop and fall through into the top of the loop
;; if it passes.
;;
;; Basically, with STEP being a multiply step, it's:
;;
;; STEP;
;; count += 2;
;; mod4 = count % 4;
;; count /= 4;
;; switch(mod4) {
;; case 3:
;; if (count) {
;; do {
;; STEP;
;; case 2:
;; STEP;
;; case 1:
;; STEP;
;; case 0:
;; STEP;
;; } while (--count);
;; }
;; }
;;
;; The switch() is actually done by two levels of branch instructions
;; rather than a lookup table.
_lbnMulN1_16 proc far
push bp
mov bp,sp
push ds
push si
push di
cld
les di,[bp+6] ; out
lds si,[bp+10] ; in
mov cx,[bp+14] ; len
mov bp,[bp+16] ; k
;; First multiply step has no carry in
lodsw
mul bp
stosw
;; The switch() for Duff's device starts here
;; Note: this *is* faster than a jump table for an 8086 and '286.
;; 8086: jump table: 44 cycles; this: 27/29/31/41
;; 80286: jump table: 25 cycles; this: 17/17/20/22
shr cx,1
jc SHORT m16_odd
inc cx
shr cx,1
jc SHORT m16_case2
jmp SHORT m16_case0
nop ; To align loop
m16_odd:
inc cx
shr cx,1
jnc SHORT m16_case1
jz SHORT m16_done ; Avoid entire loop in this case
m16_loop:
lodsw
mov bx,dx ; Remember carry for later
mul bp
add ax,bx ; Add carry in from previous word
adc dx,0
stosw
m16_case2:
lodsw
mov bx,dx ; Remember carry for later
mul bp
add ax,bx ; Add carry in from previous word
adc dx,0
stosw
m16_case1:
lodsw
mov bx,dx ; Remember carry for later
mul bp
add ax,bx ; Add carry in from previous word
adc dx,0
stosw
m16_case0:
lodsw
mov bx,dx ; Remember carry for later
mul bp
add ax,bx ; Add carry in from previous word
adc dx,0
stosw
loop m16_loop
m16_done:
mov ax,dx
stosw ; Store last word
pop di
pop si
pop ds
pop bp
ret
_lbnMulN1_16 endp
align 2
_lbnMulAdd1_16 proc far
push bp
mov bp,sp
push ds
push si
push di
cld
les bx,[bp+6] ; out
lds si,[bp+10] ; in
mov cx,[bp+14] ; len
mov bp,[bp+16] ; k
;; First multiply step has no carry in
lodsw
mul bp
add es:[bx],ax ; This time, store in [bx] directly
adc dx,0
sub bx,si ; Prepare to use [bx+si].
;; The switch() for Duff's device starts here
;; Note: this *is* faster than a jump table for an 8086 and '286.
;; 8086: jump table: 44 cycles; this: 27/29/31/41
;; 80286: jump table: 25 cycles; this: 17/17/20/22
shr cx,1
jc SHORT ma16_odd
inc cx
shr cx,1
jc SHORT ma16_case2
jmp SHORT ma16_case0
ma16_odd:
inc cx
shr cx,1
jnc SHORT ma16_case1
jz SHORT ma16_done ; Avoid entire loop in this case
ma16_loop:
lodsw
mov di,dx ; Remember carry for later
mul bp
add ax,di ; Add carry in from previous word
adc dx,0
add es:[bx+si],ax
adc dx,0
ma16_case2:
lodsw
mov di,dx ; Remember carry for later
mul bp
add ax,di ; Add carry in from previous word
adc dx,0
add es:[bx+si],ax
adc dx,0
ma16_case1:
lodsw
mov di,dx ; Remember carry for later
mul bp
add ax,di ; Add carry in from previous word
adc dx,0
add es:[bx+si],ax
adc dx,0
ma16_case0:
lodsw
mov di,dx ; Remember carry for later
mul bp
add ax,di ; Add carry in from previous word
adc dx,0
add es:[bx+si],ax
adc dx,0
loop ma16_loop
ma16_done:
mov ax,dx
pop di
pop si
pop ds
pop bp
ret
_lbnMulAdd1_16 endp
align 2
_lbnMulSub1_16 proc far
push bp
mov bp,sp
push ds
push si
push di
cld
les bx,[bp+6] ; out
lds si,[bp+10] ; in
mov cx,[bp+14] ; len
mov bp,[bp+16] ; k
;; First multiply step has no carry in
lodsw
mul bp
sub es:[bx],ax ; This time, store in [bx] directly
adc dx,0
sub bx,si ; Prepare to use [bx+si].
;; The switch() for Duff's device starts here
;; Note: this *is* faster than a jump table for an 8086 and '286.
;; 8086: jump table: 44 cycles; this: 27/29/31/41
;; 80286: jump table: 25 cycles; this: 17/17/20/22
shr cx,1
jc SHORT ms16_odd
inc cx
shr cx,1
jc SHORT ms16_case2
jmp SHORT ms16_case0
ms16_odd:
inc cx
shr cx,1
jnc SHORT ms16_case1
jz SHORT ms16_done ; Avoid entire loop in this case
ms16_loop:
lodsw
mov di,dx ; Remember carry for later
mul bp
add ax,di ; Add carry in from previous word
adc dx,0
sub es:[bx+si],ax
adc dx,0
ms16_case2:
lodsw
mov di,dx ; Remember carry for later
mul bp
add ax,di ; Add carry in from previous word
adc dx,0
sub es:[bx+si],ax
adc dx,0
ms16_case1:
lodsw
mov di,dx ; Remember carry for later
mul bp
add ax,di ; Add carry in from previous word
adc dx,0
sub es:[bx+si],ax
adc dx,0
ms16_case0:
lodsw
mov di,dx ; Remember carry for later
mul bp
add ax,di ; Add carry in from previous word
adc dx,0
sub es:[bx+si],ax
adc dx,0
loop ms16_loop
ms16_done:
mov ax,dx
pop di
pop si
pop ds
pop bp
ret
_lbnMulSub1_16 endp
;; Two-word by one-word divide. Stores quotient, returns remainder.
;; BNWORD16 lbnDiv21_16(BNWORD16 *q, BNWORD16 nh, BNWORD16 nl, BNWORD16 d)
;; 4 8 10 12
align 2
_lbnDiv21_16 proc far
mov cx,bp ; bp NOT pushed; note change in offsets
mov bp,sp
mov dx,[bp+8]
mov ax,[bp+10]
div WORD PTR [bp+12]
les bx,[bp+4]
mov es:[bx],ax
mov ax,dx
mov bp,cx
ret
nop ; To align loop in lbnModQ properly
_lbnDiv21_16 endp
;; Multi-word by one-word remainder.
;; BNWORD16 lbnModQ_16(BNWORD16 *q, unsigned len, unsigned d)
;; 6 10 12
_lbnModQ_16 proc far
push bp
mov bp,sp
push ds
mov bx,si
mov cx,10[bp] ; load len
lds si,6[bp] ; load q
std ; loop MSW to LSW
add si,cx
mov bp,12[bp] ; load d
add si,cx
xor dx,dx ; Set up for first divide
sub si,2 ; Adjust pointer to point to MSW
lodsw ; Load first word
cmp ax,bp ; See if we can skip first divide
jnc SHORT modq16_inner ; No such luck
mov dx,ax ; Yes! Modulus > input, so remainder = input
dec cx ; Do loop
jz SHORT modq16_done
modq16_loop:
lodsw
modq16_inner:
div bp
loop modq16_loop
modq16_done:
pop ds
mov ax,dx ; Return remainder
pop bp
mov si,bx
cld ; Microsoft C's libraries assume this
ret
_lbnModQ_16 endp
;; Similar, but using 32-bit operations.
;;
;; The differences are that the switch() in Duff's device is done using
;; a jump table, and lods is not used because it's slower than load and
;; increment. The pointers are only updated once per loop; offset
;; addressing modes are used, since they're no slower. [di] is used
;; instead of [bx+si] because the extra increment of di take only one
;; cycle per loop a '486, while [bx+si] takes one extra cycle per multiply.
;;
;; The register assignments are also slightly different:
;;
;; es:[si] in
;; ds:[di] out
;; ecx k
;; bp loop counter (len/4)
;; edx,eax high,low parts of product
;; ebx carry word from previous multiply iteration
;;
;; The use of bp for a loop counter lets all the 32-bit values go
;; in caller-save registers, so there's no need to do any 32-bit
;; saves and restores. Using ds:di for the destination saves one
;; segment override in the lbnMulN1_32 code, since there's one more
;; store to [di] than load from es:[si].
;;
;; Given the number of 32-bit references that this code uses, optimizing
;; it for the Pentium is interesting, because the Pentium has a very
;; inefficient implementation of prefix bytes. Each prefix byte, with
;; the exception of 0x0f *>> on conditional branch instructions ONLY <<*
;; is a 1-cycle non-pairiable instruction. Which has the effect of
;; forcing the instruction it's on into the U pipe. But this code uses
;; *lots* of prefix bytes, notably the 0x66 operand size override.
;;
;; For example "add [di],eax" is advised against in Intel's optimization
;; papers, because it takes 3 cycles and 2 of them are not pairable.
;; But any longer sequence would have a prefix byte on every instruction,
;; resulting in even more non-pairable cycles. Also, only two instructions
;; in the multiply kernel can go in the V pipe (the increments of si and
;; di), and they're already there, so the pairable cycles would be wasted.
;;
;; Things would be *quite* different in native 32-bit mode.
;;
;; All instructions that could go in the V pipe that aren't there are
;; marked.
;;
;; The setup code is quite intricately interleaved to get the best possible
;; performance out of a Pentium. If you want to follow the code,
;; pretend that the sections actually come in the following order:
;; 1) prologue (push registers)
;; 2) load (fetch arguments)
;; 3) first multiply
;; 4) loop unrolling
;;
;; The loop unrolling setup consists of taking the count, adjusting
;; it to account for the first multiply, and splitting it into
;; two parts: the high bits are a loop count, while the low bits are
;; used to find the right entry in the Duff's device jump table and
;; to adjust the initial data pointers.
;;
;; Known slack: There is one instruction in the prologue and one in
;; the epilogue that could go in the V pipe if I could find a U-pipe
;; instruction to pair them with, but all the U-pipe instructions
;; are already paired, so it looks difficult.
;;
;; There is a cycle of Address Generation Interlock in the lbnMulN1_32
;; code on the Pentium (not on a '486). I can't figure out how to
;; get rid of it without wasting time elsewhere. The problem is that
;; the load of bx needs to be done as soon as possible to let it
;; be set up in time for the switch(). The other problem is the
;; epilogue code which can waste time if the order of the pushed
;; registers is diddled with so that ds doesn't come between si and di.
;;
;; The increment of si after the last load is redundant, and the
;; copy of the high word of the product to the carry after the last
;; multiply is likewise unnecessary.
;;
;; In these cases, the operations were done that way in order to remove
;; cycles from the loop on the '486 and/or Pentium, even though it costs
;; a few overhead cycles on a '386.
;; The increment fo si has to be done early because a load based on si
;; is the first thing in any given multiply step, and the address
;; generation interlock on the '486 and Pentium requires that a full
;; cycle (i.e. possibly two instructions on a Pentium) pass between
;; incrementing a register and using it in an address.
;; This saves one cycle per multiply on a '486 and Pentium, and costs
;; 2 cycles per call to the function on a '386 and 1 cycle on a '486.
;;
;; The carry word is copied where it is so that the decrement of the loop
;; counter happens in the V pipe. The instruction between the decrement
;; of the loop counter and the branch should be a U-pipe instruction that
;; doesn't affect the flags. Thus, the "mov" was rotated down from
;; the top of the loop to fill the slot.
;; This is a bit more marginal: it saves one cycle per loop iteration on
;; a Pentium, and costs 2 cycles per call on a '386, '486 or Pentium.
;;
;; The same logic applies to the copy of the carry and increment of si
;; before the test, in case 0, for skipping the loop entirely.
;; It makes no difference in speed if the loop is executed, but
;; incrementing si before saves an address generation interlock cycle
;; On a '486 and Pentium in the case that the loop is executed.
;; And the loop is executed more often than not.
;;
;; Given that just one multiply on a '386 takes 12 to 41 cycles (with the
;; average being very much at the high end of that) 4 cycles of additional
;; overhead per call is not a big deal.
;;
;; On a Pentium, it would actually be easier to *not* unroll the loop
;; at all, since the decrement and compare are completely hidden
;; in the V-pipe and it wouldn't cost anything to do them more often.
;; That would save the setup for the unrolling and Duff's device at the
;; beginning. But the overhead for that is pretty minor: ignoring what's
;; hidden in the V pipe, it's two cycles plus the indirect jump.
;; Not too much, and special-casing the pentium is quite a hassle.
;; (For starters, you have to detect it, and since you're probably in
;; V86 mode, without access to the EFLAGS register to test the CPUID bit.)
align 16
_lbnMulN1_32 proc far
push bp ; U prologue ** Could be V
mov bp,sp ; V prologue
push si ; U prologue ** Could be V
mov bx,[bp+14] ; U load len ** Could be V (AGI!)r
push ds ; NP prologue
les si,[bp+10] ; NP load in
mov ecx,[bp+16] ; U load k
dec bx ; V loop unrolling
shl bx,2 ; U loop unrolling
push di ; V prologue
lds di,[bp+6] ; NP load out
mov bp,bx ; U loop unrolling ** Could be V
and bx,12 ; V loop unrolling
;; First multiply step has no carry in.
mov eax,es:[si] ; U first multiply
add si,bx ; V loop unrolling
mul ecx ; NP first multiply
mov [di],eax ; U first multiply
add di,bx ; V loop unrolling
;; The switch() for Duff's device. This jump table is (slightly!) faster
;; than a bunch of branches on a '386 and '486, and is probably better yet
;; on higher processors.
jmp WORD PTR cs:m32_jumptable[bx] ; NP loop unrolling
align 2
m32_jumptable:
dw OFFSET m32_case0, 0
dw OFFSET m32_case1, 0
dw OFFSET m32_case2, 0
dw OFFSET m32_case3, 0, 0, 0, 0 ; Get loop aligned properly
m32_case0:
add si,16 ; U Fix up si ** Could be V
test bp,bp ; V
mov ebx,edx ; U Remember carry for later
jbe SHORT m32_done ; V Avoid entire loop if loop count is 0
m32_loop:
mov eax,es:[si-12] ; U
add di, 16 ; V
mul ecx ; NP
add eax,ebx ; U Add carry in from previous word
adc edx,0 ; U
mov [di-12],eax ; U
m32_case3:
mov ebx,edx ; U Remember carry for later
mov eax,es:[si-8] ; U
mul ecx ; NP
add eax,ebx ; U Add carry in from previous word
adc edx,0 ; U
mov [di-8],eax ; U
m32_case2:
mov ebx,edx ; U Remember carry for later
mov eax,es:[si-4] ; U
mul ecx ; NP
add eax,ebx ; U Add carry in from previous word
adc edx,0 ; U
mov [di-4],eax ; U
m32_case1:
mov ebx,edx ; U Remember carry for later
mov eax,es:[si] ; U
mul ecx ; NP
add eax,ebx ; U Add carry in from previous word
adc edx,0 ; U
add si,16 ; V
mov [di],eax ; U
sub bp,16 ; V
mov ebx,edx ; U Remember carry for later
ja m32_loop ; V
m32_done:
mov [di+4],edx ; U
pop di ; V
pop ds ; NP
pop si ; U ** Could be V
pop bp ; V
ret ; NP
_lbnMulN1_32 endp
align 16
_lbnMulAdd1_32 proc far
push bp ; U prologue ** Could be V
mov bp,sp ; V prologue
push ds ; NP prologue
mov ecx,[bp+16] ; U load k
mov bx,[bp+14] ; V load len
push di ; U prologue ** Could be V
dec bx ; V loop unrolling
lds di,[bp+6] ; NP load out
shl bx,2 ; U loop unrolling
push si ; V prologue
les si,[bp+10] ; NP load in
mov bp,bx ; U loop unrolling ** Could be V
and bx,12 ; V loop unrolling
;; First multiply step has no carry in.
mov eax,es:[si] ; U first multiply
add si,bx ; V loop unrolling
mul ecx ; NP first multiply
add [di],eax ; U first multiply
adc edx,0 ; U first multiply
add di,bx ; V loop unrolling
;; The switch() for Duff's device. This jump table is (slightly!) faster
;; than a bunch of branches on a '386 and '486, and is probably better yet
;; on higher processors.
jmp WORD PTR cs:ma32_jumptable[bx] ; NP loop unrolling
align 2
ma32_jumptable:
dw OFFSET ma32_case0, 0
dw OFFSET ma32_case1, 0
dw OFFSET ma32_case2, 0
dw OFFSET ma32_case3, 0, 0 ; To get loop aligned properly
ma32_case0:
add si,16 ; U Fix up si ** Could be V
test bp,bp ; V
mov ebx,edx ; U Remember carry for later
jbe SHORT ma32_done ; V Avoid entire loop if loop count is 0
ma32_loop:
mov eax,es:[si-12] ; U
add di, 16 ; V
mul ecx ; NP
add eax,ebx ; U Add carry in from previous word
adc edx,0 ; U
add [di-12],eax ; U
adc edx,0 ; U
ma32_case3:
mov ebx,edx ; U Remember carry for later
mov eax,es:[si-8] ; U
mul ecx ; NP
add eax,ebx ; U Add carry in from previous word
adc edx,0 ; U
add [di-8],eax ; U
adc edx,0 ; U
ma32_case2:
mov ebx,edx ; U Remember carry for later
mov eax,es:[si-4] ; U
mul ecx ; NP
add eax,ebx ; U Add carry in from previous word
adc edx,0 ; U
add [di-4],eax ; U
adc edx,0 ; U
ma32_case1:
mov ebx,edx ; U Remember carry for later
mov eax,es:[si] ; U
mul ecx ; NP
add eax,ebx ; U Add carry in from previous word
adc edx,0 ; U
add si,16 ; V
add [di],eax ; U
adc edx,0 ; U
sub bp,16 ; V
mov ebx,edx ; U Remember carry for later
ja ma32_loop ; V
ma32_done:
pop si ; U ** Could be V
pop di ; V
mov ax,dx ; U return value low ** Could be V
pop ds ; NP
shr edx,16 ; U return value high
pop bp ; V
ret ; NP
_lbnMulAdd1_32 endp
align 16
_lbnMulSub1_32 proc far
push bp ; U prologue ** Could be V
mov bp,sp ; V prologue
push ds ; NP prologue
mov ecx,[bp+16] ; U load k
mov bx,[bp+14] ; V load len
push di ; U prologue ** Could be V
dec bx ; V loop unrolling
lds di,[bp+6] ; NP load out
shl bx,2 ; U loop unrolling
push si ; V prologue
les si,[bp+10] ; NP load in
mov bp,bx ; U loop unrolling ** Could be V
and bx,12 ; V loop unrolling
;; First multiply step has no carry in.
mov eax,es:[si] ; U first multiply
add si,bx ; V loop unrolling
mul ecx ; NP first multiply
sub [di],eax ; U first multiply
adc edx,0 ; U first multiply
add di,bx ; V loop unrolling
;; The switch() for Duff's device. This jump table is (slightly!) faster
;; than a bunch of branches on a '386 and '486, and is probably better yet
;; on higher processors.
jmp WORD PTR cs:ms32_jumptable[bx] ; NP loop unrolling
align 2
ms32_jumptable:
dw OFFSET ms32_case0, 0
dw OFFSET ms32_case1, 0
dw OFFSET ms32_case2, 0
dw OFFSET ms32_case3, 0, 0 ; To get loop aligned properly
ms32_case0:
add si,16 ; U Fix up si ** Could be V
test bp,bp ; V
mov ebx,edx ; U Remember carry for later
jbe SHORT ms32_done ; V Avoid entire loop if loop count is 0
ms32_loop:
mov eax,es:[si-12] ; U
add di, 16 ; V
mul ecx ; NP
add eax,ebx ; U Add carry in from previous word
adc edx,0 ; U
sub [di-12],eax ; U
adc edx,0 ; U
ms32_case3:
mov ebx,edx ; U Remember carry for later
mov eax,es:[si-8] ; U
mul ecx ; NP
add eax,ebx ; U Add carry in from previous word
adc edx,0 ; U
sub [di-8],eax ; U
adc edx,0 ; U
ms32_case2:
mov ebx,edx ; U Remember carry for later
mov eax,es:[si-4] ; U
mul ecx ; NP
add eax,ebx ; U Add carry in from previous word
adc edx,0 ; U
sub [di-4],eax ; U
adc edx,0 ; U
ms32_case1:
mov ebx,edx ; U Remember carry for later
mov eax,es:[si] ; U
mul ecx ; NP
add eax,ebx ; U Add carry in from previous word
adc edx,0 ; U
add si,16 ; V
sub [di],eax ; U
adc edx,0 ; U
sub bp,16 ; V
mov ebx,edx ; U Remember carry for later
ja ms32_loop ; V
ms32_done:
pop si ; U ** Could be V
pop di ; V
mov ax,dx ; U return value low ** Could be V
pop ds ; NP
shr edx,16 ; U return value high
pop bp ; V
ret ; NP
_lbnMulSub1_32 endp
;; Just for interest's sake, here's a completely Pentium-optimized version.
;; In addition to being smaller, it takes 8 + (8+mul_time)*n cycles, as
;; compared to the 10 + jmp_time + (8+mul_time)*n cycles for the loop above.
;; (I don't know how long a 32x32->64 bit multiply or an indirect jump
;; take on a Pentium, so plug those numbers in.)
; align 2
; nop ; To align loop nicely
;P_lbnMulAdd1_32 proc far
;
; push bp ; U prologue ** Could be V
; mov bp,sp ; V prologue
; push ds ; NP prologue
; mov ecx,[bp+16] ; U load k
; push si ; V prologue
; lds si,[bp+10] ; NP load in
; mov eax,[si] ; U first multiply
; push di ; V prologue
; mul ecx ; NP first multiply
; les di,[bp+6] ; NP load out
; add es:[di],eax ; U first multiply
; mov bp,[bp+14] ; V load len
; adc edx,0 ; U first multiply
; dec bp ; V
; mov ebx,edx ; U Remember carry for later
; je Pma32_done ; V
;Pma32_loop:
; mov eax,[si+4] ; U
; add di,4 ; V
; mul ecx ; NP
; add eax,ebx ; U Add carry in from previous word
; adc edx,0 ; U
; add si,4 ; V
; add es:[di],eax ; U
; adc edx,0 ; U
; dec bp ; V
; mov ebx,edx ; U Remember carry for later
; jne Pma32_loop ; V
;Pma32_done:
; pop di ; U ** Could be V
; pop si ; V
; pop ds ; NP
; mov ax,dx ; U return value low ** Could be V
; pop bp ; V
; shr edx,16 ; U return value high
; ret ; NP
;
;P_lbnMulAdd1_32 endp
;; Two-word by one-word divide. Stores quotient, returns remainder.
;; BNWORD32 lbnDiv21_32(BNWORD32 *q, BNWORD32 nh, BNWORD32 nl, BNWORD32 d)
;; 4 8 12 16
align 16
_lbnDiv21_32 proc far
mov cx,bp ; U bp NOT pushed; offsets differ
mov bp,sp ; V
; AGI
mov edx,[bp+8] ; U
mov eax,[bp+12] ; U
div DWORD PTR [bp+16] ; NP
les bx,[bp+4] ; NP
mov es:[bx],eax ; U
mov ax,dx ; V
shr edx,16 ; U
mov bp,cx ; V
ret ; NP
nop
nop
nop
nop ; Get lbnModQ_32 aligned properly
_lbnDiv21_32 endp
;; Multi-word by one-word remainder.
;; This speeds up key generation. It's not worth unrolling and so on;
;; using 32-bit divides is enough of a speedup.
;;
;; bp is used as a counter so that all the 32-bit values can be in
;; caller-save registers (eax, ecx, edx). bx is needed as a pointer.
;;
;; The modulus (in ebp) is 16 bits. Given that the dividend is 32 bits,
;; the chances of saving the first divide because the high word of the
;; dividend is less than the modulus are low enough it's not worth taking
;; the cycles to test for it.
;;
;; unsigned lbnModQ_32(BNWORD16 *q, unsigned len, unsigned d)
;; 6 10 12
_lbnModQ_32 proc far
xor ecx,ecx ; U Clear ecx (really, the high half)
push bp ; V
mov edx,ecx ; U Clear high word for first divide
mov bp,sp ; V
push ds ; NP
lds ax,[bp+6] ; NP Load dividend pointer
mov bx,[bp+10] ; U Load count ** Could be V
sub ax,4 ; V Offset dividend pointer
mov cx,[bp+12] ; U Load modulus ** Could be V
mov bp,bx ; V Copy count
shl bx,2 ; U Shift index
add bx,ax ; U Add base ** Could be V
; lea bx,[eax+ebp*4-4]; U Move pointer to high word
modq32_loop:
mov eax,[bx] ; U
sub bx,4 ; V
div ecx ; NP
dec bp ; U ** Could be V
jnz modq32_loop ; V
modq32_done:
pop ds ; NP
mov ax,dx ; U ** Could be V
pop bp ; V
ret ; NP
_lbnModQ_32 endp
;; int not386(void) returns 0 on a 32-bit (386 or better) processor;
;; non-zero if an 80286 or lower. The Z flag is set to reflect
;; ax on return. This is only called once, so it doesn't matter how
;; it's aligned.
_not386 proc far
;;
;; This first test detects 80x86 for x < 2. On the 8086 and '186,
;; "push sp" does "--sp; sp[0] = sp". On all later processors, it does
;; "sp[-1] = sp; --sp".
;;
push sp
pop ax
sub ax,sp
jne SHORT return
;; This test is the key one. It will probably detect 8086, V30 and 80186
;; as well as 80286, but I haven't had access to test it on any of those,
;; so it's protected by the well-known test above. It has been tested
;; on the 80286, 80386, 80486, Pentium and AMD tested it on their K5.
;; I have not been able to confirm effectiveness on the P6 yet, although
;; someone I spoke to at Intel said it should work.
;;
;; This test uses the fact that the '386 and above have a barrel shifter
;; to do shifts, while the '286 does left shifts by releated adds.
;; That means that on the '286, the auxilliary carry gets a copy of
;; bit 4 of the shift output, while on the '386 and up, it's trashed
;; (as it happens, set to 1) independent of the result. (It's documented
;; as undefined.)
;;
;; We do two shifts, which should produce different auxilliary carries
;; on a '286 and XOR them to see if they are different. Even on a
;; future processor that does something different with the aux carry
;; flag, it probably does something data-independent, so this will still
;; work. Note that all flags except aux carry are defined for shl
;; output and will be the same for both cases.
mov al,4
shl al,1 ; Expected to produce ac = 0 on a '286
lahf
shl al,1 ; Expected to produce ac = 1 on a '286
mov al,ah
lahf
xor al,ah ; Xor the flags together to detect the difference
mov ah,al ; Clear ah if al is clear, leave Z flag alone
return:
ret
_not386 endp
_TEXT ends
end