3 Commits

Author SHA1 Message Date
0d3021c44e v0.4.1: cm.xfer per-file scratch buffers + bidir fairness cap
Patch release.  Two related changes on the TX hot path: a perf
optimization that drops EmitData from O(N) heap allocs per frame
to a single per-file alloc, and a fairness cap on the inner emit
loop that keeps bidir sessions from starving the RX side.

Performance
-----------

* cm.xfer per-file scratch buffers in EmitData.  v0.4.0 did
  SetLength(RawBuf), SetLength(ZBuf), SetLength(Frame) per frame
  -- three Pascal heap allocations on every DATA packet -- plus
  a Stream.Position := FTxPos assignment issuing an lseek(2)
  syscall on TFileStream.  Mirrors legacy cometxfer.pas's
  GetMem-once-per-file pattern: three TBytes fields on the
  session (FTxRawBuf, FTxZBuf, FTxFrameBuf) sized at FBlockSize
  once in StartNextOutbound / StartFREQResponse / RPOS, reused
  across every frame for the file.  Cached FTxStreamPos: Int64
  tracks the read offset locally so we only Stream.Position :=
  on Start / RPOS / EOF-retry, not per frame.

  ~30% wallclock improvement on the 10 MB bidir continuous test.

Fixed
-----

* bidir fairness regression introduced by the scratch-buffer
  optimization.  The unbounded inner emit loop in Step() (added
  in v0.4.0 to undo v0.3.0's round-trip regression) filled the
  entire FWindowSize worth of frames per call.  With the
  scratch optimization making each EmitData ~10x cheaper, the
  inner loop emitted 8 frames in microseconds -- but Pump
  only reads one inbound frame per call, producing an 8:1
  send/recv ratio per cycle that starved the RX side.  Visible
  in test_xfer_continuous as small inbound files not completing
  before a concurrent big outbound finished.

  Fix: cap the inner loop at MAX_FRAMES_PER_STEP = 4.  Two
  Step calls now fill an 8-frame window with a Pump in between
  to drain peer bytes.  Tuned empirically: failures start at
  >= 5, throughput drops noticeably at 1-3.

All 12 tests in tests/run_tests.sh PASS.
2026-04-27 12:37:26 -07:00
b24e6a5b4a v0.4.0: HasPendingTx accessor (mirror of fpc-binkp 0.4.0)
Add public read-only TComSession.HasPendingTx returning True iff
there are outbound bytes already queued in the send stream that
haven't been pushed to the socket yet.  Mirror of the same
accessor added to fpc-binkp's TBPSession in 0.4.0.  Lets driver
loops pass it as the WantWrite flag in Transport.WaitReady;
primarily a perf improvement for high-volume sends (avoids
per-frame idle wait).

Recommended pattern:

  while Session.NextStep do
    Transport.WaitReady(True, Session.HasPendingTx, 50);

Wire protocol unchanged.  Consumers pinned to 0.3.0 can upgrade
in place; HasPendingTx is purely additive and callers that
don't use it keep the old behaviour.
CM_MIN_COMPATIBLE_VERSION stays at 0.1.0.
2026-04-25 10:34:20 -07:00
1408d94d99 v0.3.0: vendor cr.x25519 + cr.chacha20 from fpc-crypto v0.2.0
cm.crypto.pas drops its in-tree X25519 Montgomery ladder and
ChaCha20 RFC 8439 implementation, keeping only the Comet-
specific session glue on top: SHA-256 key derivation,
tx/rx ChaCha20 pair, CometCryptInit / Encrypt / Decrypt.

Primitives now live in fpc-crypto 0.2.0:
  TX25519Key + X25519Keypair/ScalarMult/SharedSecret  -> cr.x25519
  TChacha20State + ChaCha20Init/Crypt                 -> cr.chacha20

cm.driver.pas imports cr.x25519 explicitly (no re-export
gymnastics through cm.crypto) so its dependency graph
matches reality.  Same rationale as v0.2.0's treatment of
cr.ed25519 callers.

12/12 local tests PASS.  6-target cross-build (incl.
i386-go32v2) green.  Wire protocol unchanged -- the session-
layer key-derivation and frame encryption produce
byte-identical output to v0.2.0.
2026-04-24 12:08:19 -07:00
9 changed files with 636 additions and 353 deletions

View File

@@ -10,6 +10,111 @@ Semver intent:
- **minor** — additive features, new hooks, new capability flags
- **patch** — bug fixes, security hardening, internal perf
## 0.4.1 — 2026-04-27
Patch release: per-file scratch-buffer optimization on the TX
hot path. Drops EmitData from O(allocations) per frame to one
fixed allocation per file. Bidir fairness preserved via a
4-frame-per-Step cap on the inner emit loop.
### Performance
- **`cm.xfer`: per-file scratch buffers in EmitData**. v0.4.0's
`EmitData` did `SetLength(RawBuf, ChunkLen)`, `SetLength(ZBuf,
Bound)`, and `SetLength(Frame, HdrLen+...)` per frame -- three
Pascal heap allocations on every DATA packet, plus a
`Stream.Position := FTxPos` assignment that issued an `lseek(2)`
syscall on TFileStream. Mirrors legacy cometxfer.pas's
GetMem-once-per-file pattern: three `TBytes` fields on the
session (`FTxRawBuf`, `FTxZBuf`, `FTxFrameBuf`) sized at
`FBlockSize` once in `StartNextOutbound` / `StartFREQResponse`
/ RPOS, reused across every frame for the file. Cached
`FTxStreamPos: Int64` tracks our read offset locally so we
only `Stream.Position :=` on Start / RPOS / EOF-retry, not
every frame.
Combined gain on a 10 MB Comet send: ~30% wallclock on the
bidir continuous test, more on slower disks.
### Fixed
- **bidir fairness regression introduced by the scratch-buffer
optimization**. The unbounded inner emit loop in
`Step()` (introduced in v0.4.0 to undo the v0.3.0 round-trip
regression) fills the entire `FWindowSize` worth of frames
per call. With the scratch optimization making each EmitData
~10x cheaper, the existing loop emitted 8 frames in microseconds
-- but `Pump` only reads one inbound frame per call, producing
an 8:1 send/recv ratio per cycle that starved the RX side.
Visible in `test_xfer_continuous` as small inbound files not
completing before a concurrent big outbound finished.
Fix: cap the inner loop at `MAX_FRAMES_PER_STEP = 4`. Two
Step calls now fill an 8-frame window, with a Pump in between
to drain peer bytes. Tuned empirically: failures start at
>= 5, throughput drops noticeably at 1-3.
All 12 tests in `tests/run_tests.sh` PASS at this cap.
## 0.4.0 — 2026-04-25
### Added
- **`TComSession.HasPendingTx: Boolean`** — public read-only
accessor. True iff there are outbound bytes already
queued in the send stream that haven't been pushed to
the socket yet. Mirror of the same accessor added to
fpc-binkp's `TBPSession` in 0.4.0. Lets driver loops
pass it as the `WantWrite` flag in
`Transport.WaitReady`; primarily a perf improvement for
high-volume sends (avoids per-frame idle wait).
- Recommended driver loop pattern:
```pascal
while Session.NextStep do
Transport.WaitReady(True, Session.HasPendingTx, 50);
```
### Not changed
- Wire protocol unchanged; consumers pinned to 0.3.0 can
upgrade in place. `HasPendingTx` is purely additive --
callers that don't use it keep the old behaviour.
- `CM_MIN_COMPATIBLE_VERSION` stays at `0.1.0`.
## 0.3.0 — 2026-04-24
### Changed
- **X25519 + ChaCha20 primitives moved out to fpc-crypto
v0.2.0.** `cm.crypto.pas` thins down to the
Comet-specific session glue (SHA-256 key derivation +
paired tx/rx ChaCha20 streams keyed per direction).
Pure crypto primitives now live upstream:
- `TX25519Key`, `X25519Keypair`, `X25519ScalarMult`,
`X25519SharedSecret` → `cr.x25519`
- `TChacha20State`, `ChaCha20Init`, `ChaCha20Crypt`
→ `cr.chacha20`
`cm.driver.pas` gains `cr.x25519` in its uses-clause so
`TX25519Key` + `X25519Keypair` / `X25519SharedSecret`
resolve directly from the primitive library instead of
being re-exported through `cm.crypto`.
- Rationale: X25519 and ChaCha20 are crypto; they belong
in the crypto library regardless of consumer count.
The "wait for a second consumer" rule that delayed their
carve in v0.1.0 / v0.2.0 was appropriate for MD5 (which
was duplicated code across three libraries) but doesn't
serve primitives like X25519 / ChaCha20 where upstream
ownership is the obviously-correct home.
### Not changed
- `TCometCrypt` record still owns the paired ChaCha20
states and the SHA-256-chained key derivation -- that's
Comet session-layer behaviour, not a primitive.
- Wire protocol is identical to 0.2.0. Same test vectors,
same interop behaviour. 12/12 local test suite green
pre- and post-migration; 6-target cross-build clean.
## 0.2.0 — 2026-04-23
### Changed

View File

@@ -35,6 +35,8 @@ UNITS=(
src/cr.ed25519.sc.pas
src/cr.ed25519.bp.pas
src/cr.ed25519.ge.pas
src/cr.x25519.pas
src/cr.chacha20.pas
src/cm.crypto.pas
src/cm.zlib.pas
src/cm.cram.pas

View File

@@ -1,26 +1,30 @@
{
Comet - Direct TCP File Transfer for FidoNet
cometcrypt.pas - Session encryption: X25519 key exchange + ChaCha20
cm.crypto.pas - Session encryption: Comet-specific glue on
top of cr.x25519 + cr.chacha20 + cr.sha
Provides full session encryption for the Comet native protocol using
keys already configured for ED25519 authentication. No certificates,
no external libraries, no CA infrastructure required.
Key exchange:
Both sides generate ephemeral X25519 keypairs and exchange public
keys. The shared secret is computed via Diffie-Hellman. Ephemeral
keys provide forward secrecy -- compromising long-term keys does
not reveal past session content.
Encryption:
ChaCha20 stream cipher (RFC 8439) with per-direction keys and
counters. Each frame body (TYPE+SEQ+PAYLOAD+CRC32) is encrypted
after CRC computation and decrypted before CRC verification.
This unit is the session-layer wrapper that:
1. Takes an X25519 shared secret + both ephemeral
public keys (produced by cr.x25519) and derives a
pair of SHA-256-chained directional keys,
2. Keys two cr.chacha20 streams (send / receive), one
per direction, into a TCometCrypt record, and
3. Exposes CometCryptEncrypt / CometCryptDecrypt so
the frame encoder / decoder can XOR frame bodies
on the wire.
Key derivation:
session_key = SHA-256(shared_secret || ephemeral_pub_A || ephemeral_pub_B)
send_key = SHA-256(session_key || "comet-send-" || direction_byte)
recv_key = SHA-256(session_key || "comet-recv-" || direction_byte)
session_key = SHA-256(shared_secret || pub_A || pub_B)
(pub_A = originator's, pub_B = answerer's;
both sides hash in the same order)
Key-A = SHA-256(session_key || 'A')
Key-B = SHA-256(session_key || 'B')
Originator sends with Key-A, receives with Key-B.
Answerer sends with Key-B, receives with Key-A.
The X25519 key-exchange + ChaCha20 stream primitives
themselves live in fpc-crypto (cr.x25519 / cr.chacha20)
as of fpc-comet 0.3.0 / fpc-crypto 0.2.0.
Copyright (C) 2026 Ken Johnson
License: GPL-2.0
@@ -33,62 +37,21 @@ unit cm.crypto;
interface
uses
SysUtils, cr.ed25519, cr.sha;
SysUtils, cr.sha, cr.x25519, cr.chacha20;
type
{ X25519 key types }
TX25519Key = array[0..31] of Byte;
{ ChaCha20 cipher state }
TChacha20State = record
Key: array[0..31] of Byte; { 256-bit key }
Nonce: array[0..11] of Byte; { 96-bit nonce (zeroed, counter-based) }
Counter: QWord; { Block counter }
Block: array[0..63] of Byte; { Current keystream block }
Used: Integer; { Bytes used in current block }
end;
{ Session encryption context }
{ Session encryption context. One TCometCrypt per
session. Holds both directional ChaCha20 states plus
an activation flag. }
TCometCrypt = record
Active: Boolean; { Encryption is active }
TxState: TChacha20State; { Transmit cipher state }
RxState: TChacha20State; { Receive cipher state }
Active: Boolean; { Encryption is active }
TxState: cr.chacha20.TChacha20State; { Transmit cipher state }
RxState: cr.chacha20.TChacha20State; { Receive cipher state }
end;
{ ---- X25519 Key Exchange ---- }
{ Generate a random X25519 keypair.
PrivKey is the clamped 32-byte scalar.
PubKey is the 32-byte Montgomery u-coordinate. }
procedure X25519Keypair(out PrivKey, PubKey: TX25519Key);
{ Scalar multiplication on Curve25519 (Montgomery form).
R = K * U. Exported for testing. }
procedure X25519ScalarMult(const K, U: TX25519Key; out R: TX25519Key);
{ Compute shared secret = X25519(our_private, their_public).
Returns False if the result is all-zero (bad public key). }
function X25519SharedSecret(const OurPriv, TheirPub: TX25519Key;
out Secret: TX25519Key): Boolean;
{ ---- ChaCha20 Stream Cipher ---- }
{ Initialize a ChaCha20 cipher state with a 32-byte key. }
procedure ChaCha20Init(var State: TChacha20State;
const Key: array of Byte);
{ Encrypt/decrypt data in place using ChaCha20 (XOR with keystream). }
procedure ChaCha20Crypt(var State: TChacha20State;
Data: PByte; Len: LongWord);
{ ---- Session Encryption ---- }
{ Derive send/recv keys from shared secret and ephemeral public keys,
and initialize the encryption context. IsOriginator determines
key direction (originator's send key = answerer's recv key). }
and initialize the encryption context. IsOriginator determines key
direction (originator's send key = answerer's recv key). }
procedure CometCryptInit(var Crypt: TCometCrypt;
const SharedSecret, EphPubLocal, EphPubRemote: TX25519Key;
IsOriginator: Boolean);
@@ -110,250 +73,6 @@ implementation
session engine logs activation events at its own layer. }
{ ---- X25519 Montgomery Ladder ---- }
{ Clamp a 32-byte scalar for X25519 per RFC 7748. }
procedure X25519Clamp(var K: TX25519Key);
begin
K[0] := K[0] and 248;
K[31] := (K[31] and 127) or 64;
end;
{ Montgomery ladder scalar multiplication on Curve25519.
Computes Result = k * u where u is a Montgomery u-coordinate.
Uses field element operations from cr.ed25519. }
procedure X25519ScalarMult(const K, U: TX25519Key; out R: TX25519Key);
{ Montgomery ladder per RFC 7748 Section 5.
Clamps scalar and decodes u-coordinate per spec.
All temporaries are distinct to avoid aliasing bugs. }
var
Scalar: TX25519Key;
UCopy: TX25519Key;
U_fe, X2, Z2, X3, Z3: TFieldElement;
tA, tAA, tB, tBB, tE: TFieldElement;
tC, tD, tDA, tCB: TFieldElement;
tDAp, tDAm, tT: TFieldElement;
Swap, KT: LongInt;
I: Integer;
begin
{ decodeScalar25519: clamp scalar per RFC 7748 }
Move(K, Scalar, 32);
X25519Clamp(Scalar);
{ decodeUCoordinate: clear high bit per RFC 7748 }
Move(U, UCopy, 32);
UCopy[31] := UCopy[31] and $7F;
FE_FromBytes(U_fe, @UCopy[0]);
FE_1(X2);
FE_0(Z2);
FE_Copy(X3, U_fe);
FE_1(Z3);
Swap := 0;
for I := 254 downto 0 do
begin
KT := (Scalar[I shr 3] shr (I and 7)) and 1;
Swap := Swap xor KT;
FE_CSwap(X2, X3, Swap);
FE_CSwap(Z2, Z3, Swap);
Swap := KT;
FE_Add(tA, X2, Z2);
FE_Sq(tAA, tA);
FE_Sub(tB, X2, Z2);
FE_Sq(tBB, tB);
FE_Sub(tE, tAA, tBB);
FE_Add(tC, X3, Z3);
FE_Sub(tD, X3, Z3);
FE_Mul(tDA, tD, tA);
FE_Mul(tCB, tC, tB);
FE_Add(tDAp, tDA, tCB);
FE_Sq(X3, tDAp);
FE_Sub(tDAm, tDA, tCB);
FE_Sq(tDAm, tDAm);
FE_Mul(Z3, U_fe, tDAm);
FE_Mul(X2, tAA, tBB);
FE_Mul121666(tT, tE); { 121666 * E }
FE_Sub(tT, tT, tE); { 121665 * E (a24 for Curve25519) }
FE_Add(tT, tAA, tT); { AA + a24 * E }
FE_Mul(Z2, tE, tT); { z_2 = E * (AA + a24 * E) }
end;
FE_CSwap(X2, X3, Swap);
FE_CSwap(Z2, Z3, Swap);
FE_Invert(Z2, Z2);
FE_Mul(X2, X2, Z2);
FE_ToBytes(TED25519PublicKey(R), X2);
end;
{ The X25519 base point (u=9). }
const
X25519_BASEPOINT: TX25519Key = (
9, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
);
procedure X25519Keypair(out PrivKey, PubKey: TX25519Key);
var
Seed: TED25519Seed;
begin
{ Generate 32 random bytes }
ED25519RandomSeed(Seed);
Move(Seed, PrivKey, 32);
X25519Clamp(PrivKey);
{ Public key = PrivKey * basepoint }
X25519ScalarMult(PrivKey, X25519_BASEPOINT, PubKey);
end;
function X25519SharedSecret(const OurPriv, TheirPub: TX25519Key;
out Secret: TX25519Key): Boolean;
var
I: Integer;
Check: Byte;
begin
X25519ScalarMult(OurPriv, TheirPub, Secret);
{ Check for all-zero result (invalid public key) }
Check := 0;
for I := 0 to 31 do
Check := Check or Secret[I];
Result := Check <> 0;
end;
{ ---- ChaCha20 Stream Cipher (RFC 8439) ---- }
type
TChaCha20Block = array[0..15] of LongWord;
function RotL32(X: LongWord; N: Integer): LongWord; inline;
begin
Result := (X shl N) or (X shr (32 - N));
end;
procedure QuarterRound(var A, B, C, D: LongWord); inline;
begin
A := A + B; D := D xor A; D := RotL32(D, 16);
C := C + D; B := B xor C; B := RotL32(B, 12);
A := A + B; D := D xor A; D := RotL32(D, 8);
C := C + D; B := B xor C; B := RotL32(B, 7);
end;
procedure ChaCha20Block(const Key: array of Byte;
Counter: LongWord; const Nonce: array of Byte;
out Output: array of Byte);
var
State, Working: TChaCha20Block;
I: Integer;
begin
{ Initialize state }
State[0] := $61707865; { "expa" }
State[1] := $3320646e; { "nd 3" }
State[2] := $79622d32; { "2-by" }
State[3] := $6b206574; { "te k" }
{ Key (little-endian LongWords) }
for I := 0 to 7 do
State[4 + I] := LongWord(Key[I*4]) or (LongWord(Key[I*4+1]) shl 8) or
(LongWord(Key[I*4+2]) shl 16) or (LongWord(Key[I*4+3]) shl 24);
{ Counter }
State[12] := Counter;
{ Nonce (little-endian LongWords) }
for I := 0 to 2 do
State[13 + I] := LongWord(Nonce[I*4]) or (LongWord(Nonce[I*4+1]) shl 8) or
(LongWord(Nonce[I*4+2]) shl 16) or (LongWord(Nonce[I*4+3]) shl 24);
{ Copy state to working }
Working := State;
{ 20 rounds (10 double-rounds) }
for I := 1 to 10 do
begin
{ Column rounds }
QuarterRound(Working[0], Working[4], Working[8], Working[12]);
QuarterRound(Working[1], Working[5], Working[9], Working[13]);
QuarterRound(Working[2], Working[6], Working[10], Working[14]);
QuarterRound(Working[3], Working[7], Working[11], Working[15]);
{ Diagonal rounds }
QuarterRound(Working[0], Working[5], Working[10], Working[15]);
QuarterRound(Working[1], Working[6], Working[11], Working[12]);
QuarterRound(Working[2], Working[7], Working[8], Working[13]);
QuarterRound(Working[3], Working[4], Working[9], Working[14]);
end;
{ Add original state }
for I := 0 to 15 do
Working[I] := Working[I] + State[I];
{ Serialize to bytes (little-endian) }
for I := 0 to 15 do
begin
Output[I*4] := Byte(Working[I]);
Output[I*4 + 1] := Byte(Working[I] shr 8);
Output[I*4 + 2] := Byte(Working[I] shr 16);
Output[I*4 + 3] := Byte(Working[I] shr 24);
end;
end;
procedure ChaCha20Init(var State: TChacha20State;
const Key: array of Byte);
begin
FillChar(State, SizeOf(State), 0);
Move(Key[0], State.Key[0], 32);
State.Counter := 0;
State.Used := 64; { Force new block generation on first use }
end;
procedure ChaCha20Crypt(var State: TChacha20State;
Data: PByte; Len: LongWord);
var
I: LongWord;
Avail: Integer;
begin
I := 0;
while I < Len do
begin
{ Generate new keystream block if needed }
if State.Used >= 64 then
begin
ChaCha20Block(State.Key, LongWord(State.Counter),
State.Nonce, State.Block);
Inc(State.Counter);
State.Used := 0;
end;
{ XOR data with keystream }
Avail := 64 - State.Used;
if Avail > Integer(Len - I) then
Avail := Integer(Len - I);
while Avail > 0 do
begin
Data[I] := Data[I] xor State.Block[State.Used];
Inc(I);
Inc(State.Used);
Dec(Avail);
end;
end;
end;
{ ---- Session Encryption Setup ---- }
procedure CometCryptInit(var Crypt: TCometCrypt;
const SharedSecret, EphPubLocal, EphPubRemote: TX25519Key;
IsOriginator: Boolean);
@@ -388,8 +107,8 @@ begin
SHA256Final(Ctx, SessionKey);
{ Derive directional keys:
Key-A = SHA-256(session_key || "comet-keystream-A")
Key-B = SHA-256(session_key || "comet-keystream-B")
Key-A = SHA-256(session_key || "A")
Key-B = SHA-256(session_key || "B")
Originator sends with Key-A, receives with Key-B.
Answerer sends with Key-B, receives with Key-A. }
DirByte := Ord('A');

View File

@@ -39,7 +39,7 @@ interface
uses
Classes, SysUtils,
log.types, mb.address,
cm.types, cm.frame, cm.cram, cr.ed25519, cm.crypto,
cm.types, cm.frame, cm.cram, cr.ed25519, cr.x25519, cm.crypto,
cm.session, cm.events, cm.transport, cm.provider, cm.config;
type
@@ -181,6 +181,16 @@ type
function Phase: TCometPhase;
function Result_: TCometSessionResult;
{ True iff there are outbound bytes already queued in the
send stream that haven't been pushed to the socket yet.
Driver loops use this to choose whether WaitReady should
also wait for socket WRITE-readiness -- when True,
select() returns the moment the kernel send buffer can
take more bytes, so high-volume transfers don't pay a
50 ms idle wait per frame. Mirror of the same accessor
on fpc-binkp's TBPSession. }
function HasPendingTx: Boolean;
{ ---- Hooks for cm.xfer ---- }
{ Queue a complete frame onto the outbound stream. Bytes
drain to the transport on the next NextStep / Pump. }
@@ -1288,6 +1298,12 @@ begin
Result := FResult;
end;
function TComSession.HasPendingTx: Boolean;
begin
Result := (FSendStream <> nil) and
((FSendStream.Size - FSendPos) > 0);
end;
procedure TComSession.EmitFrame(PktType, Seq: Byte; const Buf;
Len: SizeInt);
var

View File

@@ -21,9 +21,9 @@ interface
const
CM_VERSION_MAJOR = 0;
CM_VERSION_MINOR = 2;
CM_VERSION_PATCH = 0;
CM_VERSION = '0.2.0';
CM_VERSION_MINOR = 4;
CM_VERSION_PATCH = 1;
CM_VERSION = '0.4.1';
{ Oldest version a consumer pinned to CM_VERSION is
guaranteed to remain ABI/API-compatible with. Bumped

View File

@@ -76,6 +76,21 @@ type
FTxAcked: Byte;
FTxAckedPos: Int64;
FTxStartTime: TDateTime;
{ Per-file scratch buffers reused across all DATA frames
for this file. Allocated once in StartNextOutbound at
FBlockSize, never SetLength'd in the inner loop.
Mirrors legacy cometxfer.pas's GetMem(DataBuf,...) /
GetMem(CompBuf,...) once-per-file pattern. Per-frame
SetLength was the dominant heap cost in the regressed
0.4.0 EmitData path. }
FTxRawBuf: TBytes; { raw file bytes }
FTxZBuf: TBytes; { zlib-compressed staging }
FTxFrameBuf: TBytes; { final on-wire frame body }
{ Cached read offset. We read sequentially from the TX
stream; assigning Stream.Position on every frame is an
lseek() syscall on TFileStream. Track our own offset
and only assign Stream.Position on Start / RPOS. }
FTxStreamPos: Int64;
{ RPOS loop-detection: peer keeps sending RPOS to the
same offset means the file is corrupt or the link is
bad enough to be unrecoverable. }
@@ -472,6 +487,17 @@ begin
FTxSeq := 0;
FTxAcked := 0;
FTxAckedPos := 0;
{ Same per-file scratch alloc + position reset that
StartNextOutbound does -- FREQ-injected sends share the
same EmitData hot path. }
if Length(FTxRawBuf) < Integer(FBlockSize) then
SetLength(FTxRawBuf, FBlockSize);
if Length(FTxZBuf) < Integer(CMZlibBound(FBlockSize)) then
SetLength(FTxZBuf, CMZlibBound(FBlockSize));
if Length(FTxFrameBuf) < Integer(FBlockSize) + 8 then
SetLength(FTxFrameBuf, FBlockSize + 8);
FTxItem.Stream.Position := 0;
FTxStreamPos := 0;
EmitFinfo(FTxItem);
FTxState := txAwaitFinfoAck;
Log(llInfo, Format('FREQ: serving %s (%d bytes)',
@@ -757,6 +783,7 @@ begin
end;
FTxItem.Stream.Position := RPos;
FTxStreamPos := RPos;
FTxPos := RPos;
FTxSeq := 0;
FTxAcked := 0;
@@ -772,16 +799,14 @@ end;
procedure TCometXfer.EmitData;
var
RawBuf: TBytes; { uncompressed payload bytes }
Got: LongInt;
Got: LongInt;
ChunkLen: Integer;
UseZlib: Boolean;
ZBuf: TBytes; { zlib-compressed staging }
ZLen: Cardinal;
ZRes: TCometZlibResult;
Frame: TBytes; { final on-wire frame body }
HdrLen: Integer; { 4 (no zlib) or 5 (zlib enabled) }
MaxRaw: Integer; { max raw bytes that fit a frame }
UseZlib: Boolean;
ZLen: Cardinal;
ZRes: TCometZlibResult;
HdrLen: Integer; { 4 (no zlib) or 5 (zlib enabled) }
MaxRaw: Integer; { max raw bytes that fit a frame }
FrameLen: Integer; { final body length we emit }
begin
if FTxItem.Stream = nil then Exit;
UseZlib := (FSharedCaps and COPT_ZLIB) <> 0;
@@ -801,9 +826,11 @@ begin
ChunkLen := Integer(FTxItem.Size - FTxPos);
if ChunkLen <= 0 then Exit;
SetLength(RawBuf, ChunkLen);
FTxItem.Stream.Position := FTxPos;
Got := FTxItem.Stream.Read(RawBuf[0], ChunkLen);
{ Sequential read from the file stream. StartNextOutbound
+ HandleRpos are the only places that re-Seek the stream;
in steady state the stream's position already equals
FTxStreamPos (= FTxPos), so we can skip the lseek. }
Got := FTxItem.Stream.Read(FTxRawBuf[0], ChunkLen);
if Got <= 0 then
begin
Log(llError, Format('TX read failed at %d on %s',
@@ -811,42 +838,40 @@ begin
CloseTxFile(False, cmSendFailIO);
Exit;
end;
if Got < ChunkLen then
SetLength(RawBuf, Got);
Inc(FTxStreamPos, Got);
if UseZlib then
begin
SetLength(ZBuf, CMZlibBound(Length(RawBuf)));
ZLen := Length(ZBuf);
ZRes := CMZlibCompress(RawBuf[0], Length(RawBuf),
ZBuf[0], ZLen);
if ZRes = cmZlibOK then
ZLen := Length(FTxZBuf);
ZRes := CMZlibCompress(FTxRawBuf[0], Got,
FTxZBuf[0], ZLen);
if (ZRes = cmZlibOK) and (Integer(ZLen) < Got) then
begin
{ Compressed AND smaller -- send with comp_type=1. }
SetLength(Frame, HdrLen + Integer(ZLen));
PutLE32(@Frame[0], LongWord(FTxPos));
Frame[4] := 1; { comp_type = ZLIB }
Move(ZBuf[0], Frame[5], ZLen);
PutLE32(@FTxFrameBuf[0], LongWord(FTxPos));
FTxFrameBuf[4] := 1; { comp_type = ZLIB }
Move(FTxZBuf[0], FTxFrameBuf[5], ZLen);
FrameLen := HdrLen + Integer(ZLen);
end
else
begin
{ No gain or buffer issue -- send raw with comp_type=0. }
SetLength(Frame, HdrLen + Length(RawBuf));
PutLE32(@Frame[0], LongWord(FTxPos));
Frame[4] := 0; { comp_type = NONE }
Move(RawBuf[0], Frame[5], Length(RawBuf));
PutLE32(@FTxFrameBuf[0], LongWord(FTxPos));
FTxFrameBuf[4] := 0; { comp_type = NONE }
Move(FTxRawBuf[0], FTxFrameBuf[5], Got);
FrameLen := HdrLen + Got;
end;
end
else
begin
SetLength(Frame, 4 + Length(RawBuf));
PutLE32(@Frame[0], LongWord(FTxPos));
Move(RawBuf[0], Frame[4], Length(RawBuf));
PutLE32(@FTxFrameBuf[0], LongWord(FTxPos));
Move(FTxRawBuf[0], FTxFrameBuf[4], Got);
FrameLen := 4 + Got;
end;
FSession.EmitFrame(NPKT_DATA, FTxSeq, Frame[0], Length(Frame));
FSession.EmitFrame(NPKT_DATA, FTxSeq, FTxFrameBuf[0], FrameLen);
Inc(FTxSeq);
FTxPos := FTxPos + Length(RawBuf);
FTxPos := FTxPos + Got;
end;
{ ---- TX state transitions ---- }
@@ -889,6 +914,24 @@ begin
FTxSeq := 0;
FTxAcked := 0;
FTxAckedPos := 0;
{ Pre-allocate the per-file scratch buffers exactly once.
FBlockSize is the max raw payload + small headroom.
CMZlibBound covers worst-case zlib expansion. Reused
across every DATA frame for this file. }
if Length(FTxRawBuf) < Integer(FBlockSize) then
SetLength(FTxRawBuf, FBlockSize);
if Length(FTxZBuf) < Integer(CMZlibBound(FBlockSize)) then
SetLength(FTxZBuf, CMZlibBound(FBlockSize));
if Length(FTxFrameBuf) < Integer(FBlockSize) + 8 then
SetLength(FTxFrameBuf, FBlockSize + 8);
{ Sync our cached read offset with the stream's position.
Provider returns a fresh stream positioned at 0; record
that so EmitData can skip the per-frame Position setter. }
if FTxItem.Stream <> nil then
begin
FTxItem.Stream.Position := 0;
FTxStreamPos := 0;
end;
EmitFinfo(FTxItem);
FTxState := txAwaitFinfoAck;
end;
@@ -1088,7 +1131,10 @@ begin
FTxStart := Int64(Off);
FTxAckedPos := Int64(Off);
if FTxItem.Stream <> nil then
begin
FTxItem.Stream.Position := FTxPos;
FTxStreamPos := FTxPos;
end;
FTxState := txData;
Log(llDebug, Format('peer accepted %s @ %d',
[FTxItem.Name, FTxPos]));
@@ -1264,7 +1310,10 @@ begin
FTxAcked := 0;
FTxAckedPos := 0;
if FTxItem.Stream <> nil then
begin
FTxItem.Stream.Position := 0;
FTxStreamPos := 0;
end;
FTxState := txData;
end;
NEOFACK_SKIP: CloseTxFile(False, cmSendFailPeerSkipped);
@@ -1278,7 +1327,18 @@ end;
Pumps the TX state machine forward when there's window
capacity. Returns True if work was done this step. }
const
{ Max DATA frames a single Step() call will emit. See the
matching comment in the txData branch below for the rationale.
4 was tuned empirically: the test_xfer_continuous bidir test
starts to fail at >= 5 (RX side starves), and 1-3 give back
~30% of the legacy throughput by paying the Step round-trip
too often. }
MAX_FRAMES_PER_STEP = 4;
function TCometXfer.Step: Boolean;
var
MaxBurst: Integer;
begin
Result := False;
EnsureNegotiated;
@@ -1293,9 +1353,36 @@ begin
begin
if (FTxItem.Stream <> nil) and (FTxPos < FTxItem.Size) then
begin
if WindowOpen then
{ Inner-burst emit, capped at MAX_FRAMES_PER_STEP.
Legacy cometxfer.pas:751-866 streamed frames in a tight
inner loop while the window had room, only yielding when
the window closed (waiting on a DATAACK) or the file
ended. fpc-comet 0.3.0 inadvertently dropped that loop
and emitted one frame per Step() call, which paid the
outer-driver round-trip for every 16-64 KB. 10 MB
sends regressed from ~2 s (Comet 1.01) to ~7 s, so
v0.4.0 restored the inner loop.
v0.4.1 caps the inner loop at MAX_FRAMES_PER_STEP (4).
The cap matters for bidirectional sessions: the per-file
scratch-buffer optimization (introduced same release)
makes EmitData ~10x cheaper, and Pump only reads ONE
inbound frame per call, so an unbounded inner loop
produced an 8:1 send/recv ratio per cycle that starved
the RX side -- visible in test_xfer_continuous as small
inbound files not completing before a concurrent big
outbound finished. Cap of 4 keeps the emit-burst
per Step long enough to amortize Step's overhead
(perf preserved), short enough that two Step calls
fill a typical 8-frame window with a Pump in between
(bidir fairness preserved). }
MaxBurst := MAX_FRAMES_PER_STEP;
while (MaxBurst > 0) and WindowOpen and
(FTxPos < FTxItem.Size) do
begin
EmitData;
Dec(MaxBurst);
Result := True;
end;
end

181
src/cr.chacha20.pas Normal file
View File

@@ -0,0 +1,181 @@
{ cr.chacha20 -- ChaCha20 stream cipher (RFC 8439).
Incremental ChaCha20 keystream generator with per-state
buffering so arbitrary-length plaintext can be streamed
across multiple ChaCha20Crypt calls without needing to
align on 64-byte blocks.
Public API:
ChaCha20Init Install a 32-byte key, reset counter/nonce/block
ChaCha20Crypt XOR Len bytes of Data with keystream (in place)
Nonce is 96 bits (RFC 8439 §2.3); caller pre-fills
State.Nonce before the first ChaCha20Crypt. Counter is
incremented per 64-byte block. Same ChaCha20 is used for
both encrypt and decrypt (symmetric stream cipher).
Carved from Comet's cm.crypto 2026-04-24. }
unit cr.chacha20;
{$mode objfpc}{$H+}
{$R-}{$Q-} { Range + overflow checks OFF for crypto arithmetic }
interface
uses
SysUtils;
type
{ ChaCha20 cipher state -- 256-bit key, 96-bit nonce,
64-bit block counter, 64-byte keystream buffer. Caller
populates Nonce before first ChaCha20Crypt; Key comes
from ChaCha20Init; Counter + Block + Used are managed
internally. }
TChacha20State = record
Key: array[0..31] of Byte; { 256-bit key }
Nonce: array[0..11] of Byte; { 96-bit nonce (zeroed, counter-based) }
Counter: QWord; { Block counter }
Block: array[0..63] of Byte; { Current keystream block }
Used: Integer; { Bytes used in current block }
end;
{ Initialize a ChaCha20 cipher state with a 32-byte key.
Clears nonce + counter; caller sets nonce bytes before
calling ChaCha20Crypt. Used=64 forces first call to
generate a fresh block. }
procedure ChaCha20Init(var State: TChacha20State;
const Key: array of Byte);
{ Encrypt/decrypt Len bytes of Data in place. Operation is
symmetric: XOR with keystream. Safe to call repeatedly
on different-length chunks; internal Used/Counter track
position. }
procedure ChaCha20Crypt(var State: TChacha20State;
Data: PByte; Len: LongWord);
implementation
type
TChaCha20Block = array[0..15] of LongWord;
function RotL32(X: LongWord; N: Integer): LongWord; inline;
begin
Result := (X shl N) or (X shr (32 - N));
end;
procedure QuarterRound(var A, B, C, D: LongWord); inline;
begin
A := A + B; D := D xor A; D := RotL32(D, 16);
C := C + D; B := B xor C; B := RotL32(B, 12);
A := A + B; D := D xor A; D := RotL32(D, 8);
C := C + D; B := B xor C; B := RotL32(B, 7);
end;
procedure ChaCha20Block(const Key: array of Byte;
Counter: LongWord; const Nonce: array of Byte;
out Output: array of Byte);
var
State, Working: TChaCha20Block;
I: Integer;
begin
{ Initialize state }
State[0] := $61707865; { "expa" }
State[1] := $3320646e; { "nd 3" }
State[2] := $79622d32; { "2-by" }
State[3] := $6b206574; { "te k" }
{ Key (little-endian LongWords) }
for I := 0 to 7 do
State[4 + I] := LongWord(Key[I*4]) or (LongWord(Key[I*4+1]) shl 8) or
(LongWord(Key[I*4+2]) shl 16) or (LongWord(Key[I*4+3]) shl 24);
{ Counter }
State[12] := Counter;
{ Nonce (little-endian LongWords) }
for I := 0 to 2 do
State[13 + I] := LongWord(Nonce[I*4]) or (LongWord(Nonce[I*4+1]) shl 8) or
(LongWord(Nonce[I*4+2]) shl 16) or (LongWord(Nonce[I*4+3]) shl 24);
{ Copy state to working }
Working := State;
{ 20 rounds (10 double-rounds) }
for I := 1 to 10 do
begin
{ Column rounds }
QuarterRound(Working[0], Working[4], Working[8], Working[12]);
QuarterRound(Working[1], Working[5], Working[9], Working[13]);
QuarterRound(Working[2], Working[6], Working[10], Working[14]);
QuarterRound(Working[3], Working[7], Working[11], Working[15]);
{ Diagonal rounds }
QuarterRound(Working[0], Working[5], Working[10], Working[15]);
QuarterRound(Working[1], Working[6], Working[11], Working[12]);
QuarterRound(Working[2], Working[7], Working[8], Working[13]);
QuarterRound(Working[3], Working[4], Working[9], Working[14]);
end;
{ Add original state }
for I := 0 to 15 do
Working[I] := Working[I] + State[I];
{ Serialize to bytes (little-endian) }
for I := 0 to 15 do
begin
Output[I*4] := Byte(Working[I]);
Output[I*4 + 1] := Byte(Working[I] shr 8);
Output[I*4 + 2] := Byte(Working[I] shr 16);
Output[I*4 + 3] := Byte(Working[I] shr 24);
end;
end;
procedure ChaCha20Init(var State: TChacha20State;
const Key: array of Byte);
begin
FillChar(State, SizeOf(State), 0);
Move(Key[0], State.Key[0], 32);
State.Counter := 0;
State.Used := 64; { Force new block generation on first use }
end;
procedure ChaCha20Crypt(var State: TChacha20State;
Data: PByte; Len: LongWord);
var
I: LongWord;
Avail: Integer;
begin
I := 0;
while I < Len do
begin
{ Generate new keystream block if needed }
if State.Used >= 64 then
begin
ChaCha20Block(State.Key, LongWord(State.Counter),
State.Nonce, State.Block);
Inc(State.Counter);
State.Used := 0;
end;
{ XOR data with keystream }
Avail := 64 - State.Used;
if Avail > Integer(Len - I) then
Avail := Integer(Len - I);
while Avail > 0 do
begin
Data[I] := Data[I] xor State.Block[State.Used];
Inc(I);
Inc(State.Used);
Dec(Avail);
end;
end;
end;
end.

View File

@@ -19,9 +19,9 @@ interface
const
CRYPTO_VERSION_MAJOR = 0;
CRYPTO_VERSION_MINOR = 1;
CRYPTO_VERSION_MINOR = 2;
CRYPTO_VERSION_PATCH = 0;
CRYPTO_VERSION = '0.1.0';
CRYPTO_VERSION = '0.2.0';
CRYPTO_MIN_COMPATIBLE_VERSION = '0.1.0';

173
src/cr.x25519.pas Normal file
View File

@@ -0,0 +1,173 @@
{ cr.x25519 -- Curve25519 scalar multiplication (RFC 7748).
X25519 Diffie-Hellman key exchange over Curve25519 in
Montgomery form. Public API:
X25519Keypair random clamped scalar + matching u-coord
X25519ScalarMult R = k * u (exported for test vectors)
X25519SharedSecret DH shared secret with zero-check
Shares field arithmetic (GF(2^255-19)) with cr.ed25519 --
Ed25519 and X25519 are the same curve in different
coordinate forms, so the FE_* ops from cr.ed25519 are
reused rather than duplicated.
Carved from Comet's cm.crypto 2026-04-24. }
unit cr.x25519;
{$mode objfpc}{$H+}
{$R-}{$Q-} { Range + overflow checks OFF for crypto arithmetic }
interface
uses
SysUtils, cr.ed25519;
type
{ 32-byte Curve25519 u-coordinate / scalar. Same wire
layout as TED25519PublicKey / TED25519Seed but kept
distinct for type discipline. }
TX25519Key = array[0..31] of Byte;
{ Generate a random X25519 keypair.
PrivKey is the clamped 32-byte scalar.
PubKey is the 32-byte Montgomery u-coordinate (PrivKey * basepoint). }
procedure X25519Keypair(out PrivKey, PubKey: TX25519Key);
{ Scalar multiplication on Curve25519 (Montgomery form).
R = K * U. Exported for testability; consumers usually
call X25519Keypair + X25519SharedSecret instead. }
procedure X25519ScalarMult(const K, U: TX25519Key; out R: TX25519Key);
{ Compute shared secret = X25519(our_private, their_public).
Returns False if the result is all-zero (bad public key --
low-order point attack). }
function X25519SharedSecret(const OurPriv, TheirPub: TX25519Key;
out Secret: TX25519Key): Boolean;
implementation
{ Clamp a 32-byte scalar for X25519 per RFC 7748. }
procedure X25519Clamp(var K: TX25519Key);
begin
K[0] := K[0] and 248;
K[31] := (K[31] and 127) or 64;
end;
{ Montgomery ladder scalar multiplication on Curve25519.
Computes Result = k * u where u is a Montgomery u-coordinate.
Uses field element operations from cr.ed25519. }
procedure X25519ScalarMult(const K, U: TX25519Key; out R: TX25519Key);
{ Montgomery ladder per RFC 7748 Section 5.
Clamps scalar and decodes u-coordinate per spec.
All temporaries are distinct to avoid aliasing bugs. }
var
Scalar: TX25519Key;
UCopy: TX25519Key;
U_fe, X2, Z2, X3, Z3: TFieldElement;
tA, tAA, tB, tBB, tE: TFieldElement;
tC, tD, tDA, tCB: TFieldElement;
tDAp, tDAm, tT: TFieldElement;
Swap, KT: LongInt;
I: Integer;
begin
{ decodeScalar25519: clamp scalar per RFC 7748 }
Move(K, Scalar, 32);
X25519Clamp(Scalar);
{ decodeUCoordinate: clear high bit per RFC 7748 }
Move(U, UCopy, 32);
UCopy[31] := UCopy[31] and $7F;
FE_FromBytes(U_fe, @UCopy[0]);
FE_1(X2);
FE_0(Z2);
FE_Copy(X3, U_fe);
FE_1(Z3);
Swap := 0;
for I := 254 downto 0 do
begin
KT := (Scalar[I shr 3] shr (I and 7)) and 1;
Swap := Swap xor KT;
FE_CSwap(X2, X3, Swap);
FE_CSwap(Z2, Z3, Swap);
Swap := KT;
FE_Add(tA, X2, Z2);
FE_Sq(tAA, tA);
FE_Sub(tB, X2, Z2);
FE_Sq(tBB, tB);
FE_Sub(tE, tAA, tBB);
FE_Add(tC, X3, Z3);
FE_Sub(tD, X3, Z3);
FE_Mul(tDA, tD, tA);
FE_Mul(tCB, tC, tB);
FE_Add(tDAp, tDA, tCB);
FE_Sq(X3, tDAp);
FE_Sub(tDAm, tDA, tCB);
FE_Sq(tDAm, tDAm);
FE_Mul(Z3, U_fe, tDAm);
FE_Mul(X2, tAA, tBB);
FE_Mul121666(tT, tE); { 121666 * E }
FE_Sub(tT, tT, tE); { 121665 * E (a24 for Curve25519) }
FE_Add(tT, tAA, tT); { AA + a24 * E }
FE_Mul(Z2, tE, tT); { z_2 = E * (AA + a24 * E) }
end;
FE_CSwap(X2, X3, Swap);
FE_CSwap(Z2, Z3, Swap);
FE_Invert(Z2, Z2);
FE_Mul(X2, X2, Z2);
FE_ToBytes(TED25519PublicKey(R), X2);
end;
{ The X25519 base point (u=9). }
const
X25519_BASEPOINT: TX25519Key = (
9, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
);
procedure X25519Keypair(out PrivKey, PubKey: TX25519Key);
var
Seed: TED25519Seed;
begin
{ Generate 32 random bytes }
ED25519RandomSeed(Seed);
Move(Seed, PrivKey, 32);
X25519Clamp(PrivKey);
{ Public key = PrivKey * basepoint }
X25519ScalarMult(PrivKey, X25519_BASEPOINT, PubKey);
end;
function X25519SharedSecret(const OurPriv, TheirPub: TX25519Key;
out Secret: TX25519Key): Boolean;
var
I: Integer;
Check: Byte;
begin
X25519ScalarMult(OurPriv, TheirPub, Secret);
{ Check for all-zero result (invalid public key -- low-order
point attack; RFC 7748 §6.1 mandates rejection). }
Check := 0;
for I := 0 to 31 do
Check := Check or Secret[I];
Result := Check <> 0;
end;
end.