v0.4.1: cm.xfer per-file scratch buffers + bidir fairness cap

Patch release. Two related changes on the TX hot path: a perf optimization that drops EmitData from O(N) heap allocs per frame to a single per-file alloc, and a fairness cap on the inner emit loop that keeps bidir sessions from starving the RX side. Performance ----------- * cm.xfer per-file scratch buffers in EmitData. v0.4.0 did SetLength(RawBuf), SetLength(ZBuf), SetLength(Frame) per frame -- three Pascal heap allocations on every DATA packet -- plus a Stream.Position := FTxPos assignment issuing an lseek(2) syscall on TFileStream. Mirrors legacy cometxfer.pas's GetMem-once-per-file pattern: three TBytes fields on the session (FTxRawBuf, FTxZBuf, FTxFrameBuf) sized at FBlockSize once in StartNextOutbound / StartFREQResponse / RPOS, reused across every frame for the file. Cached FTxStreamPos: Int64 tracks the read offset locally so we only Stream.Position := on Start / RPOS / EOF-retry, not per frame. ~30% wallclock improvement on the 10 MB bidir continuous test. Fixed ----- * bidir fairness regression introduced by the scratch-buffer optimization. The unbounded inner emit loop in Step() (added in v0.4.0 to undo v0.3.0's round-trip regression) filled the entire FWindowSize worth of frames per call. With the scratch optimization making each EmitData ~10x cheaper, the inner loop emitted 8 frames in microseconds -- but Pump only reads one inbound frame per call, producing an 8:1 send/recv ratio per cycle that starved the RX side. Visible in test_xfer_continuous as small inbound files not completing before a concurrent big outbound finished. Fix: cap the inner loop at MAX_FRAMES_PER_STEP = 4. Two Step calls now fill an 8-frame window with a Pump in between to drain peer bytes. Tuned empirically: failures start at >= 5, throughput drops noticeably at 1-3. All 12 tests in tests/run_tests.sh PASS.
v0.4.0: HasPendingTx accessor (mirror of fpc-binkp 0.4.0)
2026-04-27 12:37:26 -07:00 · 2026-04-25 10:34:20 -07:00 · 2026-04-24 12:08:19 -07:00
9 changed files with 636 additions and 353 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -10,6 +10,111 @@ Semver intent:
 - **minor** — additive features, new hooks, new capability flags
 - **patch** — bug fixes, security hardening, internal perf

+## 0.4.1 — 2026-04-27
+
+Patch release: per-file scratch-buffer optimization on the TX
+hot path.  Drops EmitData from O(allocations) per frame to one
+fixed allocation per file.  Bidir fairness preserved via a
+4-frame-per-Step cap on the inner emit loop.
+
+### Performance
+
+- **`cm.xfer`: per-file scratch buffers in EmitData**.  v0.4.0's
+  `EmitData` did `SetLength(RawBuf, ChunkLen)`, `SetLength(ZBuf,
+  Bound)`, and `SetLength(Frame, HdrLen+...)` per frame -- three
+  Pascal heap allocations on every DATA packet, plus a
+  `Stream.Position := FTxPos` assignment that issued an `lseek(2)`
+  syscall on TFileStream.  Mirrors legacy cometxfer.pas's
+  GetMem-once-per-file pattern: three `TBytes` fields on the
+  session (`FTxRawBuf`, `FTxZBuf`, `FTxFrameBuf`) sized at
+  `FBlockSize` once in `StartNextOutbound` / `StartFREQResponse`
+  / RPOS, reused across every frame for the file.  Cached
+  `FTxStreamPos: Int64` tracks our read offset locally so we
+  only `Stream.Position :=` on Start / RPOS / EOF-retry, not
+  every frame.
+
+  Combined gain on a 10 MB Comet send: ~30% wallclock on the
+  bidir continuous test, more on slower disks.
+
+### Fixed
+
+- **bidir fairness regression introduced by the scratch-buffer
+  optimization**.  The unbounded inner emit loop in
+  `Step()` (introduced in v0.4.0 to undo the v0.3.0 round-trip
+  regression) fills the entire `FWindowSize` worth of frames
+  per call.  With the scratch optimization making each EmitData
+  ~10x cheaper, the existing loop emitted 8 frames in microseconds
+  -- but `Pump` only reads one inbound frame per call, producing
+  an 8:1 send/recv ratio per cycle that starved the RX side.
+  Visible in `test_xfer_continuous` as small inbound files not
+  completing before a concurrent big outbound finished.
+
+  Fix: cap the inner loop at `MAX_FRAMES_PER_STEP = 4`.  Two
+  Step calls now fill an 8-frame window, with a Pump in between
+  to drain peer bytes.  Tuned empirically: failures start at
+  >= 5, throughput drops noticeably at 1-3.
+
+  All 12 tests in `tests/run_tests.sh` PASS at this cap.
+
+## 0.4.0 — 2026-04-25
+
+### Added
+
+- **`TComSession.HasPendingTx: Boolean`** — public read-only
+  accessor.  True iff there are outbound bytes already
+  queued in the send stream that haven't been pushed to
+  the socket yet.  Mirror of the same accessor added to
+  fpc-binkp's `TBPSession` in 0.4.0.  Lets driver loops
+  pass it as the `WantWrite` flag in
+  `Transport.WaitReady`; primarily a perf improvement for
+  high-volume sends (avoids per-frame idle wait).
+- Recommended driver loop pattern:
+  ```pascal
+  while Session.NextStep do
+    Transport.WaitReady(True, Session.HasPendingTx, 50);
+  ```
+
+### Not changed
+
+- Wire protocol unchanged; consumers pinned to 0.3.0 can
+  upgrade in place.  `HasPendingTx` is purely additive --
+  callers that don't use it keep the old behaviour.
+- `CM_MIN_COMPATIBLE_VERSION` stays at `0.1.0`.
+
+## 0.3.0 — 2026-04-24
+
+### Changed
+
+- **X25519 + ChaCha20 primitives moved out to fpc-crypto
+  v0.2.0.**  `cm.crypto.pas` thins down to the
+  Comet-specific session glue (SHA-256 key derivation +
+  paired tx/rx ChaCha20 streams keyed per direction).
+  Pure crypto primitives now live upstream:
+    - `TX25519Key`, `X25519Keypair`, `X25519ScalarMult`,
+      `X25519SharedSecret` → `cr.x25519`
+    - `TChacha20State`, `ChaCha20Init`, `ChaCha20Crypt`
+      → `cr.chacha20`
+  `cm.driver.pas` gains `cr.x25519` in its uses-clause so
+  `TX25519Key` + `X25519Keypair` / `X25519SharedSecret`
+  resolve directly from the primitive library instead of
+  being re-exported through `cm.crypto`.
+- Rationale: X25519 and ChaCha20 are crypto; they belong
+  in the crypto library regardless of consumer count.
+  The "wait for a second consumer" rule that delayed their
+  carve in v0.1.0 / v0.2.0 was appropriate for MD5 (which
+  was duplicated code across three libraries) but doesn't
+  serve primitives like X25519 / ChaCha20 where upstream
+  ownership is the obviously-correct home.
+
+### Not changed
+
+- `TCometCrypt` record still owns the paired ChaCha20
+  states and the SHA-256-chained key derivation -- that's
+  Comet session-layer behaviour, not a primitive.
+- Wire protocol is identical to 0.2.0.  Same test vectors,
+  same interop behaviour.  12/12 local test suite green
+  pre- and post-migration; 6-target cross-build clean.
+
 ## 0.2.0 — 2026-04-23

 ### Changed
--- a/build.sh
+++ b/build.sh
@@ -35,6 +35,8 @@ UNITS=(
    src/cr.ed25519.sc.pas
    src/cr.ed25519.bp.pas
    src/cr.ed25519.ge.pas
+    src/cr.x25519.pas
+    src/cr.chacha20.pas
    src/cm.crypto.pas
    src/cm.zlib.pas
    src/cm.cram.pas
--- a/src/cm.crypto.pas
+++ b/src/cm.crypto.pas
@@ -1,26 +1,30 @@
 {
  Comet - Direct TCP File Transfer for FidoNet
-  cometcrypt.pas - Session encryption: X25519 key exchange + ChaCha20
+  cm.crypto.pas - Session encryption: Comet-specific glue on
+                  top of cr.x25519 + cr.chacha20 + cr.sha

-  Provides full session encryption for the Comet native protocol using
-  keys already configured for ED25519 authentication. No certificates,
-  no external libraries, no CA infrastructure required.
-
-  Key exchange:
-    Both sides generate ephemeral X25519 keypairs and exchange public
-    keys. The shared secret is computed via Diffie-Hellman. Ephemeral
-    keys provide forward secrecy -- compromising long-term keys does
-    not reveal past session content.
-
-  Encryption:
-    ChaCha20 stream cipher (RFC 8439) with per-direction keys and
-    counters. Each frame body (TYPE+SEQ+PAYLOAD+CRC32) is encrypted
-    after CRC computation and decrypted before CRC verification.
+  This unit is the session-layer wrapper that:
+    1. Takes an X25519 shared secret + both ephemeral
+       public keys (produced by cr.x25519) and derives a
+       pair of SHA-256-chained directional keys,
+    2. Keys two cr.chacha20 streams (send / receive), one
+       per direction, into a TCometCrypt record, and
+    3. Exposes CometCryptEncrypt / CometCryptDecrypt so
+       the frame encoder / decoder can XOR frame bodies
+       on the wire.

  Key derivation:
-    session_key = SHA-256(shared_secret || ephemeral_pub_A || ephemeral_pub_B)
-    send_key    = SHA-256(session_key || "comet-send-" || direction_byte)
-    recv_key    = SHA-256(session_key || "comet-recv-" || direction_byte)
+    session_key = SHA-256(shared_secret || pub_A || pub_B)
+                  (pub_A = originator's, pub_B = answerer's;
+                   both sides hash in the same order)
+    Key-A       = SHA-256(session_key || 'A')
+    Key-B       = SHA-256(session_key || 'B')
+    Originator sends with Key-A, receives with Key-B.
+    Answerer   sends with Key-B, receives with Key-A.
+
+  The X25519 key-exchange + ChaCha20 stream primitives
+  themselves live in fpc-crypto (cr.x25519 / cr.chacha20)
+  as of fpc-comet 0.3.0 / fpc-crypto 0.2.0.

  Copyright (C) 2026 Ken Johnson
  License: GPL-2.0
@@ -33,62 +37,21 @@ unit cm.crypto;
 interface

 uses
-  SysUtils, cr.ed25519, cr.sha;
+  SysUtils, cr.sha, cr.x25519, cr.chacha20;

 type
-  { X25519 key types }
-  TX25519Key = array[0..31] of Byte;
-
-  { ChaCha20 cipher state }
-  TChacha20State = record
-    Key:     array[0..31] of Byte;  { 256-bit key                          }
-    Nonce:   array[0..11] of Byte;  { 96-bit nonce (zeroed, counter-based) }
-    Counter: QWord;                 { Block counter                        }
-    Block:   array[0..63] of Byte;  { Current keystream block              }
-    Used:    Integer;               { Bytes used in current block          }
-  end;
-
-  { Session encryption context }
+  { Session encryption context.  One TCometCrypt per
+    session.  Holds both directional ChaCha20 states plus
+    an activation flag. }
  TCometCrypt = record
-    Active:  Boolean;               { Encryption is active                 }
-    TxState: TChacha20State;        { Transmit cipher state                }
-    RxState: TChacha20State;        { Receive cipher state                 }
+    Active:  Boolean;                           { Encryption is active  }
+    TxState: cr.chacha20.TChacha20State;        { Transmit cipher state }
+    RxState: cr.chacha20.TChacha20State;        { Receive cipher state  }
  end;

-
-{ ---- X25519 Key Exchange ---- }
-
-{ Generate a random X25519 keypair.
-  PrivKey is the clamped 32-byte scalar.
-  PubKey is the 32-byte Montgomery u-coordinate. }
-procedure X25519Keypair(out PrivKey, PubKey: TX25519Key);
-
-{ Scalar multiplication on Curve25519 (Montgomery form).
-  R = K * U. Exported for testing. }
-procedure X25519ScalarMult(const K, U: TX25519Key; out R: TX25519Key);
-
-{ Compute shared secret = X25519(our_private, their_public).
-  Returns False if the result is all-zero (bad public key). }
-function X25519SharedSecret(const OurPriv, TheirPub: TX25519Key;
-  out Secret: TX25519Key): Boolean;
-
-
-{ ---- ChaCha20 Stream Cipher ---- }
-
-{ Initialize a ChaCha20 cipher state with a 32-byte key. }
-procedure ChaCha20Init(var State: TChacha20State;
-  const Key: array of Byte);
-
-{ Encrypt/decrypt data in place using ChaCha20 (XOR with keystream). }
-procedure ChaCha20Crypt(var State: TChacha20State;
-  Data: PByte; Len: LongWord);
-
-
-{ ---- Session Encryption ---- }
-
 { Derive send/recv keys from shared secret and ephemeral public keys,
-  and initialize the encryption context. IsOriginator determines
-  key direction (originator's send key = answerer's recv key). }
+  and initialize the encryption context.  IsOriginator determines key
+  direction (originator's send key = answerer's recv key). }
 procedure CometCryptInit(var Crypt: TCometCrypt;
  const SharedSecret, EphPubLocal, EphPubRemote: TX25519Key;
  IsOriginator: Boolean);
@@ -110,250 +73,6 @@ implementation
  session engine logs activation events at its own layer. }


-{ ---- X25519 Montgomery Ladder ---- }
-
-{ Clamp a 32-byte scalar for X25519 per RFC 7748. }
-procedure X25519Clamp(var K: TX25519Key);
-begin
-  K[0]  := K[0] and 248;
-  K[31] := (K[31] and 127) or 64;
-end;
-
-{ Montgomery ladder scalar multiplication on Curve25519.
-  Computes Result = k * u where u is a Montgomery u-coordinate.
-  Uses field element operations from cr.ed25519. }
-procedure X25519ScalarMult(const K, U: TX25519Key; out R: TX25519Key);
-{ Montgomery ladder per RFC 7748 Section 5.
-  Clamps scalar and decodes u-coordinate per spec.
-  All temporaries are distinct to avoid aliasing bugs. }
-var
-  Scalar: TX25519Key;
-  UCopy: TX25519Key;
-  U_fe, X2, Z2, X3, Z3: TFieldElement;
-  tA, tAA, tB, tBB, tE: TFieldElement;
-  tC, tD, tDA, tCB: TFieldElement;
-  tDAp, tDAm, tT: TFieldElement;
-  Swap, KT: LongInt;
-  I: Integer;
-begin
-  { decodeScalar25519: clamp scalar per RFC 7748 }
-  Move(K, Scalar, 32);
-  X25519Clamp(Scalar);
-
-  { decodeUCoordinate: clear high bit per RFC 7748 }
-  Move(U, UCopy, 32);
-  UCopy[31] := UCopy[31] and $7F;
-
-  FE_FromBytes(U_fe, @UCopy[0]);
-  FE_1(X2);
-  FE_0(Z2);
-  FE_Copy(X3, U_fe);
-  FE_1(Z3);
-
-  Swap := 0;
-
-  for I := 254 downto 0 do
-  begin
-    KT := (Scalar[I shr 3] shr (I and 7)) and 1;
-    Swap := Swap xor KT;
-    FE_CSwap(X2, X3, Swap);
-    FE_CSwap(Z2, Z3, Swap);
-    Swap := KT;
-
-    FE_Add(tA, X2, Z2);
-    FE_Sq(tAA, tA);
-    FE_Sub(tB, X2, Z2);
-    FE_Sq(tBB, tB);
-    FE_Sub(tE, tAA, tBB);
-    FE_Add(tC, X3, Z3);
-    FE_Sub(tD, X3, Z3);
-    FE_Mul(tDA, tD, tA);
-    FE_Mul(tCB, tC, tB);
-
-    FE_Add(tDAp, tDA, tCB);
-    FE_Sq(X3, tDAp);
-
-    FE_Sub(tDAm, tDA, tCB);
-    FE_Sq(tDAm, tDAm);
-    FE_Mul(Z3, U_fe, tDAm);
-
-    FE_Mul(X2, tAA, tBB);
-
-    FE_Mul121666(tT, tE);       { 121666 * E                 }
-    FE_Sub(tT, tT, tE);         { 121665 * E  (a24 for Curve25519) }
-    FE_Add(tT, tAA, tT);        { AA + a24 * E               }
-    FE_Mul(Z2, tE, tT);         { z_2 = E * (AA + a24 * E)   }
-  end;
-
-  FE_CSwap(X2, X3, Swap);
-  FE_CSwap(Z2, Z3, Swap);
-
-  FE_Invert(Z2, Z2);
-  FE_Mul(X2, X2, Z2);
-  FE_ToBytes(TED25519PublicKey(R), X2);
-end;
-
-
-{ The X25519 base point (u=9). }
-const
-  X25519_BASEPOINT: TX25519Key = (
-    9, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
-    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
-  );
-
-procedure X25519Keypair(out PrivKey, PubKey: TX25519Key);
-var
-  Seed: TED25519Seed;
-begin
-  { Generate 32 random bytes }
-  ED25519RandomSeed(Seed);
-  Move(Seed, PrivKey, 32);
-  X25519Clamp(PrivKey);
-
-  { Public key = PrivKey * basepoint }
-  X25519ScalarMult(PrivKey, X25519_BASEPOINT, PubKey);
-end;
-
-
-function X25519SharedSecret(const OurPriv, TheirPub: TX25519Key;
-  out Secret: TX25519Key): Boolean;
-var
-  I: Integer;
-  Check: Byte;
-begin
-  X25519ScalarMult(OurPriv, TheirPub, Secret);
-
-  { Check for all-zero result (invalid public key) }
-  Check := 0;
-  for I := 0 to 31 do
-    Check := Check or Secret[I];
-  Result := Check <> 0;
-end;
-
-
-{ ---- ChaCha20 Stream Cipher (RFC 8439) ---- }
-
-type
-  TChaCha20Block = array[0..15] of LongWord;
-
-function RotL32(X: LongWord; N: Integer): LongWord; inline;
-begin
-  Result := (X shl N) or (X shr (32 - N));
-end;
-
-procedure QuarterRound(var A, B, C, D: LongWord); inline;
-begin
-  A := A + B; D := D xor A; D := RotL32(D, 16);
-  C := C + D; B := B xor C; B := RotL32(B, 12);
-  A := A + B; D := D xor A; D := RotL32(D, 8);
-  C := C + D; B := B xor C; B := RotL32(B, 7);
-end;
-
-procedure ChaCha20Block(const Key: array of Byte;
-  Counter: LongWord; const Nonce: array of Byte;
-  out Output: array of Byte);
-var
-  State, Working: TChaCha20Block;
-  I: Integer;
-begin
-  { Initialize state }
-  State[0]  := $61707865;  { "expa" }
-  State[1]  := $3320646e;  { "nd 3" }
-  State[2]  := $79622d32;  { "2-by" }
-  State[3]  := $6b206574;  { "te k" }
-
-  { Key (little-endian LongWords) }
-  for I := 0 to 7 do
-    State[4 + I] := LongWord(Key[I*4]) or (LongWord(Key[I*4+1]) shl 8) or
-                    (LongWord(Key[I*4+2]) shl 16) or (LongWord(Key[I*4+3]) shl 24);
-
-  { Counter }
-  State[12] := Counter;
-
-  { Nonce (little-endian LongWords) }
-  for I := 0 to 2 do
-    State[13 + I] := LongWord(Nonce[I*4]) or (LongWord(Nonce[I*4+1]) shl 8) or
-                     (LongWord(Nonce[I*4+2]) shl 16) or (LongWord(Nonce[I*4+3]) shl 24);
-
-  { Copy state to working }
-  Working := State;
-
-  { 20 rounds (10 double-rounds) }
-  for I := 1 to 10 do
-  begin
-    { Column rounds }
-    QuarterRound(Working[0], Working[4], Working[8],  Working[12]);
-    QuarterRound(Working[1], Working[5], Working[9],  Working[13]);
-    QuarterRound(Working[2], Working[6], Working[10], Working[14]);
-    QuarterRound(Working[3], Working[7], Working[11], Working[15]);
-    { Diagonal rounds }
-    QuarterRound(Working[0], Working[5], Working[10], Working[15]);
-    QuarterRound(Working[1], Working[6], Working[11], Working[12]);
-    QuarterRound(Working[2], Working[7], Working[8],  Working[13]);
-    QuarterRound(Working[3], Working[4], Working[9],  Working[14]);
-  end;
-
-  { Add original state }
-  for I := 0 to 15 do
-    Working[I] := Working[I] + State[I];
-
-  { Serialize to bytes (little-endian) }
-  for I := 0 to 15 do
-  begin
-    Output[I*4]     := Byte(Working[I]);
-    Output[I*4 + 1] := Byte(Working[I] shr 8);
-    Output[I*4 + 2] := Byte(Working[I] shr 16);
-    Output[I*4 + 3] := Byte(Working[I] shr 24);
-  end;
-end;
-
-
-procedure ChaCha20Init(var State: TChacha20State;
-  const Key: array of Byte);
-begin
-  FillChar(State, SizeOf(State), 0);
-  Move(Key[0], State.Key[0], 32);
-  State.Counter := 0;
-  State.Used := 64;  { Force new block generation on first use }
-end;
-
-
-procedure ChaCha20Crypt(var State: TChacha20State;
-  Data: PByte; Len: LongWord);
-var
-  I: LongWord;
-  Avail: Integer;
-begin
-  I := 0;
-  while I < Len do
-  begin
-    { Generate new keystream block if needed }
-    if State.Used >= 64 then
-    begin
-      ChaCha20Block(State.Key, LongWord(State.Counter),
-        State.Nonce, State.Block);
-      Inc(State.Counter);
-      State.Used := 0;
-    end;
-
-    { XOR data with keystream }
-    Avail := 64 - State.Used;
-    if Avail > Integer(Len - I) then
-      Avail := Integer(Len - I);
-
-    while Avail > 0 do
-    begin
-      Data[I] := Data[I] xor State.Block[State.Used];
-      Inc(I);
-      Inc(State.Used);
-      Dec(Avail);
-    end;
-  end;
-end;
-
-
-{ ---- Session Encryption Setup ---- }
-
 procedure CometCryptInit(var Crypt: TCometCrypt;
  const SharedSecret, EphPubLocal, EphPubRemote: TX25519Key;
  IsOriginator: Boolean);
@@ -388,8 +107,8 @@ begin
  SHA256Final(Ctx, SessionKey);

  { Derive directional keys:
-    Key-A = SHA-256(session_key || "comet-keystream-A")
-    Key-B = SHA-256(session_key || "comet-keystream-B")
+    Key-A = SHA-256(session_key || "A")
+    Key-B = SHA-256(session_key || "B")
    Originator sends with Key-A, receives with Key-B.
    Answerer sends with Key-B, receives with Key-A. }
  DirByte := Ord('A');
--- a/src/cm.driver.pas
+++ b/src/cm.driver.pas
@@ -39,7 +39,7 @@ interface
 uses
  Classes, SysUtils,
  log.types, mb.address,
-  cm.types, cm.frame, cm.cram, cr.ed25519, cm.crypto,
+  cm.types, cm.frame, cm.cram, cr.ed25519, cr.x25519, cm.crypto,
  cm.session, cm.events, cm.transport, cm.provider, cm.config;

 type
@@ -181,6 +181,16 @@ type
    function Phase: TCometPhase;
    function Result_: TCometSessionResult;

+    { True iff there are outbound bytes already queued in the
+      send stream that haven't been pushed to the socket yet.
+      Driver loops use this to choose whether WaitReady should
+      also wait for socket WRITE-readiness -- when True,
+      select() returns the moment the kernel send buffer can
+      take more bytes, so high-volume transfers don't pay a
+      50 ms idle wait per frame.  Mirror of the same accessor
+      on fpc-binkp's TBPSession. }
+    function HasPendingTx: Boolean;
+
    { ---- Hooks for cm.xfer ---- }
    { Queue a complete frame onto the outbound stream.  Bytes
      drain to the transport on the next NextStep / Pump. }
@@ -1288,6 +1298,12 @@ begin
  Result := FResult;
 end;

+function TComSession.HasPendingTx: Boolean;
+begin
+  Result := (FSendStream <> nil) and
+            ((FSendStream.Size - FSendPos) > 0);
+end;
+
 procedure TComSession.EmitFrame(PktType, Seq: Byte; const Buf;
  Len: SizeInt);
 var
--- a/src/cm.version.pas
+++ b/src/cm.version.pas
@@ -21,9 +21,9 @@ interface

 const
  CM_VERSION_MAJOR = 0;
-  CM_VERSION_MINOR = 2;
-  CM_VERSION_PATCH = 0;
-  CM_VERSION       = '0.2.0';
+  CM_VERSION_MINOR = 4;
+  CM_VERSION_PATCH = 1;
+  CM_VERSION       = '0.4.1';

  { Oldest version a consumer pinned to CM_VERSION is
    guaranteed to remain ABI/API-compatible with.  Bumped
--- a/src/cm.xfer.pas
+++ b/src/cm.xfer.pas
@@ -76,6 +76,21 @@ type
    FTxAcked:   Byte;
    FTxAckedPos: Int64;
    FTxStartTime: TDateTime;
+    { Per-file scratch buffers reused across all DATA frames
+      for this file.  Allocated once in StartNextOutbound at
+      FBlockSize, never SetLength'd in the inner loop.
+      Mirrors legacy cometxfer.pas's GetMem(DataBuf,...) /
+      GetMem(CompBuf,...) once-per-file pattern.  Per-frame
+      SetLength was the dominant heap cost in the regressed
+      0.4.0 EmitData path. }
+    FTxRawBuf:   TBytes;       { raw file bytes              }
+    FTxZBuf:     TBytes;       { zlib-compressed staging     }
+    FTxFrameBuf: TBytes;       { final on-wire frame body    }
+    { Cached read offset.  We read sequentially from the TX
+      stream; assigning Stream.Position on every frame is an
+      lseek() syscall on TFileStream.  Track our own offset
+      and only assign Stream.Position on Start / RPOS. }
+    FTxStreamPos: Int64;
    { RPOS loop-detection: peer keeps sending RPOS to the
      same offset means the file is corrupt or the link is
      bad enough to be unrecoverable. }
@@ -472,6 +487,17 @@ begin
  FTxSeq   := 0;
  FTxAcked := 0;
  FTxAckedPos := 0;
+  { Same per-file scratch alloc + position reset that
+    StartNextOutbound does -- FREQ-injected sends share the
+    same EmitData hot path. }
+  if Length(FTxRawBuf) < Integer(FBlockSize) then
+    SetLength(FTxRawBuf, FBlockSize);
+  if Length(FTxZBuf) < Integer(CMZlibBound(FBlockSize)) then
+    SetLength(FTxZBuf, CMZlibBound(FBlockSize));
+  if Length(FTxFrameBuf) < Integer(FBlockSize) + 8 then
+    SetLength(FTxFrameBuf, FBlockSize + 8);
+  FTxItem.Stream.Position := 0;
+  FTxStreamPos := 0;
  EmitFinfo(FTxItem);
  FTxState := txAwaitFinfoAck;
  Log(llInfo, Format('FREQ: serving %s (%d bytes)',
@@ -757,6 +783,7 @@ begin
  end;

  FTxItem.Stream.Position := RPos;
+  FTxStreamPos := RPos;
  FTxPos      := RPos;
  FTxSeq      := 0;
  FTxAcked    := 0;
@@ -772,16 +799,14 @@ end;

 procedure TCometXfer.EmitData;
 var
-  RawBuf:  TBytes;        { uncompressed payload bytes        }
-  Got:     LongInt;
+  Got:      LongInt;
  ChunkLen: Integer;
-  UseZlib: Boolean;
-  ZBuf:    TBytes;        { zlib-compressed staging           }
-  ZLen:    Cardinal;
-  ZRes:    TCometZlibResult;
-  Frame:   TBytes;        { final on-wire frame body          }
-  HdrLen:  Integer;       { 4 (no zlib) or 5 (zlib enabled)   }
-  MaxRaw:  Integer;       { max raw bytes that fit a frame    }
+  UseZlib:  Boolean;
+  ZLen:     Cardinal;
+  ZRes:     TCometZlibResult;
+  HdrLen:   Integer;       { 4 (no zlib) or 5 (zlib enabled)   }
+  MaxRaw:   Integer;       { max raw bytes that fit a frame    }
+  FrameLen: Integer;       { final body length we emit         }
 begin
  if FTxItem.Stream = nil then Exit;
  UseZlib := (FSharedCaps and COPT_ZLIB) <> 0;
@@ -801,9 +826,11 @@ begin
    ChunkLen := Integer(FTxItem.Size - FTxPos);
  if ChunkLen <= 0 then Exit;

-  SetLength(RawBuf, ChunkLen);
-  FTxItem.Stream.Position := FTxPos;
-  Got := FTxItem.Stream.Read(RawBuf[0], ChunkLen);
+  { Sequential read from the file stream.  StartNextOutbound
+    + HandleRpos are the only places that re-Seek the stream;
+    in steady state the stream's position already equals
+    FTxStreamPos (= FTxPos), so we can skip the lseek. }
+  Got := FTxItem.Stream.Read(FTxRawBuf[0], ChunkLen);
  if Got <= 0 then
  begin
    Log(llError, Format('TX read failed at %d on %s',
@@ -811,42 +838,40 @@ begin
    CloseTxFile(False, cmSendFailIO);
    Exit;
  end;
-  if Got < ChunkLen then
-    SetLength(RawBuf, Got);
+  Inc(FTxStreamPos, Got);

  if UseZlib then
  begin
-    SetLength(ZBuf, CMZlibBound(Length(RawBuf)));
-    ZLen := Length(ZBuf);
-    ZRes := CMZlibCompress(RawBuf[0], Length(RawBuf),
-                           ZBuf[0], ZLen);
-    if ZRes = cmZlibOK then
+    ZLen := Length(FTxZBuf);
+    ZRes := CMZlibCompress(FTxRawBuf[0], Got,
+                           FTxZBuf[0], ZLen);
+    if (ZRes = cmZlibOK) and (Integer(ZLen) < Got) then
    begin
      { Compressed AND smaller -- send with comp_type=1. }
-      SetLength(Frame, HdrLen + Integer(ZLen));
-      PutLE32(@Frame[0], LongWord(FTxPos));
-      Frame[4] := 1;        { comp_type = ZLIB }
-      Move(ZBuf[0], Frame[5], ZLen);
+      PutLE32(@FTxFrameBuf[0], LongWord(FTxPos));
+      FTxFrameBuf[4] := 1;        { comp_type = ZLIB }
+      Move(FTxZBuf[0], FTxFrameBuf[5], ZLen);
+      FrameLen := HdrLen + Integer(ZLen);
    end
    else
    begin
      { No gain or buffer issue -- send raw with comp_type=0. }
-      SetLength(Frame, HdrLen + Length(RawBuf));
-      PutLE32(@Frame[0], LongWord(FTxPos));
-      Frame[4] := 0;        { comp_type = NONE }
-      Move(RawBuf[0], Frame[5], Length(RawBuf));
+      PutLE32(@FTxFrameBuf[0], LongWord(FTxPos));
+      FTxFrameBuf[4] := 0;        { comp_type = NONE }
+      Move(FTxRawBuf[0], FTxFrameBuf[5], Got);
+      FrameLen := HdrLen + Got;
    end;
  end
  else
  begin
-    SetLength(Frame, 4 + Length(RawBuf));
-    PutLE32(@Frame[0], LongWord(FTxPos));
-    Move(RawBuf[0], Frame[4], Length(RawBuf));
+    PutLE32(@FTxFrameBuf[0], LongWord(FTxPos));
+    Move(FTxRawBuf[0], FTxFrameBuf[4], Got);
+    FrameLen := 4 + Got;
  end;

-  FSession.EmitFrame(NPKT_DATA, FTxSeq, Frame[0], Length(Frame));
+  FSession.EmitFrame(NPKT_DATA, FTxSeq, FTxFrameBuf[0], FrameLen);
  Inc(FTxSeq);
-  FTxPos := FTxPos + Length(RawBuf);
+  FTxPos := FTxPos + Got;
 end;

 { ---- TX state transitions ---- }
@@ -889,6 +914,24 @@ begin
  FTxSeq   := 0;
  FTxAcked := 0;
  FTxAckedPos := 0;
+  { Pre-allocate the per-file scratch buffers exactly once.
+    FBlockSize is the max raw payload + small headroom.
+    CMZlibBound covers worst-case zlib expansion.  Reused
+    across every DATA frame for this file. }
+  if Length(FTxRawBuf) < Integer(FBlockSize) then
+    SetLength(FTxRawBuf, FBlockSize);
+  if Length(FTxZBuf) < Integer(CMZlibBound(FBlockSize)) then
+    SetLength(FTxZBuf, CMZlibBound(FBlockSize));
+  if Length(FTxFrameBuf) < Integer(FBlockSize) + 8 then
+    SetLength(FTxFrameBuf, FBlockSize + 8);
+  { Sync our cached read offset with the stream's position.
+    Provider returns a fresh stream positioned at 0; record
+    that so EmitData can skip the per-frame Position setter. }
+  if FTxItem.Stream <> nil then
+  begin
+    FTxItem.Stream.Position := 0;
+    FTxStreamPos := 0;
+  end;
  EmitFinfo(FTxItem);
  FTxState := txAwaitFinfoAck;
 end;
@@ -1088,7 +1131,10 @@ begin
  FTxStart    := Int64(Off);
  FTxAckedPos := Int64(Off);
  if FTxItem.Stream <> nil then
+  begin
    FTxItem.Stream.Position := FTxPos;
+    FTxStreamPos := FTxPos;
+  end;
  FTxState := txData;
  Log(llDebug, Format('peer accepted %s @ %d',
    [FTxItem.Name, FTxPos]));
@@ -1264,7 +1310,10 @@ begin
        FTxAcked    := 0;
        FTxAckedPos := 0;
        if FTxItem.Stream <> nil then
+        begin
          FTxItem.Stream.Position := 0;
+          FTxStreamPos := 0;
+        end;
        FTxState := txData;
      end;
    NEOFACK_SKIP:  CloseTxFile(False, cmSendFailPeerSkipped);
@@ -1278,7 +1327,18 @@ end;
  Pumps the TX state machine forward when there's window
  capacity.  Returns True if work was done this step. }

+const
+  { Max DATA frames a single Step() call will emit.  See the
+    matching comment in the txData branch below for the rationale.
+    4 was tuned empirically: the test_xfer_continuous bidir test
+    starts to fail at >= 5 (RX side starves), and 1-3 give back
+    ~30% of the legacy throughput by paying the Step round-trip
+    too often. }
+  MAX_FRAMES_PER_STEP = 4;
+
 function TCometXfer.Step: Boolean;
+var
+  MaxBurst: Integer;
 begin
  Result := False;
  EnsureNegotiated;
@@ -1293,9 +1353,36 @@ begin
      begin
        if (FTxItem.Stream <> nil) and (FTxPos < FTxItem.Size) then
        begin
-          if WindowOpen then
+          { Inner-burst emit, capped at MAX_FRAMES_PER_STEP.
+
+            Legacy cometxfer.pas:751-866 streamed frames in a tight
+            inner loop while the window had room, only yielding when
+            the window closed (waiting on a DATAACK) or the file
+            ended.  fpc-comet 0.3.0 inadvertently dropped that loop
+            and emitted one frame per Step() call, which paid the
+            outer-driver round-trip for every 16-64 KB.  10 MB
+            sends regressed from ~2 s (Comet 1.01) to ~7 s, so
+            v0.4.0 restored the inner loop.
+
+            v0.4.1 caps the inner loop at MAX_FRAMES_PER_STEP (4).
+            The cap matters for bidirectional sessions: the per-file
+            scratch-buffer optimization (introduced same release)
+            makes EmitData ~10x cheaper, and Pump only reads ONE
+            inbound frame per call, so an unbounded inner loop
+            produced an 8:1 send/recv ratio per cycle that starved
+            the RX side -- visible in test_xfer_continuous as small
+            inbound files not completing before a concurrent big
+            outbound finished.  Cap of 4 keeps the emit-burst
+            per Step long enough to amortize Step's overhead
+            (perf preserved), short enough that two Step calls
+            fill a typical 8-frame window with a Pump in between
+            (bidir fairness preserved). }
+          MaxBurst := MAX_FRAMES_PER_STEP;
+          while (MaxBurst > 0) and WindowOpen and
+                (FTxPos < FTxItem.Size) do
          begin
            EmitData;
+            Dec(MaxBurst);
            Result := True;
          end;
        end
--- a/src/cr.chacha20.pas
+++ b/src/cr.chacha20.pas
@@ -0,0 +1,181 @@
+{ cr.chacha20 -- ChaCha20 stream cipher (RFC 8439).
+
+  Incremental ChaCha20 keystream generator with per-state
+  buffering so arbitrary-length plaintext can be streamed
+  across multiple ChaCha20Crypt calls without needing to
+  align on 64-byte blocks.
+
+  Public API:
+
+    ChaCha20Init   Install a 32-byte key, reset counter/nonce/block
+    ChaCha20Crypt  XOR Len bytes of Data with keystream (in place)
+
+  Nonce is 96 bits (RFC 8439 §2.3); caller pre-fills
+  State.Nonce before the first ChaCha20Crypt.  Counter is
+  incremented per 64-byte block.  Same ChaCha20 is used for
+  both encrypt and decrypt (symmetric stream cipher).
+
+  Carved from Comet's cm.crypto 2026-04-24. }
+
+unit cr.chacha20;
+
+{$mode objfpc}{$H+}
+{$R-}{$Q-}  { Range + overflow checks OFF for crypto arithmetic }
+
+interface
+
+uses
+  SysUtils;
+
+type
+  { ChaCha20 cipher state -- 256-bit key, 96-bit nonce,
+    64-bit block counter, 64-byte keystream buffer.  Caller
+    populates Nonce before first ChaCha20Crypt; Key comes
+    from ChaCha20Init; Counter + Block + Used are managed
+    internally. }
+  TChacha20State = record
+    Key:     array[0..31] of Byte;  { 256-bit key                          }
+    Nonce:   array[0..11] of Byte;  { 96-bit nonce (zeroed, counter-based) }
+    Counter: QWord;                 { Block counter                        }
+    Block:   array[0..63] of Byte;  { Current keystream block              }
+    Used:    Integer;               { Bytes used in current block          }
+  end;
+
+{ Initialize a ChaCha20 cipher state with a 32-byte key.
+  Clears nonce + counter; caller sets nonce bytes before
+  calling ChaCha20Crypt.  Used=64 forces first call to
+  generate a fresh block. }
+procedure ChaCha20Init(var State: TChacha20State;
+  const Key: array of Byte);
+
+{ Encrypt/decrypt Len bytes of Data in place.  Operation is
+  symmetric: XOR with keystream.  Safe to call repeatedly
+  on different-length chunks; internal Used/Counter track
+  position. }
+procedure ChaCha20Crypt(var State: TChacha20State;
+  Data: PByte; Len: LongWord);
+
+
+implementation
+
+
+type
+  TChaCha20Block = array[0..15] of LongWord;
+
+function RotL32(X: LongWord; N: Integer): LongWord; inline;
+begin
+  Result := (X shl N) or (X shr (32 - N));
+end;
+
+procedure QuarterRound(var A, B, C, D: LongWord); inline;
+begin
+  A := A + B; D := D xor A; D := RotL32(D, 16);
+  C := C + D; B := B xor C; B := RotL32(B, 12);
+  A := A + B; D := D xor A; D := RotL32(D, 8);
+  C := C + D; B := B xor C; B := RotL32(B, 7);
+end;
+
+procedure ChaCha20Block(const Key: array of Byte;
+  Counter: LongWord; const Nonce: array of Byte;
+  out Output: array of Byte);
+var
+  State, Working: TChaCha20Block;
+  I: Integer;
+begin
+  { Initialize state }
+  State[0]  := $61707865;  { "expa" }
+  State[1]  := $3320646e;  { "nd 3" }
+  State[2]  := $79622d32;  { "2-by" }
+  State[3]  := $6b206574;  { "te k" }
+
+  { Key (little-endian LongWords) }
+  for I := 0 to 7 do
+    State[4 + I] := LongWord(Key[I*4]) or (LongWord(Key[I*4+1]) shl 8) or
+                    (LongWord(Key[I*4+2]) shl 16) or (LongWord(Key[I*4+3]) shl 24);
+
+  { Counter }
+  State[12] := Counter;
+
+  { Nonce (little-endian LongWords) }
+  for I := 0 to 2 do
+    State[13 + I] := LongWord(Nonce[I*4]) or (LongWord(Nonce[I*4+1]) shl 8) or
+                     (LongWord(Nonce[I*4+2]) shl 16) or (LongWord(Nonce[I*4+3]) shl 24);
+
+  { Copy state to working }
+  Working := State;
+
+  { 20 rounds (10 double-rounds) }
+  for I := 1 to 10 do
+  begin
+    { Column rounds }
+    QuarterRound(Working[0], Working[4], Working[8],  Working[12]);
+    QuarterRound(Working[1], Working[5], Working[9],  Working[13]);
+    QuarterRound(Working[2], Working[6], Working[10], Working[14]);
+    QuarterRound(Working[3], Working[7], Working[11], Working[15]);
+    { Diagonal rounds }
+    QuarterRound(Working[0], Working[5], Working[10], Working[15]);
+    QuarterRound(Working[1], Working[6], Working[11], Working[12]);
+    QuarterRound(Working[2], Working[7], Working[8],  Working[13]);
+    QuarterRound(Working[3], Working[4], Working[9],  Working[14]);
+  end;
+
+  { Add original state }
+  for I := 0 to 15 do
+    Working[I] := Working[I] + State[I];
+
+  { Serialize to bytes (little-endian) }
+  for I := 0 to 15 do
+  begin
+    Output[I*4]     := Byte(Working[I]);
+    Output[I*4 + 1] := Byte(Working[I] shr 8);
+    Output[I*4 + 2] := Byte(Working[I] shr 16);
+    Output[I*4 + 3] := Byte(Working[I] shr 24);
+  end;
+end;
+
+
+procedure ChaCha20Init(var State: TChacha20State;
+  const Key: array of Byte);
+begin
+  FillChar(State, SizeOf(State), 0);
+  Move(Key[0], State.Key[0], 32);
+  State.Counter := 0;
+  State.Used := 64;  { Force new block generation on first use }
+end;
+
+
+procedure ChaCha20Crypt(var State: TChacha20State;
+  Data: PByte; Len: LongWord);
+var
+  I: LongWord;
+  Avail: Integer;
+begin
+  I := 0;
+  while I < Len do
+  begin
+    { Generate new keystream block if needed }
+    if State.Used >= 64 then
+    begin
+      ChaCha20Block(State.Key, LongWord(State.Counter),
+        State.Nonce, State.Block);
+      Inc(State.Counter);
+      State.Used := 0;
+    end;
+
+    { XOR data with keystream }
+    Avail := 64 - State.Used;
+    if Avail > Integer(Len - I) then
+      Avail := Integer(Len - I);
+
+    while Avail > 0 do
+    begin
+      Data[I] := Data[I] xor State.Block[State.Used];
+      Inc(I);
+      Inc(State.Used);
+      Dec(Avail);
+    end;
+  end;
+end;
+
+
+end.
--- a/src/cr.version.pas
+++ b/src/cr.version.pas
@@ -19,9 +19,9 @@ interface

 const
  CRYPTO_VERSION_MAJOR = 0;
-  CRYPTO_VERSION_MINOR = 1;
+  CRYPTO_VERSION_MINOR = 2;
  CRYPTO_VERSION_PATCH = 0;
-  CRYPTO_VERSION       = '0.1.0';
+  CRYPTO_VERSION       = '0.2.0';

  CRYPTO_MIN_COMPATIBLE_VERSION = '0.1.0';

--- a/src/cr.x25519.pas
+++ b/src/cr.x25519.pas
@@ -0,0 +1,173 @@
+{ cr.x25519 -- Curve25519 scalar multiplication (RFC 7748).
+
+  X25519 Diffie-Hellman key exchange over Curve25519 in
+  Montgomery form.  Public API:
+
+    X25519Keypair         random clamped scalar + matching u-coord
+    X25519ScalarMult      R = k * u (exported for test vectors)
+    X25519SharedSecret    DH shared secret with zero-check
+
+  Shares field arithmetic (GF(2^255-19)) with cr.ed25519 --
+  Ed25519 and X25519 are the same curve in different
+  coordinate forms, so the FE_* ops from cr.ed25519 are
+  reused rather than duplicated.
+
+  Carved from Comet's cm.crypto 2026-04-24. }
+
+unit cr.x25519;
+
+{$mode objfpc}{$H+}
+{$R-}{$Q-}  { Range + overflow checks OFF for crypto arithmetic }
+
+interface
+
+uses
+  SysUtils, cr.ed25519;
+
+type
+  { 32-byte Curve25519 u-coordinate / scalar.  Same wire
+    layout as TED25519PublicKey / TED25519Seed but kept
+    distinct for type discipline. }
+  TX25519Key = array[0..31] of Byte;
+
+{ Generate a random X25519 keypair.
+  PrivKey is the clamped 32-byte scalar.
+  PubKey is the 32-byte Montgomery u-coordinate (PrivKey * basepoint). }
+procedure X25519Keypair(out PrivKey, PubKey: TX25519Key);
+
+{ Scalar multiplication on Curve25519 (Montgomery form).
+  R = K * U.  Exported for testability; consumers usually
+  call X25519Keypair + X25519SharedSecret instead. }
+procedure X25519ScalarMult(const K, U: TX25519Key; out R: TX25519Key);
+
+{ Compute shared secret = X25519(our_private, their_public).
+  Returns False if the result is all-zero (bad public key --
+  low-order point attack). }
+function X25519SharedSecret(const OurPriv, TheirPub: TX25519Key;
+  out Secret: TX25519Key): Boolean;
+
+
+implementation
+
+
+{ Clamp a 32-byte scalar for X25519 per RFC 7748. }
+procedure X25519Clamp(var K: TX25519Key);
+begin
+  K[0]  := K[0] and 248;
+  K[31] := (K[31] and 127) or 64;
+end;
+
+{ Montgomery ladder scalar multiplication on Curve25519.
+  Computes Result = k * u where u is a Montgomery u-coordinate.
+  Uses field element operations from cr.ed25519. }
+procedure X25519ScalarMult(const K, U: TX25519Key; out R: TX25519Key);
+{ Montgomery ladder per RFC 7748 Section 5.
+  Clamps scalar and decodes u-coordinate per spec.
+  All temporaries are distinct to avoid aliasing bugs. }
+var
+  Scalar: TX25519Key;
+  UCopy: TX25519Key;
+  U_fe, X2, Z2, X3, Z3: TFieldElement;
+  tA, tAA, tB, tBB, tE: TFieldElement;
+  tC, tD, tDA, tCB: TFieldElement;
+  tDAp, tDAm, tT: TFieldElement;
+  Swap, KT: LongInt;
+  I: Integer;
+begin
+  { decodeScalar25519: clamp scalar per RFC 7748 }
+  Move(K, Scalar, 32);
+  X25519Clamp(Scalar);
+
+  { decodeUCoordinate: clear high bit per RFC 7748 }
+  Move(U, UCopy, 32);
+  UCopy[31] := UCopy[31] and $7F;
+
+  FE_FromBytes(U_fe, @UCopy[0]);
+  FE_1(X2);
+  FE_0(Z2);
+  FE_Copy(X3, U_fe);
+  FE_1(Z3);
+
+  Swap := 0;
+
+  for I := 254 downto 0 do
+  begin
+    KT := (Scalar[I shr 3] shr (I and 7)) and 1;
+    Swap := Swap xor KT;
+    FE_CSwap(X2, X3, Swap);
+    FE_CSwap(Z2, Z3, Swap);
+    Swap := KT;
+
+    FE_Add(tA, X2, Z2);
+    FE_Sq(tAA, tA);
+    FE_Sub(tB, X2, Z2);
+    FE_Sq(tBB, tB);
+    FE_Sub(tE, tAA, tBB);
+    FE_Add(tC, X3, Z3);
+    FE_Sub(tD, X3, Z3);
+    FE_Mul(tDA, tD, tA);
+    FE_Mul(tCB, tC, tB);
+
+    FE_Add(tDAp, tDA, tCB);
+    FE_Sq(X3, tDAp);
+
+    FE_Sub(tDAm, tDA, tCB);
+    FE_Sq(tDAm, tDAm);
+    FE_Mul(Z3, U_fe, tDAm);
+
+    FE_Mul(X2, tAA, tBB);
+
+    FE_Mul121666(tT, tE);       { 121666 * E                 }
+    FE_Sub(tT, tT, tE);         { 121665 * E  (a24 for Curve25519) }
+    FE_Add(tT, tAA, tT);        { AA + a24 * E               }
+    FE_Mul(Z2, tE, tT);         { z_2 = E * (AA + a24 * E)   }
+  end;
+
+  FE_CSwap(X2, X3, Swap);
+  FE_CSwap(Z2, Z3, Swap);
+
+  FE_Invert(Z2, Z2);
+  FE_Mul(X2, X2, Z2);
+  FE_ToBytes(TED25519PublicKey(R), X2);
+end;
+
+
+{ The X25519 base point (u=9). }
+const
+  X25519_BASEPOINT: TX25519Key = (
+    9, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
+  );
+
+procedure X25519Keypair(out PrivKey, PubKey: TX25519Key);
+var
+  Seed: TED25519Seed;
+begin
+  { Generate 32 random bytes }
+  ED25519RandomSeed(Seed);
+  Move(Seed, PrivKey, 32);
+  X25519Clamp(PrivKey);
+
+  { Public key = PrivKey * basepoint }
+  X25519ScalarMult(PrivKey, X25519_BASEPOINT, PubKey);
+end;
+
+
+function X25519SharedSecret(const OurPriv, TheirPub: TX25519Key;
+  out Secret: TX25519Key): Boolean;
+var
+  I: Integer;
+  Check: Byte;
+begin
+  X25519ScalarMult(OurPriv, TheirPub, Secret);
+
+  { Check for all-zero result (invalid public key -- low-order
+    point attack; RFC 7748 §6.1 mandates rejection). }
+  Check := 0;
+  for I := 0 to 31 do
+    Check := Check or Secret[I];
+  Result := Check <> 0;
+end;
+
+
+end.