Skip to content

fix: preserve split UTF-8 across PersistentTerminal feed() calls#9

Merged
remorses merged 2 commits intoremorses:mainfrom
aarcamp:multibyte-split-fix
Mar 6, 2026
Merged

fix: preserve split UTF-8 across PersistentTerminal feed() calls#9
remorses merged 2 commits intoremorses:mainfrom
aarcamp:multibyte-split-fix

Conversation

@aarcamp
Copy link
Copy Markdown
Contributor

@aarcamp aarcamp commented Mar 6, 2026

When reading from a PTY, byte chunks arrive at arbitrary boundaries. Multi-byte UTF-8 characters (emoji = 4 bytes, CJK = 3 bytes, accented chars = 2+ bytes) can be split across consecutive read() calls. The old code in PersistentTerminal.feed():

// OLD - each call decoded independently
str = data.toString("utf-8")      // Buffer path
str = new TextDecoder("utf-8").decode(data)  // Uint8Array path

Both of these treat each chunk as a complete, self-contained UTF-8 sequence. When a multi-byte character is split (e.g., the first 2 bytes of a 4-byte emoji in one call, the remaining 2 in the next), each independent decode produces U+FFFD replacement characters (�) instead of the intended character.

- use a persistent TextDecoder in streaming mode for Buffer/Uint8Array input
- share binary UTF-8 decoding across ptyToJson, ptyToText, and ptyToHtml
- accept Uint8Array in ghostty-terminal ansi/feed typings
- add regression tests for split multibyte chunk feeds
@remorses
Copy link
Copy Markdown
Owner

remorses commented Mar 6, 2026

is there a way to remove hasBinaryDecoderState state and use the decoder directly? cc @codex review

@aarcamp
Copy link
Copy Markdown
Contributor Author

aarcamp commented Mar 6, 2026

is there a way to remove hasBinaryDecoderState state and use the decoder directly? cc @codex review

Yes, it's an optimization. We could swap it out for simpler logic as follows -- it just means some extra TextDecoder() allocations in certain conditions:

diff --git a/src/ffi.ts b/src/ffi.ts
index 79bc6a3..6a4244c 100644
--- a/src/ffi.ts
+++ b/src/ffi.ts
@@ -255,7 +255,6 @@ export class PersistentTerminal {
   private _rows: number
   private _destroyed = false
   private _streamDecoder = new TextDecoder("utf-8")
-  private _hasBinaryDecoderState = false

   constructor(options: PersistentTerminalOptions = {}) {
     if (!native) {
@@ -297,14 +296,9 @@ export class PersistentTerminal {
     this.assertNotDestroyed()

     if (typeof data === "string") {
-      if (this._hasBinaryDecoderState) {
-        const flushed = this._streamDecoder.decode()
-        this._streamDecoder = new TextDecoder("utf-8")
-        this._hasBinaryDecoderState = false
-        if (flushed.length > 0) {
-          native!.feedTerminal(this._id, flushed)
-        }
-      }
+      // Discard any partial UTF-8 bytes from prior binary feeds before
+      // crossing into a plain string boundary.
+      this._streamDecoder = new TextDecoder("utf-8")

       if (data.length > 0) {
         native!.feedTerminal(this._id, data)
@@ -312,7 +306,6 @@ export class PersistentTerminal {
       return
     }

-    this._hasBinaryDecoderState = true
     const decoded = this._streamDecoder.decode(data, { stream: true })
     if (decoded.length > 0) {
       native!.feedTerminal(this._id, decoded)
@@ -337,7 +330,6 @@ export class PersistentTerminal {
     this.assertNotDestroyed()
     native!.resetTerminal(this._id)
     this._streamDecoder = new TextDecoder("utf-8")
-    this._hasBinaryDecoderState = false
   }

   /**
@@ -414,7 +406,6 @@ export class PersistentTerminal {
     if (this._destroyed) return
     this._destroyed = true
     this._streamDecoder = new TextDecoder("utf-8")
-    this._hasBinaryDecoderState = false
     native!.destroyTerminal(this._id)
   }

I can either push this update, or add comments to the existing PR so it's clearer what's happening, just lmk. :)

@remorses
Copy link
Copy Markdown
Owner

remorses commented Mar 6, 2026

Yeah I like this one better. I will publish after you push this commit

- remove _hasBinaryDecoderState bookkeeping
- recreate the streaming TextDecoder on string feeds, reset, and destroy
- add a regression test for binary-to-string feed boundaries
@aarcamp aarcamp force-pushed the multibyte-split-fix branch from fcc9405 to 6262592 Compare March 6, 2026 16:09
@remorses remorses merged commit 750870e into remorses:main Mar 6, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants