Text management using a piece chain

The core of this editor is a persistent data structure called a piece table which supports all modifications in O(m), where m is the number of non-consecutive editing operations. This bound could be further improved to O(log m) by use of a balanced search tree, however the additional complexity doesn't seem to be worth it, for now.

The actual data is stored in buffers which are strictly append only. There exist two types of buffers, one fixed-sized holding the original file content and multiple append-only ones storing the modifications.

A text, i.e. a sequence of bytes, is represented as a double linked list of pieces each with a pointer into a buffer and an associated length. Pieces are never deleted but instead always kept around for redo/undo support. A span is a range of pieces, consisting of a start and end piece. Changes to the text are always performed by swapping out an existing, possibly empty, span with a new one.

An empty document is represented by two special sentinel pieces which always exist:

/-+ --> +-\
| |     | |
\-+ <-- +-/
 #1     #2

Loading a file from disk is as simple as mmap(2)-ing it into a buffer, creating a corresponding piece and adding it to the double linked list. Hence loading a file is a constant time operation i.e. independent of the actual file size (assuming the operating system uses demand paging).

/-+ --> +-----------------+ --> +-\
| |     | I am an editor! |     | |
\-+ <-- +-----------------+ <-- +-/
 #1             #3              #2

Insert

Inserting a chunk of data amounts to appending the new content to a modification buffer. Followed by the creation of new pieces. An insertion in the middle of an existing piece requires the creation of 3 new pieces. Two of them hold references to the text before respectively after the insertion point. While the third one points to the newly added text.

/-+ --> +---------------+ --> +----------------+ --> +--+ --> +-\
| |     | I am an editor|     |which sucks less|     |! |     | |
\-+ <-- +---------------+ <-- +----------------+ <-- +--+ <-- +-/
 #1            #4                   #5                #6      #2

       modification buffer content: "which sucks less"

During this insertion operation the old span [3,3] has been replaced by the new span [4,6]. Notice that the pieces in the old span were not changed, therefore still point to their predecessors/successors, and can thus be swapped back in.

If the insertion point happens to be at a piece boundary, the old span is empty, and the new span only consists of the newly allocated piece.

Delete

Similarly a delete operation splits the pieces at appropriate places.

/-+ --> +-----+ --> +--+ --> +-\
| |     | I am|     |! |     | |
\-+ <-- +-----+ <-- +--+ <-- +-/
 #1       #7         #6      #2

Where the old span [4,5] got replaced by the new span [7,7]. The underlying buffers remain unchanged.

Cache

Notice that the common case of appending text to a given piece is fast since, the new data is simply appended to the buffer and the piece length is increased accordingly. In order to keep the number of pieces down, the least recently edited piece is cached and changes to it are done in place (this is the only time buffers are modified in a non-append only way). As a consequence they can not be undone.

Undo/redo

Since the buffers are append only and the spans/pieces are never destroyed undo/redo functionality is implemented by swapping the required spans/pieces back in.

As illustrated above, each change to the text is recorded by an old and a new span. An action consists of multiple changes which logically belong to each other and should thus also be reverted together. For example a search and replace operation is one action with possibly many changes all over the text.

The text states can be marked by means of a snapshotting operation. Snapshotting saves a new node to the history graph and creates a fresh Action to which future changes will be appended until the next snapshot.

Actions make up the nodes of a connected digraph, each representing a state of the file at some time during the current editing session. The edges of the digraph represent state transitions that are supported by the editor. The edges are implemented as four Action pointers (prev, next, earlier, and later).

The editor operations that execute the four aforementioned transitions are undo, redo,earlier, and later, respectively. Undo and redo behave in the traditional manner, changing the state one Action at a time. Earlier and later, however, traverse the states in chronological order, which may occasionally involve undoing and redoing many Actions at once.

Marks

Because we are working with a persistent data structure marks can be represented as pointers into the underlying (append only) buffers. To get the position of an existing mark it suffices to traverse the list of pieces and perform a range query on the associated buffer segments. This also nicely integrates with the undo/redo mechanism. If a span is swapped out all contained marks (pointers) become invalid because they are no longer reachable from the piece chain. Once an action is undone, and the corresponding span swapped back in, the marks become visible again. No explicit mark management is necessary.

Properties

The main advantage of the piece chain as described above is that all operations are performed independent of the file size but instead linear in the number of pieces i.e. editing operations. The original file buffer never changes which means the mmap(2) can be performed read only which makes optimal use of the operating system's virtual memory / paging system.

The maximum editable file size is limited by the amount of memory a process is allowed to map into its virtual address space, this shouldn't be a problem in practice. The whole process assumes that the file can be used as is. In particular the editor assumes all input and the file itself is encoded as UTF-8. Supporting other encodings would require conversion using iconv(3) or similar upon loading and saving the document.

Similarly the editor has to cope with the fact that lines can be terminated either by \n or \r\n. There is no conversion to a line based structure in place. Instead the whole text is exposed as a sequence of bytes. All addressing happens by means of zero based byte offsets from the start of the file.

The main disadvantage of the piece chain data structure is that the text is not stored contiguous in memory which makes seeking around somewhat harder. This also implies that standard library calls like the regex(3) functions can not be used as is. However this is the case for all but the most simple data structures used in text editors.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly