Cracking the Editor — Robert Krten

Cracking the Editor's Back

Published Dec 24, 2020

In my Editor article, I talked about having a vector of lines; that is, if your editor has 1,234 lines of text in a document, then there's an array with 1,234 pointers to char * that holds the actual text.

This doesn't scale. I use the standard C++ vector STL to hold the pointers to lines (and other data; each line ends up taking 24 bytes of overhead).

It's fine for small files, but as the file grows, we keep resizing the vector. Under the covers, the vector does the same kinds of things that you'd do in plain ordinary C if you were managing the growth of an array: you'd call realloc(). The problem is, if the array can't be extended in place (that is, there's no free space immediately after the array), realloc() will have no option but to allocate a new chunk of memory, and copy the existing content into it.

Now, C++'s vector is a little smarter than just a plain realloc() in that it anticipates growth. It allocates the data area to be somewhat bigger than required, and makes a note of the overallocation. If the next allocation fits within the overallocated area, it's a win. If it doesn't, then it has to do what realloc() does.

The improvement I made in my editor is that I keep a two-level arrangement of data structures:

struct OneLine
{
  uint32_t   flags;     // line flags
  const char *ptr;      // pointer to line content
  OneList    *list;     // optional list
};

#define BlockSizePower  4
#define BlockSize       (1 << BlockSizePower)
#define BlockSizeMask   (BlockSize - 1)
#define BlockNumber(_x) ((_x) >> BlockSizePower)
#define BlockOffset(_x) ((_x) & BlockSizeMask)

// content is stored in a number of blocks
class Listor
{
...
  int nused_;
  vector<OneLine *>     blocks_;
};

Let's look at this.

Each line is of type OneLine. The structure is explained in detail in the Sharding article; for our purposes it contains a pointer to the line of text.

The BlockSizePower indicates the size of the block as a power of two. Yes, it's tiny! I have that in place for testing, because I wanted to exercise cross-block boundaries. I think a reasonable value would be more like 10 or so (that is, it's currently 4, which means there are 16 lines per block; 10 would give you 1,024 lines per block and take 24kB at a time).

The blocks_ vector holds a pointer to an array of lines, and the nused_ variable tells me how many of the lines are in use (we allocate based on block boundaries, but use based on the actual number of lines).

With the tiny value of BlockSizePower given above, a 35 line file will take three blocks. That is, the vector<OneLine *> blocks_ vector will have 3 pointers to 16-element arrays [the first element contains lines 0..15, the second contains 16..31, and the third contains 32..35 -- note that we start at line 1 so there's a dummy line 0 for convenience].

The two-level arrangement isn't as horrible to work with as you might initially suspect. I provide a block_resize() function that manages the blocks:

void
Listor::block_resize (size_t target)
{
	uint32_t target_blocks = BlockNumber (target + BlockSize - 1);
	uint32_t current_blocks = blocks_.size();

	// if shrinking...
	if (target_blocks < current_blocks) {
		for (int i = target_blocks; i < current_blocks; i++) {
			delete [] blocks_ [i];
		}
	}

	// if we're making any changes to the size at all, do so now
	if (target_blocks != current_blocks) {
		blocks_.resize (target_blocks);
	}

	// if we're growing...
	if (target_blocks > current_blocks) {
		// allocate new blocks
		for (int i = current_blocks; i < target_blocks; i++) {
			blocks_ [i] = new OneLine [BlockSize];
			memset (blocks_ [i], 0, sizeof (OneLine) * BlockSize);
		}
	}

	nused_ = target;
}

Once the block size management is taken care of, all you then need to do is traverse lines. Here's a sample of a utility function that tags a line range for later processing:

void
Listor::tag_range (lnum_t start, lnum_t end)
{
  while (start <= end) {
	flags (start, flags (start) | FlagTagged);
	start++;
  }
}

Conclusion

By decoupling the access to the line data via a small getter/setter:

const char *
ptr (uint64_t lnum)
{
  return (blocks_ [BlockNumber(lnum)] + BlockOffset(lnum)) -> ptr);
}

void
ptr (uint64_t lnum, const char *p)
{
  (blocks_ [BlockNumber(lnum)] + BlockOffset(lnum)) -> ptr = p;
}

uint32_t
flags (uint64_t lnum)
{
  return ((blocks_ [BlockNumber(lnum)] + BlockOffset(lnum)) -> flags);
}

void
flags (uint64_t lnum, uint32_t f)
{
  (blocks_ [BlockNumber(lnum)] + BlockOffset(lnum)) -> flags = f;
}

I'm free to experiment with changing the underlying implementation of the storage for line pointers.

Frequently Anticipated Questions

With a large monolithic array, you can just use memmove() to move the lines around (for example, when deleting a line). How do you manage that with this arrangement? In the case of deleting one line, yes, I have to manually go around and copy OneLine entries potentially from one block to another (that's why I set the BlockSizePower to such a ridiculously small number initially -- this is the behaviour I wanted to test). Generally, though, the case I'm optimizing for is not deleting one line, but rather a g/rexp/d style of operation where some significant number of lines are deleted. In that case, to compress the lines, I end up doing a copy with source and target pointers anyway -- so whether that crosses a block or not doesn't make much of a difference.