Cracking the Editor's Back
Published Dec 24, 2020
In my Editor article, I talked about having a vector of lines; that
is, if your editor has 1,234 lines of text in a document, then there's
an array with 1,234 pointers to char * that holds the actual text.
This doesn't scale. I use the standard C++ vector STL to hold the pointers to lines (and other data; each line ends up taking 24 bytes of overhead).
It's fine for small files, but as the file grows, we keep resizing the
vector. Under the covers, the vector does the same kinds of things that
you'd do in plain ordinary C if you were managing the growth of an
array: you'd call realloc().
The problem is, if the array can't be
extended in place (that is, there's no free space immediately after the
array), realloc() will have no option but to allocate a new chunk of
memory, and copy the existing content into it.
Now, C++'s vector is a little smarter than just a plain realloc() in
that it anticipates growth. It allocates the data area to be somewhat
bigger than required, and makes a note of the overallocation. If the
next allocation fits within the overallocated area, it's a win. If it
doesn't, then it has to do what realloc() does.
The improvement I made in my editor is that I keep a two-level arrangement of data structures:
struct OneLine
{
uint32_t flags; // line flags
const char *ptr; // pointer to line content
OneList *list; // optional list
};
#define BlockSizePower 4
#define BlockSize (1 << BlockSizePower)
#define BlockSizeMask (BlockSize - 1)
#define BlockNumber(_x) ((_x) >> BlockSizePower)
#define BlockOffset(_x) ((_x) & BlockSizeMask)
// content is stored in a number of blocks
class Listor
{
...
int nused_;
vector<OneLine *> blocks_;
};
Let's look at this.
Each line is of type OneLine. The structure is explained in detail in
the Sharding article; for our purposes it contains a pointer to the
line of text.
The BlockSizePower indicates the size of the block as a power of two.
Yes, it's tiny! I have that in place for testing, because I wanted to
exercise cross-block boundaries. I think a reasonable value would be
more like 10 or so (that is, it's currently 4, which means there are 16
lines per block; 10 would give you 1,024 lines per block and take 24kB
at a time).
The blocks_ vector holds a pointer to an array of lines, and the nused_
variable tells me how many of the lines are in use (we allocate based
on block boundaries, but use based on the actual number of lines).
With the tiny value of BlockSizePower given above, a 35 line file will
take three blocks. That is, the vector<OneLine *> blocks_ vector will
have 3 pointers to 16-element arrays [the first element contains lines
0..15, the second contains 16..31, and the third contains 32..35 --
note that we start at line 1 so there's a dummy line 0 for
convenience].
The two-level arrangement isn't as horrible to work with as you might
initially suspect. I provide a block_resize() function that manages the
blocks:
void
Listor::block_resize (size_t target)
{
uint32_t target_blocks = BlockNumber (target + BlockSize - 1);
uint32_t current_blocks = blocks_.size();
// if shrinking...
if (target_blocks < current_blocks) {
for (int i = target_blocks; i < current_blocks; i++) {
delete [] blocks_ [i];
}
}
// if we're making any changes to the size at all, do so now
if (target_blocks != current_blocks) {
blocks_.resize (target_blocks);
}
// if we're growing...
if (target_blocks > current_blocks) {
// allocate new blocks
for (int i = current_blocks; i < target_blocks; i++) {
blocks_ [i] = new OneLine [BlockSize];
memset (blocks_ [i], 0, sizeof (OneLine) * BlockSize);
}
}
nused_ = target;
}
Once the block size management is taken care of, all you then need to do is traverse lines. Here's a sample of a utility function that tags a line range for later processing:
void
Listor::tag_range (lnum_t start, lnum_t end)
{
while (start <= end) {
flags (start, flags (start) | FlagTagged);
start++;
}
}
Conclusion
By decoupling the access to the line data via a small getter/setter:
const char *
ptr (uint64_t lnum)
{
return (blocks_ [BlockNumber(lnum)] + BlockOffset(lnum)) -> ptr);
}
void
ptr (uint64_t lnum, const char *p)
{
(blocks_ [BlockNumber(lnum)] + BlockOffset(lnum)) -> ptr = p;
}
uint32_t
flags (uint64_t lnum)
{
return ((blocks_ [BlockNumber(lnum)] + BlockOffset(lnum)) -> flags);
}
void
flags (uint64_t lnum, uint32_t f)
{
(blocks_ [BlockNumber(lnum)] + BlockOffset(lnum)) -> flags = f;
}
I'm free to experiment with changing the underlying implementation of the storage for line pointers.
Frequently Anticipated Questions
- With a large monolithic array, you can just use
memmove()to move the lines around (for example, when deleting a line). How do you manage that with this arrangement? In the case of deleting one line, yes, I have to manually go around and copyOneLineentries potentially from one block to another (that's why I set theBlockSizePowerto such a ridiculously small number initially -- this is the behaviour I wanted to test). Generally, though, the case I'm optimizing for is not deleting one line, but rather ag/rexp/dstyle of operation where some significant number of lines are deleted. In that case, to compress the lines, I end up doing a copy with source and target pointers anyway -- so whether that crosses a block or not doesn't make much of a difference.