The third edition

Published Apr 14, 2016

For context, in this article I was considering a linked list of lines for the purposes of being able to do fast insertions and deletions. That approach has since been discontinued in favour of the listor.

The third version of my editor is now in progress. The first version was a learning vehicle for C++; having gained some basic knowledge, the second version of the editor focussed on the weaknesses introduced in the first version, namely, the difficulties encountered in the visual portion. The third (and hopefully final!) version of the editor focuses on the following requirements:

big data -- this thing has to be able to edit files in the terabytes range,
clean iterators -- we need to be able to walk from beginning to end of the file using iterators, regardless of the underlying implementation of the line database (and the location, disk or memory), and
multi-threaded support -- that is, the ability to shard the database into multiple, per-thread areas, with load balancing and map/reduce functionality.

Why?

Big data is becoming more and more important to me -- I regularly open >1GB files and it's a pain using VED (it takes a long time). You certainly can't manipulate such files (g//d, g//.m, %s///) in any reasonable time period, and going back and forth to the command line to do streaming ops (that should be handled by the editor in the first place) is a horrible kludge.

Iterators are, I think, the correct solution in order to unify the visual part and the internal operations of things like buffers and the various visual commands.

Multi-thread support is seen as the only way to scale the editor up to absorb all available CPU, otherwise I'm going to run into single-threaded speed bounds. I'd just as soon not do that.

An undercurrent to the conversation is that of line numbers. Yes, I've mostly weaned myself off of line numbers, but I still think of them as having meaning during visual mode -- maybe this is a problem that I'm perpetuating myself. I find that for window positioning, an "easy" algorithm is to use the "line number" as an anchor (indicating the top line of the window to display). For ordering lines, that is, in trying to determine if the source comes before or after the destination in arbitrary visual and line-mode commands, line numbers seem like the "obvious" solution. They also feel "optimal" in updating the marks when manipulating lines between window and journal contexts.

Like I said, maybe it's me, but I think there are probably ways to keep the lines "intact" during all operations, and be able to combine this with multi-threaded support; that is, being able to break up the buffer into (relatively) equally sized chunks for each thread to work in in an independent manner. I need to think about what the cost of a synchronization operations is after each operation performed -- in visual mode, everything has zero cost except scrolling -- that has to be fast. Everything else is triggered by a human-speed interaction, and thus isn't as time sensitive. However, command line operations are where the real performance needs to be. Issuing a g// operation results in a map/reduce problem that (a) needs to be thread distributed, and (b) needs to operate efficiently -- including its synchronization ops. For example, you certainly wouldn't want a delete operation figuring out from the start of the file what line numbers it had deleted in order to update the marks tables.

Maybe, the solution is that I need to structure it more as a library approach. Create a base class, with an iterator, that knows how to multi-thread distribute a vast file space, with programmable limits in caching vs disk. As a library, the onus is on us to create a clean, non-entangled interface. That is, create a set of function calls that allow the creation of a "window" with its own journalling, redo/undo, disk swap features and of course iterator and multi-threaded map/reduce functionality. Once this is done, the visual component and the command line component "simply" interface to it.

That's the current plan, anyway.

I find it interesting that nobody else has solved this problem (yet). There are editors out there, (look on stackoverflow for "best editor to handle huge files") but they have limits like 2G lines or 256GB of data size -- this is already on the edge of my requirements. Does nobody else have these requirements? What's everyone else doing for huge files? [And before you answer, I'm a great fan of the Unix philosophy of splitting up huge files using tools and stream processing and so on, but there are times when you just need to operate on the whole thing as one big chunk].