Learning C++ by Writing a vi-like Editor


Forward to Part 2


2015 03 08

After decades of being unable (unwilling?) to learn C++, I finally said “screw it,” and decided to learn the language. This is my journal. In it, I give a step-by-step account of writing a vi-like editor, with copious code snippets and commentary on design/architectural decisions I made along the way (i.e., not just the “what” but also the “why” and “how”). The project is by no means finished; I add new content almost daily.

I strongly recommend the site “CPlusPlus.com” — it has excellent reference material and great examples!

Background

As people on the web are fond of saying, “There's not much information out there for the internals of editor implementations.” I'll add a +1 to that.

A buddy of mine (Paul) wrote a vi-like clone in 1989, and we've been porting it to various systems as we encountered them. He eventually abandoned it, and I've been maintaining it since. It's vi-like, so it has a few quirks that would probably annoy real vi users, but it has enough features in common that it serves as an excellent case study.

The editor started life in 1989 on a 16-bit QNX system, and was ported to VAX/VMS, OpenBSD, FreeBSD, and Linux, in various flavours of 16, 32 and finally 64-bit compiler models. It's a testament to Paul's original design that it has survived this long — I think 26 years is a good run! The editor is reasonably small, even today it's under 9k LOC.

The problems with the old editor were numerous, and were mainly as a result of the age of the code and our abilities at the time:

  1. The backbone of the editor, the line data structures, consisted of a dynamically reallocated array, with each element pointing to a line. This caused performance problems (inserting or deleting a line caused a shuffle of all affected lines in the array).
  2. Working line buffers, where each major subsystem would operate on “the line” were distributed haphazardly amongst the subsystems, and copying of the line buffers was rampant.
  3. The visual mode editor was 1,400 lines of complicated switch/case logic that was on the edge of maintainability.
  4. Undo was broken from the start, and I ended up finishing breaking it a few years ago, so I removed it entirely.
  5. Various functions need to be next-match based rather than next-matching-line based, for example dn doesn't work at all, and if it did, it would probably delete all the lines inclusive from here to the next match, when really it needs to delete from the cursor on the current line to the character before the next match (and it may involve joining the current line to the line containing said next match).

New Version

So what I hoped to accomplish with the new version was to learn C++, take advantage of ubiquitous multiprocessing and big memory, and produce a “sonic screwdriver” — something that I could whip out and point at any problem and not have it fail me.

Specifically, I wanted to achieve:

  1. fast line insert/delete (well, fast operations in general, but these were the existing bottlenecks),
  2. multi-threaded map/reduce style of operation for g// commands,
  3. “instant on” for huge files,
  4. object oriented “clean” design.

Base Objects

Fundamentally, a text editor lives and dies by files, lines, and buffers. So in my mind, this was the first order of business. Like I said in the intro, I was leery of learning C++ because it seemed to me to be the source of much bloat. During the early days of the editor project, there were many moments of seriously considering abandoning the project and just going back to a straight-C implementation.

Speed Surprise

So the first thing I wanted to see was, “how fast can I load the Giant File?” (I have a standard “Giant File” that I test with, it's 1.3M lines and weighs in at 556MB — not the “biggest file ever” but big enough). My benchmark consisted of loading the file and hitting :q and timing it. In the old editor, it was an unimpressive 13 seconds. In the Ubuntu 14.04LTS distribution of VIM, it was 1.8 seconds. Using the following C++ code:

int
Window::load (string fname)
{
    fstream     fs;
    string      s;

    fs.open (fname.c_str(), fstream::in);
    cout << "Loading... " << fname;
    while (getline (fs, s)) {
        lines.push_back (new Line (s));
    }
    fs.close ();
    dot = lines.begin();
    prompt = ":";

    cout << " / << lines.size() << endl;
}

It was 0.9 seconds.

Wow.

My main concerns coming into this were that it was going to be horribly slow, but it turned out to be faster than VIM, which I was impressed with. (OK, doing a wc -l giantfile only took 0.18 seconds, whereas wc giantfile took 2.8 seconds, but still.)

Window::lines is a list, which as I understand the underlying implementation is a doubly-linked list of objects. This immediately brought about another concern — how slow would it be to access a specific line number? I had come from the line-oriented UNIX V6 ed where you would specify line numbers directly, so this was a concern. I wrote a simple test program that allowed me to type in the line number, and it would print the line. Loading the Giant File, and typing 1311029 (one less than the number of lines in the file) came back almost instantly. The search algorithm was a horrible linear search, as shown in the prototypical “first command handler”:

typedef list <Line *> Line_l;
typedef Line_l::iterator Line_i;

void
edit (void)
{
    string      command;
    int         n;

    for (;;) {
        cout << (*window) -> prompt;
        cout.flush ();
        if (getline (cin, command)) {
            if (command [0] == 'q') {
                return;
            }
            n = atoi (command.c_str());
            for (Line_i it = (*window) -> lines.begin (); it != (*window) -> lines.end (); ++it) {
                if (--n == 0) {
                    cout << "Line " << n << ": " << (*it) -> content << endl;
                    break;
                }
            }
        }
    }
}

Note the possible reversion-to-C-style via the typedefs. I mean, seriously? All that typing? :-)

I then thought long and hard about where I would actually need to access the line number, and came up with the following use cases:

  1. on display, you need to be able to show what line number you're on,
  2. as a direct-access “goto” (for example, when you encounter “syntax error on line 77”, you want to be able to go to line 77)
  3. as a target for line oriented operations (e.g., g/spud/.m77)

The first two cases are interactive, meaning that “almost instantly” is fast enough. (Doing a bit of benchmarking showed that it was somewhere around 170 ms — like I said, fast enough.) The last case could be optimized — I could lookup the target line once, and then use the looked-up iterator as the line.

It's instructive to note where you specifically do not need to deal with line numbers:

  1. mark commands (e.g., ma from visual mode)
  2. command line relative commands (e.g., :.,.+10j)
  3. cursor movement commands (e.g., j from visual mode)
  4. ... and so on.

Well, it's instructive for me, anyway, because the original editor implemented the first one by storing the line number in the mark buffer, and then adjusting the mark buffers when the line pointers were adjusted.

The Classes

So, my initial definition for the classes used in the editor was as follows. A Window is the largest class, and holds one file's worth of information. Windows are organized in a list, because I want to be able to edit multiple files simultaneously (e.g., e *.cpp).

// in window.hpp:
class Window {
public:
    string  name;                   // the file name
    string  prompt;                 // line editor prompt
    Line_l  lines;                  // the lines constituting the file
    Line_i  dot;                    // the current line

    Window (string fname);          // constructor with filename
    Window ();                      // constructor with no filename
    int load (string fname);
};

// in main.c, as globals:
Window_l windows;                   // all editor windows
Window_i window;                    // current window

Yeah, I know, making everything public is a bad idea, but I'm just prototyping at this point, trying to get the feel for how things work and so on.

So, a Window consists of a name, prompt, a number of lines (which are the actual lines in the file) and an iterator known as dot which points to the current line (this is a line-oriented concept). What's the Line type look like, you might ask?

class Line
{
public:
    string          content;
    int             flags;
    Line (string s) { content = s; }
    const string operator[] (int n) { return ("hello"); }
};

Nothing too tricky; a string for the line itself, a flags integer to hold various line selection flags (e.g., if the line has been tagged by a g command), and an initializer for the line.

This is where I ran into my first problem. I had been trying to overload the operator[] to be able to access a line by number. I had been hoping to use syntax like the following in the prototypical “first command handler” (above) instead of the iterator loop I showed:

n = atoi (command.c_str());
cout << "Line " << n << ": " << (*window) -> lines [n].content << endl;

but kept getting compiler errors:

g++ -c -o edit.o edit.cpp
edit.cpp: In function 'void edit()':
edit.cpp:56:55: error: no match for 'operator[]' (operand types are 'std::list<Line*>' and 'int')
    cout << "Line " << n << ": " << (*window) -> lines [n].content << endl;
                                                       ^

If anyone has any ideas, please let me know! :-)

In the meantime, I've simply abandoned the operator[] approach, and instead used a member function:

class Window {
    Line_l          lines;      // the lines constituting the file
public:
    string          name;       // the file name
    string          prompt;     // line editor prompt
    Line_i dot;                 // the current line

    Window (string fname);      // constructor with filename
    Window ();                  // constructor with no filename
    int load (string fname);

    // access function
    Line * line (int n) {
        if (n < 1 || n > lines.size ()) {
            throw (EDERR_LINERANGE);
        }
        for (Line_i it = lines.begin (); it != lines.end (); ++it) {
            if (--n == 0) {
                return (*it);
            }
        }
    }
};

Here I've created a line member function which can be used almost like the operator[] approach:

n = atoi (command.c_str());
cout << "Line " << n << ": " << (*window) -> line (n).content << endl;

As you can see from the above, I also had to learn about try, throw and catch. In the main, I now have:

int
main (int argc, char **argv)
{
    optproc (argc, argv);
    for (;;) {
        try {
            edit ();
            exit (EXIT_SUCCESS);
        }
        catch (int e) {
            cout << "An exception occured, number " << e << endl;
        }
        catch (exception &e) {
            cout << "Standard exception: " << e.what() << endl;
        }
    }
    exit (EXIT_SUCCESS);
}

So if you try to go to a zero or negative line number, or one that's bigger than the number of lines in the file, the Window class will throw an EDERR_LINERANGE exception. This gets picked up by main and a (currently uninformative) message is displayed.

The Command Handler

I figured the next thing to start on, now that I had a working model for the lines, was the command handler. We're talking about the line-based handler, that is, the one that you invoke when you are in vi and hit the “:” key.

Generally speaking, the command handler handles commands of the form: zero, one or two addresses, followed by an optional command. The addresses are interesting, they can be line numbers, marks, or search patterns, including some arithmetic operations on those. For example, all of the following are legitimate addresses and commands:

1,$d
/while/,/}/s/^/ /
44l
g/^./.,.+j
/3/;.+3y"A
'a,'bw spud.txt

The commands, in order of appearance, (1) delete all lines in the file; (2) find the next occurrence of while and from there until the next occurrence of a closing brace, substitute for the beginning of the line with a tab; (3) print line 44 in an unambiguous manner; (4) find all lines that have at least one character on them and join them to the next line; (5) search for the next line containing a three, set the current line counter to that line, and yank from there to the three lines following appending into buffer A; and (6) write lines from mark a through to mark b to the file called spud.txt.

What an awesome editor. What a pain in the ass to implement!

I also plan to add a G command which effects the remainder of the command line over all of the loaded windows (rather than just the current one).

Address Parsing

2015 03 10

So in approaching the address parsing, I came up with an Address class which basically holds one addr iterator and a bunch of access functions:

class Address
{
    Line_i  addr;                                   // the target line
public:
    Line_i address () { return addr; };             // return iterator directly
    void address (Line_i a) { addr = a; };          // set iterator directly
    void address (string s, string::size_type &i);  // set iterator by string at position (and update)
    void address (string s);                        // set iterator by string (position assumed irrelevant)
    void address (int lnum);                        // set iterator to given line number
};

My thinking was along the lines of “since we've ditched the whole specify-addresses-as-numbers thing, why not have the addr be an iterator that points into the list of Lines?” This then meant that I would create a bunch of different access functions (all called address, of course), which would allow me to set the address iterator to point to a specific line. The string based access functions would parse the left or right hand side of the addresses shown in the example above.

Things went swimmingly for the first pass of the string-based functions:

void
Address::address (string s, string::size_type &i)
{
    int     offset;

    if (s.length () <= i) {
        return;
    }

    offset = 0;
    if (s [i] == '.') {
        addr = (*window) -> dot;
        i++;
    } else if (s [i] == '$') {
        addr = (*window) -> line ((*window) -> nlines ());
        i++;
    } else if (isdigit (s [i])) {
        addr = (*window) -> line (atoi (s.c_str () + i));
        while (i < s.length () && isdigit (s [i])) i++;
    }
    // handle +/- logic here
}

I got into trouble, though, when I went to add +n and -n logic. At this point, I was really thirsty for line numbers, and not these stupid iterators! What I really wanted to do was add the following after the “handle +/- logic here” comment:

    if (s.length () > i) {
        if (s [i] == '+') {
            addr += atoi (s.c_str () + i + 1);
        } else if (s [i] == '-') {
            addr -= atoi (s.c_str () + i + 1);
        }
    }

That is, simply add or subtract the specified amount just like I would if I was dealing with line numbers. Two problems came up, though — I might accidentally fall off the front or back of the list, and of course the compiler wasn't helping either:

g++  -g  -c -o edit.o edit.cpp
edit.cpp: In member function 'void Address::address(std::string, std::basic_string::size_type&)':
edit.cpp:84:9: error: no match for 'operator+=' (operand types are 'std::list::iterator {aka std::_List_iterator}' and 'int')
    addr += atoi (s.c_str () + i + 1);
         ^
edit.cpp:86:9: error: no match for 'operator-=' (operand types are 'std::list::iterator {aka std::_List_iterator}' and 'int')
    addr -= atoi (s.c_str () + i + 1);
         ^

Hmmm.... (deep thought).

Here's how I fixed it:

    if (s.length () > i) {
        if (s [i] == '+') {
            i++;
            addr = (*window) -> dot_plus (atoi (s.c_str () + i));
            skip_digits (s, i);
        } else if (s [i] == '-') {
            i++;
            addr = (*window) -> dot_plus (-atoi (s.c_str () + i));
            skip_digits (s, i);
        }
    }

It's a little messy — I can only increment i once I know that I have a “+” or “-” character, and then I need to skip the digits I accumulated.

The dot_plus function isn't too bad either:

Line_i
Window::dot_plus (int val)
{
    Line_i it = dot;

    if (!val) {
        return (dot);
    }

    if (val > 0) {
        while (val--) {
            if (++it == lines.end ()) {
                throw (EDERR_LINERANGE);
            }
        }
    } else {
        while (val++ && it != lines.begin ()) --it;
        if (val != 1) {
            throw (EDERR_LINERANGE);
        }
    }

    return (it);
}

I'm obviously missing something about the begin and end member functions of lists — I feel I'm doing something wrong, in that the two cases (going forward and going in reverse) are not symmetrical code-wise. I think this is going to be a key problem, because the whole iterator concept is designed around:

for (class::iterator i = class.begin (); i != class.end (); ++i) {
    // do stuff with (*i)
}

The main takeaway here is that begin is always accessible, and end is always inaccessible. This does present a small design paradigm shift. Suppose my file consists of ten lines, called 1 through 10. Obviously, begin points to line 1. But end, however, points to one past 10.

The Difference Between C++ Iterators and Editor Addresses

2015 03 11

Resolved! The main difference between standard C++ iterators and editor line numbers is that the size of the set of iterators is one bigger than the size of the set of editor line numbers. Or, put another way, editor line numbers are only valid in an inclusive range (e.g., 1 through 10 from the example above), whereas C++ iterators have one more element, the “end of list” element. So, this means that all of my editor loops must be structured as:

for (class::iterator i = class.begin (); /* nothing to see here */ ; ++i) {
    // do stuff with (*i)
    if (i == class.end ()) {
        break;
    }
}

You may be thinking, “but why can't you just add one more to the end value, and then just go ahead and use the value as per normal C++ convention?” Good question. Simple answer: dot. dot gets in the way. Consider the following, which is a general model for how processing in the editor works:

for ((*window) -> dot = start (); (*window) -> dot != end (); ++(*window) -> dot) {
    // do stuff with (*window) -> dot
}

Using the standard C++ “end as a special value” advances dot too far (by one entry). That means that I'd need to always have code at the end of the for loop that resets dot back one entry. And if dot was already at the first entry (at line 1), then there would need to be special logic there to not advance dot before the begin position of the list. Sounds like more work than handling the exit condition at the tail end of processing.

But wait, there are other problems! Consider the following commands:

g/pattern/.-2,.+2d
g/pattern/?first?;/second/d
g/pattern/.,.+j
g/pattern/s/$/\nnew line added/

Nasty stuff!

The first one deletes two lines before the matched pattern through to two lines after the matched pattern (the editor works by setting dot through the range of the g command). This means that dot becomes its own victim for deletion, and is removed from the list of lines — this makes it hard for g to continue to the “next” value of dot.

The second example shows that we can't simply sniff the command and try and figure out where the deletion takes place (and dance around it). That's because we can be tricky and specify that we go back from the current dot to the previous occurrence of the string “first,” and then from there delete until the next occurrence of the string “second,” both of which could be pretty much anywhere in the file.

The third is just another example of what can go wrong — here we're inviting the editor to join every second line to the previous one. (j says “join all lines in the specified address range.”) This is another cause of the constantly disappearing dot.

The fourth is an example of adding in new lines; not really a problem in this case, but worth mentioning for consistency.

Bummer. You'd almost want to go back to line numbers, but there's no need!

The old version of the editor implemented g by taking the address range given to g and “tagging” each line that matched. So, “30,50g/./s/^/\t/” would examine lines 30 through 50 inclusive, and find all the lines that had at least one character (i.e., non-blank lines). These lines would be tagged. Then, the command processor would go through all lines (optimized, perhaps, by the range given to g) and set dot to each tagged line in turn, invoking the command executioner on the line. This way, if any lines got added or consumed, it wouldn't really matter to the command processor.

You could speculate that if a line was moved, as in “g/pattern/.m33” for example, that some tagged-but-not-yet-processed lines might be moved out of range of the g command and would therefore be missed or processed twice. The rule of the g command is that it tags the lines; what happens to them is not its problem. For example, tagging two consecutive lines and having them each cause themselves and their next line to move somewhere, only ends up moving two lines total:
1
2
3
4
5
6
7
8
9
10
11
And executing
g/[34].,.+m7
You'd expect that line 3 would get tagged and cause lines 3 and 4 to move to after line 7, and that line 4 would get tagged, and cause either lines 4 and 5 to move after line 7, or somehow magically cause the new location of line 4 and its next line to move to after line 7. As a benchmark, vim moves lines 3 and 4 to after line 7 and is done. In practice, this doesn't come up much, in that the vast majority of g cases that I use are single line change commands, and a few are line delete or line insert. The only move case I use with any regularity is either g/pattern/.m0 or g/pattern/.m$ when I want to extract and group lines.

So this suggests an implementation. Retaining dot as an iterator of type list, we can process just like we did in the old editor. That is, take a first pass to mark all lines in range that match, and take a second pass to process matching lines. In the functions that do the work, we note if dot is a target, and if so, we reset dot backwards until we get outside of the affected range. For example, g/^spud/d, to delete all lines that match “spud” would tag lines 3 and 7:

1   this
2   is
3   spud
4   and
5   his
6   friend
7   spud
8   the
9   potato

We would start dot at line 1, “this,” and walk until we hit a tag (line 3). When we delete line 3, in the delete function we notice that we're deleting dot. At that point, we would walk dot back one entry to move it out of the way of the delete. Then, we would proceed at the next value of dot, which in this case is the next line after line 2 — line 4 (because line 3 is now gone).

In order to prototype this, I've decided to do the d command next.

Delete

Well, that was stupidly easy. I added a d handler into the switch/case of the main command decoder (which I'll present here for clarity in its entirety):

bool
Command::execute (void)
{
    Window_i w;

    switch (c) {
    case    'd':
        (*window) -> delete_lines (start (), end ());
        break;

    case    0:
        c = (*window) -> last_print;
        if (!c) {
            c = 'p';
        }
        // fall through
    case    'l':
    case    'p':
        for ((*window) -> dot = start (); ; (*window) -> dot++) {
            if (c == 'p') {
                cout << (*(*window) -> dot) -> content << endl;
            } else {
                format_list ((*(*window) -> dot) -> content);
            }
            if ((*window) -> dot == end ()) {
                break;
            }
        }
        (*window) -> last_print = c;
        break;

    case    'q':
        if (a [0] == '!') { // q! forces exit
            return (0);
        }

        if ((*window) -> modified) {
            throw (EDERR_MODIFIEDWINDOW);
        }

        for (w = windows.begin (); w != windows.end (); ++w) {
            if ((*w) -> modified) {
                break;
            }
        }
        if (w == windows.end ()) {
            return (0);
        }
        throw (EDERR_MODIFIEDWINDOWS);
        break;

    default:
        cout << "Unknown command (" << c << ")" << endl;
        break;
    }
    return (1);
}

You can see the code for the other commands too — if no command is specified (c has the value 0) we default to whatever the last l or p command was. The q (quit) looks to see if the file is modified (in which case it throws an exception) or it quits the editor by returning from the command handler. And, our friend the d command simply calls the delete_lines function in the Window class:

Line_i
Window::delete_lines (Line_i start, Line_l::iterator end)
{
    dot = lines.erase (start, ++end);
    modified = 1;
}

and of course sets the modified flag to indicate that a change has been made to the file.

At this point, I can load a file, select various line number and dot-relative line ranges for printing or listing (p vs l commands), and delete line ranges.

I can also crash the editor six ways to Sunday without trying very hard, such as opening an empty file and hitting ., or giving a backwards line range (e.g., 4,3p) and so on. I'm tracking those issues as I prototype, because I may yet change fundamental algorithms and class structures and so on, so I don't want to spend a bunch of time fixing things that are readily apparent only to find that the fixed code gets chucked as a result of a redesign.

So what's next, then?

Undo

2015 03 12

Well, the thing I'm the most nervous about is the “undo” functionality. Because the previous editor started out with a severely limited undo command, and I ended up breaking it more (to the point where I just removed it), I think this is an area of risk. How do you undo, then?

My initial thought was to model it on journaling. If I were to keep a journal of every change I made, I should be able to go back through that journal and unwind the changes I made, eventually restoring the file to its original state (if I cared to go back far enough). Let's examine that a little. Assume that I have a file that has ten lines, with the contents “one” through “ten” on the respective line:

one
two
three
four
five
six
seven
eight
nine
ten

Let's assume I start a counter (transaction_number) at the value 1, and assign it to each change, and then increment it. As I load the initial file, then, I would assign the counter value 1 to line 1 which contains “one,” 2 to line 2 which contains “two,” and so on. Easy peasy.

Let's now say I make my first change — I delete line 5. A few things need to happen:

  1. I remove line 5 from the Lines list,
  2. I put it on a journaling list.
  3. I make an entry in my journal that says “I just deleted the line identified as 5.”

And that's it! Now, if I hit the magical mystical “undo” button, I would look in my journal to see what the last thing I did was. It tells me that I deleted a line identified as transaction number 5. I can find this line easily in my journaling list. Oops. I don't know where to restore the line to!

On thinking about this some more, I think the right thing to do is to make the journal be “undo-oriented.” What I mean by that is that the journal should tell me what action to perform in order to undo the given action, rather than telling me what happened.

So, my journal should really say “append transaction 5 after transaction 4”. Doing that operation undoes the delete.

What about substitution? Let's say I changed line 7 to read “the new line seven.” In keeping with the “undo-oriented” nature of the journal, the log should say “remove transaction 11 and append transaction 7 after transaction 6.” Wait — where did transaction 11 come from???

When I changed line 7, I effectively created a new transaction with the change. The counter transaction_number that I talked about above now has the value 11, so that's the value that gets assigned to the new “version” of line 7. Line 7 (identified as transaction 7) was removed from the current list of lines, and placed on the journaling list. Transaction 11 was created with the new contents of line 7, and put in the original line 7's place — that is, right after line 6.

All editor operations can be expressed as either deleting a line, or introducing a new line. This means that the journal records can be very simple. What if I do 3,6d or g/pattern/d though? The algorithm suggested so far would create several journal entries, all along the lines of “add transaction x after transaction y”. Hitting undo would now restore only the last delete operation, not the whole command's worth. (It would still be possible, of course, to fully undo the operation, it's just that it would take a lot more undo commands to do so.)

There are two ways to handle that — either we introduce a “list of lists” (with each list element containing a list of undo operations that correspond to one aggregate operation), or we introduce a new journaling record type.

The “list of lists” idea isn't bad, except that the vast majority of operations would be one or two element operations (e.g., delete, insert, or delete and then insert). The new journaling record type, on the other hand, would serve as a marker, basically saying “undo to here.”

Both of the above could be variously permuted. The “list of lists” could have a shortcut version saying “this list only has one operation, and here it is,” and we could introduce an aggregate delete-then-insert operation. The new record type could just contain a count of records. If there is no count of records, then one operation is assumed.

A note about efficiency:
Who cares? It's an undo — this is a human-induced command where the human is really hoping the editor is able to go back to the previous state. If this takes a few seconds, so be it!

Ok, so what about redo? You just had to ask, didn't you?

redo is similar to undo in that it operates by replaying instructions from a journal. Compare and contrast:

Command undo journal entry redo journal entry
7d add transaction 7 after transaction 6 delete transaction 7
6s/spud/potato/ delete transaction 12 (the line containing “potato”)
add transaction 6 after transaction 5
delete transaction 6
add transaction 12 (the line containing “potato”) after transaction 5

I'm not sure, for the redo list, if I want to store the “precomputed answer,” or the instructions. In the table above, I showed storing the precomputed answer — transaction 12 already had the substitution done, all we needed to do was put it back in the appropriate place.

redo does have the curious property that once you enter a new command, the rest of the redo journal is dropped. That is, you can't do a bunch of undo's, insert a few commands, and then do some redo's to effectively continue from where you left off.

So, to implement undo and redo, we need to add some extra members to the Window class.

2015 03 13

And then he got an idea! An awful idea! The Grinch got a wonderful, awful idea! “I know just what to do!” The Grinch laughed in his throat.

Why do we need transaction numbers at all? Why can't we just use Line iterators? Consider the case of the ten line file; we have ten lines, which could all be the valid contents of an iterator. When I delete line 3, for example, I remove that entry from the list of entries constituting the file, and place it on the journaling list. When I add that line to the journaling list, I make a note that that line (as identified by its iterator) was after line 2 (as identified by its iterator). That's my journal record: “this line (the one associated with this journal entry) goes after this other line (given by an iterator into the active editor list.)” The beauty of this system is that the line I just deleted still exists (after all, I have to be able to restore it), and the line that it “goes after” must also exist (because currently it's a line in the active editor list.)

EDITOR FILE LIST          JOURNALING ENTRIES
+----------+
|  "one"   |
+----------+
     ^
     |
     v
+----------+
|  "two"   |<------+
+----------+       |
     ^             |      +----------+
     |             +------| "three"  |  [1] To undo, append me after "two"
     v                    +----------+
+----------+
|  "four"  |
+----------+
     ^
     |
     v
    ...
     ^
     |
     v
+----------+
|  "ten"   |
+----------+

ASCII art, what a lost art, eh? (Yeah, I need to figure out how to use a free drawing package or get these done on PowerPoint)

Ok, so what if we delete line 2?

Well, nothing bad happens:

EDITOR FILE LIST          JOURNALING ENTRIES
+----------+
|  "one"   |<--+
+----------+   |
     ^         |          +----------+
     |         |   +------| "three"  |  [1] To undo, append me after "two"
     v         |   |      +----------+
+----------+   |   |      +----------+
|  "four"  |   |   +----->|  "two"   |  [2] To undo, append me after "one"
+----------+   |          +----------+
     ^         |               |
     |         +---------------+
     v
    ...
     ^
     |
     v
+----------+
|  "ten"   |
+----------+

It only “looks” like we have a problem with line “three” — it seems to want to point to line “two” which is in the journaling list, not the editor list. But that's ok, because the only way we'll get to line “three” is if we first undo line “two,” and the moment we do that, we're back to the previous picture (where undoing line “three” works as expected).

Similarly with inserting a line; let's put a line before line “one” called line “"zero"”:

EDITOR FILE LIST          JOURNALING ENTRIES
+----------+
|  "zero"  |<--------------------------------------------------------------+
+----------+                                                               |
     ^                                                                     |
     |                                                                     |
     v                                                                     |
+----------+                                                               |
|  "one"   |<--+                                                           |
+----------+   |                                                           |
     ^         |          +----------+                                     |
     |         |   +------| "three"  |  [1] To undo, append me after "two" |
     v         |   |      +----------+                                     |
+----------+   |   |      +----------+                                     |
|  "four"  |   |   +----->|  "two"   |  [2] To undo, append me after "one" |
+----------+   |          +----------+                                     |
     ^         |               |                                           |
     |         +---------------+                                           |
     v                                                                     |
    ...                      (empty)    [3] To undo, delete this entry ----+
     ^
     |
     v
+----------+
|  "ten"   |
+----------+

Easy peasy. You'll note that the third journaling record doesn't involve a Line entry, it only has a pointer to the target entry.

For the multi-line delete, we don't need the bizarreness of the journaling sentinels or the list of lists either. (We still need it for the g command, as will become apparent shortly.) C++ already provides a clean way of doing the operations we need — the list splice function does exactly what we need!

Whereas we implemented our d command by using the erase function, we can instead remove the one or more lines by chopping them out of the editor file list and putting them onto the journaling entries list.

I implemented this by creating a Journal class which manages the undo and redo buffers:

enum journal_types
{
    JOURNAL_DELETE = 1,         // source is the removed transaction, target is its previous (to undo, append source after target)
    JOURNAL_APPEND = 2,         // source is not used, target is the new transaction (to undo, just remove target)
    JOURNAL_MULTI_START = 98,
    JOURNAL_MULTI_END = 99
};

typedef struct {
    journal_types           type;           // the type of entry
    Line_l                  source;         // the source line list, e.g. on a delete, these are the lines that get put back
    Line_i                  targets [2];    // the target lines, e.g. on a delete, targets [0] this is the line we append after, on an append, we delete from targets [0] through target [1]
}   journal_entry_t;

typedef list <journal_entry_t *>    journal_entry_l;
typedef journal_entry_l::iterator   journal_entry_i;

class Journal {
    journal_entry_l         undo_journal;
    journal_entry_l         redo_journal;
public:
    void    append_line (Line_i start, Line_i end); // record (and perform) a line addition (1 or more lines)
    void    delete_line (Line_i start, Line_i end); // record (and perform) a line deletion (1 or more lines)
    void    undo (void);                            // undo last operation
    void    redo (void);                            // redo last operation
};

typedef list <Journal *>    Journal_l;
typedef Journal_l::iterator Journal_i;

The enum journal_types indicates the operations that the journal can store. I said that an editor really has two operations; either a line gets deleted, or a new line gets added. The JOURNAL_MULTI_START and JOURNAL_MULTI_END are the sentinel markers for the g command — we'll discuss those at some point later, although you can probably already intuit how they work.

Next, I defined a good old fashioned C struct to hold the list of journaling entries. It contains enough information for each record, as per the diagrams above.

Finally, the Journal encapsulates the undo_journal and redo_journal journals, and provides the four functions that this class supports.

The delete_line and append_line functions are fairly straightforward:

void
Journal::delete_line (Line_i start, Line_i end)
{
    journal_entry_t     *j;

    ++end;
    undo_journal.push_back (j = new journal_entry_t);
    j -> type = JOURNAL_DELETE;
    j -> targets [0] = start;
    --j -> targets [0];
    j -> source.splice (j -> source.begin(), (*window) -> lines, start, end);
    ++j -> targets [0];
}

void
Journal::append_line (Line_i start, Line_i end)
{
    journal_entry_t     *j;

    undo_journal.push_back (j = new journal_entry_t);
    j -> type = JOURNAL_APPEND;
    j -> targets [0] = start;
    j -> targets [1] = end;
}

The one odd-ball thing is the -- and ++ around the splice function. I need to go one element before the place where I'm extracting from, do the extraction, and then go back to the target element. If I've understood everything correctly, that's because of where the insertion point is in the splice function. Maybe, maybe not. I'll come back to this and revise as appropriate.

Finally, the prototypical undo function:

void
Journal::undo (void)
{
    journal_entry_t     *j;

    if (undo_journal.rbegin() == undo_journal.rend()) {
        throw (EDERR_NOTHINGTOUNDO);
    }
    j = *undo_journal.rbegin();

    undo_journal.pop_back();
    switch (j -> type) {
    case    JOURNAL_DELETE:
        cout << "Asked to undo a delete\n";
        (*window) -> lines.splice (j -> targets [0], j -> source);
        break;

    case    JOURNAL_APPEND:
        cout << "Asked to undo an append\n";
        break;

    case    JOURNAL_MULTI_START:
        cout << "Undo encountered multi start\n";
        break;

    case    JOURNAL_MULTI_END:
        cout << "Undo encountered multi end\n";
        break;

    default:
        throw (EDERR_UNKNOWNUNDO);
    }
}

There. That wasn't so hard after all. I even tested it with a few boundary conditions:

It works!

2015 03 14

And speaking of boundary conditions, most of my current bugs relate to lines being out of order. That's the classic 4,3d case, for example. The following check takes care of this:
int
count_check (Line_i start, Line_i end)
{
    int     n;

    n = 0;
    while (start != (*window) -> lines.end ()) {
        n++;
        if (start == end) {
            return (n);
        }
        ++start;
    }
    throw (EDERR_ENDBEFORESTART);
}

And, as an added bonus, it also returns the number of lines involved in the range.

Redo

Undo was easy, what about redo? The classical vi doesn't have a command line redo command, so I'm going to use R.

C++ Paradigm — Insert, not Append

Well, I did say this was a learning exercise, right? Had I read and internalized the documentation for the list class, I would have paid more attention to the splice function's documentation that states:

Transfer elements from list to list. Transfers elements from x into the container, inserting them at position.

The reason I was having so much trouble (I kept needing to increment and decrement various iterators in order to make sure that they pointed to the right thing) is that I was assuming an appending model, rather than an inserting model.

Now the code is much cleaner! (But it does mean that I have to bore you with redrawn diagrams, though):

EDITOR FILE LIST          JOURNALING ENTRIES
+----------+
|  "one"   |
+----------+
     ^
     |
     v
+----------+
|  "two"   |
+----------+
     ^                    +----------+
     |             +------| "three"  |  [1] To undo, insert me before "four"
     v             |      +----------+
+----------+       |
|  "four"  |<------+
+----------+
     ^
     |
     v
    ...
     ^
     |
     v
+----------+
|  "ten"   |
+----------+

Here is the complete journaling code, starting with the class definitions:

enum journal_types
{
    JOURNAL_DELETE = 1,                                 // source is the removed transaction, target is its previous (to undo, append source after target)
    JOURNAL_APPEND = 2,                                 // source is not used, target is the new transaction (to undo, just remove target)
    JOURNAL_MULTI_START = 98,
    JOURNAL_MULTI_END = 99
};

typedef struct {
    journal_types           type;                       // the type of entry
    Line_l                  source;                     // the source line list, e.g. on a delete, these are the lines that get put back
    Line_i                  targets [2];                // on a delete:  targets [0] is the start, targets [1] is the end.
                                                        // on an append:  TBD
}   journal_entry_t;

typedef list <journal_entry_t *>    journal_entry_l;
typedef journal_entry_l::iterator   journal_entry_i;

class Journal {
    journal_entry_l         undo_journal;
    journal_entry_l         redo_journal;
public:
    int     append_line (Line_i start);                 // record (and perform) a line addition (1 or more lines)
    int     delete_line (Line_i start, Line_i end);     // record (and perform) a line deletion (1 or more lines)
    void    undo (void);                                // undo last operation
    void    redo (void);                                // redo last operation
};

typedef list <Journal *>    Journal_l;
typedef Journal_l::iterator Journal_i;

I think that this is a very clean implementation; the two journals, undo_journal and redo_journal are private members of Journal, and there are four functions that operate on the class. (We'll need two more functions later; one to start a multi-transaction record and one to terminate a multi-transaction record.)

The list management implementation is now a lot simpler. Effectively, a new operation (an append or a delete) causes list entries to be created at the end of undo_journal. Undo operations pop those off the end and process them, which involves putting them onto the end of the redo_journal. Redo operations pop those off the end and process them, which involves putting them onto the end of the undo_journal. And around and around it goes, until a new command is encountered (i.e., not an undo or redo). In that case, we kill the redo_journal list.

The append and delete logic:

int
Journal::delete_line (Line_i start, Line_i end)
{
    journal_entry_t     *j;

    ++end;
    undo_journal.push_back (j = new journal_entry_t);
    j -> type = JOURNAL_DELETE;
    j -> targets [0] = start;                           // begin delete here
    j -> targets [1] = end;                             // insert before this entry to restore
    j -> source.splice (j -> source.begin(), (*window) -> lines, start, end);
    redo_journal.clear ();                              // wipe redo journal, we're initiating a new command
    return (1);                                         // indicates "modified"
}

int
Journal::append_line (Line_i start)
{
    journal_entry_t     *j;
    string              s;
    int                 nlines;

    nlines = 0;
    cout << "Enter lines, terminating with ^D\n";
    (*window) -> dot = start;
    ++(*window) -> dot;                                  // append after, not "insert before" as the library would prefer, so add one
    while (getline (cin, s)) {
        (*window) -> dot = (*window) -> lines.insert ((*window) -> dot, new Line (s));
        ++(*window) -> dot;                              // advance dot so we can insert in front of the next one
        nlines++;
    }
    cin.clear ();
    cout << "Thank you for your input.\n";

    if (!nlines) {
        return (0);                                     // no lines were added
    }
    undo_journal.push_back (j = new journal_entry_t);
    j -> type = JOURNAL_APPEND;
    j -> targets [0] = ++start;                         // we've appended a bunch of stuff before this line
    j -> targets [1] = (*window) -> dot;                // up to here
    --(*window) -> dot;                                 // compensate for final dot advancement in while() loop
    redo_journal.clear ();                              // wipe redo journal, we're initiating a new command
    return (nlines);
}

And the undo and redo logic:

void
Journal::undo (void)
{
    journal_entry_t     *j;

    if (undo_journal.rbegin() == undo_journal.rend()) {
        throw (EDERR_NOTHINGTOUNDO);
    }
    j = *undo_journal.rbegin();

    switch (j -> type) {
    case    JOURNAL_DELETE:
        (*window) -> lines.splice (j -> targets [1], j -> source);
        redo_journal.splice (redo_journal.end(), undo_journal, --undo_journal.end());
        break;

    case    JOURNAL_APPEND:
        j -> source.splice (j -> source.begin(), (*window) -> lines, j -> targets [0], j -> targets [1]);
        redo_journal.splice (redo_journal.end(), undo_journal, --undo_journal.end());
        break;

    default:
        throw (EDERR_UNKNOWNUNDO);
    }
}

void
Journal::redo (void)
{
    journal_entry_t     *j;

    if (redo_journal.rbegin() == redo_journal.rend()) {
        throw (EDERR_NOTHINGTOREDO);
    }
    j = *redo_journal.rbegin();

    switch (j -> type) {
    case    JOURNAL_DELETE:
        j -> source.splice (j -> source.begin(), (*window) -> lines, j -> targets [0], j -> targets [1]);
        undo_journal.splice (undo_journal.end(), redo_journal, --redo_journal.end());
        break;

    case    JOURNAL_APPEND:
        (*window) -> lines.splice (j -> targets [1], j -> source);
        undo_journal.splice (undo_journal.end(), redo_journal, --redo_journal.end());
        break;

    default:
        throw (EDERR_UNKNOWNREDO);
    }
}

There's a certain symmetry in the code. Generally, this is a good indication that you've done things properly, both in terms of design and implementation.

It's interesting how the C++ “no spaces after a function name” style has started to creep up on me. I would never consider this in C — my style is to always put a space between the function name and the open round. Weird. In the early days of the editor project, I actively resisted this. Now, for functions that don't take arguments, it's “natural.” I still put spaces after ones that do take arguments. Wierder. I guess in my brain, I think idiomatically that the “()” is just syntactic sugar and can be safely grouped with the identifier. (It's almost like the & by-reference operator; maybe C++ needs a () by-call operator, that implicitly calls the function, LOL!)

What's Left?

So far, the editor allows you to print and list lines, append lines, delete lines, undo and redo operations, and quit. Major missing pieces are:

Not all of these are interesting (by that I mean, “sufficiently different in nature from what I've already talked about to warrant a bunch of web pages”). For example, window operations are basically more list operations — instead of the list containing a bunch of lines and us inserting and removing lines from that list, the list contains a bunch of windows, and we insert and remove windows from that list. Same with the file operations, at the very top of this article I explained the basics of how the initial editor file load works; reading, writing, and piping aren't significantly different.

As I proceed with development, I will add more content about things that I do find different and interesting (or surprisingly same when I had expected them to be different).

Ok, this afternoon I finished b, e, f, n, r, and w along with some infrastructure work for g, j, m, and s,

Let's Revisit the “Learning C++” Part

2015 03 15

So, the point of this article was two-fold, one was to rewrite the editor (and hence this article serves as a “how to write an editor” tutorial), and the second pleat of the fold was to learn C++. Have I accomplished the second? In order to answer that, let me itemize what I think I've learned about C++, what I know I haven't, and see what's left.

Things I've Learned

I estimate that I've spent around 30 hours on this project so far. wc tells me I've written 1524 lines of code (50 LOC/h) and 1447 lines of m4-based HTML (this blog).

The following are things I've learned by doing this exercise so far:

  1. try, throw, and catch
    • beginner level — I did a really crude job of just passing around an integer as my editor error code.
    • this was modeled after the original editor, which used setjmp and longjmp.
    • expanded this a little with some specific try/catch logic around streaming file operations.
  2. stream concepts
    • cout, cin, istream, ostream
    • functions (e.g., open, fail, flush)
  3. string operations
    • got the hang of what the string operations can do,
    • used some functions, values and operator overloads (like length, find, c_str, string::npos, +=),
  4. class semantics
    • declaration, public (vs default private), member functions
  5. list template
    • walking a list, begin, end, rbegin, rend, splice, clear, and related
  6. polymorphism
    • used both inside and outside of classes
  7. C++ unique features:
    • by-value (& operator)
    • new operator
    • bool
    • iterators
    • namespace (:: operator)

I'm by no means an expert on these topics, but I now have a feel for how they work.

There are also things that I know that I haven't even touched:

  1. inheritance
  2. memory management (constructors and destructors)
  3. friend functions
  4. operator overloading

And, I know I've botched a few things — for example, I tend to make more stuff public than I should (and just “reach in” and grab data).

So, the next thing I'm going to go do is clean up the code a bit — as I've learned things, I now see that there are cleaner ways of doing certain things, both from a strictly C++ perspective as well as from a “building an editor” perspective.

Hierarchy

In cleaning up the code, I discovered the next thing I need to learn about in C++ — inheritance and class hierarchies. Here's what happened. I have 5 existing classes (arranged from least-dependent to most-dependent):

  1. Line
    • contains one editor line, consisting of the content and a flags field.
    • doesn't depend on any other class (other than standard C++ library).
  2. Address
    • holds a start, end, and target address as an iterator into a Line (so depends on Line only).
    • it's used only by the Command class.
  3. Journal
    • contains the undo_journal and redo_journal, which consist of Line lists and iterators.
    • depends on Line (and in the upcoming re-org, will depend on Buffer).
  4. Window
    • contains all the information associated with one text editor window: the Line list of lines, the current value of dot, the last print mode, whether the editor window is dirty, the file name, and so on.
    • depends on Buffer (in the re-org) and Journal
  5. Command
    • holds information about a parsed command (Address information, left and right sides of substitutes, command name, qualifiers, arguments, as well as all the commands themselves).
    • depends on Address, but really wants to have access to all the goodies contained in the current Window (such as the list of lines associated with the window, where dot is, and so on).

The simplest representation of this is a strict hierarchy — who needs to know about who in order to do their job?

        +---------+
        | Command |
        +----+----+
             |
      +------+---------+
      |                |
      |           +----+---+
      |           | Editor |
      |           +----+---+
      |                |
      |           +----+---+
      |           | Window |
      |           +----+---+
      |                |
+---------+       +----+----+
| Address |       | Journal |
+---------+       +----+----+
      |                |
      +------+---------+
             |
          +--+---+
          | Line |
          +------+

So What About Multiple Inheritance?

2015 03 17

Truth be told, when I started this project, I thought I'd have to know a lot about inheritance, and multiple inheritance, and all that kind of stuff. So I'm a little disappointed! There is nothing in the current editor that requires inheritance, let alone multiple inheritance.

The diagram above is what I used in order to decide if I had a clean implementation.

Reviewers have said that maybe when I get into the regexp stuff, or possibly with the Journal class, or with exception handling, but ... nothing now.

Some Enhancements To vi

While designing the editor, I came up with some enhancements:

  1. " (double quote) syntax. The double quote, followed by one or two alphamerics, represents a single address or a range based on marks. Let me explain this a little better: "ab is equivalent to 'a,'b — it specifies the range of lines from mark a to mark b, but takes 2 keystrokes less.
  2. Global marks. Uppercase mark letters are global in scope, meaning for example that you can "ABt. to copy lines from global mark A through B to dot. To clarify, in one window you'd mark a certain line range with ka and kb commands; then, in any window(s), you could copy lines from that window to the current window. Global marks can also serve as targets, so g/spud/t'A will copy all lines to after global target A.

Marks, the Perfect Object

2015 03 18

In thinking about the “true purpose” of C++, I found an excellent embodiment of it — the Marks class. The Marks class holds the 26 global marks (marks A through Z inclusive) and the 37 local (per window) marks (marks 0 through 9, a through z (inclusive) and the “last line” mark).

Here's what the Marks class looks like:

class Marks
{
    Line_i          marks [37];         // local marks -- a-z + 0-9 + "current" == 26 + 10 + 1 = 37
    Line_l          dummy;
public:

    static Line_i   global_marks [26];  // global marks A-Z == 26

    Marks ();

    Line_i          get_mark (char c);
    string          find_marks (Line_i a);
    void            set_mark (char c, Line_i v);
};

These make use of a static member, the “global” marks array global_marks. A static member occupies memory exactly once, and must be instantiated somewhere (see below).

This means that the Marks class is contained within each Window, and consists of a local part and a global part.

Here's the implementation:

Line_i  Marks::global_marks [26];   // global marks A-Z == 26, dummy static initialization

Marks::Marks ()
{
    int     i;

    for (i = 0; i < sizeof (marks) / sizeof (marks [0]); i++) {
        marks [i] = dummy.end();
    }
}

void
Marks::set_mark (char c, Line_i v)
{
    if (isdigit (c)) {
        marks [c - '0' + 26] = v;
    } else if (isupper (c)) {
        global_marks [c - 'A'] = v;
    } else if (islower (c)) {
        marks [c - 'a'] = v;
    } else {
        throw (EDERR_MARKMUSTBEALPHANUM);
    }
}

Line_i
Marks::get_mark (char c)
{
    Line_i      r;

    if (isdigit (c)) {
        r = marks [c - '0' + 26];
    } else if (isupper (c)) {
        r = global_marks [c - 'A'];
    } else if (islower (c)) {
        r = marks [c - 'a'];
    } else {
        throw (EDERR_MARKMUSTBEALPHANUM);
    }
    if (r == dummy.end()) {
        throw (EDERR_MARKNOTSET);
    }
    return (r);
}

string
Marks::find_marks (Line_i a)
{
    string      o;
    char        m;

    // search local window first
    for (m = 'a'; m <= 'z'; m++) {
        if (marks [m - 'a'] == a) {
            o += m;
        }
    }
    for (m = '0'; m <= '9'; m++) {
        if (marks [m - '0' + 26] == a) {
            o += m;
        }
    }
    for (m = 'A'; m <= 'Z'; m++) {
        if (global_marks [m - 'A'] == a) {
            o += m;
        }
    }
    return (o);
}

The constructor, Marks::Marks sets each of the local mark array elements to a dummy end value.

Is there a better / cleaner way of doing this? I need a value that I can test against in order to determine if the mark has a value in it or if it's “uninitialized.” How should I initialize the global marks array?

The three functions, Marks::get_mark, Marks::set_mark, and Marks::find_marks access the otherwise private and reclusive marks array.

You'd think that I could have had a common function, say ascii_to_index, that would convert an upper or lower case character or a number into an index instead of computing the value in three places. It turns out to not be really worthwhile because I'd still need to test the value to see if it was upper or lower case, so I'd only be saving a few lines of code.

Another “duh” Moment

During development, I noticed that doing $b- on the Big File was really slow — it would take several seconds to print the results. Turns out the Window::line implementation was at fault:

Line_i
line (int n)
{
    if (n < 0 || n >= lines.size()) {
        throw (EDERR_LINERANGE);
    }

    for (Line_i it = lines.begin (); it != lines.end(); ++it) {
        if (n-- == 0) {
            return (it);
        }
    }
}

When converting the line number to an iterator, I would helpfully check the number against the expected valid range. (Line 0 is a special hack for the 0a command, which is why I test n < 0 instead of n <= 0.)

Turns out, on the Big File, lines.size took about 67ms to execute. I had thought that the list template would keep track of the number of objects in the list, but it turns out that it computes it each time. LOL.

The fixed version simply ignores the end of the range, and finds it by hitting lines.end (it also has a slight optimization for caching lines.end):

Line_i
line (int n)
{
    Line_i e;

    if (n < 0) {
        throw (EDERR_LINERANGE);
    }
    e = lines.end ();

    for (Line_i it = lines.begin (); it != e; ++it) {
        if (n-- == 0) {
            return (it);
        }
    }
    throw (EDERR_LINERANGE);
}

Now $b- is spiffy (my spies tell me around 500ns is spent in line doing the lookup for $).

Map Reduce

In the stated goals for this project, I said that I wanted to accomplish “multi-threaded map/reduce style of operation for g// commands.”

As the sidebar says, “the time has come to talk of many things.”

Basically, there are a number of operation classes that the g and v commands fall into:

  1. read-only (e.g., g/pattern/p),
  2. modify-in-place (e.g., g/pattern/s/lhs/rhs/ where the rhs doesn't introduce new lines),
  3. modify-with-extension (e.g., g/pattern/s/lhs/rhs1\nrhs2/),
  4. predictable move (e.g., g/pattern/.m't), and
  5. variable move/change (e.g., g/pattern/?pattern2?,/pattern3/m?pattern4?).

There may be some more esoteric ones, but these serve for purposes of illustration.

This means that there are a number of problems to solve:

  1. how to analyze threadability of work?
  2. how to distribute work amongst threads, where applicable?
  3. how to coordinate results of work performed by threads, where applicable?

The tagging part can always be multi-threaded — at the start of the g and v processing, iterators corresponding to the line range of the command are broken out into N lists (where N is the number of threads). This can be done in a round-robbin manner, for example. Then the N threads are told to analyze their respective lists of line ranges for initial pattern matches. When all threads have completed, the line ranges are reassembled (or interpreted as if they had been reassembled) in a reverse round-robbin manner. The result is an ordered list of matching lines, which are to be the targets of the g and v commands.

As far as the actual operation being performed, the determination of “threadability” can be initially done as a cheap hack — just scan the argument after the g or v command and see if it readily lends itself to classification into the above classes. This can be extended later as better heuristics are developed for classification.

A further consideration is that of “ordered” vs. “unordered.” An ordered operation, such as g/pattern/p means that the results have to show up in a certain order. The unordered operation, such as g/pattern/s/lhs/rhs/ means that we don't care about what order the operations are done in.

Ordering and threading are co-related — ordered and single threaded go together, as do unordered and multi-threaded. The two other cases, ordered and multi-threaded, and unordered and single-threaded, are not used.

But is it worth it?

Of course, as with any “pie in the sky” idea, a reality check is always in order. How much overhead is it to put iterators on a round-robin list and have threads process the lists for regexps as opposed to the total amount of work to do it in the main thread in the first place?

Some answers:

Regexp evaluation
100 ns through 4 μs depending on content
Add an entry to the end of a list
200 ns
Remove an entry from the list
160 ns
Simple string operation (c=a+b)
130 ns

Hmmm... It takes more time to do the list manipulation (360 ns total) than it does to do the regexp match (100 ns). So, this is kind of a wash — you'd need to have regexps that were expensive to do in order to make it worthwhile. The 4 μs regex was the pattern “a.*b” as applied to a line of text that had several “a”'s but no “b”'s.

Aha! I've been using lists! When I switch to vectors, I get different numbers:

Add an entry to the end of a vector
65 ns
Remove an entry from the vector
60 ns

And, thinking about this a bit further, there's really no need to “remove” an entry from the vector, we simply need to iterate to the next one. An iterator dereference and forward operation is 15 ns.

So now we're comparing 65 + 15 = 80 ns versus at least 100 ns. On an 8-CPU box, we have 8 x 100 + 8 x 80 = 1440 ns to do 8 operations (= 180 ns per operation).

Oh well, this isn't going to be the amazing time saver I initially thought. So I'm going to shelve this part of the project for just now.

2015 03 19

All is not wasted, however. The upshot of thinking about how to make the g and v commands thread-distributable brought on a significant implementation realization.

Generally speaking, the implementation of g and v on some vi implementations runs in O(n2). That's because the algorithm is:

  1. find and tag all matching lines
  2. while there are tagged lines:
    • starting from the first line, find the next tagged line,
    • perform the operation on the line,
    • untag the line

Where the n2 creeps in is because after an arbitrary operation, you're not guaranteed that the ordering of the lines hasn't been changed — so where do you get your “next” line from? The simplest way of doing it is to restart your search for the next line starting at the top of the file each time. So, if you've tagged all files, you end up doing the following:

  1. go to line 1, find first tagged line 1 (runtime += 1),
  2. go to line 1, find first tagged line 2 (runtime += 2),
  3. go to line 1, find first tagged line 3 (runtime += 3),
  4. and so on.

On a 1k line file, that's 1 + 2 + 3 + ... + 1000 = (1001) * 1000 / 2 = 500,500 operations.

In my version, I simply use a vector to store all the iterators to tagged lines, and then process the vector in sequence, destroying it at the end when it falls out of scope.

In the case where a target line is deleted (e.g., g/pattern/.,.+3d, and the pattern matched the deleted lines), we look through the vector into the iterator into the flags field and see if the line is tagged as deleted (the journaling guy does that for us). If so, we skip it.

Simple pimple.

The “Big File” takes 5s to do g/./.m0 on my editor, and I aborted the benchmark after 15 minutes on vim.

Get On With The Substitute Command Already!

All of the other commands are now done (some only in first draft mode, meaning that not all the bells and whistles are there). It's time to do the substitute.

The format of the substitute command is s/lhs/rhs/options, where the lhs is a regexp, and the rhs is the replacement. Whatever matches lhs gets replaced with rhs. There are special “pick up” and “put down” operations between the lhs and the rhs. For example, s/\(this\)\(that\)/\2\1/ would replace a text string “thisthat” with “thatthis.” The \( and \) indicate where a pick-up operation should happen (in this example, on the “this” and the “that”) and the \1 and \2 indicate which picked-up strings should be put down where.

The options after the rhs on the s command are optional. If present, they indicate an optional range of matches (e.g., {3 or just 7), optionally followed by the suffix “g,” meaning "apply globally."

To clarify, here are some examples:

  1. s/this/that/{3,5} — substitute third through fifth occurrence of “this” with “that.”
  2. s/this/that/g — substitute all occurrences of “this” with “that.”
  3. s/this/that/3g — substitute third and subsequent occurrences of “this” with “that.”
The standard C library provides regular expression functions (see regcomp, and regexec) which compile a regular expression and then search for it on a given string. The regexec returns an array of start/end offsets of where it found the matches — the first array element tells us where the entire regex matches, and subsequent array elements tell us where individual pickups matched. Consider the regex ..test.*\(th..\).*\(fi..\).... This is looking for any two characters, followed by the string “test” followed by any number of any character, followed by the string “th” followed by any two characters (and, by the way, pick up that four character string into pickup buffer 1), followed by any number of any character, followed by the string “fi” followed by any two characters (and pick up those four into the second pickup buffer), followed by any three characters.

This is what the array of start/end offsets looks like:

This is a test of this editor and it should find strings!
                ^ ^   ^                     ^   ^  ^
                A C   D                     E   F  B

A through B
Represent the start offset and end offset of the first array value, and indicates the complete match. I.e., this is the complete string that the lhs pattern matched.
C through D
This indicates the first pickup match, in this case, “this.”
E through F
This indicates the second pickup match, in this case, “find.”

So, in order to effect the substitution command s/..test.*\(th..\).*\(fi..\)/@1@ \1 @2@ \2 @3@/ (ignore the options for now), we need to:

This gives us:

:p
This is a test of this editor and it should find strings!
:s/..test.*\(th..\).*\(fi..\)/@1@ \1 @2@ \2 @3@/
This is @1@ this @2@ find @3@ strings!

A little tedious, to be sure. But it gets uglier. Adding the options means that we may have to skip several matches (for example, the options may specify that only the third and subsequent matches should be processed), and iterate over the string multiple times (for example, the options may demand that we substitute on the third through fifth match).

This implies the software design then:

:p
If this is thistle, then thistle is this!
:s/this/that/{2,3}
If this is thistle, then thistle is this!
   AAAA    BBBB          CCCC       DDDD
If this is thattle, then thattle is this!

Practically speaking:

Had there been a g qualifier in options, we would have continued until the end of the line.

Substitute

2015 03 20

Presented, for your amusement, is the Command::cmd_s code that does the substitution.

void
Command::cmd_s (void)
{
#define NMatches        9                               // substring pickup/putdown are 1..9 inclusive, so 9
    regex_t             rx;                             // the compiled regexp
    int                 match_num;
    Line_li             lindex;                         // line index for address range
    string::size_type   sindex;
    int                 start_match, end_match;         // for handling options in s/lhs/rhs/<options>; if "g" is used, end_match gets set to line length
    int                 substitutions;                  // indicates how many lines were substituted (also used as a flag for journaling transactions)
    int                 last_line;                      // flag indicating if we're on the last line (we may destroy the last line, so we need to cache this value)

    addr.require_012_default_dot();

    if (regcomp (&rx, l.c_str(), REG_ICASE | REG_EXTENDED)) {
        throw (EDERR_BADREGEX);
    }

    //  Handle the tail end "[start][,][end][g]" syntax ("g" sets end_match to -2; any negative value means "to the end of the line")
    get_one_or_two_numbers (a, sindex = 0, start_match, end_match);
    if (a [sindex] == 'g') {
        sindex++;
        if (end_match != -1) {
            throw (EDERR_GANDENDRANGE);
        }
        end_match = -2;
    }
    if (start_match == -1) {
        start_match = 0;
    }
    if (end_match == -1) {
        end_match = start_match;
    }

    if (a [sindex]) {
        throw (EDERR_GARBAGEAFTERENDOFS);
    }

    substitutions = 0;
    last_line = 0;
    for (lindex = addr.address (Address_start); !last_line; ++lindex) {
        const char          *original;                  // original string
        const char          *unmolested;                // original string, but don't move it once you've matched
        string              new_line;                   // the new line that is the output of the substitution
        regmatch_t          pmatch [NMatches];          // matched output and pickup indices

        last_line = lindex == addr.address (Address_end);   // are we on the last line?
        match_num = 0;
        original = (*lindex) -> content.c_str();
        new_line.clear();                               // clear target buffer

        while (regexec (&rx, unmolested = original, NMatches, pmatch, 0) == 0) {
            int             offset;                     // offset within string for multiple matches
            int             slash;                      // if escaping a slash
            const char      *rhs;                       // temporary for walking along the RHS

            offset = 0;

            if (match_num < start_match) {
                // this match is before the start_match indication, copy its data to the new line
                while (offset < pmatch [0].rm_eo) {
                    new_line += *original++;
                    offset++;
                }
            } else if (match_num >= start_match && (end_match < 0 || match_num <= end_match)) {
                // in the mid-game, we need to copy from the previous position to the start of the matching area
                while (offset < pmatch [0].rm_so) {      // just the tip!
                    new_line += *original++;
                    offset++;
                }

                // now parse the RHS, copying in direct characters or substituting for pickup references
                slash = 0;
                for (rhs = r.c_str(); *rhs; rhs++) {
                    int     q;                          // used to walk along the picked-up range

                    if (slash) {                        // currently in "escaped" mode
                        switch (*rhs) {
                        case    '\\':                   // backslash backslash == just one backslash
                            new_line += "\\";           // second slash means we got "\\", which is just a single slash
                            break;
                        case    't':                    // \t == tab
                            new_line += "\t";
                            break;
                        case    'n':                    // \n == bell
                            new_line += "\n";
                            break;
                        case    '&':                    // \& == & (plain "&" is "entire line")
                            new_line += "&";
                            break;
                        case    'x':                    // \xZZ == hex digit ZZ (must be two digits)
                            if (isxdigit (rhs [1] && isxdigit (rhs [2]))) {
                                char        hex;

                                hex = tolower (rhs [1]) >= 'a' && tolower (rhs [1]) <= 'f' ? (tolower (rhs [1]) - 'a' + 10) << 4 : (rhs [1] - '0') << 4;
                                hex+= tolower (rhs [2]) >= 'a' && tolower (rhs [2]) <= 'f' ?  tolower (rhs [2]) - 'a' + 10       : (rhs [2] - '0');
                                if (!hex) {         // and we don't allow the value 0x00
                                    throw (EDERR_BADHEXINRHS);
                                }
                                rhs += 2;               // we consumed two more characters than expected
                                new_line += hex;        // add hex characters to string
                            } else {
                                throw (EDERR_BADHEXINRHS);
                            }
                            break;
                        case    '1':    case    '2': case   '3':
                        case    '4':    case    '5': case   '6':
                        case    '7':    case    '8': case   '9':                    // a number indicates a put-down reference (0 not valid)
                            if (pmatch [*rhs - '0'].rm_so == -1 || pmatch [*rhs - '0'].rm_eo == -1) {
                                throw (EDERR_BADPUTDOWN);
                            }
                            for (q = pmatch [*rhs - '0'].rm_so; q < pmatch [*rhs - '0'].rm_eo; q++) {
                                new_line += unmolested [q];
                            }
                            break;
                        default:
                            throw (EDERR_UNKNOWNRHSBACKSLASH);
                        }
                        slash = 0;                      // transition back to normal mode

                    } else {                            // currently in "normal" mode
                        if (*rhs == '&') {              // a plain "&" means "put down whatever you found"
                            for (q = pmatch [0].rm_so; q < pmatch [0].rm_eo; q++) {
                                new_line += unmolested [q];
                            }
                        } else if (*rhs == '\\') {
                            slash = 1;
                        } else {
                            new_line += *rhs;
                        }
                    }
                }

                // and skip the rest of the match; it's covered by the rhs we just copied in
                original += pmatch [0].rm_eo - offset;
                offset = pmatch [0].rm_eo;
            } else {
                break;                                  // match has exceeded range, stop looking (we'll pick up the rest of the line shortly)
            }
            match_num++;
        }

        // we're done with matching, if we got any matches at all, it means we should replace the line
        if (match_num) {
            Line_li     tmp;

            new_line += original;                       // gather up any straggling characters
            tmp = lindex;
            ++lindex;                                   // set up next line to be analyzed *before* we delete and insert the substitution

            if (!substitutions++) {
                (*window) -> transaction_start();
            }
            (*window) -> delete_lines (tmp, tmp);        // delete the lines, 'cuz we're gonna replace them with "new_line"

            (*window) -> insert_lines (lindex, new_line);
            --lindex;                                   // set up next line to be analyzed *before* we delete and insert the substitution
        }
    }
    regfree (&rx);

    if (substitutions) {
        (*window) -> transaction_end();
        cout << substitutions << " substitutions\n";
    }
}
The above code was read and written by the editor. It's alive! I introduced a really cool pair of commands called >h and <h which convert to and from HTML. This means it does all the conversions of things like “>” to “&gt;” for me! I'm planning on expanding the > and < commands to take other characters, for example H to convert to and from hexadecimal, r to convert to and from rot13, and so on.

On an administrative note, I'm now about 50 hours in to the project, with 2,622 lines of code and 2,196 lines of blog.

Visual Mode

2015 03 21

With spring comes the visual mode. Command line mode is “done” as far as it needs to be done for right now. I have basic editor functionality: substitution, global commands, insert / delete lines, reading, writing, etc. Sure there's a bunch of stuff missing (options, yank buffers, split inserted lines based on newline, recursive g and v, ctags support, etc.)

But given the existing functionality, these are things that can be added incrementally.

So, it's time to move on to the second half of the project — visual mode.

Breaking it down, visual mode consists of the following major work items:

  1. ncurses interface (bring up screens, “getkey” functionality, colours, highlights)
  2. line presentation (tab expansion, basic horizontal and vertical scrolling, redraw on change)
  3. line editing (ability to move around on one line, inserting / deleting characters, etc)
  4. line-oriented visual functions (delete lines, insert lines)
  5. character-oriented visual functions (command-driven cursor movements and impact)

I'm sure there's more work, but this should do as an initial list :-)

ncurses

ncurses is a 200-year old library that dates back from Charles Babbage's Difference Engine days with stone tables used for standard output. (Well, maybe I'm exaggerating a teensy bit.)

And yet it works.

So I created a Visual class:

class Visual
{
    WINDOW      *w;                     // the ncurses window
    int         xsize, ysize;           // x and y size of screen
    int         xpos, ypos;             // our current cursor position
    Line_li     top_line;               // the top line

public:
    Visual();
    ~Visual();

    void        execute (void);         // this is the main loop for the visual processing
    void        refresh_screen (void);  // draw the screen
};

Some local variables dealing with the ncurses window (w), the screen size, current cursor position, and so on.

It has a constructor and destructor that do all the magic of bringing ncurses to life and shutting it down:

Visual::Visual ()
{
    w = initscr ();
    if (has_colors ()) {
        start_color ();
        // @@@ set up colors here
    }
    cbreak ();              // get characters immediately, but process ^C and friends (raw() would give you all characters)
                            // this means that we don't get at least ^Q, ^S, ^Z
    noecho ();              // we'll handle echoing of characters ourselves, thank you
    keypad (w, 1);          // enable processing of function keys, arrows, etc.
    nonl ();                // don't convert \n to \r\n on output
    intrflush (w, 0);       // don't flush output on ^C
    idlok (w, 1);           // ok to use hardware line insert/delete feature
    idcok (w, 1);           // ok to use hardware character insert/delete feature

    // capture the screen size parameters
    xsize = COLS;
    ysize = LINES;

    // capture window size change signals
    signal (SIGWINCH, resize_handler);

    // clear the screen
    erase ();

    // reset our screen positions
    move (ypos = 1, xpos = 0);

    // set up our top line
    top_line = (*window) -> dot;
}

And:

Visual::~Visual ()
{
    endwin ();
    signal (SIGWINCH, SIG_DFL);
    cout << "Screen was " << xsize << " by " << ysize << " characters\n";
}

But then I got to thinking about the purpose of this exercise — to learn more C++ stuff. If you think about it, (and sure, there might be some conversations in my head that you weren't privy to), the editor really consists of a bunch of sub-windows, mapped onto one “canvas.” In the basic case:

+-----------------+
| command window  |
+-----------------+
| text display    |
| area consisting |
| of multiple     |
| lines of text.  |
+-----------------+

Yes, I like my command window at the top. Long story (goes back to QNX2 days, LOL!)

However, if I wanted to display the line numbers for each line, I could add a window to the left of the text display area:

+---------------------+
|   command window    |
+---+-----------------+
| 1 | text display    |
| 2 | area consisting |
| 3 | of multiple     |
| 4 | lines of text.  |
+---+-----------------+

And, in the “hex magnification glass” mode, I'd like a hex display on the right side:

+-----------------------------------------------------------------------+
|                            command window                             |
+---+-----------------+-------------------------------------------------+
| 1 | text display    | 74 65 78 74 20 64 69 73 70 6c 61 79 20 20 20 20 |
| 2 | area consisting | 61 72 65 61 20 63 6f 6e 73 69 73 74 69 6e 67 20 |
| 3 | of multiple     | 6f 66 20 6d 75 6c 74 69 70 6c 65 20 20 20 20 20 |
| 4 | lines of text.  | 6c 69 6e 65 73 20 6f 66 20 74 65 78 74 2e 20 20 |
+---+-----------------+-------------------------------------------------+

Or whatever other ideas strike my fancy during development (e.g., diff output).

For long, multi-line commands, it may be desirable to either scroll the command window, or to simply expand it to make it 2 or more lines:

+-----------------------------------------------------------------------+
|                            command window                             |
|                           (multiple lines)                            |
+---+-----------------+-------------------------------------------------+
| 1 | text display    | 74 65 78 74 20 64 69 73 70 6c 61 79 20 20 20 20 |
| 2 | area consisting | 61 72 65 61 20 63 6f 6e 73 69 73 74 69 6e 67 20 |
| 3 | of multiple     | 6f 66 20 6d 75 6c 74 69 70 6c 65 20 20 20 20 20 |
| 4 | lines of text.  | 6c 69 6e 65 73 20 6f 66 20 74 65 78 74 2e 20 20 |
+---+-----------------+-------------------------------------------------+

So this means that I need to think about “subwindows” — which really sounds like a C++ class to me. (And given this approach, you could move the command window to the bottom if you really wanted to, or have multiple line windows displayed on the same screen, etc — they're just objects, people!)

Ahh... this is the “inheritance” I keep hearing about :-)

My base class, then, will be something that has an ncurses window identifier in it, some size information, maybe cursor locations, that kind of stuff. A new derived class will then inherit that base class, and provide additional functionality, such as line editing, etc.

58 hours, 3085 LOC and I now have basic multi-window page-up/page-down browsing capability in visual mode.

The code is ugly. It's just bits and pieces hacked together. But, while doing it, it became apparent how the individual objects should be structured. Each window should have its own object. This object knows how to update the window. I added another window at the top:

+-------------------------------------------------------+---------------+
|                    command window                     | status window |
+---+-----------------+---------------------------------+---------------+
| 1 | text display    | 74 65 78 74 20 64 69 73 70 6c 61 79 20 20 20 20 |
| 2 | area consisting | 61 72 65 61 20 63 6f 6e 73 69 73 74 69 6e 67 20 |
| 3 | of multiple     | 6f 66 20 6d 75 6c 74 69 70 6c 65 20 20 20 20 20 |
| 4 | lines of text.  | 6c 69 6e 65 73 20 6f 66 20 74 65 78 74 2e 20 20 |
+---+-----------------+-------------------------------------------------+

The status window displays things like what line number and column you're currently on.

BTW, if you don't know already, ncurses is a harsh mistress. You really need to read the documentation in order to squeeze out the last drop of performance :-) Using multiple windows, for example, you might be tempted to call wrefresh for each window as they are ready to be rendered. This is a suboptimal idea. Instead, there's a function, wnoutrefresh which flushes the window's changes to the master buffer. Then you call doupdate to actually update the terminal. The performance improvement can be reasonably dramatic, especially if you have several parallel windows (like the line number window and the line content window).

Designing the Visual Mode

2015 03 22

My problem is that I learned vi incrementally — I knew ed, so a lot of the time when I didn't know how to do something in vi I simply resorted to the ed line mode commands. Gradually, I learned.

“How do you make a new line?”

“Just hit o or O, depending on whether you want it below or above the current line.”

“Ok, so how do I delete a line?”

“Easy, just hit dd

“Is there an easy way to delete 20 lines?”

“Yup, just hit 20dd

And so on. So for me, vi's commands were a magical mystical sea of letters and modes that did special things at special times.

Let's look at this systematically. Apart from line mode, vi is in one of two modes — you're either actively modifying text, or you're in a command mode. The text modification mode is pretty standard, when you type characters they either overwrite the text underneath the cursor, or they insert themselves. Command mode is a little more exciting. Just like in the line mode editor, where we discovered that there are really only 2 operations (delete a line or insert a line), in command mode, there are effectively four submodes:

  1. move
  2. change
  3. delete
  4. yank

If you don't do anything special, “move” mode is the default. This mode allows you to use the various keys as commands to move the cursor around — by word (w command), by search for prev/next character (,, ;, f and t), search for prev/next match (n and N), and so on. <~-- -->

If you hit a _CMD(c) or _CMD(d) or _CMD(y) command, then the mode changes to one of the other three, but the underlying _QUOTE(move) mode still operates. For example, a plain _CMD(w) moves you to the next word, whereas _CMD(dw) deletes the word. Effectively, the way that _CMD(c), _CMD(d), and _CMD(y) work is that they say _QUOTE(mark the current location (line and position within line) and wait for further instructions.) The further instructions are the standard _CMD(move) command, which really just gives you a new target (line and position within the line). A complicated example is _CMD(dn); assume weve previously done a search for “spud” and we have the following lines of text on the screen. The cursor is at the highlighted position:

This is a test of this editor, and I'd really
like to change some stuff up to the next occurrence
of the word spud (but retain it and everything after
the word).

First off, just hitting n on its own would move the cursor as follows (range is highlighted in white):

This is a test of this editor, and I'd really
like to change some stuff up to the next occurrence
of the word spud (but retain it and everything after
the word).

However, hitting dn says “start at the 'u' in 'up' and delete everything from there until you find the next occurrence of the previously used search pattern.” This results in the screen changing to the following:

This is a test of this editor, and I'd really
like to change some stuff spud (but retain it and everything after
the word).

You can imagine that c does something very similar; it deletes just like d does, and then pops you into insert mode. y also does something similar; it copies everything that was highlighted in white into a buffer.

This implies the design — we need to get “move” mode working first.

The Importance of Being Earnest

2015 03 25

Or, more correctly, “the importance of representational systems.”

I got bogged down when I started implementing the cursor movement functionality. It was bizarrely difficult! How hard could Home, End, PgUp, PgDn, cursor up, etc. be?

My main difficulty was in the choice of representational system for the “window” paradigm. This is what I had in my head:

                     +-- x cursor
                     |
                     v
       +----------------------------------------+
       |         |                              |
       |         |yoffset                       |
       |         |                              |
top -> |      +--v----------+<-+ ysize          |
       |      |             |  |                |
       |      |             |  |                |
dot -> |      |      C      |  |                |<-- ycursor
       |      |             |  |                |
       |      |             |  |                |
       |      |             |  |                |
       |----->+-------------+<-+                |
       |xoff  ^---xsize-----^                   |
       |                                        |
       |                                        |
       |                                        |
       |                                        |
       +----------------------------------------+

The window (the inner square) represents the displayed area, and the outer square represents the entire file. The red C is the cursor location; it's expressed as ycursor and xcursor. The cursor always points to dot. The window is at some yoffset from the beginning of the file, and we have a reference variable called top that points to the first line in the window. The window is also at some xoff from the left edge of the entire file (I don't believe in wrapping lines, mine cut off at the right edge of the screen and you can scroll the whole window). And of course, this is based on the tab-expanded representation of the file (which only exists virtually).

As I started writing the code for simple things (like PgUp, End, , etc.), I got bogged down in details of how to keep all of these variables up to date, and do their range checks, and so on. This was complicated by the fact that I decided to keep the cursors in on-screen coordinates (that is, (0, 0) .. (ysize-1, xsize-1)). The hassle of whether the coordinates were 0-based or 1-based was a minor detail compared to trying to convert them to string indicies.

And then I wrote lots of code to try and display the window properly given all possible input cases and ... I despaired.

Rethinking the representational system I used, and stewing on the problem a bit, I came up with the following:

class Container
{
public:
    Line_li     first;      // first line iterator (points to .end() on empty container)
    unsigned    todot;      // how many down to get to dot?
    unsigned    offset;     // how many columns is the container shifted to the right?
    unsigned    sindex;     // character offset into dot's string
};

The fields are fairly self explanatory, todot is a convenience variable computed whenever I redraw the display. The representational system was string-index based. If you were displaying a line that was 493 characters long and the cursor was at character 490, then sindex would contain the value 490.

This now means that my cursor movement functions are more along the lines of the level of complexity you'd expect. Witness and PgDn:

void
Visual::execute_up (void)
{
    Window      &w = **window;
    Container   &c = w.container;

    if (w.dot != w.line (1)) {
        column_capture ();
        if (c.todot <= 0) --c.first;     // scroll the window
        --w.dot;
        column_restore ();
    }
}

void
Visual::execute_npage (bool fullpage)
{
    Window      &w = **window;
    int         n;

    n = fullpage ? text_ysize : text_ysize / 2;

    column_capture ();
    while (n--) {
        if (w.dot != w.dollar ()) ++w.dot;
    }
    column_restore ();
}

I would have expected that more complexity would be shifted to the screen rendering function, but it turned out to be almost the same size, but much more readable. Win win!

Ahhh... coding is fun again:

void
Visual::execute_tilde (void)
{
    Window      &w = **window;
    Container   &c = w.container;
    string      &s = (*w.dot) -> content;

    for (repeats = repeats ? repeats : 1; repeats && w.sindex < s.size(); repeats--) {
        s [w.sindex] = isalpha (s [w.sindex]) ? isupper (s [w.sindex]) ? tolower (s [w.sindex]) : toupper (s [w.sindex]) : s [w.sindex];
        w.sindex++;
    }
    repeats = 0;
}

Sometimes you just gotta indulge yourself in a nice long one-liner :-)

2015 03 27

4,416 lines of code, 80 hours, and there's a lot of functionality.

I think what's happening is I'm taking a break from learning C++ by just soaking in it and getting comfortable with the syntax. Most of the code that I've done is little fiddly bits (like the fjords) where it's maybe 10-30 lines of code that does a particular visual function (like the % key, or just today the << and >> indent keys).

I'm estimating “completion” at around 200 hours, and by that measure, somewhere around the sub 10k LOC mark. I've been doing everything except the actual visual line editor (as will be required by the :, a, A, c, C, i, I, o, O, s, and S commands).

Update

2015 04 03

At 110 hours and 5,968 LOC, the editor is starting to become usable in visual mode. I've installed it as /usr/local/bin/eng and try to use it for non-critical edits (stuff I can afford to loose). So far, I haven't lost anything :-)

Major functionality includes two line editors (by that I mean something that can edit a line using cursor controls, insert, delete, that kind of thing). One of them is in the : command and search handler (with history, I might add), and the other is a more generalized one for the full screen text area.

Along the way, I've been doing lots of bug fixes; the concept of dot being the hardest one to fully nail. At issue is that my brain still wants to think in terms of line numbers, whereas iterators to list elements that can't readily be compared for “in range” or “outside of range” still sometime elude me. Often I'll find myself coding stuff along the lines of:

Line_li restore = dot;
--restore;
// do stuff that might affect dot
dot = ++restore;

In a vain attempt to keep track of where I am in the editing buffer during operations. An approach that's been successful is to have anything that deletes lines be aware of dot, and move it out of the way. This is a much “simpler” approach, but doesn't address all of the edge cases of when dot should be or should not be moved to the line, the next line, or the last line, of a particular command.

On another note, I found that my Home and End keys (and others) on my Genuine IBM PC/AT Keyboard didn't work. So I discovered the magic of define_key:

define_key ("\x1b\x5b\x31\x7e", PCAT_HOME);
define_key ("\x1b\x5b\x34\x7e", PCAT_END);
define_key ("\x1b\x4f\x6a", GREY_STAR);
define_key ("\x1b\x4f\x6b", GREY_PLUS);
define_key ("\x1b\x4f\x6d", GREY_MINUS);
define_key ("\x1b\x5b\x45", DEAD_5);

These allow me to kludge those keys into being recognized by ncurses. (Truth be told, they hadn't worked for a long time. They don't work on various other keyboards too. Searching around on the net indicates that this is a long standing problem and nobody's really bothered to fix it. After discovering the define_key hack, I backported the hack to the original editor. Now I have Home and End again. Nice.)

Here it is, Friday afternoon, and I'm basically done. I'm using the editor now to type in these blog entries :-) (And discovering all the nasty “usability” issues that still need addressing, like z. and friends for navigation, auto-indent, w taking me to the end of the file instead of leaving me at the current line, f no longer working (but F working just fine), '' insisting that “the mark must be alphameric” rather than just taking me to the previously location, xp not switching characters around, the occasional crash, that kind of stuff).

Fine, 1 hour of concentrated effort fixed the above-noted z. and friends, w, f, '' and a few other issues. It's interesting to note that once you start “eating your own dog food” the difference in bug fixes and bug capturing is night and day. Before I started using the editor for actual work, I'd think that there's a few minor annoyances that I might get to “one day.” But the instant I started actually using it, the bug reports started piling in and the motivation to fix stuff climbed. Lesson learned!