Debugging a synchronous callback in loop defect

#programming #coding #debugging

A bug snuck by me while adding a new feature to Each: a behaviour in Fuse that dynamically adds UI elements. I had to change a lot of the code, so I was diligent in adding a heap of new tests. Each is likely our most tested behaviour now, but alas, a regression slipped through! Fortunately, we caught it during the pre-release testing, but that's still late in the process. What happened?

The defect

The setup of the defect, once we whittled it down, is roughly this:

<StackPanel>
    <Each Items="{items}">
        <PanelA/>
        <PanelB/>
    </Each>
</StackPanel>

This fragment requires an items list where we will be adding/removing items. For each item in the items array, we add two children to the StackPanel: PanelA and PanelB. When an item is removed from items we'd then expect both of those children to be removed. Unfortunately, only one of them is removed; the other remains forever!

A complete example would involve a JavaScript code block to control the items, but as it wasn't related to the problem I left it out here. The resulting simplicity makes it shocking that none of the tests managed to pick up the error. I'll get back to it later why they didn't.

The defect is in this cleanup code:

for (int n=0; n < av.Nodes.Count; ++n)
    Parent.BeginRemoveChild(av.Nodes[n], CompletedRemove);

This removes the children when an item is removed from the items list. av.Nodes is a list of the nodes that were added, in this case one of PanelA and PanelB. There are two items so we expect RemoveFromParent to be called twice, but it's only called once! Madness!

The danger of synchronous callbacks

I wrote once before about the infernal loop iterator, and this situation is one of those. It builds upon a problem with synchronous callbacks, something I also mention is an issue in my article on event programming.

What you don't see above is the definition of the CompletedRemove callback. A node may have a RemovingAnimation on it, which means it stays around a bit longer to animate its removal. The callback informs us of when this removal finally happens.

void CompletedRemove(Node n)
{
    //uninteresting bits...

    WindowItem wi;
    if (_dataMap.TryGetValue(n, out wi))
    {
        wi.Nodes.Remove(n);

        //...more bits
    }
}

When the child is removed from its parent, we remove it from our own list. Seeing this code alongside the for loop should make it easier to spot the problem. What if this code is called synchronously? Removing the first item in the Nodes list will reach the callback and shorten the list. By the time we get back to the for loop there won't be a second item, thus explaining why it's only called once.

It's not relevant here whether we use an indexed for loop or a foreach style iterator. They will both fail, though possibly in different manners. We use indexed loops in our codebase since they happen to be more efficient on some target platforms (yes, this is something that should be optimized).

This situation highlights the danger of having synchronous callback, or event functions. When the callback is made, it's uncertain how side-effects could affect code further back in the stack. For safety, we want to make callbacks, and any other message dispatch, near the top of the call stack. An asynchronous callback fixes our defect with Each.

Two fixes

It's a bit risky to change message ordering this late in the release process. We already did most of the manual testing so we may not catch subtle errors anymore. I instead made an alternative fix that changes the dependency above. It turns out we don't even need to track removals in Each, so the fix was to remove most of the CompletedRemove code.

We still want the preferred asynchronous callback,, so I also made that change, but only on the development version. It'll be subjected to more testing and use before it gets released.

But why wasn't it noticed?

This defect is a result of several orthogonal aspects of the code interacting with each other:

Multiple children must be inserted by the items in Each (a single template, as we call them, is not enough: the first item is always removed successfully).
The list must have items removed from it (if we just add items we won't notice the problem).
The removal callback must be synchronous (when it is asynchronous, which is common, the defect doesn't happen).

We had tests for all of these conditions. We even have tests for combinations of these. We didn't have a test that combined all three together! The test for changing the items list, with multiple templates, also involved a visual RemovingAnimation, thus not triggering the synchronous callback. The test which invoked a synchronous callback didn't have multiple templates.

There's significant difficulty in testing APIs with a high degree of orthogonality. We can't verify every combination. The goal in coding is to be as generic as possible such that combinations don't change behavior too much. Interactions are however unavoidable. In this case, the defect happened because a false, but reasonable, assumption was made about how the removal function worked inside the for loop.

The regression was my fault; I'm not trying to shift blame. I'm just struggling to figure out how to prevent this type of defect.