Design follows data structures
One of the funny things that’s happened in the last decade or two is that performance optimization got turned on its head.
Once upon a time, CPU performance meant optimizing instructions.
When I learned C, over 20 years ago, the books I read were quite particular about (y << 8) + (y << 6)
being a much more efficient way to write y * 320
.
These days, of course, compilers do that for you, if it’s even still a good idea (and CPUs have changed enough that it might not be).
On the one hand, this is evidence in favor of what your friendly local algorithms teacher always told you: machine details aren’t the important part, focus on the algorithms. Big O is your friend. Sometimes this advice has been scoffed at (“Constant factors matter!”), but in light of how much things have changed, it seems pretty true.
Except in one way: cache.
It’s less that the “machine details and constant factors” faction was wrong, and more that the details that matter have changed. And… it doesn’t look like they’re going to change again, at least until CPUs are no longer a thing, if that’s even possible. Multiple factors have converged:
- We got a lot more data to process, in general, and so memory is bigger part of the performance picture.
- CPUs got so much faster, relative to access latencies from memory, that instructions matter far less than cache misses. CPUs spend a lot of time these days doing nothing, waiting for RAM to reply.
- Compilers have gotten so much better at optimizing code that we’re far more likely to be bottlenecked on what compilers aren’t optimizing: memory layout.
Why can’t compilers optimize data structures?
I think this question is obvious from a higher-level perspective. Obviously, compilers can’t invent red-black trees or hash tables to replace your code’s association lists.
But there are much lower-level questions here. Compilers can’t replace linear scans with binary search, either, but that hasn’t stopped code optimizations from having a pretty significant impact of performance. Why can’t compilers do any optimizations at a smaller scale with data representation, just as it can for code?
The problem compilers face is that data representation is a system boundary, from the compiler’s perspective.
If a function takes a certain format of data, the compiler is not free to just change that.
It doesn’t know where all this function may be called from.
Even in situations like static
functions in C, where the compiler can see all call sites, compilers still do not really take advantage of that fact to alter their signatures.
The trouble is that functions are still taken to be a fundamental abstraction: intra-procedural and inter-procedural optimizations are different, and inter-procedural generally take each function’s signature as a given.
So what are we to do?
The inflexibility of data structures means that—when performance is a concern—we need to think about representation early. The design of the code involved can’t be easily changed later, because the code is written around the data structure choice.
But we’re also told to avoid “premature optimization,” so which is it?
On the one hand, not all early optimization is premature. On the other hand, prototypes are valuable in figuring out what to do, before you figure out how to do it well. So perhaps we’re talking about designing a partial re-write.
But I’m going to punt on this question until maybe a later post. There’s different question I want to think about first.
Wasn’t object-oriented programming supposed to help?
Part of the argument for object-oriented programming is that encapsulation was supposed to hide data structure choice. We’re supposed to build against an interface, the concrete implementation of which is supposed to be swappable. So does OO help us swap out data structure choice at a later date?
Well, not so much. Or at least, there’s nothing special about OO here. Part of what helped the OO crowd make their case at first is that it was advocated at a time when non-modular procedural programming was dominant. Obviously, any amount of modularity is a win, and OO did bring that compared to the “plain C” practices of the day.
But the only major benefit OO (i.e. modular programming of any kind) brings here is “in the large.” It’s the idea that some modules should not depend on other modules. Limiting dependencies limits the scope of impact that a change in data structures has. That’s the primary mechanism that helps facilitate data structure changes, but we can do that with or without OO.
“In the small,” when we’re just considering code within a module, or a couple of related modules, OO does essentially nothing to help make data structure changes less painful. Of course, it’s not without dogma to hand out, despite this. This is how we’ve ended up with proscriptions for classes to be littered with getters/setters, despite the frequent absence of benefit compared to data.
The theory backing up getters/setters is that by depending on an interface (these methods) instead of an implementation (the fields), the actual representation could change in the future, and the methods could still work. This is supposed to make it easier to change those methods, removing the need to rewrite all the users of that class.
In practice, especially and nearly always for performance-related changes, this supposed benefit is often non-existent. If we’re changing representation for performance reasons, generally “adapters” sitting between users and the actual representation aren’t going to help with that performance. In fact, many times, a data structure change requires changing access patterns in the algorithms, too.
But more insidious is that data structure changes for performance reasons are unlikely to respect our present object-oriented decomposition.
Aggregate data
We think compositionally about how we represent things. If we have widgets, we create a widget type. If we have a collection of widgets, we might reach for the “array of widgets” type.
But performance doesn’t always respect this compositional thinking. The classic example here is “array of struct” (i.e. our traditional, compositional approach) versus “struct of arrays.”
// We have points, so let's represent a point:
struct Point { float x, y, z; }
void update(Point p[]) {
// p[0] identifies a point
// p[0].x accesses one coordinate
}
Or, alternatively,
struct Points {
float x[];
float y[];
float z[];
}
void update(Points p) {
// We don't have a way to refer to a point by itself, only the aggregate
// p.x[0] accesses one coordinate
}
The benefits of this transformation can be huge:
- For numerical data, the “SoA” approach can much more easily make use of SIMD instructions, to efficiently perform the same transformation on all elements of the array.
- For any data, access to just some of the data can be done in a much more cache-efficient manner.
(Slightly contrived for the above example, but let’s imagine we only care about
y
coordinate to apply the effects gravity or something. The second representation means the cache would only be filled withy
data, and not any of the unnecessaryz
orx
data. This trick can be helpful even in general purpose databases, where it’s called a column-oriented store.)
But consider trying to make this data structure change to an object-oriented program.
Point
would be an object at first, but now we have to explode it.
In fact, we have don’t even have a Point
object anymore, it’s gone away entirely, instead we’re just writing all our functions over the data structure Points
instead.
This isn’t even an object-oriented way of thinking anymore.
This kind of transformation is antithetical to object-oriented programming. And when it comes to cache-friendly representations, this kind of thing is common. (Here’s a RustConf closing keynote about “OO” designs vs entity component systems. Note to self: a future post about ECS might be an interesting case study.)
So another reason object-oriented rules aren’t helping us cope with changing data structures is that the necessary changes have no real reason to be confined to the internals of any given class. They are easily cross-cutting.
Opposite land
What’s the easiest way to make an invasive change to data representation?
The basic premise of the object-oriented style here is that the easiest way is to allow current code that uses the class to continue to work just fine. That is, make an internal change, but keep external users unchanged because the interface doesn’t change. I’ve challenged whether this approach really works above, but what I’d like to do now is take it a step further: I don’t even think this is best way.
The best approach to making invasive changes like this that I’ve found is to:
-
Use a language with a good type system and strong support for data. (The best languages I have used on this front are Ocaml, Haskell, Scala, Rust, and it looks like Swift as well, though I haven’t actually used that one yet.)
-
Make a “loud” breaking change to the types. Deliberately ensure that compiler errors do arise as a result of the change.
-
Chill out for a moderately mindless session of “go to next compiler error, fix error, repeat.” (Starting to have more than a hundred? Many changes are probably similar, maybe it’s time to learn vim macros…)
-
It just works?
I’ve mentioned this kind of refactoring before on the blog. If you’ve never experienced it, you should try it someday. But there seems to be a few major benefits:
- Since you’re changing all the users, there’s no longer any performance concerns about compatibility shims or the like.
- You’re directly changing data representation from one to another, so you don’t run afoul of any enforced dogma about class organization (like what happens with aggregate data).
- Since you’re fully converting the representation and all users, there’s no accumulation of technical debt whatsoever. It’s very clean.
- You get to survey how the data structure gets used, which can sometimes help inform you about whether you’re making the right change.
Conclusion
All this leaves us with the question: if it’s so difficult to make invasive changes to data representation later, what do we do about it? I’m trying to put some thoughts together on what prototyping really means when it comes to software development, but that’ll have to wait for another time.
But one tool that does work is the one object-oriented programming did help bring into common practice: control dependencies. Encapsulation might not always matter as much at the level of individual classes, but the idea of limiting the scope of changes necessary to accommodate a representation change is a good one.
And of course, the most important way that applies: if you think a data representation might need to change, then don’t leak it into a system boundary. There are “easier to change” and “harder to change” designs but this just brings us to “impossible to fix.”
We’ve got enough of those.