I realize that title’s a little ridiculous; let me explain.
First of all, for those of you who perhaps aren’t computer programmers (or even who are, but who couldn’t care less about data structures), the below is a diagram depicting an array.
Nothing to see here. Just your average everyday array.
An array consists of a bunch of elements all right next to each other in a computer’s memory. It’s very simple, and very “easy” for the computer to deal with (it only needs to know where that first element is, and how many elements there are total—sort of like, if I know where you live, then I also know where your neighbor lives: right next to you!). But there’s one serious limitation with arrays: they’re fixed in size, meaning, once you’ve created one (allocated it), you can’t change how many elements it holds.
In contrast, the following is a diagram depicting a linked list.
Look at me, I'm a zany linked list!
A linked list offers a significant benefit over an array: elements can be added, removed, and inserted anywhere in the list, all in constant time. This is because a linked list “node” consists not only of its value (the element itself); it also includes a pointer to the next node in the list (and possibly also a pointer to the previous node, in which case we have a doubly-linked list).
Now here’s a curious puzzle. Suppose you’re a programmer and you have need of a collection of, say, integers. You know you’re going to be adding many values to this collection, but you can’t predict how many (this might sound contrived, but it’s a realistic scenario: consider real-time data analysis, where data points are provided by some external source and it’s the software’s responsibility to analyze the data as it’s received). Do you use an array? Keep in mind that since arrays are fixed in size, this means that if you need more elements than the array can hold you’ll need to create a new array, copy all your elements over, and repeat this process whenever you reach the capacity of the current array. On the other hand, what about a linked list? Adding to the end of a linked list is O(1), which makes this the better choice, right?
This reminds me, believe it or not, of a common mathematical experiment that we’re all probably familiar with. I seem to remember my mom posing this question to me when I was a kid; or maybe it was my dad. In any case, the experiment goes like this:
You have two choices. The first choice is that I give you $100 right now. The second is that I give you a penny right now, two pennies tomorrow, four pennies the next day, and I keep doubling the number of pennies every day for a month. Which do you pick?
I’m guessing that any adult reading this knows the right answer; in fact, I’d be surprised if any adult could see that question and not feel that the answer is obvious. But to a child, it is not obvious at all. In fact when you actually learn how much money you’d have after a month in the second case, as a child (and even as an adult, honestly), it’s pretty shocking! (For the record, on a 30-day month that would be $10,737,418.23.)
A similar illustration of this phenomenon (basically, exponential growth) that I’ve heard used before is that of folding a piece of paper in half enough times for the resulting folded-up page to reach the moon. (Guess how many folds it takes? 42—that’s it!)
What does this have to do with linked lists? Actually, you know, it’s funny: the real parallel is actually a backwards one. So, to make it less confusing, let me pose a variation on the above experiment:
You have two choices. The first choice is that you owe me 10 cents per day, every day, for the rest of your life. The second is that in one week you will owe me 10 cents, then in a month you’ll owe me 20, then in six months we’ll make it 40, then 80 after a year, and from there we’ll just keep doubling the amount of time that goes by before you owe me any more money.
The above is admittedly imprecise (I didn’t feel like doing the conversions from days to months & years in my head); but it captures the basic gist of what I’m trying to say. In the first case, let’s say you live 80 years; that’s nearly $3,000! On the other hand, let’s look at the second case. Supposing again that you lived 80 years, and let’s just say the first year you owed $2 (to make it easy). Your last payment would be when you’re 64 years old (even if you made it past 80, I doubt you’d make it to 128!) and it would be for a whopping $128.
In other words, taking the guaranteed constant cost (allocating linked list nodes) over the performance hit that occurs less and less frequently (creating a new array) is really not the best choice.
This is all related to a peculiar psychological phenomenon that Ed Rosenthal discusses in his excellent book The Era of Choice. In one chapter (I don’t recall which) he describes the following interesting experiment. Members of a study group are given two options: accept a $100 guaranteed reward, or take a 50/50 chance of receiving either $500 or nothing.
Which would you choose? As it turns out, the vast majority of participants in the study decided they’d choose the guaranteed $100.
OK, interesting. Before analyzing whether or not that’s smart, though, I should mention that Rosenthal then goes on to share that there was a second part to the study. It was essentially the exact same question, only backwards: instead of a guaranteed versus probabilistic gain, participants were asked to choose between a guaranteed and probabilistic loss. So, the options were: pay a guaranteed $100 fine, or take a 50/50 chance of either having to pay $500 or nothing.
Which would you choose this time? Same answer?
Here’s the statistical perspective. In the first case, the correct choice is to go with the 50/50 chance of receiving $500. Why? Because your expected outcome is $250 ((50% * $0) + (50% * $500) = $250). In the case of a guaranteed $100, your expected outcome is, obviously, $100. On the other hand, in the second part of the experiment, the correct choice is to go with the guaranteed $100 loss, for the exact same reason: your expected loss is $100, while in the 50/50 case your expected loss would be $250. (Yet in the experiment, the vast majority of participants opted for the 50/50 choice in the second part. So they made the wrong choice both times!)
This is a little difficult for some people to understand at first: How can your “expected value” be $250 when that’s not even a possibility? This is the same sort of objection people make to the seemingly nonsensical statistic sometimes spouted that “The average American family has 2.5 children” (I have no idea whether or not that’s accurate; I just know I’ve heard it before). The error in reasoning here is conflating an individual case with the average of all cases. Here’s what I mean.
Let’s say, rather than participating in the abovementioned experiment once, you were given the opportunity to participate in it a thousand times. And let’s say, as with most of the participants in the original study, your gut told you to go with the guaranteed $100 reward each time. Then your expected reward would be $100 per trial, for a total of $100,000. Sounds pretty nice, right? But what if you went the other way? If you opted for the 50/50 choice each time, then half the time (roughly) you would get $500, and half the time (roughly) you would get nothing. Averaged out over 1,000 trials, you could expect to be up $250,000—that’s $150,000 better than the other option!
Get the idea? We make many choices in our life, and often it is our nature to want to secure the guaranteed something, whatever it may be. But if we apply this strategy everywhere in our lives, we may ultimately be experiencing a suboptimal outcome, because we are continuously making poor statistical choices due to our fear of risk.
This seems all the more crystal clear when we imagine the 1,000 trials scenario in the context of the second part of the experiment (choosing between losses). Over 1,000 trials, opting for the 50/50 chance of loss each time, you could expect to be in debt $250,000 by the end of the experiment. Taking the guaranteed $100 route would put you in the hole only $100,000 (and if you participated in both 1,000-trial runs, this means the correct choice in both cases would eventually put you ahead $150,000, while making the wrong choice in both would put you behind $150,000).
Interestingly, I think software developers (being humans themselves) are susceptible to making similarly poor but understandably instinctive choices in their software design. “I don’t want the amortized statistical advantage of a resizable array; I’ll take the guaranteed performance win of a linked list for its O(1) appends!” Even though in the long run that will set me back, providing a small expected benefit in comparison to the greater expected benefit of the array.
Of course, I’m not saying that either approach is really the best choice.
But I’m also not really just talking about arrays and linked lists, here. Linked lists are the scapegoat of this post; in fact I’m talking about looking at the bigger picture and making well-informed choices based on overall expected benefit, rather than stressing out over edge cases. And that applies to you too, non-developers!