Garbage, garbage, everywhere

Today I want to respond to a comment that was left on a previous post with regard to the following LINQ query:

For Each prod As Product In Products.Where(Function(p) p.IsQuoting).OrderBy(Function(p) p.Expiration).ThenBy(Function(p) p.Strike)
    prod.SendQuotes()
Next

What’s wrong with the above code? In most cases, nothing, really. It gets the job done in a somewhat obvious way, in a some readable line of code.

That’s most cases. But most cases don’t take performance into consideration; and you may recall, in the post in question, I was specifically talking about performance.

In order to understand the performance impact of that code, it’s necessary to first understand garbage.

Back in the early days of computer programming, whenever a developer allocated some memory in code, he or she was responsible for also deallocating it.

This actually made for some pretty lean code. Imagine it like a society in which every citizen always deals with his or her own trash: rather than have collectors who come through on a regular basis, this society’s members all dispose of their own trash in one way or another. In an ideal world, this society would enjoy significant savings as a result of not needing to invest in the infrastructure to deal with trash on a large scale.

However, in the real world, such a society would almost certainly have a huge trash problem, since not every member of society is a responsible citizen. In software terms, not every program always deallocates its memory perfectly. This often results in what is referred to as a memory leak.

Developers using platforms like Java (JVM) or .NET (CLR) have this fanciful notion in their heads that memory leaks can’t happen on my platform, because my language is garbage collected! This is not entirely true to begin with (in .NET, at least, memory leaks are very possible and potentially very costly); but more importantly, it’s just a bad attitude to have. When you believe that you can just do whatever the heck you want because your language is garbage collected, chances are you’re going to end up writing some pretty inefficient code.

For those of you who may not know what garbage collection is in software, it’s pretty much what it sounds like: a process whereby a system will “clean up” the memory allocated by a program after that memory is no longer in use, without requiring software developers to do this themselves. So a language with garbage collection (e.g., C#) is like a society in which individual citizens do not concern themselves with disposing of their own waste. In other words, it’s like our society, which does have garbage collectors who handle everyone’s garbage.

Now consider this: whether or not a society expects its individual citizens to deal with their trash, or it expects garbage collectors to, there’s still garbage to be dealt with. And that is, no matter how you spin it, a cost.

What the heck does this have to do with that LINQ query? I’m glad you asked. Let’s look at that first line again:

For Each prod As Product In Products.Where(Function(p) p.IsQuoting).OrderBy(Function(p) p.Expiration).ThenBy(Function(p) p.Strike)

How much garbage does this code create? I will count the throwaway objects:

  1. An enumerator to iterate the results of the Where call.
  2. A delegate object to execute p.IsQuoting on every iteration.
  3. Another enumerator to provide access to the result of OrderBy.
  4. Another delegate object to call p.Expiration for each iteration.
  5. Another enumerator to iterate over the results of ThenBy. This enumerator will need to store its own intermediate collection of Product objects, since it has to sort them and iterate over them in sorted order based on p.Expiration followed by p.Strike.
  6. Another delegate object to call p.Strike for every iteration.

See that? That’s 6 objects, not including the intermediate collection of Product objects requiring sorting prior to iteration in the For Each loop.

To put this in perspective: the above code could be written without requiring any new object allocations, with the possible exception of the intermediate collection (depending on whether or not the sort from OrderBy/ThenBy could be in-place).

To me this is sort of like this: say you have a bowl of pasta for lunch every day. You could use the fork and bowl you have in your kitchen drawer for every bowl, or you could use a different throwaway paper bowl and plastic fork every time you eat this meal (every day).

Which would you choose?


For the record, here’s how you could do the same work with no garbage:

' Initialization code somewhere--
' Let's say this ProductComparer class encompasses the OrderBy/ThenBy logic.

' I am not counting this as garbage, because the cost would be incurred ONCE,
' as opposed to every time the below code executes.
Dim comparer As New ProductComparer
Products.Sort(comparer)

' Now here's the clean, garbage-free code.
' (OK OK, there is still possibly an enumerator object...
' then again, there might not be; e.g., if Products is a List(Of Product),
' then its enumerator will actually be a value type which does not create
' garbage.)
For Each prod As Product in Products
    If prod.IsQuoting Then
        prod.SendQuotes()
    End If
Next
Advertisements

8 thoughts on “Garbage, garbage, everywhere

  1. empi says:

    Thanks for your answer.
    However, I truly disagree with you. I think that linq code is better. Then only thing I would change is to eliminate ThenBy to perform sort only once (if I really need this code to be more performant).
    That’s how I see it:
    1. You point out that there will be 6 unnecessary objects created (and then garbage collected). Mostly these are just references. The amount of memory consumed is determined by the size of Products collection. Only if Products collection is really small (only a few elements) it may noticeable – however if this collection is so small it really doesn’t matter how you deal with it.
    2. You assume that you may sort in place. Most times it won’t be a good assumption. What is more you first sort collection and then check products that satisfy IsQuoting condition. If there are many products and only a few that meet this condition your code will be much slower than linq version.
    To be honest if this isn’t some core logic that gets executed thousand times per minute, then it’s an example of premature optimization.

    • Daniel says:

      empi, I think you’re falling into the trap so many other developers do, which is to think by default any attempt at optimization must be premature.

      Consider this: every call to OrderBy will perform a sort (on an intermediate collection), making iteration alone O(n log n) assuming a Quicksort-based implementation. Forget about “thousands of times per minute”; if the code in question will be executed repeatedly throughout the lifetime of the application, this is wasteful.

      Remember that I was talking about high-performance code in the first place; if your software needs to react to events as quickly as possible in real time, there’s absolutely no reason to sort your data every time you iterate over it.

      As for whether an in-place sort is appropriate or not, that depends on your context, obviously. But it would still be much better to have a separate collection of Product objects which is sorted (once) in advance than to create a new collection and sort it on every call.

      If you want to investigate the issue further, I suggest you test out some sample code that benchmarks the performance of two approaches to iterating over a collection in sorted order:

      1. Iterate over the results of OrderBy on a 1000-item collection a million times.
      2. Copy a 1000-item collection, sort the copy, and iterate over that a million times.

      Again, in high-performance code, there’s no excuse for preferring the former when it is so savagely outperformed by the latter.

  2. empi says:

    Ok, I did it. And here is what I got for 100k iterations for different possibilities of IsQuoting. Linq code is significantly faster and basically it shows that times depend on time of sort operation. I pasted my test code here: http://www.copypastecode.com/63970/

    Products count 1000, iterations 100000, possibility 0,01
    Generation only 00:10.89
    Generate and do nothing 00:12.611
    Sent so far 100000000
    LINQ 00:13.791
    Sent so far 1000629
    Custom 00:57.495
    Sent so far 998923
    ====================

    Products count 1000, iterations 100000, possibility 0,1
    Generation only 00:10.440
    Generate and do nothing 00:13.125
    Sent so far 100000000
    LINQ 00:16.807
    Sent so far 10003365
    Custom 00:57.993
    Sent so far 9995485
    ====================

    Products count 1000, iterations 100000, possibility 0,2
    Generation only 00:10.24
    Generate and do nothing 00:12.387
    Sent so far 100000000
    LINQ 00:20.596
    Sent so far 19993324
    Custom 00:57.16
    Sent so far 20000899
    ====================

    Products count 1000, iterations 100000, possibility 0,5
    Generation only 00:10.124
    Generate and do nothing 00:12.561
    Sent so far 100000000
    LINQ 00:33.555
    Sent so far 50000379
    Custom 00:58.919
    Sent so far 50004941
    ====================

    Products count 1000, iterations 100000, possibility 1
    Generation only 00:10.294
    Generate and do nothing 00:12.697
    Sent so far 100000000
    LINQ 00:56.475
    Sent so far 100000000
    Custom 00:56.712
    Sent so far 100000000
    ====================

    • Daniel says:

      Ugh, empi! I think you have misunderstood my suggestion in one catastrophic way: the code you used to test the two scenarios sorts before every iteration, which is exactly what I’m saying is wasteful! Take a look at my own demo program here: http://pastebin.com/vbL65M1V. I have included comments explaining that these are the two scenarios to examine:

      1. Calling OrderBy and iterating over the result every time a sequence is iterated.
      2. Sorting a sequence once and iterating over the result of that whenever the sequence is iterated.

      For reference, here are the results I got for 10,000 trials with a collection size of 1,000:

      Calling OrderBy on every iteration: Executed 10000 time(s) in 2228.6352 ms.
      Iterating over a pre-sorted sequence: Executed 10000 time(s) in 188.2718 ms.

      I have trouble believing you could actually dispute that there is a real performance impact from performing a sort before every iteration. I suspect this entire debate has sprouted from a misunderstanding regarding what my actual suggestion was in the first place. In any case, I’m sticking to my original argument on this one.

      • empi says:

        How can you compare these two approaches if you say that you already have sorted collection? When you have sorted collection there is nothing to compare.
        Where clause in Linq will act basically the same as your if in foreach.
        I guess you meant that you can sort in place using List.Sort or create temporary collection using IEnumerable.OrderBy. However, calling linq solution terrible just because under some circumstances you may sort in place and keep the result…
        Going further, you may keep list of elements that meet IsQuoting condition and you don’t have to even perform your check every time you want to send quotes.

      • empi says:

        Btw, that’s a really nice programming blog. It’s great that you keep posting regularly. Keep it going!

      • Daniel says:

        empi, I never meant to imply that LINQ per se was terrible. Of course Array.Sort (or List<T>.Sort) is not going to somehow magically outperform OrderBy just because it isn’t LINQ.

        If you look again at my original post, what I was trying to say was this: some developers (such as my coworker, who wrote the code in question) are so infatuated with LINQ that they use it indiscriminately in places (or in ways) they shouldn’t, without considering the consequences. In the case of my coworker, he wanted to iterate over a sequence in a certain order; LINQ gave him the OrderBy method; he used that and went on his merry way, without thinking twice about what this entailed.

        The fact that there is a much more efficient way to tackle the same problem is what I was trying to highlight. And I believe that my coworker fell into this trap because LINQ makes it so easy to do so. But you’re right that the more efficient approach could have also been accomplished using LINQ, i.e., by using OrderBy instead of Sort for the up-front sort.

      • Daniel says:

        empi: Thanks for reading! 🙂

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: