The test app
has been updated to include two more tests: one for a hash table based approach, and one that uses a ByteArray approach. Looks like the ByteArray approach is the fastest, followed by the hash table, followed by the bitmap method. They’re all pretty close when you’re working with 10k or 100k records, but once you get into a million or more items then the ByteArray approach pulls ahead. Thanks to Alex Bustin
for sharing his code for the ByteArray test.
Recently I was thinking about how to do some complex Array manipulation, and how to do it fast. I needed to take a large source Array and filter it two different ways, producing two new filtered arrays, each of which is made up of a subset of the values in the source array. Then I wanted to know which values of the first filtered array were unique to that array (did not appear in filtered array #2) and vice versa. I came up with a technique that uses the bitmap manipulation of the Flash player to accomplish these goals.
An example of the problem:
Let’s say I have a source array that looks like this:
Then I filter that array to produce a new array, Filtered Array #1, that contains some but not all of the items. I do the same thing again, except this time using a different filter that produces Filtered Array #2. Now my data looks like this:
|Filtered Array #1
|Filtered Array #2
I want to know the unique values in Filtered Array #1 that do not appear in #2 (Item 5) and the unique values in Filtered Array #2 that do not appear in #1 (Item 1, Item 4). An added bonus would be some extra stuff like being able to know the union of both arrays (items that appear in either array 1 or 2) or the intersection (items that only appear in both filtered arrays).
The slow-ass way
So the first thing that comes to mind when trying to solve this is that you can simply do a for loop over the source array and check if each item appears in Filtered Array #1 and Filtered Array #2. This means you have one full loop over every item in the data set, and then you need to do a check for whether the item exists in each of the filtered arrays. The most basic way would be something like this:
var unique1:Array = ;
var unique2:Array = ;
for(var i:int=0; i
So that's the easiest way and it certainly works. I'm sure there are countless ways to do this in a more optimized way, like not looping over the source array but instead looping over the first filtered array, somehow marking all the items in the second filtered array if they were found in the first filtered array, then looping over the remaining items in the second filtered array. I'm sure someone will tell me that I should use a binary tree search since I know the source index, or that I should use a hashtable to lookup the existence of items faster. I bet all those different approaches would beat this simple example speed-wise, but I was still worried no matter how optimized an approach like this can get, it would still suck. (Side note: If you have an alternative approach that will handle many items with really fast performance, especially if you have an AS3 implementation, I'd love to hear about it.) So anyway, I decided I wanted to try something different.
Check this out. Say you have a source array of 10,000 items. Then you filter that down to a filtered array that might have a subset of about 4,000 of those items. The relationship of the items in the filtered array to the source array can be drawn using a bitmap like this:
This image is a 100 pixel by 100 pixel square, and each pixel within that square represents whether or not the item in the source array appears in the filtered array. So if the pixel at 0,0 is black that means that the first item in the source array is also in the filtered array. If a pixel is red, that means that that item does not appear in the filtered array.
Why did I choose to make a square image? I originally just did a single line 1px high. But Flash has a bitmap limitation of a max width or height of 2,880 pixels. So a single line means that you would be limited to arrays with only 2,880 items, and that's not the kind of problem I was trying to solve. The performance need arose from having to do this with a shitload of items. So if we use a square instead (easy to calculate the dimensions, just take the square root of the # of items), then we have a theoretical maximum number of items of 2,880 x 2,880 = about 8.3 million. I'm not going to try to filter more than 8 million items client-side, so I'm ok with that limitation.
OK, cool. Now we do the same thing for the second filtered array. Below is the generated image of both filtered arrays side by side. The red image represents Filtered Array #1 and the green represents Filtered Array #2:
But how do you create these images in an optimized way? Since the arrays are already being filtered, that means I'm using Array.filter() with a callback function to decide whether or not the item passes the filter test. This filter function is going to be run on every item in the array. So I use that function to also do my bitmap drawing. And check this out, the callback for a filterFunction tells you the index of the item in the array! So that gives you all the information you need to draw these images. You know that the function will get called one time for each item, and it tells you the index. So just do a simple BitmapData.setPixelAt() and once your filtering is complete your image has been generated. Dope.
So now we start with the real bitmap magic. We're going to use the compare() method of the BitmapData class to generate a third image that represents a pixel by pixel comparison of these two images. To use the compare() method you call it on one BitmapData and pass in a second BitmapData. The function will return a new BitmapData object that contains pixel comparisons. If the pixel in bitmap 1 is the same as the pixel in bitmap 2 then the new comparison bitmap will have a transparent pixel (0x00000000). If they are not the same, then it will return the "difference pixel", which contains the difference between each channel (RGB). So what happens when we run compare() on these two images? We get a new image something like this:
There are only a few possible pixel values now, and each unique pixel value tells us something special. So by looping over the pixels we can know whether the item occurs in both arrays, one array or the other, or neither array. And just like that, we got our answer.
Try out the demo application I created here. And you can view the full source here. Be forgiving, the source code is dirty as hell... no comments, no nothin'... I basically just cranked this out and decided to put it out there without taking the time to clean it up because a) I think it's a pretty cool approach and b) it's dinner time and I'm hungry.
So what about some benchmarks? I've put together a little demo app that you can play with to try out this method. It lets you run two algorithms, one that is a very simple for loop that loops over each item and calls array1.indexOf(item) and array2.indexOf(item) to see if the item is present in both filtered arrays. Then you can also run the bitmap test that I've described. Here are the results that I had on my machine when running both techniques on various numbers of items:
|# of items
|For loop text
|1 million items
These aren't meant to be scientific results. There are some pretty big factors that might influence speed. For the tests I have a filter function that just chooses at random (with 50% probability) which items to include in the two filtered arrays. If you change the number of items in either array the speeds will fluctuate. But the basic idea is that doing a for loop and checking for an item using indexOf fails miserably when you start getting over the tens of thousands of items. The bitmap method works even on a million records, albeit slowly at 3-4 seconds, but at least it works without timing out or crashing the browser.
And like I said before, I'm sure there are other algorithms that would be faster than looping over every item in the full array and checking using indexOf. If you have some implementations that solve the same problem but with better performance, please let me know in the comments.
I don't really know if I'm going to actually use this technique anywhere, but I might. I'm also thinking about performing other calculations on large datasets using similar methods. Essentially if you have two or more datasets, you can use this graphical technique to perform all kinda of other calculations. I'm only using the compare() function to test for a few very distinct possibilities, but you could imagine doing something much more detailed, like performing calculations based on the aggregate color difference of multiple array images. The bitmap manipulation in Flash is so damn fast, that it opens a whole new door if you try to think about using those methods for number crunching and not just pretty graphics.
P.S. I know it's been a long time since I did a blog post with code. I think this one geeks out to a pretty extreme level, so for anyone reading who was getting fed up with my non-technical ramblings, I hope this makes up for it.