I’ve been combing through the Wikileaks Iraq War Logs dataset and experimenting with different visualizations. This new one shows each individual death logged in the data. A single death is drawn as a single dot. The color of the dot indicates who was killed: either a civilian, a coalition soldier, an Iraqi soldier, or an enemy combatant. These classifications are taken directly from the military records, I did not categorize the data myself in any way. The dataset documents exactly 108,394 deaths, so exactly 108,394 dots are drawn.
Coalition soldiers are white dots, Iraqi forces are gray dots, enemy forces are blue dots, and civilians are red dots. At a glance you can see the shift from the heavy blue in the early days of the war to the overwhelming red. Let that soak in for a second. Every red dot is a civilian life.
Explore the visualization by selecting a different tab along the top (Years, Months, Incident Type, Category, Casualty Type) or by using the plus and minus buttons to zoom in and out of the visualization. I encourage you to experience the visualization in full screen (use the full-screen button on the bottom-right).
This dataset uses the dataset produced by the Guardian, which filtered the full WIkileaks dataset to only include records with one or more deaths logged. It contains 52,048 records that document 108,394 deaths.
Please note that this data only contains incidents documented by Multi-National Force – Iraq and presents only a partial, incomplete record of the war. Please see this article about issues with this dataset.
I was honored to be asked to give a keynote presentation at 360|Flex in DC last month. All the sessions were recorded, and John Wilker was gracious enough to let me post the full video of my keynote.
This keynote was a bit different. I went out on a limb a bit and talked about the experimental projects that I’ve been working on, and my belief in the importance of pursuing fun experiments to stay invigorated and passionate about our work. It covers a number of mapping and data visualization projects I’ve been playing with, but the point was really that we all need to pursue what we’re passionate about. For me that happens to be maps right now, but everyone has their own unique areas of interest.
If you’re interested in mapping work then the projects I talk about should be right up your alley. But even if you’re not a map geek, I think the presentation is still interesting and (I hope!) inspirational.
You can also see the slide deck on its own, but I think the video gives much better context to the slides.
I’ve been playing with different ways of representing data (see my previous night lights example) and I decided to venture into 3D representations. I’ve used a full year of crime data for San Francisco from 2009 to create these maps. The full dataset can be download from the city’s DataSF website.
A view from above
This view shows different types of crime in San Francisco viewed directly from above. The sun is shining from the east, as it would during sunrise.
I love how some of the features in these maps are pretty consistent across all the crime types, like the mountain ridge along Mission St., and how some of the features only crop up in one or two of the maps. The most unique map by far is the one for prostitution (more on that further down).
An alternate view
Here’s the same data but from a different angle, which helps show some of the differences.
UPDATE: Whoops, I screwed up originally and had a duplicate image. The original graphic showed the same map for Vandalism and Assault (both were the Vandalism map). This updated graphic has the correct map for Assault.
Many of the maps have peaks in the Tenderloin, which is that high area sort of in the north-east center area of the city. Some are extremely concentrated (narcotics) and some are far more spread out (vehicle theft).
My favorite map is the one for prostitution (maybe “favorite” is the wrong choice of words there). Nearly all the arrests for prostitution in San Francisco occur along what I’m calling the “Mission Mountain Ridge”, which runs up Mission St between 24th and 16th.
EDIT: I’ve been corrected. Upon closer inspection the prostitution arrests are peaking on Shotwell St. at the intersections of 19th and 17th. I’m sure the number of colorful euphemisms you can come up with that include the words “shot” and “well” are endless.
I love the way the mountain range casts a shadow over much of the city. There’s also a second peak in the Tenderloin (which I’m dubbing Mt. Loin).
Drug crimes are also interesting to look at, since so much of the drug activity in San Francisco is centered in a few distinct areas. We can see Mt. Loin rising high above all the other small peaks. The second highest peak is the 16th St. BART peak.
There are other consistent features in these maps, in addition to Mt. Loin and the Mission Range. There’s a valley that separates the peaks in the Mission and the peaks in the Tenderloin, which is where the freeway runs (Valley 101). You’ll also notice a division in many of the maps that separates the southeast corner. That’s the Hunter’s Point Riverbed (aka the 280 freeway).
These maps were generated from real data, but please don’t take them as being accurate. The data was aggregated geographically and artistically rendered. This is meant more as an art piece than an informative visualization.
I’m in love with the New York Times data visualization/infographics division. They consistently put out some of the most amazing visualization pieces (both in print and online) that I’ve ever seen. Their recently geographic analysis of Netflix ratings was absolutely superb. And we all probably saw their election maps (either for 2008 or 2004). They produce stunning displays that convey amazing amounts of information in a way that only interactive graphics can do. And they’re all done in Flash.
So when you see images showing the missing plugin icon on the New York Times website on the iPad or iPhone, that’s not just some annoying ad that’s not playing or a streaming video. That’s some of the most cutting edge visualization work that’s being produced today. And without Flash it simply doesn’t exist.
The New York Times (like all newspapers) is in crisis. They are trying to reinvent themselves in an online form. And as a news organization they are one of the most progressive and experimental out there. They are embracing the new medium by doing some of the best damn interactive graphic work I’ve ever seen. They make things that convey news and information in ways that draw people in and keep them coming back for more.
But without Flash they’re just a newspaper. And we all know newspapers are dying.
I’ve been reading William Playfair’s Commercial and Political Atlas, in which he invented the line chart. In the book, Playfair examines the imports and exports between Britain and various countries. To illustrate these trade relationships, Playfair created the first ever line charts that show the change in trade over time.
Each section of the book covered a different country, and each one contained a chart that showed the imports and exports like this:
Two line series are shown, one for imports and one for exports, and shading is used to show when there was a “balance in favor of England” (when there were more exports than imports).
I’ve been captivated by these charts and wanted to recreate them, but with modern data. You can find tons of US trade data at the US Census Bureau’s website, including a spreadsheet that has all the data in one place. I downloaded that data and put together a little application to create Playfair-esque charts.
The app displays all the countries that the US has trade data for, month by month going back as far as 1985. Each country is displayed in the list on the left with a sparkline chart of the trade data. A red fill indicates we are importing from a given country more than we are exporting, and a light green fill indicates we are exporting more than we are importing.
Exploring the data
The charts tell some really interesting stories. Some of the charts show a nearly identical relationship of imports to exports, both growing at the same rates, like these charts of the UK and Guatemala.
While some other charts show different relationships. Notice how exports to Hong Kong have been steadily increasing, but imports from Hong Kong have been declining.
Or we can see what imposing sanctions on a country looks like, as illustrated by sanctions on Burma that were put into place in 2003:
We can see the massive growth of China (and notice how interestingly seasonal each year is, peaking in October):
And one final one that I find very interesting, isn’t a country, but the import and export of what is classified as “Advanced Technology Products“, which includes things like biotech and advanced electronics products. Notice how up until the early 2000s we were exporting more of these products than we were importing, but by 2002 that balance shifted and the gap continues to increase:
I had fun creating this app, but one thing I didn’t expect was how much fun researching the charts was going to be. The charts that stuck out with trends that were abnormal all had interesting stories to tell about the history of the country.
In closing, I’ll end with a quote from Playfair in which he describes the concept of displaying numeric values in a line chart (remember, he was the first person to actually do this):
As the eye is the best judge of proportion, being able to estimate it with more quickness and accuracy than any other of our organs, it follows, that wherever relative quantities are in question … this mode of representing it is peculiarly applicable; it gives a simple, accurate, and permanent idea, by giving form and shape to a number of separate ideas, which are otherwise abstract and unconnected.
Well said, Mr. Playfair, well said. Your charts are just as effective nearly 200 years later.
On the RIAdventure conference I gave a presentation about the past, present, and future of data visualization as I see it (fun side note: RIAdventure is the only conference I can say I “went on”). Luckily, the organizers filmed the entire thing, and we now have the video of the whole presentation that you can watch. This presentation covered a brief history of the field of data visualization, with the focus on the invention (in the not too distant past) of many data visualization techniques we take for granted. The point of the historical exercise was to point out that new opportunities with new data that we have before us present new opportunities for invention. I talked about new trends I see emerging in the data itself (massive datasets, city data, you life data, stream data) and what those trends mean for us as data visualization software engineers (I also argue that everyone will be a “data viz” engineer to some degree in the future).
I hope you enjoy the presentation, it was a lot of fun to create and to present. I learned a ton from the research and it was exciting thinking about the future of the field. Below is the full video (low resolution streaming from vimeo, or you can find higher resolution streaming form screencast here, or you can even download the full video file). Also embedded below are the slides that go along with the presentation, and you can always download the slides as a PDF.
One of the first time-series line charts ever drawn was a visualization of the great American credit crisis (but probably not the credit crisis that comes immediately to mind). If you were to look at this chart today you might even mistake it for the charts of the housing credit crisis of the past few years.
(see image credits below for image details)
This chart was created by William Playfair and published in 1786 in The Commercial and Political Atlas. In that work Playfair literally invented the line chart. This particular chart shows the imports and exports between Britain and America between 1700-1800. The red line is the line for exports from Britain to America, and the lighter yellow line is the imports from America. You can see the relationship of imports to exports stays relatively constant for the first 50 years (1700-1750) and then the exports start shooting up dramatically, at a rate much greater than the increase in imports.
I guess we know what a credit-driven catastrophe looks like. And it’s not only the image itself that looks similar, at times his words sound as if he’s writing today about our current financial mess.
Between 1750-1772 there was a rapid increase in exports from Britain to America. These exports were the result of many new merchants hoping to strike it big by shipping goods to the new settlers. But the reason things got out of control has to do with credit. Merchants started lending and borrowing on credit to finance their get-rich-quick schemes of selling stuff to America. Playfair writes (all emphasis added is mine),
Ever since the invention of paper credit, trade has had a latitude it did not before enjoy, and its progress being less natural, has become more intricate. That bound set and preserved by the nature of things was removed, when paper credit was first invented; previous to which, nothing represented wealth that was not wealth itself, or that was not physically worth the sum it represented; and in order to give credit in business, it was absolutely necessary either to possess, or to have borrowed capital.
And because of this new credit, people started making business decisions that were insane. They started shipping products to America before they knew they could sell them. Since the money was free they took irrational risks. And if your business venture failed miserably you could always just hide from your creditors in that new land of opportunity.
Of the eventual crash, Playfair writes,
For the first fifty years, we observe the simple and regular growth, from poverty to wealth, of a new country; during the succeeding twenty years, we are astonished at the extent and operation of a mad mercantile speculation carried on by our own country; and the period which succeeds, shews the catastrophe that so airy and so ill-founded a project was likely, sooner or later, to experience. There is not any branch of trade, which, from the nature of its progress, affords so much instruction as this. It merits equally the attention of the philosopher, the politician, and the merchant; for it throws light upon all the three different objects of their pursuits.
Isn’t that beautiful? Almost the same words could apply to the current financial crisis. And one final quote that I like, which also made me think of our current crisis:
Upon the manner in which business is conducted, depends something more than merely the gaining or losing a little money. The happiness of numbers of innocent individuals is frequently depending upon the success of projects, with the formation of which they had no concern. What numbers have been ruined, and how many more deprived of fortune, by our ill-conducted trade with America?
What numbers have been ruined indeed.
I’ve been reading the works of Playfair to understand the history of data visualization (in this same work he also invented the bar chart, and in a successive work he invented the pie chart). I wanted to make sure I understood the history of statistical charts, since as they say, those who cannot remember the past are condemned to repeat it. I didn’t realize that phrase would also apply so perfectly to the text accompanying the images.
* Image Credits
The first image above is from William Playfair’s Commercial and Political Atlas, 3rd edition, published in 1801. The scan is of a copy contained in the University of Pennsylvania’s Annenberg Rare Book and Manuscript Library. It was reproduced in a publication by Cambridge University Press entitled The Commercial and Political Atlas and Statistical Breviary, published in 2005, which was compiled by Howard Wainer and Ian Spence (and if you want to be even more technical the image above is a reproduction from a Google scan of the Cambridge University scan). As was decided in Bridgeman vs Corel Corp (full text), a reproduction of a work of art in the public domain is not protected by copyright. As was stated in that verdict: “While it may be assumed that this required both skill and effort, there was no spark of originality — indeed, the point of the exercise was to reproduce the underlying works with absolute fidelity. Copyright is not available in these circumstances.” I am reproducing the image here with that legal precedent in mind, and with the best of intentions. I would highly recommend that if you are interested in Playfair’s work you buy the reprint by Cambridge University Press. It contains full-color reproductions of the charts, and the introduction contains great biographic information about Playfair.
Images courtesy of the Image Science & Analysis Laboratory, NASA Johnson Space Center
As I was flying back home into San Francisco airport I was watching the city lights out the window and got struck by a bit of inspiration. I find cities beautiful, from the graffiti to the neon signs to the line of headlights on the highway. A city viewed from above at night is captivating. I wanted to try to recreate that same look, but by visualizing data (in one sense you can say that the real view of a city from above is already a visualization of population data).
I started searching for images of cities at night, and found these amazing images from NASA. All those images were taken from a space shuttle orbiting the earth. These images tell you a lot about the city, the layout, urban density, planning (or lack thereof). I wanted to take other meaningful data and create similar images.
All the visualizations below have been created with SpatialKey. However, this is some experimental work I’ve been playing with to generate the “night light” images, so it’s not released (and might not ever be). Basically this is a peak behind some of the R&D work I do for fun (yes, for a dataviz dork like me making fake “cities at night” images is my idea of fun).
Crime in San Francisco
This image is all crime in San Francisco for a 3-month period. You can see some of the same features that you can see in the NASA space image, such as Golgen Gate Park and the Presidio (the area on the north-west edge of the city). All in all it’s interesting how similar the crime image looks compared to the NASA image. Downtown is the brightest spot in both images, which means that it’s literally the brightest area of the city (the most streetlights), and also has the most crime.
And here are breakdowns for a few different crime types. Notice how different the distributions are. Narcotics crimes are heavily clustered and can be found downtown (in the Tenderloin), in the Mission (near the 16th St BART station), and along Haight Street near Golden Gate Park. Whereas vehicle theft is scattered fairly evenly throughout the city.
Both San Francisco and New York publish their 311 data, which is when citizens call for city services. One category of 311 calls is to report graffiti. Graffiti is interesting in that it often follows specific city streets. When we look at the graffiti data for both cities we see specific streets that have far more graffiti than others. I love these images (particularly the one of SF) because they really look like a view of street lights from a plane.
Trees planted in San Francisco
Another one of my favorites of this set is data for all the trees that the city of San Francisco has planted since 1990 (all this SF data is available at datasf.org). You can see the heavy planting along Market St (which cuts diagonally through downtown), as well as along streets like Sunset Blvd (the street running north/south on the western side of the city).
Street lights (or SF as a giant lite-brite)
One final image of San Francisco we have is the locations of every street light in the city. I liked this image because it reminded me of playing with a Lite-Brite when I was a kid. It almost makes city planning feel light a grown-up version of playing with little plastic lights.
Data can often tell you far more about people than you originally think. In my previous post I presented some of the data from the history of the FlexCoders mailing list. I showed some of the details of the individual usage patterns for different people. One of those people was the Flex product manager, Matt Chotin. Matt’s involvement with FlexCoders is pretty interesting if you start to dig into the data. In this post I’ll try to identify some changing trends in his usage patterns and we’ll see if we can do some detective work to figure out why his behavior changed.
A little background: Matt has been involved in Flex since basically forever. He was an engineer at Macromedia and is now the product manager for Flex. Matt has been quite prolific on flexcoders over the years (in the overall ranking he’s #3). So to start I was interested in his overall post volume on the list. Take a look at the timeline showing his posts per month and you’ll notice there’s a distinct drop-off:
So if you notice the number of flexcoders posts going down it’s because my brain will be slowly atrophying as I move away from the details of our vast offering.
And that’s exactly what happened.
Seeing the correlation between a change in professional life and a drop in activity is cool, but we can dig deeper. Not only is this data telling us when Matt changed his behavior throughout the year, but we can also figure out something about his daily routines and how that changed as well. I started looking at when (as in what time of day) Matt was posting to the list.
Here’s a chart that shows the distribution of posts by hour of day and day of week. It groups the posts by the combination of what day and what hour they occur on.
So you can see that Matt posted the most on weekday mornings (around 9-11am on Monday-Friday) and weekday evenings (around 8-10pm Monday-Thursday, note that he rarely posts on Friday nights).
This pattern is actually very similar to Alex Harui’s activity as well, although Alex’s activity is more weighted to during work hours than at night (except for Sunday night!).
I found the evening hotspots interesting (both in Matt and Alex’s cases). Clearly Matt was answering people’s questions a lot after work hours from home.
I dug a bit further into Matt’s trends. Here’s the graph of his activity by hour of day for 2005:
We can see in 2005 he actually answered more questions in the evening than in the morning. Taking a look at 2006 this became even more pronounced, almost all his activity was at night (I wasn’t the only one who noticed this, see Ryan Stewart’s post about Matt posting at 9pm):
And then there was a change in 2007. The graph for 2007 shows that he started answering more question during the workday. And that shift continued into 2008 and 2009, by which time almost all of Matt’s activity was during work hours.
If you dig even deeper into the data you can find out that the transition from mainly evening activity to work-day activity happened mostly during the months of April 2007 – June 2007. After about July 2007 Matt almost primarily posts during the day. Taking a look at the release history of Flex, we see that the beta of Flex 3 came out in June 2007. So my guess is that Matt changed to a management role in May of 2006, but had far too much work to do to get Flex 3 ready and out the door between then and June 2007 (meaning his devotion to flexcoders had to be delegated to the evening hours). Finally once the Flex 3 beta was out the door he could devote some actual work hours to being involved in the community, instead of having to do it all from home.
As if knowing the intimate details about Matt’s daily routine isn’t enough, we can learn something about his historical vacation time off as well. Matt’s impressive in that he’s never missed a month without posting. If you go even more granular there are actually very few weeks that he missed (as his overall activity declined in 2009 this became more common). So if we look at Matt’s activity around the holidays something interesting pops out (well, it’s only interesting if you’re a total stalker, but if you’ve read this far then you probably are). Here are a few timelines of different years, showing columns grouped by week. In 2005 we see Matt was posting pretty regularly through the holidays. There actually was a 5 day stretch with no posts, but that was it (due to the way the weeks are grouped that gap doesn’t show in this chart).
I’ll be on vacation until mid-January so emails to me will go unanswered as will responses to various forums and blog comments 🙂 Happy Holidays to all!
The data never lies.
Looks like a long vacation over the holidays didn’t turn into a regular thing though, since he was right back at it the following year:
I’m not a total nut job
I know it seems like I’m obsessed with Matt Chotin. And regardless of whether that’s true or not, I do want to assure people I’m not totally off my rocker. This little experiment in data mining and analysis isn’t really about Matt. It’s about the stories data tells about all of us. There are mountains of public information out there about us all, and the tiny little bits that we put out there, even if those are just little Facebook or Twitter status messages, can say a lot about us. Sure, a single Facebook status message doesn’t tell anyone much, but when you look at all of them over a multi-year period you can start learning a lot about a person. And often that information that the aggregate data tells about us isn’t something we’re aware of. From this data experiment I know when Matt eats dinner (pretty typical range of 6-8pm), when he goes to bed (around midnight), and when he gets to work (again pretty normal between 8-9). And this is all from only 4,000 data points. With social networking and microblogging sites we’re starting to create thousands of little data points like this all the time.
Thanks to Matt Chotin
I ran this post by Matt first, since I know it’s a bit creepy. He was cool with me posting it, so thanks Matt! And thanks for all the years of hard work answering questions on flexcoders, we’re a stronger community because of it.
Happy Halloween! This is the dorkiest pumpkin I’ve ever carved. For those of you into data visualization or mapping, maybe you can recognize it:
This is a pumpkin representation of Charle’s Minard‘s visualization of Napoleon’s march into Russia in 1812. This graphic is considered by some (ie Edward Tufte) to be the “best statistical graphic ever drawn.” The graph shows the size of Napoleon’s army as they marched to and from Moscow. You can see how the army shrank as they approached Moscow. Once they reached Moscow they found the city had been abandoned and burned. Then they marched back home, except it was through a brutal Russian winter and nearly killed the remaining army. By the time they return home you can see the size of the army is just a small trickle.
Beyond just the two charts of the march to and from Moscow, the graphic also serves as a map, with the paths indicating where the troops were geographically. And below the map is a temperature chart that visualizes how severe the winter weather was, which correlates with some of the major drops in troops on the way home.
The carved pumpkin ended up being very hard to take a photo of because the graph wraps around over half the pumpkin’s circumference. So I tried to take a few pictures to get the different sides. I carved the march to and from Moscow, as well as the temperature chart along the bottom.