Skip to content

I can’t think of anything that excites a greater sense of childlike wonder than to be in a country where you are ignorant of almost everything.

Bill Bryson


There is no royal road to learning; no short cut to the acquirement of any art.

Anthony Trollope

Only those who dare to fail greatly can ever achieve greatly

Robert Kennedy

What Does the Western Look Like? – The Outtake

What Does A Western Really Look Like?

I watched 50 westerns and compressed them into single frames of form and light.

Not too long ago, I watched 50 westerns. The montage above represents 16 of them. The massive montage a few paragraphs down shows the rest.

Each image within the montage is a sum image of every 10th second of each film — that is, one frame from every 10 seconds of a western was extracted and summed with the others to create a real image (math, basically).

Only occasionally might one make out actual features of films when they are viewed this way — perhaps end-titles or credits or changes in aspect ratio, but certainly not characters, objects, or sets. Rather, these summed images show us patterns and colors: blobby shapes occupying the center frame, a shadowy vignette ringing the corners, and mottled concentrations of saturation bleeding into one another.

These shapes and colors are evocative in a way that tea leaves and tarot are: they don’t actually tell you much about what you’re looking at, but they allow you an emotional response confirmed or denied once you come to discover what the image “really” is. Consider these three:

Actually, I tricked you. These aren’t westerns, but sum images respectively of The Matrix (1999), The Godfather (1972), and The Wizard of Oz (1939). In hindsight, The Matrix’s green-code color palette and fixed-width font scroll are recognizable. But The Godfather isn’t nearly as dark as we might expect and The Wizard of Oz not as bright.

So, alone, these summed images can only tell us much. But what if we compare a large group of them? Do westerns “sum” more or less the same? Is there a constitutive color palette? Is there a noticeable shift in their “average look” over time?

48 of the 50 westerns I watched and “summed.”

The Western, Saturated

To begin a comparative analysis of the set of filmic, washed-out color fields I had generated, I used the program ImageJ (about which I’ve written before) and a plugin that helps visualize the relationship between the summed images by plotting thumbnails of those images along two axes. I also used ImageJ to generate the summed film frames, using the Z-Projection feature.

I worked with simple interrelated features of images that are easily calculated (math again!) by computer: brightness, hue, and saturationalthough the last two are not as useful for black-and-white films.

Here is what it looks like to plot median hue by median saturation. In other words, from left to right the hue changes, where up to down we move from less saturated to more saturated.

In the bottom left, eleven of the black-and-white films are stacked on top of each other, occupying one space as, again, there’s not a whole lot of hue or saturation in black and white. Then, above them we see a vertical line of films with similar hues but varying saturations. We can easily see that by this measure Seven Men from Now (Budd Boetticher, 1956), which occupies the top spot, is the most saturated film I watched.

Outside of that tall left-most column, there are a few lower-saturation blue-hued films that stand out at the bottom middle. Decision at Sundown(Budd Boetticher, 1957) is the one from that group furthest to the right, just above and to the right of Johnny Guitar (Nicholas Ray, 1954). Before you look for yourself, yes, the two Boetticher films had different cinematographers.

Then, most obvious, is that horizontal line of four films in the middle. Here are those four up close, from left to right:

The Naked Spur (Anthony Mann, 1953)

Once Upon a Time in the West (Sergio Leone, 1968)

The Missouri Breaks (Arthur Penn, 1976)

Ride the High Country (Sam Peckinpah, 1962)

Of these four, The Naked Spur (Anthony Mann, 1953) is the most interesting to me. First, it doesn’t show nearly the same amount of lighter blue sky that the others do. Second, it clearly shows the film’s pattern of framing large red-hued shapes in the center with blue on the right and blue-green on the left.

The Naked Spur was filmed in Technicolor. It’s not the only western I watched in that process, but it’s the one whose summed frame most obviously shows off the deep saturation that Technicolor is known for. Another consideration is that The Naked Spur was one of the few I watched in the Academy ratio, so we might expect the film’s “average look” to be concentrated in one central place.

Ride the High Country (Sam Peckinpah, 1962) has the most unique hue of the group, somewhere in the magenta (300°) range. Magenta! This is surprising to me, as the film doesn’t seem particularly garish, and has a number of nighttime camp scenes — although perhaps those campfires are showing up as magenta-hued?

Ride the High Country was filmed in Metrocolor. Here are the frames I used to generate the summed image:

What’s visible in Ride the High Country that’s even more visible in The Missouri Breaks (Arthur Penn, 1976) is the obvious blue sky at top of the frame. It’s clear a number of scenes were filmed outdoors, with a horizontal or up-looking camera framing some part of the sky.

The Missouri Breaks also demonstrates another feature visible in a number of these films: a darker magenta-hued shape center frame, with a yellow-red-hued falloff to the right and left. While it’s difficult to make out characteristics, this shape is a result of the film’s repeated use of shots that only show one character.

Whether closeups or medium shots, we’re more often looking at one person than a group. The Missouri Breaks is about a gang of thieves, but its strength comes from its emphasis on a battle of individual wills: Jack Nicholson’s rustler, Marlon Brando’s regulator, John McLiam’s land baron, and his love-interest daughter Kathleen Lloyd.

With that idea in mind, look again at the widescreen (2.35:1) frame of Once Upon a Time in the West (Sergio Leone, 1968). Rather than one darker shape in the center, it instead has a few dark shapes: one darker bluish, one taking up the left third of the frame, then two smaller ovals on the right half. One explanation could be the director’s preference framing closeups on the left, mixed with repeated long shots of evenly spaced groups of characters across the frame — such as these two shots, from the opening scene:

The Western, Brightness

Another pattern shows up when we plot median hue vs. median brightness. From left to right the hue changes as it did in the first plot, where up to down we move from darker to brighter.

Looking at this visualization, a few films stand out. First, I can see the majority of the westerns I saw are concentrated around one particular median hue value on the left, while this grouping of films is still differentiated by brightness and saturation. I would wager that a sampling of most color films of any genre would result in a similar spread: the soft, warm hue seems to me to be the color of light, mostly reflected off of human skin.

What is interesting, though, are the outliers: far on the top right is the magenta Ride the High Country, which we’ve already seen is unique in its hue. In the middle on the right is The Missouri Breaks. The brightest film is Johnny Guitar, in the center top, only a touch higher on the y-axis than Two Rode Together (John Ford, 1961)Locked solidly in the bottom left corner is The Gunfighter (Henry King, 1950). Here are those last two up close:

Johnny Guitar (Nicholas Ray, 1954)

The Gunfighter (Henry King, 1950)

In The Gunfighter, Gregory Peck’s character spends much of the film somberly waiting in a saloon for a chance to leave town, and the dark tone and vignetted corners reflect the oppressive interior locations.

Johnny Guitar, bright and evenly lit.

Conversely, Johnny Guitar is bright and evenly lit, particularly in its interiors. Here’s a typical shot. While we can see some occasional dark areas, a glance at the actors’ shadows on the floor shows us the numerous light sources on set.

Another interesting pattern from the above visualization is the cluster of films that occupies the upper left corner. They include Two Rode TogetherComanche Station (Budd Boetticher, 1960), Unforgiven (Clint Eastwood, 1992), Young Guns (Christopher Cain, 1988), High Plains Drifter (Clint Eastwood, 1973)and The Quick and the Dead (Sam Raimi, 1995) among others. These seem to be the most “western” in look, and out of all 50, one from this group — The Searchers(John Ford, 1956) — most clearly demonstrates to me the genre’s basic look: a wide desert and a quarter of blue sky.

The Searchers (John Ford, 1956)

The sharp contrast between the two colors, limned by the faintest jagged suggestion of mountains, is compelling. Below are the sampled frames used to generate this image. We can see that even in the nighttime scenes there are either warm, fire-lit tones in the bottom two-thirds or blue silhouetting skies in the upper third.

Another mode of looking at The Searchers further demonstrates how and where this foreground-sky demarcation works.

Here are a few “barcode” views of the film. In each case, a number of frames are extracted from the film, and then compressed horizontally and presented left to right. Here is a sample of 1200 frames (at 24 frames per second, The Searchers works out to about 171,000 frames).

And here (respectively) are 400, 200, 100, 50, 25, and 10 frames (so you can see how this works):

The Western, My Expectations Met

Last, I want to consider Heaven’s Gate (Michael Cimino, 1980). Gauzy, misty, soft, indistinct: it was the film that made me first want to explore this technique for trying to capture a film’s overall “look.” And, I expected it to look exactly as it does.

Heaven’s Gate (Michael Cimino, 1980)

This image perfectly matches my visual conception of the film — almost as if a scrim just like this image had been laid over the lens during filming.

If I were dreaming the film, this is what it would look like to me. If I were making a photo app, this would be a set of readymade photo filters. If I were a fan, this would be a nice poster or computer wallpaper.

As a scholar, though, what use are these average looks — which strip out virtually all narrative, characterization, plot, sound, dialogue, and action? I don’t yet have a cogent answer to that question, but I do have a strong suspicion that film studies will benefit from new modes of visualization such as this one, which represent film texts from an otherwise impossible perspective — in this case, along the z-axis that compresses the film’s time into a single frame of form and light.

Source: What Does the Western Look Like? – The Outtake

If Taxi Trips were Fireflies: 1.3 Billion NYC Taxi Trips Plotted

If Taxi Trips were Fireflies: 1.3 Billion NYC Taxi Trips Plotted

NYC Metro Area Taxi Dropoffs, 1.3 Billion points plotted. Click for full resolution

The NYC Taxi and Limousine Commission (TLC) has publicly released a dataset of taxi trips from January 2009 — June 2016 with GPS coordinates for starting and endpoints. Chris Whong originally sent a FOIA request to the TLC, getting them to release the data, and has produced a famous visualization, NYC Taxis: A Day in the LifeMark Litwintschik benchmarked various relational database and big data technologies using this dataset given its moderate 400GB size. And notably, Todd W. Schneider produced some really nice summaries of the dataset, some of which are similar to work I show here. I actually was not aware of Todd’s work on this topic until after this post was written, so although there is a fair bit of overlap, this post and the graphics in it are original.

I downloaded the data files from TLC website, and (very painfully) using Python, Dask, and Spark, have produced a cleaned dataset in Parquet format, which I make this available for AWS users at the end of this post.

So I was curious, where do taxis pick up passengers, or more precisely, what does the distribution of taxi pickup locations look like? With 1.3 billion taxi pickups, plotting the distribution in a way that does not wash out detail is very challenging. Scatter plots are useless due to overplotting, and 2D histograms are a form of kernel density estimation that necessarily blur or pixelate a lot of the details. Additionally, with the full dataset, the pickup locations alone total 21GB, which is more than the memory of my 16GB laptop. Out of core tools can solve that technical problem easily (and subsampling is easier than that), but what about the visual problem? Human eyes are incapable of absorbing 21GB of information in a plot.

The solution to this comes from an interesting library called Datashader. It dynamically generates a 2D Histogram at the resolution of your display (or a specified canvas). Each pixel on the display corresponds to certain histogram boundaries in the data. The library counts the number of data points that fall within those boundaries for each pixel, and this number is used to color the intensity of the pixel. Leveraging Dask, the creation of the histogram can scale to terabytes of data, and be spread across a cluster. Leveraging Bokeh, the final plot can be zoomed and panned. Using techniques from high dynamic range photography, intensity ranges are mapped so that maximum dynamic contrast is present at any zoom level, and in any given viewport.

Taxi Pickup Locations

This is what the map of taxi pickup locations (1.3 billion points) looks like over Manhattan, plotted using the Viridis perceptually uniform colormap.

NYC Taxi pickups map for Manhattan Click for full resolution

The first thing I notice is how clearly I can see the street patterns. In parts of Brooklyn and Queens, the street pattern is sharp. In Manhattan, the pattern is `fuzzier’, especially near the southern tip of Manhattan and in Midtown south of Central Park. There are an awful lot of pickups that, according to GPS coordinates, fall over the Hudson or East rivers, and quite a few pickups that fall in the portion of Central Park where there are no roads. Obviously, not a lot of taxi trips are starting in the rivers surrounding Manhattan, but what this plot shows is instead how important GPS error is. The fuzziness arises from tall buildings which make it quite difficult to get a good GPS fix, and the taller the buildings, the fuzzier the streets look. More broadly, the Midtown area south of Central Park is very bright, indicating a lot of taxi trips start there.

Taxi pickups map for NYC Metro Area Click for full resolution

The second image is also taxi pickups, but on a much wider scale. Zoomed out, most of Manhattan lights up like a beacon, indicating far more pickups in Manhattan than the surrounding area. But the airports, JFK and La Guardia in particular, also light up, showing nearly as much visual intensity (trips per unit area starting there) as Midtown.

Taxi Dropoff Locations

Now let’s examine the dropoff locations using the Inferno colormap.

NYC Taxi dropoffs map for Manhattan Click for full resolution

At first glance, the dropoff locations look a lot like the pickup locations within Manhattan. The same regions, Midtown south of Central Park, and the southern tip of Manhattan show the brightest (and fuzziest) streets.

Taxi dropoffs map for NYC Metro Area Click for full resolution

Zooming out to the broader metro area, the streets in Brooklyn and Queens are much sharper and brighter, indicating there are a lot more dropoffs in the outer boroughs than pickups, and indicating the GPS error in these regions tends to be lower, presumably due to fewer tall buildings. In fact, in some places it looks good enough to use as a street map, indicating a relatively even distribution of taxi dropoffs in Brooklyn and Queens. This is quite distinct from the pickups map, indicating that there are relatively few pickups in the outer boroughs, but a lot of dropoffs there. Many people take taxis from Manhattan to the outer boroughs, but a lot fewer take taxis from the outer boroughs into Manhattan.

Taxi Pickup and Dropoff Locations

The last two plots compare pickups and dropoffs on a pixel by pixel basis. Wherever pickups are higher than dropoffs, the pixel is shaded with the Viridis green and yellow colormap. Wherever dropoffs are higher than pickups, the pixel is shaded with the purple and orange Inferno colormap.

Pickups (Yellow-Green) and Dropoffs (Orange) for Manhattan Click for full resolution

In Manhattan, the Avenues (North-South streets) are lined with green, indicating more pickups than dropoffs. The cross streets (East-West) are orange, indicating more dropoffs. Practically, if I want to catch a taxi, it is probably easier to walk to the nearest avenue and pick one up there.

Pickups (Yellow-Green) and Dropoffs (Orange) for NYC Metro Area Click for full resolution

Zooming out to the broader area, there are a few major streets in Brooklyn and Queens that are green, indicating significant numbers of pickups on those streets, while the other streets remain orange, showing dropoffs from the trips that started in Manhattan dominate. At JFK and La Guardia, the pickup and dropoff areas within the airport are highlighted, with portions shaded in green (pickups), and other portions shaded in orange (dropoffs).

What about GPS?

Plotting taxi pickup and dropoff locations using Datashader and Bokeh has shown that sometimes GPS coordinate data is quite inaccurate, indicating pickup and dropoff locations in the East or Hudson rivers. We see from the maps of pickups and dropoffs in Manhattan that GPS is strongly affected by tall buildings. Dropoffs in particular show a surprisingly even distribution across the outer boroughs, and every road, and every bridge is highlighted. I find this surprising, as I would not expect many dropoffs to be occurring on the bridges, or in other locations where stopping and letting someone out of the taxi is discouraged, such as the Van Wyck Expressway, which leads to JFK. Yet, such bridges and roads are highlighted, and that makes me wonder if this a quirk of GPS? This is all speculation on my part, but what if GPS devices only update at a fixed interval, such as every two minutes, or whenever it can get a position lock? In that case, a taxi trip would end in a reasonable location, but the data would be recorded as the trip ending somewhere along the route. This would explain how large numbers of pickups and dropoffs occur in seemingly improbable locations.

Given the dataset goes back to 2009, and GPS receivers in smartphones have come a very long way since then, I am very curious if it is possible to see improvements in GPS accuracy in the taxi dataset. As a proxy for GPS error, I examined the number of pickup and dropoff locations that are in physically impossible locations, such as in the middle of the Hudson and East rivers. I then plotted the fraction of such impossible trips as a rate of the number of the total trips. Given the uptick in ride-hailing and ride-sharing services like Uber and Lyft, a rate adjustment is necessary.

Rate of Pickups and Dropoffs outside of the Taxi Zones, but within the local NYC area

Sure enough, the rate of pickups and dropoffs in impossible locations has fallen by a factor of 4 to 5 since 2009. It is unclear to me what could be causing an annual cycle in 2009–2012, where the error rate increases during summer months. Since 2011, the error rate has been falling substantially, possibly due to a changeover in taxi meters across the taxi fleet, or changes in how the GPS gets reported. The fact that dropoffs are higher than pickups suggest to me that there is probably some support for my theory that GPS devices only update at a fixed interval or whenever they can get a lock on position.

It is worth mentioning that this error rate of 0.5% — 0.1% representing is not necessarily representative of actual GPS errors in particular locations. For example, the fuzzy streets in Midtown south of Central Park indicate that position error is much higher there than 0.5%. Also, GPS position can be wrong in a way that does not put it over the water, but over an incorrect land location, which would not be detected by my crude proxy for GPS error.


I obtained, cleaned, and plotted the NYC taxi dataset. I produced some interesting visualizations of pickup and dropoff locations that show the majority of pickups and dropoffs occur within Manhattan and the JFK and La Guardia airports, however there are a substantial number of taxi trips from Manhattan to Brooklyn and Queens. Far fewer trips start in the outer boroughs and end in Manhattan. I compared the pickups and and dropoffs on a point by point basis, showing how the avenues in Manhattan have more taxi pickups than the cross streets, which have more dropoffs.

I also showed how the GPS locations have questionable accuracy. In Midtown, this is visible by ‘fuzzy’ streets, and a fair number of points that show pickups in impossible locations like the Hudson or East rivers. There are also an awful lot of pickups and dropoffs in locations where it would be inconvenient to drop off a passenger such as the Van Wyck Expressway, suggesting that the clear definition of such streets on the dropoffs map is a quirk of GPS devices updating infrequently. Analyzing the number of pickup and dropoff locations that happen to be in water show a significant 4–5X decrease since 2009, which might be attributable to improvements in GPS technology in taxi meters. Nevertheless, the error in the GPS locations suggest they should be considered with a grain of salt.

I will be publishing more data analyses on this dataset over the coming weeks.

Data Availability

I make the code available in my NYC-transport github repository. You can view the notebook used to make plots for this post on Github or NBViewer.

I have put the original parquet format dataframe containing the taxi data and Uber data (not the subject of this post) on Amazon S3 in a requester paysbucket. If you start an EC2 instance in the US-East zone with a properly configured s3cmd, you can copy the files as follows. Be sure to be in the US-East zone, otherwise you may incur significant bandwidth charges .

s3cmd sync --requester-pays s3://transit-project/parquet/all_trips_spark.parquet .

The data is approximately 33GB in Snappy compressed, columnar, parquet format. If reading with Dask, using the PyArrow backend is required.

Source: If Taxi Trips were Fireflies: 1.3 Billion NYC Taxi Trips Plotted

If at first you don’t succeed, try, try again. Then quit. There’s no point in being a damn fool about it.

W.C. Fields

A step backwards, after making a wrong turn, is a step in the right direction.

Kurt Vonnegut