Related Work

Work related to “Viewfinder” includes the legendary but almost completely undocumented event that took place in 1977 during the filming of test scenes for Close Encounters of the Third Kind [3]. Gary Demos and Mal McMillan, working with Vilmos Zsigmond, Steven Spielberg, and Douglas Trumbull, suspended several four-inch glowing balls at known locations on a set, then filmed with a freely moving camera. The idea was to use these balls as “witness points” to accurately determine the pose of the moving camera. There were 11 parameters in all captured for every frame to allow compositing with 3D computer models. After being tracked by hand, the balls were matted out of the scene. McMillan developed the method based on a 1974 paper by Ivan Sutherland on 3D data input [4]. Though the results were purported to be impressive, the contract for the effects ultimately went to Trumbull, who had earlier developed the first fully-instrumented motion control camera. McMillan wrote up his method for “Witness Point Tracking” and distributed it informally; however, this work was never published [5].

Integrating photographs into 3D computer models, often called image-based modeling and rendering, has a relatively long and rich history, including the use of models based on actual locations. Notable early examples include models of the UC Berkeley campus [6] in 1997 and of the USC campus [7] in 2003. The desire for visually rich, 3D models of the entire planet intensified with the release of Google Earth in 2004 and Microsoft Live Search Maps’ 3D feature in 2006. Both are 3D world models, both were initially made from urban and terrain data, and both are increasingly adding photographic data from satellites and special airplane cameras.

But ultimately, the big race is at ground level. Making 3D world models that can be credibly viewed from a human-scaled viewpoint near the ground is a major challenge. Google and Microsoft have access to urban data in many regions that includes the footprint and height of nearly every existing building, from which simple geometric structures can be extruded, sometimes referred to as “shoeboxes” [8].

The structures can be photographically textured (though lighting and shadows are problematic) and can be increasingly refined. Both are fairly massive tasks requiring human intervention, particularly to achieve convincing ground views. Google has encouraged community model building through its SketchUp 3D Warehouse [9] for sharing models. Impressive as the software and warehouse may be, it remains a challenging task for novices.

An alternative approach to ground-level navigation is to remain entirely within the two-dimensional realm of photographs. Thousands of photos can be taken in sequence from a moving vehicle, and the resulting collection can be navigated entirely via transitions from a given photo to its nearest neighbor. In 2007, Google launched Street View [10], a feature of Google Maps made up of panoramic photographs shot from camera cars driving up and down every street in a given city. Several competitors appeared almost immediately, including Everyscape [11] (announced the same day), MapJack [12], Earthmine [13], and City8 (in China) [14] as well as similar demos from Microsoft. All of these derive from MIT’s Aspen Moviemap project (produced almost 30 years prior), and others since, including Paris [15], Palenque [16], San Francisco [17], Karlsruhe [18], and Banff [19]. The problem with all moviemaps, from Aspen to Streetview, is that they have fairly strict navigational limitations, due to the fact that no underlying 3D geometry actually exists: in fact, they are nothing more than fast-access “lookup” databases, pulling the most appropriate photo and using increasingly sophisticated transitions to get to it.

Such transitions from one photo to another also have a rich history (e.g, View Interpolation [20] and RealityFlyThrough [21]), but the in-between images will inevitably have varying degrees of credibility, ultimately depending on the content and properties of each particular image. Meanwhile, more sophisticated camera cars are emerging, some of which now include laser range-finding devices (LIDAR) that can acquire three-dimensional depth information. The resulting data is in the form of 3D “point clouds.” However, these are not actual 3D models, and converting both 2D photos and 3D point clouds into useful 3D models made up of meaningful geometry remains a central challenge.

In 2006, Microsoft and the University of Washington previewed PhotoSynth [22], a striking example of related ground-level modeling. Arbitrary photographs taken in and around a common location are automatically analyzed to produce a 3D point cloud; the photographs are then spatialized within this 3D point cloud space. The system leverages modern computer vision techniques such as the Scale Invariant Feature Transform (SIFT) to identify sets of points which match each other across photographs, and the Random Sample Consensus (RANSAC) algorithm to automatically filter reliable matched feature points from “outliers.” The camera parameters for each picture are then computed through an optimization process to determine which photos are usable and which are not. An interactive interface allows the user to see the original photographs superimposed onto the point clouds and to transition the view from one image to another.

PhotoSynth is undeniably impressive, both in the quality of its vision and design, and in its underlying technology. However, it has several fundamental limitations. It generates a 3D point cloud rather than an actual 3D model, and it relies on seamless transitions between photographs in which the in-between images do not always look credible. More significantly, its automation process cannot handle all photos: in its initial tests, only around one quarter [23] of the photos retrieved from Flickr searches for particular locations (e.g. Notre Dame cathedral) could be brought successfully into correspondence in a scene model using this automated technique. The successful photos were the ones with the most similar lighting conditions, and contained a minimum of transient objects, such as people. More expressive and personal photos, close-ups, abstractions, etc., were ruled out.

From some perspectives, this may seem positive. In 2007, Google acquired Panoramio [24], a Flickr-like photo sharing Web site in which uploaded photos are geo-tagged onto a map of the world, and tagged with text as well. Shortly after the acquisition, Google added Panoramio as a “layer” in Google Earth, asking its community of users to submit their photos for public inclusion. Panoramio’s acceptance policy [25] states that it will only select images of exterior places and will not select photos with people posing, cars, pets, flowers, close-ups, or events. Thus, what PhotoSynth does because of technological limitations, Panoramio does by editorial fiat.

Between July 2007 and February 2008, Google added over 3 million [26] Panoramio photos to Google Earth (presumably selected by hand). These photos appear unposed as 2D pop-ups. Recently a small number of posed photos have appeared in Google Earth. In August 2007, Google added Gigapixl [27] photos as a layer in Google Earth and shortly after added GigaPan [28] images as another layer, together totaling perhaps a few hundred images. Both Gigapixl (a trademarked name) and GigaPan images are made up of literally a billion pixels (compared to today’s digital cameras which typically take between 5 and 10 million-pixel images). Gigapxl photos come from one giant film camera while GigaPan images are shot with a conventional digital camera on a small robotic platform; the photos are then stitched together. Over the past several months, both Gigapixl and GigaPan images have been posed (presumably by hand), insofar as when they are viewed, they appear perfectly registered with the Google Earth landscape. Version 2.2 of Google’s KML file format added a feature called PhotoOverlay [29] which can be used to add photos spatially in Google Earth. A small but lively community has written tutorials [30], produced panoramic spheres [31], and made cockpit overlays [32] to date.

In summary:

  • Google Earth has posed several hundred photos in Google Earth and has 4.5 million geo-located more photos to go;
  • Microsoft PhotoSynth has an impressive way of automatically combining sets of photos from the same area but with no underlying 3D model and cannot handle most photos taken under arbitrary conditions;
  • Yahoo Flickr taps the power of the community to create and organize an enormous free-for-all database but has no 3D world models.