Results to Date

We built two browser-based applications that handle the first and second steps of the 2D-to-2D approach; both are relatively easy to use, but both still have room for improvement. We continue to experiment with various ways to view the posed photos in Google Earth. We also have begun proofs of concept for 2D-to-3D second steps.

2D-to-2D Results

Figure 5. Interface for step one of the 2D-to-2D approach.

In order to pose a free-standing photo in Google Earth in 3D, the PhotoOverlay KML element must be used. The PhotoOverlay element includes a camera element which specifies six of the parameters necessary for a pose: longitude, latitude, altitude, as well as heading (pan), tilt, and roll. It also includes a ViewVolume element which specifies the field of view the image occupies with respect to the current camera position, thereby addressing the remaining two pose parameters.

The source photo itself can be specified by a simple URL, but much better quality can be had through the use of the ImagePyramid element, a hierarchical set of tiles derived from the original photo at multiple resolutions, allowing for quick access to the specific part of the image the user is currently viewing, at the appropriate resolution – no more and no less. The focus of our work on the 2D-to-2D process has been to make the generation of the PhotoOverlay element required to pose a specific photo in Google Earth as easy as possible.

Since no server exists that can be queried for a 2D view of a 3D world model based on position and direction, we hacked our own. Our “faux server” runs Google Earth in the background on the same computer as our browser. The human helper uses the browser-based application to upload their photo, which appears as a thumbnail above a mashed-up Google map with a field of view indicator overlay. The helper zooms in to determine the location of the photo and then rotates the custom overlay to indicate general direction. The location and direction data is formulated as a Camera KML element and sent to Google Earth via a network link. An initial height of five meters above the ground is assumed, but both the altitude and the tilt can be adjusted.

Using the Google Earth COM API, a screen grab from the current camera location is returned and placed next to the photo. The helper can change positions and directions, and quickly sees the new resulting view from Google Earth. Surprisingly, we’ve found this a more efficient way to navigate at ground level than by using Google Earth’s current built-in navigation. In addition, the Camera KML element used to generate the view can be copied directly into the PhotoOverlay that will ultimately pose the photo.

Figure 6. Interface for step two of the 2D-to-2D approach.

Once a screen shot is acquired, it can be loaded into our second browser-based application along with the photo (this is currently a manual process, but will eventually be automated). By adjusting position, scale, and transparency, the two images can be aligned as closely as possible. The application then automatically generates the ViewVolume KML element which specifies the field of view of the photo with respect to that of Google Earth’s camera (the latter is 60 degrees wide and n degrees high, where n is derived from the current height of the Google Earth window).

Finally, a publicly available Photoshop script [33] is used to generate ImagePyramid tiles from the original photo, and a text editor is used to put all the pieces together into a single PhotoOverlay KML file that can be opened in Google Earth (again, a manual process that will eventually be automated). At this point, the photo has been posed.

Currently, Google Earth has three distinct classes of 3D models, each with its own degree of detail with which to match a pose. The most basic and primary class is terrain. Next are “shoeboxes,” simple untextured models of architecture based on collected data of footprints and heights. Some urban areas are nothing more than large regions of shoeboxes, depending on the availability of the data. All of Japan has such data available. The third class of 3D models consists of geometrically dense, richly textured architecture created with Google SketchUp. Such areas currently exist primarily in major urban centers, but they are proliferating rapidly. Each of these classes have different degrees of tolerance and “fuzziness” with respect to posing photos.

Viewing Posed Photos Results

Once a PhotoOverlay has been loaded into Google Earth, the photo it poses can be seen in 3D space as a flat plane. A specialized viewing interface becomes available when the user double-clicks a PhotoOverlay: the camera flies to the nodal point of the pose, and the field of view is automatically adjusted so the posed photo appears centered in the view. The user can zoom in and pan around the photo at will before clicking “Exit Photo” to return to the normal viewing mode.

While this experience is adequate for viewing one posed photo at a time, design questions quickly arise when multiple photos are posed in close proximity. In Google Earth, posed photos fade out when the camera moves a significant distance away from them, but remain visible when close by. There are times when it would be appropriate to simultaneously view a group of photos posed closely to one other, but other photo groups are best viewed one at a time. In the future, when the general public has the ability to pose any photo anywhere, a design strategy for handling the sorting and display of large numbers of posed photos will clearly be needed.

2D-to-3D Results

Figure 7. Proof of concept for the 2D-to-3D approach.

We have also implemented a technique which allows the user to pose a photograph by dragging features from a roughly aligned 3D model (points and line segments) to where they should intersect on the photograph. In this case, the problem we are solving can be stated thus: we have a photograph of the real-world scene that is actually the view we want to find. We also have a virtual picture of the same scene, initially given by the user and (hopefully) close to the real one. Since the virtual view has been generated by a 3D model of the scene, we have access to the depth of each point in the virtual view. We also have some rough approximation of the camera intrinsic parameters: focal length and center of projection (e.g. from the EXIF information saved in the picture when it was taken).

If at least eight “good” correspondences between the real picture and the virtual picture are specified, the classical “eight-point” algorithm from computer vision is able to retrieve the geometry of this ‘stereo couple’ up to a single unknown transformation using only the given matches [34]. Given that depth values of the points are available, it is possible to remove this ambiguity and achieve what is technically called a metric reconstruction, i.e. the parameters computed (such as translation and rotation of the camera) are an accurate representation of reality.

In practice, various errors can occur. Specifically, what happens when the number of corresponding points is less than the minimum required? In this case, the problem is under-constrained, and so not enough information is available to infer a solution.

Our approach relies on the extraordinary ability of a human user to detect matching features (points, lines) between two views (real-virtual) of the same scene. Our goal is to create a tool that can be used to interactively refine information provided by the human user. Using this input to artificially constrain the problem, we can do our best to automatically optimize the real pose of the camera and its projection parameters.

The artificial constraint is based on the assumption that the pose found by the user is close to the real one: so we compensate for the lack of mathematical constraints, forcing our parameter search to take place within the locale of the human guess. The optimization result will get closer and closer to the right one as long as the user provides further correct matchings, due to the fact that the amount of information is increasing incrementally, thus improving the quality of the output. Moreover, thanks to the interactive nature of the approach, the result of the step-by-step correction can be seen in real time, allowing the user to choose new appropriate matches to obtain a better pose.

For this reason we have implemented a fast, natural, and intuitive interface in which the user can see the picture overlayed to the model, change the point of view from which the model is observed, and add or modify correspondences between features in the real and in the virtual picture simply by dragging points and lines and dropping them on the expected position. During each drag-and-drop operation, the optimization engine gathers all the available matches, launches the minimization process, and shows the results almost instantly; the result of the step-by-step correction can be seen “online,” giving the user immediate feedback and allowing him or her to correct or choose appropriate matches to improve the pose of the picture.

By providing an intuitive, straightforward interface, we expect the user community will become increasingly skilled and fluent performing the tasks required.

Any optimization problem is defined by the objective function to be minimized, and by the constraints that any feasible solution must respect. In our case, the function is a second-order error computed from all the matches. Each term is the square of the distance on the camera image plane between the real feature and the reprojection of the virtual feature's position, where the latter is dependent on the camera parameters being sought.

As stated above, we also require that the solution lies close to the one provided by the user. To achieve this, another parameter controls how the solution progresses when the solution is far from the initial estimate. This forces the optimization algorithm to find a solution as close as possible to the estimate. In practice, we have found that the Levenberg-Marquardt optimization algorithm [35] works well for the kind of target function we are looking for. Unlike existing techniques such as Facade [36], Canoma [37], and Photomodeler [38] we optimize only camera parameters. This allows the optimization to converge rapidly.In Canoma the target is to compute both the model parameters and the camera parameters at the same time. The user adds some geometrical primitives to "compose" the 3D model and some correspondences between the model and the pictures. In Canoma the solution is seen as a function of time. At each step a constrained quadratic optimization problem is solved to compute the smallest time variation to the current solution;this is effective to avoid sudden and unexpected changes in the solution and thus in the visual feedback to the user. The method of Lagrange multipliers is used. The time variation is then exploited to compute the new solution using time integration techniques.

As a further add-on to our tool, we are working on automatic techniques that, mixed with heuristics, attempt to compute the vanishing points from the scene.

Due to perspective deformations, lines parallel in the real world can be imaged as incidents on the image plane of the camera at a point that is typically quite distant from the center of projection. Such points are called vanishing points and are essentially the projection of real world directions onto the image plane of the camera.

Given the vanishing points for three orthogonal directions in any given scene, there are straightforward computer vision techniques that can compute the orientation of the camera.

In our case, computing the orientation in this way would reduce the number of parameters involved in the optimization process, thus reducing the number of matches required to pose the picture. This feature would therefore be useful, especially when posing pictures of buildings with predominantly straight lines.