Geospatial Photos — Can Every Pixel Have Real World Spatial Coordinates?
Depth maps are a foundational element for a wide variety of technologies ranging from augmented reality to autonomy to photography. In all these instances depth maps convey “information relating to the distance of the surfaces of scene objects from a viewpoint”. Take this example from Wikipedia for the depth map of a cube where the darker the area the closer it is the viewpoint of the camera:
Traditionally depth maps required multiple images to create a stereo view. These stereo images could be opportunistic (a single camera taking pictures of an object from multiple perspectives) or dedicated (two fixed cameras with different perspectives taking pictures of the same object). Most recently neural nets have been developed to model depth maps from a single monocular camera and image. The cool bit is depth maps are increasingly turning up as a native capability for cameras on mobile phones. We dove into this in more detail in a previous post.
Remote Sensing and Orthorectification
In the geospatial world we have our own version of depth mapping — creating digital elevation models (DEM) from multiple satellite or aerial images using stereophotogrammetry. Interestingly we go a step further and use our DEM data to better locate new aerial and satellite images using orthrectification. This begs the question can we orthorectify mobile phone camera images/video?
Since the Pixel8 team has been spending a good amount of time creating detailed 3D maps (a.k.a. DEMs) of Boulder it seemed like a fun challenge to see if we could orthorectify individual mobile phone photos. Before we start it is helpful to do a quick refresher on the orthorectification process. Photos collected by satellite and aerial platforms have a variety of distortions and anomalies inherent in their collection. These issues are largely generated by the fact the subject of the photo has three dimensions and the photo has only two dimensions. The more variability in the third dimension (Z axis — a.k.a elevation) the more error in the photo. To fix this remote sensing scientists take elevation data (DEMs) to correct for the errors not captured by the sensor’s two dimensional data. DEMs come in two flavors 1) Digital Terrain Models (DTM) which provide a a bare surface of the earth with object like vegetation and buildings removed and 2) Digital Surface Models (DSM) which provide a holistic surface including objects like vegetation and buildings.
Not surprisingly you can use the two different approaches to DEMs to create two different types of orthorectification. The most common is using DTM’s seen in the example below from Satpalda.
The second method utilizes DSM’s and results in what is called a “true orthophoto”. In urban areas we can often get occlusions from large buildings. Think of each pixel as a raytracing of the ground to the sensor (satellite, airplane, drone). If a building pixel blocks a potential ground pixel from getting collected we have an occlusion. A nice visual of an occlusion from Morton Nielsen’s work is below:
If we have a 3D model of the buildings and images with multiple perspectives it is possible to generate an image that includes the occluded areas. Again Nielsen’s excellent thesis goes into this in detail, but we can see the general flow 1) provide a detailed DSM model to enable the determination of occlusions, 2) detecting the obscured areas, 3) performing a euclidean distance transformation between the obscured areas and the pixel and 4) repeating the process across images of the same location with multiple perspectives.
Not only does this provide an aesthetically pleasing nadir perspective of the building it is also a planimetric improvement to its spatial accuracy.
Nested in the nice improvement to aesthetics and accuracy is the ability to merge multiple small orthophotos into a mosaic of a large geographic area. We take this for granted today in our global basemaps built from aerial and satellite imagery, but it was a fundamental unlock for large scale geographic mapping from remotely sensed data.
Terrestrial 3D Mapping
In the emerging work around AR and autonomy we see lots of the same problems creating challenges. On one hand we have amazing advancement, like “on-the-fly” dynamic depth mapping to handle real time occlusion for AR. On the other hand we struggle persisting these depth maps and creating a persistent 3D maps of the globe. Arguably we lack the orthorectification process to make a 3D terrestrial mosaic of the globe. This is even more challenging in AR/autonomy in that our mosaic also needs to dynamically update.
Satellite and aerial mosaics are updated infrequently. In the best case scenario — where you have persistent satellite collections — basemaps are current to within a year and updated in part per quarter. This begs the question we began with — is it possible to add depth with spatial coordinates to any camera image. If so we are one step closer to creating both the persistence and mosaicing ability that would be ideal of AR/autonomy.
Let’s Give it a Go
We already have an open 3D map of Downtown Boulder we generated in one of our previous experiments. Recently we took that data and co-registered it with Nearmap aerial data as well as City of Boulder LiDAR.
If we follow the orthorectification metaphor the combination of the three give us a super accurate DSM to rectify images with. Next we can introduce a new photo to the equation and see if we can determine geospatially accurate depth and location from it.
Using just the photo above we can calculate both depth and keypoint descriptors from the image. In this case depth is calculated using both the data in the photos as well as our 3D map (DSM), while the keypoint descriptors come from just the image. This combination gives us a relative distance of each pixel to the camera as well as keypoints that can be used to match features from other images. The result of the two looks like this:
Now comes the rectification process of projecting our 2D photo onto our 3D model (DSM), and then extracting real world spatial coordinates from it. Similar to the traditional orthorectification process we take our 3D DSM and use it as an elevation data set to begin the rectification process for our 2D photo. In this case we take our keypoint descriptors from our 2D photo and rectify them to the DSM.
Once we have the single photo rectified to our DSM it is aligned to a real world spatial coordinate system. This means that all the pixels in our photo now have a latitude, longitude and altitude (meters above mean sea level derived from NAVD88). To illustrate this we’ve taken our photo and randomly sampled pixels from it and plotted their spatial coordinates.
Each one of these coordinates in our sampling will give us a portable set of coordinates that can be rendered in any mapping platform. As a token example let’s take the bottom set of coordinates and plug them into Google Maps.
This is a validation of just one pixel. The cool aspect is every pixel in the photo has the same ability to be plotted in a spatial coordinate system. Going back to our goal of leveraging commodity data to update industrial baselines; we are that much closer to being able to leverage every photo with EXIF as a 3D map update. Really exciting to see a convergence of the old and the new to create new opportunities for the geo-community.