The Case for Affordable and Open 3D Mapping to Accelerate Augmented Reality and Autonomy
Introduction
The goal for this blog post is to demonstrate how a detailed 3D point cloud can be built and accurately georectified to a real world spatial coordinate system for a large geographic area using commodity video. Specifically, highlighting the cost of production at scale. Before we get to the meat of the result we are going to take a detour highlighting why mapping a large geographic area in an affordable manner is an important result. It is a circuitous trip but one we think is worth taking.
Maps for Humans vs. Maps for Machines
The advent of Google Earth inspired a significant technological push and investment to map the world in 3D. These efforts encompassed Google’s work plus Microsoft Virtual Earth, Apple Look Around, Vricon and others. Utilizing a blend of expensive sensors like satellites, aerial cameras and dedicated sensor laden cars imagery was collected at massive scale. This aggregate imagery was then processed to optimize photo-realism to allow humans to explore the Earth virtually. We can think of these as 3D maps for humans.
More recently there has been a second push to map the world in 3D for machines. Augmented reality (AR) and autonomy applications also need 3D maps to enable their use cases. On the AR side companies like Unity, Ubiquity6, Scape, Fantasmo and 6D.ai provide solutions. For autonomy mapping providers like CivilMaps, Carmera, DeepMap, Mapbox, Here and TomTom among many others address market needs.
A key difference between the requirements of maps for humans vs. maps for machines is that the later need to be perpetually updated at a much higher frequency than maps for humans. At my current location when I check the currency of Google Street View I find:
While out of data street views for a human’s virtual tours is inconvenient it could be crippling for an autonomy or AR application. So, why don’t applications like Google Street View update more frequently? In short it is too expensive. Building and operating a unique fleet of specialized vehicle is a big investment. If we want to enable frequently updated maps for machines we need a new paradigm. Ironically we can learn a lot from the strategies used to leverage satellite and aerial imagery to create maps for humans.
How to Scale Economically
For almost two decades we’ve been able to fly anywhere in the world in Google Earth/Maps to see imagery of the planet. One feature that is often overlooked in the small line at the bottom of the map listing all the data sources you are actively viewing.
For the far-sighted reader this sample list includes — Google imagery, Clear Creek County Government, Landsat, Copernicus, Maxar Technologies, USGS and USDA Farm Service Agency. That’s seven different imagery sources for this view of Louisville, CO. It would have been technically simpler to use one source of imagery, but you’d sacrifice some combination of coverage, affordability or resolution. This hybridization of imagery sources and the ability to seamlessly fuse them together is what makes virtual globes work as both economically and a great user experience.
The viability of any platform is directly connected to its affordability to scale. Before Google Maps, creating applications that used maps was a very niche industry that required specialized expertise and a robust budget. By driving down the cost of creating maps at massive scale Google opened up mapping as a platform that anyone could integrate into their applications. OpenStreetMap pushed the affordability horizon even further creating a free source of data for vector features like roads, points of interest and buildings across the globe.
The key to this feast of affordability was the interoperability of data sources both vector and raster. We often take spatial coordinate systems for granted (e.g. https://gdalbarn.com/), but an accurate system of latitude, longitude and altitude is what allows global map making to happen. In the context of this argument it is also what makes global maps for humans affordable. Each data source could have it’s own Cartesian coordinate system, but then we would not be able to blend a disparate number of data sources together. IMHO this blending of sources is the heart of modern map making.
Misunderstanding Absolute Accuracy
So, why are we waxing poetic about the virtues of real world spatial coordinate systems (i.e. latitude, longitude, altitude) and absolute accuracy? There is a strong belief in the maps for machines (computer vision) community that you only need relative accuracy and Cartesian coordinate systems for machine navigation. In fact, this is a true statement. I think many computer vision specialists came into 3D mapping/navigation and saw this archaic adherence to absolute accuracy as a distraction. It was an unnecessary set of rigor that did not improve engineering performance.
Our argument isn’t that absolute accuracy is important for navigation of machines, but instead creating affordability at scale. Relying only relative accuracy to build maps for machines it too expensive to scale. This economics argument has two facets 1) the ability efficiently manage and conflate numerous data providers into real world coordinate system and 2) more efficient compute of SfM models and bundle adjustment. We’ll cover topic #2 later in the post and focus on topic #1 now.
Using a single source/sensor and relying on relative accuracy does make life far simpler. That said there is a reason Google didn’t power Maps with only one imagery source. In addition to it being an expensive path by ignoring free and open data sources like Landsat, Sentinel etc.; different sensors/data sources have varying beneficial attributes. Satellites can image anywhere but resolution is mediocre, aerial is better resolution but you can’t fly everywhere, Street View provides a unique perspective but covering and updating large geographies is challenging. The same decision, using a blend of data sources, was made by every global scale commercial map provider (Microsoft, Apple, ESRI, Mapbox etc.).
We believe there is a lesson to be learned from these past experiences when it comes to building maps for machines. There is a choice of building a global map for machines from a single data source or utilize a variety of data sources to create a hybrid global map. Two key questions to ask in making the decision. How quickly can you map the globe? How often can you update your globe? These question boil down to two key attributes speed and cost. Autonomy map providers charge upwards of $5000 per kilometer to drive their sensor laden vehicles through a street network. If we take the World Bank’s Global Roads Inventory Database estimate there are 21 million kilometers of roads on Earth. That would be $105 billion to HD map the globe. The faster you want to map the globe with this approach the more vehicles you need and the greater the expense. The same goes for updating the map using the same fleet. The cost of growing a fleet of specialty sensor laden cars to meet the frequency of update demanded by the market is untenable for many companies already and will only become more so. This example is some what hyperbolic, but illustrates the extremes of the single source strategy.
Ultimately maps for machines are a platform that enable a new world of AR and autonomy applications to be built upon. For this reason cost is a critical component. The lower you can drive the cost of the core infrastructure to enable autonomy and AR the larger the number of applications that can afford to build a business utilizing it. As long as costs stay high applications will be niche/custom to small segment of the market than can afford them. Mapping and GIS in the 1990’s is a convenient if not the best metaphor. Maps for machines needs to be a volume business in order for AR and autonomy to succeed.
Mapping a Large Geography Economically
Let’s see if we can provide an alternative to expensive single sensor mapping. For our 3D mapping work there are two key dimensions to affordability 1) collection cost and 2) compute cost. Reducing collection cost is the most obvious. Sensor laden vehicles are expensive. The first Google Street View cars in 2007 required a $45,000 camera and $90,000 for the mount and onboard processing unit. Then it was another $125 to $700 per mile of video footage in operational costs. More modern HD mapping rigs, for autonomy use cases (e.g. AutoNavi), can cost upwards of $1,000,000.
If we can replace these expensive custom rigs with commodity cameras there is an opportunity to drive cost down in a significant way. To date we’ve been testing both action cameras (e.g. GoPro) for video and mobile phones for photo collection. Compared to the various HD mapping and Street View cars our rigs are quite simple.
You can get a GoPro Fusion for $200 and a car suction mount for an additional $40. The beauty of this price point is any car, bike or person can become a 3D data collector. Also, just about any camera can become a potential data collector — assuming the camera intrinsic are available or discoverable. Instead of a single sensor we get closer to any sensor.
This approach culminates in a strategy is to use inexpensive commodity video to systematically seed a 3D world. Then use the ubiquitous flow of crowdsourced geotagged photos to grow a live updating map:
This seeding and updating strategy address the two core problems we’ve highlighted 1) how quickly can you map the globe and 2) how often can you update it. When anyone with a camera can contribute videos and photos it turns the problem from a vehicle/hardware + logistics problem into a data organization problem. From a speed and costs perspective the later is inherently easier to solve.
While this strategy drives down the cost of collecting and updating the data needed to power a 3D world it doesn’t address the second big cost driver — compute. Turning photos into 3D point clouds (SfM), bundle adjusting and then georectfying those clouds to a spatial coordinate system is a massively heavy compute task. Scape has an excellent blog series with a very accessible explanation of the challenges of running these massive scale solvers. Scape’s solution was to build an incredibly efficient solver for the bundle adjustment problem called Apophis, which improved upon Google’s Ceres solver. Even with the most efficient solvers the bundle adjustment of large chunks of geography is computationally intensive and expensive. Geo-mythology holds the early Google bundle adjustment work ate up so much compute they initially got kicked off the Google data center.
What if we didn’t have to run these bundle adjustment jobs at a massive scale. Instead, could we create a series of small jobs and then stitch the results back together. As a bonus, running small SfM and bundle adjustment jobs decreases the amount of error propagation and keeps errors uncorrelated. The upside is less compute and fewer errors. The downside you need to stitch all this data back together. The key ingredient to stitching data back together is having a real world spatial coordinate system to have a reference to fuse data back together with. If we relied only on relative accuracy and a Cartesian coordinate system it wouldn’t be possible. This partitioning -> compute -> fusion problem is what we’ve spent the majority of our time solving. If you use just GPS to position you point cloud partitions it looks like this.
In the animation below we can see how a reference assisted process takes the partitioned point clouds and stitches them back together using a spatial coordinate system.
Visually we can then compare how well the point clouds are stitched back together compared to a high resolution aerial orthophoto.
The combination of enabling any camera to collect data and our new approach to compute significantly changes the economics of building a global 3D map for machines. Next, we’ll specifically illustrate how this happens in practice and enumerate the cost savings.
The Power of Self Service Interfaces and API’s
Driving the cost down is only effective if you can maintain quality when doing so. We’ve covered quantifying systematic measures of errors from our georectification process in previous posts, and they currently range between 25–50cm. Combining inexpensive data collection gear with efficient compute to create high accuracy data really becomes compelling when you allow anyone to do this themselves. To this end we’ve been working hard on a parallelized scale out of our infrastructure to handle multiple concurrent jobs.
Then we are hooking up this back-end to a simple (prerelease beta) self service interface for running and managing your 3D mapping projects. Below we can see a time lapse of uploading a video and running it through the entire process.
The uploaded video and the 3D point cloud it creates are then maintained as artifacts that can continue to be built upon as we collect the rest of campus. The video below shows the parent video uploaded and the children streams generated from it to create the 3D model for this part of University of Colorado’s Boulder campus.
The combination gives us a platform for collaboratively mapping a large geographic area using a number of disparate video and photo uploads. While these small area examples are compelling it only gets interesting when we push to a large geographic scale.
A Cost Breakdown of Mapping the University of Colorado Boulder
Given the unfortunate circumstances of the COVID-19 pandemic mapping the University of Colorado campus was a solo affair. In three hours a single person mapped 25 linear kilometers creating 63 GB’s of video. The total area collected was roughly 1 SQKM.
The data took 24 hours to process with a conservative scale out of our Kubernetes cluster. The cost of generating 63 GBs and 25 linear km of georeferenced 3D point clouds was $38.95 in cloud compute. In this graph we can see May 27th where the system was idle and May 28th when we processed the CU Boulder data.
On average that is $1.56 per linear kilometer of compute cost and a camera costs of $200. This is a very signficant decrease in production cost from current methods. In our previous post we highlighted the accuracy of our approach is 25–50cm RMSE from a survey baseline (3DEP LiDAR). While this level of accuracy is not sufficient for all use cases it is a level of affordability and accuracy that opens a new world of use cases.
Too often in the diverse ecosystem of geospatial data we price data products at a level that hobbles the number of potential use cases. Most often this is by necessity — satellites, rockets, LiDAR scanners, aircraft time and drones are all expensive. The “costs of goods sold” become an albatross that leaves the geospatial industry relegated to niches in many cases. The same risk emerges for autonomy and augmented reality. If the cost of building “digital twins” to power these new technologies is exorbitant it limits the use cases and companies that can participate. If only firms with the deepest pocketbooks can generate the requisite data to power applications it will be tough to build a vibrant ecosystem. We’ve seen this definitively illustrated by OpenStreetmap.
The Results
Now for the fun part. Showing off the results of mapping the University of Colorado’s campus and sharing the open data. The eagle eyed readers have probably noticed that our collection of the University of Colorado at Boulder campus was done at the end of May and we are good way through June now. While we were able to process and align al the data programmatically the day after collection back in May we found an issue. Specifically, that the since ground control surface we were using was collected the university had done a considerable amount of significant construction. Entirely new complexes had popped up and some older buildings had disappeared. In the most extreme cases this introduced alignment errors in our data. While the University of Colorado case was a bit extreme it was an edge case we wanted to be able to handle. The good news is the team was able to devise new odometry and pose graph optimization routines to allow SfM reconstructions to bridge temporal gaps in the reference data. In this image we can see how the new method accurately structured and aligned the SfM point cloud even though the building in question was missing from the ground control.
With that challenge solved we can share the initial results. There are multiple options you can choose from the play with the data 1) raw data download in EPT and LAZ, 2) the Pixel8 interactive viewer (it will max out your processors) or 3) the Potree viewer (better performance but restricted perspectives). All data is open and licensed as CC BY-SA 4.0. Also here are a few highlights of cool segments in the point cloud.
One aspect you’ll notice in the point cloud reconstructions is the prevalence of what we call “sky points”. These are artifacts from the SfM process where the process picks up features from the sky and propagates them in the 3D point cloud. The “sky points” don’t impact AR or autonomy use cases when you are using the data to power a feature database, but from a visual perspective they are distracting. These points can be removed programmatically in a post process cleaning step, but we wanted to get the data out sooner rather than later. That said it likely warrants a separate post on maximizing photo realism in these collects and the potential for mesh generation. This is particularly interesting for gaming use cases. Over all we are excited to push out the next iteration in the platform. Lots of work to do but we are excited about the potential!