Rapid 3D Mapping Using Commodity Video
While our Boulder Mapping Experiment was a ton of fun — getting our friends together every weekend to take photos was a big ask. So, we’ve been experimenting with other imagery capture modalities that could scale up more quickly. One model we are particularly excited about is video. Not only do most camera phones have the ability to take video, but also the number of dash and action camera has exploded in recent years. The cherry on top is there has been a proliferation of companies partnering with fleet vehicles and crowdsourcing to enable video at scale like — Mapillary, OpenStreetCam, Carmera, Nexar, MobilEye+ESRI, Tesla etc. There is a surfeit of geotagged video being generated. If we can crack the nut of turning it into high accuracy 3D maps it could open new possibilities for creating a crowdsourced 3D map of the globe.
To simulate video collection from moving vehicles we used a simple rig; a GoPro 360 Max attached to a ski helmet while biking through our Boulder test area. Yeah it looks a little funny in action…
We’ve also tested putting the GoPro on a pole for pedestrian collection of areas that can’t be driven or biked. Starting our video collection testing with the GoPro 360 Max had it’s pros and cons. In the “pro” category a 360 action camera gives you lots of photos from a variety of angles, which is quite handy for photogrammetry. In the “con” category the GoPro uses a good bit of proprietary processing to create their 360 degree views - leaving you with an open ended reverse engineer challenge.
GoPro 360 Calibration and Photogrammetry Prep
Since the new-ish GoPro 360 Max wasn’t in any of the camera databases commonly used for photogrammetry we first tried to calibrate the sensor. There are some nice online tutorials for calibrating GoPro cameras to do things like remove wide angle distortion. So, we dove in, printed up a black and white chessboard and started following the protocols.
The bad news was we couldn’t get good enough parameters from the calibration to derive a camera model and focal length that gave us a solid SfM (Structure for Motion) result. Fortunately, the photogrammetry framework we use supports splitting 360 degree images for processing. The hitch was the results were pretty lousy. Then Pramukta had a clever idea to split the images with overlap. One of the key principles that makes photogrammetry work is having multiple pixels in common between multiple images. These common pixels allow triangulation through lines of sight. To create our overlap Pramukta divided our 360 degree images into eight 60 degree segments instead of eight clean 45 degree segments. This makes a bit more sense when visualized.
The results when we implemented the new segmentation strategy were super encouraging. The increase in overlap gave us more pixels in common with our input images providing us much better SfM results. Here we can see the combination of an SfM output using the 360 overlap segmentation approach.
Over all we consistently got solid results with the new technique. In the image below we can see some more results from the new approach across our Boulder test area.
Compared to our “still photography” derived point clouds — GoPro captures ground detail and overall coverage better, but lacks some of the detail found in the “still photography” approach especially for high angles. The biggest advantage was the GoPro 360 mapping took us seven minutes by bike versus several hours taking still photos by foot. When it comes to 3D mapping large scale geographies video is an ideal format to leverage.
The Magic of Using Odometry for Alignment
Another big advantage of using video as a collection modality is the ability to leverage “visual odometry” when you want to align your data to a survey reference. Traditional odometry is “the use of data from the movement of actuators to estimate change in position over time through devices such as rotary encoders to measure wheel rotations.” The concept of “visual odometry” extends this idea to determine the same information by using sequential camera images. This is perfect for GPS tagged video from action cams like GoPro, vehicle cameras or dashcams.
Traditionally, the challenge of combining SfM and “visual odometry” for large scale 3D mapping is the propagation of errors. The larger the area you are generating photogrammetry for the further the propagation of triangulation errors you generate. Generally, smaller models can be quite accurate and the larger your model goes the more accuracy slips. To solve this problem for “still photography” we focus on taking 30 photo models and stitch them together. This doesn’t work particularly well for video. Instead we segment video into chunks of time. Each segment is then sampled at regular interval for frames that are then input to the photogrammetrical model.
Winnie built a super cool visualization of these time segments color coded and aligned in geographic space. Each color is a separate point cloud from a subset of video frames. Each point cloud is then aligned to the reference data.
This segmentation of the data prevents errors from the SfM from being propagated beyond each small model. This keeps over all errors small and uncorrelated, which is key to an over all high level accuracy for neighborhoods, cities and eventually the globe.
How Accurate is Video for 3D Mapping
Aesthetically the results of the GoPro 360 derived photogrammetric results were quite nice, but the far more important question is are they accurate? The irony in all this work is we strive for visually realistic renderings of reality, but the most compelling use case for data is machine use not people. The machine doesn’t care if it’s maps look nice just that it is accurate to reality, whether that be an augmented reality or autonomy use cases. That said let’s look at the accuracy of our GoPro 360 video derived point clouds co-registered to our Nearmap reference data.
The map above visualizes the residuals from the RMSE calculation - color coded by how much error there is from each pixel to the reference. The interesting pattern in the results is how there is generally minimal error (>18cm) at the ground level and increases at higher elevations. The rough correlation between error and elevation makes sense the camera is on roughly ground level and loses accuracy for the pixels furthest away from it. For the entire GoPro data set there is an average 50cm RMSE to the Nearmap baseline. This is a solid result but not quite as good as the 25cm RMSE we achieved with the mobile phone photo generated point clouds.
There are a few caveats to the 50cm RMSE that should be noted. It includes additional error generated by transient objects like cars, aerial occlusions like overhangs and artifacts like sky points. We can see some examples of these in the image below. The most prominent being the cars and sky points artifacts.
Fortunately, the sky point artifacts can be pretty easily removed in post processing. With a good bit more work cars can also be removed from by data sources. Both of these additions would improve the RMSE score. We are bit stumped on the occlusion induced errors but likely the solid path there is a “best” pixel analysis done in post processing after co-registration. This doesn’t solve the error problem but removes the operational issue in deployment. If you’d like to explore the data you can download the .laz of the GoPro 360 point cloud here — the “residual” column is the RMSE attribute.
Commodity video is a great data source for large scale 3D mapping. It isn’t quite as accurate the point clouds we derived from our mobile app, but the coverage was markedly better. Another upside to using video is its ubiquity of collection at city scale and larger. We’ll further explore the potential of video in a future post utilizing the eight cameras available in a Tesla Model 3:
The best part — it isn’t just armies of Tesla’s driving our cities there are a plethora of cameras collecting data. Bonus is the photogrammetrical process removes personal information. People are transient so aren’t rendered, and the detail isn’t good enough for details like license plates. This doesn’t obviate the issues of the raw input data, but the natural obfuscation of large scale photogrammetry nice positive externality.
Recently there was a Tweet from Christopher Beddow asking about mounting a GoPro to a drone for OpenStreetMap work:
What if we did not need drones to augment OpenStreetMap with pixel level precision. Just upload your GoPro etc. video and go. This also opens up lots of interesting trajectories for 3D annotations, but one step at a time. First challenge — demonstrate we can scale. Map and process a small city in a day!