tile-smush, a faster tile-join for MBTiles files (Oct 5, 2024)

(Return to the blog homepage.)

tile-smush is a command-line tool to merge multiple .mbtiles files into a single file. It takes 85% less time than tile-join, but only works for .mbtiles files that don't have overlapping layers.

Use case

My workflow for building the HikerAtlas map runs on GitHub Actions. GitHub's free runners are somewhat underpowered: 4 cores and 16 GB of RAM.

To avoid being killed by the OOM killer, I build the map in thematic layers. GitHub, to its credit, lets you use up to 20 runners simultaneously. So even though the total amount of CPU used is more, the wall-clock time is dictated by whichever of your layers is slowest:

At the end of this step, we have 9 .mbtiles files covering all of North America.

We'd like to join them into a single merged.mbtiles file.

Option 1: go-pmtiles

This post is about .mbtiles, but honestly, I'd be happy to solve it with .pmtiles, too.

The defacto standard for manipulating .pmtiles files is go-pmtiles.

Issue #105 in that repo tracks building just this feature, but unfortunately, it's unimplemented as of October 2024.

Option 2: tile-join

MapBox's tippecanoe project has a utility called tile-join. It can merge .mbtiles files, albeit a bit slowly.

I grabbed a random Docker image and ran it on a test extract of Nova Scotia. It took 2 minutes, which felt slow. While it was running, I noticed a lot of time spent in syscalls.

My spidey senses were tingling, so I checked out the repo, built it and ran it outside of Docker. This time, it took only 30 seconds (!), and had much less syscall contention.

It seemed like the issue is that tile-join creates/destroys a thread for every tile in the .mbtiles archive. This is inefficient: we spend more time in the kernel overhead of thread lifecycle management than on doing productive tile joining work. That overhead seems to be exacerbated in the Docker container.

Then I saw that Felt maintained a fork of tippecanoe and gave it a try. It was even worse -- taking almost 7 minutes!

The performance could probably be improved via a thread pool. I toyed with patching it, but ultimately decided not to. Even if I patched tile-join to use threads more efficiently, the tool is a general-purpose tool. It needs to do a lot of work so that it can work well in a wide variety of scenarios.

Option 3: tile-smush

But I don't need a wide variety of scenarios. Just my scenario: .mbtiles files that might have overlapping tiles, but never overlapping layers. That is, tile 0/0/0 might be present in both land.mbtiles and water.mbtiles, but the layer water_labels is only in water.mbtiles.

When tile-join merges two files, it does so with deep knowledge of each tile. Every tile gets decoded: it's unzipped, and then the internal protobuf structures are interrogated to rebuild a representation of the tile in tile-join's internal data structures.

That makes perfect sense for a general purpose tool. Luckily, the nature of my constrained use case suggests a faster approach:

As always, I have a hammer called tilemaker, so everything looks like a nail.

I forked tilemaker into tile-smush and ripped out all the PBF, SHP, GeoJSON, etc code.

With the approach above, my Nova Scotia test case completes in ~6 seconds -- a nice speed up!

Even better, my production use case of North America completes in 20 minutes (down from 130 minutes).

Usage

tile-smush is published as a docker image at ghcr.io/hikeratlas/tile-smush:master.

A driver script for it is available at docker-tile-smush.

Invoke it like:

./tile-smush input1.mbtiles input2.mbtiles [...] inputN.mbtiles

It will emit a merged.mbtiles.