Proposal: add route tables to osm2pgsql

NOTE: The proponent has received fairly unambiguous feedback from the maintainers of osm2pgsql that the proposal in its current form is unacceptable, and it appears unlikely that any change that continues to address the requirements of this project can be developed that will meet all the constraints needed to make it acceptable. Alternative approaches are being sought.

[WITHDRAWN] Proposal: Include route relation information in osm2pgsql's database schema

Kevin B. Kenny | email | OSM user 'ke9tv' | Github user 'kennykb'

4 September 2018

Executive summary

Handling of route concurrences (where two numbered routes follow the same way for a time), overlays of different route networks, and pictorial route shields are features that have been requested for OpenStreetMap renderings for a long time. It appears that the current database schema of osm2pgsql is not quite adequate to the task. The relation between numbered routes and ways is inherently many-to-many. It requires at least one auxiliary table in the database beyond the four already present, to track it effectively.

This proposal introduces two more tables beyoud point, line, polygon and roads. The first, route, represents route relations and holds their tags; it is structured similarly to the four tables that are already there except that it has no geometry - geometry belongs to the relation members, not to the relation. The second, route_line, represents the fact that a given line is a member of a route relation. It is intentionally structured to be minimalistic; each row holds only the OSM ID of the relation, the OSM ID of the member, the position of the member in the relation's list of members, and the member's role in the relation.

The proposal includes several command-line options on osm2pgsql to allow for:

initial construction of the route tables on the database at import time, or updating the route tables in response to changes
building or rebuilding of the route tables on an existing database, which can be done provided that slim tables are available
removal of the route tables in the event that it is decided that a particular database instance no longer requires them.

There is a proof of concept rendering available that gives a feel for the sort of labeling that this technique can provide. In the area shown can be seen pictorial markers for route relations (including concurrencies like MA 110/113), classic OSM-style markers for ref=* tags on ways that do not participate in route relations (also including concurrencies).

Background

Rendering of pictorial route markers and rendering of concurrent routes have both been requested widely for several years, and relatively little actual progress has been made, owing to the difficulty of tracking route relations.

There have been a several attempts at solving this problem; perhaps the most successful has been that of Phil! Gold (repository), which was briefly used on openstreetmap.us and can still be seen (in an out-of-date version) in at least one rendering. (The specific linked-to area shows the somewhat insane concurrencies that are sometimes present in North America, with no fewer than nine numbered routes from four different networks running over the same way!)

Unfortunately, all these approaches have proven to be brittle and difficult to install and maintain. They suffer one or more of the drawbacks:

requiring that the renderer execute SQL queries that are impenetrably complex and intolerably slow.
requiring additional tables that are not supported by osm2pgsql and have expensive maintenance requirements, generally needing a full rebuild on every update.
requiring read/write connections to the database (and hence, a patched version of mapnik.
requiring a side channel to the database, whereby both the renderer and the database engine have access to the same host filesystem
requiring access to the slim tables at render time (or, in general, from outside osm2pgsql).
being unable to cope with minutely updates.

Any of these would be a show-stopper for widespread deployment. Gold's system, despite its attractive rendering and its comprehensive approach to overlaid route networks, suffers from all six.

The proponent used a lightly-modified version of Gold's system for a few years to render US-centric topographic maps for his own use and that of some of his friends. Recently, he decided to do a fairly comprehensive upgrade of his computing infrastructure, advancing from years-obsolete versions of the toolchain to reasonably current ones (moving from Ubuntu 14.04LTS to 18.04). The management of route symbols was by far the most difficult part of the conversion, and he resolved to reimplement it in such a way that it would not require maintaining a fork of any of the basic tools. He largely succeeded, for a limited workflow:

running with a continent-sized extract from GeoFabrik,
updating no more often than daily,
rendering only a portion of the US statically, owing to limited resources on a home computer,
tolerating the use of simple auxiliary tables, built at update time from the slim tables.

This workflow has now been stable for a short while on his personal server - in a view that is intended to preview maps produced for download to mobile devices. The rendering surely has its problems, but its handling of numbered highways appears to be satisfactory enough to prove the concept.

This limited success prompted several of the proponent's corresponents to ask whether the process could be scaled up to a large server facing the general public (and perhaps, eventually, one or more of the main renderings at openstreetmap.org). There are a number of technical hurdles to clear. Some of them are downstream in the toolchain (in tools such as CartoCSS and Mapnik). For the purpose of this discussion, these are ignored, in favour of focusing on the requirements for importing OSM data to the database. That aspect is the fulcrum about which the rest of the system turns, and clearly the one that must be addressed to clear the 'log jam' and get forward motion going on the remaining issues.

Requirements

The key requirement for route rendering is that at run time, a renderer must be able to locate:

all numbered routes, or numbered routes of a given importance (e.g., all numbered motorways, trunks and primary roads),
lying at least partially within a given region of interest,
with the geometry divided in such a way that no two ways are repeated,
with the set of routes enumerated for each identified way,
with ways having the same sets of routes coalesced as far as possible, to aid in the optimal placement of markers.
with each route to which a way belongs identified by network and reference, to allow for the selection of the appropriate pictorial marker.

The roads table has at least been an attempt to address these requirements, but does not at present allow concurrencies. Cascading the networks and references down from route relations into the table has proven to be an intractable problem, leading to the associated issue's having languished for nearly four years.

The difficuly lies in either the ability to make the above query efficiently, or to keep the intermediate data needed to support it up to date when accepting updates from upstream. This arises because the roads or line tables, even augmented with the rels table (from the slim tables), violate Edgar Codd's first principle of database normalization: that the domain of each attribute must contain only atomic (indivisible) values, and that the value of each attribute must consist of only a single value from its domain. This is true neither of the rels table, where the rows contain not only arrays of relation members, but heterogeneous ones at that: relations, ways and nodes are all grouped together, along with their roads. Neither is it true of the roads table, where the ref attribute is expected to consist of a series of ordered pairs of network and reference number, separated by semicolons. Attempting to update a relation membership causes the need of the application performing the update to track multiple dependencies. Moreover, the desire to coalesce ways that belong to the same sets of relations leads to a situation where anything that touches a relation needs to propagate down to an arbitrary number of ways, and may need to split and recombine them.

Proposal, part 1: Add additional tables to track route members

The problem is greatly simplified if we normalize the database and promote the fact that 'way W is a member of relation R at position P with role X' to a first-class object in the database. This gives rise to a schema something like:

    CREATE TABLE planet_osm_route(
        osm_id BIGINT NOT NULL PRIMARY KEY, -- OSM ID of the relation
        type TEXT,      --    ...
        route TEXT,     -- Tags on the relation
        network TEXT,   --    ...
        "ref" TEXT,     --    ...
        -- other keys and values as needed, possibly including:
        tags HSTORE
    )

    CREATE TABLE planet_osm_route_way(
        rel_id BIGINT NOT NULL,  -- ID of the route relation
        idx INT NOT NULL ,       -- Numeric position of the way in the relation
        osm_id BIGINT NOT NULL,  -- ID of the way
        "role" TEXT,             -- Role of the way in the relation
        PRIMARY KEY(rel_id, idx),
        FOREIGN KEY(rel_id) REFERENCES planet_osm_route(osm_id)
            ON DELETE CASCADE
    )

    CREATE INDEX idx_route_way_osm_id
        ON planet_osm_route_way USING BTREE(osm_id)

If we confine ourselves to route relations (there's no need to include the far more common multipolygons, as far as the proponent can tell), the resulting table is small by OSM standards. For the North American continent, there are roughly fifty thousand routes, comprising roughly a million and a half members. PostgreSQL's memory caching can handle tables of this size well enough that JOIN operations run at the speed of main memory.

The proof-of-concept implementation includes these tables, albeit with names in less generic form. The file to which the link refers also contains the code by which the tables can be initialized from the slim tables. On the proponent's workflow, where continent-sized diffs from GeoFabrik are applied on a daily basis, rerunning this process after each update is not unduly burdensome; it takes considerably less time than applying an update of that scale.

Implementation note: It is worth noticing that the route table has exactly the same structure as point, line, polygon and roads, except that it lacks geometry. (Relations in OSM by themselves have no geometry, after all.) These four tables are all handled by the table_t class in osm2pgsql. It would probably be best to address adding the route table by making appropriate changes to this class. By contrast, the route_way table is sui generis and would require some new handling. Given that the table_t class also tracks a good deal of common information about the database infrastructure, the best structure for the code modification may be to refactor it into a base class that manages the database connection and a derived class that has the table-specific information, with at least two derived classes to begin with, one for the existing four tables and one for route_way.

Proposal, part 2: Add command-line options to enable routes

The existing suite of command-line options must be augmented to handle the route tables. It is imagined that a single option, --routes, can specify this. It would interact with the other options in the following ways:

If running with -overwrite, or when creating the initial set of tables (no -append or -overwrite options), it will create the route tables using the above schema (deferring primary key installation and index creation until after import is complete). As each relation is encountered in the input, the necessary rows will be added to the two tables based on its content.
If running with -append, a check must first be made for whether the route tables are already present. If they are, then each deletion, creation, or update to a route relation will give rise to the corresponding update to the route tables: note that the cascade on the foreign key relation between route_way and route simplifies the deletion logic by eliminating the need to ferret out the ways when deleting the relation.
If running with -append, and the database does not contain the route tables, all is not lost if running in slim mode. The route tables will be created from the slim tables before proceeding with the -append operation. This mode of operation allows for creating the route tables on a database that does not already have them.

Open issues

The foregoing discussion addresses only the pgsql back end, which the proponent understands to be the mainstream of most current rendering chains. It is not entirely clear whether the gazetteer or multi back ends would benefit from these tables, and the proponent hopes to keep the footprint of this change upon the code base as small as possible. @pnorman has suggested that any eventual use of this material in the main renderer means that a schema change such as this one must cascade into the Lua API of osm2pgsql.

Consider this section of the document to be a plea to those developers who understand the requirements of the other back ends to explain them and give pointers to the code areas that must be modified to accomplish this task.

Issues intentionally out of scope

If routes are to be rendered with clusters of pictorial shields, then there also needs to be database support for looking up the graphical elements, and for repackaging the data in such a way that Mapnik's GroupSymbolizer can handle them. The SQL queries that perform these tasks are somewhat complex and arcane, and the proponent thinks it best to develop them once, and install stored procedures in the database that perform the queries. In that way, the rendering code can see relatively simple procedure calls rather than horribly messy SQL, and the SQL code can be commented fully and hidden inside the walls. The curious may examine these procedures at the 'osm-shields' repository.

It is thought best at the current time to consider these as being formally out of the ambit of osm2pgsql. The function of osm2pgsql shall be to prepare the tables that the procedures use, or more specifically, the tables that depend on OSM content. Actually querying the tables shall be the function of other portions of the toolchain.

Directions for future work

Whither the 'roads' table?

If stored procedures are used to identify routes for a way (and coalesce ways that belong to the same routes), and if additional indices are added to the database to select 'major' roads for rendering at low zoom levels, for example:

CREATE INDEX planet_osm_idx_line_primary
ON planet_osm_line
USING GIST(way)
WHERE 'highway' IN ('motorway', 'trunk', 'primary')

then it is not clear that the roads table continues to serve a useful purpose. It is possible that we may want to have a command-line option (--no-roads might be a name for it) to avoid creating this table, saving the space and time needed to manage it. Of course, this idea will have to depend on more extensive tests of the performance of low-zoom-level rendering without the roads table, but it appears to be worth considering.

Would this idea benefit other types of relations?

It is not immediately obvious that routes are the only relations for which a scheme like this one might benefit the renderers. Multipolygons are clearly out of scope, because they really exist to deal with the limitations of the upstream data model. We might want to think, though, if there are relations other than routes where the association between relation and member is many-to-many, and whether a renderer, router or data analysis program might need the information. Except for the names of things (and the specific choice of relation tags to include), the structure of the route tables generalizes to other types of relation, and an obvious extension would allow relation members to be points, polygons or other relations instead of ways.

One possibility that comes to mind is group relations. For instance, there is a group relation in the database for the Essex Chain of Lakes in the Adirondacks. (The individual lakes have the unimaginative names of 'First Lake' through 'Eighth Lake'.) It would be nice if, at a low zoom level, a renderer might be able to suppress the individual names and simply place a label for the group. At present, with the osm2pgsql data model, there's no visibility to this relation at all. My guess is that this is true even for the Gazetteer data model, since a Nominatim query for 'Essex Chain of Lakes' yields no results. It is hard to imagine how group relations could actually become useful until and unless some functionality like this is provided for them.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly