Split up canonical.json? #2243

bhousel · 2019-01-04T04:11:24Z

canonical.json is about 34000 lines now, and the file is starting to get kind of cumbersome to edit.

Should we split it up?
Maybe create a folder hierarchy like:

brands/
- amenity.json
- shop.json

or nest another level by key/value:

brands/
- amenity/
  - bank.json
  - fuel.json
  - pharmacy.json
  - ...
- shop/
  - car.json
  - car_repair.json
  - chemist.json
  - clothes.json
  - ...

The text was updated successfully, but these errors were encountered:

Adamant36 · 2019-01-04T04:43:44Z

It might be worth going with the second option so things are easier to organize later on when a lot more companies are inevitably added to the database. Plus, it already takes a lot of time to get through a single category like banks. So the less we have to sift through things that aren't related to a specific category that's being focused on or that already have information the better.

Maybe it would allow for easier tagging projects based on the database at some point also.

bhousel · 2019-01-04T05:28:59Z

Also.. burying the lede a bit, but this project can expand to do more than just brands.

I'm thinking of maybe doing other stuff with it too, like:

transit companies: operator / operator:wikipedia / operator:wikidata
road routes: network / network:wikipedia / network:wikidata

ImreSamu · 2019-01-04T14:16:24Z

From a local data maintainer views - I prefer separated country files:

countries:
- ca.json - Canada - brand/name suggestions
- us.json - U.S - brand/name suggestions
- hu.json - Hungary - brand/name suggestions
- ....

Pros:

As a local maintanier - I need to check only my country data.
- less data -> easier to find data 'bugs' -> better data quality
- less data -> simple to create country html/excel lists -> and post to the local mail list. ( now ~90% is a noise )
- the top-bottom limiting/cleaning is hard / not working ( [limiting local brand] - amenity/cafe|Cafe Coffee Day #2107 )
based on local community decision - ( bottom-up )
- moving the data problems to country community level.
half- solution- for figurative/unregistered/local brands: ( amenity/cafe|Caffè Nero )
- in this case we don't add amenity/cafe|Caffè Nero - for the hungarian names.
good solution for shops with general words ( fix "hu" (shop/convenience|Abc) - general hungarian word - remove (need extra check in Poland!) #2087 )
local language support - on wikidata tags
- we can use brand:wikipedia=hu:McDonald’s ; https://wiki.openstreetmap.org/wiki/Multilingual_names is a big topic in Europe.
- localisation is important : In Hungary - Russian was compulsory to learn in elementary schools from 1949 until 1989. So some older OSM contributors - don't know English.
we can add more localized info ( website=www.mcdonalds.hu )

Cons:

Some brands need to add to every country. ( data duplications )
need some quality tools.
missing iD Editor support ( lot of new problems )
....

If no country separation - I need some excluding solution for country codes. ( ExcludeCountryCodes= [ 'hu'] ?? )

bhousel · 2019-01-04T14:27:41Z

Sorry @ImreSamu but I don't see us ever splitting this repo by country. The cons you listed above are pretty significant. Anyway wikidata is not split up by country - it's universal.

kymckay · 2019-01-05T13:46:54Z

I think nesting by key/value makes the most sense since that's essentially how the entries are grouped already - which means there's potential to reduce the amount of repetition induced by the <key>/<value>|<name> syntax since 2/3 of that information would already be captured by the directory structure.

but this project can expand to do more than just brands

This has also crossed my mind once or twice

Anyway wikidata is not split up by country - it's universal.

I do wonder if there's a good way we could mirror this and capture multinational brand information in a single entry 🤔 but, that's a thought for another time

kymckay · 2019-01-26T23:44:52Z

Got bored tonight and wrote a quick python script that will split canonical into sub-directories:

import json
import os

root = 'config/brands/'
files = {}

with open('config/canonical.json', encoding="utf8") as canonical:
    data = json.load(canonical)

    for path in data:
        tagging, name = path.split('|')
        key, value = tagging.split('/')

        files.setdefault(key,{}).setdefault(value,{}).update({name:data[path]})

for key in files:
    os.makedirs(root + key, exist_ok=True)

    for value in files[key]:
        with open(root + key + "/" + value + ".json", "w", encoding="utf8") as out:
            json.dump(files[key][value], out, ensure_ascii=False, indent=2)

matkoniecz · 2019-01-29T21:57:21Z

For me it is easier to edit a single file. What kind of problems it presents? It is not so large.

Adamant36 · 2019-01-29T23:55:11Z

Its pretty slow to load when you click on the file in github and sometimes it will only load half of it. At least for me and I have a pretty good computer/internet. Its also not exactly to browse through and its really easy to lose your place. The only way I can reliably edit stuff is by using search in whatever code editor I'm using.

There would probably be other benefits to spliting it up to. Off the top of my head, it would allow anyone else that uses the index to more easily only include the poi types they want in their software. Modularity in general is a good thing.

tas50 · 2019-02-16T16:09:07Z

It would be very beneficial to have the data split in the key/value method shown in option 2. I just updated the allnames file locally and the updates to the count values in canonical.json so numerous that it crashes both SourceTree and Atom. It would be a lot easier to manage large changes like this with smaller files in the future.

bhousel · 2019-03-03T04:32:48Z

Just a heads up that I'm going to start splitting up canonical.json now! It sounds like the second option is the preferred, so that's what I'll do.

All the PRs are merged or cancelled so there shouldn't be any huge conflicts from doing this.

bhousel · 2019-03-04T19:51:07Z

I just did this! config/canonical.json is no more.
Going forward, update the files under brands/**/*.json instead.

I've updated the documentation too, but let me know if anything is confusing or broken..

There are still around 600 or so "Research Needed" issues open that contain old instructions, but I hope people will be able to figure it out.

bhousel added frozen and removed frozen labels Jan 4, 2019

Adamant36 added the question Not Actionable - just a question about something label Feb 18, 2019

bhousel mentioned this issue Mar 3, 2019

fixes#2315 Updates co-op store information. #2345

Closed

This was referenced Mar 3, 2019

Filter out generic tailors #2374

Merged

More Japanese Brands (part 3) #2375

Merged

bhousel closed this as completed in 016f918 Mar 4, 2019

bhousel mentioned this issue May 4, 2019

Autofill "operator:wikidata" using the Operator field openstreetmap/iD#5484

Closed

bhousel mentioned this issue Jul 9, 2019

Add Greyhound bus lines (US) #2863

Closed

1ec5 mentioned this issue Jul 9, 2019

Public transportation networks #2864

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split up canonical.json? #2243

Split up canonical.json? #2243

bhousel commented Jan 4, 2019

Adamant36 commented Jan 4, 2019

bhousel commented Jan 4, 2019

ImreSamu commented Jan 4, 2019

bhousel commented Jan 4, 2019

kymckay commented Jan 5, 2019

kymckay commented Jan 26, 2019

matkoniecz commented Jan 29, 2019

Adamant36 commented Jan 29, 2019

tas50 commented Feb 16, 2019

bhousel commented Mar 3, 2019

bhousel commented Mar 4, 2019

Split up canonical.json? #2243

Split up canonical.json? #2243

Comments

bhousel commented Jan 4, 2019

Adamant36 commented Jan 4, 2019

bhousel commented Jan 4, 2019

ImreSamu commented Jan 4, 2019

bhousel commented Jan 4, 2019

kymckay commented Jan 5, 2019

kymckay commented Jan 26, 2019

matkoniecz commented Jan 29, 2019

Adamant36 commented Jan 29, 2019

tas50 commented Feb 16, 2019

bhousel commented Mar 3, 2019

bhousel commented Mar 4, 2019