Skip to content

cristidraghici/geocoded-bucharest-family-medicine-providers

Repository files navigation

geocoded-bucharest-family-medicine-providers

The list of family medicine offices in Bucharest with approximate coordinates

About

The output.json file contains information about the family medicine doctors in bucharest, together with geolocation information. This file will contain the most recent list of family medicine doctors.

It will have the following structure:

data = [
    {
        "title": str,  # str
        "description": [str],  # list of str
        "latitude": float,  # float
        "longitude": float  # float
    },
    ...
]

View on a map

https://cristidraghici.github.io/generic-map-with-pois/?api=https://cristidraghici.github.io/geocoded-bucharest-family-medicine-providers/output.json

Versions

We will use simple versioning for the code and also the output files. The releases will be tagged first with v1, v2, v3 etc. Before we start working on a new version for the parser, we will save the output in the ./.archive/ folder, in a newly created corresponding version subfolder.

File structure

This is a YOLO structure which has the purpose to maintain older versions in the git repository. The files are pretty small, so the cost is not great from that point of view. And it seems that it's worth paying to be sure we will always have the data available.

  • .cache contains cache from previous runs. If you specify the --cache param when you run the script you will use the data in the cache if available, but also update it at the end of the run;
  • .archive contains a history of results after running the parser. In a folder called v1, v2, etc. we will store the source file and the outputs generated by running the parser. We will not keep the files of the parser, but each version folder will correspond to a tagged release of the script;
  • we keep the current source and outputs at the root of the project.
.cache/
|-- addresses_cache.json
|-- coordinates_cache.json
.archive/
|-- v1/
| |-- 20230721_Lista cabinete medicina de familie_20.07.2023
| |-- input.xlsx
| |-- output.json
|-- v2/
| |-- ...
20240401_Lista cabinete medicina de familie_01.04.2024
index.html
input.xlsx
output.json
geocode_medical_addresses.py
...

How to use

The source list is not consistent, nor in a proper format. This is why we will start with separate parsers which can later be merged if needed. It's also the reason why we store the source in this repo.

These are some examples of how to run the script:

  • python geocode_medical_addresses.py
  • python ./geocode_medical_addresses.py --addresses --geocodes --excel --json --cache
  • python ./geocode_medical_addresses.py --addresses --geocodes --excel --json --cache --dev

New data sources

Main source:

Here are some ideas about how to handle the newly downloaded files:

  • we keep the filename as close to the source as possible;
  • before starting, remember to create a release for the parser and also save the current output in the ./.archive folder;
  • make a minimal cleanup in the file (remove the formatting, remove the headers form the file), using a previous source file as a model.

Coordinates

We use OSM and Nominatim to get the coordinates for the address. In case an address is not found automatically, we can go to the Nominatim website, search for the address for which the error was encountered, manually find something close, then update the manual_address column in the excel (./input.xlsx).

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

No packages published