Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Process consideration to keeping consistent data formatting #390

Open
chadwpetersen opened this issue May 31, 2020 · 14 comments
Open
Assignees
Labels
enhancement New feature or request

Comments

@chadwpetersen
Copy link
Collaborator

Is your feature request related to a problem? Please describe.

CSV file header changes can cause some issues with services that rely on the provided CSV formats provided by this awesome repo. This might break some downstream services that are expecting data in a specific format -i.e. data in a particular column index and maybe even having particular column header names.

To help ensure data guarantees to the community it would be great to keep these changes -if possible to a minimum and at least managed with some lean process. :)

Describe the solution you'd like

  • We could first consider only making column changes to a file append only. So if you want to add something new to the file -maybe consider appending it to the end of the file in that way it does not break any existing indexes others currently rely on.

  • We could also consider appending some sort of versioning to the end of a file name if we want to introduce a backwards incompatible change i.e reordering columns or renaming columns. Where the new file might get a _v2.csv added to it. That way people get time to upgrade to using the new file until sometime when we deprecate the old one.

Describe alternatives you've considered

  • We could add to the README that these file formats can break at anytime as it is not yet in a stable format.

  • We could also consider having a means of agreeing to the format before the format is used. So have the community vote (but this can be a bit too much I think).

Additional context

The reason I ask is that these types of changes should be consider as backwards incompatible as downstream services that rely on these files can break if they expect the values to be in certain columns with certain headings and thus not a great experience when things like this change.

I have experienced a few breakages related to the district data files and was hoping a simple, lean process could be considered when maintaining these data files so as to give the community some data structure guarantees. :)

@chadwpetersen chadwpetersen added the enhancement New feature or request label May 31, 2020
@vukosim
Copy link
Member

vukosim commented May 31, 2020

Just for context, for the district and sub-district data, there was a question of creating a consolidated file that has all of the entries in a standard format by the demarcation board. This might be a good place to pick this up. If we can automate that conversion to have that single file that could be a good thing. We can then phase out the use of any names and use key files to reconcile. Thoughts.

@vukosim vukosim added this to the Repo Cleanup and Enhacements milestone May 31, 2020
@shaze
Copy link
Contributor

shaze commented May 31, 2020

Does anyone have an example of the demarcation file.

I agree with the proposals of @chadwpetersen though we really want to discourage it. Sometimes it is forced on us. For example Ekurhulieni was releasing Ekurhuleni East and North and then split them into East 1 and East 2

@vukosim
Copy link
Member

vukosim commented May 31, 2020

See the example of the Limpopo districts file. @JosephSefara has been using the demarcation names.

@vukosim
Copy link
Member

vukosim commented May 31, 2020

Also forgot @shaze we have the 2018 Demarcation key in https://github.com/dsfsi/covid19za/blob/master/data/district_data/LM_2018.csv

@shaze
Copy link
Contributor

shaze commented May 31, 2020

Great -- this is a coarser level than we are reporting. The main issues have been at sub-municipality issue. I've had a quick look at the Demarcation Board web site -- can't see a convenient spreadsheet at lower level -- there are maps.

@JosephSefara
Copy link
Contributor

JosephSefara commented Jun 1, 2020

Great -- this is a coarser level than we are reporting. The main issues have been at sub-municipality issue. I've had a quick look at the Demarcation Board web site -- can't see a convenient spreadsheet at lower level -- there are maps.

@shaze You are referring to ward (is a sub-municipality) or subplace (like suburb) ?

@shaze
Copy link
Contributor

shaze commented Jun 1, 2020

There are three levels of sub-municipality data that I have seen
-- a suburb
-- a ward --
-- region -- e.g. Ekurhuleni North 1

Regions are a collection of wards for sure. I think wards are generally collection of suburbs but looking at the maps I have of my neighbourhood, I'm not 100% sure that this is 100% followed.

@chadwpetersen
Copy link
Collaborator Author

I think the demarcation key will help with standardising the district column names I think that is a good idea. Might be worth-while then to standardise the the cases details part -total cases, recoveries, deaths? Seems like not all files have them defined in the same columns with the same column names.

@vukosim
Copy link
Member

vukosim commented Jun 4, 2020

@heerden can you help with your inputs

@vhschalk
Copy link
Collaborator

vhschalk commented Jun 4, 2020

I am reviewing the current keys and will give my recommendation tomorrow for a flexible system.

@vhschalk
Copy link
Collaborator

vhschalk commented Jun 6, 2020

My thoughts have settled on all the great suggestions in this thread.

I have a few practical steps we can start with, that will lead to a governable specification for data consistency for existing district data, future modifications and any districts we need to onboard.

The Readme in the data/district_data should define everything we agree on here.

A combine key file should then reflect the titles and the level of districts. I will submit the first draft soon.

While the demarcation keys are a great starting point, I see how they do not always align with the reported media releases for each province. If it is available, the combined key file will then also serve as a conversion for exiting titles to their demarcation equivalent.

We should also list the data collection leads to every province, to keep everyone in the loop. It might be a worthwhile task to list all our stakeholders as well, to contact them directly if there are "breaking changes". Internally, our API and the notebooks that commit calculated data can be seen as stakeholders.

Other automated checks (Github Actions) can be added to validate the combined key file with submitted data. The province lead can then be notified that there is a break-change that has not gone through the governance process.

The goal is still to prevent any stakeholders workflow from breaking. The only issue I see with versioning is that you will need to keep updating two sets of data files for a while until the old version is deprecated. If this is not an issue for the province leads, then we should keep this option open if there is no way to patch the data.

Appending data columns at the end of the data file might remain the best option, as we do not know how the stakeholder is reading the data. They might be using the column index number. We should thus encourage them to rather use keys.

I am not going to suggest drastic changes to any existing data but will need to consider each province, case by case.

@vukosim
Copy link
Member

vukosim commented Jun 8, 2020

Hey @chadwpetersen and @shaze please take a look at the pull request.

@vhschalk
Copy link
Collaborator

@shaze you mentioned you have a new "recovery" column for the GP districts. You can add them to the key file, which is a data column key file.

@shaze
Copy link
Contributor

shaze commented Jul 15, 2020

Yes, I will. First I want to add "Deaths" though which we have in the data but not in the keys

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants