[Feature] Process consideration to keeping consistent data formatting #390

chadwpetersen · 2020-05-31T18:27:55Z

Is your feature request related to a problem? Please describe.

CSV file header changes can cause some issues with services that rely on the provided CSV formats provided by this awesome repo. This might break some downstream services that are expecting data in a specific format -i.e. data in a particular column index and maybe even having particular column header names.

To help ensure data guarantees to the community it would be great to keep these changes -if possible to a minimum and at least managed with some lean process. :)

Describe the solution you'd like

We could first consider only making column changes to a file append only. So if you want to add something new to the file -maybe consider appending it to the end of the file in that way it does not break any existing indexes others currently rely on.
We could also consider appending some sort of versioning to the end of a file name if we want to introduce a backwards incompatible change i.e reordering columns or renaming columns. Where the new file might get a _v2.csv added to it. That way people get time to upgrade to using the new file until sometime when we deprecate the old one.

Describe alternatives you've considered

We could add to the README that these file formats can break at anytime as it is not yet in a stable format.
We could also consider having a means of agreeing to the format before the format is used. So have the community vote (but this can be a bit too much I think).

Additional context

The reason I ask is that these types of changes should be consider as backwards incompatible as downstream services that rely on these files can break if they expect the values to be in certain columns with certain headings and thus not a great experience when things like this change.

I have experienced a few breakages related to the district data files and was hoping a simple, lean process could be considered when maintaining these data files so as to give the community some data structure guarantees. :)

The text was updated successfully, but these errors were encountered:

vukosim · 2020-05-31T19:51:03Z

Just for context, for the district and sub-district data, there was a question of creating a consolidated file that has all of the entries in a standard format by the demarcation board. This might be a good place to pick this up. If we can automate that conversion to have that single file that could be a good thing. We can then phase out the use of any names and use key files to reconcile. Thoughts.

shaze · 2020-05-31T20:06:30Z

Does anyone have an example of the demarcation file.

I agree with the proposals of @chadwpetersen though we really want to discourage it. Sometimes it is forced on us. For example Ekurhulieni was releasing Ekurhuleni East and North and then split them into East 1 and East 2

vukosim · 2020-05-31T20:20:29Z

See the example of the Limpopo districts file. @JosephSefara has been using the demarcation names.

vukosim · 2020-05-31T20:21:34Z

Also forgot @shaze we have the 2018 Demarcation key in https://github.com/dsfsi/covid19za/blob/master/data/district_data/LM_2018.csv

shaze · 2020-05-31T20:32:05Z

Great -- this is a coarser level than we are reporting. The main issues have been at sub-municipality issue. I've had a quick look at the Demarcation Board web site -- can't see a convenient spreadsheet at lower level -- there are maps.

JosephSefara · 2020-06-01T07:04:10Z

Great -- this is a coarser level than we are reporting. The main issues have been at sub-municipality issue. I've had a quick look at the Demarcation Board web site -- can't see a convenient spreadsheet at lower level -- there are maps.

@shaze You are referring to ward (is a sub-municipality) or subplace (like suburb) ?

shaze · 2020-06-01T11:47:23Z

There are three levels of sub-municipality data that I have seen
-- a suburb
-- a ward --
-- region -- e.g. Ekurhuleni North 1

Regions are a collection of wards for sure. I think wards are generally collection of suburbs but looking at the maps I have of my neighbourhood, I'm not 100% sure that this is 100% followed.

chadwpetersen · 2020-06-02T08:57:23Z

I think the demarcation key will help with standardising the district column names I think that is a good idea. Might be worth-while then to standardise the the cases details part -total cases, recoveries, deaths? Seems like not all files have them defined in the same columns with the same column names.

vukosim · 2020-06-04T06:01:16Z

@heerden can you help with your inputs

vhschalk · 2020-06-04T15:31:06Z

I am reviewing the current keys and will give my recommendation tomorrow for a flexible system.

vhschalk · 2020-06-06T10:01:38Z

My thoughts have settled on all the great suggestions in this thread.

I have a few practical steps we can start with, that will lead to a governable specification for data consistency for existing district data, future modifications and any districts we need to onboard.

The Readme in the data/district_data should define everything we agree on here.

A combine key file should then reflect the titles and the level of districts. I will submit the first draft soon.

While the demarcation keys are a great starting point, I see how they do not always align with the reported media releases for each province. If it is available, the combined key file will then also serve as a conversion for exiting titles to their demarcation equivalent.

We should also list the data collection leads to every province, to keep everyone in the loop. It might be a worthwhile task to list all our stakeholders as well, to contact them directly if there are "breaking changes". Internally, our API and the notebooks that commit calculated data can be seen as stakeholders.

Other automated checks (Github Actions) can be added to validate the combined key file with submitted data. The province lead can then be notified that there is a break-change that has not gone through the governance process.

The goal is still to prevent any stakeholders workflow from breaking. The only issue I see with versioning is that you will need to keep updating two sets of data files for a while until the old version is deprecated. If this is not an issue for the province leads, then we should keep this option open if there is no way to patch the data.

Appending data columns at the end of the data file might remain the best option, as we do not know how the stakeholder is reading the data. They might be using the column index number. We should thus encourage them to rather use keys.

I am not going to suggest drastic changes to any existing data but will need to consider each province, case by case.

vukosim · 2020-06-08T08:16:47Z

Hey @chadwpetersen and @shaze please take a look at the pull request.

vhschalk · 2020-07-14T16:28:12Z

@shaze you mentioned you have a new "recovery" column for the GP districts. You can add them to the key file, which is a data column key file.

shaze · 2020-07-15T16:12:41Z

Yes, I will. First I want to add "Deaths" though which we have in the data but not in the keys

chadwpetersen added the enhancement New feature or request label May 31, 2020

vukosim assigned vukosim and chadwpetersen May 31, 2020

vukosim added this to the Repo Cleanup and Enhacements milestone May 31, 2020

vukosim assigned JosephSefara, chadwpetersen and shaze and unassigned vukosim and chadwpetersen Jun 1, 2020

vhschalk mentioned this issue Jun 7, 2020

First draft of combined district key conversion csv #424

Merged

vhschalk mentioned this issue Jun 28, 2020

Consistent data formatting #505

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Process consideration to keeping consistent data formatting #390

[Feature] Process consideration to keeping consistent data formatting #390

chadwpetersen commented May 31, 2020

vukosim commented May 31, 2020

shaze commented May 31, 2020

vukosim commented May 31, 2020

vukosim commented May 31, 2020

shaze commented May 31, 2020

JosephSefara commented Jun 1, 2020 •

edited

shaze commented Jun 1, 2020

chadwpetersen commented Jun 2, 2020

vukosim commented Jun 4, 2020

vhschalk commented Jun 4, 2020

vhschalk commented Jun 6, 2020

vukosim commented Jun 8, 2020

vhschalk commented Jul 14, 2020

shaze commented Jul 15, 2020

[Feature] Process consideration to keeping consistent data formatting #390

[Feature] Process consideration to keeping consistent data formatting #390

Comments

chadwpetersen commented May 31, 2020

vukosim commented May 31, 2020

shaze commented May 31, 2020

vukosim commented May 31, 2020

vukosim commented May 31, 2020

shaze commented May 31, 2020

JosephSefara commented Jun 1, 2020 • edited

shaze commented Jun 1, 2020

chadwpetersen commented Jun 2, 2020

vukosim commented Jun 4, 2020

vhschalk commented Jun 4, 2020

vhschalk commented Jun 6, 2020

vukosim commented Jun 8, 2020

vhschalk commented Jul 14, 2020

shaze commented Jul 15, 2020

JosephSefara commented Jun 1, 2020 •

edited