Personal data design considerations #31

ellisonbg · 2019-10-03T05:45:47Z

In reviewing some PRs/issues on telemetry, a couple of things have come up for me that I want to capture.

Personal data

Not all telemetry data is the same. The GDPR does a good job of describing personal data:

https://ico.org.uk/for-organisations/guide-to-data-protection/guide-to-the-general-data-protection-regulation-gdpr/key-definitions/what-is-personal-data/

I want to make sure that we are designing the system in a manner that 1) forces schema to declare it collects personal data and 3) enables an operator to filter out personal data easily. An example:

Let's say there is a a schema that records a user opening a notebook. The name of the notebook is personal data, but the mere act of opening the notebook not so (unless the event it tagged with a username). An operator shouldn't have to dive into the details of that schema and worry about removing the notebook names, but should instead be able to filter out the personal data with a single flag. Additionally an operator should have a simple way of enabling or disabling the logging of usernames with events. If we don't make it easy for operators to reason about and configure these things, they will end up collecting personal data, even when they don't need or want to.

Lawful basis

The GDPR is also helpful in describing a range of different "lawful bases" for processing personal data. Because Jupyter is deployed across a wide range of situations, we have to design a system that is quickly configurable for these different lawful bases:

https://gdpr-info.eu/art-6-gdpr/

Yes, sometimes the lawful basis is consent and we need to make is really easy for operators to get consent and inform the users. But we shouldn't overfit for that lawful basis in a manner than makes it difficult for other bases. Users still have rights (possibly different ones) under the other lawful bases, and we want to make sure users get those rights as needed. I don't want an operator to have to choose between offering full consent, or no protections at all.

I realize that GDPR isn't a universal code or international law. But it is nonetheless a good starting point for understanding the different questions.

Zsailer · 2019-10-03T16:36:14Z

@ellisonbg, thank you for opening this discussion!

personal data

forces schema to declare it collects personal data and 3) enables an operator to filter out personal data easily.

I think #30 handles both of these criteria (ignoring the arbitrary choice of sensitive level names for now).

[MRG] Add multi-level security awareness to event formatter #30 requires event schema to tag each property with a sensitivity level. If an event handler's sensitive level is lower than a given property, that property will be filtered out of the emitted event. The event will still be emitted, but it won't contain the sensitive data pieces. Is your

Is your suggestion that we need to document that a schema contains sensitive pieces at the top level of each schema? Or am I misunderstanding?
The operator set a single attribute, the sensitivity level, in each logging handlers. e.g.:
```
handler.event_level = 'confidential'
```
Is your main concern that this is too complicated—that this requires operators to look at the schemas in detail to know the appropriate sensitivity level?

lawful basis

we have to design a system that is quickly configurable for these different lawful bases

This is a great point!

I see two ways we could address this with the current design:

We provide "recipes" in the documentation that configure the telemetry system for different lawful bases.
We add a high level flag to EventLog with pre-configured "recipes" that automatically configure the telemetry system to follow some base.

ellisonbg · 2019-10-04T03:28:06Z

Thanks this helps. I can imagine two usage cases that you are getting at here:

An operator wants to omit individual properties that are sensitive.
An operator wants to omit entire events that have any sensitive property.

It sounds like your current approach is doing (1). Do you think it is viable to use the idea here and cover both usage cases - allow the operator to pick the sensitivity level and which "mode" they want. If they pick mode (2), filter out entire messages, rather than properties. Does this make sense?

In term of having templates for different lawful bases, I like that mental model – the operator configuration would boil down to:

What schemas do you want to collect and where do you want to send them?
What is your lawful basis (that determines consent, notification, disclosure, etc.)?
Do you want to filter out personal data?

gclen · 2019-10-07T01:10:08Z

This looks great! There was one thing I had a question about.

I'm imagining two sets of events: one at the "unclassified" level (which everyone could see) and another at the "confidential" level where access is restricted to a set of administrators. How will the permissions be set such that only the administrators can see the confidential information? More generally, how should permissions be controlled on the events output by the handlers?

Zsailer · 2019-10-07T22:57:16Z

More generally, how should permissions be controlled on the events output by the handlers?

#30 addresses your point exactly. Operators can set the sensitivity level in each handler by setting an event_level attribute. See the example in the top comment:

For operators

It's simple to configure this option. Simply add an attribute to each handler:
import logging 

handler = logging.FileHandler('events.log')
handler.event_level = 'confidential'

In your example case, you can setup two handlers—one for administrators and one for everyone to see. The administrator event log includes confidential data by setting it's event_level to 'confidential'.

Zsailer · 2019-10-07T23:13:01Z

Do you think it is viable to use the idea here and cover both usage cases - allow the operator to pick the sensitivity level and which "mode" they want

@ellisonbg Absolutely. If we merge #30, we can easily craft a PR for "modes". I think #30 provides finer grained sensitivity control and "modes" offer a higher level sensitivity control.

In term of having templates for different lawful bases, I like that mental model – the operator configuration would boil down to:

What schemas do you want to collect and where do you want to send them?

What is your lawful basis (that determines consent, notification, disclosure, etc.)?

Do you want to filter out personal data?

👍

Zsailer · 2020-05-19T17:47:31Z

@ellisonbg, check out #46—I think that design is a good first step to handling personal data.

ellisonbg mentioned this issue Oct 3, 2019

[MRG] Add multi-level security awareness to event formatter #30

Closed

yuvipanda changed the title ~~Possible gaps in system design~~ Personal data design considerations Oct 3, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Personal data design considerations #31

Personal data design considerations #31

ellisonbg commented Oct 3, 2019

Zsailer commented Oct 3, 2019 •

edited

ellisonbg commented Oct 4, 2019

gclen commented Oct 7, 2019 •

edited

Zsailer commented Oct 7, 2019

For operators

Zsailer commented Oct 7, 2019

Zsailer commented May 19, 2020

Personal data design considerations #31

Personal data design considerations #31

Comments

ellisonbg commented Oct 3, 2019

Personal data

Lawful basis

Zsailer commented Oct 3, 2019 • edited

personal data

lawful basis

ellisonbg commented Oct 4, 2019

gclen commented Oct 7, 2019 • edited

Zsailer commented Oct 7, 2019

For operators

Zsailer commented Oct 7, 2019

Zsailer commented May 19, 2020

Zsailer commented Oct 3, 2019 •

edited

gclen commented Oct 7, 2019 •

edited