Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Personal data design considerations #31

Open
ellisonbg opened this issue Oct 3, 2019 · 6 comments
Open

Personal data design considerations #31

ellisonbg opened this issue Oct 3, 2019 · 6 comments

Comments

@ellisonbg
Copy link
Collaborator

In reviewing some PRs/issues on telemetry, a couple of things have come up for me that I want to capture.

Personal data

Not all telemetry data is the same. The GDPR does a good job of describing personal data:

https://ico.org.uk/for-organisations/guide-to-data-protection/guide-to-the-general-data-protection-regulation-gdpr/key-definitions/what-is-personal-data/

I want to make sure that we are designing the system in a manner that 1) forces schema to declare it collects personal data and 3) enables an operator to filter out personal data easily. An example:

Let's say there is a a schema that records a user opening a notebook. The name of the notebook is personal data, but the mere act of opening the notebook not so (unless the event it tagged with a username). An operator shouldn't have to dive into the details of that schema and worry about removing the notebook names, but should instead be able to filter out the personal data with a single flag. Additionally an operator should have a simple way of enabling or disabling the logging of usernames with events. If we don't make it easy for operators to reason about and configure these things, they will end up collecting personal data, even when they don't need or want to.

Lawful basis

The GDPR is also helpful in describing a range of different "lawful bases" for processing personal data. Because Jupyter is deployed across a wide range of situations, we have to design a system that is quickly configurable for these different lawful bases:

https://gdpr-info.eu/art-6-gdpr/

Yes, sometimes the lawful basis is consent and we need to make is really easy for operators to get consent and inform the users. But we shouldn't overfit for that lawful basis in a manner than makes it difficult for other bases. Users still have rights (possibly different ones) under the other lawful bases, and we want to make sure users get those rights as needed. I don't want an operator to have to choose between offering full consent, or no protections at all.

I realize that GDPR isn't a universal code or international law. But it is nonetheless a good starting point for understanding the different questions.

@yuvipanda yuvipanda changed the title Possible gaps in system design Personal data design considerations Oct 3, 2019
@Zsailer
Copy link
Member

Zsailer commented Oct 3, 2019

@ellisonbg, thank you for opening this discussion!

personal data

  1. forces schema to declare it collects personal data and 3) enables an operator to filter out personal data easily.

I think #30 handles both of these criteria (ignoring the arbitrary choice of sensitive level names for now).

  1. [MRG] Add multi-level security awareness to event formatter #30 requires event schema to tag each property with a sensitivity level. If an event handler's sensitive level is lower than a given property, that property will be filtered out of the emitted event. The event will still be emitted, but it won't contain the sensitive data pieces. Is your

    Is your suggestion that we need to document that a schema contains sensitive pieces at the top level of each schema? Or am I misunderstanding?

  2. The operator set a single attribute, the sensitivity level, in each logging handlers. e.g.:

    handler.event_level = 'confidential'

    Is your main concern that this is too complicated—that this requires operators to look at the schemas in detail to know the appropriate sensitivity level?

lawful basis

we have to design a system that is quickly configurable for these different lawful bases

This is a great point!

I see two ways we could address this with the current design:

  1. We provide "recipes" in the documentation that configure the telemetry system for different lawful bases.
  2. We add a high level flag to EventLog with pre-configured "recipes" that automatically configure the telemetry system to follow some base.

@ellisonbg
Copy link
Collaborator Author

Thanks this helps. I can imagine two usage cases that you are getting at here:

  1. An operator wants to omit individual properties that are sensitive.
  2. An operator wants to omit entire events that have any sensitive property.

It sounds like your current approach is doing (1). Do you think it is viable to use the idea here and cover both usage cases - allow the operator to pick the sensitivity level and which "mode" they want. If they pick mode (2), filter out entire messages, rather than properties. Does this make sense?

In term of having templates for different lawful bases, I like that mental model – the operator configuration would boil down to:

  • What schemas do you want to collect and where do you want to send them?
  • What is your lawful basis (that determines consent, notification, disclosure, etc.)?
  • Do you want to filter out personal data?

@gclen
Copy link

gclen commented Oct 7, 2019

This looks great! There was one thing I had a question about.

I'm imagining two sets of events: one at the "unclassified" level (which everyone could see) and another at the "confidential" level where access is restricted to a set of administrators. How will the permissions be set such that only the administrators can see the confidential information? More generally, how should permissions be controlled on the events output by the handlers?

@Zsailer
Copy link
Member

Zsailer commented Oct 7, 2019

More generally, how should permissions be controlled on the events output by the handlers?

#30 addresses your point exactly. Operators can set the sensitivity level in each handler by setting an event_level attribute. See the example in the top comment:

For operators

It's simple to configure this option. Simply add an attribute to each handler:

import logging 

handler = logging.FileHandler('events.log')
handler.event_level = 'confidential'

In your example case, you can setup two handlers—one for administrators and one for everyone to see. The administrator event log includes confidential data by setting it's event_level to 'confidential'.

@Zsailer
Copy link
Member

Zsailer commented Oct 7, 2019

Do you think it is viable to use the idea here and cover both usage cases - allow the operator to pick the sensitivity level and which "mode" they want

@ellisonbg Absolutely. If we merge #30, we can easily craft a PR for "modes". I think #30 provides finer grained sensitivity control and "modes" offer a higher level sensitivity control.

In term of having templates for different lawful bases, I like that mental model – the operator configuration would boil down to:

  • What schemas do you want to collect and where do you want to send them?
  • What is your lawful basis (that determines consent, notification, disclosure, etc.)?
  • Do you want to filter out personal data?

👍

@Zsailer
Copy link
Member

Zsailer commented May 19, 2020

@ellisonbg, check out #46—I think that design is a good first step to handling personal data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants