Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expunging Data #435

Open
alexjg opened this issue Sep 30, 2021 · 7 comments
Open

Expunging Data #435

alexjg opened this issue Sep 30, 2021 · 7 comments

Comments

@alexjg
Copy link
Contributor

alexjg commented Sep 30, 2021

I've been trying to imagine how to expunge data from an automerge document and had a few thoughts. Imagine a sort of automerge equivalent of a mailing list. There's a document at some well known location and anyone can post changes to that document. As with a mailing list, some users will occasionally post spam, or offensive material. Other honest users may continue to build changes on top of the document containing this unwanted data. Some time later a moderator comes online and notices the offensive material. The moderator doesn't want to continue hosting offensive material on their server, but they also don't want to lose the changes which have been made since the offensive material was posted. What can they do?

The most obvious step is to discard the change that the spammer made and any dependent changes. However, if many users have since made changes to the document their documents will reference the offending change(s) as dependencies. These changes would also have to be discarded. This is because automerge (in it's current incarnation) does not do anything enormously clever when calculating the dependencies of a change. We just take the tips of the change graph that we know about and reference them as dependencies of any new change. This means that even if the new change didn't actually reference any data that the offending change touched it would still have to be removed.

One way around this would be to make automerge cleverer when calculating dependencies. When we add a new change we could examine the operations within that change and then examing the current change graph to find the tips of the last changes which touched the same objects. The new change we produce would reference those tips as dependencies instead of the tips of the entire change graph. I'm not sure about this solution. Firstly I'm not even sure it would be valid as I think we could end up with multiple changes from the same actor with the same startOp. Secondly, it might be a lot of work to figure out which tips to reference. Thirdly, a single change can potentially touch many objects, so we might have to discard a lot of changes, even if only a small subset of the operations within them are problematic. Finally, it depends on every implementation doing this clever work in order to be able to usefully expunge data.

An alternative approach would be to directly remove the change from the document. This would leave a hole in the change graph. This hole would initially manifest as an error about missing dependencies when processing a change which references the hole. To accomodate this we could maintain a data structure containing a list of expunged dependency hashes. This list could be encoded in the document encoding, passed out of band into an automerge implementation, or as a message in the sync protocol. Just knowing about the hole doesn't solve the problem though. There may now be operations in the document which reference missing operations. I don't think there's any sensible way to interpret these operations other than to discard them and any dependents. To achieve this we would need to store not only the expunged dependency hashes, but also the actor ID, startOp and number of operations in the expunged dependency. This does still leave a DOS attack open where a malicious actor submits a large number of unwanted changes. Expunging these changes then requires storing the hash of every expunged change, which could potentially be a lot of storag. I'm not sure there's a good way around that.

Anyway, has anyone else had thoughts about this? Especially with regard to moderation and offensive content.

@ept
Copy link
Member

ept commented Oct 1, 2021

Good question. Here are some initial thoughts.

Option A, trying to be clever about dependencies: we'd have to be very careful in defining what causes a dependency to be introduced. For example, you have an object containing the set of mailing list messages; if change X posts a message and then change Y posts a message, you don't necessarily want Y to depend on X, since the messages may be in different threads. Maybe we could say that Y depends on X if Y overwrites a map key or list element that was previously written by X, or if Y modifies an object that was created by X, or if Y references a list element that was inserted by X.

If we go with fine-grained dependencies, our current concept of actors doesn't make much sense any more. The definition of actors relies on the fact that a change always depends on previous changes by the same actor, i.e. that the changes by a particular actor form a linear sequence. This is no longer the case if we define dependencies on the basis of the objects they modify. startOp no longer being unique is a consequence of this. If we go with this option, I think we might have to get rid of actors entirely and rely only on the hash graph. This would carry a performance cost and make the file format more verbose, since hashes don't compress as well as actorIds do.

I would be hesitant to introduce features that impose a performance cost on all apps using Automerge, even if they do not need the expunging feature, but it might be possible to isolate the feature in a way that you only pay the price if you're actually using it.

Option B is to say that hash chaining is incompatible with expunging unwanted data from the change history, because hash chaining requires immutability, and here you really want mutability. We could allow hashes to be rewritten when data is expunged (à la Git rebase), or we could simply rely less on hashes for expressing dependencies.

The document data format makes it possible to simply leave out values that you no longer want, although a consequence is that you can then no longer reconstruct the original change history from the compressed document. Maybe that is fine: after all, most databases use a mutable model where they only store the current state of the database, but not the entire change history; they may retain all writes for a while in a WAL or a replication log, but that log eventually gets truncated and discarded. If we say that Automerge is essentially a replicated database, we could apply the same logic by removing the detailed change history from time to time.

If we take a compressed Automerge document as given, and don't try to recompute the change history that led to that state, then it doesn't matter that we can no longer compute the hashes of the original changes. We'd still keep a log of recent changes (just like a database WAL), but we wouldn't keep that log forever. For the purpose of syncing up two Automerge nodes, if they're only slightly out of sync (all changes that one node needs are in the other node's log), they could send each other recent changes, but if they have diverged a lot, they need to send each other their entire documents and merge them.

We would still need some kind of tombstone mechanism to ensure that when two nodes send each other their entire documents, and merge them, they don't inadvertently resurrect data that was supposed to be deleted. But I suspect those tombstones could be quite lightweight, and not be a significant performance concern or DOS vector.

My current thinking is that option B is the more promising of the two options. I'd be interested to hear what you think.

@alexjg
Copy link
Contributor Author

alexjg commented Oct 5, 2021

Option B seems more appealing to me for broadly similar reasons. When you say the document format makes it possible to "leave out values", does that mean just omitting the changes from the document compression that you no longer want? Does that in turn mean adding logic to ignore dependent operations (e.g sequence insert operations which reference operations in the missing change)? Or do you mean to also strip out the dependent operations (and thus change the hashes of dependent changes, which I suppose is what you mean by not relying so much on hashes?). What does a tombstone look like in this scenario?

I'm also interested in how you work with this when you're not using the document format. If you imagine something like Secure Scuttlebutt, with each individual publishing a feed of signed changes which are merged at the edge, it would be nice for peers to be able to maintain a denylist for particular changes but still be able to reconstitute any changes which are not dependent on the blocked changes. To do this I think you would have to have logic in the implementation for skipping blocked changes and dependent operations as there is no way to strip out the operations from the changes.

@ept
Copy link
Member

ept commented Oct 5, 2021

The compressed document essentially contains the set of all operations that have been performed on the document, arranged in a particular order. If the value you want to remove was a string that was assigned to a key in a map via a set operation, you could either leave out that set operation entirely, or replace its value with null or empty string. If there is an entire object you want to remove, you could leave out all operations pertaining to that object. It's a bit trickier if you want to remove a list element, since subsequent list insertions may reference the ID of the list element you're removing — in that case, if you leave out one list element you'd also have to leave out all other list elements that depend on it; or you could keep the insertion op, but replace its value with null.

The current implementation will complain if you do this, since leaving out values or ops will mean that the original change history cannot be reconstructed. But we could just add a flag to tell the backend not to try to reconstruct the original changes.

If you're working with a log of signed changes, you don't really have the option of leaving out values that you don't want (because that would change all the hashes and invalidate all the signatures, like git rebase). In that case I guess you could maintain a list of change hashes you want to ignore. However, you'd still be storing the objectionable content in the log, even if it's not being displayed to users. Would that be acceptable? If you're taking this route, couldn't you simply add a normal change that deletes the values you don't want? That would also have the property of keeping the objectionable data in the log but making it invisible.

@alexjg
Copy link
Contributor Author

alexjg commented Oct 5, 2021

Filtering the document format makes sense. Tombstones here would basically be the operation IDs which you're ignoring then?

With the log of signed changes you could allow replacing the missing change with a placeholder which just has the hash of the missing change, the hashes of it's dependencies, the actor ID, the startop and the number of operations that were in it. That way later changes can tell if an operation depends on something in the missing change and skip that operation. In fact, the logic for this feels like it would be the same as the logic for filtering the document format, maybe there's something we could reuse between the two scenarios there?

It occurs to me that this filtering logic would also be a way of making progress if you're unable to find a missing dependency, rather than trying to remove a dependency.

@ept
Copy link
Member

ept commented Oct 7, 2021

Tombstones here would basically be the operation IDs which you're ignoring then?

Yes.

a placeholder which just has the hash of the missing change, the hashes of it's dependencies, the actor ID, the startop and the number of operations that were in it

Yes, that sounds like it would work. However, there is a potential trust issue: say you replace a change with a placeholder and send me the placeholder. I then cannot recompute the hash, so I have to take your word for it that the fields in the placeholder are correct. If you send different people different versions of the placeholder, e.g. with different actorIds, you might be able to make their views of the CRDT state inconsistent, even though they have the same hash graph. In other words, this scheme does not ensure Byzantine Eventual Consistency.

Another potential issue: say I want to figure out what the original change was. If I have a guess as to what the removed operations might have been, I could search the possible operation values by brute force until I find one that matches the hash. This would allow changes with sufficiently low entropy to be "un-deleted". This might not be a problem if you don't care about confidentiality of deleted data.

Both of these issues could probably be addressed by using a cryptographic commitment scheme instead of a plain hash function. This would incur additional computational and storage costs though, so it depends on your threat model whether this is worthwhile. If you're interested I can try sketching out what that might look like.

@alexjg
Copy link
Contributor Author

alexjg commented Oct 7, 2021

In the model I'm interested in the nodes you are replicating from are untrusted, so the first issue would be a problem, but the data is public so I'm less concerned about the latter issue.

I would be interested in a commitment scheme. Off the bat I imagine that instead of each peer publishing each signed change, they would publish a signed merkle tree of the change plus it's dependencies?

@ept
Copy link
Member

ept commented Oct 7, 2021

If you only care about the first issue, you probably don't need a commitment scheme; you could simply use a two-level hash like Hash(actorId + seq + deps + time + startOp + Hash(message + ops)). That way, message and ops (the only fields that can contain objectionable content) could be replaced by their hash, while the outer hash still checks the consistency of the metadata fields.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants