Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix ghost facts #2978

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

guillaume-duong-bib
Copy link

@guillaume-duong-bib guillaume-duong-bib commented May 13, 2024

Description

Attempt to fix #2930 and close #2971. (update: as of 2024/06/14, it should fix them all)

This PR is an attempt to fix the 4 following issues:

  1. A fact that has been loaded into an operation from a fact source will remain in the fact_store even after being deleted from the fact source. Consequences:
  • When starting a new operation with the same fact source, the deleted fact will still be used.
  • Unnecessary clutter that may become noticeable over time, potential for unintended data leak (agreed that Caldera shouldn't be used in a way that exposes sensitive data, but still).
  1. A fact that has been loaded into an operation from a fact source will be duplicated in the fact_store when its value is updated in the fact source (very similar to the first item). Consequences:
  • When starting a new operation with the same fact source, the fact will have both the old and the new value (or more if it is updated several times), and there is no way to stop that apart from creating a new fact source or fiddling with the fact_store manually.
  • Unnecessary clutter that may become noticeable over time, potential for unintended data leak (agreed that Caldera shouldn't be used in a way that exposes sensitive data, but still).
  1. A fact that has been loaded into an operation from a fact source will remain in the fact_store even when the fact source is deleted. Consequences:
  • Unnecessary clutter that may become noticeable over time, potential for unintended data leak (agreed that Caldera shouldn't be used in a way that exposes sensitive data, but still).
  1. A fact that has been generated by an operation (from parsers) will remain in the fact_store even after the operation is deleted, seemingly never to be used again. Consequences:
  • Unnecessary clutter that may become noticeable over time, potential for unintended data leak (agreed that Caldera shouldn't be used in a way that exposes sensitive data, but still).

Status (last update 2024/06/14)

At this point, all of the items mentioned are fixed, i.e.:

  • When a fact is deleted from a Fact Source, it is no longer available to the future operations (item 1).
  • When a fact is updated within a Fact Source, only its updated value is available to the future operations (item 2).
  • Additionally, the facts artifacts are cleaned up when their source (a.k.a. Fact Source or Operation) is deleted (item 3 and 4).

Type of change

  • Bug fix (non-breaking change which fixes an issue)

How Has This Been Tested?

Item 1 & 2 and explanations

Situation: I created a Fact Source FactSource1 with 2 facts in it named Source1Fact1 & Source1Fact2, then loaded them in an operation with an adversary that contains a single ability. This ability is echo #{Source1Fact1}. There is nothing that uses Source1Fact2.

Here's the fact_store after running the operation then shutting down the server:

unique: Source1Fact1xyz, trait: Source1Fact1, value: xyz, created: 2024-06-14 09:27:55.013666+00:00, score: 0, source: a2a4ecb6-2529-4d78-adc6-ee78e1be212f, origin_type: OriginType.SEEDED, collected_by: []    
unique: Source1Fact2abc, trait: Source1Fact2, value: abc, created: 2024-06-14 09:36:35.491128+00:00, score: 0, source: a2a4ecb6-2529-4d78-adc6-ee78e1be212f, origin_type: OriginType.SEEDED, collected_by: []

Note that facts are loaded all the same whether they are used or not in the operation.

Now for the tests:

  • Item 1: Deleting a fact from a fact source sends a PUT to /api/v2/sources/xxx with the fact removed from the json, which makes sense. The Fact Source object is then updated, but here's the first issue: both facts, including the deleted one, are still in fact_store exactly the same. For now, this does not raise any visible problem. But start a new operation that uses FactSource1 as Fact Source and has an ability that uses Source1Fact1... And it somehow finds it, even though this fact does not exist in the Fact Source. The cause seems to be Operation.all_facts() with the line seeded_facts = await knowledge_svc_handle.get_facts(criteria=dict(source=self.source.id)). As we saw, the deleted fact remains in the fact_store, so this line finds every fact that has ever been loaded from this specific Fact Source.
    • We can solve this 2 ways: being more specific in the facts retrieval to avoid the deleted facts, or actually delete the facts from the internal store upon deletion from the Fact Source.
  • Item 2: it's the same cause as for item 1. Updating a fact's value in a Fact Source sends a PUT to /api/v2/sources/xxx, which updates the Fact Source but keeps previously loaded values the same in the internal store:
unique: Source1Fact1xyz, trait: Source1Fact1, value: xyz, created: 2024-06-14 09:27:55.013666+00:00, score: 0, source: a2a4ecb6-2529-4d78-adc6-ee78e1be212f, origin_type: OriginType.SEEDED, collected_by: []      
unique: Source1Fact2abc, trait: Source1Fact2, value: abc, created: 2024-06-14 09:36:35.491128+00:00, score: 0, source: a2a4ecb6-2529-4d78-adc6-ee78e1be212f, origin_type: OriginType.SEEDED, collected_by: []      
unique: Source1Fact2def, trait: Source1Fact2, value: def, created: 2024-06-14 10:10:33.464996+00:00, score: 0, source: a2a4ecb6-2529-4d78-adc6-ee78e1be212f, origin_type: OriginType.SEEDED, collected_by: []
  • Again, 2 solutions: find a way not to load facts artefacts, or update the internal fact store when a Fact Source is updated.

I tried quite a bit to modify the method that updates fact sources by trying to keep track of old configurations to be able to reflect changes on the internal fact store, but with not much of a satisfactory result. However, a much more straightforward way I found was, at the operation's start, to retrieve the facts directly from the Fact Source object itself instead of the fact_store / fact_ram (commits a253e6b & 3ad8849).

This solves the issue and should not have any side effect since it uses a new method without changing knowledge_svc.

Item 3 & 4

  • Item 3: Deletion of data related to deleted fact sources:
    • Create a fact source with 1 fact/value, 1 rule, 1 relationship
    • Load it into an operation
    • Stop the server and check that the fact and the relationship have been added to the fact_store (rules are unaffected in my tests)
    • Restart the server and Delete the fact source
    • Stop the server and check that the fact and relationship have been deleted from the fact_store
  • Item 4: Deletion of data related to deleted operations:
    • Create and launch an operation that generates facts and relationships
    • Stop the server and check that facts and relationships have been added to the fact_store
    • Restart the server and Delete the operation
    • Stop the server and check that the facts and relationships have been deleted from the fact_store

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works

@elegantmoose
Copy link
Contributor

@guillaume-duong-bib Looks fine code wise. Manually testing now.

Question - did you find some easy way to look at the fact_store? For (3) I can confirm if the facts are present by downloading the operation report. But for (4) there is no operation report, so curious how you were getting at the fact_store without doing basically the entire Caldera file read method (with encryptor).

@guillaume-duong-bib
Copy link
Author

guillaume-duong-bib commented Jun 6, 2024

Question - did you find some easy way to look at the fact_store? For (3) I can confirm if the facts are present by downloading the operation report. But for (4) there is no operation report, so curious how you were getting at the fact_store without doing basically the entire Caldera file read method (with encryptor).

I have disabled encryption on my test server and just read the file as such (with the server shut down):

with open('data/fact_store','rb') as f:
   fact_store = pickle.loads(f.read())

Same for object_store.

@guillaume-duong-bib Looks fine code wise. Manually testing now.

Great to hear. What are your thoughts about (1) and (2)?
In my opinion, these are the items that require the most some fix, but they are also the ones whose fix may break something.

If you agree that these are bugs though, I will attempt to fix them once I get some free time.

@elegantmoose
Copy link
Contributor

elegantmoose commented Jun 7, 2024

@guillaume-duong-bib could you re pull in master.

Im getting weird mimetype errors now when trying to test your branch...hmm

@guillaume-duong-bib
Copy link
Author

@elegantmoose synchronisation done.

I haven't seen these errors, where/when do you get them? I will check again later this week, most likely on Wednesday.

@guillaume-duong-bib
Copy link
Author

Item 1 & 2 are now fixed, see the updated Status and Tests at the top.
I didn't encounter any of the errors you mentioned while looking for a fix then testing it @elegantmoose.

@guillaume-duong-bib guillaume-duong-bib changed the title [WIP] Fix ghost facts Fix ghost facts Jun 14, 2024
Copy link
Contributor

@elegantmoose elegantmoose left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Im still testing, but this code LGTM, and a best possible, least effort solution for the moment. (The alternative being extensive changes to data service and knowledge service as crux of problem is that we have 2 services managing the same data)

seeded_facts = []
if self.source:
seeded_facts = await knowledge_svc_handle.get_facts(criteria=dict(source=self.source.id))
seeded_facts = await data_svc_handle.get_facts_from_source(self.source.id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@clenk How do you feel about this?

Im given pause as all_facts() is called a lot in code base. However, this change seems to be equivalent as only getting facts for the given source.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from fixing the issues mentioned, what does change is that modifying the base fact source will directly be reflected on the seeded_facts part, even during an operation. But is that an expected behavior, or is a simple doc update to warn users enough?
I can't think of/have not found anything else that could be troublesome.

@elegantmoose elegantmoose requested a review from clenk June 18, 2024 16:54
@elegantmoose
Copy link
Contributor

Tested all cases locally. Would like @clenk to take a quick look.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

fact_store's lifecycle? Updating fact sources' facts generates ghost facts during operations
2 participants