Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Staging #2041

Merged
merged 44 commits into from
May 20, 2024
Merged

Staging #2041

merged 44 commits into from
May 20, 2024

Conversation

doomedraven
Copy link
Collaborator

No description provided.

Tommy Beadle and others added 21 commits April 1, 2024 16:38
…odule import.

Calling Database() just returns a proxy to a real database object that
may or may not exist when the Database object is created. That's fine as
long as no attributes/methods are accessed until the init_database
function has been called, which must be called only once during the
program's startup. For testing purposes, the reset_database function can
be called before calling init_database again. This is useful to create a
fresh in-memory database for the test. Note that the Database objects
that were created in other modules at module-level scope will still be
valid and calling methods on them will be proxied to the new database
instance.

load_vms_tags() and load_vms_exists() have been modified so that they
are not called during import of web_utils but are called where needed
and the database is only contacted the first time--the resulting value
is cached. No database queries should be made simply as a result of
importing modules!
Use a scoped_session to make it easy to always get the current database
session with one per thread. In the scheduler and web requests, wrap the
unit of work (a main loop iteration in the scheduler and a
request/response in the Django app) in a transaction so that the whole
thing or none of it will be committed. This also removes the need for
the classlock decorator around all of the Database methods. It also
removes the need for handling commit/rollback behavior in each of those
methods.

This does make us be more aware of when the database is getting
used, by explicitly starting a transaction. In sqlalchemy 2.0+, we can
add `autobegin=False` to the sessionmaker arguments, which will make it
easier to see when database access is being requested (like by using an
attribute of an object that is "expired", i.e. outside of the
transaction it was retrieved in). Without this knowledge, we could be
making unnecessary calls to the database and not know it.

This type of transaction handling also makes it easier to do things like
find and reserve a machine. Improvements to this will be coming in a PR
soon, but for now, we can at least use `with_for_update()` to lock the
row of the machine that is found and, since that single request is not
its own transaction, it can be locked later in the same transaction and
we are guaranteed that no other thread will grab it.
- Use fixtures appropriately (i.e. be more pytest-y)
- Honor the updated transaction handling. This generally consists of
  wrapping the API call in a transaction as it would be during a web
  request or an iteration of the scheduler. Allow that transaction to be
  committed to make sure rows are written to the database correctly. Use
  another transaction to do validation that the database was updated as
  expected. We need to wrap those in a transaction so that it doesn't
  create an implicit one upon the first database command sent to it.
- Enable engine.echo so that, upon test failure or if `-s` is supplied
  to `pytest`, the SQL statements issued to the database are output.
  This enables for manual verification that it's issuing the commands we
  expect it to.
It is, so far, unchanged except for things required to make it work from
a file separate from scheduler.py.
- Improve the handling of tasks, machines, and how they are updated in
  the database. Use transactions and locks appropriately so that changes
  are more atomic.
- Remove the need for holding the machine_lock for long periods of time
  in the main loop and remove the need for batch scheduling tasks.
- Make the code a lot cleaner and readable, including separation of concerns
  among various classes. Introduce a MachineryManager class that does
  what the name suggests. In the future, this could have an API added
  that could provide us a way to dynamically update machines in the
  database without having to update a conf file and restart cuckoo.py.
Also rework how Config caching is invalidated to make it work better for
tests.
More tests still need to be added for the Scheduler and AnalysisManager,
but this at least gets us a little more testing than what was being done
before.
…sk if the only machines with that tag are "reserved."
…arent process after forking.

Use `engine.dispose` as described in
https://docs.sqlalchemy.org/en/14/core/pooling.html#using-connection-pools-with-multiprocessing-or-os-fork.
The benefit of this approach is that connection pools can still be used
in the child processes.
- Use the correct logging level.
- Update the aux modules that are available.
- Use a data file that actually exists.
@cccs-mog
Copy link
Contributor

cccs-mog commented Apr 5, 2024

So far:
Scaling semaphore is not working with dynamic number of machines, basically the machine manager options as there is no machines parameter when using a autoscaling/dynamic number of machines.

Getting a warning: WARNING: Rooter returned error: inetsim_disable() missing 4 required positional arguments: 'inetsim_ip', 'dns_port', 'resultserver_port', and 'ports'

I was not yet able to test the rest as for unknown reasons the database seem not even initialized correctly ie get stuck at getattr for every method call, seem incomplete to me should have the params of the function passed down as well in parenthesis like this getattr(_DATABASE, attr)(params) ?
?

Am I missing something ?

@doomedraven
Copy link
Collaborator Author

i looked a bit into inetsim, but didn't see what is wrong, once i will have some spare time again i will check that

@tbeadle
Copy link
Contributor

tbeadle commented Apr 8, 2024

So far: Scaling semaphore is not working with dynamic number of machines, basically the machine manager options as there is no machines parameter when using a autoscaling/dynamic number of machines.

Could you please explain a little more about this? I don't quite understand.

I was not yet able to test the rest as for unknown reasons the database seem not even initialized correctly ie get stuck at getattr for every method call, seem incomplete to me should have the params of the function passed down as well in parenthesis like this getattr(_DATABASE, attr)(params) ? ?

Am I missing something ?

Do you have a traceback or any other information that you can give? Database.__getattr__ should not be passing any other parameters. It is only proxying the getattr call to the _Database object. So if you do this:

db = Database()
db.some_func(param1)

accessing db.some_func will call the __getattr__ function, which will get that method from the _Database class and then pass param1 to it.

if machines_limit:
# The ScalingBoundedSemaphore is used to keep feeding available machines from the pending tasks queue
log.info("upper limit for ScalingBoundedSemaphore = %d", machines_limit)
retval = ScalingBoundedSemaphore(value=len(machinery_opts["machines"]), upper_limit=machines_limit)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using the parameter machines from the machinery config won't yield any result for any dynamic scenario like AZ or AWS as it is empty.

"""Set machine manager options.
@param options: machine manager options dict.
"""
self.options = options
mmanager_opts = self.options.get(self.module_name)
if not isinstance(mmanager_opts["machines"], list):
mmanager_opts["machines"] = str(mmanager_opts["machines"]).strip().split(",")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not with dynamic machinery like AZ and AWS, hence creating a problem with parsing this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind trying out this patch?
machine_lock.patch.txt

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seem to initialize fine, thanks for patching !

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DEBUG: # Active analysis: 1; # Pending Tasks: 0; # Specific Pending Tasks: {}; # Available Machines: 10; # Available Specific Machines: {'win10x64': 9, 'windows': 9, 'ub20x64': 1, 'linux': 1}; # Locked Machines: 1; # Specific Locked Machines: {'win10x64': 1, 'windows': 1}; # Total Machines: 11; # Lock value: 2/2 It seem the lock value doesn't follow correctly the number of available machines. I guess this is a work of progress because of this : ```
#Here be dragons! Making these calls on the ScalingBoundedSemaphore is not
# thread safe.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you try adding some debug logging in update_limit() and check_for_starvation() to make sure that they are getting called every 10 seconds (or whatever you have scaling_semaphore_update_timer set to) and make sure they're getting called with accurate parameter values?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the thread in question which invoke update_limit() and check_for_starvation() is just never started (missing a start()), so this is the source of this problem !

Copy link
Contributor

@tbeadle tbeadle Apr 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've created #2055 to start the thread. Thanks for noticing that!

@cccs-mog
Copy link
Contributor

cccs-mog commented Apr 8, 2024

Yes you are right regarding the Database class and getattr, it seem my call were not processing correctly because of Rooter returned error: inetsim_disable() missing 4 required positional arguments: 'inetsim_ip', 'dns_port', 'resultserver_port', and 'ports' which is now fixed. I will continue testing but so far work great minus small details like the one flagged below !

@cccs-mog
Copy link
Contributor

cccs-mog commented Apr 8, 2024

More update: I am also facing an issue with the number of pending tasks staying to 0 or just very few tasks are being processes/active at a time. Let me know if you need some more information to investigate and if this is something specific to AZ/dynamic ? The database is full of tasks. It also seem that the submit.py utility doesn't work anymore. I will investigate it in more detail.

Tommy Beadle and others added 2 commits April 8, 2024 14:22
For scaling machinery, we can't set up the scaling semaphore until after
the machinery has been initialized, since that's when the database is
synchronized with the state of the cloud infrastructure and we need to
use that to determine the limit of the semaphore.
@tbeadle
Copy link
Contributor

tbeadle commented Apr 16, 2024

@cccs-mog

-Restarting cape is not cleaning up correctly, the process is not fully killed so the port is still in used.

Which port are you referring to? How are you attempting to stop cape?

-Restarting cape on AZ give an error because I think it doesn't close correctly the vmss(s).

Can you give more details about the error you're seeing (e.g. a traceback). I don't use azure at all. I've looked at the code again and everything seems sane, but I must be missing something.

-Still having issue with submitting tasks

Earlier you said the database is full of tasks, but the pending count is 0. What state are those tasks in?

What exactly happens when you try to use submit.py?

Thanks! Unfortunately, I won't be able to respond until after Sunday, though.

@doomedraven
Copy link
Collaborator Author

hello guys, not sure how this is relevant to this branch, but i see that on master it doesn-t work neither, i had pending ´static´ tasks that wasn't processed and it wasnt going to task that need vm, looking into code what changed to try to spot where we changed that

@cccs-mog
Copy link
Contributor

cccs-mog commented Apr 17, 2024

  1. I am talking about port 2042 and I am killing the service via systemctl. From the process viewer it seem that some threads are still running.
    2)Regarding azure the only trace or hint of the problem is[root] CRITICAL: CuckooCriticalError: No machines available, which is solved by creating a new vmsss for Azure, note that I am using reset_pool_size=true. This behavior was not there consistently before trying this PR so this is why this is mentionned. I am trying to debug the issue but might take some time.
File "CAPEv2/cuckoo.py", line 143, in <module>
    cuckoo_main(max_analysis_count=args.max_analysis_count)
  File "CAPEv2/cuckoo.py", line 104, in cuckoo_main
    sched.start()
  File "CAPEv2/lib/cuckoo/core/scheduler.py", line 289, in start
    self.machinery_manager.initialize_machinery()
  File "CAPEv2/lib/cuckoo/core/machinery_manager.py", line 241, in initialize_machinery
    raise CuckooCriticalError("No machines available")
lib.cuckoo.common.exceptions.CuckooCriticalError: No machines available

3)The status is "reported" which I think is a new status ? The tasks get added to the DB, I get the tasks successfully created but the queue is still 0. The tasks in the DB are in the "reported" state.

@doomedraven
Copy link
Collaborator Author

i see a logic issue in general here and in master, about pending any that is not file category, like static, pcap
If this code is not satisfied relevant_machine_is_available = self.db.is_relevant_machine_available(task) it won't process task, but if task is static, pcap it wont get futher to another tasks, im trying to spot where code changed, i probably will add quick check of category. There is category_checks but is too late, and never really reaches to it

if self.task.category in ("file", "pcap", "static"):

@tbeadle
Copy link
Contributor

tbeadle commented Apr 17, 2024 via email

@doomedraven
Copy link
Collaborator Author

if you ask about my issues, yes for sure, but that one affects master, so i put temp fix in master right now, as i will be off for 2 w i will try to see where is the proper place to put that in staging, but can't promise nothing

@doomedraven
Copy link
Collaborator Author

doomedraven commented Apr 19, 2024

ok i see here is solved

if task_candidate.category not in CATEGORIES_NEEDING_VM:
                # This task can definitely be processed because it doesn't need a machine.
                task = task_candidate
                break

@doomedraven
Copy link
Collaborator Author

but i would love if we could think about optimize case where we have static jobs, all VMs are running thats fine, but static jobs are not processed, and while tasks are in VM, servers are idling when they could process, but we maybe can do this once the main feature is done

@tbeadle
Copy link
Contributor

tbeadle commented Apr 23, 2024

  1. I am talking about port 2042 and I am killing the service via systemctl. From the process viewer it seem that some threads are still running.

Could it be that you have tasks running when you attempt to stop the service and just aren't waiting long enough? A normal SIGTERM, as issued by systemctl stop will allow the tasks that are currently running to finish and then the process will exit. See

def signal_handler(self, signum, frame):
.

2)Regarding azure the only trace or hint of the problem is[root] CRITICAL: CuckooCriticalError: No machines available, which is solved by creating a new vmsss for Azure, note that I am using reset_pool_size=true. This behavior was not there consistently before trying this PR so this is why this is mentionned. I am trying to debug the issue but might take some time.

Again, I've never used Azure, but it would seem to me like having a vmss available for Cape to use (if I'm correctly understanding what a vmss is) is a reasonable requirement, so this may be a "works as expected" type of thing. I'm not sure.

3)The status is "reported" which I think is a new status ? The tasks get added to the DB, I get the tasks successfully created but the queue is still 0. The tasks in the DB are in the "reported" state.

"reported" is not a new status. It is what the task goes to after cape-processor is finished with it. Out of curiosity, are these tasks submitted for static analysis, as opposed to dynamic analysis that requires a VM?

rkoumis and others added 2 commits April 25, 2024 07:49
* Added more tests for the analyzer

- fixed a couple of bugs

* Added more tests and fixed a couple more bugs

- typing hints now compatible with python 3.8
- in `in_protected_path` we were not correctly checking for files in a protected directory
- in `handle_interop` we now check ANALYSIS_TIMED_OUT
- in `inject_process` we now update LASTINJECT_TIME
- Updated the tests to match the fixed behaviour

* Ran `poetry lock --no-update` as prompted by github

* Added tests; reduced the scope of patch()
@doomedraven
Copy link
Collaborator Author

@cccs-mog was you able to check this? It would be nice to merge with but ensuring that it works for you too

@cccs-mog
Copy link
Contributor

cccs-mog commented May 6, 2024

Hi @doomedraven and @tbeadle, I tested the new version and I am still facing the same issues. Haven't had much time to look into it yet, the issue seem to be the status not getting reflected correctly for tasks and machines. Regarding the machines the status is NULL which is obviously not a valid status. Not sure how the machinery would be affecting the tasks status. Looking into it.

Possible reason why the machines doesn't get a status on restart/ are not counted: AZ overload _initialize which is probably fine, the machine are found in the database so they are not added again, but somehow between the initialization and the prompt for machines, it get emptied out which make it fail. Could it be that the deletion is made in another thread and the timing is bad ? We could always just assume it's empty on az side for every restart which might solve that particular issue.

Regarding tasks: Now I am totally unable to create tasks using submit utilities or anything else so perhaps the issue is in add() function. Not totally sure. Will continue exploring.

@nbargnesi
Copy link
Contributor

Regarding tasks: Now I am totally unable to create tasks using submit utilities

I see this as well. I think we just need to wrap a transaction around the db calls in utils/submit.py:

  1. add_url
  2. demux_sample_and_add_to_db

@tbeadle tbeadle mentioned this pull request May 8, 2024
Update submit.py so that it properly uses database transactions to
commit the changes.

Co-authored-by: Tommy Beadle
@cccs-mog
Copy link
Contributor

cccs-mog commented May 8, 2024

Thanks for fixing this.
Now while running tasks I get this annoying messages:
2024-05-08 21:03:55,649 [lib.cuckoo.core.machinery_manager] DEBUG: Task #XX: no machine available yet for task requiring machine '', platform '' or tags '[<Tag(2,'win10x64')>]'. It doesn't seem to affect the analysis. Is this an expected behavior with this new way of scheduling ? If so maybe this error message should be limited/disabled ?

Found a few instances of theses as well:
sqlalchemy.orm.exc.StaleDataError: UPDATE statement on table 'machines' expected to update 1 row(s); 0 were matched.

@doomedraven
Copy link
Collaborator Author

ouch is 2w since last commit, any progress here guys? i would love to merge it but dont want to break anyones sandbox

@nbargnesi
Copy link
Contributor

2024-05-08 21:03:55,649 [lib.cuckoo.core.machinery_manager] DEBUG: Task #XX: no machine available yet for task requiring machine '', platform '' or tags '[<Tag(2,'win10x64')>]'. It doesn't seem to affect the analysis. Is this an expected behavior with this new way of scheduling ? If so maybe this error message should be limited/disabled ?

You're seeing debugging messages which are going to be verbose.

Maybe you're running with -d/--debug?

parser.add_argument("-d", "--debug", help="Display debug messages", action="store_true", required=False)

@doomedraven
Copy link
Collaborator Author

@cccs-mog im gonna merge this to master but please be on hold for 1-2w or test it on dev, we gonna help you to get it working on cloud, but i need this in as there a lot of good code

@doomedraven doomedraven merged commit df14080 into master May 20, 2024
3 checks passed
@cccs-mog
Copy link
Contributor

I am running with debug yes, so if there is no particular issue on that front this is fine with me. Only remaining issue is the machines not getting created on restart unless I switch vmss, its between the initialization and the prompt for machines, it get emptied out which make it fail.I think it's really the deletion which is happening between the machinery check. I will try assuming it's empty on az side for every restart to see if this resolve the issue. Otherwise it seem usable and will continue using it on dev for a few week and reach out for any issue. Thanks everyone !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants