Staging #2041

doomedraven · 2024-04-03T08:51:32Z

No description provided.

…odule import. Calling Database() just returns a proxy to a real database object that may or may not exist when the Database object is created. That's fine as long as no attributes/methods are accessed until the init_database function has been called, which must be called only once during the program's startup. For testing purposes, the reset_database function can be called before calling init_database again. This is useful to create a fresh in-memory database for the test. Note that the Database objects that were created in other modules at module-level scope will still be valid and calling methods on them will be proxied to the new database instance. load_vms_tags() and load_vms_exists() have been modified so that they are not called during import of web_utils but are called where needed and the database is only contacted the first time--the resulting value is cached. No database queries should be made simply as a result of importing modules!

Use a scoped_session to make it easy to always get the current database session with one per thread. In the scheduler and web requests, wrap the unit of work (a main loop iteration in the scheduler and a request/response in the Django app) in a transaction so that the whole thing or none of it will be committed. This also removes the need for the classlock decorator around all of the Database methods. It also removes the need for handling commit/rollback behavior in each of those methods. This does make us be more aware of when the database is getting used, by explicitly starting a transaction. In sqlalchemy 2.0+, we can add `autobegin=False` to the sessionmaker arguments, which will make it easier to see when database access is being requested (like by using an attribute of an object that is "expired", i.e. outside of the transaction it was retrieved in). Without this knowledge, we could be making unnecessary calls to the database and not know it. This type of transaction handling also makes it easier to do things like find and reserve a machine. Improvements to this will be coming in a PR soon, but for now, we can at least use `with_for_update()` to lock the row of the machine that is found and, since that single request is not its own transaction, it can be locked later in the same transaction and we are guaranteed that no other thread will grab it.

- Use fixtures appropriately (i.e. be more pytest-y) - Honor the updated transaction handling. This generally consists of wrapping the API call in a transaction as it would be during a web request or an iteration of the scheduler. Allow that transaction to be committed to make sure rows are written to the database correctly. Use another transaction to do validation that the database was updated as expected. We need to wrap those in a transaction so that it doesn't create an implicit one upon the first database command sent to it. - Enable engine.echo so that, upon test failure or if `-s` is supplied to `pytest`, the SQL statements issued to the database are output. This enables for manual verification that it's issuing the commands we expect it to.

It is, so far, unchanged except for things required to make it work from a file separate from scheduler.py.

- Improve the handling of tasks, machines, and how they are updated in the database. Use transactions and locks appropriately so that changes are more atomic. - Remove the need for holding the machine_lock for long periods of time in the main loop and remove the need for batch scheduling tasks. - Make the code a lot cleaner and readable, including separation of concerns among various classes. Introduce a MachineryManager class that does what the name suggests. In the future, this could have an API added that could provide us a way to dynamically update machines in the database without having to update a conf file and restart cuckoo.py.

Also rework how Config caching is invalidated to make it work better for tests. More tests still need to be added for the Scheduler and AnalysisManager, but this at least gets us a little more testing than what was being done before.

…sk if the only machines with that tag are "reserved."

…arent process after forking. Use `engine.dispose` as described in https://docs.sqlalchemy.org/en/14/core/pooling.html#using-connection-pools-with-multiprocessing-or-os-fork. The benefit of this approach is that connection pools can still be used in the child processes.

Database and scheduler overhaul

…manager.py.

- Use the correct logging level. - Update the aux modules that are available. - Use a data file that actually exists.

Updates to scheduler overhaul

cccs-mog · 2024-04-05T20:44:53Z

So far:
Scaling semaphore is not working with dynamic number of machines, basically the machine manager options as there is no machines parameter when using a autoscaling/dynamic number of machines.

Getting a warning: WARNING: Rooter returned error: inetsim_disable() missing 4 required positional arguments: 'inetsim_ip', 'dns_port', 'resultserver_port', and 'ports'

I was not yet able to test the rest as for unknown reasons the database seem not even initialized correctly ie get stuck at getattr for every method call, seem incomplete to me should have the params of the function passed down as well in parenthesis like this getattr(_DATABASE, attr)(params) ?
?

Am I missing something ?

doomedraven · 2024-04-06T11:56:40Z

i looked a bit into inetsim, but didn't see what is wrong, once i will have some spare time again i will check that

tbeadle · 2024-04-08T15:46:28Z

So far: Scaling semaphore is not working with dynamic number of machines, basically the machine manager options as there is no machines parameter when using a autoscaling/dynamic number of machines.

Could you please explain a little more about this? I don't quite understand.

I was not yet able to test the rest as for unknown reasons the database seem not even initialized correctly ie get stuck at getattr for every method call, seem incomplete to me should have the params of the function passed down as well in parenthesis like this getattr(_DATABASE, attr)(params) ? ?

Am I missing something ?

Do you have a traceback or any other information that you can give? Database.__getattr__ should not be passing any other parameters. It is only proxying the getattr call to the _Database object. So if you do this:

db = Database()
db.some_func(param1)

accessing db.some_func will call the __getattr__ function, which will get that method from the _Database class and then pass param1 to it.

cccs-mog · 2024-04-08T15:55:11Z

lib/cuckoo/core/machinery_manager.py

+ if machines_limit:
+ # The ScalingBoundedSemaphore is used to keep feeding available machines from the pending tasks queue
+ log.info("upper limit for ScalingBoundedSemaphore = %d", machines_limit)
+ retval = ScalingBoundedSemaphore(value=len(machinery_opts["machines"]), upper_limit=machines_limit)


using the parameter machines from the machinery config won't yield any result for any dynamic scenario like AZ or AWS as it is empty.

cccs-mog · 2024-04-08T16:01:16Z

lib/cuckoo/common/abstracts.py

 """Set machine manager options.
 @param options: machine manager options dict.
 """
 self.options = options
+ mmanager_opts = self.options.get(self.module_name)
+ if not isinstance(mmanager_opts["machines"], list):
+ mmanager_opts["machines"] = str(mmanager_opts["machines"]).strip().split(",")


This is not with dynamic machinery like AZ and AWS, hence creating a problem with parsing this.

Would you mind trying out this patch?
machine_lock.patch.txt

Seem to initialize fine, thanks for patching !

DEBUG: # Active analysis: 1; # Pending Tasks: 0; # Specific Pending Tasks: {}; # Available Machines: 10; # Available Specific Machines: {'win10x64': 9, 'windows': 9, 'ub20x64': 1, 'linux': 1}; # Locked Machines: 1; # Specific Locked Machines: {'win10x64': 1, 'windows': 1}; # Total Machines: 11; # Lock value: 2/2 It seem the lock value doesn't follow correctly the number of available machines. I guess this is a work of progress because of this : ```
#Here be dragons! Making these calls on the ScalingBoundedSemaphore is not
# thread safe.

Could you try adding some debug logging in update_limit() and check_for_starvation() to make sure that they are getting called every 10 seconds (or whatever you have scaling_semaphore_update_timer set to) and make sure they're getting called with accurate parameter values?

So the thread in question which invoke update_limit() and check_for_starvation() is just never started (missing a start()), so this is the source of this problem !

I've created #2055 to start the thread. Thanks for noticing that!

cccs-mog · 2024-04-08T16:18:43Z

Yes you are right regarding the Database class and getattr, it seem my call were not processing correctly because of Rooter returned error: inetsim_disable() missing 4 required positional arguments: 'inetsim_ip', 'dns_port', 'resultserver_port', and 'ports' which is now fixed. I will continue testing but so far work great minus small details like the one flagged below !

cccs-mog · 2024-04-08T17:28:05Z

More update: I am also facing an issue with the number of pending tasks staying to 0 or just very few tasks are being processes/active at a time. Let me know if you need some more information to investigate and if this is something specific to AZ/dynamic ? The database is full of tasks. It also seem that the submit.py utility doesn't work anymore. I will investigate it in more detail.

For scaling machinery, we can't set up the scaling semaphore until after the machinery has been initialized, since that's when the database is synchronized with the state of the cloud infrastructure and we need to use that to determine the limit of the semaphore.

… from monitor option filter

tbeadle · 2024-04-16T20:04:20Z

@cccs-mog

-Restarting cape is not cleaning up correctly, the process is not fully killed so the port is still in used.

Which port are you referring to? How are you attempting to stop cape?

-Restarting cape on AZ give an error because I think it doesn't close correctly the vmss(s).

Can you give more details about the error you're seeing (e.g. a traceback). I don't use azure at all. I've looked at the code again and everything seems sane, but I must be missing something.

-Still having issue with submitting tasks

Earlier you said the database is full of tasks, but the pending count is 0. What state are those tasks in?

What exactly happens when you try to use submit.py?

Thanks! Unfortunately, I won't be able to respond until after Sunday, though.

doomedraven · 2024-04-17T14:36:40Z

hello guys, not sure how this is relevant to this branch, but i see that on master it doesn-t work neither, i had pending ´static´ tasks that wasn't processed and it wasnt going to task that need vm, looking into code what changed to try to spot where we changed that

cccs-mog · 2024-04-17T14:54:46Z

I am talking about port 2042 and I am killing the service via systemctl. From the process viewer it seem that some threads are still running.
2)Regarding azure the only trace or hint of the problem is[root] CRITICAL: CuckooCriticalError: No machines available, which is solved by creating a new vmsss for Azure, note that I am using reset_pool_size=true. This behavior was not there consistently before trying this PR so this is why this is mentionned. I am trying to debug the issue but might take some time.

File "CAPEv2/cuckoo.py", line 143, in <module>
    cuckoo_main(max_analysis_count=args.max_analysis_count)
  File "CAPEv2/cuckoo.py", line 104, in cuckoo_main
    sched.start()
  File "CAPEv2/lib/cuckoo/core/scheduler.py", line 289, in start
    self.machinery_manager.initialize_machinery()
  File "CAPEv2/lib/cuckoo/core/machinery_manager.py", line 241, in initialize_machinery
    raise CuckooCriticalError("No machines available")
lib.cuckoo.common.exceptions.CuckooCriticalError: No machines available

3)The status is "reported" which I think is a new status ? The tasks get added to the DB, I get the tasks successfully created but the queue is still 0. The tasks in the DB are in the "reported" state.

doomedraven · 2024-04-17T19:19:06Z

i see a logic issue in general here and in master, about pending any that is not file category, like static, pcap
If this code is not satisfied relevant_machine_is_available = self.db.is_relevant_machine_available(task) it won't process task, but if task is static, pcap it wont get futher to another tasks, im trying to spot where code changed, i probably will add quick check of category. There is category_checks but is too late, and never really reaches to it

if self.task.category in ("file", "pcap", "static"):

tbeadle · 2024-04-17T21:58:27Z

I saw that particular issue while I was working on the scheduler and think that it should be addressed in staging.

…

On Wed, Apr 17, 2024, 2:19 PM doomedraven ***@***.***> wrote: i see a logic issue in general here and in master, about pending any that is not file category, like static, pcap If this code is not satisfied relevant_machine_is_available = self.db.is_relevant_machine_available(task) it won't process task, but if task is static, pcap it wont get futher to another tasks, im trying to spot where code changed, i probably will add quick check of category. There is category_checks but is too late, and never really reaches to it if self.task.category in ("file", "pcap", "static"): — Reply to this email directly, view it on GitHub <#2041 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABADCROR3Y3BB2KJLJWEP2TY53DL7AVCNFSM6AAAAABFU3Y2HCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANRSGAZTINJZG4> . You are receiving this because you were mentioned.Message ID: ***@***.***>

doomedraven · 2024-04-18T06:16:36Z

if you ask about my issues, yes for sure, but that one affects master, so i put temp fix in master right now, as i will be off for 2 w i will try to see where is the proper place to put that in staging, but can't promise nothing

doomedraven · 2024-04-19T20:39:09Z

ok i see here is solved

if task_candidate.category not in CATEGORIES_NEEDING_VM:
                # This task can definitely be processed because it doesn't need a machine.
                task = task_candidate
                break

doomedraven · 2024-04-20T12:21:15Z

but i would love if we could think about optimize case where we have static jobs, all VMs are running thats fine, but static jobs are not processed, and while tasks are in VM, servers are idling when they could process, but we maybe can do this once the main feature is done

tbeadle · 2024-04-23T18:11:56Z

I am talking about port 2042 and I am killing the service via systemctl. From the process viewer it seem that some threads are still running.

Could it be that you have tasks running when you attempt to stop the service and just aren't waiting long enough? A normal SIGTERM, as issued by systemctl stop will allow the tasks that are currently running to finish and then the process will exit. See

CAPEv2/lib/cuckoo/core/scheduler.py

Line 270 in cc4d767

def signal_handler(self, signum, frame):

.

2)Regarding azure the only trace or hint of the problem is[root] CRITICAL: CuckooCriticalError: No machines available, which is solved by creating a new vmsss for Azure, note that I am using reset_pool_size=true. This behavior was not there consistently before trying this PR so this is why this is mentionned. I am trying to debug the issue but might take some time.

Again, I've never used Azure, but it would seem to me like having a vmss available for Cape to use (if I'm correctly understanding what a vmss is) is a reasonable requirement, so this may be a "works as expected" type of thing. I'm not sure.

3)The status is "reported" which I think is a new status ? The tasks get added to the DB, I get the tasks successfully created but the queue is still 0. The tasks in the DB are in the "reported" state.

"reported" is not a new status. It is what the task goes to after cape-processor is finished with it. Out of curiosity, are these tasks submitted for static analysis, as opposed to dynamic analysis that requires a VM?

* Added more tests for the analyzer - fixed a couple of bugs * Added more tests and fixed a couple more bugs - typing hints now compatible with python 3.8 - in `in_protected_path` we were not correctly checking for files in a protected directory - in `handle_interop` we now check ANALYSIS_TIMED_OUT - in `inject_process` we now update LASTINJECT_TIME - Updated the tests to match the fixed behaviour * Ran `poetry lock --no-update` as prompted by github * Added tests; reduced the scope of patch()

doomedraven · 2024-05-06T18:00:37Z

@cccs-mog was you able to check this? It would be nice to merge with but ensuring that it works for you too

cccs-mog · 2024-05-06T19:53:30Z

Hi @doomedraven and @tbeadle, I tested the new version and I am still facing the same issues. Haven't had much time to look into it yet, the issue seem to be the status not getting reflected correctly for tasks and machines. Regarding the machines the status is NULL which is obviously not a valid status. Not sure how the machinery would be affecting the tasks status. Looking into it.

Possible reason why the machines doesn't get a status on restart/ are not counted: AZ overload _initialize which is probably fine, the machine are found in the database so they are not added again, but somehow between the initialization and the prompt for machines, it get emptied out which make it fail. Could it be that the deletion is made in another thread and the timing is bad ? We could always just assume it's empty on az side for every restart which might solve that particular issue.

Regarding tasks: Now I am totally unable to create tasks using submit utilities or anything else so perhaps the issue is in add() function. Not totally sure. Will continue exploring.

nbargnesi · 2024-05-08T19:09:29Z

Regarding tasks: Now I am totally unable to create tasks using submit utilities

I see this as well. I think we just need to wrap a transaction around the db calls in utils/submit.py:

Update submit.py so that it properly uses database transactions to commit the changes. Co-authored-by: Tommy Beadle

cccs-mog · 2024-05-08T21:06:20Z

Thanks for fixing this.
Now while running tasks I get this annoying messages:
2024-05-08 21:03:55,649 [lib.cuckoo.core.machinery_manager] DEBUG: Task #XX: no machine available yet for task requiring machine '', platform '' or tags '[<Tag(2,'win10x64')>]'. It doesn't seem to affect the analysis. Is this an expected behavior with this new way of scheduling ? If so maybe this error message should be limited/disabled ?

Found a few instances of theses as well:
sqlalchemy.orm.exc.StaleDataError: UPDATE statement on table 'machines' expected to update 1 row(s); 0 were matched.

doomedraven · 2024-05-20T13:38:45Z

ouch is 2w since last commit, any progress here guys? i would love to merge it but dont want to break anyones sandbox

nbargnesi · 2024-05-20T15:56:19Z

2024-05-08 21:03:55,649 [lib.cuckoo.core.machinery_manager] DEBUG: Task #XX: no machine available yet for task requiring machine '', platform '' or tags '[<Tag(2,'win10x64')>]'. It doesn't seem to affect the analysis. Is this an expected behavior with this new way of scheduling ? If so maybe this error message should be limited/disabled ?

You're seeing debugging messages which are going to be verbose.

Maybe you're running with -d/--debug?

CAPEv2/cuckoo.py

Line 116 in 21c679f

 parser.add_argument("-d", "--debug", help="Display debug messages", action="store_true", required=False) 

doomedraven · 2024-05-20T18:59:20Z

@cccs-mog im gonna merge this to master but please be on hold for 1-2w or test it on dev, we gonna help you to get it working on cloud, but i need this in as there a lot of good code

cccs-mog · 2024-05-21T17:05:24Z

I am running with debug yes, so if there is no particular issue on that front this is fine with me. Only remaining issue is the machines not getting created on restart unless I switch vmss, its between the initialization and the prompt for machines, it get emptied out which make it fail.I think it's really the deletion which is happening between the machinery check. I will try assuming it's empty on az side for every restart to see if this resolve the issue. Otherwise it seem usable and will continue using it on dev for a few week and reach out for any issue. Thanks everyone !

Tommy Beadle and others added 21 commits April 1, 2024 16:38

Move the AnalysisManager to its own module.

8392843

It is, so far, unchanged except for things required to make it work from a file separate from scheduler.py.

Update the unit tests for the database.

08c5aab

Update the remaining unit tests.

8af4e18

Also rework how Config caching is invalidated to make it work better for tests. More tests still need to be added for the Scheduler and AnalysisManager, but this at least gets us a little more testing than what was being done before.

Allow a "reserved" machine to be used when a tag is provided for a ta…

a68f2fe

…sk if the only machines with that tag are "reserved."

Merge pull request #2037 from tbeadle/database-and-scheduler-overhaul

ed5f57e

Database and scheduler overhaul

Remove test_scheduler.py. It has been moved/updated in test_analysis_…

bfa20bd

…manager.py.

Fix bug in analysis_manager.

5b0b021

Fix analysis_manager tests.

fa52ae0

- Use the correct logging level. - Update the aux modules that are available. - Use a data file that actually exists.

Merge pull request #2040 from tbeadle/database-and-scheduler-overhaul

9d9ef1f

Updates to scheduler overhaul

Merge branch 'master' into staging

bd03801

add customizable cape config directory

555de24

Update dist.py

a239be3

Merge branch 'master' into staging

273f080

Update analysis_manager.py

379724d

cleanup

3ae8ce3

Merge branch 'master' into staging

ddccfdc

Update machinery_manager.py

7d9cf94

cccs-mog reviewed Apr 8, 2024

View reviewed changes

Tommy Beadle and others added 2 commits April 8, 2024 14:22

Merge branch 'master' into staging

5e3be81

Rename option for interactive desktop as simply 'interactive', remove…

520d65b

… from monitor option filter

Merge branch 'master' into staging

4390032

Merge branch 'master' into staging

cc4d767

rkoumis and others added 2 commits April 25, 2024 07:49

Merge branch 'master' into staging

b37f899

tbeadle mentioned this pull request May 8, 2024

Fix submit.py #2111

Merged

Fix submit.py (#2111)

21c679f

Update submit.py so that it properly uses database transactions to commit the changes. Co-authored-by: Tommy Beadle

doomedraven and others added 4 commits May 20, 2024 21:05

Merge branch 'master' into staging

839c138

Update data

9cb4162

Update poetry.lock

7b43180

ci: Update requirements.txt

21b0c7c

doomedraven merged commit df14080 into master May 20, 2024
3 checks passed

nbargnesi mentioned this pull request May 21, 2024

stop() module functions not executed/reached? #2120

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Staging #2041

Staging #2041

doomedraven commented Apr 3, 2024

cccs-mog commented Apr 5, 2024 •

edited

doomedraven commented Apr 6, 2024

tbeadle commented Apr 8, 2024

cccs-mog Apr 8, 2024

cccs-mog Apr 8, 2024

tbeadle Apr 8, 2024

cccs-mog Apr 8, 2024

cccs-mog Apr 8, 2024

tbeadle Apr 9, 2024

cccs-mog Apr 9, 2024

tbeadle Apr 10, 2024 •

edited

cccs-mog commented Apr 8, 2024

cccs-mog commented Apr 8, 2024 •

edited

tbeadle commented Apr 16, 2024

doomedraven commented Apr 17, 2024

cccs-mog commented Apr 17, 2024 •

edited

doomedraven commented Apr 17, 2024

tbeadle commented Apr 17, 2024 via email

doomedraven commented Apr 18, 2024

doomedraven commented Apr 19, 2024 •

edited

doomedraven commented Apr 20, 2024

tbeadle commented Apr 23, 2024

doomedraven commented May 6, 2024

cccs-mog commented May 6, 2024 •

edited

nbargnesi commented May 8, 2024

cccs-mog commented May 8, 2024 •

edited

doomedraven commented May 20, 2024

nbargnesi commented May 20, 2024

doomedraven commented May 20, 2024

cccs-mog commented May 21, 2024

Staging #2041

Staging #2041

Conversation

doomedraven commented Apr 3, 2024

cccs-mog commented Apr 5, 2024 • edited

doomedraven commented Apr 6, 2024

tbeadle commented Apr 8, 2024

cccs-mog Apr 8, 2024

Choose a reason for hiding this comment

cccs-mog Apr 8, 2024

Choose a reason for hiding this comment

tbeadle Apr 8, 2024

Choose a reason for hiding this comment

cccs-mog Apr 8, 2024

Choose a reason for hiding this comment

cccs-mog Apr 8, 2024

Choose a reason for hiding this comment

tbeadle Apr 9, 2024

Choose a reason for hiding this comment

cccs-mog Apr 9, 2024

Choose a reason for hiding this comment

tbeadle Apr 10, 2024 • edited

Choose a reason for hiding this comment

cccs-mog commented Apr 8, 2024

cccs-mog commented Apr 8, 2024 • edited

tbeadle commented Apr 16, 2024

doomedraven commented Apr 17, 2024

cccs-mog commented Apr 17, 2024 • edited

doomedraven commented Apr 17, 2024

tbeadle commented Apr 17, 2024 via email

doomedraven commented Apr 18, 2024

doomedraven commented Apr 19, 2024 • edited

doomedraven commented Apr 20, 2024

tbeadle commented Apr 23, 2024

doomedraven commented May 6, 2024

cccs-mog commented May 6, 2024 • edited

nbargnesi commented May 8, 2024

cccs-mog commented May 8, 2024 • edited

doomedraven commented May 20, 2024

nbargnesi commented May 20, 2024

doomedraven commented May 20, 2024

cccs-mog commented May 21, 2024

cccs-mog commented Apr 5, 2024 •

edited

tbeadle Apr 10, 2024 •

edited

cccs-mog commented Apr 8, 2024 •

edited

cccs-mog commented Apr 17, 2024 •

edited

doomedraven commented Apr 19, 2024 •

edited

cccs-mog commented May 6, 2024 •

edited

cccs-mog commented May 8, 2024 •

edited