-
Notifications
You must be signed in to change notification settings - Fork 384
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Staging #2041
Staging #2041
Conversation
…odule import. Calling Database() just returns a proxy to a real database object that may or may not exist when the Database object is created. That's fine as long as no attributes/methods are accessed until the init_database function has been called, which must be called only once during the program's startup. For testing purposes, the reset_database function can be called before calling init_database again. This is useful to create a fresh in-memory database for the test. Note that the Database objects that were created in other modules at module-level scope will still be valid and calling methods on them will be proxied to the new database instance. load_vms_tags() and load_vms_exists() have been modified so that they are not called during import of web_utils but are called where needed and the database is only contacted the first time--the resulting value is cached. No database queries should be made simply as a result of importing modules!
Use a scoped_session to make it easy to always get the current database session with one per thread. In the scheduler and web requests, wrap the unit of work (a main loop iteration in the scheduler and a request/response in the Django app) in a transaction so that the whole thing or none of it will be committed. This also removes the need for the classlock decorator around all of the Database methods. It also removes the need for handling commit/rollback behavior in each of those methods. This does make us be more aware of when the database is getting used, by explicitly starting a transaction. In sqlalchemy 2.0+, we can add `autobegin=False` to the sessionmaker arguments, which will make it easier to see when database access is being requested (like by using an attribute of an object that is "expired", i.e. outside of the transaction it was retrieved in). Without this knowledge, we could be making unnecessary calls to the database and not know it. This type of transaction handling also makes it easier to do things like find and reserve a machine. Improvements to this will be coming in a PR soon, but for now, we can at least use `with_for_update()` to lock the row of the machine that is found and, since that single request is not its own transaction, it can be locked later in the same transaction and we are guaranteed that no other thread will grab it.
- Use fixtures appropriately (i.e. be more pytest-y) - Honor the updated transaction handling. This generally consists of wrapping the API call in a transaction as it would be during a web request or an iteration of the scheduler. Allow that transaction to be committed to make sure rows are written to the database correctly. Use another transaction to do validation that the database was updated as expected. We need to wrap those in a transaction so that it doesn't create an implicit one upon the first database command sent to it. - Enable engine.echo so that, upon test failure or if `-s` is supplied to `pytest`, the SQL statements issued to the database are output. This enables for manual verification that it's issuing the commands we expect it to.
It is, so far, unchanged except for things required to make it work from a file separate from scheduler.py.
- Improve the handling of tasks, machines, and how they are updated in the database. Use transactions and locks appropriately so that changes are more atomic. - Remove the need for holding the machine_lock for long periods of time in the main loop and remove the need for batch scheduling tasks. - Make the code a lot cleaner and readable, including separation of concerns among various classes. Introduce a MachineryManager class that does what the name suggests. In the future, this could have an API added that could provide us a way to dynamically update machines in the database without having to update a conf file and restart cuckoo.py.
Also rework how Config caching is invalidated to make it work better for tests. More tests still need to be added for the Scheduler and AnalysisManager, but this at least gets us a little more testing than what was being done before.
…sk if the only machines with that tag are "reserved."
…arent process after forking. Use `engine.dispose` as described in https://docs.sqlalchemy.org/en/14/core/pooling.html#using-connection-pools-with-multiprocessing-or-os-fork. The benefit of this approach is that connection pools can still be used in the child processes.
Database and scheduler overhaul
- Use the correct logging level. - Update the aux modules that are available. - Use a data file that actually exists.
Updates to scheduler overhaul
So far: Getting a warning: I was not yet able to test the rest as for unknown reasons the database seem not even initialized correctly ie get stuck at getattr for every method call, seem incomplete to me should have the params of the function passed down as well in parenthesis like this getattr(_DATABASE, attr)(params) ? Am I missing something ? |
i looked a bit into inetsim, but didn't see what is wrong, once i will have some spare time again i will check that |
Could you please explain a little more about this? I don't quite understand.
Do you have a traceback or any other information that you can give?
accessing |
lib/cuckoo/core/machinery_manager.py
Outdated
if machines_limit: | ||
# The ScalingBoundedSemaphore is used to keep feeding available machines from the pending tasks queue | ||
log.info("upper limit for ScalingBoundedSemaphore = %d", machines_limit) | ||
retval = ScalingBoundedSemaphore(value=len(machinery_opts["machines"]), upper_limit=machines_limit) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using the parameter machines from the machinery config won't yield any result for any dynamic scenario like AZ or AWS as it is empty.
"""Set machine manager options. | ||
@param options: machine manager options dict. | ||
""" | ||
self.options = options | ||
mmanager_opts = self.options.get(self.module_name) | ||
if not isinstance(mmanager_opts["machines"], list): | ||
mmanager_opts["machines"] = str(mmanager_opts["machines"]).strip().split(",") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not with dynamic machinery like AZ and AWS, hence creating a problem with parsing this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you mind trying out this patch?
machine_lock.patch.txt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seem to initialize fine, thanks for patching !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DEBUG: # Active analysis: 1; # Pending Tasks: 0; # Specific Pending Tasks: {}; # Available Machines: 10; # Available Specific Machines: {'win10x64': 9, 'windows': 9, 'ub20x64': 1, 'linux': 1}; # Locked Machines: 1; # Specific Locked Machines: {'win10x64': 1, 'windows': 1}; # Total Machines: 11; # Lock value: 2/2 It seem the lock value doesn't follow correctly the number of available machines. I guess this is a work of progress because of this : ```
#Here be dragons! Making these calls on the ScalingBoundedSemaphore is not
# thread safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you try adding some debug logging in update_limit()
and check_for_starvation()
to make sure that they are getting called every 10 seconds (or whatever you have scaling_semaphore_update_timer
set to) and make sure they're getting called with accurate parameter values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the thread in question which invoke update_limit()
and check_for_starvation()
is just never started (missing a start()), so this is the source of this problem !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've created #2055 to start the thread. Thanks for noticing that!
Yes you are right regarding the Database class and getattr, it seem my call were not processing correctly because of Rooter returned error: inetsim_disable() missing 4 required positional arguments: 'inetsim_ip', 'dns_port', 'resultserver_port', and 'ports' which is now fixed. I will continue testing but so far work great minus small details like the one flagged below ! |
More update: I am also facing an issue with the number of pending tasks staying to 0 or just very few tasks are being processes/active at a time. Let me know if you need some more information to investigate and if this is something specific to AZ/dynamic ? The database is full of tasks. It also seem that the submit.py utility doesn't work anymore. I will investigate it in more detail. |
For scaling machinery, we can't set up the scaling semaphore until after the machinery has been initialized, since that's when the database is synchronized with the state of the cloud infrastructure and we need to use that to determine the limit of the semaphore.
… from monitor option filter
Which port are you referring to? How are you attempting to stop cape?
Can you give more details about the error you're seeing (e.g. a traceback). I don't use azure at all. I've looked at the code again and everything seems sane, but I must be missing something.
Earlier you said the database is full of tasks, but the pending count is 0. What state are those tasks in? What exactly happens when you try to use submit.py? Thanks! Unfortunately, I won't be able to respond until after Sunday, though. |
hello guys, not sure how this is relevant to this branch, but i see that on master it doesn-t work neither, i had pending ´static´ tasks that wasn't processed and it wasnt going to task that need vm, looking into code what changed to try to spot where we changed that |
3)The status is "reported" which I think is a new status ? The tasks get added to the DB, I get the tasks successfully created but the queue is still 0. The tasks in the DB are in the "reported" state. |
i see a logic issue in general here and in master, about pending any that is not file category, like static, pcap
|
I saw that particular issue while I was working on the scheduler and think
that it should be addressed in staging.
…On Wed, Apr 17, 2024, 2:19 PM doomedraven ***@***.***> wrote:
i see a logic issue in general here and in master, about pending any that
is not file category, like static, pcap
If this code is not satisfied relevant_machine_is_available =
self.db.is_relevant_machine_available(task) it won't process task, but if
task is static, pcap it wont get futher to another tasks, im trying to spot
where code changed, i probably will add quick check of category. There is
category_checks but is too late, and never really reaches to it
if self.task.category in ("file", "pcap", "static"):
—
Reply to this email directly, view it on GitHub
<#2041 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABADCROR3Y3BB2KJLJWEP2TY53DL7AVCNFSM6AAAAABFU3Y2HCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANRSGAZTINJZG4>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
if you ask about my issues, yes for sure, but that one affects master, so i put temp fix in master right now, as i will be off for 2 w i will try to see where is the proper place to put that in staging, but can't promise nothing |
ok i see here is solved
|
but i would love if we could think about optimize case where we have static jobs, all VMs are running thats fine, but static jobs are not processed, and while tasks are in VM, servers are idling when they could process, but we maybe can do this once the main feature is done |
Could it be that you have tasks running when you attempt to stop the service and just aren't waiting long enough? A normal SIGTERM, as issued by CAPEv2/lib/cuckoo/core/scheduler.py Line 270 in cc4d767
Again, I've never used Azure, but it would seem to me like having a vmss available for Cape to use (if I'm correctly understanding what a vmss is) is a reasonable requirement, so this may be a "works as expected" type of thing. I'm not sure.
"reported" is not a new status. It is what the task goes to after cape-processor is finished with it. Out of curiosity, are these tasks submitted for static analysis, as opposed to dynamic analysis that requires a VM? |
* Added more tests for the analyzer - fixed a couple of bugs * Added more tests and fixed a couple more bugs - typing hints now compatible with python 3.8 - in `in_protected_path` we were not correctly checking for files in a protected directory - in `handle_interop` we now check ANALYSIS_TIMED_OUT - in `inject_process` we now update LASTINJECT_TIME - Updated the tests to match the fixed behaviour * Ran `poetry lock --no-update` as prompted by github * Added tests; reduced the scope of patch()
@cccs-mog was you able to check this? It would be nice to merge with but ensuring that it works for you too |
Hi @doomedraven and @tbeadle, I tested the new version and I am still facing the same issues. Haven't had much time to look into it yet, the issue seem to be the status not getting reflected correctly for tasks and machines. Regarding the machines the status is NULL which is obviously not a valid status. Not sure how the machinery would be affecting the tasks status. Looking into it. Possible reason why the machines doesn't get a status on restart/ are not counted: AZ overload _initialize which is probably fine, the machine are found in the database so they are not added again, but somehow between the initialization and the prompt for machines, it get emptied out which make it fail. Could it be that the deletion is made in another thread and the timing is bad ? We could always just assume it's empty on az side for every restart which might solve that particular issue. Regarding tasks: Now I am totally unable to create tasks using submit utilities or anything else so perhaps the issue is in add() function. Not totally sure. Will continue exploring. |
I see this as well. I think we just need to wrap a transaction around the db calls in |
Update submit.py so that it properly uses database transactions to commit the changes. Co-authored-by: Tommy Beadle
Thanks for fixing this. Found a few instances of theses as well: |
ouch is 2w since last commit, any progress here guys? i would love to merge it but dont want to break anyones sandbox |
You're seeing debugging messages which are going to be verbose. Maybe you're running with Line 116 in 21c679f
|
@cccs-mog im gonna merge this to master but please be on hold for 1-2w or test it on dev, we gonna help you to get it working on cloud, but i need this in as there a lot of good code |
I am running with debug yes, so if there is no particular issue on that front this is fine with me. Only remaining issue is the machines not getting created on restart unless I switch vmss, its between the initialization and the prompt for machines, it get emptied out which make it fail.I think it's really the deletion which is happening between the machinery check. I will try assuming it's empty on az side for every restart to see if this resolve the issue. Otherwise it seem usable and will continue using it on dev for a few week and reach out for any issue. Thanks everyone ! |
No description provided.