Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add memory cache limit #10258

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

PacificDou
Copy link
Contributor

@PacificDou PacificDou commented Apr 23, 2024

Description:
This PR is for the proposal mentioned at: #9871

Basically, it gives users the flexibility to control memory cache by the additional parameter mem_cache_limit (default is 0.7):

  1. mem_cache_limit < 0.0: no cache will be used in memory (same as cache=False);
  2. 0.0 < mem_cache_limit < 1.0: use fraction of current available memory;
  3. mem_cache_limit > 1.0: use the value as it is, unit is byte.

For example:

  1. mem_cache_limit=0.7: 70% of current available memory;
  2. mem_cache_limit=1024*1024*1024: 1024x1024x1024 Bytes = 1 GB.

Please be noticed that, there will always be a (memory buffer)[https://github.com/ultralytics/ultralytics/blob/5a82b511074617b9daa25000e5672c5d56188374/ultralytics/data/base.py#L87] inside the dataset, even if case=disk or cache=False, to cache a few batches.
If the required memory for this buffer is larger than mem_cache_limit, we'll still cache the images so the actual memory usage might be higher than mem_cache_limit. This should not be a problem, because this buffer is usually quite small.

In addition, if multi-process workers are being used, mem_cache_limit only controls the memory usage inside a single worker process.

🛠️ PR Summary

Made with ❤️ by Ultralytics Actions

🌟 Summary

Introducing memory cache limits for improved data handling in training 🚀.

📊 Key Changes

  • Added mem_cache_limit configuration option in default.yaml to control memory cache usage.
  • Implemented memory cache limits within the data loading process to better manage RAM usage during training, preventing potential bottlenecks.
  • Updated dataset classes to support partial caching based on the new memory cache limit, enabling more efficient use of available system memory.

🎯 Purpose & Impact

  • Efficient Memory Usage: The introduction of a memory cache limit aims to optimize RAM usage during the data loading phase of training models. This is especially crucial for training on large datasets or on systems with limited memory resources.
  • Enhanced Performance: By allowing for partial dataset caching within the specified memory limits, training processes can achieve a balance between speed and system resource utilization. This ensures smoother training sessions without overwhelming the system's memory.
  • Flexibility and Control: Users now have the flexibility to configure the memory cache usage according to their system's capabilities, making the training process more adaptable and reducing the risk of system crashes or slowdowns due to excessive memory consumption.

Overall, this update enhances the Ultralytics framework's usability and efficiency, especially in environments where memory is at a premium.

@PacificDou
Copy link
Contributor Author

I have read the CLA Document and I sign the CLA

Copy link

codecov bot commented Apr 23, 2024

Codecov Report

Attention: Patch coverage is 85.71429% with 2 lines in your changes are missing coverage. Please review.

Project coverage is 70.49%. Comparing base (40e5ac3) to head (5e4da0c).

Files Patch % Lines
ultralytics/data/base.py 85.71% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #10258      +/-   ##
==========================================
- Coverage   70.66%   70.49%   -0.18%     
==========================================
  Files         122      122              
  Lines       15591    15600       +9     
==========================================
- Hits        11018    10997      -21     
- Misses       4573     4603      +30     
Flag Coverage Δ
Benchmarks 35.37% <21.42%> (-0.26%) ⬇️
GPU 37.35% <21.42%> (-0.01%) ⬇️
Tests 66.75% <85.71%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@PacificDou
Copy link
Contributor Author

Codecov Report

Attention: Patch coverage is 86.66667% with 2 lines in your changes are missing coverage. Please review.

Project coverage is 75.72%. Comparing base (5a82b51) to head (00c9e64).

❗ Current head 00c9e64 differs from pull request most recent head abf95eb. Consider uploading reports for the commit abf95eb to get more accurate results

Files Patch % Lines
ultralytics/data/base.py 86.66% 2 Missing ⚠️
Additional details and impacted files
☔ View full report in Codecov by Sentry. 📢 Have feedback on the report? Share it here.

Added an additional test case for low ram limit setting

@PacificDou
Copy link
Contributor Author

Got the following failure from github actions:

=========================== short test summary info ============================
FAILED tests/test_python.py::test_export_coreml - AttributeError: 'torch._C.Node' object has no attribute 'getModuleHierarchy'
======= 1 failed, 71 passed, 9 skipped, 11 warnings in 554.44s (0:09:14) =======
Error: Process completed with exit code 1.

However, I passed the same test on my local machine:

========================================================== test session starts ===========================================================
platform linux -- Python 3.10.12, pytest-8.1.1, pluggy-1.4.0
rootdir: /home/shuyang/ultralytics
configfile: pyproject.toml
collected 1 item

tests/test_python.py . [100%]

========================================================== slowest 30 durations ==========================================================
22.99s call tests/test_python.py::test_export_coreml

(2 durations < 0.005s hidden. Use -vv to show these durations.)
=========================================================== 1 passed in 23.01s ===========================================================

@glenn-jocher
Copy link
Member

Hey there! It seems like you encountered an issue when running GitHub Actions for exporting to CoreML, but it worked fine on your local setup. 🤔 This discrepancy might be due to differences in the environment setup, library versions, or even the PyTorch and torchvision versions between your local machine and the GitHub Actions runner.

A potential step to align both environments more closely could involve ensuring that the same versions of PyTorch, torchvision, and coremltools are used in both cases. You could verify and match these versions in your GitHub Actions workflow file, possibly by setting up a more explicit environment for your tests. Here's a quick example on how you could specify versions in your workflow:

- name: Set up Python
  uses: actions/setup-python @PacificDou
  with:
    python-version: '3.10'
- name: Install dependencies
  run: |
    pip install torch==<your_version> torchvision==<your_version> coremltools==<your_version>
    pip install -r requirements.txt

Replace <your_version> with the specific versions you're using locally. This might help in ensuring that the testing environments are as identical as possible, potentially resolving the attribute error you encountered. If the issue persists, it may be worth checking the latest compatibility notes or updates for coremltools or the related exporting function within the repository. Hope this helps!

@PacificDou
Copy link
Contributor Author

PacificDou commented Apr 23, 2024

Hi @glenn-jocher , I tried the main branch with original code (no change), got the same error for the same test case on github actions. So I guess there might be some version conflict in the test scripts.

It seems that your recent commit fixed it!!

@Burhan-Q Burhan-Q added the enhancement New feature or request label Apr 23, 2024
@ultralytics ultralytics deleted a comment from glenn-jocher Apr 23, 2024
@Burhan-Q
Copy link
Member

@PacificDou thanks for opening a PR! Please also make sure to update the train arguments table in the documentation. Additionally, we appreciate any unit tests you can write to test the changes you're contributing.

@PacificDou
Copy link
Contributor Author

Hi @Burhan-Q , I updated the documentation and also added a test case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants