-
Notifications
You must be signed in to change notification settings - Fork 2.2k
TSC Meeting Notes
Travis Addair edited this page May 29, 2020
·
3 revisions
Attendees: Travis Addair, Lin Yuan, Can Karakus, Enrico Minack, Fardin Abdi, Nicolas Castet, Josh Romero, Todd Mytkowicz
Updates:
- Elastic Horovod PR landed
- LSF / jsrun support
- Sync batch norm for PyTorch
- NCCL GPU Allgather
- Dropped Python 2 support
- Hotfix 0.19.3
- Network interface discovery
- Hotfix 0.19.4
- Sync Batch Norm
- TensorFlow 2.2
Upcoming / recent talks:
- GTC: Travis, Lin, Adasum
- SparkAI Summit
- Open Source Summit
Josh: How to maintain vendor-specific code paths?
- Ensure CI tests pass
- If missing test coverage, not responsibility of author
- Experimental repo?
- Downstream, separate tests
- How to handle very experimental hardware?
- Experimental fork / branch
- Experimental module?
- How do TensorFlow, PyTorch solve these issues?
- Invest time in figuring this out.
- Add item to track this.
Enrico: Horovod on Spark
- Gloo
- Elastic
- Spark 3
- Lightning Estimators
- Petastorm dataset converter API (Databricks)
Elastic mode upcoming
- Dynamic world size
- Gradient predivision (should be solved by above)
- NCCL async error handling
- Reset limit
- Don’t sync state when workers are gracefully removed
CMake
- Coming soon!
Adasum
- Paper coming soon
- Big perf benefits shown
- NCCL:
- all-to-all (Josh/Todd)
- 2.7 has send/recv
Updates:
- AWS credits
- TF 2.2 support
Upcoming / recent talks
- Josh @ ScaledML
- Travis @ GTC
- Lin @ GTC
- Adasum @ GTC (3/26, afternoon)
Python 2 support?
- Solution: keep nightly tsts, but pin MXNet
- Drop once other frameworks drop support
Elastic Horovod deep dive
Introductions
- Name, role, how you use Horovod in your work
LFAI: Path to graduation (Jacqueline)
Upcoming talks / blog posts / announcements
- Jan 29: Seattle Applied Deep Learning Meetup, Fardin Abdi (Uber)
- Feb 27: Scaled ML (Bay Area), Josh Romero (NVIDIA)
Outstanding issues:
- Continuous integration
- AWS Credits and plan for long term support
- Ideas to reduce costs from GPU tests
- TensorFlow Keras 2.1 support (#1688)
H1 2020 planning
Action Items
- Research opportunities for collaboration between other LFAI projects
- Insight into usage of Horovod (horovod.torch vs torch.distributed, horovod.tensorflow vs tf.mirrored_strategy)
- Investigate multi-node unit tests