Skip to content

TSC Meeting Notes

Travis Addair edited this page May 29, 2020 · 3 revisions

May 29, 2020

Attendees: Travis Addair, Lin Yuan, Can Karakus, Enrico Minack, Fardin Abdi, Nicolas Castet, Josh Romero, Todd Mytkowicz

Agenda

Updates:

  • Elastic Horovod PR landed
  • LSF / jsrun support
  • Sync batch norm for PyTorch
  • NCCL GPU Allgather
  • Dropped Python 2 support
  • Hotfix 0.19.3
    • Network interface discovery
  • Hotfix 0.19.4
    • Sync Batch Norm
    • TensorFlow 2.2

Upcoming / recent talks:

  • GTC: Travis, Lin, Adasum
  • SparkAI Summit
  • Open Source Summit

Josh: How to maintain vendor-specific code paths?

  • Ensure CI tests pass
  • If missing test coverage, not responsibility of author
  • Experimental repo?
    • Downstream, separate tests
    • How to handle very experimental hardware?
  • Experimental fork / branch
  • Experimental module?
  • How do TensorFlow, PyTorch solve these issues?
    • Invest time in figuring this out.
    • Add item to track this.

Enrico: Horovod on Spark

  • Gloo
  • Elastic
  • Spark 3
  • Lightning Estimators
  • Petastorm dataset converter API (Databricks)

Elastic mode upcoming

  • Dynamic world size
  • Gradient predivision (should be solved by above)
  • NCCL async error handling
  • Reset limit
  • Don’t sync state when workers are gracefully removed

CMake

  • Coming soon!

Adasum

  • Paper coming soon
  • Big perf benefits shown
  • NCCL:
    • all-to-all (Josh/Todd)
    • 2.7 has send/recv

March 6, 2020

Agenda

Updates:

  • AWS credits
  • TF 2.2 support

Upcoming / recent talks

  • Josh @ ScaledML
  • Travis @ GTC
  • Lin @ GTC
  • Adasum @ GTC (3/26, afternoon)

Python 2 support?

  • Solution: keep nightly tsts, but pin MXNet
  • Drop once other frameworks drop support

Elastic Horovod deep dive

January 24, 2020

Agenda

Introductions

  • Name, role, how you use Horovod in your work

LFAI: Path to graduation (Jacqueline)

Upcoming talks / blog posts / announcements

  • Jan 29: Seattle Applied Deep Learning Meetup, Fardin Abdi (Uber)
  • Feb 27: Scaled ML (Bay Area), Josh Romero (NVIDIA)

Outstanding issues:

  • Continuous integration
  • AWS Credits and plan for long term support
  • Ideas to reduce costs from GPU tests
  • TensorFlow Keras 2.1 support (#1688)

H1 2020 planning

Action Items

  • Research opportunities for collaboration between other LFAI projects
  • Insight into usage of Horovod (horovod.torch vs torch.distributed, horovod.tensorflow vs tf.mirrored_strategy)
  • Investigate multi-node unit tests
Clone this wiki locally