Training with continue and fork mode terminated due to unhandled system error #1098

drremo1 · 2023-04-01T05:11:21Z

Bug Description

Hello, I've been trying to train the sota/2019 pre-trained dev-clean transformer model for 1 more epoch using flashlight's train continue mode. However, it fails to start training as the pre-trained models from wav2letter are not compatible with flashlight. I then installed wav2letter v0.2 to try retraining the pre-trained models using train continue but it fails and shows this error:

I0401 12:50:50.303958 29472 Train.cpp:80] Parsing command line flags
I0401 12:50:50.303997 29472 Train.cpp:81] Overriding flags should be mutable when using `continue`
terminate called after throwing an instance of 'std::runtime_error'
  what():  unhandled system error
*** Aborted at 1680324650 (unix time) try "date -d @1680324650" if you are using GNU date ***
PC: @     0x7f4669af9e87 gsignal
*** SIGABRT (@0x3e800007320) received by PID 29472 (TID 0x7f4697288380) from PID 29472; stack trace: ***
    @     0x7f468f59f980 (unknown)
    @     0x7f4669af9e87 gsignal
    @     0x7f4669afb7f1 abort
    @     0x7f466a4ee957 (unknown)
    @     0x7f466a4f4ae6 (unknown)
    @     0x7f466a4f4b21 std::terminate()
    @     0x7f466a4f4d54 __cxa_throw
    @     0x55673215c6f8 fl::detail::ncclCheck()
    @     0x55673215ddd7 fl::distributedInit()
    @     0x5567320cb387 w2l::initDistributed()
    @     0x556731e3eab2 main
    @     0x7f4669adcc87 __libc_start_main
    @     0x556731ea7e4a _start
Aborted

I tried using train fork and still the error persists. This error does not occur using train alone.

Reproduction Steps

This is what I ran:

wav2letter/build/Train continue /mnt/d/198 --minloglevel=0 --logtostderr=1 --rndv_filepath=

Is there other way to try to train the pretrained models for just 1 epoch?

The text was updated successfully, but these errors were encountered:

drremo1 added the bug Something isn't working label Apr 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training with continue and fork mode terminated due to unhandled system error #1098

Training with continue and fork mode terminated due to unhandled system error #1098

drremo1 commented Apr 1, 2023

Training with continue and fork mode terminated due to unhandled system error #1098

Training with continue and fork mode terminated due to unhandled system error #1098

Comments

drremo1 commented Apr 1, 2023

Bug Description

Reproduction Steps