Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JSONDecodeError in report_builder.py #364

Open
SolarSupremacy opened this issue Aug 23, 2023 · 20 comments
Open

JSONDecodeError in report_builder.py #364

SolarSupremacy opened this issue Aug 23, 2023 · 20 comments

Comments

@SolarSupremacy
Copy link

SolarSupremacy commented Aug 23, 2023

Tried running this in command line in multiple different ways and can't get it to work. VizTracer does work when used in-line in python but in-line doesn't work for multiprocessing (which I need it to).

C:\Users\SolarSupremacy\Repositories\SolarDevelopmentRepo\solarCustomsPycharm>.\venv\Scripts\viztracer.exe --open .\src\main.py
Loading trace data from processes 0/2
Traceback (most recent call last):
  File "C:\Users\SolarSupremacy\AppData\Local\Programs\Python\Python310\lib\multiprocessing\util.py", line 300, in _run_finalizers
    finalizer()
  File "C:\Users\SolarSupremacy\AppData\Local\Programs\Python\Python310\lib\multiprocessing\util.py", line 224, in __call__
    res = self._callback(*self._args, **self._kwargs)
  File "C:\Users\SolarSupremacy\Repositories\SolarDevelopmentRepo\solarCustomsPycharm\venv\lib\site-packages\viztracer\main.py", line 553, in exit_routine
    self.save()
  File "C:\Users\SolarSupremacy\Repositories\SolarDevelopmentRepo\solarCustomsPycharm\venv\lib\site-packages\viztracer\main.py", line 534, in save
    builder.save(output_file=ofile)
  File "C:\Users\SolarSupremacy\Repositories\SolarDevelopmentRepo\solarCustomsPycharm\venv\lib\site-packages\viztracer\report_builder.py", line 180, in save
    self.generate_report(f, output_format="json", file_info=file_info)
  File "C:\Users\SolarSupremacy\Repositories\SolarDevelopmentRepo\solarCustomsPycharm\venv\lib\site-packages\viztracer\report_builder.py", line 162, in generate_report
    self.prepare_json(file_info=file_info)
  File "C:\Users\SolarSupremacy\Repositories\SolarDevelopmentRepo\solarCustomsPycharm\venv\lib\site-packages\viztracer\report_builder.py", line 106, in prepare_json
    self.load_jsons()
  File "C:\Users\SolarSupremacy\Repositories\SolarDevelopmentRepo\solarCustomsPycharm\venv\lib\site-packages\viztracer\report_builder.py", line 72, in load_jsons
    self.jsons.append(get_json(j))
  File "C:\Users\SolarSupremacy\Repositories\SolarDevelopmentRepo\solarCustomsPycharm\venv\lib\site-packages\viztracer\report_builder.py", line 30, in get_json
    return orjson.loads(json_str)
orjson.JSONDecodeError: unexpected end of data: line 1 column 1576961 (char 1576960)
C:\Users\SolarSupremacy\Repositories\SolarDevelopmentRepo\solarCustomsPycharm>.\venv\Scripts\python.exe -m viztracer .\src\main.py
Loading trace data from processes 1/2
Traceback (most recent call last):
  File "C:\Users\SolarSupremacy\AppData\Local\Programs\Python\Python310\lib\multiprocessing\util.py", line 300, in _run_finalizers
    finalizer()
  File "C:\Users\SolarSupremacy\AppData\Local\Programs\Python\Python310\lib\multiprocessing\util.py", line 224, in __call__
    res = self._callback(*self._args, **self._kwargs)
  File "C:\Users\SolarSupremacy\Repositories\SolarDevelopmentRepo\solarCustomsPycharm\venv\lib\site-packages\viztracer\main.py", line 553, in exit_routine
    self.save()
  File "C:\Users\SolarSupremacy\Repositories\SolarDevelopmentRepo\solarCustomsPycharm\venv\lib\site-packages\viztracer\main.py", line 534, in save
    builder.save(output_file=ofile)
  File "C:\Users\SolarSupremacy\Repositories\SolarDevelopmentRepo\solarCustomsPycharm\venv\lib\site-packages\viztracer\report_builder.py", line 180, in save
    self.generate_report(f, output_format="json", file_info=file_info)
  File "C:\Users\SolarSupremacy\Repositories\SolarDevelopmentRepo\solarCustomsPycharm\venv\lib\site-packages\viztracer\report_builder.py", line 162, in generate_report
    self.prepare_json(file_info=file_info)
  File "C:\Users\SolarSupremacy\Repositories\SolarDevelopmentRepo\solarCustomsPycharm\venv\lib\site-packages\viztracer\report_builder.py", line 106, in prepare_json
    self.load_jsons()
  File "C:\Users\SolarSupremacy\Repositories\SolarDevelopmentRepo\solarCustomsPycharm\venv\lib\site-packages\viztracer\report_builder.py", line 72, in load_jsons
    self.jsons.append(get_json(j))
  File "C:\Users\SolarSupremacy\Repositories\SolarDevelopmentRepo\solarCustomsPycharm\venv\lib\site-packages\viztracer\report_builder.py", line 30, in get_json
    return orjson.loads(json_str)
orjson.JSONDecodeError: unexpected end of data: line 1 column 1540097 (char 1540096)

edit: Figured out how to format it better.

@SolarSupremacy
Copy link
Author

Since I don't know what else to do (other than dig into the viztracer source code and debug there)... Here's more errors. I've seen many different JSON errors that all point to the file being improperly formatted in some way or another.

json.decoder.JSONDecodeError: Expecting property name enclosed in double quotes: line 1 column 34381
825 (char 34381824)

@gaogaotiantian
Copy link
Owner

Interesting. Could you share the script (preferably minimal reproducible example) so I can check it out? For now the only thing I know is there's something wrong with the JSON built in the subprocess, which is not that helpful as this is a corner case.

Also are you using the latest release version of VizTracer?

@gaogaotiantian
Copy link
Owner

The JSON file is generated manually by VizTracer to achieve better performance, which could have bugs when it comes to some combination of characters. I just need to figure out which.

@SolarSupremacy
Copy link
Author

Another error:

json.decoder.JSONDecodeError: Expecting property name enclosed in double quotes: line 1 column 34381
825 (char 34381824)

I'm on viztracer 0.15.6, just installed from pip today and it seems up to date.

I'll see about providing you a minimal reproducible example at my next convenience, but the code is complicated. If there's a way I can just grab the JSON that VizTracer is generating and send that, that would be much easier for me.

@SolarSupremacy
Copy link
Author

SolarSupremacy commented Aug 23, 2023

I've also tried this on two different (Windows) computers to see if maybe only one computer had the issue, but it's happening on both.

Also... I don't know if its related to viztracer but I'm having another issue with my code that I didn't notice before investigating this viztracer issue. I have a process that just won't die. The process writes to a PipeConnection, then closes it when its done and returns from the target function. The process is also a daemon. For both of those reasons (closes pipe and is daemon), the process should close when the main is done. And of course, when main is done, that's when viztracer says "waiting for child processes to close". I've tried using close(), kill(), and terminate() on the process and it just won't end.

edit: I think this issue occurs with or without viztracer profiling the code. I just hadn't figured out which process was the one that was always not closing, forcing me to end the process.

@gaogaotiantian
Copy link
Owner

The JSON file that has issues is a temporary file, it's not easy (though not impossible) to find the file. If the process failed out, the temporary JSON files should be somewhere in your temporary directory.

The better solution is still getting a minimum reproducible example so I can reproduce it somehow. However, if you figured out how the JSON file is corrupted, that could also work.

@SolarSupremacy
Copy link
Author

SolarSupremacy commented Aug 23, 2023

stdout:

(venv) PS C:\Users\Ethan Randall\Documents\GitHub\SolarDevelopment\solarCustomsPycharm> viztracer .\
src\main.py ; vizviewer result.json
# My script printing debug stuff... ignore this as the "DONE! NO MORE PROCESSES!" is a lie.
Data Queue growing! 8 queue size, 1253 frames, 4323.65 frames/s, 1245 proc, 4296.04 proc/s
Data Queue growing! 0 queue size, 6385 frames, 4960.69 frames/s, 6385 proc, 4960.69 proc/s
Data Queue growing! 1 queue size, 11474 frames, 5017.27 frames/s, 11473 proc, 5016.84 proc/s
Data Queue growing! 0 queue size, 16519 frames, 5025.93 frames/s, 16519 proc, 5025.93 proc/s
Data Queue growing! 0 queue size, 21488 frames, 5013.05 frames/s, 21488 proc, 5013.05 proc/s
<Process name='Process-1' parent=10880 started daemon>
Data Queue growing! 103 queue size, 27139 frames, 5118.73 frames/s, 27036 proc, 5099.31 proc/s
CSV is done. Queue size: 0
DONE! NO MORE PROCESSES!
[<Process name='Process-1' pid=12684 parent=10880 started daemon>]
# End of my script printing debug stuff. Everything below is viztracer and the stack trace.
Loading trace data from processes 1/2Traceback (most recent call last):         
  File "C:\Users\Ethan Randall\AppData\Local\Programs\Python\Python310\lib\multiprocessing\util.py",
 line 300, in _run_finalizers
    finalizer()
  File "C:\Users\Ethan Randall\AppData\Local\Programs\Python\Python310\lib\multiprocessing\util.py",
 line 224, in __call__
    res = self._callback(*self._args, **self._kwargs)
  File "C:\Users\Ethan Randall\Documents\GitHub\SolarDevelopment\solarCustomsPycharm\venv\lib\site-p
ackages\viztracer\main.py", line 553, in exit_routine
    self.save()
  File "C:\Users\Ethan Randall\Documents\GitHub\SolarDevelopment\solarCustomsPycharm\venv\lib\site-p
ackages\viztracer\main.py", line 534, in save
    builder.save(output_file=ofile)
  File "C:\Users\Ethan Randall\Documents\GitHub\SolarDevelopment\solarCustomsPycharm\venv\lib\site-p
ackages\viztracer\report_builder.py", line 180, in save
    self.generate_report(f, output_format="json", file_info=file_info)
  File "C:\Users\Ethan Randall\Documents\GitHub\SolarDevelopment\solarCustomsPycharm\venv\lib\site-p
ackages\viztracer\report_builder.py", line 162, in generate_report
    self.prepare_json(file_info=file_info)
  File "C:\Users\Ethan Randall\Documents\GitHub\SolarDevelopment\solarCustomsPycharm\venv\lib\site-p
ackages\viztracer\report_builder.py", line 106, in prepare_json
    self.load_jsons()
  File "C:\Users\Ethan Randall\Documents\GitHub\SolarDevelopment\solarCustomsPycharm\venv\lib\site-p
ackages\viztracer\report_builder.py", line 72, in load_jsons
    self.jsons.append(get_json(j))
  File "C:\Users\Ethan Randall\Documents\GitHub\SolarDevelopment\solarCustomsPycharm\venv\lib\site-p
ackages\viztracer\report_builder.py", line 32, in get_json
    return json.loads(json_str)
  File "C:\Users\Ethan Randall\AppData\Local\Programs\Python\Python310\lib\json\__init__.py", line 3
46, in loads
    return _default_decoder.decode(s)
  File "C:\Users\Ethan Randall\AppData\Local\Programs\Python\Python310\lib\json\decoder.py", line 33
7, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "C:\Users\Ethan Randall\AppData\Local\Programs\Python\Python310\lib\json\decoder.py", line 35
5, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 25366529 (char 25366528)

This is a Google Drive link: https://drive.google.com/drive/folders/1PUa6LD_Q9RoY0DoWpF0PJQWrejlV0CgE?usp=drive_link

In that folder, there's the two raw files that were generated, associated with the stack trace exception above.

Just in case Google Drive is fucky with the files (which I don't think it would be), and for your convenience, I also added a .zip (compressed by Windows, not 7z) and a .7z (compressed by... 7z). If you have 7z, great, and I trust 7z won't have altered the data. If you don't, the .zip should still contain the raw data but if anything would fuck with the binary data, it would be Windows.

Here are the checksums for those files (calculated by 7z on the raw files):

Name: result_10880.json
Size: 149748687 bytes (142 MiB)
CRC32: 984D90ED
CRC64: 08A71E7061B5D235
SHA256: b29d45bc907b489313156c42def8d122d57986156fe25ed4ac3b6a0ad1e918c9
SHA1: eb8ee48e633e7ff72e0b9b3b8305bea213c096ac
BLAKE2sp: 4eaaf87befa64ed4cc637c7877b1f5ba848ae5bae0e13e64b92ef1acc983a04c

Name: result_12684.json
Size: 25366528 bytes (24 MiB)
CRC32: D59F9624
CRC64: 6D9858C832417C42
SHA256: b686422c79b4c0543c677b7c9a8af7cc50d5fb2b238ce50bb2c9f027a1a700e6
SHA1: f467ea9d1fd62b69cea287f1806e59e16676e02a
BLAKE2sp: 0e570e795e6e5bb7af6c206681f25ee1bc9af5d2d490467b1100394074050947

I can also tell you... if the files are meant to have a standard encoding (like UTF-8) or are even meant to be ASCII... there's clearly an issue with one or both of them because Notepad++ gets very unhappy when I try to view them. NP++ just hanged on me and I had to force close it as well. When I manage to open one and scroll down a ways, there's corrupted text/unicode spam everywhere. I can't copy the text into this comment though. When I do, it appears normal.

@gaogaotiantian
Copy link
Owner

The file requires access.

@SolarSupremacy
Copy link
Author

Apologies. Thought I made it public. I gave you access (including edit access in case you want to send files back to me for testing).

@gaogaotiantian
Copy link
Owner

Your file result_12684.json is not a complete json file - open it with Python and check the last characters. I believe the reason was your subprocess got interrupted somehow (presumably by your main process) and viztracer could not finish writing the file. From what we have now, it does not seems to be a viztracer issue, it's more likely that your program does not allow a graceful exit for the subprocess.

@SolarSupremacy
Copy link
Author

SolarSupremacy commented Aug 24, 2023

Every time I run viztracer, I have to do Ctrl+C to "skip" the "waiting for child processes to close". That seems like a logical reason why there wouldn't be a graceful exit.

The problem now is... I've got to figure out why Python won't close this process when it absolutely should be. Its target function returns. The pipe it writes to is emptied and then closed (on both ends). Doesn't make any sense and I can't even terminate() it... Is there any way the viztracer process is interfering with the ability for my child process to close? How is it converting the process into JSON? Is my child process itself being abruptly ended the issue, or is it a viztracer process that's being interrupted?

@SolarSupremacy
Copy link
Author

I identified the problem and found the solution.

By adding a time.sleep() at the end of the main process's inline code (the last line of code run by the entire program) of at least 2 seconds (1 second isn't enough), it works. It seems like the process does get closed properly and it just takes a couple seconds for garbage collection or something to mark the process as terminated... whatever viztracer relies on to check if there are active child processes.

But if the process ends after the main.py returns, viztracer won't detect that. It just hangs on "waiting for child processes to close...".

After typing that out... I realized what the proper solution here should be:

Manually calling a "kill" or "end" method of the class that generated the process which joins the process (and threads) to block until the process (and threads) are closed. I was hoping I wouldn't need to do that so I could move on with processing and let GC take care of it, but this does seem more proper and obviously the ability to .join() a thread or process exists for a reason. I'm sure there's also a dunder method (not sure if del would do it) where I could use a with statement for my process-generating-class and exiting the with statement should call a method of that class to do the .join() blocking until it's cleaned up.

I tested this lazily (just by joining the process directly from the main thread via the class's reference, doesn't matter) and viztracer works just fine.

It actually looks visually differently than it did when I was testing it the other day. The webpage looks very clean and easier to use. There must have been an older version of it that I was accessing somehow (or you just updated it). In either case, it looks pretty sexy now and I'm glad I found the solution.

I'm not sure if you would consider this behavior a bug of viztracer or not, but I have learned better programming habits from this (properly cleaning up threads and processes when they are done). Thanks for being with me on the journey. If you need anything else from me (for help in fixing this bug), feel free to ask. I don't think I need any more help though as I identified the issue and found the solution (which I should have been doing to begin with).

Thanks man! If you don't need anything from me either, feel free to close the issue.

@gaogaotiantian
Copy link
Owner

The way VizTracer works is to add a hook when the subprocess ends, which:

  1. Creates a tmp file indicating the trace is being saved
  2. Saves the trace(might take a while)
  3. Deletes the tmp file indicating the trace is already saved

However, if you try to terminate() the subprocess from the main process, it would "kill" the subprocess, and it's highly possible it's in step 2. So the subprocess never finishes saving the trace(hence incomplete trace file), and the main process does not know the subprocess is finished(requires Ctrl+C to escape).

For multi-processing to work, you have to ensure all the processes end gracefully, that's just by definition. So yes, having a "join" in the main process makes a lot of sense and it's a good habit. VizTracer is not able to deal with such situations on its own, if the user forcefully kill the subprocess, there's nothing VizTracer can do and it just simply won't work.

@Hyfro-Cyrup
Copy link

Hyfro-Cyrup commented Nov 2, 2023

Sorry to revive this, but I'm having what I believe to be a similar issue. Sometimes when multiprocessing workers use shared data, vizviewer gets stuck thinking a child process is still running when it isn't. So it gets stuck on "Wait for child processes to finish, Ctrl+C to skip" (even though we now know pressing Ctrl+C will cut off the stream and leave a string that json can't decode).
Here's a fairly minimal example:

import multiprocessing as mp
import time

def worker(currentMin, timeout=float('inf'), startTime=0):
    while True:

        # it only works when this next line is commented
        x = currentMin.value

        if time.perf_counter() - startTime > timeout:
            #-- exit early because of timeout
            break


def multiprocess(timeout=float('inf')):

    t0 = time.perf_counter()

    with mp.Pool() as pool:
        with mp.Manager() as manager:

            integer = manager.Value('i', 1)

            for _ in range(100):
                pool.apply_async(worker, (integer, timeout, t0))

            # end all the processes
            pool.close()
            pool.join()

if __name__ == '__main__':

    val = multiprocess(timeout=10)

    print(mp.active_children()) # []

I am using viztracer 0.16.0 on python 3.11.6.

Thanks

@gaogaotiantian
Copy link
Owner

Which OS are you using? I can't repro this with the code provided on Linux.

@Hyfro-Cyrup
Copy link

I'm on Windows 10.

@TTianshun
Copy link
Contributor

TTianshun commented Dec 1, 2023

Sorry to revive this, but I'm having what I believe to be a similar issue. Sometimes when multiprocessing workers use shared data, vizviewer gets stuck thinking a child process is still running when it isn't. So it gets stuck on "Wait for child processes to finish, Ctrl+C to skip" (even though we now know pressing Ctrl+C will cut off the stream and leave a string that json can't decode). Here's a fairly minimal example:

import multiprocessing as mp
import time

def worker(currentMin, timeout=float('inf'), startTime=0):
    while True:

        # it only works when this next line is commented
        x = currentMin.value

        if time.perf_counter() - startTime > timeout:
            #-- exit early because of timeout
            break


def multiprocess(timeout=float('inf')):

    t0 = time.perf_counter()

    with mp.Pool() as pool:
        with mp.Manager() as manager:

            integer = manager.Value('i', 1)

            for _ in range(100):
                pool.apply_async(worker, (integer, timeout, t0))

            # end all the processes
            pool.close()
            pool.join()

if __name__ == '__main__':

    val = multiprocess(timeout=10)

    print(mp.active_children()) # []

I am using viztracer 0.16.0 on python 3.11.6.

Thanks

I've repro this on Windows, seemed that the saving process was terminated by mp.Manager() so the viztmp file can't be deleted(which is the reason of stuck on "Wait for child processes to finish, Ctrl+C to skip"). And the json file is incomplete because the saving process is terminated, which caused the traceback.I'm not sure is there a property way to avoid the process being terminated on windows, but I'll try to make a pr to fix this.

@gaogaotiantian
Copy link
Owner

Hmm, I don't think there's an easy way to prevent that. Windows kills processes without signals to catch, so many multiprocessing stuff won't work on Windows.

@TTianshun
Copy link
Contributor

Hmm, I don't think there's an easy way to prevent that. Windows kills processes without signals to catch, so many multiprocessing stuff won't work on Windows.

I‘ve tried to use signal to fix this, but it's really hard to do. I think there may be some imperfect solution for this. The current ReportBuilder raises exception when there's invalid json file, how about skipping the invalid json file and combine those valid files. Another way to fix it is to patch the multiprocessing.managers.BaseManager and make it wait for a longer time before terminate(), so the process will have enough time to save trace data.

Do you think it's necessary to warn users when they terminate a subprocss on windows? I found that there won't be any trace data for the terminated process.

@gaogaotiantian
Copy link
Owner

Yeah we can combine only the valid ones and send a warning about the invalid ones to users. Increasing the timeout is not a good idea because that won't "fix" the problem - there's no time guarantee for the trace saving to be done. The method is a bit intrusive as well. I don't think the benefit worth the cost.

You can't warn users, that's the issue. There's no hook you can use when a process is terminated - otherwise we can stop the termination and save the trace. You'll have the same thing on Linux for SIGKILL. Those are OS level signals that mandate the process to be killed immediately.

You can do something smart in multi-process case, when you have a main process as a monitor - that's what we are doing now for waiting for other sub-processes to finish. You can maybe check whether the process that is being waited exists while waiting for it, and somehow deal with it if it does not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants