Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: JSONDecodeError: Extra data: line 1 column 4 (char 3) on download_llama_dataset for PaulGrahamEssay #13114

Open
authentichamza opened this issue Apr 25, 2024 · 1 comment
Labels
bug Something isn't working triage Issue needs to be triaged/prioritized

Comments

@authentichamza
Copy link

authentichamza commented Apr 25, 2024

Bug Description

Bug

I had a public notebook that i created for public rag benchmarking. it was working perfectly fine with v0.10.11 but now i am getting
on download_llama_dataset
JSONDecodeError: Extra data: line 1 column 4 (char 3)
Exact Error:
RUN ID : 226e64cb-761d-4ae8-8d58-64d00e15ecbd
PaulGrahamEssayDataset

Version

v0.10.31

Steps to Reproduce

download_llama_data('PaulGrahamEssayDataset', './data')
Screenshot 2024-04-26 02:29:44

Relevant Logs/Tracbacks

JSONDecodeError                           Traceback (most recent call last)
[<ipython-input-17-2c82eb180180>](https://localhost:8080/#) in <cell line: 29>()
     27     return None
     28 
---> 29 download_dataset(dataset_name)
     30 dataset_path = find_source_files_dir(f"data/{dataset_name}/{run_id}")
     31 rag_dataset_path = find_rag_dataset_json(f"data/{dataset_name}/{run_id}")

4 frames
[<ipython-input-17-2c82eb180180>](https://localhost:8080/#) in download_dataset(name)
      9         os.makedirs(f"data/{name}/{run_id}")
     10     print(name)
---> 11     download_llama_dataset(
     12       name, custom_path=f"data/{name}/{run_id}", show_progress=True
     13     )

[/usr/local/lib/python3.10/dist-packages/llama_index/core/download/dataset.py](https://localhost:8080/#) in download_llama_dataset(dataset_class, llama_datasets_url, llama_datasets_lfs_url, llama_datasets_source_files_tree_url, refresh_cache, custom_dir, custom_path, source_files_dirpath, library_path, disable_library_cache, override_path, show_progress)
    223 
    224     # fetch info from library.json file
--> 225     dataset_info = get_dataset_info(
    226         local_dir_path=dirpath,
    227         remote_dir_path=llama_datasets_url,

[/usr/local/lib/python3.10/dist-packages/llama_index/core/download/dataset.py](https://localhost:8080/#) in get_dataset_info(local_dir_path, remote_dir_path, remote_source_dir_path, dataset_class, refresh_cache, library_path, source_files_path, disable_library_cache)
     81             str(remote_dir_path), f"/{library_path}"
     82         )
---> 83         library = json.loads(library_raw_content)
     84         if dataset_class not in library:
     85             raise ValueError("Loader class name not found in library")

[/usr/lib/python3.10/json/__init__.py](https://localhost:8080/#) in loads(s, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
    344             parse_int is None and parse_float is None and
    345             parse_constant is None and object_pairs_hook is None and not kw):
--> 346         return _default_decoder.decode(s)
    347     if cls is None:
    348         cls = JSONDecoder

[/usr/lib/python3.10/json/decoder.py](https://localhost:8080/#) in decode(self, s, _w)
    338         end = _w(s, end).end()
    339         if end != len(s):
--> 340             raise JSONDecodeError("Extra data", s, end)
    341         return obj
    342 

JSONDecodeError: Extra data: line 1 column 4 (char 3)

No response

@authentichamza authentichamza added bug Something isn't working triage Issue needs to be triaged/prioritized labels Apr 25, 2024
Copy link

dosubot bot commented Apr 25, 2024

The JSONDecodeError you're encountering suggests there's extra data beyond a single JSON object or array in the JSON being parsed, or there's an issue with the JSON syntax. This error can occur at various points, such as when reading local cache, fetching remote content, or processing dataset information.

To address this issue:

  1. Check JSON Files in the Remote Repository: Ensure the JSON files (like library.json or dataset-specific card.json) are correctly formatted without extra data after the main JSON object.

  2. Review Changes Between Versions: Investigate any modifications in how JSON files are handled between versions v0.10.11 and v0.10.31 that might introduce extra data into the JSON content.

  3. Debug JSON Content: Before decoding, print the raw JSON content to pinpoint where the extra data is coming from.

  4. Verify Complete File Fetching: Make sure HTTP requests for JSON files retrieve complete file content without truncation or additional data.

If these steps don't resolve the issue, consider reaching out to the LlamaIndex project maintainers for more specific support.

Details

Sources


About Dosu
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage Issue needs to be triaged/prioritized
Projects
None yet
Development

No branches or pull requests

1 participant