Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error for some reads when conducting mapping step #42

Open
GYQ-form opened this issue Sep 10, 2023 · 4 comments
Open

Error for some reads when conducting mapping step #42

GYQ-form opened this issue Sep 10, 2023 · 4 comments

Comments

@GYQ-form
Copy link

Hello

Thank you for your convenient reed2tree tool and I am very sorry to cause you trouble again.

When I followed the coronavirus dataset analysis tutorial provided by your team, I met some trouble conducting the mapping step:

image

A subprocess.CalledProcessError occurred when conducting this step when mapping the species ERR7552499. So I solely selected a job file, say, r2t_map_BC512-ERR7552499 for running to see what was going on. I got some key information in the output:

[INPUT] Error while reading paired end reads.
[INPUT] Names of mates don't match: ERR7552499.1544703 and ERR7552499.1544751.
[INPUT] NextGenMap expects paired end read names with the format: <read name>/<mate nunber> (e.g. @HWI-ST1176_0172:8:1101:1234:1934/1)
[INPUT] Use -d/--pe-delimiter to specify a different delimiter than '/'
[INPUT] If the format of the read names is correct this error might be caused by missing mates in the input file. Please check your input files or use ngm-utils interleave to match mate pairs.
[INPUT] If you are sure that your input files are valid use --skip-mate-check
This error is fatal. Quitting...
, found '['  (at char 0), (line:1, col:1)
2023-09-10 09:21:31,659 - read2tree.wrappers.read_mappers.parser - ERROR - , found '['  (at char 0), (line:1, col:1)

It seemed like there was something wrong with the paired-end reads file, so I unzipped and checked the content of ERR7552499__1.fastq:

@ERR7552499.1 1 length=99
AACAAACCAACCAACTTTCGATCTCTTGTAGATCTGTTCTCTAAACGAACTTTAAAATCTGTCATTCACTGTACTCTGTTTAACACCAGTTTACTCATT
+ERR7552499.1 1 length=99
:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,F
@ERR7552499.2 2 length=85
AACAAACCAACCAACTTTCGATCTCTTGTAGATCTGTTCTCTAAATGAACCTCTAACACAAGACCATGTTGACATACTAGGACCT
+ERR7552499.2 2 length=85
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
...

Is this problem caused by too many 'F' in the quality control line? However, all these reads files are downloaded using the sradownloader as in the tutorial . How can I fix it?

Any help would be greatly appreciated.

Best regards,
Yuqiao Gong

@sinamajidian
Copy link
Contributor

Hi Yuqiao Gong

The read aligner expects to see similar record id when it processes paired-end reads. For example, the first line of ERR7552499__1.fastq should be @ERR7552499.1/1 and the first line of ERR7552499__2.fastq should be @ERR7552499.1/2 , each of them ending with /1 or /2.

This is mentioned in the log you sent us as well
[INPUT] NextGenMap expects paired end read names with the format: <read name>/<mate nunber> (e.g. @HWI-ST1176_0172:8:1101:1234:1934/1)

So, the record id of your fastq files have different format than expected for read2tree.
Could you tell us the first few lines of ERR7552499__2.fastq? and also please tell us the name of output files of sradownloader

This step prints some outputs, please share them with us as well.

Thanks!

@GYQ-form
Copy link
Author

Thanks for your prompt response!

the content of ERR7552499__2.fastq is like:

@ERR7552499.1 1 length=99
AATGAGTAAACTGGTGTTAAACAGAGTACAGTGAATGACAGATTTTAAAGTTCGTTTAGAGAACAGATCTACAAGAGATCGAAAGTTGGTTGGTTTGTT
+ERR7552499.1 1 length=99
FFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFF:FFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFF,:FFFF
@ERR7552499.2 2 length=85
AGGTCCTAGTATGTCAACATGGTCTTGTGTTAGAGGTTCATTTAGAGAACAGATCTACAAGAGATCGAAAGTTGGTTGGTTTGTT
+ERR7552499.2 2 length=85
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFF,F,FFFFFFFFFFFFFFFFFFFFFF

However, some reads are successfully mapped(e.g. SRR12980426), but their format is similar to ERR7552499:

SRR12980426__1.fastq

@SRR12980426.1 1 length=249
TGGATGACAAAGATCCAAATTTCAAAGATCAAGTCATTTTGCTGAATAAGCATATTGACGCATACAAAACATTCCCACCAACAGAGCCTAAAAAGGACAAAAAGAAGAAGGCTGATGAAACTCAAGCCTTACCGCAGAGACAGAAGAAACAGCAAACTGTGACTCTTCTTCCTGCTGCAGATTTGGATGATTTCTCCAAACAATTGCAACAATCCATGAGCAGTGCTGACTCAACTCAGGCCTAAACTC
+SRR12980426.1 1 length=249
CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGEGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGCGGGGGGGGGGGGCFGGGEGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGFDGGGGGGGGGGGGGGFEG
@SRR12980426.2 2 length=251
CATCAGGAGATGCCACAACTGCTTATGCTAATAGTGTTTTTAACATTTGTCAAGCTGTCACGGCCAATGTTAATGCACTATTATCTACTGATGGTAACAAAATTGCCGATAAGTATGTCCGCAATTTACAACACAGACTTTATGAGTGTCTCTATAGAAATAGAGATGTTGACACAGACTTTGTGAATGAGTTTTACGCATATTTGCGTAAACATTTCTCAATGATGATACTCTCTGACGATGCTGTTGTG
+SRR12980426.2 2 length=251
CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFFFFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGC

SRR12980426__2.fastq

@SRR12980426.1 1 length=251
ACACACTGATTAAAGATTGCTATGTGAGATTAAAGTTAACTACATCTACTTGTGCTATGTAGTTACAAGAATTCATTCTGCACAAGAGTAGACTATATATCGTAAACGGAAAAGCGAAAACGTTTATATAGCCCATCTGCCTTGTGTTGTCTGCTTTAGTTTAGGCCTTACTTGAGTCAGCCCTGCTCATGTTTTTTTGCAATTGTTTGGCGAAATCATCCAAATCTTCTTCAGTAAGTTTTTTCACAGTT
+SRR12980426.1 1 length=251
CCCCCGFGGFGGGGGGGGGGGFGFGGGGGGGGGGGGGGGGGAFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFEGGGGGGGGGGGCFG>CFCCFEE8DFGGGGGG===EFGFGGAFGGG8FFFF9,,9,,,:9C,,,EF?8,,59C??FEEFF,4,6,?DE@,,,88++@+C=EDF9EEGFE,++4BDFF5<,5<D<:E,,,,7<,,3,,,,,,*435=,,==
@SRR12980426.2 2 length=250
GTTGAGAGCAAAATTCATGAGGTCCTTTAGTAAGGTCAGTCTCAGTCCAACATTTTGCTTCAGACATAAAAACATTGTTTTGATAATAAAGAACTGACTTAAAGTTCTTTATGCTAGCCACTAGACCTTGAGATGCATAAGTGCTATTGAAACACACAACAGCATCGTCAGAGAGTATCATCATTGAGAAATGTTTACGCAAATATGCGTAAAACTCATTCACAAAGTCTGTGTCAACATCTCTATTTCT
+SRR12980426.2 2 length=250
CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGECEECG?GGGGFGGGGDD@GDGGGGGGGGGGGGGGECEFEEFGGGGGGGGGGGGGGGGGFCFGGGGGGGGGFGGFGGGGGGGGG?FCFFGCFFGFF

It is very weird.

Additionally, I followed the tutorial provided by your team to use sradownloader to download reads you used for coronavirus dataset analysis from NCBI, that is

python $paper_repo/corona/sradownloader --nogeo --noena --outdir reads data/used_accession.txt --threads 8

the output files are all the SRA reads provided in used_accession.txt (e.g. ERR7552499__1.fastq.gz, ERR7552499__2.fastq.gz, etc)

The full output of this step is:

> bash jobs/r2t_map_BC512-ERR7552499
--- Generating reference for mapping from folder ---
Re-loading references for mapping from folder: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 82/82 [00:00<00:00, 12925.44 species/s]
--- Mapping of reads to reference sequences ---
Mapping reads to species:   0%|                                                                                                                                                                             | 0/1 [00:00<?, ? species/s]2023-09-10 14:58:00,160 - read2tree.Mapper - INFO - ERR7552499: --- Mapping of reads to BC512 reference species ---
/home/gurc/mambaforge/envs/evolution/lib/python3.10/subprocess.py:961: RuntimeWarning: line buffering (buffering=1) isn't supported in binary mode, the default buffer size will be used
  self.stdout = io.open(c2pread, 'rb', bufsize)
/home/gurc/mambaforge/envs/evolution/lib/python3.10/subprocess.py:966: RuntimeWarning: line buffering (buffering=1) isn't supported in binary mode, the default buffer size will be used
  self.stderr = io.open(errread, 'rb', bufsize)
[MAIN] NextGenMap 0.5.5
[MAIN] Startup : x64 (build Jul  3 2020 02:47:43)
[MAIN] Starting time: 2023-09-10.14:58:00
[CONFIG] Parameter:  --affine 0 --argos_min_score 0 --bam 1 --bin_size 2 --block_multiplier 2 --broken_pairs 0 --bs_cutoff 6 --bs_mapping 0 --cpu_threads 4 --dualstrand 1 --fast 0 --fast_pairing 0 --force_rlength_check 0 --format 2 --gap_extend_penalty 5 --gap_read_penalty 20 --gap_ref_penalty 20 --hard_clip 0 --keep_tags 0 --kmer 13 --kmer_min 0 --kmer_skip 2 --match_bonus 10 --match_bonus_tc 2 --match_bonus_tt 10 --max_cmrs 2147483647 --max_equal 1 --max_insert_size 1000 --max_polya -1 --max_read_length 0 --min_identity 0.650000 --min_insert_size 0 --min_mq 0 --min_residues 0.500000 --min_score 0.000000 --mismatch_penalty 15 --mode 0 --no_progress 0 --no_unal 1 --ocl_threads 1 --output /tmp/ngm_t6d9clkq/BC512_OGs.fa.bam --overwrite 1 --pair_score_cutoff 0.900000 --paired 1 --parse_all 1 --pe_delimiter / --qry1 ../reads/ERR7552499__1.fastq.gz --qry2 ../reads/ERR7552499__2.fastq.gz --qry_count -1 --qry_start 0 --ref /tmp/ngm_t6d9clkq/BC512_OGs.fa --ref_mode -1 --sensitive 0 --silent_clip 0 --skip_mate_check 0 --skip_save 0 --slam_seq 0 --step_count 4 --strata 0 --topn 1 --trim5 0 --update_check 0 --very_fast 0 --very_sensitive 0
[NGM] Opening for output (BAM): /tmp/ngm_t6d9clkq/BC512_OGs.fa.bam
[SEQPROV] Encoding reference sequence.
[SEQPROV] Size of reference genome 0 Mbp (max. 17179 Mbp)
[SEQPROV] Allocating 15682 (26426) bytes for the reference.
[SEQPROV] BinRef length: 15679ll (elapsed 0.000092)
[SEQPROV] 0 reference sequences were skipped (length < 10).
[SEQPROV] Writing encoded reference to /tmp/ngm_t6d9clkq/BC512_OGs.fa-enc.2.ngm
[SEQPROV] Writing to disk took 0.00s
[PREPROCESS] Building reference table
[PREPROCESS] Allocated 1 hashtable units (tableLocMax=2^32.000000, genomeSize=2^14.936500)
[PREPROCESS] Building RefTable #0 (kmer length: 13, reference skip: 2)
[PREPROCESS]    Number of k-mers: 67108865
[PREPROCESS]    Counting kmers took 0.04s
[PREPROCESS]    Average number of positions per prefix: 1.000799
[PREPROCESS]    Index size: 335544325 byte (67108865 x 5)
[PREPROCESS]    Generating index took 1.67s
[PREPROCESS]    Allocating and initializing prefix Table took 0.00s
[PREPROCESS]    Number of prefix positions is 8769 (4)
[PREPROCESS]    Size of RefTable is 35076
[PREPROCESS]    Number of repetitive k-mers ignored: 0
[PREPROCESS]    Overall time for creating RefTable: 1.71s
[PREPROCESS] Writing RefTable to /tmp/ngm_t6d9clkq/BC512_OGs.fa-ht-13-2.3.ngm
[PREPROCESS] Writing to disk took 0.21s
[PREPROCESS] Max. k-mer frequency set so 100!
[INPUT] Input is paired end data.
[INPUT] Opening file ../reads/ERR7552499__1.fastq.gz for reading
[INPUT] Opening file ../reads/ERR7552499__2.fastq.gz for reading
[INPUT] Input is Fastq
[INPUT] Estimating parameter from data
[INPUT] Reads found in files: 1552620
[INPUT] Average read length: 219 (min: 38, max: 222)
[INPUT] Corridor width: 37
[INPUT] Average kmer hits pro read: 0.315271
[INPUT] Max possible kmer hit: 69
[INPUT] Estimated sensitivity: 0.300000
[INPUT] Estimating parameter took 3.592s
[INPUT] Input is Fastq
[INPUT] Input is Fastq
[OPENCL] Available platforms: 2
[OPENCL] AMD Accelerated Parallel Processing
[OPENCL] Selecting OpenCl platform: AMD Accelerated Parallel Processing
[OPENCL] Platform: OpenCL 1.2 AMD-APP (1214.3)
[OPENCL] 1 CPU device found.
[OPENCL] Device 0: Intel(R) Xeon(R) Platinum 8336C CPU @ 2.30GHz (Driver: 1214.3 (sse2,avx))
[OPENCL] 112 CPU cores available.
[INPUT] Error while reading paired end reads.
[INPUT] Names of mates don't match: ERR7552499.1544703 and ERR7552499.1544751.
[INPUT] NextGenMap expects paired end read names with the format: <read name>/<mate nunber> (e.g. @HWI-ST1176_0172:8:1101:1234:1934/1)
[INPUT] Use -d/--pe-delimiter to specify a different delimiter than '/'
[INPUT] If the format of the read names is correct this error might be caused by missing mates in the input file. Please check your input files or use ngm-utils interleave to match mate pairs.
[INPUT] If you are sure that your input files are valid use --skip-mate-check
This error is fatal. Quitting...
, found '['  (at char 0), (line:1, col:1)
2023-09-10 14:58:42,145 - read2tree.wrappers.read_mappers.parser - ERROR - , found '['  (at char 0), (line:1, col:1)
Mapping reads to species:   0%|                                                                                                                                                                             | 0/1 [00:41<?, ? species/s]
Traceback (most recent call last):
  File "/home/gurc/mambaforge/envs/evolution/bin/read2tree", line 16, in <module>
    main(sys.argv[1:], exe_name=exe_name(), desc=desc)
  File "/home/gurc/mambaforge/envs/evolution/lib/python3.10/site-packages/read2tree/main.py", line 352, in main
    Mapper(args, ref_set=reference.ref)  # Run the mapping
  File "/home/gurc/mambaforge/envs/evolution/lib/python3.10/site-packages/read2tree/Mapper.py", line 80, in __init__
    self._map_reads_to_references(ref_set)
  File "/home/gurc/mambaforge/envs/evolution/lib/python3.10/site-packages/read2tree/Mapper.py", line 289, in _map_reads_to_references
    processed_reads = self._call_wrapper(ref_tmp_file_handle, reads,
  File "/home/gurc/mambaforge/envs/evolution/lib/python3.10/site-packages/read2tree/Mapper.py", line 104, in _call_wrapper
    ngm = ngm_wrapper()
  File "/home/gurc/mambaforge/envs/evolution/lib/python3.10/site-packages/read2tree/wrappers/read_mappers/ngm.py", line 66, in __call__
    self.result = self._read_result(error, self.ref_input, self.tmp_folder)  # store result
  File "/home/gurc/mambaforge/envs/evolution/lib/python3.10/site-packages/read2tree/wrappers/read_mappers/ngm.py", line 117, in _read_result
    result = parser.to_dict(outfile, output)
  File "/home/gurc/mambaforge/envs/evolution/lib/python3.10/site-packages/read2tree/wrappers/read_mappers/parser.py", line 43, in to_dict
    reads_mapped, total_reads, mapping_time = self.parse(stdout)
TypeError: cannot unpack non-iterable NoneType object

and the content of r2t_map_BC512-ERR7552499 is:

#!/bin/bash

read2tree  --standalone_path ../marker_genes/ --reads ../reads/ERR7552499__1.fastq.gz ../reads/ERR7552499__2.fastq.gz --output_path ./ --single_mapping BC512 --threads 4 --species_name ERR755249

@GYQ-form
Copy link
Author

Sorry to bother you again, but I have run some other jobs during this period. I found all reads that cannot be successfully mapped share one thing in common: Just as I said, there are too many 'F' in the quality control line.

All the reads files were downloaded following your instructions in the tutorial. Are my reads files different from what you used for coronavirus analysis? If so, could you please tell me where and how you downloaded these reads files so that I can reproduce your work?

Thank you for your patient answer.

@sinamajidian
Copy link
Contributor

Sorry, I forgot to answer there, it just slipped my mind. The list of coronavirus accession IDs are in the supplementary of paper.
Please let us know, if there is any other questions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants