Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: #7 read sequencer #27

Open
ninsch3000 opened this issue Oct 27, 2023 · 0 comments
Open

test: #7 read sequencer #27

ninsch3000 opened this issue Oct 27, 2023 · 0 comments

Comments

@ninsch3000
Copy link
Collaborator

ninsch3000 commented Oct 27, 2023

README description

Read Sequencer

Overview

Read Sequencer is a python package to simulate sequencing.
It reads fasta files, simulate sequencing with specified read length and writes the resulting sequences into a new fasta file.

Installation from github

Read Sequencer requires Python 3.9 or later.

Install Read Sequencer from Github using:

git clone https://git.scicore.unibas.ch/zavolan_group/tools/read-sequencer.git
cd read-sequencer
pip install . 

Usage

usage: read-sequencer [-h] [-i INPUT] [-r READ_LENGTH] [-n N_RANDOM] [-s CHUNK_SIZE] output 
Simulates sequencing of DNA sequences specified by an FASTA file.

positional arguments:
  output                path to FASTA file

optional arguments:
  -h, --help            show this help message and exit
  -i INPUT, --input INPUT
                        path to FASTA file
  -r READ_LENGTH, --read-length READ_LENGTH
                        read length for sequencing
  -n N_RANDOM, --n_random N_RANDOM
                        n random sequences. Just used if input fasta file is not specified.
  -s CHUNK_SIZE, --chunk-size CHUNK_SIZE
                        chunk_size for batch processing

Docker

The docker image is available on docker hub: https://hub.docker.com/r/grrchrr/readsequencer

docker pull grrchrr/readsequencer
docker run readsequencer readsequencer --help

Contributors and Contact Information

Christoph Harmel - [email protected]
Michael Sandholzer - [email protected]
Clara Serger - [email protected]

Original issue description

https://git.scicore.unibas.ch/zavolan_group/pipelines/scrna-seq-simulation/-/issues/7

Read sequencing

Simulate the sequencing of reads on the template of terminal fragments. Reads are copies of fixed length starting from the 5' end of fragments. If the desired read length is larger than the fragment length, sequencing would in principle proceed into the 3' adaptor and then would perhaps yield random bases. For simplicity, here we assume that random nucleotides are introduced in this case.

Input:

  1. Fasta-formatted file of sequences of terminal fragments from transcripts
  2. Number of reads to sample
  3. Read length (number of sequencing cycles)
  4. Dictionary of nucleotide frequencies used to pad the read if the input fragment is too short.

Output:
Fasta-formatted file of reads of identical length, representing 5’ ends of the terminal fragments.

To generate each read, a terminal fragment is chosen from input 1, with replacement. Then a segment of the specified read length (input 3) is extracted from the terminal fragment. If the terminal fragment is shorter than the read length, then random nucleotides are added to the 3' end according to the probabilities given in input 4, until the read length is reached. A unique name should be created for each read, and the name and read should be written to the output file in fasta format. The process is repeated for the specified number of reads (input 2).

Pipeline overview description

https://git.scicore.unibas.ch/zavolan_group/pipelines/scrna-seq-simulation
The terminal fragments from the previous step are sampled according to input #5, to pick a fragment for sequencing. Then a piece of length input #8 is taken fromm the 5' end of the fragment to form a read. If the fragment is shorter than the read length (input #8), the fragment is padded with random sequence, given a vector of relative probability for A,C,G,T to appear in the random sequence (input #8). The output of this step will be a fasta file with "sequenced reads", which is the output of the simulation.

Project design plan

https://git.scicore.unibas.ch/zavolan_group/tools/read-sequencer/-/issues/1

Project design: read_sequencer

Input:

- FASTA: terminal fragment sequences
- total number of reads
- read length
- padding nucleotide frequencies

Output:

- FASTA: sequenced reads

Function design:

- read_in_fasta(file_path)
    - reads lines of the FASTA into dictionary of strings or pandas dataframe
    - option to generate synthetic sequences that include primers and variable length
    
- simulate_sequencing(n_reads, sequences, read_length, padding_probabilities):
    - initiate results dict
    - wrapper function that iterates over reads:
        - per read do read_sequence():
            - sample one sequence from the pool of given sequences according to relative 
            - locate position of primer sequence
            - from this position read sequence
                - if: read_length > length_sequence:
                    - add random nucleotides according padding_probabilities to the end of
                      the sequence until read_length is reached
                - will the 'sequencing' be affected by leading/lagging nucleotides (markov chains etc)
                  which can affect the correct sequencing result?      
            - store sequenced reads as string to result dictionary
        - return results dict which contains all sequencing reads as a FASTA file 

- needed lower level functions:
    - generate_dummy_data() / load_dummy_data()
    - read_sequence()
    - add_nucleotides_to_end()
    - sample_sequence()
    - locate_primer_site()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant