Skip to content
/ 2bit.el Public

An Emacs package for pulling data from 2bit files

License

Notifications You must be signed in to change notification settings

davep/2bit.el

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 

Repository files navigation

2bit.el

MELPA Stable MELPA

Introduction

2bit.el is a package for Emacs that provides a collection of functions, macros and interactive commands that can be used to extract data from 2bit format files. These are files that can hold DNA sequences in a compressed format. You'll see samples of this if you look at the human genome data, for example.

Emacs Lisp support

The package is built such that it provides functions and macros for working with 2bit files, and then goes on to build interactive commands on top of this code. The elisp-level support code includes:

2bit-open

2bit-open should be called to "open" a 2bit file for further use:

(2bit-open FILE &optional MASKING)

FILE is the path to the 2bit file you want to open. MASKING is an optional parameter to say if mask blocks should be taken into account when reading bases from the file. (it's worth noting that in my testing, mask block handling tends to make the reading of data a lot slower)

2bit-sequence-count

2bit-sequence-count can be used to quickly get the count of how many sequences are held in the 2bit file:

(2bit-sequence-count file)

FILE can either be the path to a 2bit file, or a value returned from 2bit-open.

2bit-sequence-names

2bit-sequence-names can be used to quickly get a list of the names of all the sequences held in a 2bit file:

(2bit-sequence-names file)

FILE can either be the path to a 2bit file, or a value returned from 2bit-open.

2bit-sequence

2bit-sequence can be used to quickly get a named sequence from a 2bit file:

(2bit-sequence file sequence)

FILE can either be the path to a 2bit file, or a value returned from 2bit-open.

SEQUENCE is the name of the sequence to get.

2bit-sequence-dna-size

2bit-sequence-dna-size returns the size of the DNA contained in the given sequence.

(2bit-sequence-dna-size sequence)

SEQUENCE must be a value returned from 2bit-sequence.

2bit-bases

2bit-bases can be used to get a string of bases from a sequence in a 2bit file:

(2bit-bases sequence start end)

SEQUENCE is a value returned from a call to 2bit-sequence. START and END describe the sub-sequence to grab. Note that the convention of zero-based, inclusive of START and exclusive of END is used.

2bit-with-file

2bit-with-file is a simple macro provides as a convenience wrapper when working with a 2bit file:

(2bit-with-file (handle file)
  ...body...)

HANDLE is the name to give to the "handle" of the 2bit file, and FILE is the file to open. For example:

;; Get a list of the sizes of each of the numbered chromosomes in the Human
;; Genome.
(2bit-with-file (hg "hg38.2bit")
  (cl-loop for chr from 1 to 22
           collect (2bit-sequence-dna-size (2bit-sequence hg (format "chr%d" chr)))))

Emacs commands

The following interactive commands are available:

2bit-insert-bases

2bit-insert-bases simply inserts the requested sequence at the current point in the current buffer.

2bit-insert-fasta

2bit-insert-fasta simply inserts the requested sequence at the current point in the current buffer, formatted in FASTA format.

Example

Here's a simple recording of a sample of using 2bit-insert-fasta to create a FASTA file from a 2bit file: