Refactor of sequence collections and alignments #1861

GavinHuttley · 2024-05-10T03:13:38Z

The problem

First, let's define the sequence collection classes. These are classes that allow manipulations of multiple sequences, either as unaligned or aligned groups.

The current implementations of classes in cogent3/core/alignment.py are excessively complex. We have 2 classes for alignments, that have the same API but differ in two important aspects:

ArrayAlignment can be sliced with a stride, e.g. aln[::3] works, but Alignment does not support this
Alignment can be annotated but ArrayAlignment does not support this

There are performance differences too, ArrayAlignment is based on numpy arrays and so feasibly more memory efficient.

So we need to reduce these to a single class that has the union of features, but is more efficient. In addition, the SequenceCollection class will also be updated.

The solution

We will decouple the interfaces (i.e. SequenceCollection and Alignment) from the details of their implementation. We will do this through two types of objects:

a SeqData class which handles the underlying storage of sequences (and it's counterpart AlignedSeqData for aligned sequences)
a single sequence "view" class, e.g. the SeqView class which is already in use for sequences

Objects under (1) and (2) will satisfy an interface defined by abstract base classes.

The purpose of (1) is classes that return views [i.e. (2)] on their data. Those views are bound to the Sequence (as is done currently) and Aligned instances that belong to their SequenceCollection and Alignment containers.

Responsibilities of `SeqData` classes

These classes provide storage of sequence name indexed data. They provide methods for getting a view of an individual member of the storage. They also provide methods used by those views to select subsets of individual members (e.g., a slice of a sequence) for different return types, e.g., string and numpy arrays. These classes also have a moltype etc...

Responsibilities of `SeqDataView` classes

These are views into the SeqData classes and are bound to a .data (or .seq) attribute of Sequence and Aligned.

Objects are sliced as per SeqView; that is, the operation is used to update the slice state, but the underlying data is only modified when needed.

The view classes have a .parent attribute, the SeqData container that generated the view.

Qualifiers

SeqData is immutable, and it knows nothing of the enclosing class.

Refactor of `SequenceCollection`

At present, the SequenceCollection.seqs attribute points to Sequence instances. The Sequence._seq attribute is a SeqView instance that contains the actual string.

We will change the above so that:

SequenceCollection has a ._seqdata attribute which is a SeqData instance.
SequenceCollection.seqs is a property that indexes into the SeqData instance, either via sequence name or integer.
Indexing returns a Sequence, but now Sequence._seq is a SeqDataView, which is linked to the parent storage

We need to modify SequenceCollection to take a SeqData instance. All conversions of standard sequence data into a SeqData instance should be handled via make_unaligned_seqs(new_style=True).

Refactor of `Alignment` and `Aligned`

At present, the Alignment.seqs attribute points to Aligned instances. The Aligned.data attribute is a Sequence and the Aligned.map attribute is an IndelMap instance.

The challenge here is that alignment's can be sliced, e.g. aln[::3]. If we mirror the design for unaligned sequences then we would make Alignment.seqs an AlignedSeqData instance and that means slicing of the Alignment needs to be recordable somehow by AlignedSeqData. One way to handle this is to have a single attribute which is a SliceView (perhaps called SliceRecord) but can be applied to data not bound to it, e.g.

seq = ...
slice_record = SliceRecord(start, stop, step)
new_slice_record = slice_record[2:4:3]
sliced = new_slice_record.apply(seq)

The advantage of this approach is that the slice operation only needs to be applied to a single SliceRecord object.

When a user indexes Alignment.seqs, AlignedSeqData returns an Aligned instance which has a single .seq attribute. This is now an AlignedSeqDataView. This view does not record slicing, but relies on that being done by AlignedSeqData when the AlignedSeqData.get_seq_str() or AlignedSeqData.get_seq_array() methods are called. Those methods return both the IndelMap and the Sequence

Tasks

discuss designing an abstract base class to define an interface for recording slice history
merge in changes from develop branch into seq-collections-refactor
Write out the state flow for a slicing operation on an Alignment

The text was updated successfully, but these errors were encountered:

GavinHuttley added this to the 2024 Q2 release milestone May 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor of sequence collections and alignments #1861

Refactor of sequence collections and alignments #1861

GavinHuttley commented May 10, 2024 •

edited

Loading

Refactor of sequence collections and alignments #1861

Refactor of sequence collections and alignments #1861

Comments

GavinHuttley commented May 10, 2024 • edited Loading

The problem

The solution

Responsibilities of SeqData classes

Responsibilities of SeqDataView classes

Qualifiers

Refactor of SequenceCollection

Refactor of Alignment and Aligned

Tasks

GavinHuttley commented May 10, 2024 •

edited

Loading

Responsibilities of `SeqData` classes

Responsibilities of `SeqDataView` classes

Refactor of `SequenceCollection`

Refactor of `Alignment` and `Aligned`