Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor of sequence collections and alignments #1861

Open
1 of 3 tasks
GavinHuttley opened this issue May 10, 2024 · 0 comments
Open
1 of 3 tasks

Refactor of sequence collections and alignments #1861

GavinHuttley opened this issue May 10, 2024 · 0 comments

Comments

@GavinHuttley
Copy link
Collaborator

GavinHuttley commented May 10, 2024

The problem

First, let's define the sequence collection classes. These are classes that allow manipulations of multiple sequences, either as unaligned or aligned groups.

The current implementations of classes in cogent3/core/alignment.py are excessively complex. We have 2 classes for alignments, that have the same API but differ in two important aspects:

  • ArrayAlignment can be sliced with a stride, e.g. aln[::3] works, but Alignment does not support this
  • Alignment can be annotated but ArrayAlignment does not support this

There are performance differences too, ArrayAlignment is based on numpy arrays and so feasibly more memory efficient.

So we need to reduce these to a single class that has the union of features, but is more efficient. In addition, the SequenceCollection class will also be updated.

The solution

We will decouple the interfaces (i.e. SequenceCollection and Alignment) from the details of their implementation. We will do this through two types of objects:

  1. a SeqData class which handles the underlying storage of sequences (and it's counterpart AlignedSeqData for aligned sequences)
  2. a single sequence "view" class, e.g. the SeqView class which is already in use for sequences

Objects under (1) and (2) will satisfy an interface defined by abstract base classes.

The purpose of (1) is classes that return views [i.e. (2)] on their data. Those views are bound to the Sequence (as is done currently) and Aligned instances that belong to their SequenceCollection and Alignment containers.

Responsibilities of SeqData classes

These classes provide storage of sequence name indexed data. They provide methods for getting a view of an individual member of the storage. They also provide methods used by those views to select subsets of individual members (e.g., a slice of a sequence) for different return types, e.g., string and numpy arrays. These classes also have a moltype etc...

Responsibilities of SeqDataView classes

These are views into the SeqData classes and are bound to a .data (or .seq) attribute of Sequence and Aligned.

Objects are sliced as per SeqView; that is, the operation is used to update the slice state, but the underlying data is only modified when needed.

The view classes have a .parent attribute, the SeqData container that generated the view.

Qualifiers

SeqData is immutable, and it knows nothing of the enclosing class.

Refactor of SequenceCollection

At present, the SequenceCollection.seqs attribute points to Sequence instances. The Sequence._seq attribute is a SeqView instance that contains the actual string.

We will change the above so that:

  • SequenceCollection has a ._seqdata attribute which is a SeqData instance.

  • SequenceCollection.seqs is a property that indexes into the SeqData instance, either via sequence name or integer.

  • Indexing returns a Sequence, but now Sequence._seq is a SeqDataView, which is linked to the parent storage

We need to modify SequenceCollection to take a SeqData instance. All conversions of standard sequence data into a SeqData instance should be handled via make_unaligned_seqs(new_style=True).

Refactor of Alignment and Aligned

At present, the Alignment.seqs attribute points to Aligned instances. The Aligned.data attribute is a Sequence and the Aligned.map attribute is an IndelMap instance.

The challenge here is that alignment's can be sliced, e.g. aln[::3]. If we mirror the design for unaligned sequences then we would make Alignment.seqs an AlignedSeqData instance and that means slicing of the Alignment needs to be recordable somehow by AlignedSeqData. One way to handle this is to have a single attribute which is a SliceView (perhaps called SliceRecord) but can be applied to data not bound to it, e.g.

seq = ...
slice_record = SliceRecord(start, stop, step)
new_slice_record = slice_record[2:4:3]
sliced = new_slice_record.apply(seq)

The advantage of this approach is that the slice operation only needs to be applied to a single SliceRecord object.

When a user indexes Alignment.seqs, AlignedSeqData returns an Aligned instance which has a single .seq attribute. This is now an AlignedSeqDataView. This view does not record slicing, but relies on that being done by AlignedSeqData when the AlignedSeqData.get_seq_str() or AlignedSeqData.get_seq_array() methods are called. Those methods return both the IndelMap and the Sequence

Tasks

  • discuss designing an abstract base class to define an interface for recording slice history
  • merge in changes from develop branch into seq-collections-refactor
  • Write out the state flow for a slicing operation on an Alignment
@GavinHuttley GavinHuttley added this to the 2024 Q2 release milestone May 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant