You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
First, let's define the sequence collection classes. These are classes that allow manipulations of multiple sequences, either as unaligned or aligned groups.
The current implementations of classes in cogent3/core/alignment.py are excessively complex. We have 2 classes for alignments, that have the same API but differ in two important aspects:
ArrayAlignment can be sliced with a stride, e.g. aln[::3] works, but Alignment does not support this
Alignment can be annotated but ArrayAlignment does not support this
There are performance differences too, ArrayAlignment is based on numpy arrays and so feasibly more memory efficient.
So we need to reduce these to a single class that has the union of features, but is more efficient. In addition, the SequenceCollection class will also be updated.
The solution
We will decouple the interfaces (i.e. SequenceCollection and Alignment) from the details of their implementation. We will do this through two types of objects:
a SeqData class which handles the underlying storage of sequences (and it's counterpart AlignedSeqData for aligned sequences)
a single sequence "view" class, e.g. the SeqView class which is already in use for sequences
Objects under (1) and (2) will satisfy an interface defined by abstract base classes.
The purpose of (1) is classes that return views [i.e. (2)] on their data. Those views are bound to the Sequence (as is done currently) and Aligned instances that belong to their SequenceCollection and Alignment containers.
Responsibilities of SeqData classes
These classes provide storage of sequence name indexed data. They provide methods for getting a view of an individual member of the storage. They also provide methods used by those views to select subsets of individual members (e.g., a slice of a sequence) for different return types, e.g., string and numpy arrays. These classes also have a moltype etc...
Responsibilities of SeqDataView classes
These are views into the SeqData classes and are bound to a .data (or .seq) attribute of Sequence and Aligned.
Objects are sliced as per SeqView; that is, the operation is used to update the slice state, but the underlying data is only modified when needed.
The view classes have a .parent attribute, the SeqData container that generated the view.
Qualifiers
SeqData is immutable, and it knows nothing of the enclosing class.
Refactor of SequenceCollection
At present, the SequenceCollection.seqs attribute points to Sequence instances. The Sequence._seq attribute is a SeqView instance that contains the actual string.
We will change the above so that:
SequenceCollection has a ._seqdata attribute which is a SeqData instance.
SequenceCollection.seqs is a property that indexes into the SeqData instance, either via sequence name or integer.
Indexing returns a Sequence, but now Sequence._seq is a SeqDataView, which is linked to the parent storage
We need to modify SequenceCollection to take a SeqData instance. All conversions of standard sequence data into a SeqData instance should be handled via make_unaligned_seqs(new_style=True).
Refactor of Alignment and Aligned
At present, the Alignment.seqs attribute points to Aligned instances. The Aligned.data attribute is a Sequence and the Aligned.map attribute is an IndelMap instance.
The challenge here is that alignment's can be sliced, e.g. aln[::3]. If we mirror the design for unaligned sequences then we would make Alignment.seqs an AlignedSeqData instance and that means slicing of the Alignment needs to be recordable somehow by AlignedSeqData. One way to handle this is to have a single attribute which is a SliceView (perhaps called SliceRecord) but can be applied to data not bound to it, e.g.
The advantage of this approach is that the slice operation only needs to be applied to a single SliceRecord object.
When a user indexes Alignment.seqs, AlignedSeqData returns an Aligned instance which has a single .seq attribute. This is now an AlignedSeqDataView. This view does not record slicing, but relies on that being done by AlignedSeqData when the AlignedSeqData.get_seq_str() or AlignedSeqData.get_seq_array() methods are called. Those methods return both the IndelMap and the Sequence
Tasks
discuss designing an abstract base class to define an interface for recording slice history
merge in changes from develop branch into seq-collections-refactor
Write out the state flow for a slicing operation on an Alignment
The text was updated successfully, but these errors were encountered:
The problem
First, let's define the sequence collection classes. These are classes that allow manipulations of multiple sequences, either as unaligned or aligned groups.
The current implementations of classes in
cogent3/core/alignment.py
are excessively complex. We have 2 classes for alignments, that have the same API but differ in two important aspects:ArrayAlignment
can be sliced with a stride, e.g.aln[::3]
works, butAlignment
does not support thisAlignment
can be annotated butArrayAlignment
does not support thisThere are performance differences too,
ArrayAlignment
is based on numpy arrays and so feasibly more memory efficient.So we need to reduce these to a single class that has the union of features, but is more efficient. In addition, the
SequenceCollection
class will also be updated.The solution
We will decouple the interfaces (i.e.
SequenceCollection
andAlignment
) from the details of their implementation. We will do this through two types of objects:SeqData
class which handles the underlying storage of sequences (and it's counterpartAlignedSeqData
for aligned sequences)SeqView
class which is already in use for sequencesObjects under (1) and (2) will satisfy an interface defined by abstract base classes.
The purpose of (1) is classes that return views [i.e. (2)] on their data. Those views are bound to the
Sequence
(as is done currently) andAligned
instances that belong to theirSequenceCollection
andAlignment
containers.Responsibilities of
SeqData
classesThese classes provide storage of sequence name indexed data. They provide methods for getting a view of an individual member of the storage. They also provide methods used by those views to select subsets of individual members (e.g., a slice of a sequence) for different return types, e.g., string and numpy arrays. These classes also have a moltype etc...
Responsibilities of
SeqDataView
classesThese are views into the
SeqData
classes and are bound to a.data
(or.seq
) attribute ofSequence
andAligned
.Objects are sliced as per
SeqView
; that is, the operation is used to update the slice state, but the underlying data is only modified when needed.The view classes have a
.parent
attribute, theSeqData
container that generated the view.Qualifiers
SeqData
is immutable, and it knows nothing of the enclosing class.Refactor of
SequenceCollection
At present, the
SequenceCollection.seqs
attribute points toSequence
instances. TheSequence._seq
attribute is aSeqView
instance that contains the actual string.We will change the above so that:
SequenceCollection
has a._seqdata
attribute which is aSeqData
instance.SequenceCollection.seqs
is a property that indexes into theSeqData
instance, either via sequence name or integer.Indexing returns a
Sequence
, but nowSequence._seq
is aSeqDataView
, which is linked to the parent storageWe need to modify
SequenceCollection
to take aSeqData
instance. All conversions of standard sequence data into aSeqData
instance should be handled viamake_unaligned_seqs(new_style=True)
.Refactor of
Alignment
andAligned
At present, the
Alignment.seqs
attribute points toAligned
instances. TheAligned.data
attribute is aSequence
and theAligned.map
attribute is anIndelMap
instance.The challenge here is that alignment's can be sliced, e.g.
aln[::3]
. If we mirror the design for unaligned sequences then we would makeAlignment.seqs
anAlignedSeqData
instance and that means slicing of the Alignment needs to be recordable somehow byAlignedSeqData
. One way to handle this is to have a single attribute which is aSliceView
(perhaps calledSliceRecord
) but can be applied to data not bound to it, e.g.The advantage of this approach is that the slice operation only needs to be applied to a single
SliceRecord
object.When a user indexes
Alignment.seqs
,AlignedSeqData
returns anAligned
instance which has a single.seq
attribute. This is now anAlignedSeqDataView
. This view does not record slicing, but relies on that being done byAlignedSeqData
when theAlignedSeqData.get_seq_str()
orAlignedSeqData.get_seq_array()
methods are called. Those methods return both theIndelMap
and theSequence
Tasks
The text was updated successfully, but these errors were encountered: