Memory consumption for extremely large search spaces #97

grosenberger · 2023-10-26T11:40:55Z

Hi Michael,

When using Sage with very large search spaces (e.g. many PTMs, non-specific digestion, etc.), the memory consumption frequently goes beyond the available resources on standard workstations. In such scenarios, MSFragger partitions the search space and iteratively processes them.

I was wondering whether similar functionality would be possible to implement in Sage. For example, a "batch size" parameter could be manually set (or estimated based on available memory) to partition the search space. I think there are several options on how this could be implemented, one option could be to partition candidate peptide precursors based on precursor m/z and have different partitions for different spaces. For DIA, this could correspond to the precursor isolation windows, for DDA, it might make sense to just select the range according to batch size. The main search algorithm could then iterate over the partitions for scoring and the individual partitions would be assembled before ML and statistical validation. The search space could be generated according to partitions on-the-fly and kept in memory, or alternatively, also exported to disk (similar to how MSFragger does it).

How do you think about this options? Would there be a preferred solution?

Best regards,
George

lazear · 2023-10-26T22:20:01Z

Hi George,

I agree that it's a necessity for large search spaces. I have been messing around with some internal database splitting, but it's not ready for prime-time yet.

In the mean time, it's possible to perform external database splitting - generate slices of FASTA files and run Sage multiple times, then combine the results and rescore. Perhaps not ideal, but this is essentially what would be done with internal database splitting as well. See below for an example python script for accomplishing this.

import subprocess
import pandas as pd
from Bio import SeqIO

SLICES = 5
records = []
for record in SeqIO.parse("fasta/human_contaminant.fasta", format="fasta"):
    records.append(record)

N = len(records) // SLICES
for i in range(SLICES):
    with open(f"fasta/human_slice_{i}.fasta", "w") as f:
        for record in records[i * N : (i + 1) * N]:
            SeqIO.write(record, f, format="fasta")
    cmd = [
        "sage",
        "search.json",
        "-o",
        f"semi_{i}",
        "-f",
        f"fasta/human_slice_{i}.fasta",
        "--write-pin",
        "HeLa_chytry_HCD_1.mzML.gz",
    ]
    subprocess.run(cmd)


dfs = []
for i in range(SLICES):
    dfs.append(pd.read_csv(f"semi_{i}/results.sage.pin", sep="\t"))

pd.concat(dfs).sort_values(by="ln(hyperscore)", ascending=False).drop_duplicates(
    subset=["FileName", "ScanNr"], keep="first"
).to_csv("sliced.pin", sep="\t")

patrick-willems · 2023-11-02T13:34:20Z

Hey, just a question related to this issue. Could it be that by sorting on the hyperscore and only retaining the best match you might lose hits (also not compatible with chimera searching)? Would a valid alternative be to split up the searches in terms of precursor m/z in consecutive searches (but those on whole FASTA instead of FASTA splitting)? I once tried it (by making alternative JSONs in a loop) but the memory consumption did not decrease.

grosenberger · 2023-11-02T15:01:22Z

Thanks for the feedback! We have been using similar workarounds before. FragPipe also uses similar mechanisms for very large databases.

lazear · 2023-11-02T16:02:17Z

Hey, just a question related to this issue. Could it be that by sorting on the hyperscore and only retaining the best match you might lose hits (also not compatible with chimera searching)? Would a valid alternative be to split up the searches in terms of precursor m/z in consecutive searches (but those on whole FASTA instead of FASTA splitting)? I once tried it (by making alternative JSONs in a loop) but the memory consumption did not decrease.

Interesting that this didn't decrease memory consumption - setting the peptide_min_mass and peptide_max_mass will restrict down the # of final peptides kept and fragments generated (filter is applied after digestion).

That is a valid point about chimeric searches, but those are already kind of heuristic (subtractive method vs something potentially smarter). One potential alternative would be to pre-digest the FASTA database (and pass in "$" as the cleavage enzyme to Sage), and then chunk the FASTA database by peptide mass. That should help with improving chimeric searches and possibly make it go faster as well - this is basically what would be implemented internally.

lazear mentioned this issue Nov 2, 2023

Unspecific HLA datasets #99

Closed

lazear mentioned this issue Nov 23, 2023

SAGE failing in big database bigbio/quantms#327

Open

lazear mentioned this issue Dec 4, 2023

Semi-enzymatic digests not completing? #106

Open

lazear mentioned this issue Apr 19, 2024

Segmentation fault (core dumped) #129

Closed

picciama mentioned this issue Apr 27, 2024

fatal runtime error: stack overflow #131

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory consumption for extremely large search spaces #97

Memory consumption for extremely large search spaces #97

grosenberger commented Oct 26, 2023

lazear commented Oct 26, 2023

patrick-willems commented Nov 2, 2023 •

edited

grosenberger commented Nov 2, 2023

lazear commented Nov 2, 2023

Memory consumption for extremely large search spaces #97

Memory consumption for extremely large search spaces #97

Comments

grosenberger commented Oct 26, 2023

lazear commented Oct 26, 2023

patrick-willems commented Nov 2, 2023 • edited

grosenberger commented Nov 2, 2023

lazear commented Nov 2, 2023

patrick-willems commented Nov 2, 2023 •

edited