Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory consumption for extremely large search spaces #97

Open
grosenberger opened this issue Oct 26, 2023 · 4 comments
Open

Memory consumption for extremely large search spaces #97

grosenberger opened this issue Oct 26, 2023 · 4 comments

Comments

@grosenberger
Copy link

Hi Michael,

When using Sage with very large search spaces (e.g. many PTMs, non-specific digestion, etc.), the memory consumption frequently goes beyond the available resources on standard workstations. In such scenarios, MSFragger partitions the search space and iteratively processes them.

I was wondering whether similar functionality would be possible to implement in Sage. For example, a "batch size" parameter could be manually set (or estimated based on available memory) to partition the search space. I think there are several options on how this could be implemented, one option could be to partition candidate peptide precursors based on precursor m/z and have different partitions for different spaces. For DIA, this could correspond to the precursor isolation windows, for DDA, it might make sense to just select the range according to batch size. The main search algorithm could then iterate over the partitions for scoring and the individual partitions would be assembled before ML and statistical validation. The search space could be generated according to partitions on-the-fly and kept in memory, or alternatively, also exported to disk (similar to how MSFragger does it).

How do you think about this options? Would there be a preferred solution?

Best regards,
George

@lazear
Copy link
Owner

lazear commented Oct 26, 2023

Hi George,

I agree that it's a necessity for large search spaces. I have been messing around with some internal database splitting, but it's not ready for prime-time yet.

In the mean time, it's possible to perform external database splitting - generate slices of FASTA files and run Sage multiple times, then combine the results and rescore. Perhaps not ideal, but this is essentially what would be done with internal database splitting as well. See below for an example python script for accomplishing this.

import subprocess
import pandas as pd
from Bio import SeqIO

SLICES = 5
records = []
for record in SeqIO.parse("fasta/human_contaminant.fasta", format="fasta"):
    records.append(record)

N = len(records) // SLICES
for i in range(SLICES):
    with open(f"fasta/human_slice_{i}.fasta", "w") as f:
        for record in records[i * N : (i + 1) * N]:
            SeqIO.write(record, f, format="fasta")
    cmd = [
        "sage",
        "search.json",
        "-o",
        f"semi_{i}",
        "-f",
        f"fasta/human_slice_{i}.fasta",
        "--write-pin",
        "HeLa_chytry_HCD_1.mzML.gz",
    ]
    subprocess.run(cmd)


dfs = []
for i in range(SLICES):
    dfs.append(pd.read_csv(f"semi_{i}/results.sage.pin", sep="\t"))

pd.concat(dfs).sort_values(by="ln(hyperscore)", ascending=False).drop_duplicates(
    subset=["FileName", "ScanNr"], keep="first"
).to_csv("sliced.pin", sep="\t")

@patrick-willems
Copy link

patrick-willems commented Nov 2, 2023

Hey, just a question related to this issue. Could it be that by sorting on the hyperscore and only retaining the best match you might lose hits (also not compatible with chimera searching)? Would a valid alternative be to split up the searches in terms of precursor m/z in consecutive searches (but those on whole FASTA instead of FASTA splitting)? I once tried it (by making alternative JSONs in a loop) but the memory consumption did not decrease.

@grosenberger
Copy link
Author

Thanks for the feedback! We have been using similar workarounds before. FragPipe also uses similar mechanisms for very large databases.

@lazear
Copy link
Owner

lazear commented Nov 2, 2023

Hey, just a question related to this issue. Could it be that by sorting on the hyperscore and only retaining the best match you might lose hits (also not compatible with chimera searching)? Would a valid alternative be to split up the searches in terms of precursor m/z in consecutive searches (but those on whole FASTA instead of FASTA splitting)? I once tried it (by making alternative JSONs in a loop) but the memory consumption did not decrease.

Interesting that this didn't decrease memory consumption - setting the peptide_min_mass and peptide_max_mass will restrict down the # of final peptides kept and fragments generated (filter is applied after digestion).

That is a valid point about chimeric searches, but those are already kind of heuristic (subtractive method vs something potentially smarter). One potential alternative would be to pre-digest the FASTA database (and pass in "$" as the cleavage enzyme to Sage), and then chunk the FASTA database by peptide mass. That should help with improving chimeric searches and possibly make it go faster as well - this is basically what would be implemented internally.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants