Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BRENDA content collaboration #2

Open
Midnighter opened this issue Feb 20, 2018 · 20 comments
Open

BRENDA content collaboration #2

Midnighter opened this issue Feb 20, 2018 · 20 comments

Comments

@Midnighter
Copy link

Hi,

I'm currently working on upgrading my parser for the BRENDA flat file download. I've implemented a few SQLAlchemy models that seemed fitting for the content. Is there any interest on your side in the content of BRENDA?

@jonrkarr
Copy link
Member

Rik van Rosmalen has also written a BRENDA parser
https://gitlab.com/wurssb/brenda-parser

Currently it dumps all of Brenda either in a SQLLite DB or a JSON file

One of the main issues right now is that BRENDA's download does not include a metabolite reference table or any cross-references. However, UniChem does cross-reference metabolites to BRENDA via InChi, and has all their data open. This could make integration possible.

@jonrkarr
Copy link
Member

@Midnighter, we're finally starting to work on BRENDA. We're trying to determine if BRENDA contains a record of the reaction associated with each K_cat and K_m (which SABIO-RK clearly displays). Neither the website or the text file shows this information, but the BRENDA output seems to contain this information. I suspect that the SBML output contains inferred kinetic parameters, rather than directly measured kinetic constants. Do you know what information is encoded in the SBML output?

Any code we write will be shared via this repo.

We tried to use Rik's code. Unfortunately, it appears to be out of date with respect to the current format of the BRENDA text file.

@Midnighter
Copy link
Author

We're trying to determine if BRENDA contains a record of the reaction associated with each K_cat and K_m

I don't fully understand what you want to achieve. Given a specific Kcat or Km value, you want to list all reactions (by EC-code) that have this value? This should be possible with a SQL query, however, there are many reactions in BRENDA that specify Kcat and Km as ranges rather than fixed values. The same EC-code can also have different Kcat and Km values in different organisms, of course.

I still haven't finished my BRENDA work as it was not high priority to me. I do have a branch that uses pyparsing to go over the flat file and it's quite promising. I can try to deliver a working version by the end of May.

@jonrkarr
Copy link
Member

SABIO-RK contains information about the exact reaction associated with each measured kinetic parameter. In addition, SABIO-RK often presents pairs of kinetic parameters that were measured together (e.g., paired k_cat, K_m).

In contrast, the BRENDA website, text file, and SOAP interface present coarser information. This is why we have preferred to work with SABIO-RK, even though SABIO-RK is also difficult to scrape. The BRENDA website only displays the EC number associated with each kinetic measurement, and the website doesn't present pairs of parameters.

It appears that BRENDA annotates reactions more coarsely than SABIO-RK. However, BRENDA's SBML output suggests that the underlying BRENDA database might have finer-grained reaction information than what is presented in the BRENDA website, text file, and SOAP interface. We haven't found any documentation about the SBML output. We're trying to understand what those files means, and if this is a way to pull more information out of BRENDA than what is provided in the text file.

@jonrkarr
Copy link
Member

SABIO-RK contains information about the exact reaction associated with each measured kinetic parameter. In addition, SABIO-RK often presents pairs of kinetic parameters that were measured together (e.g., paired k_cat, K_m).

In contrast, the BRENDA website, text file, and SOAP interface present coarser information. This is why we have preferred to work with SABIO-RK, even though SABIO-RK is also difficult to scrape. The BRENDA website only displays the EC number associated with each kinetic measurement, and the website doesn't present pairs of parameters.

It appears that BRENDA annotates reactions more coarsely than SABIO-RK. However, BRENDA's SBML output suggests that the underlying BRENDA database might have finer-grained reaction information than what is presented in the BRENDA website, text file, and SOAP interface. We haven't found any documentation about the SBML output. We're trying to understand what those files mean, and if they are a way to pull more information out of BRENDA than what is provided in the text file.

1 similar comment
@jonrkarr
Copy link
Member

SABIO-RK contains information about the exact reaction associated with each measured kinetic parameter. In addition, SABIO-RK often presents pairs of kinetic parameters that were measured together (e.g., paired k_cat, K_m).

In contrast, the BRENDA website, text file, and SOAP interface present coarser information. This is why we have preferred to work with SABIO-RK, even though SABIO-RK is also difficult to scrape. The BRENDA website only displays the EC number associated with each kinetic measurement, and the website doesn't present pairs of parameters.

It appears that BRENDA annotates reactions more coarsely than SABIO-RK. However, BRENDA's SBML output suggests that the underlying BRENDA database might have finer-grained reaction information than what is presented in the BRENDA website, text file, and SOAP interface. We haven't found any documentation about the SBML output. We're trying to understand what those files mean, and if they are a way to pull more information out of BRENDA than what is provided in the text file.

@Midnighter
Copy link
Author

I have not found a way to reliably scrape all SBML output files from BRENDA as this required paid access previously, I think. It would be preferable, though, of course, to the terrible test format.

With regard to the information that you are looking for: BRENDA gives entries for the K_cat value divided by the K_m value, for example,

KKM	#2# 314 (#2# recombinant isozyme, pH 7.5, 30°C <45>) <45>

So one could look at the matching K_m value (by protein and citation), in this case

KM	#2# 0.165 {GMP}  (#2# recombinant isozyme, pH 7.5, 30°C <45>) <45>

FYI, this is for EC-code 2.7.4.8 and this specific entry is for

PR	#2# Bacillus subtilis   <45>

So that would give you what you are looking for?

@jonrkarr
Copy link
Member

Basically, we're trying to infer the link between the SP entries and the TN, KM, and KKM entries.

I don't think the BRENDA text files provide enough information to reconstruct this.

  • Each PR entry can be associated with multiple SP entries
  • Each PR entry can have multiple associated KM, TN, and KKM entries, far more than the number of substrates of products of a single reaction.
  • Each RF , can be associated with many PR, KM, TN and KKM entries

This is what motivated us to look at the other BRENDA outputs, to try to extract this mapping out of BRENDA.

@jonrkarr
Copy link
Member

I'll contact BRENDA to ask them about the SBML output. I can share what I learn.

@Midnighter
Copy link
Author

It would be super nice to just get a database dump rather than having to jump through so many hoops.

@jonrkarr
Copy link
Member

I'm looking to understand if the text file lacks relationships between KM and TN entries that the underlying database captures, and if these relationships are captured, I'd like to obtain this information.

A database dump would be nice. Any format with this relational information would be an improvement.

@Midnighter
Copy link
Author

I still think it's possible to tell these apart, however, if you look at the comment in each entry.

TN	#2# 52 {GMP}  (#2# recombinant isozyme, pH 7.5, 30°C <45>) <45>

There is only one entry in each section that has the same protein reference #2#, comment (...) and literature reference <45>.

I'm not sure what you gain from the SP entry. The substrate is already provided in the KM and TN entries.

So if you start with KM or TN entries you should be able to identify all the information that you need?

I've only looked at a few examples, though, so I'm easily proven wrong. Also, it'd be painful to parse the information in this way so something structured is definitely preferable 👍

@jonrkarr
Copy link
Member

It shouldn't be this hard.

Inferring the reaction associated with each KM, TN entry from the substrate information

The substrate of each KM or TN entry doesn't contain information about the entire reaction. The reaction can't be inferred from the substrate because the metabolite can participate in multiple reactions.

For example, you can't infer the reaction associated with this TN

TN      #114# 1646 {NADH}  (#114# cosubstrate acetaldehyde, pH 8.0, 60°C <215>)
        <215>

because multiple SP entries involve NADH

SP      #96# hexaldehyde + NADH + H+ = 1-hexanol + NAD+ (#96# 7% activity
        compared to benzyl alcohol <156>) <156>
SP      #96# hydrocinnamaldehyde + NADH + H+ = hydrocinnamyl alcohol + NAD+
        (#96# 12% activity compared to benzyl alcohol <156>) {r} <156>
SP      #96# nonyl aldehyde + NADH + H+ = 1-nonanol + NAD+ (#96# 25% activity
        compared to benzyl alcohol <156>) <156>
SP      #96# octyl aldehyde + NADH + H+ = 1-octanol + NAD+ (#96# 29% activity
        compared to benzyl alcohol <156>) <156>

Inferring pairs of KM, TN, KKM, SP from unique tuples of substrates, comments, and references

This is an interesting idea. This might work for inferring relationships between KM and TN entries. I don't think this will work for inferring relationships between KKM and other entries because they don't include substrates. The SP entries don't appear to have the same comments as KM and TN entries.

Example from 1.1.1.1:

KKM	#115# 3.6 (#115# cosubstrate NADP+, pH 8.0, 60°C <215>) <215>
KKM	#115# 67.2 (#115# cosubstrate NADP+, pH 8.0, 60°C <215>) <215>

@Midnighter
Copy link
Author

Okay, that's a clear counter example. Let's see if you get a reply from BRENDA. I tried once some years back and never got an answer. I was probably not persistent enough.

The way that the textual data is structured I would definitely manually check a number of example to see if the associations presented by BRENDA are correct...

@jonrkarr
Copy link
Member

FYI, I think the SBML output would also be difficult to use. It times out easily. You'd have to figure out how to make the queries small enough not to time out. One possibility is to iterate of each EC and each organism.

for ec_code in ec_codes:
    for organism in organisms:
        get-sbml(ec_code, organism)

@jonrkarr
Copy link
Member

Also the SBML output is missing some of the information from the HTML preview of the SBML

  • No enzyme info (UniProt id)
  • No comments
  • No references

The SMBL does give insight into how to parse temperature and pH from the comments:

  • r'(^|,[ \n])(\d+(\.\d+)?)°C(,[ \n]|$)'
  • r'(^|,[ \n])pH[ \n](\d+(\.\d+)?)(,[ \n]|$)'

@jonrkarr
Copy link
Member

I'm looking into your suggestion about matching tuples of protein ids, comments, and references. This might work for pairing k_cats with K_ms, but I don't think this works for inferring the reaction associated with each k_cat/K_m. It doesn't look like these relationships have been encoded into the text file. While you can find pairs of entries with overlapping protein ids, substrates, comments, and references, it appears to be difficult to unambiguously resolve relationships. I think trying to infer relationships is likely to infer false relationships that are not present in the underlying database. At least for our purposes, we're hesitant to add additional interpretation on top of the BRENDA data.

In spite of these problems, I think BRENDA is doing exactly what you've suggested to build the SBML output. However, I think this is difficult to replicate because we don't know the details how BRENDA is encoded into the text file.

@jonrkarr
Copy link
Member

I got a response from the BRENDA team:

  • Recently, they have begun to track the specific reaction associated with each KM and TN. However, I don't think we have a way to access this information, or to discern which entries have this metadata.
  • For the the oldest curated entries (entries curated > 15 years ago), there is no way to discern the reaction associated with KM and TN because these entries don't have sufficient metadata to attempt to infer the associated reaction. The BRENDA team is slowly filling in this missing metadata.
  • For most entries, the organism, comments, and references can potentially be used to infer the specific reaction associated with each KM and TN. However, there's no way avoid inferring false relationships.
  • We don't have any timestamps that we can use to discern when an entry was curated.

For Datanator, we're hesitant to infer false relationships. We want Datanator to be as free of interpretation as possible so that our downstream projects have as much control over the representational of experimental data as possible.

@Midnighter
Copy link
Author

Thanks for the input. Any word on accessing all SBML or other structured data set?

@jonrkarr
Copy link
Member

The BRENDA team didn't respond to my question about the SMBL output. I suspect the reactions in the SBML output are inferred from common enzymes, comments, and references. I think the temperature and pH are also inferred by similar string pattern matching of the comments.

There's no other more structured output available. In any case, this wouldn't have the missing relationships because they have never been recorded.

If you're looking for a more structured dataset, I recommend SABIO-RK.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants