5. Local cache

The LocalCompoundCache class, found in local_compound_cache, provides methods to generate Compound objects as well as storing and retrieving these compounds from a local component contribution database.

This notebook will highlight the following use-cases:

  1. Adding compounds and retrieving them from the coco namespace using add_compounds

  2. Adding compounds and retrieving them using get_compounds

  3. Options to control behavior for get_compounds and add_compounds

5.1. Requirements

  • equilibrator-assets: !pip install equilibrator-assets

  • openbabel: !pip install openbabel-wheel or !conda install -c conda-forge openbabel

  • chemaxon (including license): cxcalc must be in “PATH”

5.2. Initialize the local compound cache

[1]:
import pandas as pd
from equilibrator_assets.local_compound_cache import LocalCompoundCache
lc = LocalCompoundCache()

5.3. Generating a new local cache

A copy of the default zenodo cache must be used for the local_cache.

You can skip this cell if the local cache already exists

[2]:
# Copies the default zenodo compounds.sqlite cache to file location
# If that location already exists, user is prompted to delete
lc.generate_local_cache_from_default_zenodo('compounds.sqlite')
compounds.sqlite already exists.
Delete existing file and replace?
Proceed? (yes/no): yes
Deleting compounds.sqlite
Copying default Zenodo compound cache to compounds.sqlite

5.4. Loading an already existing local cache

[3]:
# load the local cache from the .sqlite database
lc.load_cache('compounds.sqlite')
Loading compounds from compounds.sqlite

5.5. Creating and adding compounds to the coco namespace

add_compounds provides a method to take a data frame consisting of compound information and generating and adding new compounds into the database. When generated, three compound properties must be defined:

  1. struct - a SMILES string representing the compound structure

  2. coco_id - an ID (string) enabling use with the equilibrator-api parser, e.g. my_compound can be accessed using coco:my_compound

  3. name - the name of a compound that will appear when creating plots for analyses such as Max-min Driving Force (MDF)

To generate compounds, a DataFrame must be provided following this example:

struct

coco_id

name

CCO

etoh

Ethanol

C/C1=CC(\O)=C/C(=O)O1

TAL

Triacetic Acid Lactone

[4]:
# Generating an example .csv for adding compounds
# 3A4HA is already present, but custom names can be added
# to the coco namespace
compound_df = pd.DataFrame(
    data=[
        ["OC(=O)C1=CC(NC(=O)C2=CC=CC=C2)=C(O)C=C1", "3B4HA", "3-Benzamido-4-hydroxybenzoic acid"],
        ["NC1=C(O)C=CC(=C1)C(O)=O", "3A4HA", "3-Amino-4-hydroxybenzoic acid"]
    ],
    columns=["struct","coco_id", "name"]
)

lc.add_compounds(compound_df, mol_format="smiles")
# added compound has the ID 3B4HA that can be access as coco:3B4HA
# and prints as 3-Amino-4-hydroxybenzoic acid in plots
coco3B4HA = lc.get_compounds("OC(=O)C1=CC(NC(=O)C2=CC=CC=C2)=C(O)C=C1")
print(coco3B4HA)
Compound(id=694325, inchi_key=RKCVLDMDZASBEO-UHFFFAOYSA-N)

5.6. Using the coco namespace to define reactions with equilibrator-api

This method uses the equilibrator_api and the LocalCompoundCache to enable custom-compound use.

[5]:
from equilibrator_api import ComponentContribution, Q_
# the local cache is passed to ComponentContribution
cc = ComponentContribution(ccache = lc.ccache)
[6]:
# use coco:ID to access user-specified coco namespace
rxn = cc.parse_reaction_formula("coco:3B4HA + kegg:C00001 = coco:3A4HA + kegg:C00180")
if not rxn.is_balanced():
    print('%s is not balanced' % rxn)

cc.p_h = Q_(7)  # set pH
cc.ionic_strength = Q_("100 mM")  # set I

print(f"ΔG0 = {cc.standard_dg(rxn)}")
print(f"ΔG'0 = {cc.standard_dg_prime(rxn)}")
print(f"ΔG'm = {cc.physiological_dg_prime(rxn)}")
ΔG0 = (39.1 +/- 3.4) kilojoule / mole
ΔG'0 = (-2.0 +/- 3.4) kilojoule / mole
ΔG'm = (-19.1 +/- 3.4) kilojoule / mole

5.7. Using get_compounds to directly generate Compound objects

The get_compounds method accepts a single string or a list of strings that are molecule structures in either smiles or inchi form. The database is queried for each molecule and any misses are generated and inserted into the database. A list of compounds is returned.

Generated compounds are assigned an ID that is one greater than the current largest ID.

[7]:
compound_list = lc.get_compounds(["CC(=O)O", "CC(O)C(=O)O", 'CCCOP(=O)(O)O', "OCC(N)C(O)CO"])

for c in compound_list:
    print("-" * 80)
    print(c)
    print(f"pK_a: {c.dissociation_constants}")
    print(f"pK_Mg: {c.magnesium_dissociation_constants}")
    print("Microspecies: ")
    for ms in c.microspecies:
        print(f"{ms}, ddg_over_rt = {ms.ddg_over_rt:.1f}")
==============================
*** Open Babel Warning  in InChI code
  #1 :Omitted undefined stereo
==============================
*** Open Babel Warning  in InChI code
  #1 :Omitted undefined stereo
==============================
*** Open Babel Warning  in InChI code
  #1 :Omitted undefined stereo
--------------------------------------------------------------------------------
Compound(id=28, inchi_key=QTBSBXVTEAMEQO-UHFFFAOYSA-M)
pK_a: [4.54]
pK_Mg: [MagnesiumDissociationConstant(compound_id=28, number_protons=3, number_magnesiums=1)]
Microspecies:
CompoundMicrospecies(compound_id=28, z=-1, nH=3, nMg=0), ddg_over_rt = 0.0
CompoundMicrospecies(compound_id=28, z=0, nH=4, nMg=0), ddg_over_rt = -10.5
CompoundMicrospecies(compound_id=28, z=1, nH=3, nMg=1), ddg_over_rt = -186.1
--------------------------------------------------------------------------------
Compound(id=2667, inchi_key=JVTAAEKCZFNVCJ-UHFFFAOYSA-M)
pK_a: [3.78]
pK_Mg: []
Microspecies:
CompoundMicrospecies(compound_id=2667, z=-1, nH=5, nMg=0), ddg_over_rt = 0.0
CompoundMicrospecies(compound_id=2667, z=0, nH=6, nMg=0), ddg_over_rt = -8.7
--------------------------------------------------------------------------------
Compound(id=694326, inchi_key=MHZDONKZSXBOGL-UHFFFAOYSA-N)
pK_a: [6.84, 1.82]
pK_Mg: []
Microspecies:
CompoundMicrospecies(compound_id=694326, z=-2, nH=7, nMg=0), ddg_over_rt = 0.0
CompoundMicrospecies(compound_id=694326, z=-1, nH=8, nMg=0), ddg_over_rt = -15.7
CompoundMicrospecies(compound_id=694326, z=0, nH=9, nMg=0), ddg_over_rt = -19.9
--------------------------------------------------------------------------------
Compound(id=694327, inchi_key=PMLGQXIKBPFHJZ-UHFFFAOYSA-N)
pK_a: [13.69, 8.92]
pK_Mg: []
Microspecies:
CompoundMicrospecies(compound_id=694327, z=-1, nH=10, nMg=0), ddg_over_rt = 52.1
CompoundMicrospecies(compound_id=694327, z=0, nH=11, nMg=0), ddg_over_rt = 20.5
CompoundMicrospecies(compound_id=694327, z=1, nH=12, nMg=0), ddg_over_rt = 0.0

5.7.1. Highlighting local cache persistence

Compounds remain in the local cache between runs. To highlight this, two compounds are added to local cache and given ids. The cache is reloaded and the compounds are queried in reverse, showing the ids remain with the specific compound.

[8]:
# get two new compounds
cpds_before = lc.get_compounds(["C(CC)CCOP(=O)(O)O", "C(CCC)CCOP(=O)(O)O"])

print('Before Reload')
for cpd in cpds_before:
    print(f"\tID: {cpd.id}, InChI Key: {cpd.inchi_key}")

print("\n")
# reload cache
lc.ccache.session.close()
lc.load_cache('compounds.sqlite')
print("\n")

# query compounds in reverse
# ids stay with inchi keys, indicating compound persistence in the local cache
cpds_after = lc.get_compounds(["C(CCC)CCOP(=O)(O)O", "C(CC)CCOP(=O)(O)O"])

print('After Reload')
for cpd in cpds_after:
    print(f"\tID: {cpd.id}, InChI Key: {cpd.inchi_key}")
Before Reload
        ID: 694328, InChI Key: NVTPMUHPCAUGCB-UHFFFAOYSA-N
        ID: 694329, InChI Key: PHNWGDTYCJFUGZ-UHFFFAOYSA-N


Loading compounds from compounds.sqlite


After Reload
        ID: 694329, InChI Key: PHNWGDTYCJFUGZ-UHFFFAOYSA-N
        ID: 694328, InChI Key: NVTPMUHPCAUGCB-UHFFFAOYSA-N

5.8. Exploring More Options of add_compounds and get_compounds

There are a number of options to further control the behavior of get_compounds that will be explained below:

  1. Varying the inchi-key connectivity for searches

  2. Handling compound creation errors

    • Investigting Log

    • Bypassing Chemaxon

    • Inserting Empty Compounds

    • Returning Failed Compounds

5.8.1. Inchi-key block control over searches

The connectivity_only option in get_compounds allows for the use of only the first block in the InChI key to be used in a search, otherwise the first two blocks will be used.

An example is shown with D-Glucose and L-Glucose. The connectivity-only searches yield the same results, as is expected.

[9]:
cc = ComponentContribution()
TRAINING_IDS = cc.predictor.params.train_G.index

d_glucose_con = lc.get_compounds('C([C@@H]1[C@H]([C@@H]([C@H]([C@H](O1)O)O)O)O)O', connectivity_only=True)
d_glucose = lc.get_compounds('C([C@@H]1[C@H]([C@@H]([C@H]([C@H](O1)O)O)O)O)O', connectivity_only=False)
l_glucose_con = lc.get_compounds('O[C@@H]1[C@@H](O)[C@@H](OC(O)[C@H]1O)CO', connectivity_only=True)
l_glucose = lc.get_compounds('O[C@@H]1[C@@H](O)[C@@H](OC(O)[C@H]1O)CO', connectivity_only=False)

print("D-Glucose Search")
print(f"Two InChI Key blocks: {d_glucose}\nIn training data: {d_glucose.id in TRAINING_IDS}")
print(f"\nConnectivity Only: {d_glucose_con}\nIn training data: {d_glucose_con.id in TRAINING_IDS}")

print('\n')
print("L-Glucose Search")
print(f"Two InChI Key blocks: {l_glucose}\nIn training data: {l_glucose.id in TRAINING_IDS}")
print(f"\nConnectivity Only: {l_glucose_con}\nIn training data: {l_glucose_con.id in TRAINING_IDS}")

D-Glucose Search
Two InChI Key blocks: Compound(id=93, inchi_key=WQZGKKKJIJFFOK-DVKNGEFBSA-N)
In training data: False

Connectivity Only: Compound(id=43, inchi_key=WQZGKKKJIJFFOK-GASJEMHNSA-N)
In training data: True


L-Glucose Search
Two InChI Key blocks: Compound(id=11639, inchi_key=WQZGKKKJIJFFOK-ZZWDRFIYSA-N)
In training data: False

Connectivity Only: Compound(id=43, inchi_key=WQZGKKKJIJFFOK-GASJEMHNSA-N)
In training data: True

5.8.2. Handling Compound Creation Errors

Sometimes compounds fail to be decomposed. This is due to chemaxon errors or the structure being invalid. As a result, there are a few workarounds to this problem. Users can specify two options, bypass_chemaxon and save_empty_compounds, to get around these errors.

bypass_chemxon will attempt to create a compound from the user-specified structure. If the compound cannot be decomposed even without bypass_chemaxon=True then it can still be saved as an empty compound by specifying save_empty_compounds=True.

There are two ways to log results, an error_log, which saves to a .tsv, or through a pandas dataframe.

5.8.3. Viewing error log

[ ]:
log_df = pd.DataFrame()
smiles = [
        # Decomposed with chemaxon
        "OC(=O)C1=CC(NC(=O)C2=CC=CC=C2)=C(O)C=C1",
        # Decomposed with bypass_chemaxon
        "C1(CC(OC(C(C1)=O)CO)O)=O",
        # Cannot be decomposed
        "CC(=O)OC1C=C2C3Cc4ccc(c(c4C2(CCN3C)C=C1OC)O)OC",
    ]

compounds = lc.get_compounds(
    smiles,
    bypass_chemaxon=False,
    save_empty_compounds=False,
    error_log="compound_creation_log.tsv",
    log_df=log_df
)

5.8.4. compound_creation_log

The log gives the structure, the method that can insert the compound, and the status.

[ ]:
# succesfully insert with chemaxon
cc_log = pd.read_csv("compound_creation_log.tsv", sep="\t", index_col=[0])
print(f"----------\nSaved Log\n{cc_log.to_string()}\n")
print(f"----------\nDataFrame Log\n{log_df.to_string()}")
[12]:
smiles = [
        # Decomposed with chemaxon
        "OC(=O)C1=CC(NC(=O)C2=CC=CC=C2)=C(O)C=C1",
        # Decomposed with bypass_chemaxon
        "C1(CC(OC(C(C1)=O)CO)O)=O",
        # Cannot be decomposed
        "CC(=O)OC1C=C2C3Cc4ccc(c(c4C2(CCN3C)C=C1OC)O)OC",
    ]

compounds = lc.get_compounds(
    smiles,
    bypass_chemaxon=True,
    save_empty_compounds=False,
)
# Succesfully insert empty compound

==============================
*** Open Babel Warning  in InChI code
  #1 :Omitted undefined stereo
==============================
*** Open Babel Warning  in InChI code
  #1 :Omitted undefined stereo
==============================
*** Open Babel Warning  in InChI code
  #1 :Omitted undefined stereo
==============================
*** Open Babel Warning  in InChI code
  #1 :Omitted undefined stereo
WARNING:equilibrator_assets.generate_compound:One or more compounds were unable to be decomposed. Rerun specifying error_log or log_df to view details.

[13]:
smiles = [
        # Decomposed with chemaxon
        "OC(=O)C1=CC(NC(=O)C2=CC=CC=C2)=C(O)C=C1",
        # Decomposed with bypass_chemaxon
        "C1(CC(OC(C(C1)=O)CO)O)=O",
        # Cannot be decomposed
        "CC(=O)OC1C=C2C3Cc4ccc(c(c4C2(CCN3C)C=C1OC)O)OC",
    ]

compounds = lc.get_compounds(
    smiles,
    bypass_chemaxon=True,
    save_empty_compounds=True,
    log_df=log_df
)
# Succesfully insert empty compound
print(f"{log_df.to_string()}")
==============================
*** Open Babel Warning  in InChI code
  #1 :Omitted undefined stereo
==============================
*** Open Babel Warning  in InChI code
  #1 :Omitted undefined stereo
==============================
*** Open Babel Warning  in InChI code
  #1 :Omitted undefined stereo
                                           struct                    inchi_key    method status
0         OC(=O)C1=CC(NC(=O)C2=CC=CC=C2)=C(O)C=C1  RKCVLDMDZASBEO-UHFFFAOYSA-N  database  valid
1                        C1(CC(OC(C(C1)=O)CO)O)=O  UBJAOLUWBPQQJC-UHFFFAOYSA-N  database  valid
2  CC(=O)OC1C=C2C3Cc4ccc(c(c4C2(CCN3C)C=C1OC)O)OC  DNOMLUPMYHAJIY-UHFFFAOYSA-N     empty  valid

5.8.5. Returning failed compounds

If the shape of the output list of get_compounds must be the same as the input, set return_failures==True. This returns None for compounds that failed.

The below only returns 2 compounds, as the third one cannot be decomposed.

[14]:
smiles = [
        # Already in database
        "CCO",
        # Decomposed with bypass_chemaxon
        "C1(CCC(OC(C(C1)=O)CO)O)=O",
        # Cannot be decomposed
        "CCC(=O)OC1C=C2C3Cc4ccc(c(c4C2(CCN3C)C=C1OC)O)OC",
    ]
lc.get_compounds(smiles)
==============================
*** Open Babel Warning  in InChI code
  #1 :Omitted undefined stereo
==============================
*** Open Babel Warning  in InChI code
  #1 :Omitted undefined stereo
==============================
*** Open Babel Warning  in InChI code
  #1 :Omitted undefined stereo
==============================
*** Open Babel Warning  in InChI code
  #1 :Omitted undefined stereo
WARNING:equilibrator_assets.generate_compound:One or more compounds were unable to be decomposed. Rerun specifying error_log or log_df to view details.

[14]:
[Compound(id=287, inchi_key=LFQSCWFLJHTTHZ-UHFFFAOYSA-N),
 Compound(id=694332, inchi_key=XNYRWWMIBJOLGA-UHFFFAOYSA-N)]

Set return_fails=True to return None for failed compounds

[15]:
compounds = lc.get_compounds(smiles, return_fails=True)
smiles_dict = dict(zip(smiles, compounds))
print("\nCompounds:")
print(compounds)
print("\nDict:")
print(smiles_dict)
==============================
*** Open Babel Warning  in InChI code
  #1 :Omitted undefined stereo
==============================
*** Open Babel Warning  in InChI code
  #1 :Omitted undefined stereo
==============================
*** Open Babel Warning  in InChI code
  #1 :Omitted undefined stereo
WARNING:equilibrator_assets.generate_compound:One or more compounds were unable to be decomposed. Rerun specifying error_log or log_df to view details.


Compounds:
[Compound(id=287, inchi_key=LFQSCWFLJHTTHZ-UHFFFAOYSA-N), Compound(id=694332, inchi_key=XNYRWWMIBJOLGA-UHFFFAOYSA-N), None]

Dict:
{'CCO': Compound(id=287, inchi_key=LFQSCWFLJHTTHZ-UHFFFAOYSA-N), 'C1(CCC(OC(C(C1)=O)CO)O)=O': Compound(id=694332, inchi_key=XNYRWWMIBJOLGA-UHFFFAOYSA-N), 'CCC(=O)OC1C=C2C3Cc4ccc(c(c4C2(CCN3C)C=C1OC)O)OC': None}