5. Local cache

The LocalCompoundCache class, found in local_compound_cache, provides methods to generate Compound objects as well as storing and retrieving these compounds from a local component contribution database.

This notebook will highlight the following use-cases:

  1. Adding compounds and retrieving them from the coco namespace using add_compounds

  2. Adding compounds and retrieving them using get_compounds

  3. Options to control behavior for get_compounds and add_compounds

5.1. Requirements

  • equilibrator-assets: !pip install equilibrator-assets

  • openbabel: !pip install openbabel-wheel or !conda install -c conda-forge openbabel

  • chemaxon (including license): cxcalc must be in “PATH”

5.2. Initialize the local compound cache

[1]:
import pandas as pd
from equilibrator_assets.local_compound_cache import LocalCompoundCache
lc = LocalCompoundCache()

5.3. Generating a new local cache

A copy of the default zenodo cache must be used for the local_cache.

You can skip this cell if the local cache already exists

[27]:
# Copies the default zenodo compounds.sqlite cache to file location
# If that location already exists, user is prompted to delete
lc.generate_local_cache_from_default_zenodo('compounds.sqlite')
compounds.sqlite already exists.
Delete existing file and replace?
Proceed? (yes/no): yes
Deleting compounds.sqlite
Copying default Zenodo compound cache to compounds.sqlite

5.4. Loading an already existing local cache

[3]:
# load the local cache from the .sqlite database
lc.load_cache('compounds.sqlite')
Loading compounds from compounds.sqlite

5.5. Creating and adding compounds to the coco namespace

add_compounds provides a method to take a data frame consisting of compound information and generating and adding new compounds into the database. When generated, three compound properties must be defined:

  1. struct - a SMILES string representing the compound structure

  2. coco_id - an ID (string) enabling use with the equilibrator-api parser, e.g. my_compound can be accessed using coco:my_compound

  3. name - the name of a compound that will appear when creating plots for analyses such as Max-min Driving Force (MDF)

To generate compounds, a DataFrame must be provided following this example:

struct

coco_id

name

CCO

etoh

Ethanol

C/C1=CC(\O)=C/C(=O)O1

TAL

Triacetic Acid Lactone

[4]:
def display_compound_result(cpd_result, print_identifiers: bool = True):
    print("structure =", cpd_result.structure)
    print("method = ", cpd_result.method)
    print("status = ", cpd_result.status)
    if cpd_result.compound is not None:
        print("Compound ID =", cpd_result.compound.id)
        print("pK_a =", cpd_result.compound.dissociation_constants)
        print("pK_Mg =", cpd_result.compound.magnesium_dissociation_constants)
        print("InChIKey =", cpd_result.compound.inchi_key)
        print("standardized SMILES =", cpd_result.compound.smiles)

        if print_identifiers:
            print("\nidentifiers\n-----------")
            for _id in cpd_result.compound.identifiers:
                print(_id.registry.namespace, ":", _id.accession)

        print("\nmicrospecies\n------------")
        for _ms in cpd_result.compound.microspecies:
            print(f"charge = {_ms.charge}, number of H+ = {_ms.number_protons}, number of Mg2+ = {_ms.number_magnesiums}, ΔΔG/RT = {_ms.ddg_over_rt:.2f}")
[5]:
# Generating an example .csv for adding compounds
# 3A4HA is already present, but custom names can be added
# to the coco namespace
compound_df = pd.DataFrame(
    data=[
        ["OC(=O)C1=CC(NC(=O)C2=CC=CC=C2)=C(O)C=C1", "3B4HA", "3-Benzamido-4-hydroxybenzoic acid"],
        ["NC1=C(O)C=CC(=C1)C(O)=O", "3A4HA", "3-Amino-4-hydroxybenzoic acid"]
    ],
    columns=["struct","coco_id", "name"]
)

lc.add_compounds(compound_df, mol_format="smiles")
# added compound has the ID 3B4HA that can be access as coco:3B4HA
# and prints as 3-Amino-4-hydroxybenzoic acid in plots
cpd_results = lc.get_compounds(["OC(=O)C1=CC(NC(=O)C2=CC=CC=C2)=C(O)C=C1"])
[6]:
display_compound_result(cpd_results[0])
structure = OC(=O)C1=CC(NC(=O)C2=CC=CC=C2)=C(O)C=C1
method =  database
status =  valid
Compound ID = 694325
pK_a = [8.92, 4.23]
pK_Mg = []
InChIKey = RKCVLDMDZASBEO-UHFFFAOYSA-N
standardized SMILES = OC1=C(NC(=O)C2=CC=CC=C2)C=C(C=C1)C([O-])=O

identifiers
-----------
coco : 3B4HA
synonyms : 3-Benzamido-4-hydroxybenzoic acid

microspecies
------------
charge = -2, number of H+ = 9, number of Mg2+ = 0, ΔΔG/RT = 20.54
charge = -1, number of H+ = 10, number of Mg2+ = 0, ΔΔG/RT = 0.00
charge = 0, number of H+ = 11, number of Mg2+ = 0, ΔΔG/RT = -9.74

5.6. Using the coco namespace to define reactions with equilibrator-api

This method uses the equilibrator_api and the LocalCompoundCache to enable custom-compound use.

[7]:
from equilibrator_api import ComponentContribution, Q_
# the local cache is passed to ComponentContribution
cc = ComponentContribution(ccache = lc.ccache)
[8]:
# use coco:ID to access user-specified coco namespace
rxn = cc.parse_reaction_formula("coco:3B4HA + kegg:C00001 = coco:3A4HA + kegg:C00180")
if not rxn.is_balanced():
    print('%s is not balanced' % rxn)

cc.p_h = Q_(7)  # set pH
cc.ionic_strength = Q_("100 mM")  # set I

print(f"ΔG0 = {cc.standard_dg(rxn)}")
print(f"ΔG'0 = {cc.standard_dg_prime(rxn)}")
print(f"ΔG'm = {cc.physiological_dg_prime(rxn)}")
ΔG0 = (39.1 +/- 3.4) kilojoule / mole
ΔG'0 = (-2.0 +/- 3.4) kilojoule / mole
ΔG'm = (-19.1 +/- 3.4) kilojoule / mole

5.7. Using get_compounds to directly generate Compound objects

The get_compounds method accepts a single string or a list of strings that are molecule structures in either smiles or inchi form. The database is queried for each molecule and any misses are generated and inserted into the database. A list of compounds is returned.

Generated compounds are assigned an ID that is one greater than the current largest ID.

[9]:
cpd_results = lc.get_compounds(["CC(=O)O", "CC(O)C(=O)O", 'CCCOP(=O)(O)O', "OCC(N)C(O)CO"])

for cpd_res in cpd_results:
    print("-" * 80)
    display_compound_result(cpd_res, print_identifiers=False)
==============================
*** Open Babel Warning  in InChI code
  #1 :Omitted undefined stereo
==============================
*** Open Babel Warning  in InChI code
  #1 :Omitted undefined stereo
--------------------------------------------------------------------------------
structure = CC(=O)O
method =  database
status =  valid
Compound ID = 28
pK_a = [4.54]
pK_Mg = [MagnesiumDissociationConstant(compound_id=28, number_protons=3, number_magnesiums=1)]
InChIKey = QTBSBXVTEAMEQO-UHFFFAOYSA-M
standardized SMILES = CC([O-])=O

microspecies
------------
charge = -1, number of H+ = 3, number of Mg2+ = 0, ΔΔG/RT = 0.00
charge = 0, number of H+ = 4, number of Mg2+ = 0, ΔΔG/RT = -10.45
charge = 1, number of H+ = 3, number of Mg2+ = 1, ΔΔG/RT = -186.15
--------------------------------------------------------------------------------
structure = CC(O)C(=O)O
method =  database
status =  valid
Compound ID = 2667
pK_a = [3.78]
pK_Mg = []
InChIKey = JVTAAEKCZFNVCJ-UHFFFAOYSA-M
standardized SMILES = CC(O)C([O-])=O

microspecies
------------
charge = -1, number of H+ = 5, number of Mg2+ = 0, ΔΔG/RT = 0.00
charge = 0, number of H+ = 6, number of Mg2+ = 0, ΔΔG/RT = -8.70
--------------------------------------------------------------------------------
structure = CCCOP(=O)(O)O
method =  database
status =  valid
Compound ID = 694326
pK_a = [6.84, 1.82]
pK_Mg = []
InChIKey = MHZDONKZSXBOGL-UHFFFAOYSA-N
standardized SMILES = CCCOP([O-])([O-])=O

microspecies
------------
charge = -2, number of H+ = 7, number of Mg2+ = 0, ΔΔG/RT = 0.00
charge = -1, number of H+ = 8, number of Mg2+ = 0, ΔΔG/RT = -15.75
charge = 0, number of H+ = 9, number of Mg2+ = 0, ΔΔG/RT = -19.94
--------------------------------------------------------------------------------
structure = OCC(N)C(O)CO
method =  database
status =  valid
Compound ID = 694327
pK_a = [13.69, 8.92]
pK_Mg = []
InChIKey = PMLGQXIKBPFHJZ-UHFFFAOYSA-N
standardized SMILES = [NH3+]C(CO)C(O)CO

microspecies
------------
charge = -1, number of H+ = 10, number of Mg2+ = 0, ΔΔG/RT = 52.06
charge = 0, number of H+ = 11, number of Mg2+ = 0, ΔΔG/RT = 20.54
charge = 1, number of H+ = 12, number of Mg2+ = 0, ΔΔG/RT = 0.00

5.7.1. Highlighting local cache persistence

Compounds remain in the local cache between runs. To highlight this, two compounds are added to local cache and given ids. The cache is reloaded and the compounds are queried in reverse, showing the ids remain with the specific compound.

[10]:
# get two new compounds
cpds_before = lc.get_compounds(["C(CC)CCOP(=O)(O)O", "C(CCC)CCOP(=O)(O)O"])

print('Before Reload')
for cpd in cpds_before:
    print("-" * 80)
    display_compound_result(cpd)

print("\n")
# reload cache
lc.ccache.session.close()
lc.load_cache('compounds.sqlite')
print("\n")

# query compounds in reverse
# ids stay with inchi keys, indicating compound persistence in the local cache
cpds_after = lc.get_compounds(["C(CCC)CCOP(=O)(O)O", "C(CC)CCOP(=O)(O)O"])

print('After Reload')
for cpd in cpds_after:
    print("-" * 80)
    display_compound_result(cpd)
Before Reload
--------------------------------------------------------------------------------
structure = C(CC)CCOP(=O)(O)O
method =  database
status =  valid
Compound ID = 694328
pK_a = [6.83, 1.81]
pK_Mg = []
InChIKey = NVTPMUHPCAUGCB-UHFFFAOYSA-N
standardized SMILES = CCCCCOP([O-])([O-])=O

identifiers
-----------
synonyms : 694328

microspecies
------------
charge = -2, number of H+ = 11, number of Mg2+ = 0, ΔΔG/RT = 0.00
charge = -1, number of H+ = 12, number of Mg2+ = 0, ΔΔG/RT = -15.73
charge = 0, number of H+ = 13, number of Mg2+ = 0, ΔΔG/RT = -19.89
--------------------------------------------------------------------------------
structure = C(CCC)CCOP(=O)(O)O
method =  database
status =  valid
Compound ID = 694329
pK_a = [6.83, 1.81]
pK_Mg = []
InChIKey = PHNWGDTYCJFUGZ-UHFFFAOYSA-N
standardized SMILES = CCCCCCOP([O-])([O-])=O

identifiers
-----------
synonyms : 694329

microspecies
------------
charge = -2, number of H+ = 13, number of Mg2+ = 0, ΔΔG/RT = 0.00
charge = -1, number of H+ = 14, number of Mg2+ = 0, ΔΔG/RT = -15.73
charge = 0, number of H+ = 15, number of Mg2+ = 0, ΔΔG/RT = -19.89


Loading compounds from compounds.sqlite


After Reload
--------------------------------------------------------------------------------
structure = C(CCC)CCOP(=O)(O)O
method =  database
status =  valid
Compound ID = 694329
pK_a = [6.83, 1.81]
pK_Mg = []
InChIKey = PHNWGDTYCJFUGZ-UHFFFAOYSA-N
standardized SMILES = CCCCCCOP([O-])([O-])=O

identifiers
-----------
synonyms : 694329

microspecies
------------
charge = -2, number of H+ = 13, number of Mg2+ = 0, ΔΔG/RT = 0.00
charge = -1, number of H+ = 14, number of Mg2+ = 0, ΔΔG/RT = -15.73
charge = 0, number of H+ = 15, number of Mg2+ = 0, ΔΔG/RT = -19.89
--------------------------------------------------------------------------------
structure = C(CC)CCOP(=O)(O)O
method =  database
status =  valid
Compound ID = 694328
pK_a = [6.83, 1.81]
pK_Mg = []
InChIKey = NVTPMUHPCAUGCB-UHFFFAOYSA-N
standardized SMILES = CCCCCOP([O-])([O-])=O

identifiers
-----------
synonyms : 694328

microspecies
------------
charge = -2, number of H+ = 11, number of Mg2+ = 0, ΔΔG/RT = 0.00
charge = -1, number of H+ = 12, number of Mg2+ = 0, ΔΔG/RT = -15.73
charge = 0, number of H+ = 13, number of Mg2+ = 0, ΔΔG/RT = -19.89

5.8. Exploring More Options of add_compounds and get_compounds

There are a number of options to further control the behavior of get_compounds that will be explained below:

  1. Varying the inchi-key connectivity for searches

  2. Handling compound creation errors

    • Investigting Log

    • Bypassing Chemaxon

    • Inserting Empty Compounds

    • Returning Failed Compounds

5.8.1. Inchi-key block control over searches

The connectivity_only option in get_compounds allows for the use of only the first block in the InChI key to be used in a search, otherwise the first two blocks will be used.

An example is shown with D-Glucose and L-Glucose. The connectivity-only searches yield the same results, as is expected.

[11]:
cc = ComponentContribution()
TRAINING_IDS = cc.predictor.params.train_G.index

d_glucose_con = lc.get_compounds(['C([C@@H]1[C@H]([C@@H]([C@H]([C@H](O1)O)O)O)O)O'], connectivity_only=True)[0]
d_glucose = lc.get_compounds(['C([C@@H]1[C@H]([C@@H]([C@H]([C@H](O1)O)O)O)O)O'], connectivity_only=False)[0]
l_glucose_con = lc.get_compounds(['O[C@@H]1[C@@H](O)[C@@H](OC(O)[C@H]1O)CO'], connectivity_only=True)[0]
l_glucose = lc.get_compounds(['O[C@@H]1[C@@H](O)[C@@H](OC(O)[C@H]1O)CO'], connectivity_only=False)[0]
[12]:
print("D-Glucose Search")
print(f"Two InChI Key blocks: {d_glucose.compound.id in TRAINING_IDS}")
display_compound_result(d_glucose, print_identifiers=False)

print(f"\nConnectivity Only: {d_glucose_con.compound.id in TRAINING_IDS}")
display_compound_result(d_glucose_con, print_identifiers=False)

print('\n')
print("L-Glucose Search")
print(f"Two InChI Key blocks: {l_glucose.compound.id in TRAINING_IDS}")
display_compound_result(l_glucose, print_identifiers=False)

print(f"\nConnectivity Only: {l_glucose_con.compound.id in TRAINING_IDS}")
display_compound_result(l_glucose_con, print_identifiers=False)
D-Glucose Search
Two InChI Key blocks: False
structure = C([C@@H]1[C@H]([C@@H]([C@H]([C@H](O1)O)O)O)O)O
method =  database
status =  valid
Compound ID = 93
pK_a = [13.58, 12.69, 11.3]
pK_Mg = []
InChIKey = WQZGKKKJIJFFOK-DVKNGEFBSA-N
standardized SMILES = OC[C@H]1O[C@H](O)[C@H](O)[C@@H](O)[C@@H]1O

microspecies
------------
charge = -3, number of H+ = 9, number of Mg2+ = 0, ΔΔG/RT = 86.51
charge = -2, number of H+ = 10, number of Mg2+ = 0, ΔΔG/RT = 55.24
charge = -1, number of H+ = 11, number of Mg2+ = 0, ΔΔG/RT = 26.02
charge = 0, number of H+ = 12, number of Mg2+ = 0, ΔΔG/RT = 0.00

Connectivity Only: True
structure = C([C@@H]1[C@H]([C@@H]([C@H]([C@H](O1)O)O)O)O)O
method =  database
status =  valid
Compound ID = 43
pK_a = [13.58, 12.69, 11.3]
pK_Mg = []
InChIKey = WQZGKKKJIJFFOK-GASJEMHNSA-N
standardized SMILES = OC[C@H]1OC(O)[C@H](O)[C@@H](O)[C@@H]1O

microspecies
------------
charge = -3, number of H+ = 9, number of Mg2+ = 0, ΔΔG/RT = 86.51
charge = -2, number of H+ = 10, number of Mg2+ = 0, ΔΔG/RT = 55.24
charge = -1, number of H+ = 11, number of Mg2+ = 0, ΔΔG/RT = 26.02
charge = 0, number of H+ = 12, number of Mg2+ = 0, ΔΔG/RT = 0.00


L-Glucose Search
Two InChI Key blocks: False
structure = O[C@@H]1[C@@H](O)[C@@H](OC(O)[C@H]1O)CO
method =  database
status =  valid
Compound ID = 11639
pK_a = [13.58, 12.69, 11.3]
pK_Mg = []
InChIKey = WQZGKKKJIJFFOK-ZZWDRFIYSA-N
standardized SMILES = OC[C@@H]1OC(O)[C@@H](O)[C@H](O)[C@H]1O

microspecies
------------
charge = -3, number of H+ = 9, number of Mg2+ = 0, ΔΔG/RT = 86.51
charge = -2, number of H+ = 10, number of Mg2+ = 0, ΔΔG/RT = 55.24
charge = -1, number of H+ = 11, number of Mg2+ = 0, ΔΔG/RT = 26.02
charge = 0, number of H+ = 12, number of Mg2+ = 0, ΔΔG/RT = 0.00

Connectivity Only: True
structure = O[C@@H]1[C@@H](O)[C@@H](OC(O)[C@H]1O)CO
method =  database
status =  valid
Compound ID = 43
pK_a = [13.58, 12.69, 11.3]
pK_Mg = []
InChIKey = WQZGKKKJIJFFOK-GASJEMHNSA-N
standardized SMILES = OC[C@H]1OC(O)[C@H](O)[C@@H](O)[C@@H]1O

microspecies
------------
charge = -3, number of H+ = 9, number of Mg2+ = 0, ΔΔG/RT = 86.51
charge = -2, number of H+ = 10, number of Mg2+ = 0, ΔΔG/RT = 55.24
charge = -1, number of H+ = 11, number of Mg2+ = 0, ΔΔG/RT = 26.02
charge = 0, number of H+ = 12, number of Mg2+ = 0, ΔΔG/RT = 0.00

5.8.2. Handling Compound Creation Errors

Sometimes compounds fail to be decomposed. This is due to chemaxon errors or the structure being invalid. As a result, there are a few workarounds to this problem. Users can specify two options, bypass_chemaxon and save_empty_compounds, to get around these errors.

bypass_chemxon will attempt to create a compound from the user-specified structure. If the compound cannot be decomposed even without bypass_chemaxon=True then it can still be saved as an empty compound by specifying save_empty_compounds=True.