Tutorial
========


Introduction to Component Contributions
***************************************

All of eQuilibrator's Gibbs energy estimates are based on the *Component Contribution* method :footcite:`noor_consistent_2013`, which combines *Group Contribution* :footcite:`mavrovouniotis_group_1988` :footcite:`mavrovouniotis_group_1990` :footcite:`mavrovouniotis_estimation_1991` :footcite:`jankowski_group_2008` with the more accurate *Reactant Contribution* :footcite:`noor_integrated_2012` by decomposing each reaction into two parts and applying one of the methods to each of them. This method gives priority to the reactant contributions over group contributions while guaranteeing that all estimates are consistent, i.e. will not violate the first law of thermodynamics.

This tutorial focuses on usage and applications, and these do not require understanding the underlying calculations. However, a reader who is interested in learning how Component Contribution works, would benefit from reading the original publication :footcite:`noor_consistent_2013`.

Parsing chemical formulas
*************************

In this section, we will learn how to convert strings containing reaction formulae
to ``Reaction`` objects, using the parsing functions in ``equilibrator-api``.

First, one must create an object from the ``ComponentContribution`` class::
    
    from equilibrator_api import ComponentContribution, Q_
    cc = ComponentContribution()

Calling the constructor can take a few seconds, since it is loading the entire database
of compounds into RAM. **Important notice:** when you do this for the first time after installing
``equilibrator-api``, this database will be downloaded from a remote website (Zenodo) and
that can take about 10 minutes (or more, depending on your bandwidth).

Notice that we also imported ``equilibrator_api.Q_``, which is a shorthand to the ``Quantity`` object that
will help us define physical values with proper units.
The default aqueous conditions are pH of 7.5, ionic strenght of 250 mM, and pMg of 3.
You can change any or all of them by direct assignment:

.. code-block:: python

    cc.p_h = Q_(7.0)
    cc.p_mg = Q_(2.5)
    cc.ionic_strength = Q_("0.1M")

Note that changing these aqueous conditions can be also be done later on (e.g. for checking how
sensitive a reaction's Gibbs energy is to pH).

Creating a ``Compound`` object
------------------------------

First, we will see how to find compounds in the database (which we call the ``compound_cache``).
The simplest way is to have an accession from on of the supported repositories:

================= ================== =============================
Namespace         Pattern            Example
================= ================== =============================
metanetx.chemical MNXM{int}          metanetx.chemical:MNXM5
bigg.metabolite   {str}              bigg.metabolite:h2o
kegg              {C/D/G}{int:05d}   kegg:C00001
chebi             CHEBI:{int}        chebi:CHEBI:30616
sabiork.compound  {int}              sabiork.compound:34
metacyc.compound  {str} or CPD-{int} metacyc.compound:ATP
hmdb              HMDB{int}          hmdb:HMDB0000538
swisslipid        SLM:{int}          swisslipid:SLM:000000002
reactome          R-ALL-{int}        reactome:R-ALL-113592
lipidmaps         LM**{int}          lipidmaps:LMFA00000001
seed              cpd{int:05d}       seed:cpd00002
================= ================== =============================

For example, we are interested in the compound *ATP* and we found it in `ChEBI <https://www.ebi.ac.uk/chebi/searchId.do?chebiId=CHEBI:30616>`_
(with the identifier ``CHEBI:30616``). We can now use the following command:

.. code-block:: python

    atp_compound = cc.get_compound("chebi:CHEBI:30616")

The first colon separates the namespace from the ID, while the second one is just part of the identifier itself. Not all
databases will have two colons, for example

.. code-block:: python

    atp_compound = cc.get_compound("kegg:C00002")

Now that we created a compound, we can print some of its properties:

.. code-block:: python

    print(f"InChI = {atp_compound.inchi}")
    print(f"InChIKey = {atp_compound.inchi_key}")
    print(f"formula = {atp_compound.formula}")
    print(f"molecular mass = {atp_compound.mass}")
    print(f"pKas = {atp_compound.dissociation_constants}")
    
If you don't have an database accession for your compound, you can also get it
using the InChI:

.. code-block:: python

    acetate_compound = cc.get_compound_by_inchi("InChI=1S/C2H4O2/c1-2(3)4/h1H3,(H,3,4)/p-1")

Searching for a ``Compound``
----------------------------

If you don't have an exact identifier (or full InChI), you have two options for
searching the database: by InChIKey or by common name.

To search by name, simply use the function ``search_compound``. The search is approximate
(using N-grams) and tries to find the best match in terms of edit-distance (similar to
the *I'm Feeling Lucky* option on Google search). However,
it can be unreliable for long compound names (and a bit slow for large datasets).

.. code-block:: python

    acetate_compound = cc.search_compound("acetat")

The other search option is based on InChIKeys. ``search_compound_by_inchi_key`` returns
a list of all the matching compounds whose InChIKey matches the query as their prefix.
For example:

.. code-block:: python

    cc.search_compound_by_inchi_key("MNQZXJOMYWMBOU")
    
returns a list of 3 compound objects, corresponding to D-glyceraldehyde, L-glyceraldehyde,
and glyceraldehyde (with unspecific chirality).


Using the ``parse_reaction_formula`` function
---------------------------------------------

Now, we can parse our first formula using the method ``parse_reaction_formula``.
In the following example, we use the BiGG database for the compound IDs to create the ATPase reaction:

.. code-block:: python

    atpase_reaction = cc.parse_reaction_formula(
        "bigg.metabolite:atp + bigg.metabolite:h2o = bigg.metabolite:adp + bigg.metabolite:pi"
    )

Note that the namespace ``bigg.metabolite`` has to be repeated before each compound identifier,
and that several namespaces can be combined within the same reaction. For example:

.. code-block:: python

    atpase_reaction = cc.parse_reaction_formula(
        "chebi:CHEBI:30616 + kegg:C00001 = hmdb:HMDB01341 + seed:cpd00009"
    )

We highly recommend that you check that the reaction is atomically balanced (conserves atoms) and charge balanced (redox neutral). We've found that it's easy to accidentally write unbalanced reactions in this format (e.g. forgetting to balance water is a common mistake) and so we always check ourselves. Calling the function ``atpase_reaction.is_balanced()`` returns a boolean with the answer.

Now, try to create more reactions of your own using ``parse_reaction_formula`` and make sure they are balanced.

Using the ``search_reaction`` function
--------------------------------------

Although ``parse_reaction_formula`` is the most common way to create reactions,
there are a few alternatives that fit some specific use-cases.

One option is to use the ``search_reaction`` function, which works similarly to
``search_compound`` by choosing the best match for each of the compound names.
For example:

.. code-block:: python
    
    formate_dehydrogenase_reaction = cc.search_reaction("formate + NAD = CO2 + NADH")

In this case, the resulting reaction object is correct (and the reaction is indeed chemically balanced).
However, things can become more fuzzy when dealing with compounds that have many versions
with similar names, such as sugars with multiple chiral centers. For instance,

.. code-block:: python
   
    glucose_isomerase_reaction = cc.search_reaction("beta-D-glucose = alpha-D-glucose")

In this case, the β-D-glucose (InChIKey = WQZGKKKJIJFFOK-VFUOTHLCSA-N) is correctly identified, but the product chosen by
eQuilibrator is D-glucose (InChIKey = WQZGKKKJIJFFOK-GASJEMHNSA-N) and not α-D-glucose
(InChIKey = WQZGKKKJIJFFOK-DVKNGEFBSA-N). These mistakes can be unpredictable and depend
on the synonym lists in various chemical databases which are not always well-curated.
There is absolutely no guarantee that the first hit would be the intended compound.
If you must use the reaction search option, it is vital that you carefully go through
the results and make sure no mistakes have been made. As demonstrated in the last example, simply 
checking the chemical balancing is not enough.

Using the ``Reaction`` constructor
----------------------------------

The third option for creating a ``Reaction`` object is by first creating your own
set of ``Compound`` objects (e.g. using ``get_compound`` based on one of the namespaces).
For example, one can store them in a dictionary with the IDs as the keys.

The main input argument for the constructor is a dictionary with ``Compound`` as keys and
the stoichiometric coefficients (float) as the values.

.. code-block:: python
   
    from equilibrator_api import Reaction
    compound_ids = ["h2o", "adp", "atp", "pi"]
    compound_dict = {cid : cc.get_compound(f"bigg.metabolite:{cid}") for cid in compound_ids}
    atpase_reaction = Reaction({
        compound_dict["atp"]: -1,
        compound_dict["h2o"]: -1,
        compound_dict["adp"]: 1,
        compound_dict["pi"]: 1,
    })

Estimating standard Gibbs energies
**********************************

Once you have a ``Reaction`` object, you can use it to get estimates for the Gibbs energy.


Transformed standards Gibbs energy of a single reaction
-------------------------------------------------------

Run the following command::

    standard_dg_prime = cc.standard_dg_prime(atpase_reaction)

The return value is a ``Measurement`` object (from the ``pint`` package) which contains the 
mean estimated value and its uncertainty (in terms of standard deviation). 


Transformed standard Gibbs energies of multiple reactions
---------------------------------------------------------

Often when working with metabolic models, we need to work with more than one reaction at a time. Although running ``standard_dg_prime()`` on each reaction is possible, this means that we ignore any data about the dependencies between the estimates. In reality, errors in reaction Gibbs energies are highly correlated, due to the fact that they often have metabolites in common. Furthermore, the Component Contribution method uses sub-structures to estimate ΔGs and these are even more coupled (for example, consider two transaminase reactions where the vector of group changes is identical, even if the reactants themselves are different).

As an example, we start by defining three additional reactions:

.. code-block:: python

    shikimate_kinase_reaction = cc.parse_reaction_formula(
        "bigg.metabolite:skm + bigg.metabolite:atp = bigg.metabolite:skm5p + bigg.metabolite:adp"
    )
    tyramine_dehydrogenase_reaction = cc.parse_reaction_formula(
        "kegg:C00483 + bigg.metabolite:nad + bigg.metabolite:h2o = kegg:C04227 + bigg.metabolite:nadh"
    )
    fluorene_dehydrogenase_reaction = cc.parse_reaction_formula(
        "kegg:C07715 + bigg.metabolite:nad + bigg.metabolite:h2o = kegg:C06711 + bigg.metabolite:nadh"
    )
    reactions = [atpase_reaction, shikimate_kinase_reaction, tyramine_dehydrogenase_reaction, fluorene_dehydrogenase_reaction]

    for r in reactions:
        assert r.is_balanced(), "One of the reactions is not chemically balanced"

Now that we store all our reactions in a list, we can use the ``standard_dg_prime_multi`` function, which is a batch version of the ``standard_dg_prime`` function:

.. code-block:: python

    (
        standard_dg_prime, dg_uncertainty
    ) = cc.standard_dg_prime_multi(reactions, uncertainty_representation="cov")

Since the software calculates the full covariance matrix for the uncertinaty of the estimates,
we cannot use the ``Measurement`` class (which stores only one standard deviation per estimate). Therefore, there are two return values:

    - ``standard_dg_prime`` (Quantity) - an array containing the CC estimation of the reactions' standard transformed energies
    - ``dg_uncertainty`` (Quantity) - the uncertainty covariance matrix, which can be given in 4 different formats (determined by the ``uncertinaty_representation`` flag):

        - ``cov`` - the full covariance matrix
        - ``precision`` - precision matrix (i.e. the inverse of the covariance matrix)
        - ``sqrt`` - a sqaure root of the covariance, based on the uncertainty vectors
        - ``fullrank`` - full-rank square root of the covariance which is a compressed form of the sqrt result

In the above example, we used the ``cov`` option, which means that ``dg_uncertainty`` stores the full covariance matrix:

+--------------------------+------+----------------+----------------------+----------------------+
|                          |ATPase|Shikimate       |Tyramine              |Fluorene              |
+--------------------------+------+----------------+----------------------+----------------------+
|                          |      |kinase          |dehydrogenase         |dehydrogenase         |
+==========================+======+================+======================+======================+
|**ATPase**                |  0.09|   0.02         |   0.05               |   0.05               |
+--------------------------+------+----------------+----------------------+----------------------+
|**Shikimate kinase**      |  0.02|   8.07         |   -0.42              |   -0.69              |
+--------------------------+------+----------------+----------------------+----------------------+
|**Tyramine dehydrogenase**|  0.05|   -0.42        |   0.73               |   0.68               |
+--------------------------+------+----------------+----------------------+----------------------+
|**Fluorene dehydrogenase**|  0.05|   -0.69        |   0.68               |   2.24               |
+--------------------------+------+----------------+----------------------+----------------------+

All values are in units of :math:`kJ^2 / mol^2`. 
The diagonal values are the uncertainties of single reactions (the square of the standard deviation),
while the off-diagonal values represent the covariances between reactions. For a detailed explanation
of the importance of these values and how to use them in sampling and optimization problems, see :ref:`Covariance`.


Multi-compartment reactions
---------------------------

Reactions that involve membrane transport are a bit more complicated than single-compartment
reactions. First, we need to indicate the location of each reactant (i.e. in which of the two
compartments it is present). In addition, we need to provide extra information, such as
the electrostatic potential, the pH/I/pMg on both sides of the membrane, and the amount
of protons that are being transported together with the rest of the compounds.

The calculations are based on Haraldsdottir et al. :footcite:`haraldsdottir_quantitative_2012`, specitically
by adding the following term to the calculated value of :math:`\Delta G'`

:math:`-N_H \cdot RT\ln\left( 10^{\Delta pH} \right) - Q \cdot F \Delta \Phi`

where *R* is the gas constant, *T* is the temperature (in Kalvin), *F* is Faraday's constant (the total electric charge of one mol of electrons -- 96.5 kC/mol), *ΔpH* is the difference in pH between initial and final compartment, :math:`N_H` is the net number of hydrogen ions transported from initial to final compartment, and *Q* is the stoichiometric coefficient of the transported charges.

This code example below shows how to estimate :math:`\Delta_{r}G'^{\circ}` for glucose uptake through the phosphotransferase system. The result is -44.8 :math:`\pm` 0.6 kJ/mol. Note that we only account for uncertainty stemming from the component-contribution method. All other estimates (e.g. based on electrostatic forces) are assumed to be precise:

.. code-block:: python

    from equilibrator_api import ComponentContribution, Q_

    cytoplasmic_p_h = Q_(7.5)
    cytoplasmic_ionic_strength = Q_("250 mM")
    periplasmic_p_h = Q_(7.0)
    periplasmic_ionic_strength = Q_("200 mM")
    e_potential_difference = Q_("0.15 V")
    cytoplasmic_reaction = "bigg.metabolite:pep = bigg.metabolite:g6p + bigg.metabolite:pyr"
    periplasmic_reaction = "bigg.metabolite:glc__D = "

    cc = ComponentContribution()
    cc.p_h = cytoplasmic_p_h
    cc.ionic_strength = cytoplasmic_ionic_strength
    standard_dg_prime = cc.multicompartmental_standard_dg_prime(
        cc.parse_reaction_formula(cytoplasmic_reaction),
        cc.parse_reaction_formula(periplasmic_reaction),
        e_potential_difference=e_potential_difference,
        p_h_outer=periplasmic_p_h,
        ionic_strength_outer=periplasmic_ionic_strength,
    )

    print(standard_dg_prime)

References
**********

.. footbibliography::