Source code for glycosylator.core.Glycan

"""
Glycans are represented by the `Glycan` class. The `Glycan` is actually just a derivative of a `buildamol.Molecule` with some additional methods and attributes to handle glycan-specific operations. 
The `Glycan` class is used to represent glycan molecules and is used to generate glycan molecules from IUPAC strings, graph structures, or to extend/crop glycan molecules to match a given IUPAC string. 
The `Glycan` class also has methods to draw 2D (SNFG) and 3D representations of the glycan molecule.

Making Glycans from IUPAC strings
---------------------------------
The IUPAC nomenclature has been widely established to describe glycans in textual form because their SMILES tend to be very long and cumbersome to work with.
`Glycans` have a `from_iupac` method that can read an IUPAC string and produce a 3D model for the glycan. Alternatively one may use the top-level `read_iupac` function
to the same effect, or pass an IUPAC string to the `glycan` function. 

.. code-block:: python

    import glycosylator as gl 

    my_glycan = gl.read_iupac("my new glycan", "Man(b1-4)Glc")

The `Glycan` class also has a `to_iupac` method that can convert a glycan molecule to an IUPAC string.

.. code-block:: python

    iupac_string = my_glycan.to_iupac()

Making Glycans from other inputs
--------------------------------
Since the `Glycan` class inherits from the `buildamol.Molecule` it supports many more inputs such as RDKit molecules, SMILES, or PDB files.
You can either use a dedicated classmethod such as `from_pdb` to get a desired glycan molecule, or trust that the top-level `glycan` function
can automatically figure out what kind of input you are providing. This function is very versatile and can be provided with a variety of inputs
which are automatically processed behind the scenes. It is the most convenient way for users to obtain a glycan structure.

.. code-block:: python

    # make a non-standard glucose from SMILES
    smiles = "NCC1OC(O)C(O)C(O)C1O"
    nitrogen_glucose = gl.glycan(smiles)

Modifying individual sugars
---------------------------
If you do not feel like working with SMILES in order to make small chemical changes on individual sugars or even whole glycans,
you can use the flexibility of `BuildAMol`, which Glycosylator is built upon, to help you out. Let's say we we want to make a phospho-glucose
we can do something like this:

.. code-block:: python

    import glycosylator as gl
    import buildamol as bam

    # get a glucose 
    glc = gl.glycan("GLC")

    # now use buildamol directly to modify
    # the molecule
    bam.phosphorylate(glc, at_atom="C6")
    glc.remove_atoms("O6", "HO6")

"""

from typing import Union
import warnings

import glycosylator.utils as utils
import buildamol.structural as structural
import glycosylator.resources as resources
import buildamol.core as core

from buildamol.extensions.bio.glycans.glycan import (
    _segments_to_mol as _parse_iupac_graph,
)

import pandas as pd


[docs] def glycan(g: Union[str, list], id: str = None, _topology=None): """ The toplevel function to generate an entire glycan molecule from either an IUPAC/SNFG string, a list of residues from a graph structure, or just get a single residue glycan molecule (e.g. one Glucose, etc.). Parameters ---------- g : str or list The glycan string to parse. The string may be a single sugar residue's name, in which case this function is used like `buildamol.molecule`, or can be an entire glycan structure in IUPAC/SNFG condensed format - currently, neither extended nor short formats are supported (refer to the `read_iupac` function for more information). Alternatively, a list of residues from a graph structure can be passed (refer to the `read_graph` function for more information). id : str The id of the molecule to create. If not provided, the id will be the same as the input string. _topology A particular topology to use. If None, the default topology is used. Returns ------- Glycan The created Glycan molecule. """ if isinstance(g, str): try: return read_iupac(id, g, _topology) except: try: mol = core.molecule(g) if isinstance(mol, list): mol = mol[0] mol.id = id if id is not None else g mol = Glycan(mol) return mol except: try: mol = Glycan.from_glycosmos(g, _topology) return mol except: raise ValueError( f"Failed to interpret input '{g}'. Could not parse as IUPAC/SNFG string and not get a molecule from it. If it is a IUPAC/SNFG string, perhaps one of the residues could not be found in the database or a linkage is not available in the used topology? Try building the glycan directly." ) elif isinstance(g, list): mol = read_graph(id, g) return mol
[docs] def read_iupac(id: str, s: str, _topology=None) -> "Glycan": """ Make a molecule from an IUPAC glycan string in condensed format. Parameters ---------- id : str The id of the molecule to create s : str The glycan string to parse. The string must be in IUPAC condensed format - currently, neither extended nor short formats are supported. _topology A particular topology to use. If None, the default topology is used. Returns ------- molecule : Glycan The created Glycan molecule. Examples -------- To generate a small glycan of the structure: ``` ~ --- NAG MAN \\ / (14bb) (16ab) \\ / NAG -(14bb)- BMA \\ (13ab) \\ MAN ``` the IUPAC/SNFG string would be: >>> iupac = "Man(a1-6)[Man(a1-3)]b-Man(a1-4)GlcNAc(b1-4)GlcNAc(b1-" # notice the final "b1-" to indicate where the glycan attaches to a scaffold Which can be parsed into a molecule with: >>> mol = read_iupac("my_glycan", iupac) """ if isinstance(s, str): s = utils.__default_IUPACParser__(s) else: raise TypeError( f"Expected string, got {type(s)} instead. Perhaps you meant to use the `read_graph` function?" ) mol = Glycan(_parse_iupac_graph(id, s, _topology)) return mol
# Alias read_snfg = read_iupac
[docs] def write_iupac(mol: "Glycan") -> str: """ Write a molecule as an IUPAC string in condensed format. Parameters ---------- mol : Glycan The molecule to write. Returns ------- iupac : str The IUPAC/SNFG string in condensed format. """ return mol.to_iupac()
write_snfg = write_iupac
[docs] def read_graph(id: str, g: list, _topology=None) -> "Glycan": """ Build a molecule from a glycan graph. Parameters ---------- id : str The id of the molecule to create g : list A list of tuples in the form (parent, child, linkage), where parent and child are strings designating the residues and an "@{id}" suffix to distinguish individual residues. The linkage must be a valid id of any defined linkage in the provided or default topology, e.g. "14bb" or "16ab" for the default CHARMM topology (see example below). _topology A particular topology to use. If None, the default topology is used. Returns ------- molecule : Glycan The created Glycan molecule. Examples -------- To generate a small glycan of the structure: ``` ~ --- NAG MAN \\ / (14bb) (16ab) \\ / NAG -(14bb)- BMA \\ (13ab) \\ MAN ``` We can formulate a graph structure for the glycan above as follows: >>> graph = [ ("NAG@1", "NAG@2", "14bb"), ("NAG@2", "BMA@1", "14bb"), ("BMA@1", "MAN@1", "13ab"), ("BMA@1", "MAN@2", "16ab"), ] # notice the @{id} after each residue name The `@{id}` suffix is used to distinguish between residues with the same name, for example the two mannoses (`MAN@1` and `MAN@2`). The in the example above the ids reflect the number of residues with the same name, hence `NAG@2` connects to `BMA@1` "the second NAG connecting to the first BMA". However, this is not a strict requirement. Any numeric or string value that will mark each residue as a unique node is acceptable - that is, each combination of one particuar residue is identified by a unique "{name}@{id}". Hence, also the following graph is valid where the index simply reflects the order of the residues in the molecule: >>> graph = [ ("NAG@1", "NAG@2", "14bb"), ("NAG@2", "BMA@3", "14bb"), ("BMA@3", "MAN@4", "13ab"), ("BMA@3", "MAN@5", "16ab"), ] or even >>> graph = [ ("NAG@a", "NAG@b", "14bb"), ("NAG@b", "BMA@c", "14bb"), ("BMA@c", "MAN@d", "13ab"), ("BMA@c", "MAN@e", "16ab"), ] # here the ids are simply letters we can then create a molecule using: >>> mol = read_graph("my_glycan", graph) """ if not isinstance(g, list): raise ValueError("g must be either a string or a list") if not isinstance(g[0], (tuple, list)) or len(g[0]) != 3: raise ValueError("g must be a list of tuples of length 3") mol = _parse_iupac_graph(id, g, _topology) return mol
class GlycanTree: """ Stores the meta-glycan structure as a tree structure for easier access when plotting glycan schematics. """ def __init__(self): # REMARK # Originally, the whole thing was supposed to be a networkx graph, or a dictionary, # but strangely, during removal residues could not be found in the graph using `i in graph` # even though iterations over the graph clearly showed them being present `i == some_residue for i in graph.nodes` # for instance showed some True values. Since the issues persisted with various other implementations, this one finally seemed to solve # the problem, so we stick with it even though it is not particularly elegant. self._segments = [] self._linkages = [] self._connectivity = {} @property def segments(self): return self._segments @property def linkages(self): return self._linkages def add(self, residue1, residue2, linkage): self._segments.append((residue1, residue2)) self._linkages.append(linkage) if residue1 not in self._connectivity: self._connectivity[residue1] = [] self._connectivity[residue1].append(residue2) def remove(self, residue1, residue2): idx = self._segments.index((residue1, residue2)) self._segments.pop(idx) self._linkages.pop(idx) def get_linkage(self, residue1, residue2): idx = self._segments.index((residue1, residue2)) return self._linkages[idx] def has_matching_segment(self, residue1, residue2): return (residue1, residue2) in self._segments or any( residue1.matches(j) and residue2.matches(l) for j, l in self._segments ) def get_linkage_with_matching(self, residue1, residue2): for i, (j, l) in enumerate(self._segments): if residue1.matches(j) and residue2.matches(l): return self._linkages[i] return None def update(self, other): self._segments.extend(other._segments) self._linkages.extend(other._linkages) def __iter__(self): return iter(self._segments) def __len__(self): return len(self._segments)
[docs] class Glycan(core.Molecule): """ A glycan molecule Parameters ---------- id : str The id of the molecule structure: buildamol.Structure or biopython.PDB.Structure or buildamol.Molecule The structure of the molecule """ def __init__( self, structure, root_atom: Union[str, int, core.Atom] = None, model: int = 0, chain: str = None, ): if isinstance(structure, core.Molecule): model = model or structure.model.id chain = chain or structure._chain.id super().__init__( structure.structure, structure.root_atom, model, chain, ) self.add_bonds(*structure._bonds) else: super().__init__(structure, root_atom, model, chain) self._glycan_tree = GlycanTree() self._scaffold = None
[docs] @classmethod def from_pdb(cls, filename: str): new = core.Molecule.from_pdb(filename) new = cls(new) try: new.infer_glycan_tree() except: pass return new
[docs] @classmethod def from_compound(cls, compound: str, by: str = None, root_atom=None): new = core.Molecule.from_compound(compound, by, root_atom) # this is the only method that does not use cls() to create a new instance in the buildamol implementation... return cls(new)
[docs] @classmethod def from_iupac(cls, id: str, iupac: str, _topology=None) -> "Glycan": """ Generate a glycan molecule from an IUPAC/SNFG string """ new = read_iupac(id, iupac, _topology) try: new.infer_glycan_tree() except: pass return new
[docs] @classmethod def from_glycosmos(cls, id: str, _topology=None) -> "Glycan": """ Generate a glycan molecule from a GlyCosmos/GlyTouCan ID """ iupac = resources.get_iupac_from_glycosmos(id) return cls.from_iupac(id, iupac, _topology)
from_glytoucan = from_glycosmos from_snfg = from_iupac
[docs] def get_glytoucan_id(self) -> str: """ Get the GlyTouCan ID of the glycan molecule Returns ------- str The GlyTouCan ID (if available) """ return resources.get_glytoucan_id_from_iupac(self.to_iupac())
get_glycosmos_id = get_glytoucan_id
[docs] def search_glytoucan_ids(self) -> list: """ Find GlyTouCan IDs for glycans that are partial matches of the glycan. Returns ------- list A list of GlyTouCan IDs (if available) """ return resources.find_glytoucan_ids_from_iupac(self.to_iupac())
search_glycosmos_ids = search_glytoucan_ids
[docs] def find_glytoucan_ids(self) -> list: warnings.warn( "This method is deprecated. Use `search_glytoucan_ids` instead.", DeprecationWarning, ) return self.search_glytoucan_ids()
find_glycosmos_ids = find_glytoucan_ids
[docs] def to_iupac(self, add_terminal_conformation: bool = True) -> str: """ Generate an IUPAC/SNFG string from the glycan molecule Parameters ---------- add_terminal_conformation : bool Whether to add the terminal conformation of the first residue as `(a1-` or `(b1-` to the end string. """ return utils.make_iupac_string(self, add_terminal_conformation)
to_snfg = to_iupac
[docs] def infer_glycan_tree(self): """ Infer the glycan tree connectivity in case of a glycan molecule that was loaded externally """ if not self.root_atom: self.root_atom = 1 # now be sure to fill the glycan tree connections = self.get_residue_connections(triplet=False) connections = self._AtomGraph.direct_edges(self.root_atom, edges=connections) for bond in connections: a, b = bond res_a, res_b = a.parent, b.parent id = utils.iupac.make_link_id(bond) _link = core.Linkage(id=id, description="<auto-generated: labels only>") _link.atom1 = a _link.atom2 = b self._glycan_tree.add(res_a, res_b, _link)
[docs] def hist(self) -> pd.DataFrame: """ Get a histogram of the glycan residues Returns ------- pd.DataFrame The histogram """ hist = {} for res in self.get_residues(): name = resources.id_to_name(res.resname) if name is None: name = res.resname hist[name] = hist.get(name, 0) + 1 return pd.DataFrame( {"residue": [i for i in hist.keys()], "count": [i for i in hist.values()]} )
[docs] def attach( self, other: "core.Molecule", link: Union[str, "core.Linkage"] = None, at_residue: Union[int, "core.Residue"] = None, other_residue: Union[int, "core.Residue"] = None, use_patch: bool = True, inplace: bool = True, other_inplace: bool = False, _topology=None, ) -> "Glycan": out = core.Molecule.attach( self, other, link, at_residue, other_residue, use_patch, inplace, other_inplace, _topology, ) # now we need to add a new strip to the glycan tree # since we cannot know beforehand which residues were actually connected # we need to retrace some steps that are taken inside `attach` because the patcher or stitcher # is the only object that actually remembers the residues that it connected if isinstance(link, str): _topology = _topology or resources.get_default_topology() link = _topology.get_patch(link) factory = ( structural.__default_keep_keep_patcher__ if link.has_IC and use_patch else structural.__default_keep_keep_stitcher__ ) at_residue = factory._target_residue other_residue = factory._source_residue # add a new strip to the glycan tree out._glycan_tree.add(at_residue, other_residue, link) # if we have additional glycan strips from the attached structure, add these as well... if isinstance(other, Glycan): out._glycan_tree.update(factory.source._glycan_tree) return out
[docs] def extend_to(self, full_iupac: str, inplace: bool = True, _topology=None): """ Extend a partial glycan to a glycan matching the provided IUPAC/SNFG string. Note that this can only extend the glycan from the root-onward. It cannot retro-fit a root given some leaf glycans. Parameters ---------- full_iupac : str The full IUPAC/SNFG string which the glycan should be extended to. inplace : bool Whether to extend the glycan in place or return a new glycan. _topology A particular topology to use. If None, the default topology is used. """ full = Glycan.from_iupac(None, full_iupac, _topology) if self.root_residue is not None: full.rename_chain("A", self.root_residue.parent.id) obj = self if inplace else self.copy() present_segments = [ (i.resname, j.resname) for i, j in obj._glycan_tree.segments ] for segment in full._glycan_tree: # if the segment is already present in the current glycan, skip it _s = (segment[0].resname, segment[1].resname) if _s in present_segments: present_segments.remove(_s) continue # if obj._glycan_tree.has_matching_segment(*segment): # continue # get the linkage to attach with link = full._glycan_tree.get_linkage(*segment) # get the residue to attach to from the loaded compounds # they should be available since all standard IUPAC-represented glycans should be there incoming = Glycan.from_compound(segment[1].resname, by="id") # get the residue in the current glycan to attach to attach_residue = None candidates = (i for i in obj.get_residues() if i.matches(segment[0])) for candidate in candidates: if link.can_apply(obj, incoming, target_residue=candidate): attach_residue = candidate break if attach_residue is None: raise ValueError( f"Could not find a residue matching {segment[0]} to attach to in the current glycan" ) # attach the residue obj.attach( incoming, link, at_residue=attach_residue, other_inplace=True, _topology=_topology, ) return obj
[docs] def crop_to(self, full_iupac: str, inplace: bool = True, _topology=None): """ Crop a glycan to a glycan matching the provided IUPAC/SNFG string. Note that this can only crop the glycan from the leaves-onward. It cannot retro-fit a root given some leaf glycans. Parameters ---------- full_iupac : str The full IUPAC/SNFG string which the glycan should be cropped to. inplace : bool Whether to crop the glycan in place or return a new glycan. _topology A particular topology to use. If None, the default topology is used. """ full = Glycan.from_iupac(None, full_iupac, _topology) obj = self if inplace else self.copy() present_segments = [ (i.resname, j.resname) for i, j in obj._glycan_tree.segments ] for segment in present_segments: if not full._glycan_tree.has_matching_segment(*segment): obj.remove_residues( *[i for i in obj.get_residues() if i.matches(segment[0])] ) return obj
[docs] def remove_residues(self, *residues: Union[int, "core.Residue"]) -> list: residues = core.Molecule.remove_residues(self, *residues) _strips_to_remove = set() for strip in self._glycan_tree: for residue in residues: if strip[0] == residue or strip[1] == residue: _strips_to_remove.add(strip) for s in _strips_to_remove: self._glycan_tree.remove(*s) return residues
[docs] def clashes_with_scaffold( self, clash_threshold: float = 1.0, ignore_hydrogens: bool = True, coarse_precheck: bool = True, ) -> bool: """ Check if the glycan clashes with the scaffold Parameters ---------- clash_threshold : float The minimum distance to consider a clash ignore_hydrogens : bool Whether to ignore hydrogens coarse_precheck : bool Whether to use a coarse pre-check to speed up the process. This may lead to false negatives, especially if the scaffold has very large residues (e.g. lipids with long carbon chains). Returns ------- bool Whether the glycan clashes with the scaffold """ if self._scaffold is None: return False return ( len( self.find_clashes_with( self._scaffold, clash_threshold=clash_threshold, ignore_hydrogens=ignore_hydrogens, coarse_precheck=coarse_precheck, ) ) > 0 )
[docs] def snfg( self, ax=None, axis="y", **kwargs, ) -> "utils.visual.plt.Axes": """ Draw the SNFG 2D schematic of the glycan Parameters ---------- ax : matplotlib.Axes The axes to draw on axis : str The orientation of the glycan y-axis = vertical, x-axis = horizontal. Returns ------- matplotlib.Axes The axes """ v = utils.visual.GlycanViewer2D(self) return v.draw( ax=ax, axis=axis, **kwargs, )
draw2d = snfg # def show2d( # self, # axis="y", # **kwargs, # ): # """ # Draw and show the 2D schematic of the glycan # Parameters # ---------- # axis : str # The orientation of the glycan y-axis = vertical, x-axis = horizontal. # node_size : int # The size of the nodes # """ # ax = self.draw2d( # axis=axis, # **kwargs, # ) # if kwargs.get("svg", False): # return ax # utils.visual.plt.show() def __repr__(self): return f"Glycan({self.id})"
__all__ = [ "Glycan", "read_iupac", "read_snfg", "write_snfg", "write_iupac", "read_graph", "glycan", ] if __name__ == "__main__": # g = glycan("G78791QP") # g.show2d() # # s = "Gal(b1-3)GlcNAc(b1-3)[Gal(b1-3)GlcNAc(b1-3)[Gal(b1-4)GlcNAc(b1-6)]Gal(b1-4)GlcNAc(b1-6)][Gal(b1-4)GlcNAc(b1-2)]Gal(b1-4)Glc" # # glc = Glycan.from_iupac(None, s) # # glc2 = glc % "14bb" + glc import glycosylator as gl g = Glycan.from_iupac(None, "Gal(b1-4)GlcNAc") g.extend_to("Gal(b1-4)[Gal(b1-4)Glc(b1-6)]GlcNAc") g.root = g.get_atom(1) graph = g.make_atom_graph() edges = g.get_residue_connections(triplet=False, direct_by="root") env = gl.optimizers.DistanceRotatron(graph, edges) out = gl.optimizers.optimize(g, env) g.show() # g.show_snfg() # print(g.hist()) # print(g) # f = "/Users/noahhk/GIT/glycosylator/examples/my_first_glycan.pdb" # g = Glycan.from_pdb(f) # g.infer_glycan_tree() # g.show2d() # _read_snfg = gl.connect(glc, glc, "14bb") # _read_snfg = gl.connect(_read_snfg, glc, "16ab") # _read_snfg = gl.connect(_read_snfg, glc, "12ab", at_residue_a=-2) # out = gl.connect(_read_snfg, _read_snfg, "13ab") # out.show() # _removed = out.remove_residues(4) # print(out._glycan_tree) # out.show2d() # pass