geometricus package

Submodules

geometricus.geometricus module

geometricus.geometricus.Shapemer

An integer (in the case of model) or a list of integers for each moment (the old way)

alias of Union[bytes, tuple]

geometricus.geometricus.Shapemers

A list of Shapemer types

alias of List[Union[bytes, tuple]]

class geometricus.geometricus.Geometricus(protein_keys: List[Union[str, Tuple[str, str]]], shapemer_to_protein_indices: Dict[Union[bytes, tuple], List[Tuple[Union[str, Tuple[str, str]], int]]], proteins_to_shapemers: Dict[Union[str, Tuple[str, str]], List[Union[bytes, tuple]]], shapemer_keys: List[Union[bytes, tuple]], proteins_to_shapemer_residue_indices: Dict[Union[str, Tuple[str, str]], List[Union[bytes, tuple]]], resolution: Optional[Union[float, ndarray]] = None)[source]

Bases: object

Class for storing embedding information

protein_keys: List[Union[str, Tuple[str, str]]]

List of protein names = rows of the output embedding

shapemer_to_protein_indices: Dict[Union[bytes, tuple], List[Tuple[Union[str, Tuple[str, str]], int]]]

Maps each shapemer to the proteins which have it and to the corresponding residue indices within these proteins

proteins_to_shapemers: Dict[Union[str, Tuple[str, str]], List[Union[bytes, tuple]]]

Maps each protein to a list of shapemers in order of its residues

shapemer_keys: List[Union[bytes, tuple]]

List of shapemers found

proteins_to_shapemer_residue_indices: Dict[Union[str, Tuple[str, str]], List[Union[bytes, tuple]]]

Maps each protein to a set of residue indices covered by the current residue’s shapemer in order of its residues

resolution: Union[float, ndarray] = None

Multiplier that determines how coarse/fine-grained each shape is. This can be a single number, multiplied to all four moment invariants or a numpy array of four numbers, one for each invariant (This is for the old way of binning shapemers)

classmethod from_protein_files(input_files: Union[Path, str, List[str]], model: Optional[ShapemerLearn] = None, split_infos: Optional[List[SplitInfo]] = None, moment_types: Optional[List[str]] = None, resolution: Optional[Union[float, ndarray]] = None, n_threads: int = 1, verbose: bool = True)[source]

Creates a Geometricus object from protein structure files

Parameters:
  • input_files

    Can be

    A list of structure files (.pdb, .pdb.gz, .cif, .cif.gz), A list of (structure_file, chain) A list of PDBIDs or PDBID_chain or (PDB ID, chain) A folder with input structure files, A file which lists structure filenames or “structure_filename, chain” on each line, A file which lists PDBIDs or PDBID_chain or PDBID, chain on each line

  • model – trained ShapemerLearn model if this is not None, shapemers are generated using the trained model and split_infos, moment_types, and resolution is ignored

  • split_infos – List of SplitInfo objects

  • moment_types – List of moment types to use

  • resolution – Multiplier that determines how coarse/fine-grained each shape is. This can be a single number, multiplied to all four moment invariants or a numpy array of four numbers, one for each invariant (This is for the old way of binning shapemers)

  • n_threads – Number of threads to use

  • verbose – Whether to print progress

Return type:

Geometricus object

classmethod from_invariants(invariants: Union[Generator[MultipleMomentInvariants], List[MultipleMomentInvariants]], protein_keys: Optional[List[ProteinKey]] = None, model: Optional[ShapemerLearn] = None, resolution: Optional[Union[float, np.ndarray]] = None)[source]

Make a GeometricusEmbedding object from a list of MultipleMomentInvariant objects

Parameters:
  • invariants – List of MultipleMomentInvariant objects

  • protein_keys – list of protein names = rows of the output embedding. if None, takes all keys in invariants

  • model – if given, uses this model to make the shapemers

  • resolution – multiplier that determines how coarse/fine-grained each shape is this can be a single number, multiplied to all four moment invariants or a numpy array of four numbers, one for each invariant (This is for the old way of binning shapemers)

map_shapemers_to_indices(protein_keys=None)[source]

Maps each shapemer to the proteins which have it and to the corresponding residue indices within these proteins Maps shapemer to (protein_key, residue_index)

map_protein_to_shapemer_indices(protein_keys=None, shapemer_keys=None)[source]

Maps each protein to a list of shapemer indices where the index corresponds to the shapemer in shapemer_keys in order of its residues

map_shapemer_to_residues(shapemer: Union[bytes, tuple]) Dict[Union[str, Tuple[str, str]], Set[int]][source]

Gets residue indices within a particular shapemer across all proteins.

get_count_matrix(protein_keys=None, shapemer_keys=None)[source]
geometricus.geometricus.make_count_matrix(residues_list, alphabet_size: int)[source]

geometricus.moment_utility module

geometricus.moment_utility.nb_mean_axis_0(array: ndarray) ndarray[source]

Same as np.mean(array, axis=0) but njitted

class geometricus.moment_utility.MomentInfo(moment_function: Callable[[int, int, int, numpy.ndarray, numpy.ndarray], float], mu_arguments: List[Tuple[int, int, int]])[source]

Bases: object

moment_function: Callable[[int, int, int, ndarray, ndarray], float]
mu_arguments: List[Tuple[int, int, int]]
geometricus.moment_utility.mu(p, q, r, coords, centroid)[source]

Central moment

geometricus.moment_utility.O_3(mu_200, mu_020, mu_002)[source]
geometricus.moment_utility.O_4(mu_200, mu_020, mu_002, mu_110, mu_101, mu_011)[source]
geometricus.moment_utility.O_5(mu_200, mu_020, mu_002, mu_110, mu_101, mu_011)[source]
geometricus.moment_utility.F(mu_201, mu_021, mu_210, mu_300, mu_111, mu_012, mu_003, mu_030, mu_102, mu_120)[source]
geometricus.moment_utility.make_formula(name, formula_string)[source]

Generate code from one of the formula in Appendix 4A of “2D and 3D Image Analysis by Moments”

Parameters:
  • name – moment_name

  • formula_string – formula copy-pasted from PDF

geometricus.moment_utility.phi_2(mu_020, mu_011, mu_110, mu_200, mu_002, mu_101)[source]
geometricus.moment_utility.phi_3(mu_020, mu_011, mu_110, mu_200, mu_002, mu_101)[source]
geometricus.moment_utility.phi_4(mu_030, mu_021, mu_120, mu_003, mu_111, mu_201, mu_102, mu_210, mu_012, mu_300)[source]
geometricus.moment_utility.phi_5(mu_030, mu_021, mu_120, mu_003, mu_201, mu_102, mu_210, mu_012, mu_300)[source]
geometricus.moment_utility.phi_6(mu_030, mu_021, mu_120, mu_003, mu_111, mu_201, mu_102, mu_210, mu_012, mu_300)[source]
geometricus.moment_utility.phi_7(mu_030, mu_021, mu_120, mu_003, mu_111, mu_201, mu_102, mu_210, mu_012, mu_300)[source]
geometricus.moment_utility.phi_8(mu_030, mu_021, mu_120, mu_003, mu_111, mu_201, mu_102, mu_210, mu_012, mu_300)[source]
geometricus.moment_utility.phi_9(mu_030, mu_021, mu_120, mu_101, mu_003, mu_200, mu_110, mu_201, mu_111, mu_102, mu_210, mu_020, mu_012, mu_002, mu_011, mu_300)[source]
geometricus.moment_utility.phi_10(mu_030, mu_021, mu_120, mu_101, mu_003, mu_200, mu_110, mu_201, mu_111, mu_102, mu_210, mu_020, mu_012, mu_002, mu_011, mu_300)[source]
geometricus.moment_utility.phi_11(mu_030, mu_021, mu_120, mu_101, mu_003, mu_200, mu_110, mu_201, mu_102, mu_210, mu_012, mu_020, mu_002, mu_011, mu_300)[source]
geometricus.moment_utility.phi_12(mu_030, mu_021, mu_120, mu_101, mu_003, mu_200, mu_110, mu_201, mu_111, mu_102, mu_210, mu_020, mu_012, mu_002, mu_011, mu_300)[source]
geometricus.moment_utility.phi_13(mu_030, mu_021, mu_120, mu_101, mu_003, mu_200, mu_110, mu_201, mu_111, mu_102, mu_210, mu_012, mu_020, mu_002, mu_011, mu_300)[source]
geometricus.moment_utility.CI(mu_000, mu_200, mu_020, mu_002, mu_110, mu_101, mu_011, mu_111, mu_210, mu_201, mu_120, mu_021, mu_012, mu_102, mu_003, mu_030, mu_300, mu_013, mu_103, mu_130, mu_310, mu_301, mu_031, mu_112, mu_121, mu_211, mu_022, mu_202, mu_220, mu_400, mu_040, mu_004)[source]
class geometricus.moment_utility.MomentType(value)[source]

Bases: Enum

Different rotation invariant moments (order 2 and order 3)

Choose from [‘O_3’, ‘O_4’, ‘O_5’, ‘F’, ‘phi_2’, ‘phi_3’, ‘phi_4’, ‘phi_5’, ‘phi_6’, ‘phi_7’, ‘phi_8’, ‘phi_9’, ‘phi_10’, ‘phi_11’, ‘phi_12’, ‘phi_13’]

O_3, O_4, and O_5 are second order moments from [1] and F is a third order moment from [2]. These four moments are used in the original Geometricus manuscript [3].

phi_{2-13} are independent third order moments from [4].

CI is the chiral invariant moment from [5].

[1] Mamistvalov, Alexander G. “N-dimensional moment invariants and conceptual mathematical theory of recognition n-dimensional solids.” IEEE Transactions on pattern analysis and machine intelligence 20.8 (1998): 819-831.

[2] Flusser, Jan, Jirí Boldys, and Barbara Zitová. “Moment forms invariant to rotation and blur in arbitrary number of dimensions.” IEEE Transactions on Pattern Analysis and Machine Intelligence 25.2 (2003): 234-246.

[3] Durairaj, Janani, et al. “Geometricus represents protein structures as shape-mers derived from moment invariants.” Bioinformatics 36.Supplement_2 (2020): i718-i725.

[4] Flusser, Jan, Tomas Suk, and Barbara Zitová. 2D and 3D image analysis by moments. John Wiley & Sons, 2016.

[5] Hattne, Johan, and Victor S. Lamzin. “A moment invariant for evaluating the chirality of three-dimensional objects.” Journal of The Royal Society Interface 8.54 (2011): 144-151.

O_3 = MomentInfo(moment_function=CPUDispatcher(<function O_3>), mu_arguments=[(2, 0, 0), (0, 2, 0), (0, 0, 2)])
O_4 = MomentInfo(moment_function=CPUDispatcher(<function O_4>), mu_arguments=[(2, 0, 0), (0, 2, 0), (0, 0, 2), (1, 1, 0), (1, 0, 1), (0, 1, 1)])
O_5 = MomentInfo(moment_function=CPUDispatcher(<function O_5>), mu_arguments=[(2, 0, 0), (0, 2, 0), (0, 0, 2), (1, 1, 0), (1, 0, 1), (0, 1, 1)])
F = MomentInfo(moment_function=CPUDispatcher(<function F>), mu_arguments=[(2, 0, 1), (0, 2, 1), (2, 1, 0), (3, 0, 0), (1, 1, 1), (0, 1, 2), (0, 0, 3), (0, 3, 0), (1, 0, 2), (1, 2, 0)])
phi_2 = MomentInfo(moment_function=CPUDispatcher(<function phi_2>), mu_arguments=[(0, 2, 0), (0, 1, 1), (1, 1, 0), (2, 0, 0), (0, 0, 2), (1, 0, 1)])
phi_3 = MomentInfo(moment_function=CPUDispatcher(<function phi_3>), mu_arguments=[(0, 2, 0), (0, 1, 1), (1, 1, 0), (2, 0, 0), (0, 0, 2), (1, 0, 1)])
phi_4 = MomentInfo(moment_function=CPUDispatcher(<function phi_4>), mu_arguments=[(0, 3, 0), (0, 2, 1), (1, 2, 0), (0, 0, 3), (1, 1, 1), (2, 0, 1), (1, 0, 2), (2, 1, 0), (0, 1, 2), (3, 0, 0)])
phi_5 = MomentInfo(moment_function=CPUDispatcher(<function phi_5>), mu_arguments=[(0, 3, 0), (0, 2, 1), (1, 2, 0), (0, 0, 3), (2, 0, 1), (1, 0, 2), (2, 1, 0), (0, 1, 2), (3, 0, 0)])
phi_6 = MomentInfo(moment_function=CPUDispatcher(<function phi_6>), mu_arguments=[(0, 3, 0), (0, 2, 1), (1, 2, 0), (0, 0, 3), (1, 1, 1), (2, 0, 1), (1, 0, 2), (2, 1, 0), (0, 1, 2), (3, 0, 0)])
phi_7 = MomentInfo(moment_function=CPUDispatcher(<function phi_7>), mu_arguments=[(0, 3, 0), (0, 2, 1), (1, 2, 0), (0, 0, 3), (1, 1, 1), (2, 0, 1), (1, 0, 2), (2, 1, 0), (0, 1, 2), (3, 0, 0)])
phi_8 = MomentInfo(moment_function=CPUDispatcher(<function phi_8>), mu_arguments=[(0, 3, 0), (0, 2, 1), (1, 2, 0), (0, 0, 3), (1, 1, 1), (2, 0, 1), (1, 0, 2), (2, 1, 0), (0, 1, 2), (3, 0, 0)])
phi_9 = MomentInfo(moment_function=CPUDispatcher(<function phi_9>), mu_arguments=[(0, 3, 0), (0, 2, 1), (1, 2, 0), (1, 0, 1), (0, 0, 3), (2, 0, 0), (1, 1, 0), (2, 0, 1), (1, 1, 1), (1, 0, 2), (2, 1, 0), (0, 2, 0), (0, 1, 2), (0, 0, 2), (0, 1, 1), (3, 0, 0)])
phi_10 = MomentInfo(moment_function=CPUDispatcher(<function phi_10>), mu_arguments=[(0, 3, 0), (0, 2, 1), (1, 2, 0), (1, 0, 1), (0, 0, 3), (2, 0, 0), (1, 1, 0), (2, 0, 1), (1, 1, 1), (1, 0, 2), (2, 1, 0), (0, 2, 0), (0, 1, 2), (0, 0, 2), (0, 1, 1), (3, 0, 0)])
phi_11 = MomentInfo(moment_function=CPUDispatcher(<function phi_11>), mu_arguments=[(0, 3, 0), (0, 2, 1), (1, 2, 0), (1, 0, 1), (0, 0, 3), (2, 0, 0), (1, 1, 0), (2, 0, 1), (1, 0, 2), (2, 1, 0), (0, 1, 2), (0, 2, 0), (0, 0, 2), (0, 1, 1), (3, 0, 0)])
phi_12 = MomentInfo(moment_function=CPUDispatcher(<function phi_12>), mu_arguments=[(0, 3, 0), (0, 2, 1), (1, 2, 0), (1, 0, 1), (0, 0, 3), (2, 0, 0), (1, 1, 0), (2, 0, 1), (1, 1, 1), (1, 0, 2), (2, 1, 0), (0, 2, 0), (0, 1, 2), (0, 0, 2), (0, 1, 1), (3, 0, 0)])
phi_13 = MomentInfo(moment_function=CPUDispatcher(<function phi_13>), mu_arguments=[(0, 3, 0), (0, 2, 1), (1, 2, 0), (1, 0, 1), (0, 0, 3), (2, 0, 0), (1, 1, 0), (2, 0, 1), (1, 1, 1), (1, 0, 2), (2, 1, 0), (0, 1, 2), (0, 2, 0), (0, 0, 2), (0, 1, 1), (3, 0, 0)])
CI = MomentInfo(moment_function=CPUDispatcher(<function CI>), mu_arguments=[(0, 0, 0), (2, 0, 0), (0, 2, 0), (0, 0, 2), (1, 1, 0), (1, 0, 1), (0, 1, 1), (1, 1, 1), (2, 1, 0), (2, 0, 1), (1, 2, 0), (0, 2, 1), (0, 1, 2), (1, 0, 2), (0, 0, 3), (0, 3, 0), (3, 0, 0), (0, 1, 3), (1, 0, 3), (1, 3, 0), (3, 1, 0), (3, 0, 1), (0, 3, 1), (1, 1, 2), (1, 2, 1), (2, 1, 1), (0, 2, 2), (2, 0, 2), (2, 2, 0), (4, 0, 0), (0, 4, 0), (0, 0, 4)])
get_moments_from_coordinates(mus: List[float])[source]
geometricus.moment_utility.get_moments_from_coordinates(coordinates: ~numpy.ndarray, moment_types: ~typing.List[~geometricus.moment_utility.MomentType] = (<MomentType.O_3: MomentInfo(moment_function=CPUDispatcher(<function O_3>), mu_arguments=[(2, 0, 0), (0, 2, 0), (0, 0, 2)])>, <MomentType.O_4: MomentInfo(moment_function=CPUDispatcher(<function O_4>), mu_arguments=[(2, 0, 0), (0, 2, 0), (0, 0, 2), (1, 1, 0), (1, 0, 1), (0, 1, 1)])>, <MomentType.O_5: MomentInfo(moment_function=CPUDispatcher(<function O_5>), mu_arguments=[(2, 0, 0), (0, 2, 0), (0, 0, 2), (1, 1, 0), (1, 0, 1), (0, 1, 1)])>, <MomentType.F: MomentInfo(moment_function=CPUDispatcher(<function F>), mu_arguments=[(2, 0, 1), (0, 2, 1), (2, 1, 0), (3, 0, 0), (1, 1, 1), (0, 1, 2), (0, 0, 3), (0, 3, 0), (1, 0, 2), (1, 2, 0)])>)) List[float][source]

Gets rotation-invariant moments for a set of coordinates

Parameters:
  • coordinates

  • moment_types – Which moments to calculate Choose from [‘O_3’, ‘O_4’, ‘O_5’, ‘F’, ‘phi_2’, ‘phi_3’, ‘phi_4’, ‘phi_5’, ‘phi_6’, ‘phi_7’, ‘phi_8’, ‘phi_9’, ‘phi_10’, ‘phi_11’, ‘phi_12’, ‘phi_13’, ‘CI’]

Return type:

list of moments

geometricus.protein_utility module

geometricus.protein_utility.ProteinKey

A protein key is either its PDB ID (str) or a tuple of (PDB ID, chain)

alias of Union[str, Tuple[str, str]]

class geometricus.protein_utility.Structure(name: Union[str, Tuple[str, str]], length: int, coordinates: ndarray)[source]

Bases: object

Class to store basic protein structure information

name: Union[str, Tuple[str, str]]

PDB ID or (PDB ID, chain)

length: int

Number of residues

coordinates: ndarray

Coordinates

geometricus.protein_utility.parse_structure_file(input_value: Union[Path, Path, str, str])[source]

Parse a protein structure file (.pdb, .pdb.gz, .cif, .cif.gz) or PDBID or PDBID_Chain and returns a prody AtomGroup object

Parameters:

input_value (filename or (filename, chain) or PDBID or PDBID_Chain or (PDBID, chain)) –

Return type:

prody AtomGroup object

geometricus.protein_utility.get_structure_files(input_value: Union[Path, str, List[str]]) List[Union[str, str, str]][source]
Get a list of structure files or PDB IDs from a string representing:

A list of structure files (.pdb, .pdb.gz, .cif, .cif.gz), A list of (structure_file, chain) A list of PDBIDs or PDBID_chain or (PDB ID, chain) A folder with input structure files, A file which lists structure filenames or “structure_filename, chain” on each line, A file which lists PDBIDs or PDBID_chain or PDBID, chain on each line

Parameters:

input_value

Return type:

List of structure files or (structure_file, chain) or PDBIDs or (PDB ID, chain)

geometricus.protein_utility.group_indices(input_list: List[int]) List[List[int]][source]

e.g [1, 1, 1, 2, 2, 3, 3, 3, 4] -> [[0, 1, 2], [3, 4], [5, 6, 7], [8]]

geometricus.protein_utility.get_alpha_indices(protein: AtomGroup) List[int][source]

Get indices of alpha carbons of pd AtomGroup object

geometricus.protein_utility.get_beta_indices(protein: AtomGroup) List[int][source]

Get indices of beta carbons of pd AtomGroup object (If beta carbon doesn’t exist, alpha carbon index is returned)

geometricus.protein_utility.get_sequences_from_fasta_yield(fasta_file: Union[str, Path], comments='#') tuple[source]

Returns (accession, sequence) iterator :param fasta_file: :param comments: ignore lines containing any of these strings

Return type:

(accession, sequence)

geometricus.protein_utility.get_sequences_from_fasta(fasta_file: Union[str, Path], comments='#') dict[source]

Returns dict of accession to sequence from fasta file :param fasta_file: :param comments: ignore lines containing any of these strings

Returns:

{accession

Return type:

sequence}

geometricus.protein_utility.get_rmsd(coords_1: np.ndarray, coords_2: np.ndarray) float[source]

RMSD of paired coordinates = normalized square-root of sum of squares of euclidean distances

geometricus.protein_utility.get_rotation_matrix(coords_1: np.ndarray, coords_2: np.ndarray)[source]

Superpose paired coordinates on each other using Kabsch superposition (SVD) Assumes centered coordinates

Parameters:
  • coords_1 – numpy array of coordinate data for the first protein; shape = (n, 3)

  • coords_2 – numpy array of corresponding coordinate data for the second protein; shape = (n, 3)

Return type:

rotation matrix for optimal superposition

geometricus.protein_utility.alignment_to_numpy(alignment)[source]