How Computers “See” Molecules

a pc, Edvard Munch’s  is nothing greater than a grid of pixel values. It has no sense of why swirling lines in a twilight sky convey the agony of a scream. That’s because (modern digital) computers fundamentally process only binary signals [1,2]; they don’t inherently comprehend the objects and emotions we perceive.

To mimic human intelligence, we first need an intermediate form (representation) to “translate” our sensory world into something a pc can handle. For , which may mean extracting edges, colours, shapes, etc. Likewise, in Natural Language Processing (NLP), a pc sees human language as an unstructured stream of symbols that have to be was numeric vectors or other structured forms. Only then can it begin to map raw input to higher-level concepts (i.e., constructing a model).

Human intelligence also will depend on internal representations.

In psychology, a representation refers to an internal mental symbol or image that stands for something in the skin world [3]. In other words, a representation is how information is encoded within the brain: the symbols we use (words, images, memories, artistic depictions, etc.) to face for objects and concepts.

Our senses don’t simply put the external world directly into our brains; as a substitute, they convert sensory input into abstract neural signals. For instance, the eyes convert light into electrical signals on the retina, and the ears turn air vibrations into nerve impulses. These neural signals are the brain’s representation of the external world, which is used to reconstruct our perception of reality, essentially constructing a “model” in our mind.

Between ages one and two, children enter Piaget’s early [4]. That is when kids start using one thing to represent one other: a toddler might hold a banana as much as their ear and babble as if it’s a phone, or push a box around pretending it’s a automobile. This sort of symbolic play is very important for cognitive development, since it shows the kid can move beyond the here-and-now and project the concepts of their mind onto reality [5].

Without our senses translating physical signals into internal codes, we couldn’t perceive anything [5].

“Garbage in, garbage out”. The standard of a representation sets an upper certain on the performance of any model built on it [6,7].

Much of the progress in human intelligence has come from improving how we represent knowledge [8].

Certainly one of the core goals of education is to assist students form effective mental representations of recent knowledge. Seasoned educators use diagrams, animations, analogies and other tools to present abstract concepts in a vivid, relatable way. Richard Mayer argues that happens when learners form a coherent mental representation or model of the fabric, moderately than simply memorizing disconnected facts [8]. In meaningful learning, recent information integrates into existing knowledge, allowing students to transfer and apply it in novel situations.

Nonetheless, in practice, aspects like limited model capability and finite computing resources constrain how complex our representations might be. Compressing input data inevitably risks information loss, noise, and artifacts. So, as step one, developing a “adequate” representation requires balancing several key properties:

It should retain the knowledge critical to the duty. (A clear problem definition helps filter out the remainder.)
It ought to be as compact as possible: minimizing redundancy and keeping dimensionality low.
It should separate classes in feature space. Samples from the identical class cluster together, while those from different classes stay far apart.
It ought to be robust to input noise, compression artifacts, and shifts in data modality.
Invariance. Representations ought to be invariant to task‑irrelevant changes (e.g. rotating or translating a picture, or changing its brightness).
Generalizability.
Interpretability.
Transferability.

These limitations on representation complexity are somewhat analogous to the limited capability of our own working memory.

Human short-term memory, on average, can only hold about 7±2 items without delay [9]. When too many independent pieces of knowledge arrive concurrently (beyond what our cognitive load can handle), our brains bathroom down. Cognitive psychology research shows that with the precise guidance (by adjusting how information is represented), people can reorganize information to beat this apparent limit [10,11]. For instance, we will remember an extended string of digits more easily by chunking them into meaningful groups (which is why phone numbers are sometimes split into shorter blocks).

Now, shifting from to the microscopic world of molecules, we face the identical challenge: how can we translate real-world molecules right into a form that a pc can understand? With the precise representation, a pc can infer chemical properties or biological functions, and ultimately map those to higher‑level concepts (e.g., a drug’s activity or a molecule’s protein binding). In this text, we’ll explore the common methods that permit computers “see” molecules.

Chemical Formula

Perhaps probably the most straightforward depiction of a molecule is its chemical formula, like C₈H₁₀N₄O₂ (caffeine), which tells us there are 8 carbon atoms, 10 hydrogen atoms, 4 nitrogen atoms and a couple of oxygen atoms. Nonetheless, its very simplicity can be its limitation: a formula conveys nothing about how those atoms are connected (the bonding topology), how they’re arranged in space, or where functional groups are positioned. That’s why isomers (like ethanol and dimethyl ether) each share C₂H₆O yet differ completely in structure and properties.

Chemical formula and 2D structures of ethanol and dimethyl ether. Image by creator.

Linear String

One other common approach to represent molecules is to encode them as a linear string of characters, a format widely adopted in databases [12,13].

SMILES

Probably the most classic example is SMILES (Simplified Molecular Input Line Entry System) [14], developed by David Weininger within the Nineteen Eighties. SMILES treats atoms as nodes and bonds as edges, then “flattens” them right into a 1D string via a depth‑first traversal, preserving all of the connectivity and ring information. Single, double, triple, and fragrant bonds are denoted by the symbols “-”, “=”, “#”, and “:”, respectively. Numbers are used to mark the beginning and end of rings, and branches off the foremost chain are enclosed in parentheses. (See more in SMILES – Wikipedia.)

SMILES is easy, intuitive, and compact for storage. Its prolonged syntax supports stereochemistry and isotopes. There’s also a wealthy ecosystem of tools supporting it: most chemistry libraries allow us to convert between SMILES and other standard formats.

Nonetheless, without an agreed-upon canonicalization algorithm, the identical molecule might be written in multiple valid SMILES forms. This may potentially result in inconsistencies or “data pollution”, especially when merging data from multiple sources.

InChI

One other widely used string format is InChI (International Chemical Identifier) [15], introduced by IUPAC in 2005, to generate globally standardized, machine-readable, and unique molecule identifiers. InChI strings, though longer than SMILES, encode more details in layers (including atoms and their bond connectivity, tautomeric state, isotopes, stereochemistry, and charge), each with strict rules and priority. (See more in InChI – Wikipedia.)

Because an InChI string can grow to be very lengthy as a molecule grows more complex, it is commonly paired with a 27‑character InChIKey hash [15]. The InChIKeys aren’t human‑friendly, but they’re ideal for database indexing and for exchanging molecule identifiers across systems.

How Computers “See” Molecules: Figure 2 — Linear representations of caffeine. Image by creator.

Molecular Descriptor

Many computational models require numeric inputs. In comparison with linear string representations, molecular descriptors turn a molecule’s properties and patterns into a vector of numerical features, delivering satisfactory performance in lots of tasks [7, 16-18].

Todeschini and Consonni describe the molecular descriptor because the [16].

We are able to consider a set of molecular descriptors as a standardized “physical exam sheet” for a molecule, asking questions like:

Does it have a benzene ring?
What number of carbon atoms does it have?
What’s the expected octanol-water partition coefficient (LogP)?
Which functional groups are present?
What’s its 3D conformation or electron distribution like?
…

Their answers can take various forms, reminiscent of numerical values, categorical flags, vectors, graph-based structures, tensors etc. Because every molecule in our dataset is described using the identical set of questions (the identical “physical exam sheet”), comparisons and model inputs grow to be straightforward. And since each feature has a transparent meaning, descriptors improve the interpretability of the model.

In fact, just as a physical exam sheet can’t capture absolutely all the pieces a few person’s health, a finite set of molecular descriptors can never capture all elements of a molecule’s chemical and physical nature. Computing descriptors is often a non-invertible process, inevitably resulting in a loss of knowledge, and the outcomes are usually not guaranteed to be unique. Due to this fact, there are various kinds of molecular descriptors, each specializing in different elements.

1000’s of molecular descriptors have been developed over time (for instance, RDKit [19], CDK [20], Mordred [17], etc.). They might be broadly categorized by the dimensionality of knowledge they encode (these categories aren’t strict divisions):

0D: formula‑based properties independent of structure (e.g., atom counts or molecular weight).
1D: sequence-based properties (e.g., counts of certain functional groups).
2D: derived from the 2D topology (e.g., eccentric connectivity index [21]).
3D: derived from 3D conformation, capturing geometric or spatial properties (e.g., charged partial surface area [22]).
4D and better: these incorporate additional dimensions reminiscent of time, ensemble, or environmental aspects (e.g., descriptors derived from molecular dynamics simulations, or from quantum chemical calculations like HOMO/LUMO).
Descriptors obtained from other sources including experimental measurements.

Molecular fingerprints are a special sort of molecular descriptor that encode substructures right into a fixed-length numerical vector [16]. This table summarizes some commonly used molecular fingerprints [23], reminiscent of MACCS [24], which is shown within the figure below.

Similarly, human fingerprints or product barcodes will also be seen as (or converted to) fixed-format numerical representations.

Different descriptors describe molecules from various elements, so their contributions to different tasks naturally vary. In a task of predicting the aqueous solubility of drug-like molecules, over 4,000 computed descriptors were evaluated, but only about 800 made significant contributions to the prediction [7].

How Computers “See” Molecules: Figure 3 — Some molecular descriptors of caffeine from PubChem, DrugBank and RDKit. Image by creator.

Point Cloud

Sometimes, we want our models to learn directly from a molecule’s 3D structure. For instance, this is very important after we’re excited about how two molecules might interact with one another [25], need to go looking the possible conformations of a molecule [26], or wish to simulate its behavior in a certain environment [27].

One straightforward approach to represent a 3D structure is as a degree cloud of its atoms [28]. In other words, a degree cloud is a group of coordinates of the atoms in 3D space. Nonetheless, while this representation shows which atoms are near one another, it doesn’t explicitly tell us which pairs of atoms are bonded. Inferring connectivity from interatomic distances (e.g., via cutoffs) might be error-prone, and should miss higher‑order chemistry like aromaticity or conjugation. Furthermore, our model must account for changes of raw coordinates as a result of rotation or translation. (More on this later.)

Graph

A molecule will also be represented as a graph, where atoms (nodes) are connected by bonds (edges). Graph representations elegantly handle rings, branches, and sophisticated bonding arrangements. For instance, in a SMILES string, a benzene ring have to be “opened” and denoted by special symbols, whereas in a graph, it’s simply a cycle of nodes connected in a loop.

Molecules are commonly modeled as undirected graphs (since bonds don’t have any inherent direction) [29-31]. We are able to further “decorate” the graph with additional domain-specific knowledge to make the representation more interpretable: tagging nodes with atom features (e.g., element type, charge, aromaticity) and edges with bond properties (e.g., order, length, strength). Due to this fact,

(uniqueness) each distinct molecular structure could correspond to a singular graph, and
(reversibility) we could reconstruct the unique molecule from its graph representation.

How Computers “See” Molecules: Figure 4 — Ball-and-stick and two representations of caffeine’s 3D conformation. (Gray: carbon; blue: nitrogen; plum: hydrogen; red: oxygen). Image by creator.

Chemical reactions essentially involve breaking bonds and forming recent ones. Using graphs makes it easier to trace these changes. Some response‑prediction models encode reactants and products as graphs and infer the transformation by comparing them [32,33].

Graph Neural Networks (GNNs) can directly process graphs and learn from them. Using molecular graph representation, these models can naturally handle molecules of arbitrary size and topology. In actual fact, many GNNs have outperformed models that only relied on descriptors or linear strings on many molecular tasks [7,30,34].

Often, when a GNN makes a prediction, we will inspect which parts of the graph were most influential. These “necessary bits” often correspond to actual chemical substructures or functional groups. In contrast, if we were taking a look at a selected substring of a SMILES, it’s not guaranteed to map neatly to a meaningful substructure.

A graph doesn’t all the time mean just the direct bonds connecting atoms. We are able to construct different sorts of graphs from molecular data depending on our needs, and sometimes these alternate graphs yield higher results for particular applications. For instance:

Complete graph: Every pair of nodes is connected by an edge. It could introduce redundant connections, but is perhaps used to let a model consider all pairwise interactions.
Bipartite graph: Nodes are divided into two sets, and edges only connect nodes from one set to nodes from the opposite.
Nearest-neighbor graph: Each node is connected only to its nearest neighbors (in keeping with some criterion), for controlling complexity.

Extensible Graph Representations

We are able to incorporate chemical rules or impose constraints inside molecular graphs. In molecular design, (early) SMILES‑based generative models often produced SMILES strings ended up proposing invalid molecules, because: (1) assembling characters may break SMILES syntax, and (2) even a syntactically correct SMILES might encode an not possible structure. Graph‑based generative models avoid them by constructing molecules atom by atom and bond by bond (under user-specified chemical rules). Graphs also allow us to impose constraints: require or forbid specific substructures, implement 3D shapes or chirality, and so forth; thus, to guide generation toward valid candidates that meet our goals [35,36].

Molecular graphs may handle multiple molecules and their interactions (e.g., drug-protein binding, protein-protein interfaces). “Graph-of-graphs” treat each molecule as its own graph, then deploy a higher-level model to find out how they interact [37]. Or, we may merge the molecules into one composite graph, including all atoms from each partners and add special (dummy) edges or nodes to mark their contacts [38].

Up to now, we’ve been considering the usual graph of bonds (the 2D connectivity), but what if the 3D arrangement matters? Graph representations can definitely be augmented with 3D information: 3D coordinates might be attached to every node, or distances/angles might be added as attributes on the sides, to make models more sensitive to difference in 3D configurations. A greater option is to make use of models like SE(3)-equivariant GNNs, which ensure their outputs (or key internal features) transform (or stay invariant) with any rotation or translation of the input.

In 3D space, the special Euclidean group SE(3) describes all possible rigid motions (any combination of rotations and translations). (It’s sometimes described as a semidirect product of the rotation group SO(3) with the interpretation group R³.) [28]

Once we say a model or a function has SE(3) invariance, we mean that it gives the identical result irrespective of how we rotate or translate the input in 3D. This sort of invariance is commonly a vital requirement for a lot of molecular modeling tasks: a molecule floating in solution has no fixed reference frame (i.e., it could possibly tumble around in space). So, if we predict some property of the molecule (say its binding affinity), that prediction shouldn’t be influenced by the molecule’s orientation or position.

Sequence Representations of Biomacromolecules

We’ve talked mostly about small molecules. But biological macromolecules (like proteins, DNA, and RNA) can contain 1000’s and even thousands and thousands of atoms. SMILES or InChI strings grow to be extremely long and sophisticated, resulting in the associated massive computational, storage, and evaluation costs.

This brings us back to the importance of defining the issue: for biomacromolecules, we’re often not excited about the precise position of each atom or the precise bonds between each pair of atoms. As a substitute, we care about higher-level structural patterns and functional modules: like a protein’s amino acid backbone and its alpha‑helices or beta‑sheets, which fold into tertiary and quaternary structures. For DNA and RNA, we may care about nucleotide sequences and motifs.

We describe these biological polymers as sequences of their constructing blocks (i.e., primary structure): proteins as chains of amino acids, and DNA/RNA as strings of nucleotides. There are well-established codes for these constructing blocks (defined by IUPAC/IUBMB): for example, in DNA, the letters A, C, G, T represent the bases adenine, cytosine, guanine, and thymine respectively.

Static Embeddings and Pretrained Embeddings

To convert a sequence into numerical vectors, we will use static embeddings: assigning a set vector to every residue (or k-mer fragment). The only static embedding is one-hot encoding (e.g., encode adenine A as [1,0,0,0]), turning a sequence right into a matrix. One other approach is to learn dense (pretrained) embeddings by leveraging large databases of sequences. For instance, ProtVec [39] breaks proteins into overlapping 3‑mers and trains a Word2Vec‑like model (commonly utilized in NLP) on a big corpus of sequences, assigning each 3-mer a 100D vector. These learned fragment embeddings are shown to capture biochemical and biophysical patterns: fragments with similar functions or properties cluster closer within the embedding space.

k-mer fragments (or k-mers) are substrings of length k extracted from a biological sequence.

Tokens

Inspired by NLP, we will treat a sequence as if it’s a sentence composed of tokens or words (i.e., residues or k-mer fragments), after which feed them into deep language models. Trained on massive collections of sequences, these models learn biology’s “grammar” and “semantics” just as they do in human language.

Transformers can use self‑attention to capture long‑range dependencies in sequences; and we essentially use them to learn a “language of biology”. (Some) Meta’s ESM series of models [40-42] trained Transformers on a whole bunch of thousands and thousands of protein sequences. Similarly, DNABERT [43] tokenizes DNA into k‑mers for BERT training on genomic data. These sorts of obtained embeddings have been shown to encapsulate a wealth of biological information. In lots of cases, these embeddings might be used directly for various tasks (i.e., transfer learning).

Descriptors

In practice, sequence-based models often mix their embeddings with physicochemical properties, statistical features, and other descriptors, reminiscent of the proportion of every amino acid in a protein, the GC content of a DNA sequence, or indices like hydrophobicity, polarity, charge, and molecular volume.

Beyond the foremost categories above, there are another unconventional ways to represent sequences. Chaos Game Representation (CGR) [44] maps DNA sequences to points in a 2D plane, creating distinctive image patterns for downstream evaluation.

Structural Representations of Biomacromolecules

The complex structure (of a protein) determines its functions and specificities [28]. Simply knowing the linear sequence of residues is commonly not enough to completely understand a biomolecule’s function or mechanism (i.e., sequence-structure gap).

Structures are inclined to be more conserved than sequences [28, 45]. Two proteins might need very divergent sequences but still fold into highly similar 3D structures [46]. Solving the structure of a biomolecule can provide insights that we wouldn’t get just from the sequence alone.

Granularity and Dimensionality Control

A single biomolecule may contain on the order of 10³-10⁵ atoms (or much more). Encoding every atom and bond explicitly into numerical form produces prohibitively high-dimensional, sparse representations.

Adding dimensions to the representation can quickly run into the . As we increase the dimensionality of our data, the “space” we’re asking our model to cover grows exponentially. Data points grow to be sparser relative to that space (it’s like having a couple of needles in an ever-expanding haystack). This sparsity means a model might need vastly more training examples to search out reliable patterns. Meanwhile, the computational cost of processing the information often grows polynomially or worse with dimensionality.

Not every atom is equally necessary for the query we care about: we frequently turn to regulate the granularity of our representation or reduce dimensionality in smart ways (such data often has a lower-dimensional effective representation that may describe the system without (significant) performance loss [47]):

For proteins, each amino acid might be represented by the coordinates of just its alpha carbon (C_α). For nucleic acids, one might take each nucleotide and represent it by the position of its phosphate group or by the middle of its base or sugar ring.
One other example of controlled granularity comes from how AlphaFold [49] represents protein using backbone rigid groups (or ). Essentially, for every amino acid, a small set of main-chain atoms, typically the N, C_α, C (and perhaps O) are treated as a unit. The relative geometry of those atoms is nearly fixed (covalent bond lengths and angles don’t vary significantly), in order that unit might be regarded as a rigid block. As a substitute of tracking each atom individually, the model tracks the position and orientation of that entire block in space, reducing the risks related to excessive degrees of freedom [28] (i.e., errors from the interior movement of atoms inside a residue).

How Computers “See” Molecules: Figure 5 — Heavy atoms in protein backbone with dihedral angles. Image derived from [28].

If we’ve got a big set of protein structures (or an extended molecular dynamics trajectory), it could possibly be useful to cluster those conformations into a couple of representative states. This is commonly done when constructing Markov state models: by clustering continuous states right into a finite set of discrete “metastable” states, we will simplify a fancy energy landscape right into a network of a couple of states connected by transition probabilities.

Many coarse-grained molecular dynamics force fields, reminiscent of MARTINI [50] and UNRES [51], have been developed to represent structural details using fewer particles.

To capture for side-chain effects without modelling all internal atoms or adding excessive degrees of freedom, a standard approach is to represent each side-chain with a single point, typically its center of mass [52]. Such side-chain centroid models are sometimes used along side backbone models.
The 3Di Alphabet introduced by Foldseek [53] defines a 3D interaction “alphabet” of 20 states that describe protein tertiary interactions. Thus, a protein’s 3D structure might be converted right into a sequence of 20 symbols; and two structures might be aligned by aligning their 3Di sequences.
We may spatially crop or give attention to just a part of a biomolecule. For example, if we’re studying how a small drug molecule binds to a protein (say, in a dataset like PDBBind [54], which is filled with protein-ligand complexes), we may only feed the pockets and medicines into our model.
Combining different granularities or modalities of knowledge.

Point Cloud

We could model a biomacromolecule as a large 3D point cloud of each atom (or residue). As noted earlier, the identical limitations apply.

Distance Matrix

A distance matrix records all pairwise distances between certain key atoms (for proteins, commonly the C_αof every amino acid), and is inherently invariant to rotation and translation as a result of its symmetric nature. A contact map simplifies this further by indicating only which pairs of residues are “close enough” to keep in touch. Nonetheless, each representations lose directional information; so not all structural details might be recovered from them alone.

Graph

Similarly, identical to we will use graphs for small molecules, we will use graphs for macromolecular structures [55,56]. As a substitute of atoms, each node might represent a bigger unit (see Granularity and Dimensionality Control). To enhance interpretability, additional knowledge like residue descriptors and known interaction networks inside a protein, might also be incorporated in nodes and edges. Note that the graph representation for biomacromolecules inherits lots of the benefits we discussed for small molecules.

For macromolecules, edges are sometimes pruned to maintain the graph sparse and manageable in size: essentially a type of local magnification that focuses on local substructures, while far-apart relationships are treated as background context.

General dimensionality reduction methods reminiscent of PCA, t-SNE and UMAP are also widely used to research the high-dimensional structural data of macromolecules. While they don’t give us representations for computation in the identical sense because the others we’ve discussed, they assist project complex data into lower dimensions (e.g., for visualization or insights).

Latent Space

Once we train a model (especially generative models), it often learns to encode data right into a compressed internal representation. This internal representation lives in some space of lower dimension, generally known as the latent space. Consider London’s complex urban layout, dense and complex, while the latent space is sort of a “map” that captures its essence in a simplified form.

Latent spaces are frequently in a roundabout way interpretable, but we will explore them by seeing how changes in latent variables map to changes within the output. In molecular generation, if a model maps molecules right into a latent space, we will take two molecules (say, as two points in that space) and generate a path between them. Ochiai et. al. [57] did this by taking two known molecules as endpoints, interpolating between their latent representations, and decoding the intermediate points. The result was a set of recent molecules that blended features of each originals: hybrids that might need mixed properties of the 2.

—— About Writer ——

Tianyuan Zheng
[email protected] | [email protected]
Computational Biology, Bioinformatics, Artificial Intelligence

Department of Computer Science and Technology
Department of Applied Mathematics and Theoretical Physics
University of Cambridge

Reference

Patterson DA, Hennessy JL. Computer organization and design ARM edition: the hardware software interface. Morgan kaufmann; 2016 May 6.
Harris S, Harris D. Digital Design and Computer Architecture, RISC-V Edition. Morgan Kaufmann; 2021 Jul 12.
Kosslyn SM, Koenig O. Wet mind: The brand new cognitive neuroscience. Simon and Schuster; 1992.
Piaget J, Cook M. The origins of intelligence in children. Latest York: International universities press; 1952.
Bergen D. The role of pretend play in children’s cognitive development. Early Childhood Research & Practice. 2002;4(1):n1.
Bengio Y, Courville A, Vincent P. Representation learning: A review and recent perspectives. IEEE transactions on pattern evaluation and machine intelligence. 2013 Mar 7;35(8):1798-828.
Zheng T, Mitchell JB, Dobson S. Revisiting the applying of machine learning approaches in predicting aqueous solubility. ACS omega. 2024 Jul 31;9(32):35209-22.
Mayer RE. Multimedia learning. In Psychology of learning and motivation 2002 Jan 1 (Vol. 41, pp. 85-139). Academic Press.
Miller GA. The magical number seven, plus or minus two: Some limits on our capability for processing information. Psychological review. 1956 Mar;63(2):81.
Chase WG, Simon HA. Perception in chess. Cognitive psychology. 1973 Jan 1;4(1):55-81.
Simon HA. How Big Is a Chunk? By combining data from several experiments, a basic human memory unit might be identified and measured. Science. 1974 Feb 8;183(4124):482-8.
Kim S, Chen J, Cheng T, Gindulyte A, He J, He S, Li Q, Shoemaker BA, Thiessen PA, Yu B, Zaslavsky L. PubChem 2025 update. Nucleic acids research. 2025 Jan 6;53(D1):D1516-25.
Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE. The protein data bank. Nucleic acids research. 2000 Jan 1;28(1):235-42.
Weininger D. SMILES, a chemical language and knowledge system. 1. Introduction to methodology and encoding rules. Journal of chemical information and computer sciences. 1988 Feb 1;28(1):31-6.
Heller S, McNaught A, Stein S, Tchekhovskoi D, Pletnev I. InChI-the worldwide chemical structure identifier standard. Journal of cheminformatics. 2013 Jan 24;5(1):7.
Todeschini R, Consonni V. Molecular descriptors for chemoinformatics: volume I: alphabetical listing/volume II: appendices, references. John Wiley & Sons; 2009 Oct 30.
Moriwaki H, Tian YS, Kawashita N, Takagi T. Mordred: a molecular descriptor calculator. Journal of cheminformatics. 2018 Feb 6;10(1):4.
Jaganathan K, Tayara H, Chong KT. An explainable supervised machine learning model for predicting respiratory toxicity of chemicals using optimal molecular descriptors. Pharmaceutics. 2022 Apr 11;14(4):832.
RDKit: Open-source cheminformatics. https://www.rdkit.org
Willighagen EL, Mayfield JW, Alvarsson J, Berg A, Carlsson L, Jeliazkova N, Kuhn S, Pluskal T, Rojas-Chertó M, Spjuth O, Torrance G. The Chemistry Development Kit (CDK) v2. 0: atom typing, depiction, molecular formulas, and substructure searching. Journal of cheminformatics. 2017 Jun 6;9(1):33.
Sharma V, Goswami R, Madan AK. Eccentric connectivity index: A novel highly discriminating topological descriptor for structure− property and structure− activity studies. Journal of chemical information and computer sciences. 1997 Mar 24;37(2):273-82.
Stanton DT, Jurs PC. Development and use of charged partial surface area structural descriptors in computer-assisted quantitative structure-property relationship studies. Analytical Chemistry. 1990 Nov 1;62(21):2323-9.
Boldini D, Ballabio D, Consonni V, Todeschini R, Grisoni F, Sieber SA. Effectiveness of molecular fingerprints for exploring the chemical space of natural products. Journal of Cheminformatics. 2024 Mar 25;16(1):35.
Durant JL, Leland BA, Henry DR, Nourse JG. Reoptimization of MDL keys to be used in drug discovery. Journal of chemical information and computer sciences. 2002 Nov 25;42(6):1273-80.
Kitchen DB, Decornez H, Furr JR, Bajorath J. Docking and scoring in virtual screening for drug discovery: methods and applications. Nature reviews Drug discovery. 2004 Nov 1;3(11):935-49.
Friedrich NO, Meyder A, de Bruyn Kops C, Sommer K, Flachsenberg F, Rarey M, Kirchmair J. High-quality dataset of protein-bound ligand conformations and its application to benchmarking conformer ensemble generators. Journal of chemical information and modeling. 2017 Mar 27;57(3):529-39.
Karplus M, McCammon JA. Molecular dynamics simulations of biomolecules. Nature structural biology. 2002 Sep 1;9(9):646-52.
Zheng T, Rondina A, Micklem G, Lio P. Challenges and Guidelines in Deep Generative Protein Design: 4 Case Studies.
Duvenaud DK, Maclaurin D, Iparraguirre J, Bombarell R, Hirzel T, Aspuru-Guzik A, Adams RP. Convolutional networks on graphs for learning molecular fingerprints. Advances in neural information processing systems. 2015;28.
Gilmer J, Schoenholz SS, Riley PF, Vinyals O, Dahl GE. Neural message passing for quantum chemistry. InInternational conference on machine learning 2017 Jul 17 (pp. 1263-1272). Pmlr.
Wu Z, Pan S, Chen F, Long G, Zhang C, Yu PS. A comprehensive survey on graph neural networks. IEEE transactions on neural networks and learning systems. 2020 Mar 24;32(1):4-24.
Jin W, Coley C, Barzilay R, Jaakkola T. Predicting organic response outcomes with weisfeiler-lehman network. Advances in neural information processing systems. 2017;30.
Shi C, Xu M, Guo H, Zhang M, Tang J. A graph to graphs framework for retrosynthesis prediction. InInternational conference on machine learning 2020 Nov 21 (pp. 8818-8827). PMLR.
Wu Z, Ramsundar B, Feinberg EN, Gomes J, Geniesse C, Pappu AS, Leswing K, Pande V. MoleculeNet: a benchmark for molecular machine learning. Chemical science. 2018;9(2):513-30.
Lim J, Ryu S, Kim JW, Kim WY. Molecular generative model based on conditional variational autoencoder for de novo molecular design. Journal of cheminformatics. 2018 Jul 11;10(1):31.
Maziarka Ł, Pocha A, Kaczmarczyk J, Rataj K, Danel T, Warchoł M. Mol-CycleGAN: a generative model for molecular optimization. Journal of Cheminformatics. 2020 Jan 8;12(1):2.
Wang H, Lian D, Zhang Y, Qin L, Lin X. Gognn: Graph of graphs neural network for predicting structured entity interactions. arXiv preprint arXiv:2005.05537. 2020 May 12.
Jiang D, Hsieh CY, Wu Z, Kang Y, Wang J, Wang E, Liao B, Shen C, Xu L, Wu J, Cao D. InteractionGraphNet: a novel and efficient deep graph representation learning framework for accurate protein–ligand interaction predictions. Journal of medicinal chemistry. 2021 Dec 8;64(24):18209-32.
Asgari E, Mofrad MR. Continuous distributed representation of biological sequences for deep proteomics and genomics. PloS one. 2015 Nov 10;10(11):e0141287.
Rives A, Meier J, Sercu T, Goyal S, Lin Z, Liu J, Guo D, Ott M, Zitnick CL, Ma J, Fergus R. Biological structure and performance emerge from scaling unsupervised learning to 250 million protein sequences. Proceedings of the National Academy of Sciences. 2021 Apr 13;118(15):e2016239118.
Rao R, Meier J, Sercu T, Ovchinnikov S, Rives A. Transformer protein language models are unsupervised structure learners. Biorxiv. 2020 Dec 15:2020-12.
Rao RM, Liu J, Verkuil R, Meier J, Canny J, Abbeel P, Sercu T, Rives A. MSA transformer. InInternational conference on machine learning 2021 Jul 1 (pp. 8844-8856). PMLR.
Ji Y, Zhou Z, Liu H, Davuluri RV. DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome. Bioinformatics. 2021 Aug 1;37(15):2112-20.
Jeffrey HJ. Chaos game representation of gene structure. Nucleic acids research. 1990 Apr 25;18(8):2163-70.
Illergård K, Ardell DH, Elofsson A. Structure is three to 10 times more conserved than sequence—a study of structural response in protein cores. Proteins: Structure, Function, and Bioinformatics. 2009 Nov 15;77(3):499-508.
Chothia C, Lesk AM. The relation between the divergence of sequence and structure in proteins. The EMBO journal. 1986 Apr 1;5(4):823-6.
Roel-Touris J, Don CG, V. Honorato R, Rodrigues JP, Bonvin AM. Less is more: coarse-grained integrative modeling of enormous biomolecular assemblies with HADDOCK. Journal of chemical theory and computation. 2019 Sep 20;15(11):6358-67.
Duong VT, Diessner EM, Grazioli G, Martin RW, Butts CT. Neural Upscaling from Residue-Level Protein Structure Networks to Atomistic Structures. Biomolecules. 2021 Nov 30;11(12):1788.
Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, Tunyasuvunakool K, Bates R, Žídek A, Potapenko A, Bridgland A. Highly accurate protein structure prediction with AlphaFold. nature. 2021 Aug 26;596(7873):583-9.
Marrink SJ, Risselada HJ, Yefimov S, Tieleman DP, De Vries AH. The MARTINI force field: coarse grained model for biomolecular simulations. The journal of physical chemistry B. 2007 Jul 12;111(27):7812-24.
Liwo A, Baranowski M, Czaplewski C, Gołaś E, He Y, Jagieła D, Krupa P, Maciejczyk M, Makowski M, Mozolewska MA, Niadzvedtski A. A unified coarse-grained model of biological macromolecules based on mean-field multipole–multipole interactions. Journal of molecular modeling. 2014 Aug;20(8):2306.
Cao F, von Bülow S, Tesei G, Lindorff‐Larsen K. A rough‐grained model for disordered and multi‐domain proteins. Protein Science. 2024 Nov;33(11):e5172.
Van Kempen M, Kim SS, Tumescheit C, Mirdita M, Lee J, Gilchrist CL, Söding J, Steinegger M. Fast and accurate protein structure search with Foldseek. Nature biotechnology. 2024 Feb;42(2):243-6.
Wang R, Fang X, Lu Y, Yang CY, Wang S. The PDBbind database: methodologies and updates. Journal of medicinal chemistry. 2005 Jun 16;48(12):4111-9.
Ingraham J, Garg V, Barzilay R, Jaakkola T. Generative models for graph-based protein design. Advances in neural information processing systems. 2019;32.
Jing B, Eismann S, Suriana P, Townshend RJ, Dror R. Learning from protein structure with geometric vector perceptrons. arXiv preprint arXiv:2009.01411. 2020 Sep 3.
Ochiai T, Inukai T, Akiyama M, Furui K, Ohue M, Matsumori N, Inuki S, Uesugi M, Sunazuka T, Kikuchi K, Kakeya H. Variational autoencoder-based chemical latent space for giant molecular structures with 3D complexity. Communications Chemistry. 2023 Nov 16;6(1):249.

How Computers “See” Molecules

Chemical Formula