TL;DR: the development of predictive/forecasting applications for an ever-increasing number of different purposes. However, ML is also involved in data mining, i.e the identification of hidden patterns in the data itself. In this sense, the RCSB PDB database constitutes a valuable resource of biological data, containing thousands of protein 3D structures obtained by different experimental techniques. Among these, X-Ray diffraction is the method that renders a more accurate description of the position of each atom in the structure of proteins. Analyzing the dataset of X-Ray-determined protein structures (over 160,000 structures), a pattern can be found if we take into consideration both the position of amino acids and their chemical type. The biological role of this pattern is still not understood completely, but it seems well conserved across the structures of very different organisms, from animals and plants to fungi and bacteria.
Starting from the beginning
The journey began some years ago. At the time, I was doing research in bacterial adhesins, which are proteins on the surface of bacteria which are responsible for interacting, i.e. fixing, the bacteria to surfaces, such as our teeth, or other bacteria to form biofilms (bacterial communities). I remember back then I was still impressed with one of the subjects I had studied years earlier at college: elegant and beautiful organic chemistry. So, when I came across a classification of amino acids (the building blocks of proteins) according to their chemical type it was a natural thing to try and, somehow, relate the properties of adhesins (their stickiness to surfaces) to the chemical type of the amino acids in their structure.
Following advice from another PhD student colleague, I decided to try and come up with a method to quantify the organization of amino acids according to their chemical type in the 3D structure of adhesins. Having myself a background in biology/biotechnology I would not have quantified it directly, but rather assess it qualitatively. But this other PhD student was a physicist, and so he convinced me while having lunch that a more serious approach would imply quantification.
Now, my intuition was that if amino acids of the same type are together in the 3D structure (forming clusters) that would facilitate specific chemical interaction with surfaces. Accordingly, the goal was to generate a numeric descriptor of amino acid clustering according to their chemical type (hydrophobic, polar, acidic, basic, special). Although this was a side project during my doctorate program, my thesis directors helped out and suggested measuring distances between amino acids to generate that descriptor. And so I did, and now I had a numerical parameter that described the degree in which the amino acids of the same type were grouped together in the 3D structure of a protein.
The hypothesis was simple: the parameter/descriptor should render distinctive values for adhesins (in comparison to other proteins), and that would explain why adhesins actually adhere to surfaces. For that purpose, I had available an entire database of thousands of entries of 3D structural data of proteins, the RCSB PDB database. Also, powerful libraries like Biopython facilitated enormously the extraction and treatment of data, so the development phase, even though there was no generative AI back then, was relatively fast.
What I found was not what I was looking for. There was no distinctive descriptor value for adhesins. Instead, the descriptor seemed only to depend on the number of amino acids following a relatively straight curve. That was boring; I knew that the descriptor would tend to increase with protein size (number of amino acids), but the hope was to find some irregularities, for example that for adhesins the value of the descriptor was notably higher than in other families of proteins. So, that was a “failure”, nothing interesting had come up. For about three days I was reviewing the process, trying to find some bug in the code, or in the methodology. After all, I only got that simple straight curve when plotting the descriptor vs protein size (number of amino acids).
OK, but why that relatively straight curve? Actually, when fitting a simple model the R2 was very high (0.979). This may mean something. And besides, why that particular slope of the curve, why not another? This may mean something — not about adhesins, but maybe about proteins in general. The fact is, the curve tells us that any two proteins, as long as they have the same number of amino acids, will have a very similar value of the descriptor. This may not seem so relevant at first, but let’s think about it. Any two proteins? That means for example one globular, almost spherical protein, and another very elongated one. Or, for example, a protein in mammals and another one in plants. Any two pair, or actually any set of proteins with equal number of amino acids would have the same value of the descriptor. There was something there.
The Mosaic Q model
Months and years went by. I only had one equation, and that was not enough evidence. When completing my PhD, I decided to look for help to better understand this apparent conserved property. After talking to different researchers I came across José Antonio, a professor of mathematics in the University of Seville. We were talking about characterizing the shape and size of clusters of amino acid of the same type in the 3D structure of proteins, and somehow relating that to the empirical equation. However, the task was not easy, because it is not straightforward to determine when a cluster starts and ends in the 3D structure of proteins. For example, one group of amino acids in the 3D structure could be interpreted as one large cluster with irregular form, or rather two or more smaller, spherical-like clusters that just happened to be close in the structure. Then he came up with an idea. Instead of trying to derive a model of cluster size and shape from the experimental data, why not generate different models of clusters and see which one best fitted the results?
This involved the generation and execution of multiple stochastic simulations, where different combinations of size and shape of the clusters were tested. For simplicity, we supposed that along with the hydrophobic core (where all of hydrophobic amino acids group together), the rest of amino acids (polar, acidic, basic, special) always grouped together according to their type in clusters of the same size and shape, although these clusters were randomly placed across the protein structure. Other variables such as protein total size and shape were also determined stochastically, to ensure enough diversity of structures in the samples. At the same time, different quality controls were implemented to generate realistic enough structures, ensuring for example normal inter amino acid distances or frequencies of amino acids. For each combination of size/shape of clusters analyzed, 10,000 artificial structures were generated and the calculation of the descriptor (now named Q) was computed.
What we found was encouraging. Each combination of size/shape of clusters produced a distinct curve when plotting Q vs the number of amino acids (n). One combination (Shape I / 8 amino acids per cluster) seemed to fit the experimental really well. It is not that that model describes amino acid clustering of proteins in nature, but at least it is a good approximation. Moreover, we visualized multiple proteins coloring amino acids according to their type, and the Shape I / 8 amino acids combination seemed actually quite reasonable. But more importantly, the big takeaway from the simulations is that they provide a framework to understand that the real, experimental curve of Q vs n found in nature is not trivial, but rather it depends on the size and shape of clusters. In other words, there is a specific way in which amino acids tend to cluster in the 3D structure of proteins, which can be quantified using Q. We call this the Mosaic Q model. Different mosaic models could exist in theory, but the results show that only one actually occurs.
In the next images we see an overview of the analysis. Here is how the mosaic, its quantification via Q, and the conserved curve (R2 = 0.979) found across >160,000 experimental structures are related.
Overview of the empirical equation between the descriptor Q and the number of residues (n). Hydrophobic amino acids are depicted in white, polar in green, acidic in orange, basic in blue and special in cyan. The results show that the value of Q only depends on the number of residues, not on protein shape, organism of origin, etc. Over 160,000 X-Ray determined structures available in the RCSB database were analyzed. Those entries where Q/n > 1.032 Å (which correspond to < 0.07% of the total) were discarded as outliers. RCSB PDB data are in the public domain (CC0) and free for any use. Image by authors.
It must be noted that the original Q descriptor only quantified clusters of the type hydrophobic, polar, acidic and basic residues (not special residues). Another alternative definition of the descriptor (named Q_alt) does also include in its computation the clusters of special residues.
On the other hand, here is a plot of the curves obtained by the stochastic simulations, showing how one configuration best fits the experimental results, but more importantly, how their slope varies according to the configuration (size and shape) of the clusters in each one.
Stochastic simulations help understand the experimental results. A series of stochastic simulations demonstrate that the shape of the experimental curve is non trivial, and that it depends on the size and morphology of the amino acid clusters, since in each case the slope obtained is different. Also, they provide a good model candidate for the approximation of the mosaic in real structures. Image by authors.
Sharing the analysis with the community
Some analysis in biology, and experimental sciences in general, gain in depth when multiple collaborators work on them. For this reason, we started building tools and visual material so that anyone could check the presence of the Mosaic Q pattern for themselves. In this effort, a third member of the team (the biologist María Ángeles) was fundamental.
We started by sharing direct visualization of multiple protein structures so that the mosaic can be appreciated. When proteins are rendered with each amino acid colored by its chemical type, the pattern becomes visible to the eye. For example, here are some instances of famous proteins, namely hemoglobin, the molecular machine CRISPR-Cas9, the tumor-suppressor p53, and the small enzyme lysozyme. The mosaic pattern is apparently present in most proteins, also in these famous cases:
Direct visualization of the Mosaic Q in some well-known proteins. The presence of a distinctive, non-random disposition of amino acids according to their chemical type (the Mosaic Q model) can also be assessed directly in a visual manner. Here, some examples of famous proteins are depicted showing their residues colored by their chemical type. In this manner, the presence of the typical clusters of the mosaic can be appreciated. Image by authors.
Besides our own rendered images, the analysis has taken a collaborative approach. In particular, anyone interested can render proteins of their own choosing and add them as further evidence of the Mosaic Q pattern, building up a shared repository of images. The objective is to check if the mosaic pattern is present, that is, whether or not in addition to the hydrophobic core (white) the remaining amino acids group into clusters of similar size and shape across the 3D structure. As of June 2026, more than 50 volunteers have taken part as independent observers, visualizing structures of their own choice from the RCSB database with the open-source software Jmol. This collaboration adds a layer of independence in the study, since each observer chooses their protein to render from the thousands and thousands of structures currently present in RCSB database.
For those who would rather compute than visualize it, a series of tools is available for the computation of Q and Q_alt (PyPI, bio.tools), via the protein-mosaic-q package. Basic usage involves these lines:
from mosaicq import calculate_q, calculate_q_alt
q = calculate_q(“protein.pdb”)
q_alt = calculate_q_alt(“protein.cif”)
As stated before, Q covers the four main chemical families (hydrophobic, polar, acidic, basic). Q_alt extends the calculation to include the special amino acids. The library is built on top of Biopython and requires Python ≥ 3.9. This Colab notebook makes this possible with no local setup at all.
The study has also been accepted and merged into the bioinformatics platform Galaxy Europe, within its section Proteomics. There, users can upload any pdb or mmCIF file, run the computation on the browser and download the Q, Q_alt and image results. The image repository is also registered with FAIRsharing, giving it a persistent, citable record (see references).
A few caveats are worth stating. The analysis covers X-ray structures only, which provide the highest resolution quality in comparison to other methods. But still, further studies need also to address structures obtained via other experimental techniques such as cryo-EM or NMR, or even 3D structures predicted computationally. On the other hand, even though the mathematical signal of the mosaic pattern is clear, especially in the context of biology, there is still some limited variation of the Q descriptor among proteins. Whether that small variation is simply statistical noise or has actually been selected in evolution remains an open question.
Ultimately, the tools developed are open for anyone to use and extend. Plenty of questions remain, especially concerning the biological role of the pattern and how it affects protein-protein interactions and interactions with other biomolecules. This is of interest because research in this field could eventually help in the design of protein-based drugs or drugs that target proteins. If you find the pattern interesting, or think you can come up with exceptions to it, the tools above are a good place to start.
References
[1] FAIRsharing.org, Proteins Mosaic Q Project Repository (2026), DOI: 10.25504/FAIRsharing.9f9f9c
[2] H. M. Berman et al., The Protein Data Bank (2000), Nucleic Acids Research 28(1):235–242
[3] P. J. A. Cock et al., Biopython: freely available Python tools for computational molecular biology and bioinformatics (2009), Bioinformatics 25(11):1422–1423
[4] Galaxy Europe, Protein Mosaic Q tool (2026), usegalaxy.eu
[5] Jmol Development Team, Jmol: an open-source Java viewer for chemical structures in 3D (2024), jmol.org
The library and source code are released under the MIT license.
