GLink Entanglement Tutorial
glink-entanglement is a Python workflow for finding candidate protein entanglements from an all-atom PDB structure. It combines three ideas:
- heavy-atom residue contacts define possible loop-closing contacts;
- C-alpha traces are used to calculate partial Gaussian linking values;
- Topoly confirms whether a candidate loop has a crossing residue.
The package then provides a second command, glink-cluster, to merge many similar raw contacts into a smaller set of representative entanglements.
This tutorial walks through installation, a complete command-line workflow, how to interpret the output files, how to tune the main parameters, and how to call the code from Python.
When to use this workflow
Use glink-entanglement when you have an all-atom protein structure and want a residue-level screen for loop-threading entanglements. Typical inputs include:
- an experimental PDB structure;
- an AlphaFold model saved as PDB;
- a representative structure extracted from a molecular dynamics trajectory.
The method is designed for protein chains. It uses protein and not name H* to select heavy atoms, then calculates topology on each chain separately. If your biological assembly has multiple chains, each chain is analyzed independently.
What the package does
The workflow has two stages.
First, glink scans one PDB file and writes raw entangled contacts:
glink -f PDB/2ww4.pdb -o out
Second, glink-cluster groups similar raw contacts into representative entanglements:
glink-cluster \
-r out/2ww4_glink_contacts.csv \
-o clustered \
-g Human
The first command produces a CSV of contact-level entanglement candidates. The second command produces a CSV of clustered representatives.
Installation
Clone the repository and install the package:
git clone https://github.com/vuqv/glink_entanglement.git
cd glink_entanglement
pip install .
For development, use editable mode:
pip install -e .
The package requires Python 3.10 or newer and installs the following Python dependencies:
MDAnalysisnumpypandasscipytopoly
After installation, confirm that the two command-line tools are available:
glink --help
glink-cluster --help
If either command is not found, check that the active shell is using the same Python environment where you installed the package:
python -m pip show glink-entanglement
which glink
which glink-cluster
Input requirements
The main input is an all-atom PDB file:
PDB/2ww4.pdb
The structure should satisfy these conditions:
- protein atoms are recognizable by MDAnalysis;
- each residue has a C-alpha atom;
- residue ordering within each chain is meaningful;
- chain or segment identifiers are present when the file contains multiple chains;
- missing residues are either absent from the analysis intentionally or repaired before running.
Hydrogen atoms are ignored. Contacts are defined using protein heavy atoms.
Step 1: Detect raw GLink contacts
Run glink on a PDB file:
glink -f PDB/2ww4.pdb -o out
The -f argument gives the input PDB. The -o argument can be either an output directory or a full CSV path.
If -o is a directory, the output file is named from the PDB stem:
out/2ww4_glink_contacts.csv
You can also write to an explicit file:
glink \
-f PDB/2ww4.pdb \
-o out/my_2ww4_contacts.csv
At the end of a successful run, the command prints how many contacts were written and the runtime:
Wrote 91 contacts to out/2ww4_glink_contacts.csv
Runtime: 12.345678 seconds
Your exact runtime will depend on the protein size and Topoly settings.
How contacts are defined
For each chain, glink selects:
protein and not name H*
Then it marks two residues as a contact when:
- any pair of heavy atoms is within the contact cutoff;
- the residues are separated by at least four positions along the sequence.
The default heavy-atom contact cutoff is:
4.5 A
You can change it with:
glink \
-f PDB/2ww4.pdb \
-o out \
--contact_cutoff 5.0
A larger cutoff usually increases the number of candidate contacts. A smaller cutoff is stricter and may miss loose but meaningful loop-closing contacts.
How Gaussian linking is calculated
For each contact between residues i and j, the segment from i to j is treated as the loop. The C-alpha trace is used as the protein curve.
The package calculates two partial Gaussian linking values:
-
gn: linking between the loop and the N-terminal side of the chain; -
gc: linking between the loop and the C-terminal side of the chain.
The absolute values are rounded into integer-like scores:
-
Gn: rounded form ofabs(gn); -
Gc: rounded form ofabs(gc).
By default, the rounding threshold is:
0.6
For example, an absolute value at or above the threshold is rounded up to the nearest integer score. You can tune the threshold:
glink \
-f PDB/2ww4.pdb \
-o out \
--GLN_threshold 0.6
Contacts with both Gn == 0 and Gc == 0 are discarded before Topoly confirmation.
Topoly confirmation
Gaussian linking is used as a candidate screen. A contact is only written to the final raw CSV when Topoly finds a crossing residue on a side with nonzero rounded linking.
The logic is:
- if
Gn != 0andGc == 0, the contact must have an N-terminal crossing; - if
Gn == 0andGc != 0, the contact must have a C-terminal crossing; - if
Gn != 0andGc != 0, either side can confirm the contact; - if Topoly finds no relevant crossing, the contact is dropped.
The default Topoly density in this package is:
0
This is faster than Topoly’s package default. If you need the higher-density surface, run:
glink \
-f PDB/2ww4.pdb \
-o out \
--topoly_density 1
You can also tune Topoly’s crossing-reduction distances:
glink \
-f PDB/2ww4.pdb \
-o out \
--topoly_min_dist 10 6 5
The three values correspond to crossing, loop, and tail-end distances.
Raw CSV output
The raw CSV has one row per retained entangled contact. A typical file begins like this:
chain,resid_i,resname_i,resid_j,resname_j,contact_i_index,contact_j_index,gn,gc,Gn,Gc,crossingsN,crossingsC,crossings
SYSTEM,15,LEU,208,SER,14,207,-0.80898,-0.57081,1,0,-7,,-7
SYSTEM,15,LEU,209,ASN,14,208,-0.80928,-0.66592,1,1,-7,,-7
The most important columns are:
| Column | Meaning |
|---|---|
chain | Chain or segment identifier used by MDAnalysis. |
resid_i, resid_j | Residue IDs for the contact that closes the loop. |
resname_i, resname_j | Residue names for the contact residues. |
contact_i_index, contact_j_index | Zero-based C-alpha indices used internally. |
gn, gc | Raw partial Gaussian linking values. |
Gn, Gc | Rounded absolute linking values. |
crossingsN | Topoly crossing residues on the N-terminal side. |
crossingsC | Topoly crossing residues on the C-terminal side. |
crossings | Combined crossing list used by clustering. |
Crossing labels preserve chirality. For example:
-7
+172
The sign is the crossing chirality reported by Topoly, and the number is the mapped PDB residue ID.
Step 2: Cluster raw contacts
Raw contact files can contain many rows describing essentially the same entanglement. glink-cluster reduces them to representative entanglements.
Run clustering with an organism preset:
glink-cluster \
-r out/2ww4_glink_contacts.csv \
-o clustered \
-g Human
Available organism presets are:
| Organism | Cutoff |
|---|---|
Human | 52 |
Ecoli | 57 |
Yeast | 49 |
You can also provide a custom spatial cutoff:
glink-cluster \
-r out/2ww4_glink_contacts.csv \
-o clustered \
-c 52
To choose the exact output file:
glink-cluster \
-r out/2ww4_glink_contacts.csv \
-o clustered \
-g Human \
-w clustered/2ww4_representative_entanglements.csv
If -w is not provided, the output file is named:
<input_csv_stem>_clustered.csv
For the example above, that gives:
clustered/2ww4_glink_contacts_clustered.csv
How clustering works
The clustering step performs several filters and merges:
- Read the raw
glinkCSV. - Drop rows without crossing residues.
- Group rows by chain and exact crossing set.
- For each exact crossing set, keep the shortest loop as the first representative.
- Merge larger overlapping loops when they share chain, crossing count, chirality sequence, overlapping loop endpoints, and nearby crossing residues.
- Spatially cluster the remaining representatives using coordinates built from:
(resid_i, resid_j, crossing_residue_ids...)
Clustering is separated by chain, number of crossings, and chirality sequence. This prevents a one-crossing entanglement from being merged into a two-crossing entanglement, even if the loop endpoints are nearby.
Clustered CSV output
The clustered CSV has one row per representative entanglement:
cluster_id,chain,resid_i,resid_j,gn,gc,Gn,Gc,crossings,num_contacts,contacts
0,A,53,78,-0.70753,0.0,1,0,-45,3,52-79;53-78;53-79
The columns are:
| Column | Meaning |
|---|---|
cluster_id | Integer cluster ID. |
chain | Chain or segment identifier. |
resid_i, resid_j | Representative loop-closing contact. |
gn, gc | Raw linking values for the representative contact. |
Gn, Gc | Rounded linking values for the representative contact. |
crossings | Representative crossing residue set. |
num_contacts | Number of raw contacts represented by the cluster. |
contacts | Semicolon-delimited raw contact list. |
In the example row above, the representative entanglement uses loop-closing residues 53 and 78, has one crossing residue at -45, and summarizes three raw contacts.
End-to-end example
Here is a complete workflow using the example PDB directory from the repository:
git clone https://github.com/vuqv/glink_entanglement.git
cd glink_entanglement
pip install -e .
mkdir -p out clustered
glink \
-f PDB/2ww4.pdb \
-o out \
--GLN_threshold 0.6 \
--contact_cutoff 4.5 \
--topoly_density 0 \
--topoly_min_dist 10 6 5
glink-cluster \
-r out/2ww4_glink_contacts.csv \
-o clustered \
-g Human
Inspect the outputs:
head out/2ww4_glink_contacts.csv
head clustered/2ww4_glink_contacts_clustered.csv
For a quick count:
wc -l out/2ww4_glink_contacts.csv
wc -l clustered/2ww4_glink_contacts_clustered.csv
Remember that wc -l includes the header line.
Running many PDB files
For a directory of PDB files:
mkdir -p out clustered
for pdb in PDB/*.pdb; do
glink -f "$pdb" -o out
done
for csv in out/*_glink_contacts.csv; do
glink-cluster -r "$csv" -o clustered -g Human
done
Use a custom cutoff if the organism presets are not appropriate:
for csv in out/*_glink_contacts.csv; do
glink-cluster -r "$csv" -o clustered -c 52
done
Calling the package from Python
The command-line tools are the most convenient interface, but the core functions can also be imported.
To calculate raw contacts:
from glink_entanglement.glink import calculate_pdb_glink
df = calculate_pdb_glink(
"PDB/2ww4.pdb",
threshold=0.6,
cutoff=4.5,
topoly_density=0,
topoly_min_dist=(10, 6, 5),
)
df.to_csv("out/2ww4_glink_contacts.csv", index=False)
print(df.head())
To cluster a raw CSV:
from glink_entanglement.clustering import cluster_glink
clustered = cluster_glink(
"out/2ww4_glink_contacts.csv",
cutoff=52,
)
clustered.to_csv("clustered/2ww4_glink_contacts_clustered.csv", index=False)
print(clustered)
This is useful when you want to integrate entanglement detection into a larger Python analysis pipeline.
Choosing parameters
Start with the defaults:
glink \
-f structure.pdb \
-o out \
--GLN_threshold 0.6 \
--contact_cutoff 4.5 \
--topoly_density 0 \
--topoly_min_dist 10 6 5
Then adjust one parameter at a time.
Use --contact_cutoff when the contact definition is too strict or too permissive. A value around 4.5 A is a reasonable heavy-atom contact cutoff.
Use --GLN_threshold when you want to change how aggressively raw gn and gc values are rounded into nonzero Gn and Gc scores.
Use --topoly_density 1 when you want Topoly’s denser surface calculation and can tolerate a slower run.
Use glink-cluster -c when the organism preset is not appropriate for the proteins being analyzed.
Practical interpretation
A row in the raw CSV means:
- two residues form a heavy-atom contact;
- the loop closed by that contact has nonzero rounded Gaussian linking against one side of the chain;
- Topoly found at least one crossing residue consistent with the nonzero side.
A row in the clustered CSV means:
- one or more raw contacts were judged to represent the same entanglement class;
- the reported contact is the representative loop-closing contact;
-
num_contactstells you how many raw contacts were summarized.
For downstream analysis, the clustered CSV is often the easier starting point. Use the raw CSV when you need contact-level detail or want to understand why a particular representative was selected.
Troubleshooting
If glink writes zero contacts, check the PDB first:
- Does the file contain protein atoms recognized by MDAnalysis?
- Are C-alpha atoms present?
- Are chain breaks or missing residues expected?
- Is the contact cutoff too small?
- Is the GLN threshold too strict?
If Topoly is slow, keep the package default:
--topoly_density 0
Use density 1 only for cases where the slower, denser Topoly calculation is worth the extra cost.
If all atoms appear under chain SYSTEM, your PDB may not contain explicit chain or segment identifiers. The calculation can still run, but adding chain identifiers is better when you need chain-specific interpretation.
If clustering fails with a missing-column error, make sure the input to glink-cluster is the raw CSV produced by glink, not a manually edited table or a previously clustered CSV.
If the shell cannot find glink or glink-cluster, reinstall in the active environment:
python -m pip install -e .
Then open a new shell or refresh the shell’s command cache:
hash -r
Summary
glink-entanglement turns an all-atom protein PDB into two useful tables:
- a raw contact-level table from
glink; - a representative entanglement table from
glink-cluster.
The raw table is best for auditing individual loop-closing contacts. The clustered table is best for reporting and comparing entanglement patterns across structures.