# import library
import json
import pandas as pd
import requests
from colorama import Fore, Back, Style
from itertools import groupby
from operator import itemgetter
Introduction
Proteins are complex biomolecules crucial for the functioning of living organisms. They come in various shapes and sizes, each designed for specific functions within cells. Some proteins are well-structured, while others exhibit intrinsic disorder. Understanding the level of disorder in a protein is essential for unraveling its biological role. In this blog post, we’ll explore how we can determine if a protein is fully intrinsically disordered, contains structured regions with disordered loops, or is entirely structured, using insights from Necci et al.’s 2016 research (Necci, Piovesan, and Tosatto (2016)). Given a protein sequence and structural ensemble, how do we know if it is fully IDP, protein with some IDR regions or it is fully structured with some loops?
Classifying Proteins Based on Disorder:
Necci and colleagues proposed a classification scheme for intrinsically disordered proteins (IDPs) based on the number of consecutive disordered residues:
Short IDR: Proteins with \(5-19\) consecutive disordered residues.
Long IDR: Proteins with \(\ge 20\) consecutive disordered residues.
Fully Disordered Protein: Proteins with \(\ge 50\) consecutive disordered residues and more than 95% of their content being disordered.
This classification scheme provides a straightforward way to categorize proteins based on their disorder characteristics, offering valuable insights into their potential functions.
Mapping Sequence Features:
To determine a protein’s disorder characteristics, we need to map sequence features from two primary resources: DISPROT and MobiDB. DISPROT is a database that provides information about where structured and disordered regions are located in a protein sequence, while MobiDB offers valuable data on protein mobility and disorder.
Using these resources, we can efficiently evaluate the disorder characteristics of a given protein entry in the PED (Protein Ensemble Database). By cross-referencing the information from DISPROT and MobiDB with the PED entries, we can gain a comprehensive understanding of a protein’s structural properties.
What is MobiDB?
MobiDB comprises different types of disorder annotations and different quality levels of disorder evidence. (More information can be found here: Monzon et al. (2020))
The source of evidence in MobiDB can be classified depending on the quality of evidence:
High-quality: data are imported from manually curated external databases.
Intermediate-quality: annotations are derived from experimental data such as X-ray diffraction, NMR, and NMR chemical shifts.
Lowest-quality: Disorder predictions are provided by different computational methods and represent the lowest quality and confidence, even though they have by far the largest coverage by including all UniProt proteins
In order to provide the most reliable disorder annotation for each protein, MobiDB combines all its data sources into a consensus annotation, prioritizing curated and indirect pieces of evidence over predictions.
Curated Data: High-quality annotations in MobiDB come from external resources which provide manually curated evidence. Intrinsically disordered regions are extracted from three databases, DisProt, FuzDB, and UniProt. DisProt is the largest database which covers manually curated ID annotations. In the latest version of MobiDB, DisProt disordered regions are propagated by homology, exploiting GeneTree alignments with a similarity constraint of 80%.
Derived Data: Indirect experimental data sources provide annotations from X-ray experiments, NMR three-dimensional models, and NMRchemical shift data. Indirect disorder information is derived from PDB in three different ways by considering (1) missing, (2) high temperature, and (3) mobile residues. Missing residues are those for which the detected electron density is not sharp enough to model the corresponding structure.
Predicted Data: To define a residue as disordered, at least five out of eight predictors have to agree with the prediction.
Code implementation
Import libraries
Retrieve disorder information from MobiDB
Next, we define a function to retrieve disorder-related information for a protein using the DISPROT/MobiDB database.
Here, we only use data with highest confidence from MobiDB. All highest confidence sources of evidence are treated equally.
Specifically, the following categories will be used:
curated-disorder-priority
: Residues are annotated with the source with the highest of evidence from curated datahomology-disorder-priority
: Residues are annotated with the source with the highest of evidence from homology (see above for more information)derived-missing_residues-priority
: Residues are annotated when at least 90% of the children are annotatedprediction-disorder-priority
: Residues are annotated with the source with the highest of evidence from prediction (see above for more information)derived-mobile-th_90
: Mobile residues, residues are annotated when at least 90% of the children are annotated
def get_mobidb_disordered_data(uniprot):
"""
Retrieve disorder-related information for a protein using the DISPROT/MobiDB database.
This function takes a UniProt ID as input and queries the DISPROT/MobiDB database to retrieve disorder-related
information for the specified protein. It returns a list of intervals representing disordered regions based on
specific keywords, with a preference for "curated" disorder information.
Args:
uniprot (str): The UniProt ID of the protein for which disorder information is to be retrieved.
Returns:
list: A list of intervals representing disordered regions. Each interval is represented as a tuple (start, end).
Example:
disordered_regions = get_mobidb_disordered_data("P12345")
print("Disordered Regions:")
for interval in disordered_regions:
print(f"[{interval[0]}, {interval[1]}]")
Note:
- The function queries the DISPROT/MobiDB database via its API.
- It prioritizes "curated" disorder information when available.
- Disorder information is returned based on specific keywords.
- If no disorder information is found, an empty list is returned.
"""
# The information about these triplets can be found here: https://mobidb.org/help#vocabulary
= ['curated-disorder-priority',
keywords 'homology-disorder-priority',
'derived-missing_residues-priority',
'prediction-disorder-priority', 'derived-mobile-th_90'] #, , 'derived-mobile_context_dependent-th_90'
= 'https://mobidb.org/api/download?format=json&acc=' + uniprot
url
# Check if the ID exists in DISPROT/MOBIDB
= requests.get(url)
res
if res.status_code == 200:
try:
= res.json()
result except:
print("ID DOES NOT EXITS IN THE DATABASE")
return [] # Return an empty list if JSON parsing fails
= []
disordered_regions
for key in keywords:
if key in result.keys():
= result[key]['regions']
regions # print(key, regions)
tuple(regions))
disordered_regions.append(
# print(disordered_regions)
if len(disordered_regions) == 0:
print("NO DISORDER REGION FOUND")
return "Fully structured"
return disordered_regions
return [] # Return an empty list if the ID does not exist in the database
# Example:
"P02751") get_mobidb_disordered_data(
[([1, 47], [272, 296], [2083, 2198], [2432, 2477]),
([153, 166], [702, 730], [2330, 2339])]
As we can see, here the function only extracts disorder regions from various criterias and not doing anything. To be more clear, we need to sort out, combine continous regions in a compact form for more intuitative.
Merge data in compact, continuous region form
As an example, the get_mobidb_disordered_data
prints all regions without sorting. This is quite messy. We will define a merge_data
function to sort, combine the results and print in compact form:
def merge_data(data_sets):
"""
Merge and compact intervals from multiple data sets.
This function takes a list of data sets, where each data set consists of one or more intervals.
It combines intervals from all data sets, merges overlapping intervals, and further merges adjacent intervals.
The resulting merged intervals are printed in compact form.
Args:
data_sets (list): A list of data sets, where each data set is represented as a tuple of intervals.
Each interval is represented as a list with two elements: [start, end].
Returns:
None: The function prints the merged intervals but does not return a value.
Example:
data_sets = [([1, 56],), ([1, 56],), ([59, 68],), ([57, 58], [150, 160])]
merge_data(data_sets)
Output:
Compact Merged Intervals:
[[1, 68], [150, 160]]
Note:
- Intervals within each data set are merged if they overlap.
- Intervals across different data sets are combined and merged.
- Adjacent intervals are further merged into compact intervals.
"""
# print(data_sets)
if isinstance(data_sets, str):
# if protein is fully structure, data_sets will be string- then do nothing
return []
if len(data_sets) == 0:
# print("[ID does not exist in Database]")
return []
# Initialize an empty list to store all intervals
= []
all_intervals
# Iterate through each data set and collect all intervals
for intervals in data_sets:
all_intervals.extend(intervals)
# Sort the intervals by their start values
=lambda x: x[0])
all_intervals.sort(key
# Initialize a list to store the merged intervals
= []
merged_intervals
# Iterate through the sorted intervals and merge overlapping intervals
for interval in all_intervals:
if not merged_intervals or interval[0] > merged_intervals[-1][1]:
# If the interval does not overlap with the last merged interval, add it as a new merged interval
merged_intervals.append(interval)else:
# If the interval overlaps with the last merged interval, merge them
-1] = [merged_intervals[-1][0], max(merged_intervals[-1][1], interval[1])]
merged_intervals[
# Further merge adjacent intervals
= []
final_merged_intervals = merged_intervals[0]
current_interval
for interval in merged_intervals[1:]:
if current_interval[1] + 1 == interval[0]:
1] = interval[1]
current_interval[else:
final_merged_intervals.append(current_interval)= interval
current_interval
final_merged_intervals.append(current_interval)
# Print the merged intervals in compact form
# print("Disorder regions:")
# print(tuple(final_merged_intervals))
return tuple(final_merged_intervals)
Count how many disorder residues in a given sequence
Alright, Now we have the information about where is the disorder regions of full sequence (MobiDB) and where is the sequence in PED mapping to MobiDB. Question is how many residues in PED sequence are disorder? For that purpose, we define a count_overlap
function to do this job.
This function takes two arguments (as docs of function below)
first_range
is the region of PED sequence, as a tuple
second_ranges
is the disorder regions from MobiDB, is a tuple or list of tuples in case sequence contains many disorder regions and not continous.
def count_overlap(first_range, second_ranges):
"""
Count the number of overlapping numbers between two range tuples.
Args:
first_range (tuple): A tuple representing the first range as (start, end).
second_ranges (tuple or list of tuples): A tuple or list of tuples representing the second ranges.
Returns:
list: A list of counts, where each count corresponds to the number of overlapping numbers for each region
in the second_ranges.
Example:
first_range = (57, 160)
second_ranges = ([1, 68], [150, 160])
overlaps = count_overlap(first_range, second_ranges)
print(overlaps) # Output: [12, 11]
"""
# print(first_range, second_ranges)
if isinstance(second_ranges, tuple):
= list(second_ranges)
second_ranges # print(second_ranges)
= []
overlaps
if not bool(second_ranges):
# when disorder region is empty-fully structure-return an empty list
return []
for second_range in second_ranges:
# print(second_range)
= max(first_range[0], second_range[0])
start = min(first_range[1], second_range[1])
end
if start <= end:
= end - start + 1
overlap_count
overlaps.append(overlap_count)else:
0)
overlaps.append(
return overlaps
# Example usage:
= (57, 160)
first_range = ([1, 68],)#, [150, 160]
second_ranges = count_overlap(first_range, second_ranges)
overlaps print(overlaps)
[12]
Classify protein based on the length of disorder regions
We have the length of IDR regions, here we define a function to classify proteins into catagories as stated in the introduction:
Fully IDP
Long IDR
Short IDR
Fully Structured
def classify_idr_regions(length_of_idr, length_in_ped):
"""
Classifies Intrinsic Disorder Regions (IDR) based on specified criteria.
Args:
length_of_idr (list of int): A list of integer values representing the lengths
of individual IDR regions.
length_of_pdb (int): The total length of the Protein Data Bank (PDB) structure.
Returns:
str: A classification string for the IDR regions based on the following criteria:
- If any region has a length greater than or equal to 50 and accounts for
more than 95% of the PDB length, it is classified as "Full IDP."
- If any region has a length greater than or equal to 20, it is classified as
"Long IDR."
- If any region has a length between 5 and 19 (inclusive), it is classified as
"Short IDR."
- If all regions have lengths less than 5, the entire structure is classified
as "Fully Structured."
Examples:
>>> length_of_idr = [1, 10, 69]
>>> length_of_pdb = 69
>>> classification = classify_idr_regions(length_of_idr, leng_of_pdb)
>>> print(classification)
"Long IDR"
Note:
- If multiple criteria apply to different regions, the most strict criteria
are applied. The priority is "Full IDP" > "Long IDR" > "Short IDR" > "Fully Structured."
- This function assumes that the input values are valid and correctly represent
the lengths of IDR regions and the PDB length.
"""
if length_in_ped < 20:
return "Not Classified"
= []
classifications
for idr_length in length_of_idr:
if idr_length >= 50 and (idr_length / length_in_ped) >= 0.95:
# the second condition should be sum(length_of_idr)/ length_in_ped it will make more sense
"Full IDP")
classifications.append(elif idr_length >= 20:
"Long IDR")
classifications.append(elif 5 <= idr_length <= 19:
"Short IDR")
classifications.append(else:
"Undefined")
classifications.append(
# Determine the final classification based on priority
if "Full IDP" in classifications:
return "Fully IDP"
elif "Long IDR" in classifications:
return "Long IDR"
elif "Short IDR" in classifications:
return "Short IDR"
else:
return "Fully Structured"
# Example usage:
# length_of_idr = [3, 4]
# length_in_ped = 18
# classification = classify_idr_regions(length_of_idr, length_in_ped)
# print("Classification:", classification)
Combine all together:
Function to do the main task: get information from PED, pass uniprot to call MobiDB then do all other stuffs:
"""
This function work very well. need some modifies for printing.
"""
def get_ped_stats(PEDID):
"""
Retrieve and display protein ensemble deposition (PED) statistics for a given PED entry.
This function queries the Protein Ensemble Deposition (PED) API to retrieve statistics and information for a
specific PED entry identified by its PEDID. It provides details such as the entry ID, title, and information
about chains and fragments within the entry. Additionally, it calculates the number of overlapping residues
between fragment positions and disordered regions retrieved from the DISPROT/MobiDB database.
Args:
PEDID (str): The PED entry ID for the entry to retrieve statistics.
Returns:
None: The function prints the PED statistics and overlap counts but does not return a value.
Example:
get_ped_stats("PED12345")
Note:
- This function requires the 'requests' library for HTTP requests and 'colorama' for colored output.
- It queries the PED API to obtain entry information.
- It retrieves disordered region information using the DISPROT/MobiDB database.
- Overlapping residues between fragment positions and disordered regions are calculated and displayed.
"""
= "https://deposition.proteinensemble.org/api/v1/entries/" + PEDID
url = requests.get(url)
res if res.status_code == 200:
= res.json()
res print("PED ID\t# chains in entry\tProtein name\t\"Length in PED (tag counted)\"\tUniProt\tLength UniProt\t\"Disordered region from MobiDB/DisProt\"\t\"PDB region (align to Uniprot)\"\tLength of IDR\t Classification")
= res['construct_chains']
construct_chains
for chain in construct_chains:
if len(construct_chains) == 1:
# chain_name = chain['chain_name']
= PEDID
entry else:
# chain_name = res['entry_id'] + '_' + chain['chain_name']
= PEDID + '_' + chain['chain_name']
entry
= len(chain['fragments'])
n_fragments = chain['fragments']
fragments
for fragment, fragment_stats in zip(fragments, chain['fragments_stats']):
= fragment['description']
protein_name = fragment_stats['length_total_pdb']
length_in_ped = fragment_stats['uniprot']
uniprot = fragment_stats['length_total_uniprot']
length_uniprot
= tuple()
mobi_disorder_regions if fragment_stats['uniprot'] is not None:
= merge_data(get_mobidb_disordered_data(fragment_stats['uniprot']))
mobi_disorder_regions
= tuple([fragment['start_position'], fragment['end_position']])
pdb_region = count_overlap(pdb_region, mobi_disorder_regions)
length_of_idr = classify_idr_regions(length_of_idr, length_in_ped)
classification
print(f"{entry}\t{len(construct_chains)}\t{protein_name}\t{length_in_ped}\t{uniprot}\t{length_uniprot}\t{mobi_disorder_regions}\t{pdb_region}\t{length_of_idr}\t{classification}")
elif res.status_code == 404:
print(f"Entry {PEDID} does not exist")
return
Example usage:
=19
ID='PED'+f'{ID:05d}'
PEDID get_ped_stats(PEDID)
PED ID # chains in entry Protein name "Length in PED (tag counted)" UniProt Length UniProt "Disordered region from MobiDB/DisProt" "PDB region (align to Uniprot)" Length of IDR Classification
PED00019_A 2 None 160 O14558 160 ([1, 68], [150, 160]) (1, 160) [68, 11] Long IDR
PED00019_B 2 Heat shock protein beta-6 160 O14558 160 ([1, 68], [150, 160]) (1, 160) [68, 11] Long IDR
Explanation of the example
Here, we are looking at the PED entry PED00019
.
This entry comprises two chains: Chain A
and Chain B
. To enhance clarity, we opt to represent the name as ENTRY+CHAIN
for both.
Chain A consists of 160 residues and can be traced back to the UniProt ID O14558, which is also characterized by 160 residues.
The disorder regions, as identified by MobiDB, are situated at positions [1, 68]
and [150, 160]
.
In the PDB data within PED, the entire sequence, spanning positions [1, 160]
, is cataloged.
Upon a careful comparison of PED PDB data and the disorder regions, we observe the presence of two distinct disorder regions, one spanning 68 residues and the other 11 residues in length.
Similar explaination for chain B.
Notes:
In some cases, PED PDB can be complicated such as single chain is combined of multiple segment from multiple UniProt sequence, we need to manually combine information from that.