💜 PLiCat (Protein–Lipid interaction Categorization tool)
we present a robust prediction tool termed PLiCat(Protein–Lipid interaction Categorization Tool) for predicting the lipid categories that interact with proteins, utilizing protein sequences as the only input. Using a combined model architecture by the fusion of ESM C and BERT models, our method enables accurate and interpretable prediction to distinguish lipid-binding signature among the 8 major lipid categories defined by LIPID MAPS. PLiCat will serve as a powerful tool to facilitate the exploration of lipid-binding specificity and rational protein design.
- Paper: https://academic.oup.com/bib/article/26/6/bbaf665/8377155
- GitHub Repository: https://github.com/Noora68/PLiCat
- Online Demo: https://colab.research.google.com/drive/1wGSZsy7KyYoJf2PiXzP4SVLXonl-cWb9?usp=sharing
- Datasets: https://huggingface.co/datasets/Noora68/PLiCat-0.1.0
Overall schematic framework of PLiCat**:
Model Details
- Architecture: ESM Cambrian + BERT + classification head
- Task: Multi-label protein-lipid binding prediction
- Fine-tuned from:
ESMC_300m+bert-base-uncased - Developed by: Noora68
- Framework: PyTorch + HuggingFace Transformers
Model usage workflow:
- Load the model and tokenizer
- Process the input sequence (tokenize → batch → pad → mask)
- Run inference to obtain logits → probabilities
- Output the results and mark high-confidence categories
install the latest version:
pip install plicat_model
Usage:
from plicat_model import PLiCat
import torch
from torch.nn.utils.rnn import pad_sequence
from esm.tokenization import EsmSequenceTokenizer
# Set device (GPU if available, otherwise CPU)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = EsmSequenceTokenizer()
# Default lipid type dictionary
default_dict = {
"0": "NotLipidType",
"1": "Fatty Acyl (FA)",
"2": "Prenol Lipid (PR)",
"3": "Glycerophospholipid (GP)",
"4": "Sterol Lipid (ST)",
"5": "Polyketide (PK)",
"6": "Glycerolipid (GL)",
"7": "Sphingolipid (SP)",
"8": "Saccharolipid (SL)"
}
# Load pretrained PLiCat model
model = PLiCat.from_pretrained("Noora68/PLiCat-0.4B").to(device)
# Example protein sequence
sequence = "MDSNFLKYLSTAPVLFTVWLSFTASFIIEANRFFPDMLYFPM"
# Tokenize the sequence -> input_ids
input_ids = torch.tensor(tokenizer.encode(sequence))
# Add batch dimension: (batch_size=1, length)
input_ids = input_ids.unsqueeze(0)
# Pad to the longest sequence in the batch
input_ids_padded = pad_sequence(input_ids, batch_first=True, padding_value=tokenizer.pad_token_id)
# Build attention mask: 1 for real tokens, 0 for padding
attention_mask = (input_ids_padded != tokenizer.pad_token_id).long()
# Move tensors to the same device as model
input_ids_padded = input_ids_padded.to(device)
attention_mask = attention_mask.to(device)
# Forward pass (no gradient needed during inference)
with torch.no_grad():
outputs = model(input_ids_padded, attention_mask)
# Convert logits to probabilities using sigmoid
probs = torch.sigmoid(outputs['logits'])
# Convert to CPU and numpy array
probs = probs.squeeze().detach().cpu().numpy()
# Print results: add a check mark if probability > 0.6
for i, p in enumerate(probs):
mark = " √" if p > 0.6 else ""
print(f"{default_dict[str(i)]:<25}: {p:.4f}{mark}")
output of the above example is:
NotLipidType : 0.0007
Fatty Acyl (FA) : 0.1092
Prenol Lipid (PR) : 0.9178 √
Glycerophospholipid (GP) : 0.6059 √
Sterol Lipid (ST) : 0.0083
Polyketide (PK) : 0.0026
Glycerolipid (GL) : 0.0771
Sphingolipid (SP) : 0.0002
Saccharolipid (SL) : 0.0000
Limitations
- Trained only on lipid-binding protein data and may not generalize to other functions.
- Model performance is best with sequence lengths under 500.
- Dataset size is limited compared to large-scale protein corpora.
- Model may reflect biases present in training data (e.g., under-representation of certain lipid types).
Citing this work
If you use the code or data in this package, please cite:
@Article{PLiCat2025,
author = {Feitong Dong,Jingrou Wu},
journal = {Briefings in Bioinformatics},
title = {PLiCat:Decoding protein-lipid interactions by large language model},
year = {2025},
volume = {26},
issue = {6},
doi = {https://doi.org/10.1093/bib/bbaf665},
URL = {https://academic.oup.com/bib/article/26/6/bbaf665/8377155}
}
License
This project is licensed under the MIT License.
- Downloads last month
- 46
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support
Model tree for Noora68/PLiCat-0.4B
Base model
EvolutionaryScale/esmc-300m-2024-12