💜 PLiCat (Protein–Lipid interaction Categorization tool)

we present a robust prediction tool termed PLiCat(Protein–Lipid interaction Categorization Tool) for predicting the lipid categories that interact with proteins, utilizing protein sequences as the only input. Using a combined model architecture by the fusion of ESM C and BERT models, our method enables accurate and interpretable prediction to distinguish lipid-binding signature among the 8 major lipid categories defined by LIPID MAPS. PLiCat will serve as a powerful tool to facilitate the exploration of lipid-binding specificity and rational protein design.



Overall schematic framework of PLiCat**:

Plicat-Architecture.jpg

Model Details

  • Architecture: ESM Cambrian + BERT + classification head
  • Task: Multi-label protein-lipid binding prediction
  • Fine-tuned from: ESMC_300m + bert-base-uncased
  • Developed by: Noora68
  • Framework: PyTorch + HuggingFace Transformers

Model usage workflow:

  1. Load the model and tokenizer
  2. Process the input sequence (tokenize → batch → pad → mask)
  3. Run inference to obtain logits → probabilities
  4. Output the results and mark high-confidence categories

install the latest version:

pip install plicat_model

Usage:

from plicat_model import PLiCat
import torch
from torch.nn.utils.rnn import pad_sequence
from esm.tokenization import EsmSequenceTokenizer

# Set device (GPU if available, otherwise CPU)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = EsmSequenceTokenizer()

# Default lipid type dictionary
default_dict = {
    "0": "NotLipidType",
    "1": "Fatty Acyl (FA)",
    "2": "Prenol Lipid (PR)",
    "3": "Glycerophospholipid (GP)",
    "4": "Sterol Lipid (ST)",
    "5": "Polyketide (PK)",
    "6": "Glycerolipid (GL)",
    "7": "Sphingolipid (SP)",
    "8": "Saccharolipid (SL)"
}

# Load pretrained PLiCat model
model = PLiCat.from_pretrained("Noora68/PLiCat-0.4B").to(device)

# Example protein sequence
sequence = "MDSNFLKYLSTAPVLFTVWLSFTASFIIEANRFFPDMLYFPM"

# Tokenize the sequence -> input_ids
input_ids = torch.tensor(tokenizer.encode(sequence))

# Add batch dimension: (batch_size=1, length)
input_ids = input_ids.unsqueeze(0)

# Pad to the longest sequence in the batch
input_ids_padded = pad_sequence(input_ids, batch_first=True, padding_value=tokenizer.pad_token_id)

# Build attention mask: 1 for real tokens, 0 for padding
attention_mask = (input_ids_padded != tokenizer.pad_token_id).long()

# Move tensors to the same device as model
input_ids_padded = input_ids_padded.to(device)
attention_mask = attention_mask.to(device)

# Forward pass (no gradient needed during inference)
with torch.no_grad():
    outputs = model(input_ids_padded, attention_mask)

# Convert logits to probabilities using sigmoid
probs = torch.sigmoid(outputs['logits'])

# Convert to CPU and numpy array
probs = probs.squeeze().detach().cpu().numpy()

# Print results: add a check mark if probability > 0.6
for i, p in enumerate(probs):
    mark = " √" if p > 0.6 else ""
    print(f"{default_dict[str(i)]:<25}: {p:.4f}{mark}")

output of the above example is:

NotLipidType             : 0.0007
Fatty Acyl (FA)          : 0.1092
Prenol Lipid (PR)        : 0.9178 √
Glycerophospholipid (GP) : 0.6059 √
Sterol Lipid (ST)        : 0.0083
Polyketide (PK)          : 0.0026
Glycerolipid (GL)        : 0.0771
Sphingolipid (SP)        : 0.0002
Saccharolipid (SL)       : 0.0000

Limitations

  • Trained only on lipid-binding protein data and may not generalize to other functions.
  • Model performance is best with sequence lengths under 500.
  • Dataset size is limited compared to large-scale protein corpora.
  • Model may reflect biases present in training data (e.g., under-representation of certain lipid types).

Citing this work

If you use the code or data in this package, please cite:

@Article{PLiCat2025,
  author  = {Feitong Dong,Jingrou Wu},
  journal = {Briefings in Bioinformatics},
  title   = {PLiCat:Decoding protein-lipid interactions by large language model},
  year    = {2025},
  volume  = {26},
  issue  = {6},
  doi     = {https://doi.org/10.1093/bib/bbaf665},
  URL     = {https://academic.oup.com/bib/article/26/6/bbaf665/8377155}
}

License

This project is licensed under the MIT License.


Downloads last month
46
Safetensors
Model size
0.4B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Noora68/PLiCat-0.4B

Finetuned
(5)
this model