PLOD: Pinpoint Abbreviations in Scientific Texts

What is PLOD?
PLOD (Abbreviation Detection Dataset) is an English-language dataset built from PLOS journal articles, designed for training and evaluating NLP models that detect acronyms and their long forms. Each sentence is hand-labelled, clearly marking every acronym (AC) and its corresponding long form (LF)—ideal for training and testing abbreviation-aware NLP models.

What’s Inside
This coursework-ready subset contains roughly 1,000-10,000 sentences. Each entry provides:
- Tokens – the sentence split into word pieces.
- POS Tags – the part of speech for each token (via spaCy).
- NER Tags – token labels: 1 = AC, 4 = LF, 0 = other.
Why it Matters
- Training & Benchmarking – Develop or test models that link acronyms to definitions.
- Boost NLP Tasks – Improve search, summarisation, and machine translation by handling abbreviations more accurately.
- Track Progress – Published baselines show strong performance (F1 ≈ 0.92 for ACs, 0.89 for LFs).
License
Released under the Creative Commons Attribution-ShareAlike 4.0 International License.
Start Exploring
Looking for a reliable benchmark for acronym detection?
Unlock PLOD on AIOZ AI and integrate it directly into your token-classification pipeline today.

About the AIOZ Network
AIOZ Network is a DePIN for Web3 AI, Storage, and Streaming.
Powered by a global community of AIOZ DePINs, AIOZ rewards you for sharing your computational resources for storing, transcoding, and streaming digital media content and powering decentralized AI computation.
Find Us
AIOZ All Links | Website | X | Telegram