The Role of Artificial Intelligence in Systematic Literature Reviews and PICO Extraction: Progress, Practice, and Prospects in 2025
Systematic literature reviews (SLRs) are critical for health technology assessment (HTA) and regulatory submissions. However, the volume of published studies and the complexity of data extraction make SLRs labour-intensive and time-consuming. Artificial intelligence (AI), in particular natural language processing (NLP) and large language models (LLMs), has been progressively integrated into SLR workflows to support evidence identification and structured data extraction, especially for the core elements of Population, Intervention, Comparator, Outcomes, and Study design (PICOS).
Evolution of AI in Evidence Synthesis
The use of AI in systematic reviews has evolved significantly over the past decade. Initial efforts in the early 2010s employed rule-based algorithms and classical machine learning (ML) approaches for citation screening and information classification. These early systems were constrained by limited training data, restricted generalisability, and suboptimal performance in complex language tasks. Subsequent developments were catalysed by the introduction of large annotated datasets: collections of biomedical texts manually labelled with key elements such as patient populations, interventions, comparators, and outcomes. One particular example is the EBM-NLP corpus which was released in 2018. It enabled the training of machine learning models on over 5,000 PubMed abstracts with labelled PICO elements. Building on this foundation, deep learning architectures, such as recurrent neural networks and later transformer models, improved the precision of PICO sentence classification and summarisation of trial information.
The introduction of LLMs such as BERT, GPT-3, and more recently GPT-4 marked a significant inflection point from earlier deep learning approaches by enabling more generalisable and context-aware representations of medical text, with minimal task-specific training. A scoping review in late 2024 identified dozens of studies deploying LLMs across 10 of 13 defined systematic review steps, most commonly for literature search assistance, highlighting a surge in interest in their application within evidence synthesis.
Current Capabilities in 2025
LLMs have demonstrated robust performance in extracting structured information, especially PICO elements. A proof-of-concept by Gimblett et al showed that GPT-4 could extract PICO elements from over 680,000 clinical trial abstracts with 98% completeness in a random sample. The model processed the entire corpus in under three hours, indicating both scalability and high-level accuracy for well-structured texts. Beyond abstracts, LLMs also show high performance in full-text data extraction. A recent evaluation tested ChatGPT-4 as a second reviewer in a systematic review of pharmacological interventions. The AI extracted 484 relevant data items with 92.4% accuracy and 94% consistency between independent extractions, supporting its use as an effective adjunct to human reviewers.
Custom-trained foundation models are also emerging. The LEADS model, trained on over 21,000 SLRs and 450,000 full-text trial reports, demonstrated improved performance in query generation, study selection, and PICO extraction relative to generic LLMs. In user studies, clinicians supported by LEADS achieved higher recall and faster task completion compared to unaided reviewers, highlighting the growing value of domain-specific AI tools.
Limitations
Despite significant progress in recent years, several limitations remain. AI systems occasionally hallucinate, generating plausible but incorrect outputs, especially when prompted for information not explicitly present in the source text. One study found that hallucination affected approximately 5% of extracted data points, which would necessitate human verification for every extracted paper to ensure accuracy in the overall evidence synthesis. Reproducibility is another concern as although LLMs can be prompted to produce deterministic outputs, variations still occur across runs, complicating auditability in regulated environments. Transparency also remains limited as, unlike traditional evidence synthesis methods, which document decisions and rationales explicitly, most AI systems do not naturally justify exclusions or data interpretations. Some prompting strategies can elicit reasoning from LLMs, but explainability remains an area of active development.
The nuanced interpretation of clinical study designs and results also poses a challenge. AI tools may misclassify primary and secondary outcomes or fail to distinguish composite from discrete endpoints. Tasks requiring critical appraisal, such as risk-of-bias assessment, have shown only moderate AI performance and still demand human input. LLMs used in their standard format by an ad hoc screener or extractor may not reflect the most current literature unless connected to a live retrieval system, as they operate on static training data and may overlook recent evidence.
Human-AI Hybrid Workflows
The prevailing consensus is that AI should augment, rather than replace, human reviewers. A hybrid model, where AI performs first-pass screening and extraction followed by human verification, is currently the most effective and safe approach. AI-assisted screening can prioritise relevant citations, reducing human workload while ensuring comprehensive inclusion. In data extraction, AI-generated summaries or structured tables can be reviewed and corrected by human analysts. In comparative studies, this model achieved results comparable in quality to dual human extraction while significantly reducing time requirements.
Several platforms now incorporate AI functions directly and allow users to query AI for study summaries and relevance assessments, facilitating interaction rather than automation; domain-specific models further enhance this dynamic by tailoring responses to the structure and content of biomedical literature.
Application to the EU Joint Clinical Assessment
The Joint Clinical Assessment (JCA) initiative, which is now being implemented in 2025 under Regulation (EU) 2021/2282, requires centralised clinical evaluation for certain new medicines across all EU member states. This poses logistical and methodological challenges for evidence synthesis, particularly given the requirement to address multiple national PICO frameworks under time constraints: manufacturers must submit a comprehensive dossier that addresses potentially multiple PICO criteria requested by 27 member states, all within a strict 90-day timeline. AI is now emerging as a key enabler in meeting these demands. By rapidly extracting data across multiple PICO permutations and facilitating evidence mapping, AI tools can significantly expedite the development of JCA dossiers. The Gimblett et al example discussed was an example of how AI can be harnessed to streamline cross-country response preparation.
While official EMA guidance on generative AI remains in development, regulatory interest is growing. The European Medicines Agency and Heads of Medicines Agencies have included AI in their joint workplan (2023–2028), with the aim of developing frameworks to enhance the efficiency of data analysis and support decision-making in medicines regulation. For now, AI tools are primarily used to support but not replace human judgment in HTA documentation, SLRs and evidence synthesis.
PICO Predict, developed by Decisive Consulting, is an AI-enabled solution designed to assist pharmaceutical and medical device companies in preparing for the European Union's JCA process. PICO Predict recognises both the capabilities and limitations of AI in the context of JCA, acknowledging that while AI can efficiently process and synthesise large volumes of clinical and regulatory data to support early PICO scoping, expert human oversight remains essential to ensure contextual accuracy, clinical relevance, and alignment with evolving methodological requirements.
Conclusion
In 2025, AI particularly in the form of LLMs has reached a point of genuine utility in systematic reviews and PICO data extraction. These systems demonstrate high accuracy, reproducibility and speed across multiple review tasks. However, limitations in transparency, hallucination risk and medical nuance demand continued human oversight.
The optimal process involves a collaborative workflow in which AI accelerates evidence identification and structuring, while human reviewers ensure validity, context, and completeness. In time-critical settings such as the EU JCA, AI-assisted evidence generation is not only feasible but increasingly necessary to meet stringent evidence requirements. The path forward is clear: human-AI partnerships, underpinned by domain-specific tuning, quality control, and methodological transparency, will define the next generation of evidence synthesis.
References
Nye B, Li JJ, Patel R, et al. A corpus with multi-level annotations of patients, interventions and outcomes to support language processing for medical literature. Proc Annu Meet Assoc Comput Linguist. 2018;1:197–207. doi:10.18653/v1/P18-1019.
CapeStart. Using NLP to improve PICO element identification and extraction for SLRs and evidence-based medicine [Internet]. Cambridge (MA): CapeStart; [cited 2025 Apr 15]. Available from: https://capestart.com/resources/blog/using-nlp-to-improve-pico-element-identification-and-extraction
Lieberum JL, Töws M, Metzendorf MI, et al. Large language models for conducting systematic reviews: on the rise, but not yet ready for use—a scoping review. J Clin Epidemiol. 2025;181:111746. doi:10.1016/j.jclinepi.2025.111746.
Reason T, Langham J, Gimblett A. Automated mass extraction of over 680,000 PICOs from clinical study abstracts using generative AI: a proof-of-concept study. Pharmaceut Med. 2024 Sep 26;38(5):365–72. doi:10.1007/s40290-024-00539-6.
Jensen MM, Danielsen MB, Riis J, et al. ChatGPT-4o can serve as the second rater for data extraction in systematic reviews. PLoS One. 2025 Jan 7;20(1):e0313401. doi:10.1371/journal.pone.0313401.
Wang Z, Cao L, Jin Q, et al. A foundation model for human-AI collaboration in medical literature mining [preprint]. arXiv. 2025 Jan 27. doi:10.48550/arXiv.2501.16255.
Decisive Consulting Ltd. PICO Predict [Internet]. London: Decisive Consulting Ltd.; [cited 2025 Apr 15]. Available from: https://www.picopredict.com/
Written by Michael Harding
Decisive Edge 16th April 2025