Constructing Fact-Checking Systems: Catching Repeating False Claims Before They Spread

: Why We Need Automated Fact-Checking

Compared to the standard media, where articles are edited and verified before getting published, social media modified the approach completely. Suddenly, everyone could raise their voice. Posts are shared immediately, enabling the access to ideas and perspectives from all around the world. That was the dream, no less than.

What began as an idea of protecting freedom of speech, giving individuals the chance to specific opinions without censorship, has include a trade-off. Little or no information gets checked. And that makes it harder than ever to detect what’s accurate and what’s not.

An extra challenge is created as false claims rarely appear only once. They are sometimes reshared on different platforms, often altered in wording, format, length, and even language, making detection and verification even harder. As these variations flow into across platforms they’ll seem familiar and due to this fact believable to its readers.

The unique idea of an area for open, uncensored, and reliable information has run right into a paradox. The very openness meant to empower people also makes it easy for misinformation to spread. That’s exactly where fact-checking systems are available in.

The Development of Fact-checking Pipelines

Traditionally, fact-checking was a manual process that relied on experts (journalists, researchers, or fact-checking organizations) to confirm claims by referencing them with sources similar to official documents, or expert opinions. This approach was very reliable and thorough, but additionally very time-consuming. The results of this delay was due to this fact more time for the false narratives to flow into, shape public opinion, and enable further manipulation.

That is where automation is available in. Researchers have developed fact-checking pipelines that behave because the human-fact-checking-experts, but can scale to massive amounts of online content. The actual fact-checking pipeline follows a structured process, which normally includes the next five steps:

Claim Detection – find statements with factual implications.
Claim Prioritization – rank them by speed of spread, potential harm, or public interest, prioritizing probably the most impactful cases.
Retrieval of Evidence – gather supporting material and supply the context to guage it.
Veracity Prediction – determine whether the claim is true, false, or something in between.
Generation of Explanation – produce a justification that readers can understand.

Along with the five steps, many pipelines also add a sixth step: retrieval of previously fact-checked claims (PFCR). As a substitute of redoing the work from scratch, the system checks whether a claim, even reformulated, has already been verified. If that’s the case, it’s linked to the fact-check and the claim’s verdict. If not, the pipeline proceeds with evidence retrieval.

This shortcut saves effort, quickens verification, and further advantages in multilingual settings, because it allows fact-checks in a single language to support verification in one other.

This component is understood by many names; , , or . Whatever the name, the concept is identical: reuse knowledge that already exists to fight misinformation faster and more effectively.

Figure 1: Fact-checking pipeline (created by writer)

Designing the PFCR Component (Retrieval Pipeline)

At its core, previously fact-checked claim retrieval (PFCR) is an information retrieval task: given a claim from a social media post, we would like to seek out probably the most relevant match in a big collection of already fact-checked (verified) claims. If a match exists, we are able to immediately link it to the source and the decision, so there is no such thing as a need to start out verification from scratch!

Most recent information retrieval systems use a retriever–reranker architecture. The retriever acts because the first-layer filter returning a bigger set of candidate documents (top ) from the corpus. The reranker then takes those candidates and refines the rating using a deeper, more computationally intensive model. This two-stage design balances speed (retriever) and accuracy (reranker).

Models used for retrieval will be grouped into two categories:

Lexical models: fast, interpretable, and effective when there’s strong word overlap. But they struggle when ideas are phrased in another way (synonyms, paraphrases, translations).
Semantic models: capture meaning quite than surface words, making them ideal for PFCR. They’d recognize that, for instance, and are describing the identical fact, regardless that the wording is totally different.

Once candidates are retrieved, the reranking stage applies more powerful models (often cross-encoders) to rigorously re-score the highest results ensuring that probably the most relevant fact-checks rank higher. As rerankers are dearer to run, they’re only applied to a smaller pool of candidates (e.g., the highest 100).

Together, the retriever–reranker pipeline provides each coverage (by recognizing a wider range of possible matches) and precision (by rating higher probably the most similar ones). For PFCR, this balance is crucial because it enables a quick and scalable technique to detect repeating claims, but with a high accuracy in order that users can trust the data they read.

Constructing the Ensemble

The retriever–reranker pipeline already delivers solid performance. But as I evaluated the models and ran the experiments, one thing became clear: no single model is nice enough by itself.

Lexical models, like BM25, are great at exact keyword matches, but as soon because the claim is phrased in another way, they fail. That’s where semantic models step in. They haven’t any problem with handling paraphrases, translations, or crosslingual scenarios, but sometimes struggle with straightforward matches where wording matters probably the most. Not all of the semantic models are the identical either, each had its own area of interest: some work higher in English, others in multilingual settings, one other for capturing subtle contextual nuances. In other words, just as misinformation mutates and reappears in countless variations, semantic retrieval models also bring different strengths depending on how they were trained. If misinformation is adaptable, then the retrieval system should be as well.

That’s where the concept of an ensemble got here in. As a substitute of betting on a single “best” model, I combined the predictions of multiple models in an ensemble so that they could collaborate and complement one another. As a substitute of counting on a single model, why not allow them to work as a team.

Before going further into the ensemble design, I’ll briefly explain the choice making process for the alternative of retrievers.

Establishing a Baseline (Lexical Models)

BM25 is one of the effective and widely used retrieval models often used as a baseline in modern IR research. Before evaluating the embedding-based (semantic) models, I used to be interested to see how good (or bad) BM25 can perform. And because it seems, not bad in any respect!

BM25 is a rating function built upon TF-IDF. It improves TF-IDF by introducing a saturation function and document length normalization. Unlike term frequency scoring, BM25 accounts for repeated occurrences of a term, stopping long documents from being unfairly favoured. It also features a parameter (b) that controls the load assigned to term frequency and document length.

Semantic Models

As a place to begin for the semantic (embedding-based) models, I referred to the HuggingFace’s Massive Text Embedding Benchmark (MTEB) and evaluated the leading models while keeping the GPU resource constraints in mind.

The 2 models that stood out were E5 (intfloat/multilingual-e5-large-instruct) and BGE (BAAI/bge-m3). Each achieved strong results when retrieving the highest 100 candidates, so I chosen them for further tuning and integration with BM25.

Ensemble Design

With retrievers in place, the query was: how will we mix them? I tested different aggregation strategies including majority voting, exponential decay weighting, and reciprocal rank fusion (RRF).
RRF worked best because it doesn’t just average scores, it rewards documents that consistently appear high across different rankings, no matter which model produced them. This manner, the ensemble favored claims that multiple models “agreed on,” while still allowing each model to contribute independently.

I also experimented with the variety of candidates retrieved in the primary stage (commonly known as hyperparameter ). The concept is straightforward: should you only pull in a really small set of candidates, you risk missing relevant fact-checks altogether. However, should you select too many, the reranker has to undergo lots of noise, which adds computational cost without actually improving accuracy.

Through the experiments, I discovered that as increased, performance improved at first since the ensemble had more probabilities to seek out the fitting fact-checks. But after a certain point, adding more candidates stopped helping. The reranker could already see enough relevant fact-checks to make good decisions, and the additional ones were mostly irrelevant. In practice, this meant finding a “sweet spot” where the candidate pool was large enough to make sure coverage, but not so large that it decreased the reranker’s effectiveness.

As a final step, I adjusted the weights of every model. Reducing the BM25’s influence while giving more weight to the semantic retrievers boosted the performance. In other words, BM25 is helpful, however the heavy lifting is finished by E5 and BGE.

To shortly undergo the PFCR component; the pipeline consists of retrieval and reranking where for the retrieval we are able to use lexical or semantic models while for the reranking we’d use a semantic model. Moreover, we noticed that combining multiple models inside an ensemble improves the retrieval/reranking performance. Okay, so where will we integrate the ensemble?

Where Does the Ensemble Fit?

The ensemble wasn’t limited to simply one a part of the pipeline. I applied it inside each the retrieval and reranking.

Retriever stage → I merged the candidate lists produced by BM25, E5, and BGE. This manner, the system didn’t depend on a single model’s “view” of what could be relevant but as a substitute pooled their perspectives right into a stronger starting set.
Reranker stage → I then combined the rankings from multiple rerankers (again referring to MTEB and my GPU constraints). Since each reranker captures barely different nuances of similarity, mixing them helped refine the ultimate ordering of fact-checks with greater accuracy.

On the retriever stage, the ensemble enabled a wider pool of candidates, ensuring that fewer relevant claims slipped through the cracks ().While the reranker stage narrowed down the main focus, pushing probably the most relevant fact-checks to the highest ().

Figure 2: Retriever-reranker ensemble pipeline (created by writer)

Bringing It All Together (TL;DR)

Long story short; the envisioned digital utopia for open information sharing doesn’t work without verification, and may even create the contrary – a channel for misinformation.

That was the driving force for the event of automated fact-checking pipelines, which helped us move closer to that original promise. They make it easier to confirm information quickly and at scale, so when false claims pop up in recent forms, they will be spotted and addressed immediately, helping maintain accuracy and trust within the digital world.

The takeaway is straightforward: diversity is vital. Just as misinformation spreads by taking over many forms, a resilient fact-checking system advantages from multiple perspectives working together. Using an ensemble, the pipeline becomes more robust, more adaptable, and ultimately enabling a trustworthy digital space.

For the curious minds

In the event you’re occupied with a deeper technical dive into the retrieval and ensemble strategies behind this pipeline, you’ll be able to take a look at my full paper here. It goes into the model selections, experiments, and detailed evaluation metrics throughout the system.

References

Scott A. Hale, Adriano Belisario, Ahmed Mostafa, and Chico Camargo. 2024. Analyzing Misinformation Claims In the course of the 2022 Brazilian General Election on WhatsApp, Twitter, and Kwai. ArXiv:2401.02395.

Rrubaa Panchendrarajan and Arkaitz Zubiaga. 2024. Claim detection for automated fact-checking: A survey on monolingual, multilingual and cross-lingual research. Natural Language Processing Journal, 7:100066.

Matúš Pikuliak, Ivan Srba, Robert Moro, Timo Hromadka, Timotej Smolen, Martin Melišek, Ivan ˇ Vykopal, Jakub Simko, Juraj Podroužek, and Maria Bielikova. 2023. Multilingual Previously FactChecked Claim Retrieval. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 16477–16500, Singapore. Association for Computational Linguistics.

Preslav Nakov, David Corney, Maram Hasanain, Firoj Alam, Tamer Elsayed, Alberto Barrón-Cedeño, Paolo Papotti, Shaden Shaar, and Giovanni Da San Martino. 2021. Automated Fact-Checking for Assisting Human Fact-Checkers. ArXiv:2103.07769.

Oana Balalau, Pablo Bertaud-Velten, Younes El Fraihi, Garima Gaur, Oana Goga, Samuel Guimaraes, Ioana Manolescu, and Brahim Saadi. 2024. FactCheckBureau: Construct Your Own Fact-Check Evaluation Pipeline. In Proceedings of the thirty third ACM International Conference on Information and Knowledge Management, CIKM ’24, pages 5185–5189, Latest York, NY, USA. Association for Computing Machinery

Alberto Barrón-Cedeño, Tamer Elsayed, Preslav Nakov, Giovanni Da San Martino, Maram Hasanain, Reem Suwaileh, Fatima Haouari, Nikolay Babulkov, Bayan Hamdan, Alex Nikolov, Shaden Shaar, and Zien Sheikh Ali. 2020. Overview of CheckThat! 2020: Automatic Identification and Verification of Claims in Social Media. In Experimental IR Meets Multilinguality, Multimodality, and Interaction, pages 215–236, Cham. Springer International Publishing.

Ashkan Kazemi, Kiran Garimella, Devin Gaffney, and Scott Hale. 2021a. Claim Matching Beyond English to Scale Global Fact-Checking. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the eleventh International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4504–4517, Online. Association for Computational Linguistics.

Shaden Shaar, Nikolay Babulkov, Giovanni Da San Martino, and Preslav Nakov. 2020. That may be a Known Lie: Detecting Previously Fact-Checked Claims. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 3607– 3618, Online. Association for Computational Linguistics.

Alberto Barrón-Cedeño, Tamer Elsayed, Preslav Nakov, Giovanni Da San Martino, Maram Hasanain, Reem Suwaileh, Fatima Haouari, Nikolay Babulkov, Bayan Hamdan, Alex Nikolov, Shaden Shaar, and Zien Sheikh Ali. 2020. Overview of checkthat! 2020: Automatic identification and verification of claims in social media. In Experimental IR Meets Multilinguality, Multimodality, and Interaction: eleventh International Conference of the CLEF Association, CLEF 2020, Thessaloniki, Greece, September 22–25, 2020, Proceedings, page 215–236, Berlin, Heidelberg. Springer-Verlag.

Gordon V. Cormack, Charles L A Clarke, and Stefan Buettcher. 2009. Reciprocal rank fusion outperforms condorcet and individual rank learning methods. In Proceedings of the thirty second International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’09, page 758–759, Latest York, NY, USA. Association for Computing Machinery

Iva Pezo, Allan Hanbury, and Moritz Staudinger. 2025. ipezoTU at SemEval-2025 Task 7: Hybrid Ensemble Retrieval for Multilingual Fact-Checking. In , pages 1159–1167, Vienna, Austria. Association for Computational Linguistics.

Constructing Fact-Checking Systems: Catching Repeating False Claims Before They Spread

: Why We Need Automated Fact-Checking

The Development of Fact-checking Pipelines

Designing the PFCR Component (Retrieval Pipeline)