Introduction
Recently I’ve been working on the domain-specific fine-tuning of several LLMs. The primary and perhaps an important a part of this task is to gather, scrape, and clean textual data to feed the LLM. I noticed that my code was becoming messy with many repetitions, because for each identified source I used to be writing a script from scratch which had quite a lot of things in common with other scripts in my codebase. I used to be not following the “Don’t repeat yourself” (DRY) principle in any respect. Because of this I made a decision to implement the Template Design Pattern and make my code base more elegant and efficient.
The Template Design Pattern
I won’t repeat here what a design pattern is and the way we classify design patterns based on their functionalities, since I’ve written many articles on the topic. For those who are involved in reading my previous articles on this topic I’ll leave some references at the top.
In this text, I’ll show you an example related to data processing. Let’s say that in our project we’ve to cope with different kinds of information that we would like to investigate. A few of these data are…