When working on data science projects, one fundamental pipeline to establish is the one regarding data collection. Real-world Machine Learning mainly differs from Kaggle-like problems because data just isn’t static. We want to scrape web sites, gather data from APIs, and so forth. This fashion of collecting data might look chaotic, and it’s! That’s why we want to structure our code following best practices to bring some type of order to all this mess.
When you identified the sources from which you must gather your data, you might want to collect them in a structured technique to store those in your database. For instance, you would possibly determine that with the intention to train your LLM what you wish are data sources which contain 3 fields: writer, content, and link.
What you would do is to download the information, after which write SQL queries to store and retrieve data out of your database. More commonly it is advisable to implement all of the queries to perform CRUD operations. CRUD stands for create, read, update, and delete. These are the 4 basic functions of persistent storage.