Train Your Own Private ChatGPT Model for the Cost of a Starbucks Coffee Intro Preparing a Host with a 3090 Graphics Card Start DolphinScheduler Open Source Large Model Training and Deployment Summary

Artificial Intelligence

Train Your Own Private ChatGPT Model for the Cost of a Starbucks Coffee Intro Preparing a Host with a 3090 Graphics Card Start DolphinScheduler Open Source Large Model Training and Deployment Summary

admin

June 3, 2023

Train Your Own Private ChatGPT Model for the Cost of a Starbucks Coffee
Intro
Preparing a Host with a 3090 Graphics Card
Start DolphinScheduler
Open Source Large Model Training and Deployment
Summary

With the fee of a cup of Starbucks and two hours of your time, you possibly can own your personal trained open-source large-scale model. The model might be fine-tuned in accordance with different training data directions to reinforce various skills, comparable to medical, programming, stock trading, and love advice, making your large-scale model more “understanding” of you. Let’s try training an open-source large-scale model empowered by the open-source DolphinScheduler!

The democratization of ChatGPT

The birth of ChatGPT has undoubtedly filled us with anticipation for the longer term of AI. Its sophisticated expression and powerful language understanding ability have amazed the world. Nevertheless, because ChatGPT is provided as a Software as a Service (SaaS), issues of non-public privacy leaks and company data security are concerns for each user and company. An increasing number of open-source large-scale models are emerging, making it possible for people and firms to have their very own models. Nevertheless, getting began with, optimizing, and using open-source large-scale models have high barriers to entry, making it difficult for everybody to make use of them easily. To deal with this, we use Apache DolphinScheduler, which provides one-click support for training, tuning, and deploying open-source large-scale models. This permits everyone to coach their very own large-scale models using their data at a really low price and with technical expertise.

Who’s it for? — Anyone in front of a screen

Our goal shouldn’t be just for skilled AI engineers but for anyone considering GPT to benefit from the joy of getting a model that “understands” them higher. We imagine that everybody has the fitting and skill to shape their very own AI assistant. The intuitive workflow of Apache DolphinScheduler makes this possible. As a bonus, Apache DolphinScheduler is an enormous data and AI scheduling tool with over 10,000 stars on GitHub. It’s a top-level project under the Apache Software Foundation, meaning you need to use it at no cost and modify the code without worrying about any business issues.

Whether you’re an industry expert seeking to train a model together with your own data, or an AI enthusiast wanting to know and explore the training of deep learning models, our workflow will provide convenient services for you. It solves complex pre-processing, model training, and optimization steps, and only requires 1–2 hours of easy operations, plus 20 hours of running time to construct a more “understanding” ChatGPT large-scale model.

So let’s start this magical journey! Let’s bring the longer term of AI to everyone.

Only three steps to create a ChatGPT that “understands” you higher

Rent a GPU card at a low price akin to a 3090 level
Start DolphinScheduler
Click on the training workflow and deployment workflow on the DolphinScheduler page and directly experience your ChatGPT

First, you would like a 3090 graphics card. If you will have a desktop computer, you need to use it directly. If not, there are numerous hosts for rent with GPU online. Here we use AutoDL for example to use. Open https://www.autodl.com/home, register and log in. After that, you possibly can select the corresponding server within the computing power market in accordance with steps 1, 2, and three shown on the screen.

Here,

Mirror

Click on the community mirror, after which enter WhaleOps/dolphinscheduler-llm/dolphinscheduler-llm-0521 the red box below. You may select the image as shown below. Currently, only the V1 version is on the market. In the longer term, as latest versions are released, you possibly can select the most recent one.

If you must train the model multiple times, it is suggested to expand the hard disk capability to around 100GB.

After creating it, wait for the progress bar shown in the next image to finish.

With the intention to deploy and debug your personal open-source large-scale model on the interface, you must start the DolphinScheduler software, and we want to do the next configuration work:

To access the server

There are two methods available. You may select the one which suits your preference:

Click on the JupyterLab button shown below.

The page will redirect to JupyterLab; from there, you possibly can click “Terminal” to enter.

We are able to obtain the SSH connection command from the button shown in the next image.

Then, establish the connection through the terminal.

Import the metadata of DolphinScheduler

In DolphinScheduler, all metadata is stored within the database, including workflow definitions, environment configurations, tenant information, etc. To make it convenient for users to see these workflows when DolphinScheduler is launched, we will directly import pre-defined workflow metadata by copying it from the screen.

Modify the script for importing data into MySQL:

Using the terminal, navigate to the next directory:

cd apache-dolphinscheduler-3.1.5-bin

Execute the command: vim import_ds_metadata.sh to open the import_ds_metadata.sh file.
The content of the file is as follows:

Set variables
Hostname
HOST="xxx.xxx.xxx.x"

Username
USERNAME="root"Password
PASSWORD="xxxx"Port
PORT=3306Database to import into
DATABASE="ds315_llm_test"SQL filename
SQL_FILE="ds315_llm.sql"mysql -h $HOST -P $PORT -u $USERNAME -p$PASSWORD -e "CREATE DATABASE $DATABASE;"mysql -h $HOST -P $PORT -u $USERNAME -p$PASSWORD $DATABASE < $SQL_FILE

Replace xxx.xxx.xxx.x and xxxx with the relevant configuration values of a MySQL database in your public network (you possibly can apply for one on Alibaba Cloud, Tencent Cloud, or install one yourself). Then execute:

bash import_ds_metadata.sh

After execution, if interested, you possibly can check the corresponding metadata within the database (connect with MySQL and think about, skipping this step in the event you will not be conversant in the code).

Start DolphinScheduler

Within the server command line, open the next file and modify the configuration to attach DolphinScheduler with the previously imported database:

/root/apache-dolphinscheduler-3.1.5-bin/bin/env/dolphinscheduler_env.sh

Modify the relevant configuration within the database section, and leave other sections unchanged. Change the values of ‘HOST’ and ‘PASSWORD’ to the configuration values of the imported database, i.e., xxx.xxx.xxx.x and xxxx:

export DATABASE=mysql
export SPRING_PROFILES_ACTIVE=${DATABASE}
export SPRING_DATASOURCE_URL="jdbc:mysql://HOST:3306/ds315_llm_test?useUnicode=true&characterEncoding=UTF-8&useSSL=false"
export SPRING_DATASOURCE_USERNAME="root"
export SPRING_DATASOURCE_PASSWORD="xxxxxx"
......

After configuring, execute (also on this directory /root/apache-dolphinscheduler-3.1.5-bin):

bash ./bin/dolphinscheduler-daemon.sh start standalone-server

Once executed, we will check the logs through the use of tail -200f standalone-server/logs/dolphinscheduler-standalone.log. At this point, DolphinScheduler is officially launched!

After starting the service, we will click on “Custom Services” within the AutoDL console (highlighted in red) to be redirected to a URL:

Upon opening the URL, if it shows a 404 error, don’t worry. Just append the suffix /dolphinscheduler/ui to the URL:

The AutoDL module opens port 6006. After configuring DolphinScheduler’s port to 6006, you possibly can access it through the provided entry point. Nevertheless, as a result of the URL redirection, you might encounter a 404 error. In such cases, you must complete the URL manually.

After logging in, click on “Project Management” to see the predefined project named “vicuna”. Click on “vicuna” to enter the project.

Workflow Definition

Upon entering the Vicuna project, you will note three workflows: Training, Deploy, and Kill_Service. Let’s explore their uses and configure large models and train your data.

You may click the run button below to execute corresponding workflows.

Training

By clicking on the training workflow, you will note two definitions. One is for fine-tuning the model through Lora (mainly using alpaca-lora, https://github.com/tloen/alpaca-lora), and the opposite is to merge the trained model with the bottom model to get the ultimate model.

The workflow has the next parameters (pops up after clicking run):

base_model: The bottom model, which might be chosen and downloaded in accordance with your needs. The open-source large models are just for learning and experiential purposes. The present default is TheBloke/vicuna-7B-1.1-HF.
data_path: The trail of your personalized training data and domain-specific data, defaults to /root/demo-data/llama_data.json.
lora_path: The trail to avoid wasting the trained Lora weights, /root/autodl-tmp/vicuna-7b-lora-weight.
output_path: The save path of the ultimate model after merging the bottom model and Lora weights, note it down as it’s going to be needed for deployment.
num_epochs: Training parameter, the number of coaching epochs. It could possibly be set to 1 for testing, normally set to three~10.
cutoff_len: Maximum text length, defaults to 1024.
micro_batch_size: Batch size.

Deploy

The workflow for deploying large models (mainly using FastChat, https://github.com/lm-sys/FastChat). It can first invoke kill_service to kill the deployed model, then sequentially start the controller, add the model, after which open the Gradio web service.

The beginning parameters are as follows:

model: Model path, it might probably be a Huggingface model ID or the model path trained by us, i.e., the output_path of the training workflow above. The default is TheBloke/vicuna-7B-1.1-HF. If the default is used, it’s going to directly deploy the vicuna-7b model.

Kill_service

This workflow is used to kill the deployed model and release GPU memory. This workflow has no parameters, and you possibly can run it directly. If you must stop the deployed service (comparable to when you must retrain the model or when there’s insufficient GPU memory), you possibly can directly execute the kill_service workflow to kill the deployed service.

After going through just a few examples, your deployment will likely be complete. Now let’s take a take a look at the sensible operation:

Large Model Operation Example

Training a Large Model

Start the workflow directly by executing the training workflow and choose the default parameters.

Right-click on the corresponding task to view the logs, as shown below:

You may as well view the duty status and logs in the duty instance panel at the underside left of the sidebar. In the course of the training process, you possibly can monitor the progress by checking the logs, including the present training steps, loss metrics, remaining time, etc. There may be a progress bar indicating the present step, where step = (data size * epoch) / batch size.

After training is complete, the logs will seem like the next:

Updating Your Personalized Training Data

Our default data is in /root/demo-data/llama_data.json. The present data source is Huatuo, a medical model finetuned using Chinese medical data. Yes, our example is training a family doctor:

If you will have data in a particular field, you possibly can point to your personal data, the info format is as follows: one JSON per line, and the sector meaning is:

instruction ****: Instruction to provide to the model.
input: Input.
output: Expected model output.

For instance:

{"instruction": "calculation", "input": "1+1 equals?", "output": "2"}

Please note that you would be able to merge the instruction and input fields right into a single instruction field. The input field may also be left empty.

When training, modify the data_path parameter to execute your personal data.

In the course of the first training execution, the bottom model will likely be fetched from the desired location, comparable to TheBloke/vicuna-7B-1.1-HF. There will likely be a downloading process, so please wait for the download to finish. The alternative of this model is decided by the user, and you too can decide to download other open-source large models (please follow the relevant licenses when using them).

On account of network issues, the bottom model download may fail halfway through the primary training execution. In such cases, you possibly can click on the failed task and decide to rerun it to proceed the training. The operation is shown below:

To stop the training, you possibly can click the stop button, which is able to release the GPU memory used for training.

Deployment Workflow

On the workflow definition page, click on the deploy workflow to run it and deploy the model.

Should you haven’t trained your personal model, you possibly can execute the deploy workflow with the default parameters TheBloke/vicuna-7B-1.1-HF to deploy the vicuna-7b model, as shown within the image below:

If you will have trained a model within the previous step, you possibly can now deploy your model. After deployment, you possibly can experience your personal large model. The startup parameters are as follows, where you must fill within the output_path of the model from the previous step:

Next, let’s enter the deployed workflow instance. Click on the workflow instance, after which click on the workflow instance with the “deploy” prefix.

Right-click and choose “refresh_gradio_web_service” to view the duty logs and find the placement of our large model link. The operation is shown below:

Within the logs, you’ll find a link that might be accessed publicly, comparable to:

Listed here are two links. The link 0.0.0.0:7860 can’t be accessed because AutoDL only opens port 6006, which is already used for dolphinscheduler. You may directly access the link below it, comparable to https://81c9f6ce11eb3c37a4.gradio.live.

Please note that this link may change every time you deploy, so you must find it again from the logs.

When you enter the link, you will note the conversation page of your personal ChatGPT!

Yes! Now you will have your personal ChatGPT, and its data only serves you! And also you only spent lower than the fee of a cup of coffee~~

Go ahead and experience your personal private ChatGPT!

On this data-driven and technology-oriented world, having a dedicated ChatGPT model has immeasurable value. With the advancement of artificial intelligence and deep learning, we’re in an era where personalized AI assistants might be shaped. Training and deploying your personal ChatGPT model may also help us higher understand AI and the way it’s transforming our world.

In summary, training and deploying a ChatGPT model on your personal can show you how to protect data security and privacy, meet specific business requirements, save on technology costs, and automate the training process using workflow tools like DolphinScheduler. It also lets you comply with local laws and regulations. Due to this fact, training and deploying a ChatGPT model on your personal is a worthwhile option to think about.

When using ChatGPT through public API services, you’ll have concerns about data security and privacy. This can be a valid concern as your data could also be transmitted over the network. By training and deploying the model on your personal, you possibly can make sure that your data is stored and processed only on your personal device or rented server, ensuring data security and privacy.
For organizations or individuals with specific business requirements, training your personal ChatGPT model ensures that the model has the most recent and most relevant knowledge related to your enterprise. No matter your enterprise domain, a model specifically trained for your enterprise needs will likely be more precious than a generic model.
Using OpenAI’s ChatGPT model may incur certain costs. Similarly, if you ought to train and deploy the model on your personal, you furthermore may need to speculate resources and incur technology costs. For instance, you possibly can experience debugging large models for as little as 40 yuan, but in the event you plan to run it long-term, it is suggested to buy an Nvidia RTX 3090 graphics card or rent cloud servers. Due to this fact, you must weigh the professionals and cons and select the answer that most closely fits your specific circumstances.
Through the use of Apache DolphinScheduler’s workflow, you possibly can automate your complete training process, greatly reducing the technical barrier. Even in the event you don’t have extensive knowledge of algorithms, you possibly can successfully train your personal model with the assistance of such tools. Along with supporting large model training, DolphinScheduler also supports big data scheduling and machine learning scheduling, helping you and your non-technical staff to simply handle big data processing, data preparation, model training, and model deployment. Furthermore, it’s open-source and free to make use of.
DolphinScheduler is just a visible AI workflow tool and doesn’t provide any open-source large models. When using and downloading open source large models, you have to concentrate on different usage constraints related to each model and comply with the respective open-source licenses. The examples given in this text are only for private learning and experience purposes. When using large models, it will be important to make sure compliance with open-source model licensing. Moreover, different countries have different strict regulations regarding data storage and processing. When using large models, you have to customize and adjust the model to comply with the particular legal regulations and policies of your location. This may increasingly include specific filtering of model outputs to comply with local privacy and sensitive information handling regulations.

There are lots of ways to participate and contribute to the DolphinScheduler community, including:

,,,,,,,

We assume the primary PR (document, code) to contribute to be easy and needs to be used to familiarize yourself with the submission process and community collaboration style.

So the community has compiled the next https://github.com/apache/dolphinscheduler/contribute

https://github.com/apache/dolphinscheduler/issues?q=is%3Aopen+is%3Aissue+label%3A%22help+wanted%22+

Https://github.com/apache/dolphinscheduler/blob/8944fdc62295883b0fa46b137ba8aee4fde9711a/docs/docs/en/contribute/join/contribute.md

https://github.com/apache/dolphinscheduler

https://dolphinscheduler.apache.org/

dev@dolphinscheduler@apache.org

@DolphinSchedule

https://www.youtube.com/@apachedolphinscheduler

https://s.apache.org/dolphinscheduler-slack

https://dolphinscheduler.apache.org/en-us/community/index.html

Your Star for the project is important, don’t hesitate to lighten a Star for Apache DolphinScheduler ❤️