Train Your Own Private ChatGPT Model for the Cost of a Starbucks Coffee Intro Preparing a Host with a 3090 Graphics Card Start DolphinScheduler Open Source Large Model Training and Deployment Summary

-

With the associated fee of a cup of Starbucks and two hours of your time, you possibly can own your individual trained open-source large-scale model. The model will be fine-tuned in keeping with different training data directions to reinforce various skills, resembling medical, programming, stock trading, and love advice, making your large-scale model more “understanding” of you. Let’s try training an open-source large-scale model empowered by the open-source DolphinScheduler!

The democratization of ChatGPT

The birth of ChatGPT has undoubtedly filled us with anticipation for the long run of AI. Its sophisticated expression and powerful language understanding ability have amazed the world. Nonetheless, because ChatGPT is provided as a Software as a Service (SaaS), issues of non-public privacy leaks and company data security are concerns for each user and company. Increasingly open-source large-scale models are emerging, making it possible for people and firms to have their very own models. Nonetheless, getting began with, optimizing, and using open-source large-scale models have high barriers to entry, making it difficult for everybody to make use of them easily. To handle this, we use Apache DolphinScheduler, which provides one-click support for training, tuning, and deploying open-source large-scale models. This allows everyone to coach their very own large-scale models using their data at a really low price and with technical expertise.

Who’s it for? — Anyone in front of a screen

Our goal just isn’t just for skilled AI engineers but for anyone desirous about GPT to benefit from the joy of getting a model that “understands” them higher. We imagine that everybody has the precise and skill to shape their very own AI assistant. The intuitive workflow of Apache DolphinScheduler makes this possible. As a bonus, Apache DolphinScheduler is an enormous data and AI scheduling tool with over 10,000 stars on GitHub. It’s a top-level project under the Apache Software Foundation, meaning you need to use it free of charge and modify the code without worrying about any business issues.

Whether you might be an industry expert trying to train a model together with your own data, or an AI enthusiast wanting to know and explore the training of deep learning models, our workflow will provide convenient services for you. It solves complex pre-processing, model training, and optimization steps, and only requires 1–2 hours of easy operations, plus 20 hours of running time to construct a more “understanding” ChatGPT large-scale model.

So let’s start this magical journey! Let’s bring the long run of AI to everyone.

Only three steps to create a ChatGPT that “understands” you higher

  1. Rent a GPU card at a low price corresponding to a 3090 level
  2. Start DolphinScheduler
  3. Click on the training workflow and deployment workflow on the DolphinScheduler page and directly experience your ChatGPT

First, you would like a 3090 graphics card. If you’ve got a desktop computer, you need to use it directly. If not, there are a lot of hosts for rent with GPU online. Here we use AutoDL for instance to use. Open https://www.autodl.com/home, register and log in. After that, you possibly can select the corresponding server within the computing power market in keeping with steps 1, 2, and three shown on the screen.

Here,

Mirror

Click on the community mirror, after which enter WhaleOps/dolphinscheduler-llm/dolphinscheduler-llm-0521 the red box below. You possibly can select the image as shown below. Currently, only the V1 version is on the market. In the long run, as recent versions are released, you possibly can select the most recent one.

If you should train the model multiple times, it’s endorsed to expand the hard disk capability to around 100GB.

After creating it, wait for the progress bar shown in the next image to finish.

With a purpose to deploy and debug your individual open-source large-scale model on the interface, you should start the DolphinScheduler software, and we want to do the next configuration work:

To access the server

There are two methods available. You possibly can select the one which suits your preference:

Click on the JupyterLab button shown below.

The page will redirect to JupyterLab; from there, you possibly can click “Terminal” to enter.

We will obtain the SSH connection command from the button shown in the next image.

Then, establish the connection through the terminal.

Import the metadata of DolphinScheduler

In DolphinScheduler, all metadata is stored within the database, including workflow definitions, environment configurations, tenant information, etc. To make it convenient for users to see these workflows when DolphinScheduler is launched, we are able to directly import pre-defined workflow metadata by copying it from the screen.

Modify the script for importing data into MySQL:

Using the terminal, navigate to the next directory:

cd apache-dolphinscheduler-3.1.5-bin

Execute the command: vim import_ds_metadata.sh to open the import_ds_metadata.sh file.
The content of the file is as follows:

Set variables
Hostname
HOST="xxx.xxx.xxx.x"
Username
USERNAME="root"
Password
PASSWORD="xxxx"
Port
PORT=3306
Database to import into
DATABASE="ds315_llm_test"
SQL filename
SQL_FILE="ds315_llm.sql"
mysql -h $HOST -P $PORT -u $USERNAME -p$PASSWORD -e "CREATE DATABASE $DATABASE;"mysql -h $HOST -P $PORT -u $USERNAME -p$PASSWORD $DATABASE < $SQL_FILE

Replace xxx.xxx.xxx.x and xxxx with the relevant configuration values of a MySQL database in your public network (you possibly can apply for one on Alibaba Cloud, Tencent Cloud, or install one yourself). Then execute:

bash import_ds_metadata.sh

After execution, if interested, you possibly can check the corresponding metadata within the database (connect with MySQL and look at, skipping this step if you happen to should not aware of the code).

Start DolphinScheduler

Within the server command line, open the next file and modify the configuration to attach DolphinScheduler with the previously imported database:

/root/apache-dolphinscheduler-3.1.5-bin/bin/env/dolphinscheduler_env.sh

Modify the relevant configuration within the database section, and leave other sections unchanged. Change the values of ‘HOST’ and ‘PASSWORD’ to the configuration values of the imported database, i.e., xxx.xxx.xxx.x and xxxx:

export DATABASE=mysql
export SPRING_PROFILES_ACTIVE=${DATABASE}
export SPRING_DATASOURCE_URL="jdbc:mysql://HOST:3306/ds315_llm_test?useUnicode=true&characterEncoding=UTF-8&useSSL=false"
export SPRING_DATASOURCE_USERNAME="root"
export SPRING_DATASOURCE_PASSWORD="xxxxxx"
......

After configuring, execute (also on this directory /root/apache-dolphinscheduler-3.1.5-bin):

bash ./bin/dolphinscheduler-daemon.sh start standalone-server

Once executed, we are able to check the logs by utilizing tail -200f standalone-server/logs/dolphinscheduler-standalone.log. At this point, DolphinScheduler is officially launched!

After starting the service, we are able to click on “Custom Services” within the AutoDL console (highlighted in red) to be redirected to a URL:

Upon opening the URL, if it shows a 404 error, don’t worry. Just append the suffix /dolphinscheduler/ui to the URL:

The AutoDL module opens port 6006. After configuring DolphinScheduler’s port to 6006, you possibly can access it through the provided entry point. Nonetheless, because of the URL redirection, you might encounter a 404 error. In such cases, you should complete the URL manually.

Login credentials:
Username: admin
Password: dolphinscheduler123

After logging in, click on “Project Management” to see the predefined project named “vicuna”. Click on “vicuna” to enter the project.

Workflow Definition

Upon entering the Vicuna project, you will note three workflows: Training, Deploy, and Kill_Service. Let’s explore their uses and the right way to configure large models and train your data.

You possibly can click the run button below to execute corresponding workflows.

Training

By clicking on the training workflow, you will note two definitions. One is for fine-tuning the model through Lora (mainly using alpaca-lora, https://github.com/tloen/alpaca-lora), and the opposite is to merge the trained model with the bottom model to get the ultimate model.

The workflow has the next parameters (pops up after clicking run):

  • base_model: The bottom model, which will be chosen and downloaded in keeping with your needs. The open-source large models are just for learning and experiential purposes. The present default is TheBloke/vicuna-7B-1.1-HF.
  • data_path: The trail of your personalized training data and domain-specific data, defaults to /root/demo-data/llama_data.json.
  • lora_path: The trail to save lots of the trained Lora weights, /root/autodl-tmp/vicuna-7b-lora-weight.
  • output_path: The save path of the ultimate model after merging the bottom model and Lora weights, note it down as it is going to be needed for deployment.
  • num_epochs: Training parameter, the number of coaching epochs. It may be set to 1 for testing, normally set to three~10.
  • cutoff_len: Maximum text length, defaults to 1024.
  • micro_batch_size: Batch size.

Deploy

The workflow for deploying large models (mainly using FastChat, https://github.com/lm-sys/FastChat). It is going to first invoke kill_service to kill the deployed model, then sequentially start the controller, add the model, after which open the Gradio web service.

The beginning parameters are as follows:

  • model: Model path, it will probably be a Huggingface model ID or the model path trained by us, i.e., the output_path of the training workflow above. The default is TheBloke/vicuna-7B-1.1-HF. If the default is used, it is going to directly deploy the vicuna-7b model.

Kill_service

This workflow is used to kill the deployed model and release GPU memory. This workflow has no parameters, and you possibly can run it directly. If you should stop the deployed service (resembling when you should retrain the model or when there may be insufficient GPU memory), you possibly can directly execute the kill_service workflow to kill the deployed service.

After going through just a few examples, your deployment will probably be complete. Now let’s take a have a look at the sensible operation:

Large Model Operation Example

  1. Training a Large Model

Start the workflow directly by executing the training workflow and choose the default parameters.

Right-click on the corresponding task to view the logs, as shown below:

You can too view the duty status and logs in the duty instance panel at the underside left of the sidebar. In the course of the training process, you possibly can monitor the progress by checking the logs, including the present training steps, loss metrics, remaining time, etc. There may be a progress bar indicating the present step, where step = (data size * epoch) / batch size.

After training is complete, the logs will appear like the next:

Updating Your Personalized Training Data

Our default data is in /root/demo-data/llama_data.json. The present data source is Huatuo, a medical model finetuned using Chinese medical data. Yes, our example is training a family doctor:

If you’ve got data in a selected field, you possibly can point to your individual data, the information format is as follows: one JSON per line, and the sphere meaning is:

  • instruction ****: Instruction to provide to the model.
  • input: Input.
  • output: Expected model output.

For instance:

{"instruction": "calculation", "input": "1+1 equals?", "output": "2"}

Please note which you can merge the instruction and input fields right into a single instruction field. The input field will also be left empty.

When training, modify the data_path parameter to execute your individual data.

In the course of the first training execution, the bottom model will probably be fetched from the required location, resembling TheBloke/vicuna-7B-1.1-HF. There will probably be a downloading process, so please wait for the download to finish. The alternative of this model is set by the user, and it’s also possible to decide to download other open-source large models (please follow the relevant licenses when using them).

Resulting from network issues, the bottom model download may fail halfway through the primary training execution. In such cases, you possibly can click on the failed task and decide to rerun it to proceed the training. The operation is shown below:

To stop the training, you possibly can click the stop button, which can release the GPU memory used for training.

Deployment Workflow

On the workflow definition page, click on the deploy workflow to run it and deploy the model.

In the event you haven’t trained your individual model, you possibly can execute the deploy workflow with the default parameters TheBloke/vicuna-7B-1.1-HF to deploy the vicuna-7b model, as shown within the image below:

If you’ve got trained a model within the previous step, you possibly can now deploy your model. After deployment, you possibly can experience your individual large model. The startup parameters are as follows, where you should fill within the output_path of the model from the previous step:

Next, let’s enter the deployed workflow instance. Click on the workflow instance, after which click on the workflow instance with the “deploy” prefix.

Right-click and choose “refresh_gradio_web_service” to view the duty logs and find the situation of our large model link. The operation is shown below:

Within the logs, one can find a link that will be accessed publicly, resembling:

Listed below are two links. The link 0.0.0.0:7860 can’t be accessed because AutoDL only opens port 6006, which is already used for dolphinscheduler. You possibly can directly access the link below it, resembling https://81c9f6ce11eb3c37a4.gradio.live.

Please note that this link may change every time you deploy, so you should find it again from the logs.

When you enter the link, you will note the conversation page of your individual ChatGPT!

Yes! Now you’ve got your individual ChatGPT, and its data only serves you! And also you only spent lower than the associated fee of a cup of coffee~~

Go ahead and experience your individual private ChatGPT!

On this data-driven and technology-oriented world, having a dedicated ChatGPT model has immeasurable value. With the advancement of artificial intelligence and deep learning, we’re in an era where personalized AI assistants will be shaped. Training and deploying your individual ChatGPT model may also help us higher understand AI and the way it’s transforming our world.

In summary, training and deploying a ChatGPT model on your individual can allow you to protect data security and privacy, meet specific business requirements, save on technology costs, and automate the training process using workflow tools like DolphinScheduler. It also means that you can comply with local laws and regulations. Subsequently, training and deploying a ChatGPT model on your individual is a worthwhile option to contemplate.

  • When using ChatGPT through public API services, you might have concerns about data security and privacy. It is a valid concern as your data could also be transmitted over the network. By training and deploying the model on your individual, you possibly can make sure that your data is stored and processed only on your individual device or rented server, ensuring data security and privacy.
  • For organizations or individuals with specific business requirements, training your individual ChatGPT model ensures that the model has the most recent and most relevant knowledge related to your online business. No matter your online business domain, a model specifically trained for your online business needs will probably be more priceless than a generic model.
  • Using OpenAI’s ChatGPT model may incur certain costs. Similarly, if you must train and deploy the model on your individual, you furthermore may need to take a position resources and incur technology costs. For instance, you possibly can experience debugging large models for as little as 40 yuan, but if you happen to plan to run it long-term, it’s endorsed to buy an Nvidia RTX 3090 graphics card or rent cloud servers. Subsequently, you should weigh the professionals and cons and select the answer that most accurately fits your specific circumstances.
  • By utilizing Apache DolphinScheduler’s workflow, you possibly can automate all the training process, greatly reducing the technical barrier. Even if you happen to don’t have extensive knowledge of algorithms, you possibly can successfully train your individual model with the assistance of such tools. Along with supporting large model training, DolphinScheduler also supports big data scheduling and machine learning scheduling, helping you and your non-technical staff to simply handle big data processing, data preparation, model training, and model deployment. Furthermore, it’s open-source and free to make use of.
  • DolphinScheduler is just a visible AI workflow tool and doesn’t provide any open-source large models. When using and downloading open source large models, it’s essential to pay attention to different usage constraints related to each model and comply with the respective open-source licenses. The examples given in this text are only for private learning and experience purposes. When using large models, it can be crucial to make sure compliance with open-source model licensing. Moreover, different countries have different strict regulations regarding data storage and processing. When using large models, it’s essential to customize and adjust the model to comply with the particular legal regulations and policies of your location. This will include specific filtering of model outputs to comply with local privacy and sensitive information handling regulations.

There are numerous ways to participate and contribute to the DolphinScheduler community, including:

,,,,,,,

We assume the primary PR (document, code) to contribute to be easy and ought to be used to familiarize yourself with the submission process and community collaboration style.

So the community has compiled the next https://github.com/apache/dolphinscheduler/contribute

https://github.com/apache/dolphinscheduler/issues?q=is%3Aopen+is%3Aissue+label%3A%22help+wanted%22+

Https://github.com/apache/dolphinscheduler/blob/8944fdc62295883b0fa46b137ba8aee4fde9711a/docs/docs/en/contribute/join/contribute.md

https://github.com/apache/dolphinscheduler

https://dolphinscheduler.apache.org/

dev@dolphinscheduler@apache.org

@DolphinSchedule

https://www.youtube.com/@apachedolphinscheduler

https://s.apache.org/dolphinscheduler-slack

https://dolphinscheduler.apache.org/en-us/community/index.html

Your Star for the project is important, don’t hesitate to lighten a Star for Apache DolphinScheduler ❤️

ASK DUKE

What are your thoughts on this topic?
Let us know in the comments below.

3 COMMENTS

0 0 votes
Article Rating
guest
3 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

3
0
Would love your thoughts, please comment.x
()
x