Llama-Factory + Ollama 打造属于自己的中文版 Llama3

2024年5月5日 7382点热度 3人点赞 5条评论

Meta 推出 Llama3 也有一小段时间了。Llama3 包含 8B 和 70B 两种参数规模，涵盖预训练和指令调优的变体。Llama 3 支持多种商业和研究用途，并已在多个行业标准测试中展示了其卓越的性能（关于Llama3的具体介绍可以参考本站另外一篇博文：尝鲜 Meta Llama 3）。和以往的原始 Llama 模型一样，Llama 3 对中文的支持效果欠佳，经常会出现你用中文提问，它用英文或中文+英文回复的现象。

例如当我问它：”你是谁？“，会得到如下的回答：

>>> 你是谁？
I am LLaMA, an AI assistant developed by Meta AI that can understand and respond to human input in a conversational manner. I'm not a human, but a computer program designed to simulate conversation and answer questions to the best of my ability based on my training data. I can generate text on a wide range of topics, from science and history to entertainment and culture.

然后让它”介绍一下中国的首都北京“，则会得到如下的回答：

>>> 介绍一下中国的首都北京
Beijing!  The capital city of China is a treasure trove of history, culture, and beauty. Here's an introduction:

**Location**: Beijing is situated in the northern part of North China Plain, approximately 100 km (62 miles) inland from the Bo Hai Sea.

**History**: With a history dating back over 3,000 years, Beijing has been the capital of several dynasties, including the Liao, Jin, Yuan, Ming, and Qing. The city's most famous 
architectural legacy is the Forbidden City, the imperial palace of the Ming and Qing dynasties.

**Landmarks**: Some of Beijing's most iconic landmarks include:

1. **The Great Wall of China**: One of the Seven Wonders of the Medieval World, this ancient series of fortifications stretches over 13,000 km (8,100 miles) across northern China.
2. **The Forbidden City**: The imperial palace of the Ming and Qing dynasties, comprising over 980 buildings and more than 8,700 rooms.
3. **Tiananmen Square**: A public square that has played a significant role in modern Chinese history, including the famous Tiananmen Square protests of 1989.
4. **The Temple of Heaven**: A magnificent Taoist temple complex built in 1420, with intricate carvings and ornate architecture.

**Cuisine**: Beijing is renowned for its rich culinary heritage, featuring popular dishes like:

1. **Peking Roast Duck**: A classic dish that has been served since the Ming dynasty.
2. **Jiaozi (Dumplings)**: Boiled or pan-fried dumplings filled with meat and vegetables.
3. **Zhajiangmian (Noodles with Soy Sauce Sauce)**: A popular Beijing noodle dish.

**Culture**: Beijing is a city of vibrant cultural heritage, with numerous museums, theaters, and performance venues showcasing traditional Chinese arts like Peking Opera, acrobatics,
and martial arts.

**Economy**: As the capital city, Beijing is a major hub for politics, finance, education, and culture. The city is home to many international companies, research institutions, and 
universities.

**Climate**: Beijing has a continental climate with cold winters and hot summers. The best time to visit is during the spring (March to May) or autumn (September to November), when 
the weather is mild and comfortable.

This is just a brief introduction to this incredible city. If you have any specific questions or would like more information, feel free to ask!

回答得还是很简洁而全面的。缺点就是全是英文答案，对于英文不太好的人有点吃力。对 Llama3 进行微调来支持中文是一项必须的工作。

Llama3 中文模型的现状

实际上，当 Llama3 推出以后，很多开发者在中文支持方面做了大量的工作，一大批微调的中文模型如雨后春笋般纷纷涌现。目前比较知名的中文模型有：

Unichat-llama3-Chinese（地址）：中国联通AI创新中心发布业界第一个llama3中文指令微调模型，全参数微调(非lora)，以Meta Llama 3为基础,增加中文数据进行训练,实现llama3模型高质量中文问答，模型上下文保持原生长度8K，支持长度64K版本将于后续发布。
OpenBuddy – Open Multilingual Chatbot（地址）：OpenBuddy是一个面向全球用户的强大的开放式多语言聊天机器人模型，强调对话式AI以及对英语、中文和其他语言的无缝多语言支持。
Llama3-Chinese（地址）：在500k高质量中文多轮SFT数据、100k英语多轮SFT数据和2k单轮自我认知数据上训练的大型模型，采用基于Meta的DORA和LORA+的训练方法。以Llama-3-8B为基础。
Llama3-8B-Chinese-Chat（地址）：是一个针对中文和英文用户的指令调整语言模型，具有角色扮演和工具使用等各种能力，建立在 Meta-Llama-3-8B-Instruct 模型的基础上。
Llama-3-8B-Instruct-Chinese-chat（地址）：使用 firefly-train-1.1M，moss-003-sft-data，school_math_0.25M，ruozhiba 等数据集微调的模型，基于Llama-3-8B-Instruct。
Bunny-Llama-3-8B-V（地址）：Bunny 是一系列轻量但功能强大的多模式模型。它提供多种即插即用视觉编码器，如 EVA-CLIP、SigLIP 和语言主干，包括 Llama-3-8B、Phi-1.5、StableLM-2、Qwen1.5、MiniCPM 和 Phi-2。为了弥补模型大小的减小，通过从更广泛的数据源中进行精选来构建信息更丰富的训练数据。
llava-llama-3-8b-v1_1（地址）：llava-llama-3-8b-v1_1 是一个 LLaVA 模型，由XTuner使用ShareGPT4V-PT和InternVL-SFT从 meta-llama/Meta-Llama-3-8B-Instruct和CLIP-ViT-Large-patch14-336进行微调。
。。。

还有很多新的模型就不一一列举了，想要尝试这些模型可以直接部署试用。本文则探讨如何使用 Llama-Factory 对 Llama3 进行中文微调的具体过程，并通过 Ollama 本地部署中文微调的 Llama3 模型，打造属于自己的个性化的 Llama3 LLM 。

使用 Llama-Factory 对 Llama3 进行中文微调

LLaMA-Factory 是一个开源项目，它提供了一套全面的工具和脚本，用于微调、服务和基准测试 LLaMA 模型。LLaMA（大型语言模型自适应）是 Meta AI 开发的一组基础语言模型，在各种自然语言任务上表现出色。

LLaMA-Factory 存储库提供以下内容，让您轻松开始使用 LLaMA 模型：

数据预处理和标记化的脚本
用于微调 LLaMA 模型的训练流程
使用经过训练的模型生成文本的推理脚本
评估模型性能的基准测试工具
用于交互式测试的 Gradio Web UI

关于 Llama-Factory 的具体介绍可以参考本站的另外一篇博文：LLaMA-Factory 简介。

安装 Llama-Factory

首先从 github 拉取 Llama-Factory：

git clone https://github.com/hiyouga/LLaMA-Factory.git

为了方便今后的调试和部署，我选择了使用 docker 的方式来运行 Llama-Factory。它提供了一个参考的 Dockerfile：

FROM nvcr.io/nvidia/pytorch:24.01-py3

WORKDIR /app

COPY requirements.txt /app/
RUN pip install -r requirements.txt

COPY . /app/
RUN pip install -e .[deepspeed,metrics,bitsandbytes,qwen]

VOLUME [ "/root/.cache/huggingface/", "/app/data", "/app/output" ]
EXPOSE 7860

CMD [ "llamafactory-cli webui" ]

可以根据自己的实际情况进行修改。我家里的网络自己搭建了 proxy server（为了访问 github, huggingface 等站点），所以更改 Dockerfile 如下。该 Dockerfile 从 docker buildx 命令行获取 http_proxy 和 https_proxy 变量，并设置 docker buildx 环境里的相应环境变量，这样编译 docker 镜像的过程中就能使用代理服务器了。

FROM nvcr.io/nvidia/pytorch:24.01-py3


# 使用构建参数设置环境变量
ARG http_proxy
ARG https_proxy

# 设置环境变量
ENV HTTP_PROXY=$http_proxy
ENV HTTPS_PROXY=$https_proxy
ENV http_proxy=$http_proxy
ENV https_proxy=$https_proxy

WORKDIR /app

COPY requirements.txt /app/
RUN pip install -r requirements.txt

COPY . /app/
RUN pip install -e .[deepspeed,metrics,bitsandbytes,qwen]

# unset环境变量。Container运行过程中需要代理服务器的话通过-e 传入参数
ENV HTTP_PROXY=""
ENV HTTPS_PROXY=""
ENV http_proxy=""
ENV https_proxy=""

VOLUME [ "/root/.cache/huggingface/", "/app/data", "/app/output" ]
EXPOSE 7860

CMD [ "python", "src/train_web.py" ]

运行如下脚本生成自己的 Llama-Factory docker 镜像：

docker buildx build --build-arg http_proxy=http://proxy_ip:port --build-arg https_proxy=http://proxy_ip:port -t llama-factory:v0.00 .

Docker 镜像编译成功后，运行 docker image list 就可以看到编译出来的 docker 镜像了：

docker image list

接下来我们需要运行 Llama-Factory container。Llama-Factory 提供了一个参考的 docker-compose.yml 文件来运行 docker，我们可以按照自己的实际情况进行修改。我这边修改的版本如下，修改的部分参考注释。

version: '3.8'

services:
  llama-factory:
    #build:
    #  dockerfile: Dockerfile
    #  context: .
    image: llama-factory:v0.00 # 修改为编译出来的 docker image 名称/版本
    container_name: llama_factory # container 名称
    volumes:
      - ./volumes/huggingface:/root/.cache/huggingface/
      - ./volumes/data:/app/data
      - ./volumes/output:/app/output
      - /mnt/dev/myprojects/llm-webui-docker/models:/app/models # 映射自己的models目录
    environment:
      - CUDA_VISIBLE_DEVICES=0
      - GRADIO_SERVER_PORT=7864 # webui跑在7864端口上，7860被comfyui占用了
    ports:
      - "7864:7864" # webui跑在7864端口上，7860被comfyui占用了
    ipc: host
    deploy:
      resources:
        reservations:
          devices:
          - driver: nvidia
            count: "all"
            capabilities: [gpu]
    restart: unless-stopped

运行如下命令启动 Llama-Factory 容器：

docker-compose up --detach

现在可以看到 llama_factory container 已经在正常运行了：

访问 http://server_ip:7864 则可以看到如下的 Llama-Factory WebUI 界面：

使用 Llama-Factory 为 Llama3 训练中文 LoRA

1. 模型名称与路径

进入 Llama-Factory WebUI 后，先选择 Model name 为 LLaMA-8B，这时候 Model path 会自动变成 meta-llama/Meta-Llama-3-8B 并自动从 huggingface 拉取 Meta-Llama-3-8B 模型。我这里因为已经下载了 Meta-Llama-3-8B 并映射到了 Llama-Factory container 的 /app/models/llama3-hf/Meta-Llama-3-8B 目录，所以我这里的 Model path 也设置为该路径。

2. 设置 Advanced configurations

点右边的箭头展开 Advance configurations。量化位数（Quantization bit）可以选择 4 ，减小模型的体积并提高速度。Prompt template 则选择 llama3。如果有安装 flashattn2 或者 unsloth 的话可以在 boost 里选择，我这里没有安装所以选择 none。

3. 设置训练参数

在 Train 标签页里设置训练相关的参数。主要的参数有：

Stage：设置为 Supervised Fine-Tuning
Data dir：我这里设置为 data ，因为 Llama-Factory 项目自身也带了一些中文数据集，我打算直接使用。如果你自己下载了别的中文数据集，请设置相应的数据集所在的目录地址
Dataset：我选择了 Llama-Factory 项目里自带的 alpaca_zh，alpaca_gpt4_zh 和 oaast_sft_zh 数据集
Learning rate：安装自己的需要设置。我采用了缺省的 5e-5
Epochs：按照自己的需要设置。我采用了缺省的 3.0
Cutoff length：按照自己的需要设置。数字越大，对 GPU 和显存的要求越高；数字越小，则可能对长句的语义理解不够充分。我这里选择缺省的 1024
Batch size：按照自己的需要甚至。数字越大，对 GPU 和显存的要求越高。我这里选择缺省的 2
Output dir / Config path：按照自己的需要设置

设置完成后，可以点击 Preview dataset 来查看一下数据集内容。

点击 Preview command 可以查看训练过程的命令行参数。如果不希望使用 WebUI 进行训练，则可以直接执行命令行，这样也有助于进一步编程和自动化：

点击 Save arguments 则将目前的训练设置保存到指定的 json 文件。点击 Load arguments 则可以加载以前保存好的训练设置。

4. 开始训练

参数设置好后，就可以点击 Start 开始训练。

Llama-Factory 训练脚本开始解析数据。

整个训练的过程预计 8 个小时不到一点。

训练完成后，在 /app/output/llama3_cn_train_2024-04-27-16-32-46 目录下可以看到如下的文件目录结构：

这些就是 Llama-Factory 训练出来的 LoRA。可以在自动生成的 Readme.md 文件查看 LoRA 的信息：

license: other
library_name: peft
tags:
- llama-factory
- lora
- generated_from_trainer
base_model: /app/models/llama3-hf/Meta-Llama-3-8B
model-index:
- name: llama3_cn_train_2024-04-27-16-32-46

。。。后面的内容省略

在 Ollama 中打造自己的中文版 Llama3

接下来，我们要在 Ollama 中运行 llama3 和我们训练出来的 LoRA，打造属于自己的中文版 Llama3。

Ollama 是一个开源的大模型管理工具，它提供了丰富的功能，包括模型的训练、部署、监控等。通过Ollama，你可以轻松地管理本地的大模型，提高模型的训练速度和部署效率。此外，Ollama还支持多种机器学习框架，如TensorFlow、PyTorch等，使得你可以根据自己的需求选择合适的框架进行模型的训练。

运行 Ollama docker

从 github 拉取 ollama 代码：

git clone https://github.com/ollama/ollama.git

Ollama github项目提供了参考的 Dockerfile，可以编译自己的 ollama 镜像并运行。也可以直接运行 ollama 官方发布的 docker 镜像：

docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

转换 LoRA 格式

按照 Ollama modelfile ADAPTER 的说明，Ollama 支持 ggml 格式的 LoRA，所以我们需要把刚才生成的 LoRA 转换成为 ggml 格式。如下是 Ollama 项目中对于 LoRA 格式支持的描述：

ADAPTERThe ADAPTER instruction is an optional instruction that specifies any LoRA adapter that should apply to the base model. The value of this instruction should be an absolute path or a path relative to the Modelfile and the file must be in a GGML file format. The adapter should be tuned from the base model otherwise the behaviour is undefined.

为了将 Llama-Factory 生成的 LoRA 转换成 GGML 格式，我们需要使用到 Llama.cpp 的某些脚本。有关 Llama.cpp 开源项目的介绍请参考本站另外一篇博文：Llama.cpp 上手实战指南。在 llama.cpp 项目中，有如下几个用于转换格式的 python 脚本。我们将使用 conver-lora-to-ggml.py 脚本来转换格式。

运行如下的命令（其中 /app/output/llama3_cn_train_2024-04-27-16-32-46 是 Llama-Factory 生成 LoRA 的路径）：

./conver-lora-to-ggml.py /app/output/llama3_cn_train_2024-04-27-16-32-46 llama

运行完这个命令后，将在 /app/output/llama3_cn_train_2024-04-27-16-32-46 下生成 ggml-adapter-model.bin 文件。这个文件就是 Ollama 需要的 ggml 格式的 LoRA 文件。

在 ollama 中创建自己的 llama3 中文模型

我们使用 ollama 的 modelfile 来创建自己的 llama3 中文模型。我自己使用的参考 llama3.modelfile 内容如下：

# set the base model
FROM llama3:8b

# set custom parameter values
PARAMETER temperature 1
PARAMETER num_keep 24
PARAMETER stop <|start_header_id|>
PARAMETER stop <|end_header_id|>
PARAMETER stop <|eot_id|>
PARAMETER stop <|reserved_special_token

# set the model template
TEMPLATE """
{{ if .System }}<|start_header_id|>system<|end_header_id|>
{{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|>
{{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|>
{{ .Response }}<|eot_id|>
"""

# set the system message
SYSTEM You are llama3 from Meta, customized and hosted @ HY's Blog (https://blog.yanghong.dev).

# set Chinese lora support
ADAPTER /root/.ollama/models/lora/ggml-adapter-model.bin

首先，通过命令行进入 ollama container：

docker exec -it ollama /bin/bash

然后使用我们刚才生成的 llama3.modelfile 来创建自己的中文版 Llama3，命名为 llama3:hy。

ollama create llama3:hy -f llama3.modelfile

创建成功后，运行如下的命令，则可以看到刚才创建的 llama3:hy 模型。

ollama list

执行如下的命令来运行 llama3:hy 模型，接下来我们就可以愉快地跟它进行中文对话了。

ollama run llama3:hy

我们同样问它本文开头的那两个问题：”你是谁？“，”介绍一下中国的首都北京“，看看它的表现如何。

你是谁？

>>> 你是谁？
我是一個 AI 型Chatbot，来自 Meta，目前被Customized 和 Hosted 在 HY 的博客（https://blog.yanghong.dev）。简单地说，我是一個能與人互動、回答問題和生成文本的 AI 机器人。

介绍一下中国的首都北京

>>> 介绍一下中国的首都北京
 Beijing（北京）是中华人民共和国的首都，位于北方平原的一座古老的城市。它是中国政治、经济、文化和交通的中心之一。

历史：北京作为中国的首都，有着悠久的历史。它曾经是多个朝代的政治中心，如 Qin_dynasty, Han_dynyasty和 Ming_dynasty等。在1644年，明朝被清朝推翻后，北京成为了清朝的首都，并一直保持到1912年。

地理：北京位于北方平原，是中国最大的城市之一。它位于华北平原的西北部，河北省和天津市之间。

景点： Beijing拥有许多著名的历史遗址和文化attractions，如：

* Forbidden City（紫禁城）：为明朝和清朝的宫廷，包含许多历史建筑和艺术收藏。
* Great Wall of China（长城）：中国最著名的古迹之一，为防御军队入侵的石墙。
* Temple of Heaven（天坛）：为 Ming朝皇帝祭祀和向上帝祈愿的道场。
* Summer Palace（颐和园）：为 Qing朝皇家夏宫，包含许多亭子、池塘和假山。

生活：北京的生活质量较高，为中国最富有经济实力的城市之一。它拥有世界级的医疗机构、教育机构和娱乐设施。

文化：北京拥有悠久的文化历史，包括中国古典音乐、舞蹈和戏剧等。每年夏天，北京都会举办多种文化活动，如北京国际电影节和北京音乐节。

总之，北京是一个历史悠久、政治经济中心、文化繁荣和生活质量较高的城市，是中国的一座象征性城市。

可以看到，它基本上能以中文回答问题，我们的 LoRA fine tuning 过程成功完成了。

为中文 Llama3 添加 WebUI

Ollama 提供了 REST API 来跟 LLM 模型进行交互，比如最常用的 Generate，Chat 等方法。

Generate a response

curl http://localhost:11434/api/generate -d '{
  "model": "llama3",
  "prompt":"Why is the sky blue?"
}'

Chat with a model

curl http://localhost:11434/api/chat -d '{
  "model": "llama3",
  "messages": [
    { "role": "user", "content": "why is the sky blue?" }
  ]
}'

完整的 REST API 文档可参阅 github 。

因此，我们可以开发一个简单的 gradio 程序，通过 REST API 调用 llama3:hy 模型来进行中文交互。参考代码片段如下：

        response = requests.post('http://192.168.3.204:11434/api/generate',
            json={
                'model': model2endpoint[model_name],
                'prompt': prompt,
                #'context': context,
                'options': {
                    'top_k': top_k,
                    'temperature': top_p,
                    'top_p': temperature
                }
            },
            stream=True
        )

        yield "", history, user_message, ""
        output = ""

        # Check if the request was successful
        response.raise_for_status()

        # Initialize the output and history variables
        output = ""

        # Iterate over the streamed response lines
        for idx, line in enumerate(response.iter_lines()):
            if line:
                # Parse the line as JSON
                data = json.loads(line)
                token = data.get('response', '')  # Assuming 'response' contains the text
                # Check if the token is a special token
                if data.get('special', False):
                    continue

                # Append the token to the output
                output += token

                # Update history and chat based on the index
                if idx == 0:
                    history.append(output.strip())  # Append initial output
                else:
                    history[-1] = output.strip()  # Update the last history entry

                # Convert history to chat format
                chat = [
                    (history[i], history[i + 1]) if i + 1 < len(history) else (history[i], "")
                    for i in range(0, len(history), 2)
                ]

            # Yield the current chat, history, and user message updates
            yield chat, history, user_message, ""

这样就可以通过上面的 WebUI 界面让它进行中文对话，问问题，帮我写代码了。同时也整合了其它的一些 coding LLM 在一起，碰到不会写的代码，就让它们在一起比比武。

以下视频是 fine tune 的中文 llama3 实际使用演示。

浏览量: 7,253

文章评论

tuqi

您好我去拉最新的llama.cpp的代码发现conver-lora-to-ggml.py这个脚本好像已经更新了，怎么办呢？

2024年5月28日

hy

@tuqi 我写这篇博文的时候拉取的都还可以。建议你直接下载旧一点的版本试试看：
https://github.com/ggerganov/llama.cpp/releases

2024年5月28日

登录以回复
- tuqi
  
  @hy 您好我在 Llama-Factory中训练和测试了是可以的，但是用conver-lora-to-ggml.py这个脚本转为ggml格式的文件很小然后用ollama生成modelfile似乎和llama3没有改变，不知道是什么原因
  
  2024年5月29日
  
  登录以回复
  - hy
    
    @tuqi 你用了哪些数据集训练的呀？生成的lora模型大小多大呢？如果数据集比较有限的话，最后的效果也会有限。
    另外LLM的机制就是生成结果是不确定的，不同的上下文也会生成不同的结果，多试几次看看呢？
    
    2024年5月29日
    
    登录以回复
    - tuqi
      
      @hy 我用了一个简单的中文数据集，主要是关于类似“你是谁”这些的，然后我训练完了在Factory中测试了一下，问了些你是谁，你叫什么之类的，都正确回答了，但是我用那个脚本转成的bin文件才几十M，用ollama运行发现还是原来llama3的那些回复，不知道是为什么
      
      2024年5月29日

您需要登录之后才可以评论