Prompt Engine

可以参考该项目，该项目提供关于提示词书写的规则。由openai以及吴恩达完成。
https://github.com/datawhalechina/prompt-engineering-for-developers
由于目前chatgpt 无法直接在国内访问，推荐在claude on slack上尝试。
关于claude api
https://learn.microsoft.com/en-us/azure/cognitive-services/openai/concepts/advanced-prompt-engineering?pivots=programming-language-chat-completions#specifying-the-output-structure

大模型的涌现能力

https://mp.weixin.qq.com/s/L_dOMtbfhv4SdA_nBaiIog

Finetine

LoRA
https://mp.weixin.qq.com/s/5uFIGw7TBHJA_Q26REG4OA

https://github.com/Facico/Chinese-Vicuna/tree/master
https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM
https://github.com/OptimalScale/LMFlow
https://github.com/LC1332/Luotuo-Chinese-LLM#quickstart
https://github.com/lm-sys/FastChat
https://github.com/facebookresearch/llama
https://github.com/tatsu-lab/stanford_alpaca

Deployment

https://github.com/THUDM/ChatGLM-6B

jittor

https://github.com/Jittor/JittorLLMs

Langchain

https://www.bilibili.com/video/BV1GL411e7K4/?spm_id_from=333.999.0.0&vd_source=390ba7830cbb1bbb68ee1b96cfd3f787
https://github.com/hwchase17/langchain

Llama-index

https://github.com/jerryjliu/llama_index

MNN

https://github.com/wangzhaode/ChatGLM-MNN
https://github.com/IST-DASLab/gptq
https://github.com/ggerganov/llama.cpp
https://github.com/ymcui/Chinese-LLaMA-Alpaca
https://github.com/MegEngine/InferLLM
https://github.com/qwopqwop200/GPTQ-for-LLaMa
https://github.com/tpoisonooo/llama.onnx

Q&A

instruction数据
从Chinese-Vicuna 提到的使用了数据集GuanacoDataset 可以大概看出来这种格式

本地使用InferLLM 项目去推理模型

总体比较简洁，可以直接查看官方readme
https://github.com/MegEngine/InferLLM/blob/main/README_Chinese.md

#指定模型文件，指定使用的线程数量
./alpaca -m ../models/chinese-alpaca-7b-q4.bin -t 8

在这里插入图片描述

资源消耗

使用llama.cpp部署

安装openblas

git clone https://github.com/xianyi/OpenBLAS
cd OpenBLAS/
mkdir build
cd build/
cmake ../
cmake --build ./ -j8
cmake --install ./

编译llama.cpp

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
mkdir build
cd build
cmake .. -DLLAMA_OPENBLAS=ON
cmake --build . --config Release


#使用intel mkl加速
mkdir build
cd build
cmake .. -DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=Intel10_64lp -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx
cmake --build . -config Release

https://github.com/ggerganov/llama.cpp

推理模型Chinese-LLaMA-Alpaca
https://github.com/ymcui/Chinese-LLaMA-Alpaca/wiki/llama.cpp%E9%87%8F%E5%8C%96%E9%83%A8%E7%BD%B2
原始llama模型下载 https://zhuanlan.zhihu.com/p/614118339
- 合并模型 https://github.com/ymcui/Chinese-LLaMA-Alpaca/wiki/%E6%89%8B%E5%8A%A8%E6%A8%A1%E5%9E%8B%E5%90%88%E5%B9%B6%E4%B8%8E%E8%BD%AC%E6%8D%A2

python ~/anaconda3/envs/chinese_llama_alpaca/lib/python3.10/site-packages/transformers/models/llama/convert_llama_weights_to_hf.py --input_dir ~/workspace/pyllama_data/  --model_size 7B --output_dir ~/workspace/llama_7B_hf

# 注意添加手动绝对路径
python scripts/merge_llama_with_chinese_lora.py --base_model /home/kjs-server/workspace/llama_7B_hf --lora_model /home/kjs-server/workspace/chinese_llama_plus_lora_7b,/home/kjs-server/workspace/chinese_alpaca_plus_lora_7b --output_type pth --output_dir ~/workspace/merge_dir

模型量化
https://github.com/ymcui/Chinese-LLaMA-Alpaca/wiki/llama.cpp%E9%87%8F%E5%8C%96%E9%83%A8%E7%BD%B2
执行模型推理

./main -m ../../models/ggml-model-f16.bin \
       --color \
       -f ../../prompts/alpaca.txt \
       --ctx_size 2048 \
       -n -1 \
       -ins -b 256 \
       --top_k 10000 \
       --temp 0.2 \
       --repeat_penalty 1.1 \
       -t 96

MNN推理chatglm-6B

下载相关的项目，编译完成，参考项目readme
https://github.com/wangzhaode/ChatGLM-MNN/blob/master/README.md
模型下载

# 切换到指定目录
cd ChatGLM-MNN/resource/models
# 不使用代理，直接下载
./download_models.sh int4

执行demo程序

./cli_demo -d int4

目前运行存在问题， Segmentation fault -> 待解决（rk3588）
最终还是在一台x86PC上面完成模型的部署，分别在CPU和GPU上面的推理速度
CPU
- Intel® Xeon® Platinum 8160 CPU @ 2.10GHz
- 启动时间 ~30s，速度2 s/token per
GPU
- NVIDIA GeForce RTX 3090
- 启动时间，无明显的延迟，速度

JittorLLMs

该项目目前不是一个聊天机器人的应用
在CPU在同样存在速度比较慢，GPU上面速度比较快。

关于其中配置参数的理解

在这里插入图片描述
dimension
n heads
n layers
learning rate
batch size
n tokens

benchmark

https://github.com/SJTU-LIT/ceval/blob/main/README_zh.md
https://mp.weixin.qq.com/s/P0ohd5DpwJOkL8DFVC4qoA

APP

https://github.com/csunny/DB-GPT/tree/main
https://github.com/OptimalScale/DetGPT
https://github.com/oobabooga/text-generation-webui

transformer

https://blog.csdn.net/yujianmin1990/article/details/85221271

LLM推理中的参数理解

Top-p:模型只考虑总概率分布的前 p% 的 token,并从这些 token 中生成输出。这可以确保模型不会过于依赖高概率的 token。
Top-k:模型只考虑概率最高的前 k 个 token,并从这些 token 中生成输出。这也可以提高模型输出的多样性。

transformer 模型的解码主要有以下几种方式:1. Greedy search:每次只选取概率最高的token作为输出,然后继续解码。优点是效率高,缺点是无法生成较长且质量高的序列。常用于实时场景。2. Beam search:每次考虑前K个最可能的token,选取其中概率最高的一个作为输出,然后继续解码。相比greedy search,可以生成更长更高质量的序列,但计算成本也更高。在transformer中很常用。3. Sampling:根据token的概率分布随机采样一个token作为输出。可以增强输出的多样性,缺点是序列的连贯性较差。4. Top-K/Top-P sampling:在采样时只考虑概率最高的前K个或总概率分布的前P%的token,然后在这些token中采样。相比于常规sampling,可以提高序列的连贯性。5. Beam sampling:先使用beam search生成多个候选序列,然后从中随机采样一个序列作为最终输出。集beam search和sampling的优点,既可以生成较长高质量的序列,又可以增强输出的多样性。6. Nucleus sampling:类似于Top-P sampling,但会对高概率的token施加更高的采样概率,低概率的token采样概率更低。可以看作是Top-P采样和常规采样的结合。除了上述几种标准化解码方式,许多研究工作也会基于这些方式提出新的变体,以期取得更好的效果。综上,解码方式的选择主要根据实际任务的需求来确定,需要在生成序列的质量、多样性和计算成本之间进行权衡。对于高质量长序列的生成,Beam Search是比较常用的方式。而在需要实时响应或强调输出多样性的任务中,Sampling相关的方法会更合适

使用webui 运行llama.cpp

运行chinese-alpaca-7B 16bit cpu(x86)

#激活虚拟环境
conda activate textgen
cd ~/workspace/oobabooga_linux/text-generation-webui
python server.py

推理速度 -> 还是需要一个GPU来完成推理

LLM相关的一些调研

Prompt Engine

大模型的涌现能力

Finetine

Deployment

jittor

Langchain

Llama-index

MNN

Q&A

本地使用InferLLM 项目去推理模型

使用llama.cpp部署

MNN推理chatglm-6B

JittorLLMs

关于其中配置参数的理解

benchmark

APP

transformer

LLM推理中的参数理解

使用webui 运行llama.cpp

相关文章

[Spring Boot Starter系列]Spring Boot自动装配原理

shell脚本怎么获取当前脚本名称（获取脚本文件名）$(basename “$0“)（basename命令：去除字符串路径部分、去除后缀）

java + opencv对比图片不同

Ceph分布式存储系统搭建

SM2签名与验证过程

【DCT变换】Python矩阵运算实现DCT变换

go语言系列基础教程总结（2）

基于SpringBoot的在线拍卖系统【附ppt和万字文档(Lun文)和搭建文档】