Step3: 词表扩充
运行继续预训练脚本 run_pretrain_megatron_llama.sh,需要传入的参数列表如下: ENV=$1 MEGATRON_PATH=$2 MEGATRON_PATCH_PATH=$3 MODEL_SIZE=$4 BATCH_SIZE=$5 GLOBAL_BATCH_SIZE=$6 LR=$7 MIN_LR=$8 SEQ_LEN=$9 PAD_LEN=${10} EXTRA_VOCAB_SIZE=${11} PR=${12} TP=${13} PP=${14} AC=${15} DO=${16} FL=${17} SP=${18} SAVE_INTERVAL=${19} DATASET_PATH=${20} PRETRAIN_CHECKPOINT_PATH=${21} TRAIN_TOKENS=${22} WARMUP_TOKENS=${23} OUTPUT_BASEPATH=${24}
复制代码
注意设置正确的数据集挂载路径 WORK_DIR 以及运行环境 ENV,运行示例如下所示:
export WORK_DIR=/mnt/workspace cd ${WORK_DIR}/PAI-Megatron-Patch/examples/llama2 bash run_pretrain_megatron_llama.sh \ dlc \ /root/Megatron-LM-23.04 \ ${WORK_DIR}/PAI-Megatron-Patch \ 7B \ 1 \ 16 \ 1e-5 \ 1e-6 \ 2048 \ 80 \ 0 \ fp16 \ 1 \ 1 \ sel \ true \ false \ false \ 100000 \ ${WORK_DIR}/llama2-datasets/wudao/wudao_llamabpe_text_document \ ${WORK_DIR}/llama2-ckpts/llama2-7b-hf-to-megatron-tp1-pp1 \ 100000000 \ 10000 \ ${WORK_DIR}/output_megatron_llama2/
复制代码
|