关于Llama-3.1-8B-Instruct在Longbench v2 测试结果和排行榜有出入的问题 #94

chaochen99 · 2025-01-02T09:20:40Z

您好，

我测试的Llama-3.1-8B-Instruct 结果如下：

Model Overall Easy Hard Short Medium Long
Llama-3.1-8B-Instruct 29.0 30.7 28.0 33.9 25.6 27.8

和排行榜中的Overall 有一个点的差距（29.0 vs 30.0），我的环境如下：

vllm==0.5.3.post1
transformers==4.45.0

请问测试Llama-3.1-8B-Instruct 还需要什么特殊处理吗

bys0318 · 2025-01-02T09:43:27Z

看起来你的测试结果和我们测试结果出入不大，我认为这基本在随机误差之内。请问你的截断方式是什么样的呢？

chaochen99 · 2025-01-02T10:38:32Z

我的部署测试方式和示例一致。

部署方式为：
vllm serve ${model_path} --api-key token-abc123 --tensor-parallel-size 4 --gpu-memory-utilization 0.95 --max_model_len 131072 --trust-remote-code

测试方式为：

python pred.py --model ${model_path}

guanzhchen · 2025-01-02T12:08:20Z

我测的只有28.2，固定了随机种子是42，感觉波动还蛮大的...

Wangmerlyn · 2025-01-20T04:45:17Z

您好，

我测试的Llama-3.1-8B-Instruct 结果如下：

Model Overall Easy Hard Short Medium Long Llama-3.1-8B-Instruct 29.0 30.7 28.0 33.9 25.6 27.8

和排行榜中的Overall 有一个点的差距（29.0 vs 30.0），我的环境如下：

vllm==0.5.3.post1 transformers==4.45.0

请问测试Llama-3.1-8B-Instruct 还需要什么特殊处理吗

这里的随机性不只是由随机种子导致的。
如果没有明确设置，默认的随机种子是0，和您后面设置的随机种子42的表现不一致是合理的。
但是尽管设置了一样的种子，也会使两次实验的结果不一致。
这是由于在默认的设置下longbenchv2的代码使用了多线程和cache，在多线程请求的时候每次run都有可能有不同的调用次序，在使用temperature=0.1的情况下不同的调用次序会导致随机行为两次都不一致，尽管设置了一样的随机种子。而使用了cache也会有所影响，如果之前的run保存了部分结果文件，再次测试会读取cache，也会影响调用的顺序。
我的建议是，每次实验固定随机种子。在每次实验开始前关掉之前开启的vllm server，然后重新开启vllm server。在实验的设置中使用--n_proc=1，并且不要使用cache文件。这样子在我的实验setting下尽管是使用了temperature=0.1的非greedy decoding也可以使得两次结果一致。
@bys0318 请问可以在排行榜上更新一个像我说的这个可以复现的setting的实验结果吗？

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

关于Llama-3.1-8B-Instruct在Longbench v2 测试结果和排行榜有出入的问题 #94

关于Llama-3.1-8B-Instruct在Longbench v2 测试结果和排行榜有出入的问题 #94

chaochen99 commented Jan 2, 2025

bys0318 commented Jan 2, 2025

chaochen99 commented Jan 2, 2025

guanzhchen commented Jan 2, 2025

Wangmerlyn commented Jan 20, 2025

关于Llama-3.1-8B-Instruct在Longbench v2 测试结果和排行榜有出入的问题 #94

关于Llama-3.1-8B-Instruct在Longbench v2 测试结果和排行榜有出入的问题 #94

Comments

chaochen99 commented Jan 2, 2025

bys0318 commented Jan 2, 2025

chaochen99 commented Jan 2, 2025

guanzhchen commented Jan 2, 2025

Wangmerlyn commented Jan 20, 2025