Dear community,
A few months ago MS released their small LM “Phi” and recently updated this with Phi-2
This model is trained with mostly synthetic data and performs very well for its size.
You can test it directly on Hugging face or convert it to OpenNMT-py format.
Here is an example to run an evaluation with EleutherAI’s lm-evaluation-harness package:
Install the package from pip or go to GitHub - EleutherAI/lm-evaluation-harness: A framework for few-shot evaluation of language models. for instructions then run:
python main.py --model hf --model_args pretrained="microsoft/phi-2",trust_remote_code=True --tasks "hendrycksTest-*" --device cuda:0 --batch_size 1 --num_fewshot 5 --no_cache
or replace “python main.py” by “lm_eval” if you installed from pip.
it will take very long because lm_eval uses a method where they compute the logits for the 4 different answers and take the best likelihood between [A, B, C, D]
It took 2h42mn on my RTX 3090
It will spit out this:
| Task |Version| Metric |Value | |Stderr|
|-------------------------------------------------|------:|--------|-----:|---|-----:|
|hendrycksTest-abstract_algebra | 1|acc |0.3100|± |0.0465|
| | |acc_norm|0.3100|± |0.0465|
|hendrycksTest-anatomy | 1|acc |0.4444|± |0.0429|
| | |acc_norm|0.4444|± |0.0429|
|hendrycksTest-astronomy | 1|acc |0.5987|± |0.0399|
| | |acc_norm|0.5987|± |0.0399|
|hendrycksTest-business_ethics | 1|acc |0.5900|± |0.0494|
| | |acc_norm|0.5900|± |0.0494|
|hendrycksTest-clinical_knowledge | 1|acc |0.6113|± |0.0300|
| | |acc_norm|0.6113|± |0.0300|
|hendrycksTest-college_biology | 1|acc |0.6667|± |0.0394|
| | |acc_norm|0.6667|± |0.0394|
|hendrycksTest-college_chemistry | 1|acc |0.4100|± |0.0494|
| | |acc_norm|0.4100|± |0.0494|
|hendrycksTest-college_computer_science | 1|acc |0.4300|± |0.0498|
| | |acc_norm|0.4300|± |0.0498|
|hendrycksTest-college_mathematics | 1|acc |0.4000|± |0.0492|
| | |acc_norm|0.4000|± |0.0492|
|hendrycksTest-college_medicine | 1|acc |0.6012|± |0.0373|
| | |acc_norm|0.6012|± |0.0373|
|hendrycksTest-college_physics | 1|acc |0.3725|± |0.0481|
| | |acc_norm|0.3725|± |0.0481|
|hendrycksTest-computer_security | 1|acc |0.7400|± |0.0441|
| | |acc_norm|0.7400|± |0.0441|
|hendrycksTest-conceptual_physics | 1|acc |0.5064|± |0.0327|
| | |acc_norm|0.5064|± |0.0327|
|hendrycksTest-econometrics | 1|acc |0.3772|± |0.0456|
| | |acc_norm|0.3772|± |0.0456|
|hendrycksTest-electrical_engineering | 1|acc |0.5448|± |0.0415|
| | |acc_norm|0.5448|± |0.0415|
|hendrycksTest-elementary_mathematics | 1|acc |0.4550|± |0.0256|
| | |acc_norm|0.4550|± |0.0256|
|hendrycksTest-formal_logic | 1|acc |0.3571|± |0.0429|
| | |acc_norm|0.3571|± |0.0429|
|hendrycksTest-global_facts | 1|acc |0.3600|± |0.0482|
| | |acc_norm|0.3600|± |0.0482|
|hendrycksTest-high_school_biology | 1|acc |0.7065|± |0.0259|
| | |acc_norm|0.7065|± |0.0259|
|hendrycksTest-high_school_chemistry | 1|acc |0.4828|± |0.0352|
| | |acc_norm|0.4828|± |0.0352|
|hendrycksTest-high_school_computer_science | 1|acc |0.6400|± |0.0482|
| | |acc_norm|0.6400|± |0.0482|
|hendrycksTest-high_school_european_history | 1|acc |0.6424|± |0.0374|
| | |acc_norm|0.6424|± |0.0374|
|hendrycksTest-high_school_geography | 1|acc |0.7323|± |0.0315|
| | |acc_norm|0.7323|± |0.0315|
|hendrycksTest-high_school_government_and_politics| 1|acc |0.8083|± |0.0284|
| | |acc_norm|0.8083|± |0.0284|
|hendrycksTest-high_school_macroeconomics | 1|acc |0.5692|± |0.0251|
| | |acc_norm|0.5692|± |0.0251|
|hendrycksTest-high_school_mathematics | 1|acc |0.3333|± |0.0287|
| | |acc_norm|0.3333|± |0.0287|
|hendrycksTest-high_school_microeconomics | 1|acc |0.6092|± |0.0317|
| | |acc_norm|0.6092|± |0.0317|
|hendrycksTest-high_school_physics | 1|acc |0.3907|± |0.0398|
| | |acc_norm|0.3907|± |0.0398|
|hendrycksTest-high_school_psychology | 1|acc |0.7945|± |0.0173|
| | |acc_norm|0.7945|± |0.0173|
|hendrycksTest-high_school_statistics | 1|acc |0.5000|± |0.0341|
| | |acc_norm|0.5000|± |0.0341|
|hendrycksTest-high_school_us_history | 1|acc |0.6716|± |0.0330|
| | |acc_norm|0.6716|± |0.0330|
|hendrycksTest-high_school_world_history | 1|acc |0.7300|± |0.0289|
| | |acc_norm|0.7300|± |0.0289|
|hendrycksTest-human_aging | 1|acc |0.6547|± |0.0319|
| | |acc_norm|0.6547|± |0.0319|
|hendrycksTest-human_sexuality | 1|acc |0.7099|± |0.0398|
| | |acc_norm|0.7099|± |0.0398|
|hendrycksTest-international_law | 1|acc |0.7355|± |0.0403|
| | |acc_norm|0.7355|± |0.0403|
|hendrycksTest-jurisprudence | 1|acc |0.7130|± |0.0437|
| | |acc_norm|0.7130|± |0.0437|
|hendrycksTest-logical_fallacies | 1|acc |0.7485|± |0.0341|
| | |acc_norm|0.7485|± |0.0341|
|hendrycksTest-machine_learning | 1|acc |0.4911|± |0.0475|
| | |acc_norm|0.4911|± |0.0475|
|hendrycksTest-management | 1|acc |0.7379|± |0.0435|
| | |acc_norm|0.7379|± |0.0435|
|hendrycksTest-marketing | 1|acc |0.8162|± |0.0254|
| | |acc_norm|0.8162|± |0.0254|
|hendrycksTest-medical_genetics | 1|acc |0.6300|± |0.0485|
| | |acc_norm|0.6300|± |0.0485|
|hendrycksTest-miscellaneous | 1|acc |0.6897|± |0.0165|
| | |acc_norm|0.6897|± |0.0165|
|hendrycksTest-moral_disputes | 1|acc |0.6763|± |0.0252|
| | |acc_norm|0.6763|± |0.0252|
|hendrycksTest-moral_scenarios | 1|acc |0.3151|± |0.0155|
| | |acc_norm|0.3151|± |0.0155|
|hendrycksTest-nutrition | 1|acc |0.6176|± |0.0278|
| | |acc_norm|0.6176|± |0.0278|
|hendrycksTest-philosophy | 1|acc |0.6302|± |0.0274|
| | |acc_norm|0.6302|± |0.0274|
|hendrycksTest-prehistory | 1|acc |0.6296|± |0.0269|
| | |acc_norm|0.6296|± |0.0269|
|hendrycksTest-professional_accounting | 1|acc |0.4362|± |0.0296|
| | |acc_norm|0.4362|± |0.0296|
|hendrycksTest-professional_law | 1|acc |0.4244|± |0.0126|
| | |acc_norm|0.4244|± |0.0126|
|hendrycksTest-professional_medicine | 1|acc |0.4706|± |0.0303|
| | |acc_norm|0.4706|± |0.0303|
|hendrycksTest-professional_psychology | 1|acc |0.5621|± |0.0201|
| | |acc_norm|0.5621|± |0.0201|
|hendrycksTest-public_relations | 1|acc |0.6727|± |0.0449|
| | |acc_norm|0.6727|± |0.0449|
|hendrycksTest-security_studies | 1|acc |0.7306|± |0.0284|
| | |acc_norm|0.7306|± |0.0284|
|hendrycksTest-sociology | 1|acc |0.8109|± |0.0277|
| | |acc_norm|0.8109|± |0.0277|
|hendrycksTest-us_foreign_policy | 1|acc |0.7700|± |0.0423|
| | |acc_norm|0.7700|± |0.0423|
|hendrycksTest-virology | 1|acc |0.4759|± |0.0389|
| | |acc_norm|0.4759|± |0.0389|
|hendrycksTest-world_religions | 1|acc |0.6901|± |0.0355|
| | |acc_norm|0.6901|± |0.0355|
If you average these lines you will get 58.3 which is in line with 58.1 on the HF OpenLLM leader board.
Now, you can use OpenNMT-py to achieve the same.
First you need to convert the model with the command line:
python tools/convert_HF.py --model_dir "microsoft/phi-2" --output "/mnt/InternalCrucial4/dataAI/phi-2/phi-2-onmt.pt" --format safetensors --nshards 1
It will store in the target directory two files: phi-2-onmt.pt containing the settings and a safetensor with the weights.
Then you need an inference yaml config to run it:
transforms: [onmt_tokenize]
#### Subword
src_subword_type: bpe
src_subword_model: "/mnt/InternalCrucial4/dataAI/phi-2/bpe.model"
tgt_subword_type: bpe
tgt_subword_model: "/mnt/InternalCrucial4/dataAI/phi-2/bpe.model"
gpt2_pretok: true
# Model info
model: "/mnt/InternalCrucial4/dataAI/phi-2/phi-2-onmt.pt"
# Inference
seed: 42
max_length: 1
gpu: 0
batch_type: sents
batch_size: 1
world_size: 1
gpu_ranks: [0]
precision: fp16
#random_sampling_topk: 1
#random_sampling_topp: 0.6
#random_sampling_temp: 0.9
beam_size: 1
n_best: 1
profile: false
report_time: true
src: None
And finally:
python eval_llm/MMLU/run_mmlu_opennmt.py --config /mnt/InternalCrucial4/dataAI/phi-2/phi-2-inference.yaml
It will be way faster (10mn on my RTX 4090) and then you will get the scores:
ACC-abstract_algebra: 0.2700
ACC-anatomy: 0.4444
ACC-astronomy: 0.5921
ACC-business_ethics: 0.6000
ACC-clinical_knowledge: 0.5962
ACC-college_biology: 0.6667
ACC-college_chemistry: 0.4100
ACC-college_computer_science: 0.4200
ACC-college_mathematics: 0.3700
ACC-college_medicine: 0.6069
ACC-college_physics: 0.3824
ACC-computer_security: 0.7500
ACC-conceptual_physics: 0.5064
ACC-econometrics: 0.3772
ACC-electrical_engineering: 0.5448
ACC-elementary_mathematics: 0.4497
ACC-formal_logic: 0.3730
ACC-global_facts: 0.3600
ACC-high_school_biology: 0.6968
ACC-high_school_chemistry: 0.4926
ACC-high_school_computer_science: 0.6500
ACC-high_school_european_history: 0.6848
ACC-high_school_geography: 0.7475
ACC-high_school_government_and_politics: 0.7979
ACC-high_school_macroeconomics: 0.5897
ACC-high_school_mathematics: 0.3630
ACC-high_school_microeconomics: 0.6134
ACC-high_school_physics: 0.3907
ACC-high_school_psychology: 0.7963
ACC-high_school_statistics: 0.4907
ACC-high_school_us_history: 0.6716
ACC-high_school_world_history: 0.7300
ACC-human_aging: 0.6502
ACC-human_sexuality: 0.7176
ACC-international_law: 0.7521
ACC-jurisprudence: 0.6944
ACC-logical_fallacies: 0.7546
ACC-machine_learning: 0.4911
ACC-management: 0.7476
ACC-marketing: 0.8162
ACC-medical_genetics: 0.6100
ACC-miscellaneous: 0.6909
ACC-moral_disputes: 0.6647
ACC-moral_scenarios: 0.3117
ACC-nutrition: 0.6340
ACC-philosophy: 0.6238
ACC-prehistory: 0.6265
ACC-professional_accounting: 0.4362
ACC-professional_law: 0.4179
ACC-professional_medicine: 0.4779
ACC-professional_psychology: 0.5670
ACC-public_relations: 0.6727
ACC-security_studies: 0.7102
ACC-sociology: 0.8010
ACC-us_foreign_policy: 0.7800
ACC-virology: 0.4759
ACC-world_religions: 0.6901
ACC-all: 0.5684
The final score 56.84 is the real “MMLU original implementation” score. It takes the weighted average of each line. If you make a simple average of each line, then you will get 58.3
There are still some small discrepancies with the lm_eval package because of the methodology but scores should be in the same range.
Enjoy.