Support for Phi-2 from Microsoft

Dear community,

A few months ago MS released their small LM “Phi” and recently updated this with Phi-2

This model is trained with mostly synthetic data and performs very well for its size.

You can test it directly on Hugging face or convert it to OpenNMT-py format.

Here is an example to run an evaluation with EleutherAI’s lm-evaluation-harness package:

Install the package from pip or go to GitHub - EleutherAI/lm-evaluation-harness: A framework for few-shot evaluation of language models. for instructions then run:

python main.py --model hf --model_args pretrained="microsoft/phi-2",trust_remote_code=True --tasks "hendrycksTest-*" --device cuda:0 --batch_size 1 --num_fewshot 5 --no_cache

or replace “python main.py” by “lm_eval” if you installed from pip.
it will take very long because lm_eval uses a method where they compute the logits for the 4 different answers and take the best likelihood between [A, B, C, D]

It took 2h42mn on my RTX 3090

It will spit out this:

|                      Task                       |Version| Metric |Value |   |Stderr|
|-------------------------------------------------|------:|--------|-----:|---|-----:|
|hendrycksTest-abstract_algebra                   |      1|acc     |0.3100|±  |0.0465|
|                                                 |       |acc_norm|0.3100|±  |0.0465|
|hendrycksTest-anatomy                            |      1|acc     |0.4444|±  |0.0429|
|                                                 |       |acc_norm|0.4444|±  |0.0429|
|hendrycksTest-astronomy                          |      1|acc     |0.5987|±  |0.0399|
|                                                 |       |acc_norm|0.5987|±  |0.0399|
|hendrycksTest-business_ethics                    |      1|acc     |0.5900|±  |0.0494|
|                                                 |       |acc_norm|0.5900|±  |0.0494|
|hendrycksTest-clinical_knowledge                 |      1|acc     |0.6113|±  |0.0300|
|                                                 |       |acc_norm|0.6113|±  |0.0300|
|hendrycksTest-college_biology                    |      1|acc     |0.6667|±  |0.0394|
|                                                 |       |acc_norm|0.6667|±  |0.0394|
|hendrycksTest-college_chemistry                  |      1|acc     |0.4100|±  |0.0494|
|                                                 |       |acc_norm|0.4100|±  |0.0494|
|hendrycksTest-college_computer_science           |      1|acc     |0.4300|±  |0.0498|
|                                                 |       |acc_norm|0.4300|±  |0.0498|
|hendrycksTest-college_mathematics                |      1|acc     |0.4000|±  |0.0492|
|                                                 |       |acc_norm|0.4000|±  |0.0492|
|hendrycksTest-college_medicine                   |      1|acc     |0.6012|±  |0.0373|
|                                                 |       |acc_norm|0.6012|±  |0.0373|
|hendrycksTest-college_physics                    |      1|acc     |0.3725|±  |0.0481|
|                                                 |       |acc_norm|0.3725|±  |0.0481|
|hendrycksTest-computer_security                  |      1|acc     |0.7400|±  |0.0441|
|                                                 |       |acc_norm|0.7400|±  |0.0441|
|hendrycksTest-conceptual_physics                 |      1|acc     |0.5064|±  |0.0327|
|                                                 |       |acc_norm|0.5064|±  |0.0327|
|hendrycksTest-econometrics                       |      1|acc     |0.3772|±  |0.0456|
|                                                 |       |acc_norm|0.3772|±  |0.0456|
|hendrycksTest-electrical_engineering             |      1|acc     |0.5448|±  |0.0415|
|                                                 |       |acc_norm|0.5448|±  |0.0415|
|hendrycksTest-elementary_mathematics             |      1|acc     |0.4550|±  |0.0256|
|                                                 |       |acc_norm|0.4550|±  |0.0256|
|hendrycksTest-formal_logic                       |      1|acc     |0.3571|±  |0.0429|
|                                                 |       |acc_norm|0.3571|±  |0.0429|
|hendrycksTest-global_facts                       |      1|acc     |0.3600|±  |0.0482|
|                                                 |       |acc_norm|0.3600|±  |0.0482|
|hendrycksTest-high_school_biology                |      1|acc     |0.7065|±  |0.0259|
|                                                 |       |acc_norm|0.7065|±  |0.0259|
|hendrycksTest-high_school_chemistry              |      1|acc     |0.4828|±  |0.0352|
|                                                 |       |acc_norm|0.4828|±  |0.0352|
|hendrycksTest-high_school_computer_science       |      1|acc     |0.6400|±  |0.0482|
|                                                 |       |acc_norm|0.6400|±  |0.0482|
|hendrycksTest-high_school_european_history       |      1|acc     |0.6424|±  |0.0374|
|                                                 |       |acc_norm|0.6424|±  |0.0374|
|hendrycksTest-high_school_geography              |      1|acc     |0.7323|±  |0.0315|
|                                                 |       |acc_norm|0.7323|±  |0.0315|
|hendrycksTest-high_school_government_and_politics|      1|acc     |0.8083|±  |0.0284|
|                                                 |       |acc_norm|0.8083|±  |0.0284|
|hendrycksTest-high_school_macroeconomics         |      1|acc     |0.5692|±  |0.0251|
|                                                 |       |acc_norm|0.5692|±  |0.0251|
|hendrycksTest-high_school_mathematics            |      1|acc     |0.3333|±  |0.0287|
|                                                 |       |acc_norm|0.3333|±  |0.0287|
|hendrycksTest-high_school_microeconomics         |      1|acc     |0.6092|±  |0.0317|
|                                                 |       |acc_norm|0.6092|±  |0.0317|
|hendrycksTest-high_school_physics                |      1|acc     |0.3907|±  |0.0398|
|                                                 |       |acc_norm|0.3907|±  |0.0398|
|hendrycksTest-high_school_psychology             |      1|acc     |0.7945|±  |0.0173|
|                                                 |       |acc_norm|0.7945|±  |0.0173|
|hendrycksTest-high_school_statistics             |      1|acc     |0.5000|±  |0.0341|
|                                                 |       |acc_norm|0.5000|±  |0.0341|
|hendrycksTest-high_school_us_history             |      1|acc     |0.6716|±  |0.0330|
|                                                 |       |acc_norm|0.6716|±  |0.0330|
|hendrycksTest-high_school_world_history          |      1|acc     |0.7300|±  |0.0289|
|                                                 |       |acc_norm|0.7300|±  |0.0289|
|hendrycksTest-human_aging                        |      1|acc     |0.6547|±  |0.0319|
|                                                 |       |acc_norm|0.6547|±  |0.0319|
|hendrycksTest-human_sexuality                    |      1|acc     |0.7099|±  |0.0398|
|                                                 |       |acc_norm|0.7099|±  |0.0398|
|hendrycksTest-international_law                  |      1|acc     |0.7355|±  |0.0403|
|                                                 |       |acc_norm|0.7355|±  |0.0403|
|hendrycksTest-jurisprudence                      |      1|acc     |0.7130|±  |0.0437|
|                                                 |       |acc_norm|0.7130|±  |0.0437|
|hendrycksTest-logical_fallacies                  |      1|acc     |0.7485|±  |0.0341|
|                                                 |       |acc_norm|0.7485|±  |0.0341|
|hendrycksTest-machine_learning                   |      1|acc     |0.4911|±  |0.0475|
|                                                 |       |acc_norm|0.4911|±  |0.0475|
|hendrycksTest-management                         |      1|acc     |0.7379|±  |0.0435|
|                                                 |       |acc_norm|0.7379|±  |0.0435|
|hendrycksTest-marketing                          |      1|acc     |0.8162|±  |0.0254|
|                                                 |       |acc_norm|0.8162|±  |0.0254|
|hendrycksTest-medical_genetics                   |      1|acc     |0.6300|±  |0.0485|
|                                                 |       |acc_norm|0.6300|±  |0.0485|
|hendrycksTest-miscellaneous                      |      1|acc     |0.6897|±  |0.0165|
|                                                 |       |acc_norm|0.6897|±  |0.0165|
|hendrycksTest-moral_disputes                     |      1|acc     |0.6763|±  |0.0252|
|                                                 |       |acc_norm|0.6763|±  |0.0252|
|hendrycksTest-moral_scenarios                    |      1|acc     |0.3151|±  |0.0155|
|                                                 |       |acc_norm|0.3151|±  |0.0155|
|hendrycksTest-nutrition                          |      1|acc     |0.6176|±  |0.0278|
|                                                 |       |acc_norm|0.6176|±  |0.0278|
|hendrycksTest-philosophy                         |      1|acc     |0.6302|±  |0.0274|
|                                                 |       |acc_norm|0.6302|±  |0.0274|
|hendrycksTest-prehistory                         |      1|acc     |0.6296|±  |0.0269|
|                                                 |       |acc_norm|0.6296|±  |0.0269|
|hendrycksTest-professional_accounting            |      1|acc     |0.4362|±  |0.0296|
|                                                 |       |acc_norm|0.4362|±  |0.0296|
|hendrycksTest-professional_law                   |      1|acc     |0.4244|±  |0.0126|
|                                                 |       |acc_norm|0.4244|±  |0.0126|
|hendrycksTest-professional_medicine              |      1|acc     |0.4706|±  |0.0303|
|                                                 |       |acc_norm|0.4706|±  |0.0303|
|hendrycksTest-professional_psychology            |      1|acc     |0.5621|±  |0.0201|
|                                                 |       |acc_norm|0.5621|±  |0.0201|
|hendrycksTest-public_relations                   |      1|acc     |0.6727|±  |0.0449|
|                                                 |       |acc_norm|0.6727|±  |0.0449|
|hendrycksTest-security_studies                   |      1|acc     |0.7306|±  |0.0284|
|                                                 |       |acc_norm|0.7306|±  |0.0284|
|hendrycksTest-sociology                          |      1|acc     |0.8109|±  |0.0277|
|                                                 |       |acc_norm|0.8109|±  |0.0277|
|hendrycksTest-us_foreign_policy                  |      1|acc     |0.7700|±  |0.0423|
|                                                 |       |acc_norm|0.7700|±  |0.0423|
|hendrycksTest-virology                           |      1|acc     |0.4759|±  |0.0389|
|                                                 |       |acc_norm|0.4759|±  |0.0389|
|hendrycksTest-world_religions                    |      1|acc     |0.6901|±  |0.0355|
|                                                 |       |acc_norm|0.6901|±  |0.0355|

If you average these lines you will get 58.3 which is in line with 58.1 on the HF OpenLLM leader board.

Now, you can use OpenNMT-py to achieve the same.

First you need to convert the model with the command line:

python tools/convert_HF.py --model_dir "microsoft/phi-2" --output "/mnt/InternalCrucial4/dataAI/phi-2/phi-2-onmt.pt" --format safetensors --nshards 1

It will store in the target directory two files: phi-2-onmt.pt containing the settings and a safetensor with the weights.

Then you need an inference yaml config to run it:

transforms: [onmt_tokenize]
#### Subword
src_subword_type: bpe
src_subword_model: "/mnt/InternalCrucial4/dataAI/phi-2/bpe.model"
tgt_subword_type: bpe
tgt_subword_model: "/mnt/InternalCrucial4/dataAI/phi-2/bpe.model"
gpt2_pretok: true

# Model info
model: "/mnt/InternalCrucial4/dataAI/phi-2/phi-2-onmt.pt"
# Inference
seed: 42
max_length: 1
gpu: 0
batch_type: sents
batch_size: 1
world_size: 1
gpu_ranks: [0]
precision: fp16
#random_sampling_topk: 1
#random_sampling_topp: 0.6
#random_sampling_temp: 0.9
beam_size: 1
n_best: 1
profile: false
report_time: true
src: None

And finally:

python eval_llm/MMLU/run_mmlu_opennmt.py --config /mnt/InternalCrucial4/dataAI/phi-2/phi-2-inference.yaml

It will be way faster (10mn on my RTX 4090) and then you will get the scores:

ACC-abstract_algebra: 0.2700
ACC-anatomy: 0.4444
ACC-astronomy: 0.5921
ACC-business_ethics: 0.6000
ACC-clinical_knowledge: 0.5962
ACC-college_biology: 0.6667
ACC-college_chemistry: 0.4100
ACC-college_computer_science: 0.4200
ACC-college_mathematics: 0.3700
ACC-college_medicine: 0.6069
ACC-college_physics: 0.3824
ACC-computer_security: 0.7500
ACC-conceptual_physics: 0.5064
ACC-econometrics: 0.3772
ACC-electrical_engineering: 0.5448
ACC-elementary_mathematics: 0.4497
ACC-formal_logic: 0.3730
ACC-global_facts: 0.3600
ACC-high_school_biology: 0.6968
ACC-high_school_chemistry: 0.4926
ACC-high_school_computer_science: 0.6500
ACC-high_school_european_history: 0.6848
ACC-high_school_geography: 0.7475
ACC-high_school_government_and_politics: 0.7979
ACC-high_school_macroeconomics: 0.5897
ACC-high_school_mathematics: 0.3630
ACC-high_school_microeconomics: 0.6134
ACC-high_school_physics: 0.3907
ACC-high_school_psychology: 0.7963
ACC-high_school_statistics: 0.4907
ACC-high_school_us_history: 0.6716
ACC-high_school_world_history: 0.7300
ACC-human_aging: 0.6502
ACC-human_sexuality: 0.7176
ACC-international_law: 0.7521
ACC-jurisprudence: 0.6944
ACC-logical_fallacies: 0.7546
ACC-machine_learning: 0.4911
ACC-management: 0.7476
ACC-marketing: 0.8162
ACC-medical_genetics: 0.6100
ACC-miscellaneous: 0.6909
ACC-moral_disputes: 0.6647
ACC-moral_scenarios: 0.3117
ACC-nutrition: 0.6340
ACC-philosophy: 0.6238
ACC-prehistory: 0.6265
ACC-professional_accounting: 0.4362
ACC-professional_law: 0.4179
ACC-professional_medicine: 0.4779
ACC-professional_psychology: 0.5670
ACC-public_relations: 0.6727
ACC-security_studies: 0.7102
ACC-sociology: 0.8010
ACC-us_foreign_policy: 0.7800
ACC-virology: 0.4759
ACC-world_religions: 0.6901
ACC-all: 0.5684

The final score 56.84 is the real “MMLU original implementation” score. It takes the weighted average of each line. If you make a simple average of each line, then you will get 58.3
There are still some small discrepancies with the lm_eval package because of the methodology but scores should be in the same range.

Enjoy.

2 Likes

Great work!

1 Like

Enjoy this variant of phi-2:

The evaluation was performed using LLM AutoEval on Nous suite.

Model AGIEval GPT4All TruthfulQA Bigbench Average
phi-2-psy 34.4 71.4 48.2 38.1 48.02
phixtral-2x2_8 34.1 70.4 48.8 37.8 47.78
dolphin-2_6-phi-2 33.1 69.9 47.4 37.2 46.89
phi-2-orange 33.4 71.3 49.9 37.3 47.97
phi-2 28.0 70.8 44.4 35.2 44.61