Dear all,
Microsoft just released the Phi-3 model range.
The mini (3.8B param) with 4K context uses the exact same tokenizer as Llama2 which makes it possible to Ensemble these two models.
I ran the MMLU benchmark on both Phi-3-3.8B-4K and Ensemble(Phi3 + Llama2-7B).
There are some discrepancies but in the end, we got the exact same total.
Phi3:
ACC-abstract_algebra: 0.3600
ACC-anatomy: 0.6741
ACC-astronomy: 0.7763
ACC-business_ethics: 0.7100
ACC-clinical_knowledge: 0.7358
ACC-college_biology: 0.8333
ACC-college_chemistry: 0.4900
ACC-college_computer_science: 0.5300
ACC-college_mathematics: 0.3500
ACC-college_medicine: 0.7052
ACC-college_physics: 0.3529
ACC-computer_security: 0.7600
ACC-conceptual_physics: 0.7106
ACC-econometrics: 0.4737
ACC-electrical_engineering: 0.6207
ACC-elementary_mathematics: 0.5026
ACC-formal_logic: 0.5873
ACC-global_facts: 0.3600
ACC-high_school_biology: 0.8226
ACC-high_school_chemistry: 0.6108
ACC-high_school_computer_science: 0.7200
ACC-high_school_european_history: 0.7758
ACC-high_school_geography: 0.8636
ACC-high_school_government_and_politics: 0.9119
ACC-high_school_macroeconomics: 0.7487
ACC-high_school_mathematics: 0.3926
ACC-high_school_microeconomics: 0.8319
ACC-high_school_physics: 0.4371
ACC-high_school_psychology: 0.8862
ACC-high_school_statistics: 0.6157
ACC-high_school_us_history: 0.8284
ACC-high_school_world_history: 0.8101
ACC-human_aging: 0.6996
ACC-human_sexuality: 0.7634
ACC-international_law: 0.8595
ACC-jurisprudence: 0.7870
ACC-logical_fallacies: 0.7975
ACC-machine_learning: 0.5357
ACC-management: 0.7476
ACC-marketing: 0.8761
ACC-medical_genetics: 0.7900
ACC-miscellaneous: 0.8225
ACC-moral_disputes: 0.7601
ACC-moral_scenarios: 0.5810
ACC-nutrition: 0.7516
ACC-philosophy: 0.7717
ACC-prehistory: 0.7778
ACC-professional_accounting: 0.5887
ACC-professional_law: 0.4980
ACC-professional_medicine: 0.7610
ACC-professional_psychology: 0.7631
ACC-public_relations: 0.7364
ACC-security_studies: 0.7796
ACC-sociology: 0.8607
ACC-us_foreign_policy: 0.8500
ACC-virology: 0.4940
ACC-world_religions: 0.8304
ACC-all: 0.6894
Phi3+Llama2:
ACC-abstract_algebra: 0.3800
ACC-anatomy: 0.6519
ACC-astronomy: 0.7697
ACC-business_ethics: 0.6600
ACC-clinical_knowledge: 0.7321
ACC-college_biology: 0.8472
ACC-college_chemistry: 0.4700
ACC-college_computer_science: 0.5600
ACC-college_mathematics: 0.3700
ACC-college_medicine: 0.6821
ACC-college_physics: 0.3824
ACC-computer_security: 0.7900
ACC-conceptual_physics: 0.6936
ACC-econometrics: 0.4737
ACC-electrical_engineering: 0.5931
ACC-elementary_mathematics: 0.5265
ACC-formal_logic: 0.6190
ACC-global_facts: 0.3400
ACC-high_school_biology: 0.8290
ACC-high_school_chemistry: 0.6108
ACC-high_school_computer_science: 0.6900
ACC-high_school_european_history: 0.7939
ACC-high_school_geography: 0.8586
ACC-high_school_government_and_politics: 0.9067
ACC-high_school_macroeconomics: 0.7462
ACC-high_school_mathematics: 0.4000
ACC-high_school_microeconomics: 0.8445
ACC-high_school_physics: 0.4305
ACC-high_school_psychology: 0.8844
ACC-high_school_statistics: 0.6019
ACC-high_school_us_history: 0.8284
ACC-high_school_world_history: 0.8059
ACC-human_aging: 0.6996
ACC-human_sexuality: 0.7710
ACC-international_law: 0.8430
ACC-jurisprudence: 0.7963
ACC-logical_fallacies: 0.8098
ACC-machine_learning: 0.5536
ACC-management: 0.7864
ACC-marketing: 0.8974
ACC-medical_genetics: 0.7900
ACC-miscellaneous: 0.8276
ACC-moral_disputes: 0.7370
ACC-moral_scenarios: 0.5877
ACC-nutrition: 0.7451
ACC-philosophy: 0.7331
ACC-prehistory: 0.7346
ACC-professional_accounting: 0.6028
ACC-professional_law: 0.5137
ACC-professional_medicine: 0.7647
ACC-professional_psychology: 0.7418
ACC-public_relations: 0.7182
ACC-security_studies: 0.7510
ACC-sociology: 0.8856
ACC-us_foreign_policy: 0.8500
ACC-virology: 0.5301
ACC-world_religions: 0.8246
ACC-all: 0.6897
Maybe I need to do it with Llama2-13B or 70B.