Phi-3-3.8B + Llama2-7B ensemble .... just for fun

Dear all,

Microsoft just released the Phi-3 model range.

The mini (3.8B param) with 4K context uses the exact same tokenizer as Llama2 which makes it possible to Ensemble these two models.

I ran the MMLU benchmark on both Phi-3-3.8B-4K and Ensemble(Phi3 + Llama2-7B).
There are some discrepancies but in the end, we got the exact same total.

Phi3:

ACC-abstract_algebra: 0.3600
ACC-anatomy: 0.6741
ACC-astronomy: 0.7763
ACC-business_ethics: 0.7100
ACC-clinical_knowledge: 0.7358
ACC-college_biology: 0.8333
ACC-college_chemistry: 0.4900
ACC-college_computer_science: 0.5300
ACC-college_mathematics: 0.3500
ACC-college_medicine: 0.7052
ACC-college_physics: 0.3529
ACC-computer_security: 0.7600
ACC-conceptual_physics: 0.7106
ACC-econometrics: 0.4737
ACC-electrical_engineering: 0.6207
ACC-elementary_mathematics: 0.5026
ACC-formal_logic: 0.5873
ACC-global_facts: 0.3600
ACC-high_school_biology: 0.8226
ACC-high_school_chemistry: 0.6108
ACC-high_school_computer_science: 0.7200
ACC-high_school_european_history: 0.7758
ACC-high_school_geography: 0.8636
ACC-high_school_government_and_politics: 0.9119
ACC-high_school_macroeconomics: 0.7487
ACC-high_school_mathematics: 0.3926
ACC-high_school_microeconomics: 0.8319
ACC-high_school_physics: 0.4371
ACC-high_school_psychology: 0.8862
ACC-high_school_statistics: 0.6157
ACC-high_school_us_history: 0.8284
ACC-high_school_world_history: 0.8101
ACC-human_aging: 0.6996
ACC-human_sexuality: 0.7634
ACC-international_law: 0.8595
ACC-jurisprudence: 0.7870
ACC-logical_fallacies: 0.7975
ACC-machine_learning: 0.5357
ACC-management: 0.7476
ACC-marketing: 0.8761
ACC-medical_genetics: 0.7900
ACC-miscellaneous: 0.8225
ACC-moral_disputes: 0.7601
ACC-moral_scenarios: 0.5810
ACC-nutrition: 0.7516
ACC-philosophy: 0.7717
ACC-prehistory: 0.7778
ACC-professional_accounting: 0.5887
ACC-professional_law: 0.4980
ACC-professional_medicine: 0.7610
ACC-professional_psychology: 0.7631
ACC-public_relations: 0.7364
ACC-security_studies: 0.7796
ACC-sociology: 0.8607
ACC-us_foreign_policy: 0.8500
ACC-virology: 0.4940
ACC-world_religions: 0.8304
ACC-all: 0.6894

Phi3+Llama2:

ACC-abstract_algebra: 0.3800
ACC-anatomy: 0.6519
ACC-astronomy: 0.7697
ACC-business_ethics: 0.6600
ACC-clinical_knowledge: 0.7321
ACC-college_biology: 0.8472
ACC-college_chemistry: 0.4700
ACC-college_computer_science: 0.5600
ACC-college_mathematics: 0.3700
ACC-college_medicine: 0.6821
ACC-college_physics: 0.3824
ACC-computer_security: 0.7900
ACC-conceptual_physics: 0.6936
ACC-econometrics: 0.4737
ACC-electrical_engineering: 0.5931
ACC-elementary_mathematics: 0.5265
ACC-formal_logic: 0.6190
ACC-global_facts: 0.3400
ACC-high_school_biology: 0.8290
ACC-high_school_chemistry: 0.6108
ACC-high_school_computer_science: 0.6900
ACC-high_school_european_history: 0.7939
ACC-high_school_geography: 0.8586
ACC-high_school_government_and_politics: 0.9067
ACC-high_school_macroeconomics: 0.7462
ACC-high_school_mathematics: 0.4000
ACC-high_school_microeconomics: 0.8445
ACC-high_school_physics: 0.4305
ACC-high_school_psychology: 0.8844
ACC-high_school_statistics: 0.6019
ACC-high_school_us_history: 0.8284
ACC-high_school_world_history: 0.8059
ACC-human_aging: 0.6996
ACC-human_sexuality: 0.7710
ACC-international_law: 0.8430
ACC-jurisprudence: 0.7963
ACC-logical_fallacies: 0.8098
ACC-machine_learning: 0.5536
ACC-management: 0.7864
ACC-marketing: 0.8974
ACC-medical_genetics: 0.7900
ACC-miscellaneous: 0.8276
ACC-moral_disputes: 0.7370
ACC-moral_scenarios: 0.5877
ACC-nutrition: 0.7451
ACC-philosophy: 0.7331
ACC-prehistory: 0.7346
ACC-professional_accounting: 0.6028
ACC-professional_law: 0.5137
ACC-professional_medicine: 0.7647
ACC-professional_psychology: 0.7418
ACC-public_relations: 0.7182
ACC-security_studies: 0.7510
ACC-sociology: 0.8856
ACC-us_foreign_policy: 0.8500
ACC-virology: 0.5301
ACC-world_religions: 0.8246
ACC-all: 0.6897

Maybe I need to do it with Llama2-13B or 70B.

1 Like

same result for phi-3 + llama2-13B-awq

ACC-abstract_algebra: 0.3600
ACC-anatomy: 0.6741
ACC-astronomy: 0.7500
ACC-business_ethics: 0.6400
ACC-clinical_knowledge: 0.7434
ACC-college_biology: 0.8125
ACC-college_chemistry: 0.4700
ACC-college_computer_science: 0.5600
ACC-college_mathematics: 0.3400
ACC-college_medicine: 0.6821
ACC-college_physics: 0.3431
ACC-computer_security: 0.7800
ACC-conceptual_physics: 0.7277
ACC-econometrics: 0.4737
ACC-electrical_engineering: 0.6069
ACC-elementary_mathematics: 0.5238
ACC-formal_logic: 0.5873
ACC-global_facts: 0.3800
ACC-high_school_biology: 0.8387
ACC-high_school_chemistry: 0.6256
ACC-high_school_computer_science: 0.7100
ACC-high_school_european_history: 0.7879
ACC-high_school_geography: 0.8535
ACC-high_school_government_and_politics: 0.9067
ACC-high_school_macroeconomics: 0.7513
ACC-high_school_mathematics: 0.4037
ACC-high_school_microeconomics: 0.8277
ACC-high_school_physics: 0.4636
ACC-high_school_psychology: 0.8899
ACC-high_school_statistics: 0.6343
ACC-high_school_us_history: 0.8284
ACC-high_school_world_history: 0.8017
ACC-human_aging: 0.6996
ACC-human_sexuality: 0.7481
ACC-international_law: 0.8430
ACC-jurisprudence: 0.8056
ACC-logical_fallacies: 0.8037
ACC-machine_learning: 0.5357
ACC-management: 0.8252
ACC-marketing: 0.9017
ACC-medical_genetics: 0.7600
ACC-miscellaneous: 0.8225
ACC-moral_disputes: 0.7486
ACC-moral_scenarios: 0.5877
ACC-nutrition: 0.7549
ACC-philosophy: 0.7460
ACC-prehistory: 0.7562
ACC-professional_accounting: 0.5780
ACC-professional_law: 0.5007
ACC-professional_medicine: 0.7537
ACC-professional_psychology: 0.7484
ACC-public_relations: 0.7091
ACC-security_studies: 0.7469
ACC-sociology: 0.8706
ACC-us_foreign_policy: 0.8800
ACC-virology: 0.5120
ACC-world_religions: 0.8421
ACC-all: 0.6895