Using -profiler
option - we get the following performance number on v0.3 for GPU (Titan X Pascal) - (one epoch on 200K sentence of baseline-1M-enfr
corpus):
- network 2x500x500 - vocabulary 50K
train:{total:1220.48,
encoder:{total:128.918,bwd:76.1674,fwd:52.6742},
decoder:{total:1018.24,
fwd:224.269,
bwd:{total:793.903,
generator:{total:353.259,bwd:186.989,fwd:163.155},
criterion:{total:30.4795,fwd:6.96226,bwd:20.9239}}}},
valid:6.06029
- network 4x1000x500 - vocabulary 50K
train:{total:2346.22,
encoder:{total:404.836,fwd:153.731,bwd:251.019},
decoder:{total:1842.45,
bwd:{total:1428.69,
criterion:{total:33.7131,bwd:22.9849,fwd:7.55272},
generator:{total:657.478,bwd:372.867,fwd:281.834}},fwd:413.687}},
valid:13.0751
- network 4x1000x500 - vocabulary 100K
train:{total:3034.75,
encoder(*):{total:407.396,fwd:156.66,bwd:250.653},
decoder:{total:2499.1,
fwd:411.751,
bwd:{total:2087.28,
generator:{total:1308.53,fwd:563.358,bwd:742.988},
criterion:{total:50.3954,fwd:8.31949,bwd:39.0878}}}},
valid:16.8854
(*)
with -cudnn RNN
option (using cudnn LSTM implementation):
encoder:{total:196.527,bwd:127.304,fwd:69.1534}}