Dynamic dataset (file sampling) question

dbl · July 18, 2018, 6:44pm

Say I have 58 file basenames I wish to sample equally from. E.g., 100k segments from each so that an epoch has 5.8M segment pairs. What’s the best way to do this?

Can the WEIGHT values in the gsample_dist file be a float, or must they be ints? If only int, could there be an option to specify the NUMBER of segments per LuaPattern instead of WEIGHT?

dbl · July 18, 2018, 10:13pm

Update:

with just -gsample 5800000, it samples proportionately to the size of the files (not what I want).

with -gsample 5800000 -gsample_dist sample.dist (where sample.dist lists all files with 1.7241379310344827586206896551724), it samples 0 from each file (then dies b/c of empty dataset). Same is true with all files weighted at 1.

Update, part 2:

The problem was with my sample.dist file. Lua’s regex wasn’t playing nicely with my filenames, so I jiggered the patterns. Looks like it works now.