Using onmt.inputters.Dataset

Rajashan · July 26, 2019, 10:42pm

How does one actually construct a dataset in the present version? Earlier on, it could be done with only the path to the desired text file. Now however, we need the following in onmt.inputters.Dataset

Args:
        fields (dict[str, Field]): a dict with the structure
            returned by :func:`onmt.inputters.get_fields()`. Usually
            that means the dataset side, ``"src"`` or ``"tgt"``. Keys match
            the keys of items yielded by the ``readers``, while values
            are lists of (name, Field) pairs. An attribute with this
            name will be created for each :class:`torchtext.data.Example`
            object and its value will be the result of applying the Field
            to the data that matches the key. The advantage of having
            sequences of fields for each piece of raw input is that it allows
            the dataset to store multiple "views" of each input, which allows
            for easy implementation of token-level features, mixed word-
            and character-level models, and so on. (See also
            :class:`onmt.inputters.TextMultiField`.)
        readers (Iterable[onmt.inputters.DataReaderBase]): Reader objects
            for disk-to-dict. The yielded dicts are then processed
            according to ``fields``.
        data (Iterable[Tuple[str, Any]]): (name, ``data_arg``) pairs
            where ``data_arg`` is passed to the ``read()`` method of the
            reader in ``readers`` at that position. (See the reader object for
            details on the ``Any`` type.)
        dirs (Iterable[str or NoneType]): A list of directories where
            data is contained. See the reader object for more details.
        sort_key (Callable[[torchtext.data.Example], Any]): A function
            for determining the value on which data is sorted (i.e. length).
        filter_pred (Callable[[torchtext.data.Example], bool]): A function
            that accepts Example objects and returns a boolean value
            indicating whether to include that example in the dataset.Download

So how to set this up if all I have is a text file? Do I have to make an iterator first for the data argument and if so, how?

francoishernandez · July 27, 2019, 6:05am

Hi,

You have to preprocess your data before building an iterator over the shards.
Have a look at preprocess.py. If you want to build it inside some other code I think you can directly use the build_save_dataset function.

Once you have your preprocessed dataset, in the form of pickled .pt file(s), you can build an iterator with build_dataset_iter.

francoishernandez · July 27, 2019, 6:13am

Actually, you might want to have a look here.

You can directly build some dataset from raw text files. You just have to build the required objects accordingly (fields, readers, etc.). This is basically what is done in the preprocess main.

Yet, it’s quite easier to just preprocess your data first and use the pickled dataset(s). Another example can be found here.

Rajashan · July 27, 2019, 12:20pm

I see I was trying to reinvent the wheel, thank you for the insight!