reformat docstrings for sphinx
This commit is contained in:
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
|
||||
|
||||
[project]
|
||||
name = "trainlib"
|
||||
version = "0.1.0"
|
||||
version = "0.1.1"
|
||||
description = "Minimal framework for ML modeling. Supports advanced dataset operations and streamlined training."
|
||||
requires-python = ">=3.13"
|
||||
authors = [
|
||||
|
||||
@@ -1,5 +1,5 @@
|
||||
"""
|
||||
Marginalizing out the modality layer:
|
||||
.. admonition:: Marginalizing out the modality layer
|
||||
|
||||
With ``domain`` being an instance variable, one possible interpretation of
|
||||
the object structures here is that one could completely abstract away
|
||||
@@ -58,7 +58,10 @@ Marginalizing out the modality layer:
|
||||
particular case of ``_process_batch_data()``, it feels much better when
|
||||
it's on the inside.)
|
||||
|
||||
Holding:
|
||||
.. admonition:: Holding area
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
@abstractmethod
|
||||
def _get_uri_groups(self) -> Iterable[tuple[U, ...]]:
|
||||
Get URI groups for each batch.
|
||||
@@ -67,19 +70,19 @@ Holding:
|
||||
metadata file), zip the URIs such that we have a tuple of URIs per
|
||||
batch.
|
||||
|
||||
Note that this effectively defines the index style over batches in the
|
||||
attached domain. We get an ``int -> tuple[U, ...]`` map that turns
|
||||
batch indices into URIs that can be read under the domain.
|
||||
Note that this effectively defines the index style over batches in
|
||||
the attached domain. We get an ``int -> tuple[U, ...]`` map that
|
||||
turns batch indices into URIs that can be read under the domain.
|
||||
``get_batch()`` turns an integer index into its corresponding
|
||||
``tuple[U, ...]``, reading the resources with ``_read_resources()`` in
|
||||
the tuple, treating them as providers of batched data.
|
||||
``tuple[U, ...]``, reading the resources with ``_read_resources()``
|
||||
in the tuple, treating them as providers of batched data.
|
||||
``_read_resources()`` passes through to the attached domain logic,
|
||||
which, although common, need not supply an explicit iterable of batch
|
||||
items: we just access items with ``__getitem__()`` and may ask for
|
||||
``__len__``. So the returned URI group collection (this method) does
|
||||
need to be iterable to measure the number of batches, but the batch
|
||||
objects that are ultimately produced by these URI groups need not be
|
||||
iterables themselves.
|
||||
which, although common, need not supply an explicit iterable of
|
||||
batch items: we just access items with ``__getitem__()`` and may
|
||||
ask for ``__len__``. So the returned URI group collection (this
|
||||
method) does need to be iterable to measure the number of batches,
|
||||
but the batch objects that are ultimately produced by these URI
|
||||
groups need not be iterables themselves.
|
||||
|
||||
raise NotImplementedError
|
||||
|
||||
@@ -91,20 +94,24 @@ Holding:
|
||||
Read batch files at the provided paths.
|
||||
|
||||
This method should operate on a single tuple from the list of batch
|
||||
tuples returned by the ``_get_uri_groups()`` method. That is, it reads
|
||||
all of the resources for a single batch and returns a tuple of the same
|
||||
size with their contents.
|
||||
tuples returned by the ``_get_uri_groups()`` method. That is, it
|
||||
reads all of the resources for a single batch and returns a tuple
|
||||
of the same size with their contents.
|
||||
|
||||
Note: the dependence on a batch index is mostly here to make
|
||||
multi-dataset composition easier later. In-dataset, you don't need to
|
||||
know the batch index to to simply process URIs, but across datasets you
|
||||
need it to find out the origin of the batch (and process those URIs
|
||||
accordingly).
|
||||
multi-dataset composition easier later. In-dataset, you don't need
|
||||
to know the batch index to to simply process URIs, but across
|
||||
datasets you need it to find out the origin of the batch (and
|
||||
process those URIs accordingly).
|
||||
|
||||
return tuple(self.domain.read(uri) for uri in uri_group)
|
||||
|
||||
# pulling the type variable out of the inline generic b/c `ty` has trouble
|
||||
# understanding bound type variables in subclasses (specifically with Self@)
|
||||
.. code-block:: python
|
||||
|
||||
# pulling the type variable out of the inline generic b/c `ty` has
|
||||
# trouble understanding bound type variables in subclasses
|
||||
# (specifically with Self@)
|
||||
|
||||
T = TypeVar("T", bound=NamedTuple)
|
||||
|
||||
class NamedTupleDataset[I](Dataset):
|
||||
@@ -156,38 +163,41 @@ class BatchedDataset[U, R, I](Dataset):
|
||||
which are used to concretize a domain ``Domain[U, R]``), and an item type
|
||||
``T`` (which has a ``tuple`` upper bound).
|
||||
|
||||
Pipeline overview:
|
||||
|
||||
```
|
||||
.. admonition:: Pipeline overview
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
Domain -> [U] (get _batch_uris)
|
||||
U -> R (domain access ; Rs provide batches)
|
||||
R -> [I] (cache here ; _process_batch_data to use load_transform)
|
||||
[I] -> I (human item obj ; _get_item)
|
||||
I -> **P (final packed item ; __getitem__ to use transform)
|
||||
```
|
||||
|
||||
Note^1: as far as positioning, this class is meant to play nice with
|
||||
PyTorch DataLoaders, hence the inheritance from ``torch.Dataset``. The
|
||||
value add for this over the ``torch.Dataset`` base is almost entirely in
|
||||
the logic it implements to map out of *batched resources* that are holding
|
||||
data, and flattening it out into typical dataset items. There are also some
|
||||
QoL items when it comes to splitting and balancing samples.
|
||||
value add for this over the ``torch.Dataset`` base is almost entirely
|
||||
in the logic it implements to map out of *batched resources* that are
|
||||
holding data, and flattening it out into typical dataset items. There
|
||||
are also some QoL items when it comes to splitting and balancing
|
||||
samples.
|
||||
|
||||
Note^2: even though ``Domains`` implement iterators over their URIs, this
|
||||
doesn't imply a ``BatchedDataset`` is iterable. This just means we can walk
|
||||
over the resources that provide data, but we don't necessarily presuppose
|
||||
an ordered walk over samples within batches. Point being:
|
||||
Note^2: even though ``Domains`` implement iterators over their URIs,
|
||||
this doesn't imply a ``BatchedDataset`` is iterable. This just means we
|
||||
can walk over the resources that provide data, but we don't necessarily
|
||||
presuppose an ordered walk over samples within batches. Point being:
|
||||
``torch.Dataset``, not ``torch.IterableDataset``, is the appropriate
|
||||
superclass, even when we're working around iterable ``Domains``.
|
||||
|
||||
Note^3: transforms are expected to operate on ``I``-items and produce
|
||||
``I``-items. They shouldn't be the "introducers" of ``I`` types from some
|
||||
other intermediate representation, nor should they map from ``I`` to
|
||||
something else. Point being: the dataset definition should be able to map
|
||||
resources ``R`` to ``I`` without a transform: that much should be baked
|
||||
into the class definition. If you find you're expecting the transform to do
|
||||
that for you, you should consider pulling in some common structure across
|
||||
the allowed transforms and make it a fixed part of the class.
|
||||
``I``-items. They shouldn't be the "introducers" of ``I`` types from
|
||||
some other intermediate representation, nor should they map from ``I``
|
||||
to something else. Point being: the dataset definition should be able
|
||||
to map resources ``R`` to ``I`` without a transform: that much should
|
||||
be baked into the class definition. If you find you're expecting the
|
||||
transform to do that for you, you should consider pulling in some
|
||||
common structure across the allowed transforms and make it a fixed part
|
||||
of the class.
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
|
||||
@@ -268,19 +268,21 @@ class Trainer[I, K: EstimatorKwargs]:
|
||||
"""
|
||||
Note: this method attempts to implement a general scheme for passing
|
||||
needed items to the estimator's loss function from the dataloader. The
|
||||
abstract `Estimator` base only requires the model output be provided
|
||||
abstract ``Estimator`` base only requires the model output be provided
|
||||
for any given loss calculation, but concrete estimators will often
|
||||
require additional arguments (e.g., labels or length masks, as
|
||||
is the case with sequential models). In any case, this method defers
|
||||
any further logic to the `loss` method of the underlying estimator, so
|
||||
require additional arguments (e.g., labels or length masks, as is the
|
||||
case with sequential models). In any case, this method defers any
|
||||
further logic to the ``loss`` method of the underlying estimator, so
|
||||
one should take care to synchronize the sample structure with `dataset`
|
||||
to match that expected by `self.estimator.loss(...)`.
|
||||
to match that expected by ``self.estimator.loss(...)``.
|
||||
|
||||
On batch_estimator_map:
|
||||
.. admonition:: On batch_estimator_map
|
||||
|
||||
Dataloader collate functions are responsible for mapping a collection
|
||||
of items into an item of collections, roughly speaking. If items are
|
||||
tuples of tensors,
|
||||
Dataloader collate functions are responsible for mapping a
|
||||
collection of items into an item of collections, roughly speaking.
|
||||
If items are tuples of tensors,
|
||||
|
||||
.. code-block::
|
||||
|
||||
[
|
||||
( [1, 1], [1, 1] ),
|
||||
@@ -291,6 +293,8 @@ class Trainer[I, K: EstimatorKwargs]:
|
||||
the collate function maps back into the item skeleton, producing a
|
||||
single tuple of (stacked) tensors
|
||||
|
||||
.. code-block::
|
||||
|
||||
( [[1, 1],
|
||||
[2, 2],
|
||||
[3, 3]],
|
||||
@@ -299,9 +303,10 @@ class Trainer[I, K: EstimatorKwargs]:
|
||||
[2, 2],
|
||||
[3, 3]] )
|
||||
|
||||
This function should map from batches (which should be *item shaped*,
|
||||
i.e., have an `I` skeleton, even if stacked items may be different on
|
||||
the inside) into estimator keyword arguments (type `K`).
|
||||
This function should map from batches (which should be *item
|
||||
shaped*, i.e., have an ``I`` skeleton, even if stacked items may be
|
||||
different on the inside) into estimator keyword arguments (type
|
||||
``K``).
|
||||
|
||||
Parameters:
|
||||
lr: learning rate (default: 1e-3)
|
||||
|
||||
Reference in New Issue
Block a user