Overview
Minimal framework for ML modeling, supporting advanced dataset operations and streamlined training workflows.
Install
The trainlib package can be installed from PyPI:
pip install trainlib
Development
- Initialize/synchronize the project with
uv sync, creating a virtual environment with base package dependencies. - Depending on needs, install the development dependencies with
uv sync --extra dev.
Testing
- To run the unit tests, make sure to first have the test dependencies
installed with
uv sync --extra test, then runmake test. - For notebook testing, run
make install-kernelto make the environment available as a Jupyter kernel (to be selected when running notebooks).
Documentation
- Install the documentation dependencies with
uv sync --extra doc. - Run
make docs-build(optionally preceded bymake docs-clean), and serve locally withmake docs-serve.
Development remarks
-
Across
Trainer/Estimator/Dataset, I've considered aParamSpec-based typing scheme to better orchestrate alignment in theTrainer.train()loop, e.g., so we can statically check whether a dataset appears to be fulfilling the argument requirements for the estimator'sloss()/metrics()methods. Something likeclass Estimator[**P](nn.Module): def loss( self, input: Tensor, *args: P.args, **kwargs: P.kwargs, ) -> Generator: ... class Trainer[**P]: def __init__( self, estimator: Estimator[P], ... ): ...might be how we begin threading signatures. But ensuring dataset items can match
Pis challenging. You can consider a "packed" object where we obfuscate passing data throughP-signatures:class PackedItem[**P]: def __init__(self, *args: P.args, **kwargs: P.kwargs) -> None: self._args = args self._kwargs = kwargs def apply[R](self, func: Callable[P, R]) -> R: return func(*self._args, **self._kwargs) class BatchedDataset[U, R, I, **P](Dataset): @abstractmethod def _process_item_data( self, item_data: I, item_index: int, ) -> PackedItem[P]: ... def __iter__(self) -> Iterator[PackedItem[P]]: ...Meaningfully shaping those signatures is what remains, but you can't really do this, not with typical type expression flexibility. For instance, if I'm trying to appropriately type my base
TupleDataset:class SequenceDataset[I, **P](HomogenousDataset[int, I, I, P]): ... class TupleDataset[I](SequenceDataset[tuple[I, ...], "?"]): ...Here there's no way for me to shape a
ParamSpecto indicate arbitrarily many arguments of a fixed type (Iin this case) to allow me to unpack my item tuples into an appropriatePackedItem.Until this (among other issues) becomes clearer, I'm setting up around a simpler
TypedDicttype variable. We won't have particularly strong static checks for item alignment insideTrainer, but this seems about as good as I can get around the current infrastructure.