initial commit

2026-03-03 18:11:37 -08:00
commit 337175d428
24 changed files with 4940 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,105 @@
+# Overview
+Package summary goes here, ideally with a diagram
+
+# Install
+Installation instructions
+
+```sh
+pip install <package>
+```
+
+or as a CLI tool
+
+```sh
+uv tool install <package>
+```
+
+# Development
+- Initialize/synchronize the project with `uv sync`, creating a virtual
+  environment with base package dependencies.
+- Depending on needs, install the development dependencies with `uv sync
+  --extra dev`.
+
+# Testing
+- To run the unit tests, make sure to first have the test dependencies
+  installed with `uv sync --extra test`, then run `make test`.
+- For notebook testing, run `make install-kernel` to make the environment
+  available as a Jupyter kernel (to be selected when running notebooks).
+
+# Documentation
+- Install the documentation dependencies with `uv sync --extra doc`.
+- Run `make docs-build` (optionally preceded by `make docs-clean`), and serve
+  locally with `docs-serve`.
+
+# Development remarks
+- Across `Trainer` / `Estimator` / `Dataset`, I've considered a
+  `ParamSpec`-based typing scheme to better orchestrate alignment in the
+  `Trainer.train()` loop, e.g., so we can statically check whether a dataset
+  appears to be fulfilling the argument requirements for the estimator's
+  `loss()` / `metrics()`  methods. Something like
+
+  ```py
+  class Estimator[**P](nn.Module):
+      def loss(
+          self,
+          input: Tensor,
+          *args: P.args,
+          **kwargs: P.kwargs,
+      ) -> Generator:
+          ...
+
+  class Trainer[**P]:
+      def __init__(
+          self,
+          estimator: Estimator[P],
+          ...
+      ): ...
+  ```
+
+  might be how we begin threading signatures. But ensuring dataset items can
+  match `P` is challenging. You can consider a "packed" object where we
+  obfuscate passing data through `P`-signatures:
+
+  ```py
+  class PackedItem[**P]:
+      def __init__(self, *args: P.args, **kwargs: P.kwargs) -> None:
+          self._args = args
+          self._kwargs = kwargs
+  
+      def apply[R](self, func: Callable[P, R]) -> R:
+          return func(*self._args, **self._kwargs)
+  
+  
+  class BatchedDataset[U, R, I, **P](Dataset):
+      @abstractmethod
+      def _process_item_data(
+          self,
+          item_data: I,
+          item_index: int,
+      ) -> PackedItem[P]:
+          ...
+  
+      def __iter__(self) -> Iterator[PackedItem[P]]:
+          ...
+  ```
+
+  Meaningfully shaping those signatures is what remains, but you can't really
+  do this, not with typical type expression flexibility. For instance, if I'm
+  trying to appropriately type my base `TupleDataset`:
+
+  ```py
+  class SequenceDataset[I, **P](HomogenousDataset[int, I, I, P]):
+      ...
+  
+  class TupleDataset[I](SequenceDataset[tuple[I, ...], ??]):
+      ...
+  ```
+
+  Here there's no way for me to shape a `ParamSpec` to indicate arbitrarily
+  many arguments of a fixed type (`I` in this case) to allow me to unpack my
+  item tuples into an appropriate `PackedItem`.
+
+  Until this (among other issues) becomes clearer, I'm setting up around a
+  simpler `TypedDict` type variable. We won't have particularly strong static
+  checks for item alignment inside `Trainer`, but this seems about as good as I
+  can get around the current infrastructure.