Releases: tinygrad/tinygrad
Releases · tinygrad/tinygrad
tinygrad 0.9.0
Close to the new line limit of 8000 lines, sitting at 7958 lines.
tinygrad is much more usable now.
Just over 1200 commits since 0.8.0
.
Release Highlights
- New documentation: https://docs.tinygrad.org
gpuctypes
has been brought in tree and is no longer an external dependency. [#3253]AMD=1
andNV=1
experimental backends for not requiring any userspace runtime components like ROCm or CUDA.- These backends should reduce the amount of python time, and specifically with multi-gpu use cases.
PTX=1
for rendering directly to ptx instead of cuda. [#3139] [#3623] [#3775]- Nvidia tensor core support. [#3544]
THREEFRY=1
for numpy-less random number generation using threefry2x32. [#2601] [#3785]- More stabilized multi-tensor API.
- Core tinygrad has been refactored into 4 pieces, read more about it here.
- Linearizer and codegen has support for generating kernels with multiple outputs.
- Lots of progress towards greater kernel fusion in the scheduler.
- Fusing of ReduceOps with their elementwise children. This trains mnist and gpt2 with ~20% less kernels and makes llama inference faster.
- New LoadOps.ASSIGN allows fusing optimizer updates with grad.
- Schedule kernels in BFS order. This improves resnet and llama speed.
- W.I.P. for fusing multiple reduces: [#4259] [#4208]
- MLPerf ResNet and BERT with a W.I.P. UNet3D
- Llama 3 support with a new
llama3.py
that provides an OpenAI compatible API. [#4576] - NF4 quantization support in Llama examples. [#4540]
label_smoothing
has been added tosparse_categorical_crossentropy
. [#3568]
Known Issues
- Using tinygrad in a conda env on macOS is known to cause problems with the
METAL
backend. See #2226.
See the full changelog: v0.8.0...v0.9.0
See the known issues: https://github.com/tinygrad/tinygrad/issues?q=is%3Aissue+is%3Aopen+label%3Abug+sort%3Aupdated-desc
Join the Discord!
tinygrad 0.8.0
Close to the new limit of 5000 lines at 4981.
Release Highlights
- Real dtype support within kernels!
- New
.schedule()
API to separate concerns of scheduling and running - New lazy.py implementation doesn't reorder at build time.
GRAPH=1
is usable to debug issues - 95 TFLOP FP16->FP32 matmuls on 7900XTX
- GPT2 runs (jitted) in 2 ms on NVIDIA 3090
- Powerful and fast kernel beam search with
BEAM=2
- GPU/CUDA/HIP backends switched to
gpuctypes
- New (alpha) multigpu sharding API with
.shard
See the full changelog: v0.7.0...v0.8.0
Join the Discord!
tinygrad 0.7.0
Bigger again at 4311 lines :( But, tons of new features this time!
Just over 500 commits since 0.6.0
.
Release Highlights
- Windows support has been dropped to focus on Linux and Mac OS.
- Some functionality may work on Windows but no support will be provided, use WSL instead.
- DiskTensors: a way to store tensors on disk has been added.
- This is coupled with functionality in
state.py
which supports saving/loading safetensors and loading torch weights.
- This is coupled with functionality in
- Tensor Cores are supported on M1/Apple Silicon and on the 7900 XTX (WMMA).
- Support on the 7900 XTX requires weights and data to be in float16, full float16 compute support will come in a later release.
- Tensor Core behaviour/usage is controlled by the
TC
envvar.
- Kernel optimization with nevergrad
- This optimizes the shapes going into the kernel, gated by the
KOPT
envvar.
- This optimizes the shapes going into the kernel, gated by the
- P2P buffer transfers are supported on most AMD GPUs when using a single python process.
- This is controlled by the
P2P
envvar.
- This is controlled by the
- LLaMA 2 support.
- A requirement of this is bfloat16 support for loading the weights, which is semi-supported by casting them to float16, proper bfloat16 support is tracked at #1290.
- The LLaMA example now also supports 8-bit quantization using the flag
--quantize
.
- Most MLPerf models have working inference examples. Training these models is currently being worked on.
- Initial multigpu training support.
- slow multigpu training by copying through host shared memory.
- Somewhat follows torch's multiprocessing and DistributedDataParallel high-level design.
- See the hlb_cifar10.py example.
- SymbolicShapeTracker and Symbolic JIT.
- These two things combined allow models with changing shapes to be jitted like transformers.
- This means that LLaMA can now be jitted for a massive increase in performance.
- Be warned that the API for this is very WIP and may change in the future, similarly with the rest of the tinygrad API.
- aarch64 and ptx assembly backend.
- WebGPU backend, see the
compile_efficientnet.py
example. - Support for torch like tensor indexing by other tensors.
- Some more
nn
layers were promoted, namelyEmbedding
and variousConv
layers. - VITS and so-vits-svc examples added.
- Initial documentation work.
- Quickstart guide:
/docs/quickstart.md
- Environment variable reference:
/docs/env_vars.md
- Quickstart guide:
And lots of small optimizations all over the codebase.
See the full changelog: v0.6.0...v0.7.0
See the known issues: https://github.com/tinygrad/tinygrad/issues?q=is%3Aissue+is%3Aopen+label%3Abug+sort%3Aupdated-desc
Join the Discord!
tinygrad 0.6.0
2516 lines now. Some day I promise a release will make it smaller.
- float16 support (needed for LLaMA)
- Fixed critical bug in training BatchNorm
- Limited support for multiple GPUs
- ConvNeXt + several MLPerf models in models/
- More torch-like methods in tensor.py
- Big refactor of the codegen into the Linearizer and CStyle
- Removed CompiledBuffer, use the LazyBuffer ShapeTracker
tinygrad 0.5.0
An upsetting 2223 lines of code, but so much great stuff!
- 7 backends : CLANG, CPU, CUDA, GPU, LLVM, METAL, and TORCH
- A TinyJit for speed (decorate your GPU function today)
- Support for a lot of onnx, including all the models in the backend tests
- No more MLOP convs, all HLOP (autodiff for convs)
- Improvements to shapetracker and symbolic engine
- 15% faster at running the openpilot model
tinygrad 0.4.0
So many changes since 0.3.0
Fairly stable and correct, though still not fast. The hlops/mlops are solid, just needs work on the llops.
The first automated release, so hopefully it works?