Add README

This commit is contained in:
Dimitri Lozeve 2024-11-15 21:28:44 +01:00
parent 8943455447
commit 2f23505f07
3 changed files with 130 additions and 13 deletions

28
LICENSE Normal file
View file

@ -0,0 +1,28 @@
BSD 3-Clause License
Copyright (c) 2024, Dimitri Lozeve
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
1. Redistributions of source code must retain the above copyright notice, this
list of conditions and the following disclaimer.
2. Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.
3. Neither the name of the copyright holder nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

89
README.md Normal file
View file

@ -0,0 +1,89 @@
# bqn-safetensors
[BQN](https://mlochbaum.github.io/BQN/) library to read and write arrays stored in the [safetensors](https://github.com/huggingface/safetensors) format.
## Why safetensors?
Safetensors is a format to store multidimensional arrays (tensors) in
a safe, efficient way. Compared the NPY format (see
[bqn-npy](https://github.com/dlozeve/bqn-npy)), it has the advantage
of being able to store multiple named arrays in a single file, and
being able to read an array without loading the entire file in memory.
## Supported array types
This library can load arrays stored with any dtype supported by the
official library. However, it can only store arrays in 32-bit signed
integers (I32) and double-precision floating point (F64), which are
the native types of CBQN.
| Format | Description | Deserialization | Serialization |
|----------|--------------------------------------------------------------------------------------|-----------------|---------------|
| BOOL | Boolean type | ✅ | ❌ |
| U8 | Unsigned byte | ✅ | ❌ |
| I8 | Signed byte | ✅ | ❌ |
| F8\_E5M4 | [FP8](https://arxiv.org/abs/2209.05433) | ✅ | ❌ |
| F8\_E4M3 | [FP8](https://arxiv.org/abs/2209.05433) | ✅ | ❌ |
| I16 | Signed integer (16-bit) | ✅ | ❌ |
| U16 | Unsigned integer (16-bit) | ✅ | ❌ |
| F16 | Half-precision floating point | ✅ | ❌ |
| BF16 | [Brain floating point](https://en.wikipedia.org/wiki/Bfloat16_floating-point_format) | ✅ | ❌ |
| I32 | Signed integer (32-bit) | ✅ | ✅ |
| U32 | Unsigned integer (32-bit) | ✅ | ❌ |
| F32 | Floating point (32-bit) | ✅ | ❌ |
| F64 | Floating point (64-bit) | ✅ | ✅ |
| I64 | Signed integer (64-bit) | ✅ | ❌ |
| U64 | Unsigned integer (64-bit) | ✅ | ❌ |
## Usage
```bqn
⟨ExtractMetadata,GetArrayNames,GetArray,SerializeArrays⟩←•Import"safetensors.bqn"
# Use •file.MapBytes to avoid loading the entire file in memory, using memory-mapping instead.
bytes←•file.MapBytes"input.safetensors"
# Extract the metadata (a string -> string map) from the file.
ExtractMetadata bytes
# List the arrays present in the file.
GetArrayNames bytes
# Read an array from the file.
bytes GetArray "my_array"
# Create arrays and save them to a safetensors file.
a←3‿4‿5⥊↕60 # will be stored as I32
b←2‿3⥊3.14×↕6 # will be stored as F64
"output.safetensors"•file.Bytes "arr_a"‿"arr_b" SerializeArrays a‿b
```
## How to test
The script [`gentest.py`](gentest.py) generates a test safetensors
file (`arrs.safetensors`) containing arrays of various shapes and
dtypes. It requires `numpy` and `safetensors` dependencies. The
easiest way to run it is via [`uv run`](https://docs.astral.sh/uv/guides/scripts/),
which takes care of installing the dependencies in an isolated
environment automatically:
```
uv run gentest.py
```
The script [`test.bqn`](test.bqn) then reads this test file and
displays its contents to check that they are identical to the ones
reported by Python. It also creates another set of arrays, serializes
them to a `test.safetensors` file, and deserializes it again.
```
bqn test.bqn
```
Finally, the script [`loadtest.py`](loadtest.py) reads the
BQN-generated safetensors file to ensure that it is readable from
Python.
```
uv run loadtest.py
```