Functions for conversion between 16-bit and 32-bit float numbers and other. More...
Functions | |
| SIMD_API void | SimdFloat32ToFloat16 (const float *src, size_t size, uint16_t *dst) |
| Converts an array of 32-bit floats to 16-bit float values. More... | |
| SIMD_API void | SimdFloat16ToFloat32 (const uint16_t *src, size_t size, float *dst) |
| Converts an array of 16-bit float values to 32-bit floats. More... | |
| SIMD_API void | SimdSquaredDifferenceSum16f (const uint16_t *a, const uint16_t *b, size_t size, float *sum) |
| Calculates sum of squared differences for two 16-bit float arrays. More... | |
| SIMD_API void | SimdCosineDistance16f (const uint16_t *a, const uint16_t *b, size_t size, float *distance) |
| Calculates cosine distance of two 16-bit float arrays. More... | |
| SIMD_API void | SimdCosineDistancesMxNa16f (size_t M, size_t N, size_t K, const uint16_t *const *A, const uint16_t *const *B, float *distances) |
| Calculates pairwise cosine distances for two sets of 16-bit float vectors. More... | |
| SIMD_API void | SimdCosineDistancesMxNp16f (size_t M, size_t N, size_t K, const uint16_t *A, const uint16_t *B, float *distances) |
| Calculates pairwise cosine distances for two packed sets of 16-bit float vectors. More... | |
| SIMD_API void | SimdVectorNormNa16f (size_t N, size_t K, const uint16_t *const *A, float *norms) |
| Calculates Euclidean norms for an array of 16-bit float vectors. More... | |
| SIMD_API void | SimdVectorNormNp16f (size_t N, size_t K, const uint16_t *A, float *norms) |
| Calculates Euclidean norms for a packed array of 16-bit float vectors. More... | |
Detailed Description
Functions for conversion between 16-bit and 32-bit float numbers and other.
Function Documentation
◆ SimdFloat32ToFloat16()
| void SimdFloat32ToFloat16 | ( | const float * | src, |
| size_t | size, | ||
| uint16_t * | dst | ||
| ) |
Converts an array of 32-bit floats to 16-bit float values.
For each element the function stores the IEEE 754 binary16 representation of src[i] to dst[i]. The conversion handles sign, normal values, subnormal values, infinities and NaNs according to the internal half-precision conversion helper.
- Parameters
-
[in] src - a pointer to the input array with 32-bit float point numbers. [in] size - a number of elements in input and output arrays. [out] dst - a pointer to the output array with 16-bit float point numbers.
◆ SimdFloat16ToFloat32()
| void SimdFloat16ToFloat32 | ( | const uint16_t * | src, |
| size_t | size, | ||
| float * | dst | ||
| ) |
Converts an array of 16-bit float values to 32-bit floats.
For each element the function expands the IEEE 754 binary16 value src[i] to a 32-bit float value dst[i], including normal values, subnormal values, infinities and NaNs.
- Parameters
-
[in] src - a pointer to the input array with 16-bit float point numbers. [in] size - a number of elements in input and output arrays. [out] dst - a pointer to the output array with 32-bit float point numbers.
◆ SimdSquaredDifferenceSum16f()
| void SimdSquaredDifferenceSum16f | ( | const uint16_t * | a, |
| const uint16_t * | b, | ||
| size_t | size, | ||
| float * | sum | ||
| ) |
Calculates sum of squared differences for two 16-bit float arrays.
The input values are IEEE 754 binary16 values stored in uint16_t elements. Each element is converted to 32-bit float before subtraction and accumulation. Input arrays must have the same size.
Algorithm description:
da = Float16ToFloat32(a[i]) - Float16ToFloat32(b[i]); sum[0] = Sum(da*da);
- Parameters
-
[in] a - a pointer to the first 16-bit float array. [in] b - a pointer to the second 16-bit float array. [in] size - a number of elements in input arrays. [out] sum - a pointer to 32-bit float sum of squared differences.
◆ SimdCosineDistance16f()
| void SimdCosineDistance16f | ( | const uint16_t * | a, |
| const uint16_t * | b, | ||
| size_t | size, | ||
| float * | distance | ||
| ) |
Calculates cosine distance of two 16-bit float arrays.
The input values are IEEE 754 binary16 values stored in uint16_t elements. Each element is converted to 32-bit float before multiplication and accumulation. Input arrays must have the same size and non-zero Euclidean norm.
Algorithm description:
fa = Float16ToFloat32(a[i]); fb = Float16ToFloat32(b[i]); distance[0] = 1 - Sum(fa*fb)/Sqrt(Sum(fa*fa)*Sum(fb*fb));
- Parameters
-
[in] a - a pointer to the first 16-bit float array. [in] b - a pointer to the second 16-bit float array. [in] size - a number of elements in input arrays. [out] distance - a pointer to 32-bit float cosine distance.
◆ SimdCosineDistancesMxNa16f()
| void SimdCosineDistancesMxNa16f | ( | size_t | M, |
| size_t | N, | ||
| size_t | K, | ||
| const uint16_t *const * | A, | ||
| const uint16_t *const * | B, | ||
| float * | distances | ||
| ) |
Calculates pairwise cosine distances for two sets of 16-bit float vectors.
A is an array of M pointers to vectors of length K, and B is an array of N pointers to vectors of length K. The input values are IEEE 754 binary16 values stored in uint16_t elements and are converted to 32-bit float for accumulation. Every input vector is expected to have non-zero Euclidean norm. The output matrix is stored in row-major order.
Algorithm description:
distances[i*N + j] = SimdCosineDistance16f(A[i], B[j], K);
- Parameters
-
[in] M - a number of A arrays. [in] N - a number of B arrays. [in] K - a number of elements in every A and B vector. [in] A - a pointer to the first array with M pointers to 16-bit float vectors. [in] B - a pointer to the second array with N pointers to 16-bit float vectors. [out] distances - a pointer to result 32-bit float array with row-major cosine distance matrix. Its size must be M*N.
◆ SimdCosineDistancesMxNp16f()
| void SimdCosineDistancesMxNp16f | ( | size_t | M, |
| size_t | N, | ||
| size_t | K, | ||
| const uint16_t * | A, | ||
| const uint16_t * | B, | ||
| float * | distances | ||
| ) |
Calculates pairwise cosine distances for two packed sets of 16-bit float vectors.
A contains M contiguous vectors of length K and B contains N contiguous vectors of length K. The input values are IEEE 754 binary16 values stored in uint16_t elements and are converted to 32-bit float for accumulation. Every input vector is expected to have non-zero Euclidean norm. The output matrix is stored in row-major order.
Algorithm description:
distances[i*N + j] = SimdCosineDistance16f(A + i*K, B + j*K, K);
- Parameters
-
[in] M - a number of A arrays. [in] N - a number of B arrays. [in] K - a number of elements in every A and B vector. [in] A - a pointer to M packed 16-bit float vectors. [in] B - a pointer to N packed 16-bit float vectors. [out] distances - a pointer to result 32-bit float array with row-major cosine distance matrix. Its size must be M*N.
◆ SimdVectorNormNa16f()
| void SimdVectorNormNa16f | ( | size_t | N, |
| size_t | K, | ||
| const uint16_t *const * | A, | ||
| float * | norms | ||
| ) |
Calculates Euclidean norms for an array of 16-bit float vectors.
A is an array of N pointers to vectors of length K. The input values are IEEE 754 binary16 values stored in uint16_t elements and are converted to 32-bit float before accumulation.
Algorithm description:
fa = Float16ToFloat32(A[j][k]); norms[j] = Sqrt(Sum(fa*fa));
- Parameters
-
[in] N - a number of A vectors. [in] K - a number of elements in every A vector. [in] A - a pointer to an array with N pointers to 16-bit float vectors. [out] norms - a pointer to result 32-bit float array with vector norms. Its size must be N.
◆ SimdVectorNormNp16f()
| void SimdVectorNormNp16f | ( | size_t | N, |
| size_t | K, | ||
| const uint16_t * | A, | ||
| float * | norms | ||
| ) |
Calculates Euclidean norms for a packed array of 16-bit float vectors.
A contains N contiguous vectors of length K. The input values are IEEE 754 binary16 values stored in uint16_t elements and are converted to 32-bit float before accumulation.
Algorithm description:
fa = Float16ToFloat32(A[j*K + k]); norms[j] = Sqrt(Sum(fa*fa));
- Parameters
-
[in] N - a number of A vectors. [in] K - a number of elements in every A vector. [in] A - a pointer to N packed 16-bit float vectors. [out] norms - a pointer to result 32-bit float array with vector norms. Its size must be N.