Simd Library Documentation.

Home | Release Notes | Download | Documentation | Issues | GitHub
Half-Precision (16-bit) Float Point Numbers

Functions for conversion between 16-bit and 32-bit float numbers and other. More...

Functions

SIMD_API void SimdFloat32ToFloat16 (const float *src, size_t size, uint16_t *dst)
 Converts an array of 32-bit floats to 16-bit float values. More...
 
SIMD_API void SimdFloat16ToFloat32 (const uint16_t *src, size_t size, float *dst)
 Converts an array of 16-bit float values to 32-bit floats. More...
 
SIMD_API void SimdSquaredDifferenceSum16f (const uint16_t *a, const uint16_t *b, size_t size, float *sum)
 Calculates sum of squared differences for two 16-bit float arrays. More...
 
SIMD_API void SimdCosineDistance16f (const uint16_t *a, const uint16_t *b, size_t size, float *distance)
 Calculates cosine distance of two 16-bit float arrays. More...
 
SIMD_API void SimdCosineDistancesMxNa16f (size_t M, size_t N, size_t K, const uint16_t *const *A, const uint16_t *const *B, float *distances)
 Calculates pairwise cosine distances for two sets of 16-bit float vectors. More...
 
SIMD_API void SimdCosineDistancesMxNp16f (size_t M, size_t N, size_t K, const uint16_t *A, const uint16_t *B, float *distances)
 Calculates pairwise cosine distances for two packed sets of 16-bit float vectors. More...
 
SIMD_API void SimdVectorNormNa16f (size_t N, size_t K, const uint16_t *const *A, float *norms)
 Calculates Euclidean norms for an array of 16-bit float vectors. More...
 
SIMD_API void SimdVectorNormNp16f (size_t N, size_t K, const uint16_t *A, float *norms)
 Calculates Euclidean norms for a packed array of 16-bit float vectors. More...
 

Detailed Description

Functions for conversion between 16-bit and 32-bit float numbers and other.

Function Documentation

◆ SimdFloat32ToFloat16()

void SimdFloat32ToFloat16 ( const float *  src,
size_t  size,
uint16_t *  dst 
)

Converts an array of 32-bit floats to 16-bit float values.

For each element the function stores the IEEE 754 binary16 representation of src[i] to dst[i]. The conversion handles sign, normal values, subnormal values, infinities and NaNs according to the internal half-precision conversion helper.

Parameters
[in]src- a pointer to the input array with 32-bit float point numbers.
[in]size- a number of elements in input and output arrays.
[out]dst- a pointer to the output array with 16-bit float point numbers.

◆ SimdFloat16ToFloat32()

void SimdFloat16ToFloat32 ( const uint16_t *  src,
size_t  size,
float *  dst 
)

Converts an array of 16-bit float values to 32-bit floats.

For each element the function expands the IEEE 754 binary16 value src[i] to a 32-bit float value dst[i], including normal values, subnormal values, infinities and NaNs.

Parameters
[in]src- a pointer to the input array with 16-bit float point numbers.
[in]size- a number of elements in input and output arrays.
[out]dst- a pointer to the output array with 32-bit float point numbers.

◆ SimdSquaredDifferenceSum16f()

void SimdSquaredDifferenceSum16f ( const uint16_t *  a,
const uint16_t *  b,
size_t  size,
float *  sum 
)

Calculates sum of squared differences for two 16-bit float arrays.

The input values are IEEE 754 binary16 values stored in uint16_t elements. Each element is converted to 32-bit float before subtraction and accumulation. Input arrays must have the same size.

Algorithm description:

da = Float16ToFloat32(a[i]) - Float16ToFloat32(b[i]);
sum[0] = Sum(da*da);
Parameters
[in]a- a pointer to the first 16-bit float array.
[in]b- a pointer to the second 16-bit float array.
[in]size- a number of elements in input arrays.
[out]sum- a pointer to 32-bit float sum of squared differences.

◆ SimdCosineDistance16f()

void SimdCosineDistance16f ( const uint16_t *  a,
const uint16_t *  b,
size_t  size,
float *  distance 
)

Calculates cosine distance of two 16-bit float arrays.

The input values are IEEE 754 binary16 values stored in uint16_t elements. Each element is converted to 32-bit float before multiplication and accumulation. Input arrays must have the same size and non-zero Euclidean norm.

Algorithm description:

fa = Float16ToFloat32(a[i]);
fb = Float16ToFloat32(b[i]);
distance[0] = 1 - Sum(fa*fb)/Sqrt(Sum(fa*fa)*Sum(fb*fb));
Parameters
[in]a- a pointer to the first 16-bit float array.
[in]b- a pointer to the second 16-bit float array.
[in]size- a number of elements in input arrays.
[out]distance- a pointer to 32-bit float cosine distance.

◆ SimdCosineDistancesMxNa16f()

void SimdCosineDistancesMxNa16f ( size_t  M,
size_t  N,
size_t  K,
const uint16_t *const *  A,
const uint16_t *const *  B,
float *  distances 
)

Calculates pairwise cosine distances for two sets of 16-bit float vectors.

A is an array of M pointers to vectors of length K, and B is an array of N pointers to vectors of length K. The input values are IEEE 754 binary16 values stored in uint16_t elements and are converted to 32-bit float for accumulation. Every input vector is expected to have non-zero Euclidean norm. The output matrix is stored in row-major order.

Algorithm description:

distances[i*N + j] = SimdCosineDistance16f(A[i], B[j], K);
Parameters
[in]M- a number of A arrays.
[in]N- a number of B arrays.
[in]K- a number of elements in every A and B vector.
[in]A- a pointer to the first array with M pointers to 16-bit float vectors.
[in]B- a pointer to the second array with N pointers to 16-bit float vectors.
[out]distances- a pointer to result 32-bit float array with row-major cosine distance matrix. Its size must be M*N.

◆ SimdCosineDistancesMxNp16f()

void SimdCosineDistancesMxNp16f ( size_t  M,
size_t  N,
size_t  K,
const uint16_t *  A,
const uint16_t *  B,
float *  distances 
)

Calculates pairwise cosine distances for two packed sets of 16-bit float vectors.

A contains M contiguous vectors of length K and B contains N contiguous vectors of length K. The input values are IEEE 754 binary16 values stored in uint16_t elements and are converted to 32-bit float for accumulation. Every input vector is expected to have non-zero Euclidean norm. The output matrix is stored in row-major order.

Algorithm description:

distances[i*N + j] = SimdCosineDistance16f(A + i*K, B + j*K, K);
Parameters
[in]M- a number of A arrays.
[in]N- a number of B arrays.
[in]K- a number of elements in every A and B vector.
[in]A- a pointer to M packed 16-bit float vectors.
[in]B- a pointer to N packed 16-bit float vectors.
[out]distances- a pointer to result 32-bit float array with row-major cosine distance matrix. Its size must be M*N.

◆ SimdVectorNormNa16f()

void SimdVectorNormNa16f ( size_t  N,
size_t  K,
const uint16_t *const *  A,
float *  norms 
)

Calculates Euclidean norms for an array of 16-bit float vectors.

A is an array of N pointers to vectors of length K. The input values are IEEE 754 binary16 values stored in uint16_t elements and are converted to 32-bit float before accumulation.

Algorithm description:

fa = Float16ToFloat32(A[j][k]);
norms[j] = Sqrt(Sum(fa*fa));
Parameters
[in]N- a number of A vectors.
[in]K- a number of elements in every A vector.
[in]A- a pointer to an array with N pointers to 16-bit float vectors.
[out]norms- a pointer to result 32-bit float array with vector norms. Its size must be N.

◆ SimdVectorNormNp16f()

void SimdVectorNormNp16f ( size_t  N,
size_t  K,
const uint16_t *  A,
float *  norms 
)

Calculates Euclidean norms for a packed array of 16-bit float vectors.

A contains N contiguous vectors of length K. The input values are IEEE 754 binary16 values stored in uint16_t elements and are converted to 32-bit float before accumulation.

Algorithm description:

fa = Float16ToFloat32(A[j*K + k]);
norms[j] = Sqrt(Sum(fa*fa));
Parameters
[in]N- a number of A vectors.
[in]K- a number of elements in every A vector.
[in]A- a pointer to N packed 16-bit float vectors.
[out]norms- a pointer to result 32-bit float array with vector norms. Its size must be N.