Skip to content

GSoC 2024 ‐ Aman Bhansali

Aman Bhansali edited this page Aug 26, 2024 · 19 revisions

About me

I’m Aman Bhansali, an undergraduate at the Indian Institute of Technology Jodhpur. I am from Purnia, Bihar, India. I am passionate about mathematics and programming, with a strong interest in numerical linear algebra and scientific computing. My experience in areas like computer vision and NLP has deepened my understanding of mathematical models' role in algorithm development. Additionally, applications involving the use cases of machine learning to address challenges in the real world greatly fascinate me.

Project overview

In linear algebra, BLAS routines are predominantly employed to perform operations on matrices and vectors. BLAS are commonly utilized in the development of advanced linear algebra software owing to their efficiency, portability, and extensive accessibility. The major goals associated with the project are:

  • Add JavaScript implementation.
  • Add C implementation.
  • Re-implementing BLAS subroutines using the existing lapack-netlib Fortran implementation to free-form the current version of Fortran-95.
  • Write node.js bindings to allow calling BLAS routines in compiled C/ Fortran from JavaScript.

The approach I followed for the project was the same as mentioned in the proposal. Start with the Level 1 routines. In each level, the priority will be first real, followed by complex datatypes. For real data types, the first priority was real double precision, followed by real single precision, and similarly for complex double and single-precision. I proposed to complete a single routine in 2 PRs one for the JavaScript implementation and the other for the C/Fortran implementation, but with practice and understanding, I did the entire implementation in a single PR thus reducing the reviewing time as well.

Project recap

During the community bonding period, I began coding by focusing on the BLAS Level 1 routines, as outlined in tracking issue #2039. I initially concentrated on mastering the simpler single- and double-precision routines before advancing to more complex ones. The complex routines at the initial stages were challenging to grasp, but after working on a few packages, I became used to them. Implementing C/Fortran was initially challenging, but with understanding and practice, things became easier eventually. Thus, single- and double-precision routines were implemented entirely within the decided timeframe. The complex single- and double-precision routines that are independent of other BLAS routines have also been fully implemented, including their C and Fortran implementations. However, due to the ongoing development of the tooling required for integrating packages across C and Fortran, a few of the complex Level 1 routines remain.

Native Implementation Level 1 Signature:

sswap( N, x, strideX, y, strideY )

Ndarray Implementation Level 1 Signature:

sswap( N, x, strideX, offsetX, y, strideY, offsetY )

This additional offset parameter in the ndarray implementation gives the user the freedom to select their starting index for the operation along with the stride.

After completing the Level 1 routines, I switched to working on Level 2 routines, which involve matrix-vector operations. At this stage, things became more interesting and, at the same time, challenging to work on. My approach remained systematic, starting with the real single and double precision routines and commencing with their native JavaScript implementations based on the reference Fortran implementation.

At this stage, both code and R&D became equally important, leading to an extensive round of refactoring and discussions with project mentors, during which we figured out a way to implement modern BLAS routines. The existing reference LAPACK implementation is Fortran-based and hence follows a column-major layout by default. However, in JavaScript, we can provide the user with the freedom to choose whether they want to pass the matrix in a row-major or column-major order. This flexibility is important since matrices are represented as arrays in JavaScript, ensuring contiguous memory allocation. However, the key innovation comes after this.

Let's take a small example:

A = [ 1, 2, 3 ]
    [ 4, 5, 6 ] (2X3)
A = [ 1, 2, 3, 4, 5, 6 ] (row-major)
A = [ 1, 4, 2, 5, 3, 6 ] (column-major)

Native implementation signature:

sgemv( trans, M, N, alpha, A, lda, x, strideX, beta, y, strideY )
sgemv( order, trans, M, N, alpha, A, lda, x, strideX, beta, y, strideY )

Similar to the ndarray implementation for the Level 1 routine, we have an offset parameter associated with the matrix and the vector. Additionally, there are use cases where we need to perform operations on a specific submatrix within a larger global matrix. To accommodate this, our ndarray implementation includes the following parameters:

  • sa1: stride along the first dimension of matrix A.
  • sa2: stride along the second dimension of matrix A.

These parameters give users full flexibility to utilize the BLAS implementation as needed, even allowing for negative stride values depending on the use case. At each stage of implementation, the idea was to reduce code duplication and maintain cache optimality.

Ndarray implementation signature:

sgemv( trans, M, N, alpha, A, strideA1, strideA2, offsetA, x, strideX, offsetX, beta, y, strideY, offsetY )

A short example to understand how stride parameters for matrix A operate:

A = [ 999.0, 6.0,   999.0, 5.0,   999.0 ], 
    [ 999.0, 999.0, 999.0, 999.0, 999.0 ], 
    [ 999.0, 4.0,   999.0, 3.0,   999.0 ],
    [ 999.0, 999.0, 999.0, 999.0, 999.0 ], 
    [ 999.0, 2.0,   999.0, 1.0,   999.0 ]

A = [ 999.0, 6.0, 999.0, 5.0, 999.0, 999.0, 999.0, 999.0, 999.0, 999.0, 999.0, 4.0, 999.0, 3.0, 999.0, 999.0, 999.0, 999.0, 999.0, 999.0, 999.0, 2.0, 999.0, 1.0, 999.0 ] (row-major)

Now let's say our required sub-matrix for operation is:

A = [ 1, 2 ],
    [ 3, 4 ],
    [ 5, 6 ]

Hence, the stride parameters along with offset give flexibility here.

sa1 = -10
sa2 = -2
offsetA = 23

With the initial offset, we reach the position of 1 in the array representation, and using the sa1 (1-->3) and sa2 (1-->2) we perform the required operation seamlessly.

Once the implementation was ready for a package, the next step was adding test suites. However, most existing BLAS implementation test suites were neither comprehensive nor extensive. Therefore, the objective shifted toward developing a thorough test suite for each package, ensuring 100% code coverage began by designing examples on paper. Also, given our unique ndarray implementation covering all edge cases, particularly validating how the custom strides, offsets, and other parameters interacted was important. Thus, each of the possible combinations for the variation of the parameters involved was checked, and now we can claim that we have a robust test suite available for our routines.

The same follows for the other Level 2 routines as mentioned above the operation might look similar but the matrix representation varied along packages.

There exist sets such as:

  • SGEMV - where A is an MXN matrix.
  • SSYMV - where A is an NXN matrix.
  • STPMV - where A is an NXN matrix supplied as a packed form
  • STBMV - where A is an NXN band matrix

After getting a few of the real single and double-precision routines over the finish line, I moved on to implementing a Level 3 routine: matrix-matrix multiplication (sgemm). This proved to be one of the most complex and challenging packages I worked on.

Most existing implementations stick closely to the native Fortran approach, but we took it further by adding support for the ndarray. This made the task further more complex, as we had to balance flexibility with performance. The key challenge was to ensure our ndarray implementation could smoothly handle different matrix formats while still matching the performance of the Fortran-based routines.

Ndarray implementation signature:

sgemm( transA, transB, M, N, K, alpha, A, strideA1, strideA2, offsetA, strideB1, strideB2, offsetB, beta, strideC1, strideC2, offsetC )

Maintaining cache optimality here was difficult since we have around 5 parameters that vary simultaneously: transA, transB, [strideA1, strideA2], [strideB1, strideB2], and [strideC1, strideC2]

This led to an extensive discussion phase and eventually with mentor help the package was implemented. Benchmarks and test suites are both highly extensive and cover all the possible combinations for the varying parameters.

This summarizes my entire workflow. There were times when, to reduce code duplication, we created independent packages and, based on that, refactored the existing implementations again. This again serves as a quick recap of my work furthermore, for almost every package there exists a story of the implementation.

Completed work

Merged PRs:

The above two routines perform the symmetric rank 1 operation A = α*x*x^T + A where α is a scalar, x is an N element vector, and A is an N by N symmetric matrix supplied in packed form. The reference LAPACK implementation contains parameters:

dspr( uplo, N, α, x, strideX, AP ) (double-precision)
sspr( uplo, N, α, x, strideX, AP ) (single-precision)

AP - The packed form representation of an NXN symmetric matrix.

Native Implementation:

dspr( order, uplo, N, α, x, strideX, AP )
sspr( order, uplo, N, α, x, strideX, AP )

Ndarray Implementation:

dspr.ndarray( order, uplo, N, α, x, strideX, offsetX, AP, offsetAP )
sspr.ndarray( order, uplo, N, α, x, strideX, offsetX, AP, offsetAP )

For single-precision routines, we also use the f32: @stdlib/number/float64/base/to-float32 to ensure float32 emulation. AP in the ndarray implementation can be assumed as a global vector containing the packed form representation of any NXN symmetric matrix, and we want to operate only on those particular elements.

The above two routines perform the matrix-matrix operation C = α*op(A)*op(B) + β*C where op(A) is either op(A) = A or op(A) = A^T, α and β are scalars, A, B, and C are matrices, with op(A) an M by K matrix, op(B) a K by N matrix, and C an M by N matrix. The reference LAPACK implementation contains parameters:

dgemm( transA, transB, M, N, K, alpha, A, lda, B, ldb, beta, C, ldc ) (double-precision)
sgemm( transA, transB, M, N, K, alpha, A, lda, B, ldb, beta, C, ldc ) (single-precision)

AP - The packed form representation of an NXN symmetric matrix.

Native Implementation:

dgemm( order, transA, transB, M, N, K, alpha, A, lda, B, ldb, beta, C, ldc )
sgemm( order, transA, transB, M, N, K, alpha, A, lda, B, ldb, beta, C, ldc )

Ndarray Implementation:

dgemm.ndarray( transA, transB, M, N, K, alpha, A, strideA1, strideA2, offsetA, B, strideB1, strideB2, offsetB, beta, C, strideC1, strideC2, offsetC )
sgemm.ndarray( transA, transB, M, N, K, alpha, A, strideA1, strideA2, offsetA, B, strideB1, strideB2, offsetB, beta, C, strideC1, strideC2, offsetC )

Due, to multiple varying parameters it was difficult to follow the standard implementation, and hence the implementation for this routine contains two additional functions the naive approach wherein the matrix multiplication operation is performed using a naive algorithm which is cache-optimal when A is row-major and B is column-major. The other allows matrix multiplication using loop tiling.

The test suite and benchmarks for this routine are highly extensive; they both contain all the possible conditions and tests wherein the code needs to be validated.

The above two routines solve one of the systems of equations A*x = b or A^T*x = b where b and x are N element vectors and A is an N by N unit, or non-unit, upper or lower triangular matrix. The reference LAPACK implementation contains parameters:

dtrsv( uplo, trans, diag, N, A, lda, x, strideX )
strsv( uplo, trans, diag, N, A, lda, x, strideX )

Native JavaScript implementation:

dtrsv( order, uplo, trans, diag, N, A, lda, x, strideX )
strsv( order, uplo, trans, diag, N, A, lda, x, strideX )

Ndarray Implementation:

dtrsv.ndarray( uplo, trans, diag, N, A, strideA1, strideA2, offsetA, x, strideX, offsetX )
strsv.ndarray( uplo, trans, diag, N, A, strideA1, strideA2, offsetA, x, strideX, offsetX )

A unit triangular matrix implies when for the matrix the diagonal elements are 1. Here the ndarray implementation provides the algorithm with the flexibility to perform operations on matrices passed in any form. This can be thought of as a global matrix containing the A matrix within it. These can have immense use cases a few of which I have mentioned above.

The above two routines perform one of the matrix-vector operations y = α*A*x + β*y or y = α*A^T*x + β*y, where α and β are scalars, x and y are vectors, and A is an M by N matrix. The reference LAPACK implementation contains parameters:

dgemv( trans, M, N, alpha, A, lda, x, strideX, beta, y, strideY )
sgemv( trans, M, N, alpha, A, lda, x, strideX, beta, y, strideY )

Native JavaScript implementation:

dgemv( order, trans, M, N, alpha, A, lda, x, strideX, beta, y, strideY )
sgemv( order, trans, M, N, alpha, A, lda, x, strideX, beta, y, strideY )

Ndarray JavaScript implementation:

dgemv( trans, M, N, alpha, A, strideA1, strideA2, offsetA, x, strideX, offsetX, beta, y, strideY, offsetY )
sgemv( trans, M, N, alpha, A, strideA1, strideA2, offsetA, x, strideX, offsetX, beta, y, strideY, offsetY )

One important fact to understand in all these implementations is that, for example here, since we pass an input matrix as a vector along with the order parameter, we get an advantage of symmetry in the code structure as well. For instance:

For operation: y = α*A*x + β*y
A = [ 1, 2, 3 ]
    [ 4, 5, 6 ]
    [ 7, 8, 9 ]
if order = 'row-major', A = [ 1, 2, 3, 4, 5, 6, 7, 8, 9 ]
if order = 'column-major', A = [ 1, 4, 7, 2, 5, 8, 3, 6, 9 ]

For operation y = α*A^T*x + β*y
A^T = [ 1, 4, 7 ]
      [ 2, 5, 8 ]
      [ 3, 6, 9 ]
if order = 'row-major', A = [ 1, 4, 7, 2, 5, 8, 3, 6, 9 ]
if order = 'column-major', A = [ 1, 2, 3, 4, 5, 6, 7, 8, 9 ]

We came up with the idea to have a base implementation and let ndarray and native implementation call that base implementation itself based on the parameters and reduce code duplication issues.

The above two routines perform the symmetric rank 1 operation A = α*x*x^T + A where α is a scalar, x is an N element vector, and A is an N by N symmetric matrix. The reference LAPACK implementation contains parameters:

dsyr( uplo, N, alpha, x, strideX, A, lda )
ssyr( uplo, N, alpha, x, strideX, A, lda )

Native JavaScript implementation:

dsyr( order, uplo, N, alpha, x, strideX, A, lda )
ssyr( order, uplo, N, alpha, x, strideX, A, lda )

Ndarray JavaScript implementation:

dsyr( uplo, N, alpha, x, strideX, offsetX, A, strideA1, strideA2, offsetA )
ssyr( uplo, N, alpha, x, strideX, offsetX, A, strideA1, strideA2, offsetA )

As mentioned above, symmetry works here as well. For instance:

For uplo = 'upper' (upper-triangular matrix)
A = [ 1, 2, 3 ]
    [ 0, 5, 6 ]
    [ 0, 0, 9 ]
if order = 'row-major', A = [ 1, 2, 3, 0, 5, 6, 0, 0, 9 ]
if order = 'column-major', A = [ 1, 0, 0, 2, 5, 0, 3, 6, 9 ]

For uplo = 'lower'
A = [ 1, 0, 0 ]
    [ 2, 5, 0 ]
    [ 3, 6, 9 ]
if order = 'row-major', A = [ 1, 0, 0, 2, 5, 0, 3, 6, 9 ]
if order = 'column-major', A = [ 1, 2, 3, 0, 5, 6, 0, 0, 9 ]

The above two routines perform one of the matrix-vector operations x = A*x or x = A^T*x, where x is an N element vector and A is an N by N unit, or non-unit, upper or lower triangular matrix. This also follows similar to the routines mentioned above, and hence I won't get into depth of these.

The above two routines perform the symmetric rank 2 operation A = α*x*y^T + α*y*x^T + A where α is a scalar, x and y are N element vectors, and A is an N by N symmetric matrix. The structure remains similar to the ssyr, with just an extra vector y in the operation.

This routine is the Level 1 complex single-precision routine which computes the sum of the absolute values of the real and imaginary components of a single-precision complex floating-point vector. The reference LAPACK implementation:

scasum( N, cx, strideCX )

Thus, the native JavaScript implementation follows the same signature while we introduce an offset parameter for cx, in the ndarray implementation. To perform computation upon the complex vector we use reinterpret which as the name suggests reinterpret a Complex64Array as a Float32Array. Once done the operation is simply on a real vector with stride = strideCX*2, and offset = offsetCX*2. Here the output is a real single-precision value storing the result. Since this package does not depend on any other blas/base routines its C/Fortran implementations are also available. The structure of the package is similar as I mentioned in my proposal in detail.

The above two routines scale a double/single-precision complex floating-point vector by a double-precision complex floating-point constant and add the result to a double-precision complex floating-point vector. The JavaScript implementation for these routines is available, since the operation depends on dcabs1 and scabs1 the C/Fortran implementation remains due to the ongoing tooling work. The reference LAPACK implementation:

zaxpy( N, za, zx, strideZX, zy, strideZY )
caxpy( N, ca, cx, strideCX, cy, strideCY )

Here the output is the second input complex array passed i.e. zy or cy.

The above two routines apply a plane rotation. Similar to scasum it does not have any dependence and hence is implemented entirely. The output gets stored inside the second input array passed i.e. either zy or cy.

The above two routines compute the L2-norm of a complex double/single-precision floating-point vector. Thus, the result here is either a real double or a single-precision value. The structure is similar to scasum and hence the ndarray implementation contains an additional offset parameter, and is implemented entirely following its independence from any other blas/base routines.

All the above PRs correspond to modifications including refactoring of the existing implementations, and fixing paths and descriptions of the already merged routines. Since we came up with the idea for Level 1 routines to have an ndarray and let native implementation call that and hence reduce code duplication issues, thus the above three PRs are for the same. For instance: Native implementtaion for srot

srot( N, x, strideX, y, strideY, c, s )

Ndarray implementation for srot

srot( N, x, strideX, offsetX, y, strideY, offsetY c, s )

For the native implementations we want to support positive and negative stride values, and nothing else, as our native implementations follow the LAPACK reference implementation or BLAS.

x = [ 1, 2, 3 ]
for strideX = 1, we take offsetX = 0.
for strideX = -1, we take offsetX = ( 1 - N ) * strideX = 2.
now based on this call ndarray( N, x, strideX, offsetX, y, strideY, offsetY c, s )

The above two routines perform the matrix-vector operation y = alpha*A*x + beta*y where alpha and beta are scalars, x and y are N element vectors, and A is an N by N symmetric matrix supplied in packed form. This is similar to dspr or sspr as discussed above.

The above two routines perform the matrix-vector operation y = alpha*A*x + beta*y where alpha and beta are scalars, x and y are N element vectors, and A is an N by N symmetric matrix. This follows the same structure as the other Level 2 routines. This is the first routine for which we implemented the ndarray version with strideA1, strideA2, offsetA as additional parameters.

The above two PRs add additional test cases to ensure complete code coverage for the packages drotm and srotm.

The above two routines apply a modified Givens plane rotation. These are Level 1 real double and single-precision routines. These routines do not depend on any other blas/base routine and are implemented entirely with their JavaScript, C, and Fortran implementations.

The above two PRs are fixes performed in the srotm routine to update the description and perform float32 emulation.

This is a crucial PR, I started working on this before GSOC and then completed this package but CI was this was failing and after that, I moved on to BLAS and later since this was a blocker for the other project this was fixed and merged.

The above two routines scale a double/single-precision complex floating-point vector by a single-precision complex floating-point constant. Here the output is the complex input array passed i.e. zx or cx. Since these are independent of the other blas/base routines the JavaScript, C, and Fortran versions are implemented entirely.

The above two routines compute the sum of the absolute values of the real and imaginary components of a single-precision complex floating-point number. These are standalone packages and are thus implemented entirely with their JavaScript, C, and Fortran implementations and have their usage in other packages such as dzasum, caxpy, zaxpy and other complex routines as well.

The list above consists of PRs that are merged after the commencement of the GSOC period. There are a few routines which I worked and were merged before this period such as that for srot, drot, etc.

Open PRs:

The above list consists of the PRs that I am working on, those that require review, or those that need changes.

Current state

As previously mentioned, we have fully implemented all real single- and double-precision Level 1 routines, including their JavaScript, C, and Fortran versions. For the complex single and double-precision routines, approximately 5 out of 11 remain in each category, with ongoing infrastructure work as discussed.

For Level 2 routines, we have implemented nearly all real single- and double-precision routines in JavaScript, while ongoing tooling development is still pending their implementation in C and Fortran. For Level 3 routines, we have implemented the single and double-precision matrix multiplication routines (sgemm and dgemm), with approximately 5 packages left in each category. Issue #2039: RFC: Add BLAS bindings and implementations for linear algebra tracks the progress of these implementations.

What remains

As previously mentioned, we still need to implement 5 out of 11 complex single- and double-precision Level 1 routines. We can implement these once the tooling work is complete.

In Level 2, we need to implement the complex single and double-precision routines, as well as the C and Fortran implementations for real single and double-precision.

For Level 3, around 5 packages remain, while the gemm routine awaits C and Fortran implementations. I will persist in my efforts to complete the implementation of the entire BLAS routine set.

Challenges and lessons learned

I faced a few challenges, and I would like to discuss a few here. Maintaining stdlib code standards was the first challenge I faced after understanding the entire code flow. When multiple people work on related topics, it becomes crucial to maintain standards to ensure a consistent user experience across all packages. Gradually, as I worked on multiple packages and carefully considered the feedback on each of my PRs, I was able to overcome this challenge. Special thanks to mentors for correcting me and guiding me at each stage.

The other challenge was that, as discussed above, there are packages that depend on other packages, and adding their C/Fortran implementations was not possible hence, this was unexpected at the start of the project, and we need to work on the infrastructure to support that. Since this project was extensive and hence at numerous stages required discussions, there was not a single moment during the course of the project where I felt blocked every time there was something to work upon.

The most important lessons that I learned throughout were:

  • Doing something is not enough; striving for perfection is what turns ordinary into excellence.
  • Always seek improvement—continuously think, read, and explore new possibilities to optimize your work and expand its use cases.  

Conclusion

Overall, the entire program was a great experience filled with learning and execution. I learned about how any feature or operation I use or perform is implemented at the backend and how rigorous the testing and benchmarking process is. Throughout this journey, I acquired knowledge in various areas and the art of writing clean code. I would like to thank my mentors and special thanks to Athan, who provided unwavering support and guidance whenever I encountered challenges. I look forward to continuing to work and contribute to stdlib, completing all the remaining packages, and eventually helping to build the infrastructure over which further applications or algorithms can be crafted.