EN JP

Daichi MUKUNOKI

Ph.D. in Engineering, Assistant Professor, Information Technology Center, Nagoya University

Since December 2024, I work at the Information Technology Center, Nagoya University. My research interests include high-performance computing (HPC), large-scale parallel computing on supercomputers, utilizing low-precision arithmetic on AI processors for scientific computing, mixed-precision computing in numerical computing, and automatic code generation of high-performance numerical codes using AI techniques. In addition, I am conducting various researches such as automatic tuning and GPU-porting of applications with students in our laboratory with student. Students who are interested in our research and would like to work with us are welcome to contact me. I also promote joint research with universities, research institutes, and private companies in Japan and abroad.

News (last update: July 8, 2025)

Recent Research Activities

Profile

Daichi MUKUNOKI

Biography

Work experience

Education

Research Stay/Visiting Abroad

Research

Automatic HPC/numerical code generation using LLMs

As the performance of LLMs has improved significantly, not only interactive agents such as ChatGPT, but also technologies that assist for implementation or automatic generation of program code are being put into practical use. However, there are still many challenges in generating highly specialized codes. In particular, numerical codes used in HPC are highly challenging because they require high-level performance optimization. We are developing an LLM-assisted code generation method specialized for HPC and numerical codes through the coordination of multiple LLMs, iterative improvement of prompts, and the use of RAGs.

High-precision operations using low-precision operations

For ordinary scientific computations, 64-bit floating-point arithmetic (FP64, so-called double precision) is used. However, there are processors that can perform lower-precision operations, such as 32-bit single-precision operations (FP32), 16-bit half-precision operations (FP16, BF16), etc. There are also some processors (e.g., AI-oriented processors) that do not support FP64 operations at all. We are developing methods for FP64-equivalent computation using low-precision arithmetic.

Related publications

Reproducible computation

The result of numerical calculations performed with finite-precision arithmetic can vary depending on the order of computation due to rounding errors. Also, the use of special instructions (e.g., fused-multiply add, atomic, etc.) can change rounding errors. Therefore, even for the same algorithm and the same problem, the calculation results may not be reproduced if the execution environment, implementation, or executable file differs. This can be a significant problem in reproducing calculation results, quality assurance, and debugging. We developed a computation method that uses error-free method to ensure bit-level reproducibility of computation results in any computing environment.

Related publications

Development of highly scalable matrix multiplication routine for large-scale distributed parallel systems

In large-scale distributed parallel computing with tens of thousands of processes, performance can be communication-bound, even for computationally intensive calculations, and performance scalability deteriorates in strong scaling. We have implemented a matrix multiplication routine that uses a communication-avoiding algorithm (2.5-dimensional algorithm), which is compatible with the existing matrix multiplication routine (so-called PDGEMM) that uses a two-dimensional distributed scheme, and analyzed its performance based on performance modeling. Performance evaluation using the K computer (RIKEN), the Fugaku supercomputer (RIKEN), and Oakforest-PACS (University of Tokyo) showed improved strong scaling performance compared to existing routine.

Related publications

High-performance matrix computation on GPUs

GPU programs, which perform vectorial computation with tens of thousands of threads, require different implementation and performance optimization than CPU programs. We have developed new high-performance implementations of basic linear algebra operations such as matrix-vector multiplication for dense and sparse matrices. We also developed a technique to automatically adjust the number of threads for program execution, which determines performance.

Related publications

Mixed-precision computing

Mixed-precision methods have been widely studied for faster computation by using lower-precision operations, such as single-precision (FP32) and half-precision (FP16) as well as double-precision (FP64). We develop new mixed-precision scheme for sparse matrices, or automatic mixed-precision optimization methods that do not depend on any algorithm. For sparse iterative solvers, we have developed a method to improve the convergence by using high-precision arithmetic (e.g., quadruple-precision) for calculations that normally use double-precision arithmetic, thereby achieving faster computation speed.

Related publications

Publications

Journal Papers

Peer-reviewed Conference Proceedings

Technical Reports (Non-reviewed)

Peer-reviewed Poster Presentations

Poster Presentations (Non-reviewed) / ポスター発表(査読なし)

Oral Presentations

Awards

Funding

s

Teaching

Products

Professional Activities