Daichi MUKUNOKI

Since December 2024, I work at the Information Technology Center, Nagoya University. My research interests include high-performance computing (HPC), large-scale parallel computing on supercomputers, utilizing low-precision arithmetic on AI processors for scientific computing, mixed-precision computing in numerical computing, and automatic code generation of high-performance numerical codes using AI techniques. In addition, I am conducting various researches such as automatic tuning and GPU-porting of applications with students in our laboratory with student. Students who are interested in our research and would like to work with us are welcome to contact me. I also promote joint research with universities, research institutes, and private companies in Japan and abroad.

News (last update: July 8, 2025)

Recent Research Activities

Performing scientific computation using low-precision arithmetic units on AI-oriented processors (FP32 and FP64 computations using low-precision arithmetic such as FP8 and FP16)
Automatic HPC code generation for numerical computing using Large language models (LLMs) (related project: HPC-GENIE: High-Performance Computing with Generative Neural Intelligence for Execution)

Daichi Mukunoki, Shun-ichiro Hayashi, Tetsuya Hoshino, Takahiro Katagiri, Performance Evaluation of General Purpose Large Language Models for Basic Linear Algebra Subprograms Code Generation, arXiv preprint arXiv:2507.04697, 2025.

Acceleration of learning and inference of LLMs on large-scale distributed parallel systems

Profile

Email:
daichi.mukunoki _at_ gmail.com, mukunoki _at_ cc.nagoya-u.ac.jp
Address:
5F-0504, Information Technology Center, Nagoya University, Furo-cho, Chikusa-ku, Nagoya-shi, Aichi, 464-8601, JAPAN
Laboratory:
Katagiri-Hoshino lab.
Research Interests:
high performance computing (HPC), parallel computing, GPU, computer arithmetic, high-precision operations, mixed-precision computing, reproducible computing, performance optimization, auto-tuning (AT), AI/LLM assisted code generation, etc.
Technical Skills:
C/C++, CUDA, MPI, OpenMP, Python, LaTeX, HTML
Google Scholar:
https://scholar.google.co.jp/citations?user=iarkitQAAAAJ
ORCID:
https://orcid.org/0000-0002-0051-6811
researchmap:
https://researchmap.jp/mukunoki
ResearchGate:
https://www.researchgate.net/profile/Daichi_Mukunoki
Scopus:
https://www.scopus.com/authid/detail.uri?authorId=55027398500
LinkedIn:
https://www.linkedin.com/in/daichi-mukunoki

Biography

Work experience

April 1, 2025 - present: Assistant Professor, Information Technology Center, Nagoya University
December 1, 2024 - March 31, 2025: (Project) Assistant Professor, Information Technology Center, Nagoya University
April 1, 2024 - November 30, 2024: Temporary Technical Staff, Shibaura Institute of Technology
November 1, 2023 - February 29, 2024: Sr. Software Engineer, Sony Interactive Entertainment
November 1, 2021 - March 31, 2023: Visiting Researcher, Information Technology Center, University of Tokyo
April 1, 2019 - October 31, 2023: Research Scientist, Large-scale Parallel Numerical Computing Technology Research Team, Research Division, RIKEN Center for Computational Science
April 1, 2019 - March 31, 2021: Research Scientist, Architecture Development Team, Flagship 2020 Project, RIKEN Center for Computational Science
April 1, 2018 - March 31, 2019: Visiting Researcher, Architecture Development Team, Flagship 2020 Project, RIKEN Center for Computational Science
April 1, 2018 - March 31, 2019: Visiting Researcher, Large-scale Parallel Numerical Computing Technology Research Team, Research Division, RIKEN Center for Computational Science
October 1, 2017 - March 31, 2019: Postdoctoral Research Fellow, Graduate School of Science, Tokyo Woman's Christian University
October 1, 2017 - March 31, 2018: Visiting Researcher, Architecture Development Team, Flagship 2020 Project, RIKEN Advanced Institute of Computational Science
October 1, 2017 - March 31, 2018: Visiting Researcher, Large-scale Parallel Numerical Computing Technology Research Team, Research Division, RIKEN Advanced Institute of Computational Science
April 1, 2017 - September 30, 2017: Postdoctoral Researcher, Architecture Development Team, Flagship 2020 Project, RIKEN Advanced Institute of Computational Science
April 1, 2016 - March 30, 2017: Postdoctoral Researcher, Co-design Team, Flagship 2020 Project, RIKEN Advanced Institute of Computational Science
May 1, 2015 - March 31, 2016: Postdoctoral Researcher, Co-design Team, Exascale Supercomputer Project, RIKEN Advanced Institute of Computational Science
June 1, 2014 - September 30, 2017: Postdoctoral Researcher, Large-scale Parallel Numerical Computing Technology Research Team, Research Division, RIKEN Advanced Institute of Computational Science
December 1, 2013 - May 31, 2014: Research Fellow (PD), Japan Society for the Promotion of Science (at University of Tsukuba)
April 1, 2013 - November 30, 2013: Research Fellow (DC2), Japan Society for the Promotion of Science (at University of Tsukuba)

Education

April 2011 - November 2013, Graduate School of Systems and Information Engineering, University of Tsukuba (Doctor of Philosophy in Engineering, November 2013)
April 2009 - March 2011, Graduate School of Systems and Information Engineering, University of Tsukuba (Master of Engineering, March 2011)
April 2006 - March 2009, School of Library and Information Science, University of Tsukuba (Bachelor of Library and Information Science, March 2009)
April 2001 - March 2006, Gifu National College of Technology (Associate's degree in Engineering, March 2006)

Research Stay/Visiting Abroad

September 2023: University of Perpignan Via Domitia, France (Prof. Defour)
September 2023 - October 2023: LIP6, Sorbonne University, France（Prof. Jezequel）
March 2023 - Jun 2023: LIP6, Sorbonne University, France（Prof. Jezequel）
May 2022 - July 2022: LIP6, Sorbonne University, France（Prof. Jezequel）
September 2019: LIP6, Sorbonne University, France（Prof. Jezequel）
January 2019 - February 2019: LIP6, Sorbonne University, France（Prof. Jezequel）
September 2018: LIP6, Sorbonne University, France（Prof. Jezequel）
August 2018 - September 2018: KTH Royal Institute of Technology, Sweden（Dr. Iakymchuk）
February 2018 - March 2018: KTH Royal Institute of Technology, Sweden（Dr. Iakymchuk）
February 2018: LIP6, Sorbonne University, France（Prof. Graillat）

Research

Automatic HPC/numerical code generation using LLMs

As the performance of LLMs has improved significantly, not only interactive agents such as ChatGPT, but also technologies that assist for implementation or automatic generation of program code are being put into practical use. However, there are still many challenges in generating highly specialized codes. In particular, numerical codes used in HPC are highly challenging because they require high-level performance optimization. We are developing an LLM-assisted code generation method specialized for HPC and numerical codes through the coordination of multiple LLMs, iterative improvement of prompts, and the use of RAGs.

High-precision operations using low-precision operations

For ordinary scientific computations, 64-bit floating-point arithmetic (FP64, so-called double precision) is used. However, there are processors that can perform lower-precision operations, such as 32-bit single-precision operations (FP32), 16-bit half-precision operations (FP16, BF16), etc. There are also some processors (e.g., AI-oriented processors) that do not support FP64 operations at all. We are developing methods for FP64-equivalent computation using low-precision arithmetic.

Related publications

Daichi Mukunoki, Katsuhisa Ozaki, Takeshi Ogita, Toshiyuki Imamura, Accurate Matrix Multiplication on Binary128 Format Accelerated by Ozaki Scheme, Proc. The 50th International Conference on Parallel Processing (ICPP-2021), No. 78, pp. 1-11, Aug. 9, 2021.
Daichi Mukunoki, Katsuhisa Ozaki, Takeshi Ogita, Toshiyuki Imamura, DGEMM using Tensor Cores, and Its Accurate and Reproducible Versions, Proc. ISC High Performance 2020, Lecture Notes in Computer Science, Vol. 12151, pp. 230-248, Jun. 2020.

Reproducible computation

The result of numerical calculations performed with finite-precision arithmetic can vary depending on the order of computation due to rounding errors. Also, the use of special instructions (e.g., fused-multiply add, atomic, etc.) can change rounding errors. Therefore, even for the same algorithm and the same problem, the calculation results may not be reproduced if the execution environment, implementation, or executable file differs. This can be a significant problem in reproducing calculation results, quality assurance, and debugging. We developed a computation method that uses error-free method to ensure bit-level reproducibility of computation results in any computing environment.

Related publications

Daichi Mukunoki, Katsuhisa Ozaki, Takeshi Ogita, Roman Iakymchuk, Conjugate Gradient Solvers with High Accuracy and Bit-wise Reproducibility between CPU and GPU using Ozaki scheme, Proc. The International Conference on High Performance Computing in Asia-Pacific Region (HPCAsia 2021), pp. 100-109, 2021 (preprint is also available: hal-02986873).
Daichi Mukunoki, Takeshi Ogita, Katsuhisa Ozaki, Reproducible BLAS Routines with Tunable Accuracy Using Ozaki Scheme for Many-core Architectures, Proc. 13th International Conference on Parallel Processing and Applied Mathematics (PPAM 2019), Lecture Notes in Computer Science, Vol. 12043, pp. 516-527, Mar. 2020.

Development of highly scalable matrix multiplication routine for large-scale distributed parallel systems

In large-scale distributed parallel computing with tens of thousands of processes, performance can be communication-bound, even for computationally intensive calculations, and performance scalability deteriorates in strong scaling. We have implemented a matrix multiplication routine that uses a communication-avoiding algorithm (2.5-dimensional algorithm), which is compatible with the existing matrix multiplication routine (so-called PDGEMM) that uses a two-dimensional distributed scheme, and analyzed its performance based on performance modeling. Performance evaluation using the K computer (RIKEN), the Fugaku supercomputer (RIKEN), and Oakforest-PACS (University of Tokyo) showed improved strong scaling performance compared to existing routine.

Related publications

Daichi Mukunoki, Toshiyuki Imamura, Performance Analysis of 2D-compatible 2.5D-PDGEMM on Knights Landing Cluster, Proc. International Conference on Computational Science (ICCS 2018), Lecture Notes in Computer Science, Vol. 10862, pp. 853-858, Jun. 2018.
Daichi Mukunoki, Toshiyuki Imamura, Implementation and Performance Analysis of 2.5D-PDGEMM on the K Computer, Proc. 12th International Conference on Parallel Processing and Applied Mathematics (PPAM 2017), Lecture Notes in Computer Science, Vol. 10777, pp. 348-358, Mar. 2018.

High-performance matrix computation on GPUs

GPU programs, which perform vectorial computation with tens of thousands of threads, require different implementation and performance optimization than CPU programs. We have developed new high-performance implementations of basic linear algebra operations such as matrix-vector multiplication for dense and sparse matrices. We also developed a technique to automatically adjust the number of threads for program execution, which determines performance.

Related publications

Daichi Mukunoki, Toshiyuki Imamura, Daisuke Takahashi, Automatic Thread-Block Size Adjustment for Memory-Bound BLAS Kernels on GPUs, Proc. IEEE 10th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC-16). pp. 377-384, Sep. 2016.

Mixed-precision computing

Mixed-precision methods have been widely studied for faster computation by using lower-precision operations, such as single-precision (FP32) and half-precision (FP16) as well as double-precision (FP64). We develop new mixed-precision scheme for sparse matrices, or automatic mixed-precision optimization methods that do not depend on any algorithm. For sparse iterative solvers, we have developed a method to improve the convergence by using high-precision arithmetic (e.g., quadruple-precision) for calculations that normally use double-precision arithmetic, thereby achieving faster computation speed.

Related publications

Daichi Mukunoki, Daisuke Takahashi, Using Quadruple Precision Arithmetic to Accelerate Krylov Subspace Methods on GPUs, Proc. 10th International Conference on Parallel Processing and Applied Mathematics (PPAM 2013), Part I, Workshop on Numerical Algorithms on Hybrid Architectures, Lecture Notes in Computer Science, Vol. 8384, pp. 632-642, May 2014.

Publications

Journal Papers

Katsuhisa Ozaki, Daichi Mukunoki, Takeshi Ogita, Extension of accurate numerical algorithms for matrix multiplication based on error-free transformation, Japan Journal of Industrial and Applied Mathematics, Vol. 42, pp. 1-20, Oct. 29, 2024.
Kensuke Aihara, Katsuhisa Ozaki, Daichi Mukunoki, Mixed-precision conjugate gradient algorithm using the groupwise update strategy, Japan Journal of Industrial and Applied Mathematics, Volume 41, pp. 837-855, Feb. 6, 2024.
Daichi Mukunoki, Takeshi Ogita, Performance and Energy Consumption of Accurate and Mixed-precision Linear Algebra Kernels on GPUs, Journal of Computational and Applied Mathematics, Vol. 372, p. 112701, Jul., 2020.
椋木大地, 高橋大介, GPUにおける3倍・4倍精度浮動小数点演算の実現と性能評価, 情報処理学会論文誌コンピューティングシステム, Vol. 6, No. 1, pp. 66-77, 2013年1月31日 (in Japanese).

Peer-reviewed Conference Proceedings

Xuanzhengbo Ren, Yuta Kawai, Hirofumi Tomita, Seiya Nishizawa, Takahiro Katagiri, Tetsuya Hoshino, Masatoshi Kawai, Daichi Mukunoki, Nagai Toru, Performance Evaluation of Loop Body Splitting for Fast Modal Filtering in SCALE-DG on A64FX, HPC Asia '25 Workshops: Proc. 2025 International Conference on High Performance Computing in Asia-Pacific Region Workshops, pp. 36-44, 2025.
Ryunosuke Matsuzaki, Daichi Mukunoki, Takaaki Miyajima, Performance evaluation and modelling of single-precision matrix multiplication on Cerebras CS-2, Proc. SC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis, 14th Workshop on Irregular Applications: Architectures and Algorithms, pp. 727-731, 2024.
Stef Graillat, Fabienne Jézéquel, Théo Mary, Roméo Molina, Daichi Mukunoki, Reduced-Precision and Reduced-Exponent Formats for Adaptive-Precision Sparse Matrix-Vector Product, Proc. 30th International European Conference on Parallel and Distributed Computing (Euro-Par 2024), Lecture Notes in Computer Science, Vol. 14803, pp. 17-30, Aug. 26, 2024.
Daichi Mukunoki, Masatoshi Kawai, Toshiyuki Imamura, Sparse Matrix-Vector Multiplication with Reduced-Precision Memory Accessor, Proc. 2023 IEEE 16th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC 2023), pp. 608-615, 2023 (Best Paper Award).
Daichi Mukunoki, Katsuhisa Ozaki, Takeshi Ogita, Toshiyuki Imamura, Infinite-precision Inner Product and Sparse Matrix Vector Multiplication using Ozaki Scheme with Dot2 on Many-core Processors, Proc. 14th International Conference on Parallel Processing and Applied Mathematics (PPAM 2022), Lecture Notes in Computer Science, vol 13826, pp. 40–54, 2023.
Daichi Mukunoki, Yusuke Hirota, Toshiyuki Imamura, Task Scheduling Strategies for Batched Basic Linear Algebra Subprograms on Many-core CPUs, Proc. 2021 IEEE 14th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC 2021), pp. 234-241, 2021.
Daichi Mukunoki, Katsuhisa Ozaki, Takeshi Ogita, Toshiyuki Imamura, Accurate Matrix Multiplication on Binary128 Format Accelerated by Ozaki Scheme, Proc. The 50th International Conference on Parallel Processing (ICPP-2021), No. 78, pp. 1-11, Aug. 9, 2021.
Takeyuki Harayama, Shuhei Kudo, Daichi Mukunoki, Toshiyuki Imamura, Daisuke Takahashi, A rapid Euclidean norm calculation algorithm that reduces overflow and underflow, Proc. The 2021 International Conference on Computational Science and Its Applications (ICCSA 2021), Lecture Notes in Computer Science, Vol. 12949, pp. 95-110, Sep. 9, 2021.
Jens Domke, Emil Vatai, Aleksandr Drozd, Peng Chen, Yosuke Oyama, Lingqi Zhang, Shweta Salaria, Daichi Mukunoki, Artur Podobas, Mohamed Wahib, Satoshi Matsuoka, Matrix Engines for High Performance Computing: A Paragon of Performance or Grasping at Straws?, Proc. 35th IEEE International Parallel & Distributed Processing Symposium (IPDPS 2021), pp. 1056-1065, Jun. 28, 2021 (preprint: arXiv:2010.14373)
Katsuhisa Ozaki, Takeshi Ogita, Daichi Mukunoki, Interval Matrix Multiplication using Fast Low-Precision Arithmetic on GPU, Proc. 9th International Workshop on Reliable Engineering Computing (REC2021), pp. 419-434, May 2021.
Daichi Mukunoki, Katsuhisa Ozaki, Takeshi Ogita, Roman Iakymchuk, Conjugate Gradient Solvers with High Accuracy and Bit-wise Reproducibility between CPU and GPU using Ozaki scheme, Proc. The International Conference on High Performance Computing in Asia-Pacific Region (HPCAsia 2021), pp. 100-109, 2021 (preprint is also available: hal-02986873).
Fabienne Jézéquel, Stef Graillat, Daichi Mukunoki, Toshiyuki Imamura, Roman Iakymchuk, Can we avoid rounding-error estimation in HPC codes and still get trustful results?, Proc. 13th International Workshop on Numerical Software Verification 2020 (NSV 20), Lecture Notes in Computer Science, Vol. 12549, pp. 163-177, Dec. 2020 (preprint: hal-02486753).
Daichi Mukunoki, Katsuhisa Ozaki, Takeshi Ogita, Toshiyuki Imamura, DGEMM using Tensor Cores, and Its Accurate and Reproducible Versions, Proc. ISC High Performance 2020, Lecture Notes in Computer Science, Vol. 12151, pp. 230-248, Jun. 2020.
Yiyu Tan, Toshiyuki Imamura, Daichi Mukunoki, Design of an FPGA-based Matrix Multiplier with Task Parallelism, Proc. International Conference on Parallel Computing (ParCo2019), Parallel Computing: Technology Trends, Vol. 36, pp. 241-250, 2020.
Daichi Mukunoki, Takeshi Ogita, Katsuhisa Ozaki, Reproducible BLAS Routines with Tunable Accuracy Using Ozaki Scheme for Many-core Architectures, Proc. 13th International Conference on Parallel Processing and Applied Mathematics (PPAM 2019), Lecture Notes in Computer Science, Vol. 12043, pp. 516-527, Mar. 2020.
Daichi Mukunoki, Toshiyuki Imamura, Performance Analysis of 2D-compatible 2.5D-PDGEMM on Knights Landing Cluster, Proc. International Conference on Computational Science (ICCS 2018), Lecture Notes in Computer Science, Vol. 10862, pp. 853-858, Jun. 2018.
Daichi Mukunoki, Toshiyuki Imamura, Implementation and Performance Analysis of 2.5D-PDGEMM on the K Computer, Proc. 12th International Conference on Parallel Processing and Applied Mathematics (PPAM 2017), Lecture Notes in Computer Science, Vol. 10777, pp. 348-358, Mar. 2018.
Toshiyuki Imamura, Daichi Mukunoki, Yusuke Hirota, Susumu Yamada, Masahiko Machida, Design Towards Modern High Performance LA Library Enabling Heterogeneity and Flexible Data Formats, Parallel Computing is Everywhere, Proc. International Conference on Parallel Computing (ParCo2017), Advances in Parallel Computing, pp. 97-106, Sep. 2017.
Daichi Mukunoki, Toshiyuki Imamura, Daisuke Takahashi, Automatic Thread-Block Size Adjustment for Memory-Bound BLAS Kernels on GPUs, Proc. IEEE 10th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC-16). pp. 377-384, Sep. 2016.
Daichi Mukunoki, Toshiyuki Imamura, Reduced-Precision Floating-Point Formats on GPUs for High Performance and Energy Efficient Computation, Proc. IEEE International Conference on Cluster Computing (Cluster 2016), pp. 144-145, Sep. 13, 2016 (extended abstract for poster presentation).
Daichi Mukunoki, Toshiyuki Imamura, Daisuke Takahashi, Fast Implementation of General Matrix-Vector Multiplication (GEMV) on Kepler GPUs, Proc. 23rd Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP 2015), pp. 642-650, Mar. 2015.
Daichi Mukunoki, Daisuke Takahashi, Using Quadruple Precision Arithmetic to Accelerate Krylov Subspace Methods on GPUs, Proc. 10th International Conference on Parallel Processing and Applied Mathematics (PPAM 2013), Part I, Workshop on Numerical Algorithms on Hybrid Architectures, Lecture Notes in Computer Science, Vol. 8384, pp. 632-642, May 2014.
Daichi Mukunoki, Daisuke Takahashi, Optimization of Sparse Matrix-vector Multiplication for CRS Format on NVIDIA Kepler Architecture GPUs, Proc. 13th International Conference on Computational Science and Its Applications (ICCSA 2013), Part V, Lecture Notes in Computer Science, Vol. 7975, pp. 211-223, Jun. 2013.
Daichi Mukunoki, Daisuke Takahashi, Implementation and Evaluation of Triple Precision BLAS Subroutines on GPUs, Proc. 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW 2012), The 13th Workshop on Parallel and Distributed Scientific and Engineering Computing (PDSEC-12), pp. 1378-1386, May 2012.
Daichi Mukunoki, Daisuke Takahashi, Implementation and Evaluation of Quadruple Precision BLAS Functions on GPUs, Proc. 10th International Conference on Applied Parallel and Scientific Computing (PARA 2010), Part I, Lecture Notes in Computer Science, Vol. 7133, pp. 249-259, 2012.
椋木大地, 高橋大介, GPUによる4倍・8倍精度BLASの実装と評価, 2011年ハイパフォーマンスコンピューティングと計算科学シンポジウムHPCS2011論文集, pp. 148-156, 2011年1月 (in Japanese).

Technical Reports (Non-reviewed)

Daichi Mukunoki, Shun-ichiro Hayashi, Tetsuya Hoshino, Takahiro Katagiri, Performance Evaluation of General Purpose Large Language Models for Basic Linear Algebra Subprograms Code Generation, arXiv preprint arXiv:2507.04697, 2025.
羽生達郎, 片桐孝洋, 森下誠, 高橋一郎，河合直聡, 椋木大地, 星野哲也, 永井亨, コヒーレントイジングマシンにおけるパラメタチューニングへのATの適用, 計算工学講演会論文集, Vol. 30, No. E-01-03, 2025年6月 (in Japanese).
中谷崇真, 河合直聡, 片桐孝洋, 星野哲也, 永井亨, 椋木大地, 疎行列反復解法の深層学習を用いた実行時間予測モデル構築と評価, 情報処理学会研究報告: ハイパフォーマンスコンピューティング, Vol. 2025-HPC-199, No. 3, pp. 1-8, 2025年5月 (in Japanese).
羽生達郎, 森下誠, 水木直也, 片桐孝洋, 椋木大地, 河合直聡, 星野哲也, 永井亨, コヒーレントイジングマシンの性能パラメタ最適化のための探索アルゴリズム選択可能な手法の提案, 情報処理学会研究報告: ハイパフォーマンスコンピューティング, Vol. 2025-HPC-198, No. 37, pp. 1-10, 2025年3月 (in Japanese).
水木直也, 森下誠, 河合直聡, 片桐孝洋, 椋木大地, 星野哲也, 永井亨, SVMによる誤差を含むクラス分類における多種疑似量子アニーラの性能評価, 情報処理学会研究報告: ハイパフォーマンスコンピューティング, Vol. 2025-HPC-198, No. 34, pp. 1-8, 2025年3月 (in Japanese).
椋木大地, 尾崎克久, Quasi Triple-Word Arithmeticによる6倍精度演算の疎行列反復解法への応用, 情報処理学会研究報告: ハイパフォーマンスコンピューティング, Vol. 2024-HPC-197, 2024-ARC-259, No. 11, pp. 1-7, 2024年12月 (in Japanese).
Stef Graillat, Fabienne Jézéquel, Théo Mary, Roméo Molina, Daichi Mukunoki, Performance Evaluation of Adaptive-Precision SpMV with Reduced-Precision Formats, HAL, hal-04261073, Oct. 2023.
椋木大地, 尾崎克久, 荻田武史, 今村俊幸, 尾崎スキームによる無限精度内積と再現可能疎行列反復ソルバーへの応用, 日本応用数理学会2022年度年会講演予稿集, Sep. 10, 2022 (in Japanese).
椋木大地, 廣田悠輔, 今村俊幸, CPUにおけるbatched BLASのためのタスクスケジューリング戦略, 日本応用数理学会2021年度年会講演予稿集, Sep. 7, 2021 (in Japanese).
原山赳幸, 工藤周平, 椋木大地, 今村俊幸, 高橋大介, オーバー・アンダーフローを抑えた高精度かつ高速な2ノルム計算手法, 情報処理学会研究報告: ハイパフォーマンスコンピューティング, Vol. 2020-HPC-177, No. 8, pp. 1-9, 2020年12月 (in Japanese).
椋木大地, 尾崎克久, 荻田武史: 尾崎スキームを用いたbinary128による4倍精度行列積, 日本応用数理学会2020年度年会講演予稿集, Sep. 10, 2020 (in Japanese).
Roman Iakymchuk, Daichi Mukunoki, Artur Podobas, Fabienne Jézéquel, Toshiyuki Imamura, Norihisa Fujita, Jens Huthmann, Shuhei Kudo, Yiyu Tan, Jens Domke, Kai Torben Ohlhus, Takeshi Fukaya, Takeo Hoshi, Yuki Murakami, Maho Nakata, Takeshi Ogita, Kentaro Sano, Taisuke Boku, While Paper from Workshop on Large-scale Parallel Numerical Computing Technology (LSPANC 2020): HPC and Computer Arithmetic toward Minimal-Precision Computing, arXiv:2004.04628, hal-02536316, Apr. 2020.
Toshiyuki Imamura, Daichi Mukunoki, Fabienne Jézéquel, Stef Graillat, Roman Iakymchuk, Numerical Reproducibility based on Minimal-Precision Validation, Computational Reproducibility at Exascale Workshop (CRE2019), in cooperation with SC19, Nov. 17, 2019.
椋木大地, 荻田武史, 尾崎克久, 今村俊幸, 尾崎スキームによる高精度かつ再現性のあるBLAS実装, 日本応用数理学会2019年年会講演予稿集, pp. 402-403, 2019年9月 (in Japanese).
椋木大地, 荻田武史, 尾崎克久, Level-3 BLASに基づく高精度行列積計算法による高精度かつ再現性のあるBLASルーチンの実装とその最適化, 情報処理学会研究報告: ハイパフォーマンスコンピューティング, Vol. 2018-HPC-166, No. 9, pp. 1-8, 2018年9月 (in Japanese).
椋木大地, 今村俊幸, 2.5次元アルゴリズムを用いた高性能PDGEMMの開発, 東京大学情報基盤センタースーパーコンピューティングニュース, Vol. 20, No. 4, pp. 31-36, 2018年7月 (in Japanese).
椋木大地, 今村俊幸, 京コンピュータにおける2.5次元アルゴリズムを用いた分散並列行列積の実装と評価, 情報処理学会研究報告: ハイパフォーマンスコンピューティング, Vol. 2017-HPC-159, No. 1, pp. 1-6, 2017年4月 (in Japanese).
森倉悠介, 椋木大地, 深谷猛, 山中脩也, 大石進一, 大規模並列計算機における連立1次方程式の精度保証付き数値計算に対する性能評価, 情報処理学会研究報告: ハイパフォーマンスコンピューティング, Vol. 2016-HPC-157, No. 1, pp. 1-7, 2016年12月 (in Japanese).
今村俊幸, 椋木大地, コンシューマレンジGPUに最適化した固有値ソルバーの実装と評価, 情報処理学会研究報: ハイパフォーマンスコンピューティング, Vol. 2016-HPC-157, No. 7, pp. 1-9, 2016年12月 (in Japanese).
椋木大地, 今村俊幸, 短尺浮動小数点形式の検討, 情報処理学会研究報告: ハイパフォーマンスコンピューティング, Vol. 2015-HPC-152, No. 4, pp. 1-10, 2015年12月 (in Japanese).
佐々木信一, 菱沼利彰, 藤井昭宏, 田中輝雄, 椋木大地, 今村俊幸, 京・FX10における倍々精度演算の高速化, 情報処理学会研究報告, Vol. 2015-HPC-151, No. 15, pp. 1-7, 2015年9月 (in Japanese).
今村俊幸, 椋木大地, 山田進, 町田昌彦, SYMV・GEMVルーチン群のマルチGPU化とその評価, 情報処理学会研究報告: ハイパフォーマンスコンピューティング, Vol. 2015-HPC-151, No. 13, pp. 1-8, 2015年9月 (in Japanese).
佐々成正, 山田進, 町田昌彦, 椋木大地, 今村俊幸, FFTを使った時間発展問題における累積誤差, 応用数理学会2015年度年会講演論文集, 2015年9月 (in Japanese).
椋木大地, 今村俊幸, 高橋大介, NVIDIA GPUにおけるメモリ律速なBLASカーネルのスレッド数自動選択手法, 情報処理学会研究報告: ハイパフォーマンスコンピューティング, Vol. 2015-HPC-150, No. 13, pp. 1-13, 2015年7月 (in Japanese) (IPSJ Yamashita SIG Research Award 2016).
椋木大地, 今村俊幸, 高橋大介, NVIDIA GPUにおけるGEMVカーネルの自動チューニング, 計算工学講演会論文集, Vol. 20, No. E-2-1, 2015年6月 (in Japanese).
今村俊幸, 椋木大地, 山田進, 町田昌彦, CUDA-BLAS等の選択による最速GPU固有値ソルバーの性能評価, 情報処理学会研究報告: ハイパフォーマンスコンピューティング, Vol. 2015-HPC-148, No. 4, pp. 1-9, 2015年2月 (in Japanese).
椋木大地, 今村俊幸, MaxwellアーキテクチャGPUにおける疑似倍精度演算を用いたDGEMMの実装と評価, 情報処理学会研究報告: ハイパフォーマンスコンピューティング, Vol. 2014-HPC-147, No. 26, pp. 1-6, 2014年12月 (in Japanese).
今村俊幸, 椋木大地, 山田進, 町田昌彦, CUDA-xSYMVの実装と評価, 情報処理学会研究報告: ハイパフォーマンスコンピューティング, Vol. 2014-HPC-146, No. 14, pp. 1-12, 2014年10月 (in Japanese).
椋木大地, 高橋大介, GPUにおける4倍精度浮動小数点演算を用いたクリロフ部分空間法の高速化, 情報処理学会研究報告: ハイパフォーマンスコンピューティング, Vol. 2013-HPC-140, No. 35, pp. 1-7, 2013年7月 (in Japanese).
椋木大地, 高橋大介, GPUにおける高速なCRS形式疎行列ベクトル積の実装, 情報処理学会研究報告: ハイパフォーマンスコンピューティング, Vol. 2013-HPC-138, No. 5, pp. 1-7, 2013年2月 (in Japanese) (IPSJ Computer Science Research Award for Young Scientists 2012).
椋木大地, 高橋大介, GPUにおける4倍精度演算を用いた疎行列反復解法の実装と評価, 情報処理学会研究報告: ハイパフォーマンスコンピューティング, Vol. 2012-HPC-137 (2012-ARC-202), No. 37, pp. 1-8, 2012年12月 (in Japanese) (IPSJ SIGARC Young Researcher Award 2012).
椋木大地, 高橋大介, GPUによる3倍精度浮動小数点演算の検討, 情報処理学会研究報告: ハイパフォーマンスコンピューティング, Vol. 2011-HPC-132 (2011-ARC-197), No. 23, pp. 1-9, 2011年11月 (in Japanese).
椋木大地, 高橋大介, GPUによる4倍精度BLASの実装と評価, 計算工学講演会論文集, Vol. 15, No. 2, pp. 891-894, 2010年5月 (in Japanese).
椋木大地, 高橋大介, GPUによる4倍精度BLASの実装と評価, 情報処理学会研究報告: ハイパフォーマンスコンピューティング, Vol. 2009-HPC-123 (2009-ARC-186), No. 13, pp. 1-6, 2009年11月 (in Japanese).

Peer-reviewed Poster Presentations

Atsushi Suzuki, Daichi Mukunoki, Toshiyuki Imamura, tmBLAS: a Mixed Precision BLAS by C++ Template, ISC High Performance (ISC 2023), research poster session, May, 2023.
Daichi Mukunoki, Katsuhisa Ozaki, Takeshi Ogita, Toshiyuki Imamura, A Fast Infinite Precision Inner Product using Ozaki Scheme and Dot2, and Its Application to Reproducible Conjugate Gradient Solvers, ISC High Performance (ISC 2022), research poster session, Jun. 1, 2022 (Research Poster Award 2nd Place Winner).
Daichi Mukunoki, Katsuhisa Ozaki, Takeshi Ogita, Toshiyuki Imamura, Accurate Matrix Multiplication on Binary128 using Ozaki Scheme, ISC High Performance (ISC 2021), research poster session, Jun. 29, 2021 (Research Poster Award).
Roman Iakymchuk, Daichi Mukunoki, Takeshi Ogita, Katsuhisa Ozaki, Stef Graillat, Accurate and Reproducible Conjugate Gradient in Hybrid Parallel Environments, ISC High Performance (ISC 2021), Jun. 29, 2021.
Daichi Mukunoki, Toshiyuki Imamura, Yiyu Tan, Atsushi Koshiba, Jens Huthmann, Kentaro Sano, Fabienne Jézéquel, Stef Graillat, Roman Iakymchuk, Norihisa Fujita, Taisuke Boku, Minimal-Precision Computing for High-Performance, Energy-Efficient, and Reliable Computations, SC19 research poster session, Nov. 19-21, 2019.
Yusuke Hirota, Daichi Mukunoki, Toshiyuki Imamura, Automatic Generation of Full-Set Batched BLAS, ISC High Performance (ISC 2018), research poster session, Jun. 26, 2018.
Daichi Mukunoki, Toshiyuki Imamura, Implementation and Evaluation of 2.5D Matrix Multiplication on K Computer, ISC High Performance (ISC 2017), research poster session, Jun. 20, 2017 (PRACE-ISC Research Poster Award 2017).

Poster Presentations (Non-reviewed) / ポスター発表（査読なし）

Daichi Mukunoki, Atsushi Suzuki, Toshiyuki Imamura, Multiple and Mixed Precision BLAS with C++ Template, 5th R-CCS International Symposium, Feb. 6, 2023.
Daichi Mukunoki, Katsuhisa Ozaki, Takeshi Ogita, Toshiyuki Imamura, Accurate Matrix Computations using Ozaki Scheme on CPUs and GPUs, The 30th Anniversary Symposium of the Center for Computational Sciences at the University of Tsukuba, Oct. 14, 2022.
Daichi Mukunoki, Roman Iakymchuk, Fabienne Jezequel, Katsuhisa Ozaki, Takeshi Ogita, Toshiyuki Imamura, Remedies for Reproducibility Issue in Conjugate Gradient Solvers, SparseDays2022, poster session, Jun. 20-22, 2022.
Daichi Mukunoki, Katsuhisa Ozaki, Takeshi Ogita, Toshiyuki Imamura, Roman Iakymchuk, High-Precision, Accurate, and Reproducible Linear Algebra Operations using Ozaki Scheme, 3rd R-CCS International Symposium, Feb. 15, 2021.
Toshiyuki Imamura, Daichi Mukunoki, Yiyu Tan, Atsushi Koshiba, Jens Huthmann, Kentaro Sano, Fabienne Jézéquel, Stef Graillat, Roman Iakymchuk, Norihisa Fujita, Taisuke Boku, Minimal-Precision Computing for High-Performance, Energy-Efficient, and Reliable Computations, 2nd R-CCS International Symposium, Feb. 17, 2020.
Yiyu Tan, Toshiyuki Imamura, Daichi Mukunoki, An FPGA-based Matrix Multiplier with Task Parallelism, 2nd R-CCS International Symposium, Feb. 17, 2020.
Daichi Mukunoki, Katsuhisa Ozaki, Takeshi Ogita, Toshiyuki Imamura, Accurate DGEMM using Tensor Cores, The International Conference on High Performance Computing in Asia-Pacific Region (HPCAsia 2020), Jan. 15-17, 2020.
Roman Iakymchuk, Fabienne Jézéquel, Stef Graillat, Daichi Mukunoki, Toshiyuki Imamura, Yiyu Tan, Atsushi Koshiba, Jens Huthmann, Kentaro Sano, Norihisa Fujita, Taisuke Boku, Optimizing Precision for High-Performance, Robust, and Energy-Efficient Computations, The International Conference on High Performance Computing in Asia-Pacific Region (HPCAsia 2020), Jan. 15-17, 2020.
Daichi Mukunoki, Toshiyuki Imamura, Yiyu Tan, Atsushi Koshiba, Jens Huthmann, Kentaro Sano, Fabienne Jézéquel, Stef Graillat, Roman Iakymchuk, Norihisa Fujita, Taisuke Boku, Minimal-Precision Computing for High-Performance, Energy-Efficient, and Reliable Computations, France-Japan-Germany trilateral workshop: Convergence of HPC and Data Science for Future Extreme Scale Intelligent Applications, Nov. 7, 2019.
Yiyu Tan, Daichi Mukunoki, Toshiyuki Imamura, Norihisa Fujita, Taisuke Boku, Reduced and Extended-Precision Computations on FPGAs and GPUs, The 11th symposium on Discovery, Fusion, Creation of New Knowledge by Multidisciplinary Computational Sciences, University of Tsukuba, Oct. 15, 2019.
Daichi Mukunoki, Takeshi Ogita, Katsuhisa Ozaki, Accurate and Reproducible Linear Algebra Operations for Many-core Architectures, Russian Supercomputing Days 2019 (RuSCDays 2019), Sep. 23 - 24, 2019 (Best Research Poster Award).
Daichi Mukunoki, Takeshi Ogita, Katsuhisa Ozaki, OzBLAS: Accurate and Reproducible BLAS Based on Ozaki Scheme, GPU Technology Conference (GTC 2019), Mar. 17-21, 2019.
Toshiyuki Imamura, Yusuke Hirota, Daichi Mukunoki, Shuhei Kudo, Akiyoshi Kuroda, Naoki Sueyasu, Development of Scientific Numerical Libraries on post-K computer, 1st R-CCS International Symposium, Feb. 18-19, 2019.
荻田武史, 椋木大地, 尾崎克久, HPC分野における精度保証付き数値計算学の展開, 第3回CDMSI（ポスト「京」重点課題（７））シンポジウム, 2017年12月5日 (in Japanese).
椋木大地, 今村俊幸, 高橋大介, PascalアーキテクチャGPUにおける線形計算カーネルの実装技術の検討, GTC Japan 2016, 2016年10月5日 (in Japanese).
大井祥栄, 廣田悠輔, 椋木大地, 今村俊幸, KMATHLIB -High Performance and Scalable Numerical Library for the K Computer-, 応用数理学会2016年度年会, 2016年9月13日 (in Japanese).
Daichi Mukunoki, Toshiyuki Imamura, Daisuke Takahashi, Introduction of Research Activities for GPU Computing at Large-scale Parallel Numerical Computing Technology Research Team on AICS, The 6th AICS International Symposium, Feb. 22, 2016.
Yusuke Morikura, Daichi Mukunoki, Takeshi Fukaya, Naoya Yamanaka, Shin'ichi Oishi, Performance Evaluation of Verified Computation for Linear Systems on Parallel Computers, 2nd Annual Meeting on Advanced Computing System and Infrastructure (ACSI2016), Jan. 19, 2016.
大井祥栄, 廣田悠輔, 椋木大地, 今村俊幸, 京コンピュータ向け数値計算ライブラリ群KMATHLIBの実装, 応用数理学会2015年度年会, 2015年9月9日 (in Japanese).
椋木大地, 今村俊幸, 高橋大介, GPUにおけるスレッド数自動選択機能を持ったメモリ律速な線形計算カーネル群「MUBLAS」の実装と評価, GTC Japan 2015, 2015年9月18日 (in Japanese).
Daichi Mukunoki, Toshiyuki Imamura, Daisuke Takahashi, High-Performance GEMV and SYMV with Auto-Tuning for Performance Stabilization on Multiple GPU Generations, GPU Technology Conference (GTC 2015), Mar. 17, 2015.
椋木大地, 今村俊幸, 高橋大介, Kepler・MaxwellアーキテクチャGPUにおける性能が行列形状に依存しない高速なGEMVの実装, Annual Meeting on Advanced Computing System and Infrastructure (ACSI) 2015論文集, 2015年1月26日 (extended abstract in conference proceedings) (in Japanese).
佐々木信一, 藤井昭宏, 田中輝雄, 椋木大地, 今村俊幸, スーパコンピュータ京における倍々精度演算の高速化, Annual Meeting on Advanced Computing System and Infrastructure (ACSI) 2015論文集, 2015年1月26日 (extended abstract in conference proceedings) (in Japanese).
今村俊幸, 椋木大地, 佐々成正, 山田進, 町田昌彦, 疑似四倍精度拡張数学パッケージQP-Pack, Annual Meeting on Advanced Computing System and Infrastructure (ACSI) 2015論文集, 2015年1月26日 (extended abstract in conference proceedings) (in Japanese).
椋木大地, 今村俊幸, 高橋大介, KeplerアーキテクチャGPUにおける高速なSGEMVの実装, GTC Japan 2014, 2014年7月16日 (in Japanese).
Daichi Mukunoki, Daisuke Takahashi, Linear Algebra Operations using Quadruple-precision Arithmetic on GPU, GPU Technology Conference (GTC2014), Mar. 24, 2014.
Daichi Mukunoki, Daisuke Takahashi, Performance Comparison of Double, Triple and Quadruple Precision Real and Complex BLAS Subroutines on GPUs, Proc. ATIP/A*CRC Workshop on Accelerator Technologies for High-Performance Computing: Does Asia Lead the Way? (ATIP/A*CRC Workshop '12), pp. 788-790, May. 7, 2012 (extended abstract in conference proceedings).

Oral Presentations

Daichi Mukunoki, Masatoshi Kawai, Toshiyuki Imamura, Reduced-Precision Data Representation on Sparse Matrix-Vector Multiplications, 10th International Congress on Industrial and Applied Mathematics (ICIAM 2023), Aug. 21, 2023.
Toshiyuki Imamura, Daichi Mukunoki, Atsushi Suzuki, Multiple- and Mixed-Precision BLAS with C++ Template, 10th International Congress on Industrial and Applied Mathematics (ICIAM 2023), Aug. 24, 2023.
椋木大地, 河合直聡, 疎行列ベクトル積における低精度データ表現の導入について, 第14回自動チューニング技術の現状と応用に関するシンポジウム（ATTA2022）, Dec. 23, 2022 (in Japanese).
Kensuke Aihara, Katsuhisa Ozaki, Daichi Mukunoki, A mixed-precision algorithm of the CG method using the group-wise update strategy, The 41st JSST Annual International Conference on Simulation Technology (JSST2022), online, Aug. 31-Sep. 2, 2022.
Daichi Mukunoki, Katsuhisa Ozaki, Takeshi Ogita, Toshiyuki Imamura, Roman Iakymchuk, Impact and Contribution of Ozaki scheme in High Performance Computing, International Workshop on Reliable Computing and Computer-Assisted Proofs (ReCAP 2022), online, Mar. 15, 2022.
相原研輔, 尾崎克久, 椋木大地, Flying restart付きCG法に対する混合精度演算による近似解精度の向上, 日本応用数理学会第18回研究部会連合発表会, online, Mar. 9, 2022 (in Japanese).
尾崎克久, 椋木大地, 荻田武史, 行列積に対する試行型エラーフリー変換に対する誤差の対処法とその応用, 日本応用数理学会第18回研究部会連合発表会, online, Mar. 8, 2022 (in Japanese).
Daichi Mukunoki, Yusuke Hirota, Toshiyuki Imamura, Performance Evaluation of Batched BLAS on A64FX, 4th R-CCS International Symposium (lightning talk), online, Feb. 7, 2022.
椋木大地, 精度自動チューニングに向けた基盤技術の検討, 第13回自動チューニング技術の現状と応用に関するシンポジウム (ATTA2021), online, Dec. 13, 2021 (in Japanese).
Daichi Mukunoki, Katsuhisa Ozaki, Takeshi Ogita, Toshiyuki Imamura, DGEMM using Tensor Cores, SIAM Conference on Computational Science and Engineering (CSE21), online, Mar. 4, 2021.
Fabienne Jézéquel, Stef Graillat, Daichi Mukunoki, Toshiyuki Imamura, Roman Iakymchuk, Fast rounding error estimation for compute-intensive operations using standard floating-point arithmetic, Rencontres Arithmétiques de l'Informatique Mathématique (RAIM), Paris, May 2021.
椋木大地, 尾崎克久, 荻田武史, binary128 に対する尾崎スキーム行列積, 第4回精度保証付き数値計算の実問題への応用研究集会 (NVR 2020), online, Nov. 28-28, 2020 (in Japanese).
Roman Iakymchuk, Daichi Mukunoki, Conjugate Gradient Solvers with Accuracy and Reproducibility Guarantees in Hybrid Parallel Environments, Sparse Days Cerfacs, online, Nov. 24t, 2020.
Daichi Mukunoki, DGEMM using Tensor Cores and OzBLAS, 11th Joint Laboratory for Extreme Scale Computing (JLESC) Workshop, online, Sep. 8, 2020.
Daichi Mukunoki, Minimal-Precision Computing for High-Performance, Energy-Efficient, and Reliable Computations, SIAM Conference on Parallel Processing for Scientific Computing (PP20), Seattle, Feb. 15, 2020 .
Daichi Mukunoki, Accurate BLAS implementations: OzBLAS and BLAS-DOT2, Workshop on Largescale Parallel Numerical Computing Technology (LSPANC 2020 January), RIKEN R-CCS, Kobe, Jan. 30, 2020.
Daichi Mukunoki, Minimal-Precision Computing for High-Performance, Energy-Efficient, and Reliable Computations, Sapporo Winter HPC Seminar 2020, Information Initiative Center, Hokkaido University, Jan. 24, 2020.
Daichi Mukunoki, Takeshi Ogita, High-performance Implementations of Accurate Linear Algebra Kernels on GPUs, 3rd International Conference on Modern Mathematical Methods and High Performance Computing in Science & Technology (M3HPCST), Jan. 9-11, 2020.
椋木大地, 荻田武史, 尾崎克久, 尾崎スキームによる高精度BLAS実装「OzBLAS」とその応用, 第3回精度保証付き数値計算の実問題への応用研究集会 (NVR 2019), 高松市, 2019年12月1日 (in Japanese).
Daichi Mukunoki, Takeshi Ogita, Katsuhisa Ozaki, Accurate and Reproducible CG Method on GPUs, European Numerical Mathematics and Advanced Applications Conference 2019 (ENUMATH2019), Egmond aan Zee, Oct. 1, 2019.
Daichi Mukunoki, High-Performance Implementations of Accurate and Reproducible BLAS Routines on GPUs, Workshop on Largescale Parallel Numerical Computing Technology (LSPANC 2019 June), RIKEN R-CCS, Kobe, Jun. 7, 2019.
椋木大地, 尾崎スキームに基づく高精度かつ再現性のあるBLASルーチンの実装と自動チューニングの適用, 第22回AT研究会オープンアカデミックセッション（ATOS22）, 東京大学情報基盤センター, 東京都, May 13, 2019 (in Japanese).
椋木大地, 荻田武史, 尾崎克久, 尾崎スキームによる高精度かつ再現性のあるBLASルーチンの実装と評価, 第2回精度保証付き数値計算の実問題への応用研究集会 (NVR 2018), 広島市, Dec. 2, 2018 (in Japanese).
Daichi Mukunoki, Takeshi Ogita, Katsuhisa Ozaki, High Performance Implementation of Reproducible BLAS Routines with Tunable Accuracy Using Ozaki Scheme, Computational Reproducibility at Exascale 2018 (CRE2018), in cooperation with SC18, Dallas, Nov. 11, 2018.
Daichi Mukunoki, Takeshi Ogita, High Performance Implementation of Accurate Matrix Multiplications on GPUs, The 18th International Symposium on Scientific Computing, Computer Arithmetic, and Verified Numerical Computations (SCAN2018), The International Conference Center at Waseda University, Tokyo, Sep. 11, 2018.
Roman Iakymchuk, Pedro Valero-Lara, Daichi Mukunoki, Accurate and cost-efficient triangular solve, The 18th International Symposium on Scientific Computing, Computer Arithmetic, and Verified Numerical Computations (SCAN2018), The International Conference Center at Waseda University, Tokyo, Sep. 11, 2018.
Daichi Mukunoki, Roman Iakymchuk, Stef Graillat, Takeshi Ogita, High-performance implementations of reproducible and accurate matrix-multiplication, 10th International Workshop on Parallel Matrix Algorithms and Applications (PMAA18), ETH Zurich, Zurich, June 27, 2018.
Daichi Mukunoki, Toshiyuki Imamura, Performance Analysis of 2.5D-PDGEMM on the K Computer, SIAM Conference on Parallel Processing for Scientific Computing (PP18), Waseda University, Tokyo, Mar. 8, 2018.
椋木大地, 次世代計算機のための数値計算ライブラリの実装技術, 日本応用数理学会三部会連携「応用数理セミナー」, 早稲田大学西早稲田キャンパス, 東京都, 2017年12月26日 (in Japanese).
椋木大地, 今村俊幸, Reduced-/Extended-precision BLASの実装方法の検討, Fifth Workshop on Largescale Parallel Numerical Computing Technology (LSPANC 2017), RIKEN AICS, 神戸市, 2017年3月27日 (in Japanese).
Daichi Mukunoki, Toshiyuki Imamura, Daisuke Takahashi, Implementation Techniques for High Performance BLAS Kernels on Modern GPUs, SIAM Conference on Computational Science and Engineering (CSE17), Hilton Atlanta, Atlanta, Feb. 28, 2017.
Yusuke Morikura, Daichi Mukunoki, Takeshi Fukaya, Naoya Yamanaka, Shin’ichi Oishi, Performance Evaluation of Verified Computation for Linear Systems on Supercomputer, SIAM: East Asian Section Conference (EASIAM 2016), University of Macau, Macau, Jun. 20-22, 2016
Daichi Mukunoki, Toshiyuki Imamura, Daisuke Takahashi, Automatic Thread-Block Size Adjustment for Dense Matrix-Vector Multiplication on CUDA, 2016 Conference on Advanced Topics and Auto Tuning in High-Performance Scientific Computing (ATAT2016), Mathematics Research Center, National Taiwan University, Taipei, Feb. 19, 2016 (Invited).
椋木大地, 高橋大介, GPUにおける3倍精度演算と4倍精度疎行列反復解法, 第3回多倍長精度計算フォーラム, 工学院大学, 東京都, 2013年3月8日 (in Japanese).
Daichi Mukunoki, Daisuke Takahashi, Iterative Method for Sparse Linear Systems using Quadruple Precision Operations on GPUs, SIAM Conference on Computational Science and Engineering (CSE13), The Westin Boston Waterfront, Boston, Massachusetts, Feb. 28, 2013.
椋木大地, 高橋大介, GPUによる4倍精度行列計算, 2011年並列／分散／協調処理に関する『鹿児島』サマー・ワークショップ（SWoPP鹿児島2011） , かごしま県民交流センター, 鹿児島市, 2011年7月27日 (in Japanese).

Awards

Best Paper Award, 16th IEEE International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC 2023), Dec. 2023 (Daichi Mukunoki, Masatoshi Kawai, Toshiyuki Imamura, Sparse Matrix-Vector Multiplication with Reduced-Precision Memory Accessor)
Research Poster Award 2nd Place Winner, ISC High Performance 2022, Jun. 2022 (Daichi Mukunoki, Katsuhisa Ozaki, Takeshi Ogita, Toshiyuki Imamura, A Fast Infinite Precision Inner Product using Ozaki Scheme and Dot2, and Its Application to Reproducible Conjugate Gradient Solvers)
RIKEN 2022 Research Incentive Award (Ohbu Award)
Research Poster Award, ISC High Performance 2021, Jun. 2021 (Daichi Mukunoki, Katsuhisa Ozaki, Takeshi Ogita, Toshiyuki Imamura: Accurate Matrix Multiplication on Binary128 using Ozaki Scheme).
Best Research Poster Award, Russian Supercomputing Days 2019 (RuSCDays 2019), Sep. 2019 (Daichi Mukunoki, Takeshi Ogita, Katsuhisa Ozaki: Accurate and Reproducible Linear Algebra Operations for Many-core Architectures).
PRACE-ISC Research Poster Award 2017, ISC High Performance 2017, 2017 (Daichi Mukunoki, Toshiyuki Imamura, "Implementation & Evaluation of 2.5D Matrix Multiplication on K Computer").
IPSJ Yamashita SIG Research Award, Information Processing Society of Japan, 2016.
IPSJ Computer Science Research Award for Young Scientists, Information Processing Society of Japan, 2013.
IPSJ SIGARC Young Researcher Award, IPSJ SIG-ARC, 2013.

Funding

April 2022 - October 2023: Japan Society for the Promotion of Science (JSPS), Fund for the Promotion of Joint International Research (Fostering Joint International Research (A)), "Accurate matrix computations with numerical validation for next generation computers", 20KK0259, 9,230,000 JPY, as the deputation
April 2020 - March 2021: Joint Usage/Research Center for Interdisciplinary Large-scale Information Infrastructures (JHPCN), General Joint Research Project, "High-performance and Highly-reliable Numerical Methods and Applications", jh220022, as a collaborator (the deputation is Takeshi Ogita).
April 2019 - March 2023: Japan Society for the Promotion of Science (JSPS), Grant-in-Aid for Young Scientists, "Development of accurate and reproducible matrix computation library for massively parallel environments", 19K20286, 2,340,000 JPY, as the deputation.
April 2019 - March 2023: Japan Society for the Promotion of Science (JSPS), Grant-in-Aid for Scientific Research (B), "Research on high-performance and high-dimensional numerical linear algebra applying an asynchronous task mechanism on the exascale computing era", 19H04127, 17,420,000 JPY, as a collaborator (the deputation is Toshiyuki Imamura).
April 2018 - March 2019: Joint Usage/Research Center for Interdisciplinary Large-scale Information Infrastructures (JHPCN), Exploratory Theme, "高性能かつ高信頼な数値計算手法とその応用", EX18104, as a collaborator (the deputation is Takeshi Fukaya).
April 2016 - March 2018: Japan Society for the Promotion of Science (JSPS), Grant-in-Aid for Young Scientists (B), "Reduced-precision formats for high-performance and energy-efficient computations", 16K16062, 2,210,000 JPY, as the deputation.
April 2015 - March 2018: Japan Society for the Promotion of Science (JSPS), Grant-in-Aid for Scientific Research (B), "Theory and Application of Scalable Numerical Software on an O(100M) core environment", 15H02709, 18,330,000 JPY, as a collaborator (the deputation is Toshiyuki Imamura).
April 2013 - March 2015: Japan Society for the Promotion of Science (JSPS), Grant-in-Aid for JSPS Fellows, "ＧＰＵスパコンのための３倍・４倍精度線形演算ライブラリの開発に関する研究", 13J01290, 2,070,000 JPY, as the deputation.

Teaching

April 2025 - May 2025: Numerical Analysis, Department of Computer Science, School of Informatics, Nagoya University.
April 2025 - May 2025: Computer Science Experiment a, Department of Computer Science, School of Informatics, Nagoya University.
February 2023 - July 2023: Co-superviser of Master 2 research internship, University of Perpignan, Via Domitia (France).
September 2018 - January 2019: 情報処理技法（リテラシ）II, Tokyo Woman's Christian University.

Products

OzBLAS: Accurate and Reproducible BLAS based on Ozaki scheme, https://github.com/mukunoki/ozblas
MUBLAS (as a demonstration of automatic thread-block size determination on CUDA kernels), https://www.r-ccs.riken.jp/labs/lpnctrt/en/projects/mublas/
BLAS-DOT2: Higher-precision BLAS based on Dot2, http://www.math.twcu.ac.jp/ogita/post-k/results.html
GEMM-TC: S/DGEMM using Tensor Cores, http://www.math.twcu.ac.jp/ogita/post-k/results.html（merged into OzBLAS (https://github.com/mukunoki/ozblas)）
Semi-ScaLAPACK-Compatible 2.5D-PxGEMM based on SUMMA (SC-SUMMA-25D), https://www.r-ccs.riken.jp/labs/lpnctrt/projects/25dpdgemm/
Batched BLAS Generator, https://www.r-ccs.riken.jp/labs/lpnctrt/projects/batchedblas/
RpFp (reduced precision memory accessor), https://www.r-ccs.riken.jp/labs/lpnctrt/projects/rpfp/
etc.

Professional Activities

Poster Chair, The International Conference on High Performance Computing in Asia-Pacific Region 2026, 2026.
副代表者, 2025年度RIMS共同研究 (公開型), 「数値解析が切り開く新たな情報社会〜データ駆動型から「富岳NEXT」〜」, 京都大学数理解析研究所, 2025.
Program Committee Member, The 15th International Conference on Parallel Processing & Applied Mathematics (PPAM 2024), 2024.
Program Chair, Special Session: Performance Optimization and Auto-Tuning of Software on Multicore/Manycore Systems (POAT 2023) (in conjunction with MCSoC-2023), 2023.
Program Committee Member, 2023 IEEE 16th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC-2023), 2023.
Program Committee Member, The 24th IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing (PDSEC 2023) (in conjunction with IPDPS 2023), 2023.
Mini-Symposium Organizer, Mini Symposium: Exploring Arithmetic and Data Representation Beyond the Standard in HPC (at ICIAM 2023), 2023.
Program Committee Member, The 22nd International Conference on Computational Science (ICCS 2022), 2022.
Program Chair, Special Session: Auto-Tuning for Multicore and GPU (ATMG2022) (in conjunction with MCSoC-2022), 2022.
Program Committee Member, The 14th International Conference on Parallel Processing & Applied Mathematics (PPAM 2022), 2022.
Program Committee Member (Algorithm track), 36th IEEE International Parallel & Distributed Processing Symposium (IPDPS 2022), 2022.
Publicity Chair, The International Conference on High Performance Computing in Asia-Pacific Region (HPC Asia 2022), 2022.
Program Committee Member, IEEE 22nd International Workshop on Parallel and Distributed Scientific and Engineering Computing (PDSEC 2021) (in conjunction with IPDPS 2021), 2021.
幹事（交流促進委員会）, 自動チューニング研究会, 2021-2023.
Research Poster Committee Member, The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC20), 2020.
Program Committee Member, The 21st IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing (PDSEC 2020) (in conjunction with IPDPS 2020), 2020.
Program Committee Member, Workshop on Large-scale Parallel Numerical Computing Technology (LSPANC 2020 January), 2020.
Editoral board member 情報処理学会論文誌コンピューティングシステム, 2020-2025
Program Committee Member, 2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC-2019), 2019.
Program Committee Member, The 20th IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing (PDSEC 2019) (in conjunction with IPDPS 2019), 2019.
Program Committee Member, The 4th International Workshop on GPU Computing and AI (GCA'19) (in conjunction with CANDAR'19), 2019.
Program Committee Member, The Fourteenth International Workshop on Automatic Performance Tuning (iWAPT2019) (in conjunction with IPDPS 2019), 2019.
Program Committee Member, 2018 IEEE 12th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC-2018), 2018.
Program Committee Member, The Third International Workshop on GPU Computing and AI (GCA'18) (in conjunction with CANDAR'18), 2018.
Program Committee Member, The 19th IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing (PDSEC 2018) (in conjunction with IPDPS 2018), 2018.
Program Committee Member, The Thirteenth International Workshop on Automatic Performance Tuning (iWAPT2018) (in conjunction with IPDPS 2018), 2018.
Program Committee Member, Special Session: Auto-Tuning for Multicore and GPU (ATMG 2018) (in conjunction with MCSoC-2018), 2018.
Mini-Symposium Organizer, Mini Symposium: Development of Numerical Computing Software on Emerging Computing Platforms (at SIAM PP 18), 2018.
Program Committee Member, The 18th IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing (PDSEC 2017) (in conjunction with IPDPS 2017), 2017.
Program Committee Member, The Second International Workshop on GPU Computing and AI (GCA'17) (in conjunction with CANDAR'17), 2017.
Program Committee Member, The Twelfth International Workshop on Automatic Performance Tuning (iWAPT2017) (in conjunction with IPDPS 2017), 2017.
Program Committee Member, Special Session: Auto-Tuning for Multicore and GPU (ATMG 2017) (in conjunction with MCSoC-17), 2017.
Program Committee Member, The 17th IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing (PDSEC 2016) (in conjunction with IPDPS 2016), 2016.
Program Committee Member, The First International Workshop on GPU Computing and Applications (GCA'16) (in conjunction with CANDAR'16), 2016.
Program Committee Member, The 16th IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing (PDSEC 2015) (in conjunction with IPDPS 2015), 2015.
Program Committee Member, The 15th IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing (PDSEC 2014) (in conjunction with IPDPS 2014), 2014.