rigidmol Accelerated by GPU
-------------------------------------

Enable GPU Acceleration
==============================

Since ABCluster 3.2, ``rigidmol`` can be accelerated by single or multiple GPU cards! This can be done with the program ``rigidmol-gpu``, which is currently only available in Linux version.

Before proceeding, we again emphasize that you should have NVIDIA GPU cards and CUDA toolkit well configured before running GPU accelerated ABCluster.

Hardware requirement: CUDA version >= **11.0**; compute capability >= **7.0**.

.. tip::
    For different compute capability, you should download different ABCluster. First, check the compute capability from `<https://developer.nvidia.com/cuda-gpus>`_. For example, if you have a NVIDIA A30 Tensor Core GPU, then you can find that its compute capability is 8.0, so you should download ``-Linux-GPU80`` version. An inconsistent version of ABCluster may raise some errors like:

    **Error occurs: Fail to call the CUDA kernel function. Reason: no kernel image is available for execution on the device.**

    **Error occurs: Fail to call XX. Reason: the provided PTX was compiled with an unsupported toolchain.**

The use of GPU acceleration is very easy! Keep in mind that any standard input file of ``rigidmol`` can be used for ``rigidmol-gpu``, which is a standard global optimization task on CPU. Say, for the example in :doc:`eg-h2o6`, the input file is:

.. code-block:: bash
    :linenos:
    :caption: h2o6.inp

    h2o6.cluster # cluster file name
    20           # population size
    20           # maximal generations
    3            # scout limit
    4.0          # amplitude
    h2o6         # save optimized configuration
    30           # number of LMs to be saved

So, you can just run the following command to do a standard CPU calculation:

.. code-block:: bash

    $ rigidmol-gpu h2o6.inp > h2o6.out

To use GPU, just add an argument ``-gpu`` at the end of command line:

.. code-block:: bash

    $ rigidmol-gpu h2o6.inp -gpu > h2o6.out

Now, ABCluster will try to use single or multiple GPU cards. The GPU acceleration is successfully enabled!

Single GPU Performance
==============================

.. tip::
    The sample input and output files can be found in ``testfiles/rigidmol/6-gpuperf``.

Probably, in the last example, the GPU calculation is slower than the CPU one. The reason is simply that the system is **too small.** 

In GPU implementation, data transfer between GPU and host memory is very expensive. So, only for **large** systems, where data transfer is much cheaper than numerical computation, can GPU outperform CPU. 

An example can be found in in ``testfiles/rigidmol/6-gpuperf``, where a system of :math:`(\mathrm{CH}_3\mathrm{CN})_{1500}` is considered. In ``mol.inp`` is

.. code-block:: bash
    :linenos:
    :caption: mol.inp

    mol.cluster     # cluster file name
    1               # population size
    1               # maximal generations
    3               # scout limit
    10.00000000     # amplitude
    mol             # save optimized configuration
    30              # number of LMs to be saved

We use a population size of ``1`` and a generation number of ``1`` since we only want to do a single energy calculation from ``mol.cluster``, where an optimized structure is provided. By performing a CPU and GPU calculation on a computer, we get the following result:

.. list-table::
   
    * - A single Intel(R) Xeon(R) Gold 6248R CPU @ 3.00GHz Core
      - 56 seconds
    * - A single Quadro RTX 4000 Card
      - 3 seconds

The GPU acceleration is amazing! 

So, what is the critical size at which GPU outperforms? This depends on CPU, GPU, and other hardware condition. Usually, for a cluster containing more than 10000 atoms, GPU cards should be used.

Multiple GPUs Performance
==============================

.. tip::
    The sample input and output files can be found in ``testfiles/rigidmol/7-multigpus``.

``rigidmol-gpu`` will automatically detect the number of GPU cards and use all of them to accelerate calculations. You do not need to do anything extra.

.. tip::
    ``rigidmol-gpu`` optimizes a cluster only with one GPU. So, to use multiple GPUs, there must be more than 1 individuals in the population.

In the last example, we can modify the population for more GPUs:

.. code-block:: bash
    :linenos:
    :caption: mol.inp

    mol.cluster     # cluster file name
    32              # population size
    2               # maximal generations
    3               # scout limit
    10.00000000     # amplitude
    mol             # save optimized configuration
    30              # number of LMs to be saved

For example, if you have 4 A100, to do the global optimization with GPUs, just run:

.. code-block:: bash

    $ rigidmol-gpu mol.inp -gpu > mol-gpu4.out

In the output, you can find this:

.. code-block:: bash
    :linenos:
    :caption: mol-gpu4.inp

    CUDA driver version: 11040; runtime version: 11060
    4 GPU device is available:
      0: NVIDIA A100-SXM4-80GB
         Computational ability: 8.0
         Global memory:       81251 MB
         Block-shared memory: 48 KB = 6144 double
         Constant memory:     64 KB = 8192 double
         Maximum threads per block: 1024
         Maximum thread dimension:  1024, 1024, 64
         Maximum grid dimension:    2147483647, 65535, 65535
      1: NVIDIA A100-SXM4-80GB
         Computational ability: 8.0
         Global memory:       81251 MB
         Block-shared memory: 48 KB = 6144 double
         Constant memory:     64 KB = 8192 double
         Maximum threads per block: 1024
         Maximum thread dimension:  1024, 1024, 64
         Maximum grid dimension:    2147483647, 65535, 65535
      2: NVIDIA A100-SXM4-80GB
         Computational ability: 8.0
         Global memory:       81251 MB
         Block-shared memory: 48 KB = 6144 double
         Constant memory:     64 KB = 8192 double
         Maximum threads per block: 1024
         Maximum thread dimension:  1024, 1024, 64
         Maximum grid dimension:    2147483647, 65535, 65535
      3: NVIDIA A100-SXM4-80GB
         Computational ability: 8.0
         Global memory:       81251 MB
         Block-shared memory: 48 KB = 6144 double
         Constant memory:     64 KB = 8192 double
         Maximum threads per block: 1024
         Maximum thread dimension:  1024, 1024, 64
         Maximum grid dimension:    2147483647, 65535, 65535    

This means ``rigidmol-gpu`` has detected 4 GPUs and will use them. This calculation only costs 20 minutes. If it is done with CPU, probably 20 hours are needed!