rigidmol Accelerated by GPU ------------------------------------- Enable GPU Acceleration ============================== Since ABCluster 3.2, ``rigidmol`` can be accelerated by single or multiple GPU cards! This can be done with the program ``rigidmol-gpu``, which is currently only available in Linux version. Before proceeding, we again emphasize that you should have NVIDIA GPU cards and CUDA toolkit well configured before running GPU accelerated ABCluster. Hardware requirement: CUDA version >= **11.0**; compute capability >= **7.0**. .. tip:: For different compute capability, you should download different ABCluster. First, check the compute capability from ``_. For example, if you have a NVIDIA A30 Tensor Core GPU, then you can find that its compute capability is 8.0, so you should download ``-Linux-GPU80`` version. An inconsistent version of ABCluster may raise some errors like: **Error occurs: Fail to call the CUDA kernel function. Reason: no kernel image is available for execution on the device.** **Error occurs: Fail to call XX. Reason: the provided PTX was compiled with an unsupported toolchain.** The use of GPU acceleration is very easy! Keep in mind that any standard input file of ``rigidmol`` can be used for ``rigidmol-gpu``, which is a standard global optimization task on CPU. Say, for the example in :doc:`eg-h2o6`, the input file is: .. code-block:: bash :linenos: :caption: h2o6.inp h2o6.cluster # cluster file name 20 # population size 20 # maximal generations 3 # scout limit 4.0 # amplitude h2o6 # save optimized configuration 30 # number of LMs to be saved So, you can just run the following command to do a standard CPU calculation: .. code-block:: bash $ rigidmol-gpu h2o6.inp > h2o6.out To use GPU, just add an argument ``-gpu`` at the end of command line: .. code-block:: bash $ rigidmol-gpu h2o6.inp -gpu > h2o6.out Now, ABCluster will try to use single or multiple GPU cards. The GPU acceleration is successfully enabled! Single GPU Performance ============================== .. tip:: The sample input and output files can be found in ``testfiles/rigidmol/6-gpuperf``. Probably, in the last example, the GPU calculation is slower than the CPU one. The reason is simply that the system is **too small.** In GPU implementation, data transfer between GPU and host memory is very expensive. So, only for **large** systems, where data transfer is much cheaper than numerical computation, can GPU outperform CPU. An example can be found in in ``testfiles/rigidmol/6-gpuperf``, where a system of :math:`(\mathrm{CH}_3\mathrm{CN})_{1500}` is considered. In ``mol.inp`` is .. code-block:: bash :linenos: :caption: mol.inp mol.cluster # cluster file name 1 # population size 1 # maximal generations 3 # scout limit 10.00000000 # amplitude mol # save optimized configuration 30 # number of LMs to be saved We use a population size of ``1`` and a generation number of ``1`` since we only want to do a single energy calculation from ``mol.cluster``, where an optimized structure is provided. By performing a CPU and GPU calculation on a computer, we get the following result: .. list-table:: * - A single Intel(R) Xeon(R) Gold 6248R CPU @ 3.00GHz Core - 56 seconds * - A single Quadro RTX 4000 Card - 3 seconds The GPU acceleration is amazing! So, what is the critical size at which GPU outperforms? This depends on CPU, GPU, and other hardware condition. Usually, for a cluster containing more than 10000 atoms, GPU cards should be used. Multiple GPUs Performance ============================== .. tip:: The sample input and output files can be found in ``testfiles/rigidmol/7-multigpus``. ``rigidmol-gpu`` will automatically detect the number of GPU cards and use all of them to accelerate calculations. You do not need to do anything extra. .. tip:: ``rigidmol-gpu`` optimizes a cluster only with one GPU. So, to use multiple GPUs, there must be more than 1 individuals in the population. In the last example, we can modify the population for more GPUs: .. code-block:: bash :linenos: :caption: mol.inp mol.cluster # cluster file name 32 # population size 2 # maximal generations 3 # scout limit 10.00000000 # amplitude mol # save optimized configuration 30 # number of LMs to be saved For example, if you have 4 A100, to do the global optimization with GPUs, just run: .. code-block:: bash $ rigidmol-gpu mol.inp -gpu > mol-gpu4.out In the output, you can find this: .. code-block:: bash :linenos: :caption: mol-gpu4.inp CUDA driver version: 11040; runtime version: 11060 4 GPU device is available: 0: NVIDIA A100-SXM4-80GB Computational ability: 8.0 Global memory: 81251 MB Block-shared memory: 48 KB = 6144 double Constant memory: 64 KB = 8192 double Maximum threads per block: 1024 Maximum thread dimension: 1024, 1024, 64 Maximum grid dimension: 2147483647, 65535, 65535 1: NVIDIA A100-SXM4-80GB Computational ability: 8.0 Global memory: 81251 MB Block-shared memory: 48 KB = 6144 double Constant memory: 64 KB = 8192 double Maximum threads per block: 1024 Maximum thread dimension: 1024, 1024, 64 Maximum grid dimension: 2147483647, 65535, 65535 2: NVIDIA A100-SXM4-80GB Computational ability: 8.0 Global memory: 81251 MB Block-shared memory: 48 KB = 6144 double Constant memory: 64 KB = 8192 double Maximum threads per block: 1024 Maximum thread dimension: 1024, 1024, 64 Maximum grid dimension: 2147483647, 65535, 65535 3: NVIDIA A100-SXM4-80GB Computational ability: 8.0 Global memory: 81251 MB Block-shared memory: 48 KB = 6144 double Constant memory: 64 KB = 8192 double Maximum threads per block: 1024 Maximum thread dimension: 1024, 1024, 64 Maximum grid dimension: 2147483647, 65535, 65535 This means ``rigidmol-gpu`` has detected 4 GPUs and will use them. This calculation only costs 20 minutes. If it is done with CPU, probably 20 hours are needed!