Running Qbics
======================

.. contents::
   :local:

Qbics should be run from Windows command prompt or Linux/macOS terminal. 

To run Qbics, you just need to give an input file name. We prepare an input file called ``water.inp``:

.. code-block:: bash
    :caption: water.inp
    :linenos:
 
    # A B3LYP/cc-pvdz calculation for water.
    basis
        cc-pvdz
    end
    
    scf
        charge  0 # Total charge.
        spin2p1 1
    end
    
    mol
        O   0.00000000000000      0.05011194954430      0.05011194954224
        H   0.00000000000000     -0.06080277603381      1.01069082652926
        H   0.00000000000000      1.01069082648951     -0.06080277607149
    end

    task
      energy b3lyp
    end

Command Line Arguments
-------------------------------------------

The usage of Qbics is:

.. code-block:: bash
 
    qbics-linux-cpu <name> [-n <number>] [-s <path>] [-m <size>] [-d <size>] [--gpu <ids>]

You can use this command to run Qbics:

.. code-block:: bash

    $ qbics-linux-cpu water.inp > water.out

The optional arguments are explained below:

.. option:: -n

   .. list-table::
      :stub-columns: 1
      :widths: 5 20
   
      * - Value
        - Define the number of OpenMP threads for each MPI process. 
      * - Default
        - ``1``

The value should be **less than the number of physical CPU cores** of the node it is run on. 

.. option:: -s

   .. list-table::
      :stub-columns: 1
      :widths: 5 20
   
      * - Value
        - Define the scratch path where omputational temporary files are saved.
      * - Default
        - ``./``

Qbics will use this path to write some computational temporary files. It should be on a **local, fast, and large** disk, and **not** remote ones, like NFS shared paths. For Windows users, the scratch path should be given in **Linux format**. For example, if the scratch path is ``D:\Jobs\Scratch`` (Windows format), then for Qbics you should give ``-d D:/Jobs/Scratch``.

.. option:: -m

   .. list-table::
      :stub-columns: 1
      :widths: 5 20
   
      * - Value
        - Define the maximum memory size in GB that a MPI process can use
      * - Default
        - Unlimited

For example, ``-m 5.5`` means that each MPI process will use up to 5.5 GB of memory, no matter how many OpenMP threads there are. Of course, it should not exceed the total memory size of the node.

.. option:: -d

   .. list-table::
      :stub-columns: 1
      :widths: 5 20
   
      * - Value
        - Define the maximum disk size in GB that a MPI process can use in the scratch path.
      * - Default
        - Unlimited

For example, ``-d 900`` means that each MPI process will use up to 900 GB of disk, no matter how many OpenMP threads there are. Of course, it should not exceed the total disk size in the scratch path.

.. option:: --gpu

   .. list-table::
      :stub-columns: 1
      :widths: 5 20
   
      * - Value
        - Define GPU device IDs to be used.
      * - Default
        - ``0``

For example, ``--gpu 0,2,3`` means that Qbics will use GPU device of ID ``0``, ``2``, ``3`` to do calculations.

Here is an example of running Qbics:

.. code-block:: bash

    $ qbics-linux-cpu water.inp -n 8 -m 30 -d 500 -s /scratch/zhang > water.out

This command will run Qbics with an input file ``water.inp``. The number of OpenMP threads is 8, maximum memory and disk size is 30 GB and 500 GB, respectively, and the scratch path is ``/scratch/zhang``.

Run Qbics on a Single Node with GPU
--------------------------------------------

If GPU devices are available, you can just run GPU version of Qbics like before, and Qbics will automatically use GPU if possible:

.. code-block:: bash

    $ qbics-linux-gpu water.inp -n 8 > water.out

In ``water.out``, Qbics will output the GPU found, and only use ``0``:

.. code-block:: bash
    :caption: water.out
    :linenos:

    MPI is disabled in this version.
    # Nodes: 1
       ID                       Hostname          Memory (GB)   #Cores  #OpenMP
        0                  ubuntu-server                  251       96        1
    CUDA Device to be used: 0
    CUDA Device:
    On node 0, ubuntu-server:
      4 CUDA device is available:
        0: NVIDIA GeForce RTX 4080
           Computational ability: 8.9
           Global memory:         16079 MB
           Block-shared memory:   48 KB = 6144 double
           Constant memory:       64 KB = 8192 double
           Maximum threads per block: 1024
           Maximum thread dimension:  1024, 1024, 64
           Maximum grid dimension:    2147483647, 65535, 65535
        1: NVIDIA GeForce RTX 4080
           Computational ability: 8.9
           Global memory:         16077 MB
           Block-shared memory:   48 KB = 6144 double
           Constant memory:       64 KB = 8192 double
           Maximum threads per block: 1024
           Maximum thread dimension:  1024, 1024, 64
           Maximum grid dimension:    2147483647, 65535, 65535
    
In Line 8, Qbics has found 4 CUDA device. In Line 5, Qbics reports that ``0`` will be used, i.e., the device reported in Line 9.

If you want to use all 4 GPUs, just run with ``--gpu`` arguments:

.. code-block:: bash

    $ qbics-linux-gpu water.inp -n 8 --gpu 0,1,2,3 > water.out

Read ``water.out`` to confirm that all 4 GPUs are used (Line 5):

.. code-block:: bash
    :caption: water.out
    :linenos:

    MPI is disabled in this version.
    # Nodes: 1
       ID                       Hostname          Memory (GB)   #Cores  #OpenMP
        0                  ubuntu-server                  251       96        1
    CUDA Device to be used: 0 1 2 3
    CUDA Device:
    On node 0, ubuntu-server:
      4 CUDA device is available:
        0: NVIDIA GeForce RTX 4080
           Computational ability: 8.9
           Global memory:         16079 MB
           Block-shared memory:   48 KB = 6144 double
           Constant memory:       64 KB = 8192 double
           Maximum threads per block: 1024
           Maximum thread dimension:  1024, 1024, 64
           Maximum grid dimension:    2147483647, 65535, 65535

Run Qbics on Multiple Nodes
--------------------------------------------

To run MPI version of Qbics, make sure that the MPI implementation must be **the same version** as the one used to compile Qbics. To check this, first run MPI version in serial mode:

.. code-block:: bash

    $ qbics-linux-cpu-mpi water.inp -n 8 > water.out

In ``water.out``, you can find this:

.. code-block:: bash
    :caption: water.out
    :linenos:
 
    C++ compiler:   g++ (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
    C++ options:    -O2 --std=c++17 -fopenmp -ffast-math -fno-finite-math-only -fexpensive-optimizations -Wall -mavx2 -mfma
    MPI compiler:   mpirun (Open MPI) 4.1.2
    
Line 3 says that the MPI compiler is ``mpirun (Open MPI) 4.1.2``. Then, in shell, 

.. code-block:: bash

    $ mpirun -V
    mpirun (Open MPI) 4.1.2
    
    Report bugs to http://www.open-mpi.org/community/help/

Thus, this ``mpirun`` is exactly the same version as Qbics needs.

Run MPI Version of Qbics from Shell
+++++++++++++++++++++++++++++++++++++++++

The following command:

.. code-block:: bash

    $ mpirun -np 4 --bind-to none qbics-linux-cpu-mpi water.inp -n 8 > water.out

Here, ``-np`` is the number of MPI processes. Note that you can also use ``-n`` to set up OpenMP parallelization. In this case, we have 4 MPI processes, each having 8 OpenMP threads. Here, ``--bind-to none`` is the CPU binding mode. If you do not give ``--bind-to none``, the number of OpenMP threads may be incorrect.

Run MPI Version of Qbics from Slurm
+++++++++++++++++++++++++++++++++++++++++

In most cases, you will run Qbics through a queueing system. In Qbics distribution, we give an example of a Slurm script ``tools/run_qbics.slurm`` to run Qbics:

.. code-block:: bash
    :caption: tools/run_qbics.slurm
    :linenos:

    #!/bin/bash
    #SBATCH --job-name=water
    #SBATCH --nodes=4          # Total number of physical nodes.
    #SBATCH --ntasks=8         # Total number of MPI processes.
    #SBATCH --cpus-per-task=8  # Number of OpenMP thereads for each MPI process.
    #SBATCH --partition=your_partition
    
    # Load the appropriate modules if needed.
    # module load openmpi/4.1.1
    
    inp=water.inp
    out=water.out
    mpirun qbics-linux-cpu-mpi $inp -n $SLURM_CPUS_PER_TASK > $out

In this script, we request 4 physical nodes (``--nodes``) and totally 8 MPI processes (``--ntasks``), and each MPI process has 8 OpenMP threads (``--cpus-per-task``). Thus, we guess that each node will have 2 MPI processes. You can change these parameters according to your needs.

``--partition`` is the queue you want to use, which should be arranged by your cluster administrator. In Slurm script, ``mpirun`` does not need ``-np`` option, since Slurm will automatically set the number of MPI processes according to ``--ntasks``.

Submit this task:

.. code-block:: bash

   $ sbatch run_qbics.slurm

After running, you can find these lines in ``water.out`` (on my cluster):

.. code-block:: bash
    :caption: water.out
    :linenos:

    User: junz
    # Physical nodes: 4
    Physical node names: cu295 cu296 cu297 cu298
    MPI version:     3.1
    # MPI processes: 8
     Rank                       Hostname          Memory (GB)   #Cores  #OpenMP
        0                          cu295                  187       32        8
        1                          cu295                  187       32        8
        2                          cu296                  187       32        8
        3                          cu296                  187       32        8
        4                          cu297                  187       32        8
        5                          cu297                  187       32        8
        6                          cu298                  187       32        8
        7                          cu298                  187       32        8
    CUDA is disabled in this version.

Indeed, we have 4 physical nodes, each having 2 MPI processes, and each MPI process has 8 OpenMP threads. We also know that each node has 32 cores and a memory of 187 GB.

.. attention::
   On different clusters, the slurm script may need some modifications. Please consult the administrator of your cluster.