Transfert S3 DCSR to other support
Data in the S3 DCSR should be transfert to another file system as soon as possible. There is no backup for S3 data. This documentation describes the transfert using Curnagl cluster and the rclone command.
Requirements
- Have an account in the cluster
- Enough space in NAS or work to transfert the data
Rclone configuration
Use a text editor to create a configuration file in your home directory. Be sure to replace the S3 server name and the cryptographic key values with the ones sent in the email S3 form DCSR.
mkdir -p ~/.config/rclone
nano ~/.config/rclone/rclone.conf
The configuration file should look like this :
[s3-dci-ro]
type = s3
provider = Other
access_key_id = T******************M
secret_access_key = S**************************************i
region =
endpoint = https://scl-s3.unil.ch
For many different S3 tools, the pair of authentication/cryptographic keys have different names. For Rclone, they are named "access_key_id" and "secret_access_key". Corresponding respectively to "Access key" and "Private key" in the mail sent by DCSR.
Next, secure your key file:
chmod 600 ~/.config/rclone/rclone.conf
Now, s3-dci-ro is a S3 configured connection alias that you can use in Rclone without repeating the connection information in the CLI.
s3-dci-ro: In this connection alias, the cryptographic keys are assigned to a user attached to a read-only policy on the S3 cluster. This prevents you from modifying or accidentally deleting your source data when using this connection alias.
Use Rclone in CLI on the Curnagl front node.
List the content of your bucket named "bucket1" (This command only show the directories.).
rclone lsd bucket1
rclone lsd bucket1/dir1
You can use `rclone lsf` command to list the file and the folders ==>
rclone lsf s3-dci-ro:/recn-fac-fbm-dmf-sgruber1-dci-data-transfer/Titan1_Florian_jetABCD_X-DNA_AuF_Grid4_20240202
20240202_113832_EER_GainReference.gain
Images-Disc1/
Within an S3 cluster, all entities are represented as URLs that point to specific objects. These objects are stored uniformly, without any inherent hierarchical structure. The concept of "folders" does not truly exist. However, by convention, the "/" character in the Object IDs (the URLs) is interpreted as a folder delimiter by the S3 client application. Consequently, the "ls" command essentially performs a filtering and sorting operation on information stored at the same level. This approach does not scale well, hence, it is not advisable to execute an "ls" command on a significantly large number of files or objects.
The differents ways to do a listing of files and folders on S3 with Rclone are described on the following pages:
https://rclone.org/commands/rclone_ls/
https://rclone.org/commands/rclone_lsl/
https://rclone.org/commands/rclone_lsd/
https://rclone.org/commands/rclone_lsf/
https://rclone.org/commands/rclone_lsjson/
Copy all the files included in a S3 source folder in a destination folders ==>
The command `rclone copy -v` can be utilized to copy all files from a source folder to a destination folder. It's important to note that rclone does not duplicate the initial folder, but only its file contents into the destination folder. Furthermore, rclone does not recopy a file if it already exists in the destination, allowing for the resumption of an interrupted copy operation.
When launching a copy operation in the background with an ampersand (&) or a screen, it is recommended to use a log file with the verbosity set to `-v`. This log file will collect information about the copied files, errors, and provide a status update every minute on the amount of data copied so far.
Here is an example of a command to copy a subset of your data from your DCI S3 bucket to an LTS sub-folder on the Isilon NAS of the DCSR. Please substitute the paths to be relevant for your use case.
rclone copy -v --log-file=./rclone_to_LTS.log s3-dci-ro:recn-fac-fbm-dmf-sgruber1-dci-data-transfer/Titan2_UNIL_Gruber_Joe_HsSmc56_20240209/Images-Disc1/GridSquare_8464755 /nas/FAC/Lettres/ITA/hhussain/provi2folders/LTS/project_toto
These are the different components of the command you will need to adapt:
This is your connection alias ==> s3-dci-ro:
This is your S3 bucket name sent in the e-mail ==> recn-fac-fbm-dmf-sgruber1-dci-data-transfer
This is your source folder path within your bucket ==> /Titan2_UNIL_Gruber_Joe_HsSmc56_20240209/Images-Disc1/GridSquare_8464755
This is your destination folder path on the DCSR NAS ==> /nas/FAC/Lettres/ITA/hhussain/provi2folders/LTS/project_toto
This is the path to the Rclone log file ==> ./rclone_to_LTS.log
If the copy operation is expected to take an extended period of time, and you need to disconnect your terminal sessions, you can execute the Rclone commands within a tmux session. Tmux is available on the Curnagle cluster. For more information on its usage, please refer to the following link for examples:
[Red Hat Introduction to Tmux](https://www.redhat.com/sysadmin/introduction-tmux-linux)
To monitor the copy process and identify potential errors, you can view the progress of the copy operation by opening the Rclone log file using the Linux "tail" command:
tail -f ./rclone_to_LTS.log
Job script template to perform copy
Every minute, a consolidated status of the transfer will be displayed in the logs. You can exit the tail command by pressing `CTRL+C`.
Upon completion of the transfer, a summary of the copy process, including any errors, will be available at the end of the log file. It is recommended to verify that there are no errors for each copy session.
======================================================================
======================================================================
3) User Rclone in batch mode the Curnagle cluster.
======================================================================
To transfer data from the S3 storage cluster to the `/scratch/...` or `/work/...` directory on Curnagle, you will need to modify the `rclone_ro_s3_copy.sh` SLURM submission file attached to this message to align with your user environment.
Please copy the attached `rclone_ro_s3_copy.sh` script to the front node of the cluster in your home folder and replace the values highlighted in yellow/blue. It's important not to alter any other values.
You can use `nano` or your preferred text editor to create or edit a job submission file named `rclone_ro_s3_copy.sh` for SLURM. You can copy and paste the text provided between the two dashed lines as follows:
nano rclone_ro_s3_copy.sh
Then, paste the text between the two sets of dashed lines:
-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
#!/bin/bash -l
#SBATCH --account hhussain_default
#SBATCH --mail-user hamid.hussain-khan@unil.ch
#SBATCH --job-name rclone_copy
#SBATCH --time 00:30:00
#SBATCH --chdir /scratch/hhussain/RCsub
#SBATCH --mail-type ALL
#SBATCH --output %x-%j.out
#SBATCH --partition cpu
#SBATCH --nodes 1
#SBATCH --ntasks 1
#SBATCH --cpus-per-task 4
#SBATCH --mem 1G
#SBATCH --export NONE
####################################################################################
# You must adapt the following parameters for the context of your S3 data copy job
####################################################################################
## Name of your S3 Bucket (sent by email from DCSR)
S3_BUCKET_NAME=recn-fac-fbm-dmf-sgruber1-dci-data-transfer
# Path to the source folder within the S3 bucket to be replicated by rclone
# (only de content of this folder will be coipied in the destination, not the folder itself !)
IN_BUCKET_SOURCE_FOLDER_PATH=Titan2_UNIL_Gruber_Joe_HsSmc56_20240209/Images-Disc1/GridSquare_8464755
# Path to the destination folder in which the data wil be copied
DESTINATION_FOLDER_PATH=/scratch/hhussain/Rclone-sgrub1-test
####################################################################################
###################################################################################
# Do not change the code after this line
#
mkdir -p $DESTINATION_FOLDER_PATH
rclone copy -v --log-file=./$SLURM_JOB_NAME.log \
--max-backlog=1000 \
s3-dci-ro:$S3_BUCKET_NAME/$IN_BUCKET_SOURCE_FOLDER_PATH \
$DESTINATION_FOLDER_PATH
###################################################################################
-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
Please adjust the following parameter values in the SLURM job description file ==>
-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
#SBATCH --account <== The Slurm account to be charged for the compute resources used for the copy
#SBATCH --mail-user <== Your email to receive notifications when the copy is completed or the job is terminated
#SBATCH --job-name <== Name of the copy job (used only in the name of the output and log files)
#SBATCH --time <== Estimate about 1h/TB in hh:mm:ss format if the cluster is not overloaded
#SBATCH --chdir <== Directory where the Slurm and Rclone log output files will be stored
S3_BUCKET_NAME= <== This is your S3 bucket name sent in the e-mail
IN_BUCKET_SOURCE_FOLDER_PATH= <== This is your source folder path within your bucket
DESTINATION_FOLDER_PATH= <== This is your destination folder path on the the Curnagle HPC cluster
-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
Submit it from the front node of the Curnagle cluster with the `sbatch` command:
sbatch rclone_ro_s3_copy.sh
Please refrain from running more than one copy job at a time, either to the NAS or the HPC storage, as the IOPS on the storage systems on both the source and destination are limited resources.
======================================================================
4) Use Rclone in CLI mode to delete your source data on the S3 cluster
======================================================================
Once you no longer need a dataset on the S3 cluster, you can delete it using the `rclone purge` command and the `s3-dci-rw:` name as a connection alias:
rclone -v purge s3-dci-rw:recn-fac-fbm-dmf-sgruber1-dci-data-transfer/path/to/the/folder_to_be_deleted
On the Curnagle front node, the deletion speed is approximately 500 objects/s.
The documentation for this command is available on the following page: [Rclone Purge Documentation](https://rclone.org/commands/rclone_purge/)