Transfert S3 DCSR to other support

Data in the S3 DCSR should be transfert to another file system as soon as possible. There is no backup for S3 data. This documentation describes the transfert using Curnagl cluster and the rclone command.

Introduction

What is S3?

Amazon S3 (Simple Storage Service) is a scalable object storage service used for storing and retrieving any amount of data at any time. It organizes data into containers called “buckets.” Each bucket can store an unlimited number of objects, which are the fundamental entities stored in S3.

Understanding S3 Bucket Structure:

Buckets: These are the top-level containers in S3. Each bucket has a unique name and is used to store objects.
Objects: These are the files stored in a bucket. Each object is identified by a unique key (or ID) within the bucket.
Object Keys: While S3 does not have a traditional file system hierarchy, it uses a flat namespace. The / character in object keys is used to simulate a directory structure, making it easier to organize and manage objects. However, these are not actual directories but part of the object’s key. S3 Endpoint Access

Accessing S3 is similar to accessing any other web service over HTTP, which most users are already familiar with. The endpoint URL follows the same structure as a typical web address, making it straightforward to understand and use.

An S3 endpoint address typically looks like this: https://dnsname.com/bucket-name/object-key

Endpoint: https://dnsname.com
Bucket Name: bucket-name
Object Key: object-key

For example, if you have a bucket named my-bucket and an object with the key folder1/file.txt, the S3 URL would be: https://dnsname.com/my-bucket/folder1/file.txt

IAM Key Pairs

To access and manage your S3 resources securely, you will use IAM (Identity and Access Management) key pairs instead of a traditional login and password. An IAM key pair consists of an Access Key ID and a Secret Access Key. These keys are used to authenticate your requests to AWS services. • Access Key ID: This is similar to a username. • Secret Access Key: This is similar to a password and should be kept secure.

Unlike a traditional login and password, different IAM key pairs can be attached to different sets of permissions defined in their policy files. These policies control what actions the keys are allowed to perform, enhancing security by ensuring that each key pair has only the necessary permissions for its intended tasks.

Requirements

Have an account in the cluster
Enough space in NAS or work to transfert the data

Rclone configuration

Use a text editor to create a configuration file in your home directory. Be sure to replace the S3 server name and the cryptographic key values with the ones sent in the email S3 form DCSR.

mkdir -p ~/.config/rclone
nano ~/.config/rclone/rclone.conf

The configuration file should look like this :

[s3-dci-ro]
type = s3
provider = Other
access_key_id = T******************M
secret_access_key = S**************************************i
region =
endpoint = https://scl-s3.unil.ch

For many different S3 tools, the pair of authentication/cryptographic keys have different names. For Rclone, they are named access_key_id and secret_access_key. Corresponding respectively to Access key and Private key in the mail sent by DCSR.

Next, secure your key file:

chmod 600 ~/.config/rclone/rclone.conf

Now, s3-dci-ro is a S3 configured connection alias that you can use in Rclone without repeating the connection information in the CLI.

s3-dci-ro: In this connection alias, the cryptographic keys are assigned to a user attached to a read-only policy on the S3 cluster. This prevents you from modifying or accidentally deleting your source data when using this connection alias.

Use Rclone in CLI on the Curnagl front node.

List the content of your bucket named "bucket1" (This command only show the directories.).

rclone lsd s3-dci-ro:bucket1

You can also navigate sub-directories with the rclone lsd command:

rclone lsd s3-dci-ro:bucket1/dir1

You can use rclone lsf command to list the file and the folders.

Within an S3 cluster, all entities are represented as URLs that point to specific objects. These objects are stored uniformly, without any inherent hierarchical structure. The concept of "folders" does not truly exist. However, by convention, the "/" character in the Object IDs (the URLs) is interpreted as a folder delimiter by the S3 client application. Consequently, the "ls" command essentially performs a filtering and sorting operation on information stored at the same level. This approach does not scale well, hence, it is not advisable to execute an "ls" command on a significantly large number of files or objects.

The differents ways to do a listing of files and folders on S3 with Rclone are described on the following pages:

The command rclone copy -v can be utilized to copy all files from a source folder to a destination folder. It's important to note that rclone does not duplicate the initial folder, but only its file contents into the destination folder. Furthermore, rclone does not recopy a file if it already exists in the destination, allowing for the resumption of an interrupted copy operation.

When launching a copy operation in the background with an ampersand & or a screen/tmux, it is recommended to use a log file with the verbosity set to -v. This log file will collect information about the copied files, errors, and provide a status update every minute on the amount of data copied so far.

Here is an example of a command to copy a subset of your data from your DCI S3 bucket to an LTS sub-folder on the Isilon NAS of the DCSR. Please substitute the paths to be relevant for your use case.

rclone copy -v --log-file=$log_file.log $connection_alias:$bucket/$path $NAS_PATH

You need to adapt the following parameters:

$log file: path to the rclone log file
$connection_alias: connection alias (e.g., s3-dci-ro)
$bucket: S3 bucket name sent by email
$path: directory path you whant to access inside the bucket
$NAS_PATH: This is your destination folder path on the DCSR NAS

It should give you something like:

rclone copy -v --log-file=./rclone_to_LTS.log s3-dci-ro:bucket/dir1/dir2 /nas/FAC/Faculty/Unit/PI/project/LTS/project_toto

If the copy operation is expected to take an extended period of time, and you need to disconnect your terminal sessions, you can execute the Rclone commands within a tmux session. Tmux is available on the Curnagle cluster. More information here on its usage.

To monitor the copy process and identify potential errors, you can view the progress of the copy operation by opening the Rclone log file using the Linux "tail" command:

tail -f rclone_to_LTS.log

Every minute, a consolidated status of the transfer will be displayed in the logs. You can exit the tail command by pressing CTRL+C.

Upon completion of the transfer, a summary of the copy process, including any errors, will be available at the end of the log file. It is recommended to verify that there are no errors for each copy session.

Job script template to perform copy

To transfer data from the S3 storage cluster to the /scratch/... or /work/... directory on Curnagl, you will need to modify the rclone_ro_s3_copy.sh SLURM submission file shown here.

#!/bin/bash -l

#SBATCH --mail-user $user.name@unil.ch
#SBATCH --job-name rclone_copy
#SBATCH --time 1-00:00:00
#SBATCH --mail-type ALL
#SBATCH --output %x-%j.out
#SBATCH --cpus-per-task 4
#SBATCH --mem 1G
#SBATCH --export NONE


## Name of your S3 Bucket (sent by email from DCSR)

S3_BUCKET_NAME=""

# Path to the source folder within the S3 bucket to be replicated by rclone

# (only de content of this folder will be coipied in the destination, not the folder itself !)

IN_BUCKET_SOURCE_PATH=""

# Path to the destination folder in which the data wil be copied

DESTINATION_PATH=""



# Do not change the code after this line

mkdir -p $DESTINATION_FOLDER_PATH

rclone copy -v --log-file=$SLURM_JOB_NAME.log --max-backlog=1000 s3-dci-ro:$S3_BUCKET_NAME/$IN_BUCKET_SOURCE_PATH $DESTINATION_PATH

You should edit the previous file with your real email account and you should put a value for the S3_BUCKET_NAME, IN_BUCKET_SOURCE_PATH and DESTINATION_PATH variables.

Submit it from the front node of the Curnagl cluster with the sbatch command:

sbatch rclone_ro_s3_copy.sh

Please refrain from running more than one copy job at a time, either to the NAS or the HPC storage, as the IOPS on the storage systems on both the source and destination are limited resources.

Copy files to S3 bucket

You need to add the following lines to the configuration file ~/.config/rclone/rclone.conf previously created:

[s3-dci-rw]
type = s3
provider = Other
access_key_id = xxxxxxxxx
secret_access_key = xxxxxxxxxxxx
region =
endpoint = http://scl-s3.unil.ch
no_check_bucket = true

Do not forget the option no_check_bucket otherwise, you will have the following error: failed to prepare upload: AccessDenied: Access Denied. More information here

You can execute the following command to copy:

rclone copy -v $SOURCE_LOCAL_PATH s3-dci-ro:$S3_BUCKET_NAME/$IN_BUCKET_DESTINATION_PATH

DCSR? Kesako?

How to access the clusters

I'm a PI and would like to use the clusters - what do I do?

How do I ask for help?

Recovering deleted files?

Curnagl

Urblauna

How to run a job on Curnagl

What projects am I part of and what is my default account?

Providing access to external collaborators

Requesting and using GPUs

How do I run a job for more that 3 days?

Access NAS DCSR from the cluster

SSH connection to DCSR cluster

Checkpoint SLURM jobs

Urblauna access and data transfer

Job Templates

Urblauna Guacamole / RDP issues

Urblauna

Transfer files to/from Curnagl

Transfert S3 DCSR to other support

DCSR Software Stack

Old software stack

R on the clusters

Rstudio on the Curnagl cluster

MATLAB on the clusters

Using Conda and Anaconda

Using Mamba to install Conda packages

AlphaFold

Alphafold 3

CryoSPARC

Compiling and running MPI codes

Deep Learning with GPUs

Software local installation

Rstudio on the Urblauna cluster

DCSR GitLab service

Running Busco

How to run LLM models

SWITCHfilesender from the cluster

Filetransfer from the cluster

R on the clusters (old)

Sandbox containers

Course software for decision trees / random forests

Course software for introductory deep learning

JupyterLab on the curnagl cluster

JupyterLab with C++ on the curnagl cluster

Dask on curnagl

Running the Isca framework on the cluster

Running the MPAS framework on the cluster

Run OpenFOAM codes on Curnagl

Compiling software using cluster libraries

Course software for Image Analysis with CNNs

Course software for Text Analysis with LLMs

Performance of LLM backends and models in Curnagl

Run MPI with containers

Profiling Tools

DCSR Courses

Transfert S3 DCSR to other support

Introduction

What is S3?

IAM Key Pairs

Requirements

Rclone configuration

Use Rclone in CLI on the Curnagl front node.

Job script template to perform copy

Copy files to S3 bucket