Skip to content
Snippets Groups Projects
Select Git revision
  • f20342a474e36967cf41b6e992ab0fd6992624c4
  • master default protected
  • djmdev
  • dev
  • cloud_water_contents
  • 1-download-era5
  • sysinstall
  • origin/task/language-editing
  • task/language-editing
  • feature/makefiles
  • v7.1.2
  • v7.1.1
  • v7.1
  • v7.0.4.1
  • 7.0.4
15 results

quick_start.rst

Blame
  • user avatar
    Petra Seibert authored
    Language corrections for the Sections Developers, Support, Changelog, and the home directory (index.html)
    
    further improvment of documentation, close to final
    f20342a4
    History
    quick_start.rst 31.68 KiB

    Quick Start

    flex_extract is a command-line tool. In the first versions, it was started via a korn shell script and since version 6, the entry point was a python script. From version 7.1, a bash shell script was implemented to call flex_extract with the command-line parameters.

    To submit an extraction job, change the working directory to the subdirectory Run (directly under the flex_extract_vX.X root directory, where X.X is the version number):

    cd <path-to-flex_extract_vX.X>/Run

    Within this directory you can find everything you need to modify and run flex_extract. The following tree shows a shortened list of directories and important files. The * serves as a wildcard. The brackets [] indicate that the file is present only in certain modes of application.

    Run
    ├── Control
    │   ├── CONTROL_*
    ├── Jobscripts
    │   ├── compilejob.ksh
    │   ├── job.ksh
    │   ├── [joboper.ksh]
    ├── Workspace
    │   ├── CERA_example
    │   │   ├── CE000908*
    ├── [ECMWF_ENV]
    ├── run_local.sh
    └── run.sh

    The Jobscripts directory is used to store the Korn shell job scripts generated by a flex_extract run in the Remote or Gateway mode. They are used to submit the setup information to the ECMWF server and start the jobs in ECMWF's batch mode. The typical user must not touch these files. They are generated from template files which are stored in the Templates directory under flex_extract_vX.X. Usually there will be a compilejob.ksh and a job.ksh script which are explained in the section :doc:`Documentation/input`. In the rare case of operational data extraction there will be a joboper.ksh which reads time information from environment variables at the ECMWF servers.

    The Controls directory contains a number of sample CONTROL files. They cover the current range of possible kinds of extractions. Some parameters in the CONTROL files can be adapted and some others should not be changed. In this :doc:`quick_start` guide we explain how an extraction with flex_extract can be started in the different :doc:`Documentation/Overview/app_modes` and point out some specifics of each dataset and CONTROL file.

    Directly under Run you find the files run.sh and run_local.sh and according to your selected :doc:`Documentation/Overview/app_modes` there might also be a file named ECMWF_ENV for the user credentials to quickly and automatically access ECMWF servers.

    From version 7.1 on, the run.sh (or run_local.sh) script is the main entry point to flex_extract.

    Note

    Note that for experienced users (or users of older versions), it is still possible to start flex_extract directly via the submit.py script in directory flex_extract_vX.X/Source/Python.

    Job preparation

    To actually start a job with flex_extract it is sufficient to start either run.sh or run_local.sh. Data sets and access modes are selected in CONTROL files and within the user section of the run scripts. One should select one of the sample CONTROL files. The following sections describes the differences in the application modes and where the results will be stored.

    Remote and gateway modes

    For member-state users it is recommended to use the remote or gateway mode, especially for more demanding tasks, which retrieve and convert the data on ECMWF machines; only the final output files are transferrred to the local host.

    Remote mode

    The only difference between both modes is the users working location. In the remote mode you have to login to the ECMWF server and then go to the Run directory as shown above. At ECMWF servers flex_extract is installed in the $HOME directory. However, to be able to start the program you have to load the Python3 environment with the module system first.

    # Remote mode
    ssh -X <ecuid>@ecaccess.ecmwf.int
    # On ECMWF server
    [<ecuid>@ecgb11 ~]$ module load python3
    [<ecuid>@ecgb11 ~]$ cd flex_extract_vX.X/Run
    Gateway mode

    For the gateway mode you have to log in on the gateway server and go to the Run directory of flex_extract:

    # Gateway mode
    ssh <user>@<gatewayserver>
    cd <path-to-flex_extract_vX.X>/Run

    From here on the working process is the same for both modes.

    For your first submission you should use one of the example CONTROL files stored in the Control directory. We recommend to extract CERA-20C data since they usually guarantee quick results and are best for testing reasons.

    Therefore open the run.sh file and modify the parameter block marked in the file as shown below:

    # -----------------------------------------------------------------
    # AVAILABLE COMMANDLINE ARGUMENTS TO SET
    #
    # THE USER HAS TO SPECIFY THESE PARAMETERS:
    
    QUEUE='ecgate'
    START_DATE=None
    END_DATE=None
    DATE_CHUNK=None
    JOB_CHUNK=3
    BASETIME=None
    STEP=None
    LEVELIST=None
    AREA=None
    INPUTDIR=None
    OUTPUTDIR=None
    PP_ID=None
    JOB_TEMPLATE='job.temp'
    CONTROLFILE='CONTROL_CERA'
    DEBUG=0
    REQUEST=2
    PUBLIC=0

    This would retrieve a one day (08.09.2000) CERA-20C dataset with 3 hourly temporal resolution and a small 1° domain over Europe. Since the ectrans parameter is set to 1 the resulting output files will be transferred to the local gateway into the path stored in the destination (SEE INSTRUCTIONS FROM INSTALLATION). The parameters listed in the run.sh file would overwrite existing settings in the CONTROL file.

    To start the retrieval you only have to start the script by:

    ./run.sh

    Flex_extract will print some information about the job. If there is no error in the submission to the ECMWF server you will see something like this:

    ---- On-demand mode! ----
    The job id is: 10627807
    You should get an email per job with subject flex.hostname.pid
    FLEX_EXTRACT JOB SCRIPT IS SUBMITED!

    Once submitted you can check the progress of the submitted job using ecaccess-job-list. You should get an email after the job is finished with a detailed protocol of what was done.

    In case the job fails you will receive an email with the subject ERROR! and the job name. You can then check for information in the email or you can check on ECMWF server in the $SCRATCH directory for debugging information.

    cd $SCRATCH
    ls -rthl

    The last command lists the most recent logs and temporary retrieval directories (usually pythonXXXXX, where XXXXX is the process id). Under pythonXXXXX a copy of the CONTROL file is stored under the name CONTROL, the protocol is stored in the file prot and the temporary files as well as the resulting files are stored in a directory work. The original name of the CONTROL file is stored in this new file under parameter controlfile.

    If the job was submitted to the HPC ( queue=cca or queue=ccb ) you may login to the HPC and look into the directory /scratch/ms/ECGID/ECUID/.ecaccess_do_not_remove for job logs. The working directories are deleted after job failure and thus normally cannot be accessed.

    To check if the resulting files are still transferred to local gateway server you can use the command ecaccess-ectrans-list or check the destination path for resulting files on your local gateway server.

    Local mode

    To get to know the working process and to start your first submission you could use one of the example CONTROL files stored in the Control directory as they are. For quick results and for testing reasons it is recommended to extract CERA-20C data.

    Open the run_local.sh file and modify the parameter block marked in the file as shown below. The differences are highlighted.

    Take this for member-state user Take this for public user
       

    This would retrieve a one day (08.09.2000) CERA-20C dataset with 3 hourly temporal resolution and a small 1° domain over Europe. The destination location for this retrieval will be within the Workspace directory within Run. This can be changed to whatever path you like. The parameters listed in run_local.sh would overwrite existing settings in the CONTROL file.

    To start the retrieval you then start the script by:

    ./run_local.sh

    While job submission on the local host is convenient and easy to monitor (on standard output), there are a few caveats with this option:

    1. There is a maximum size of 20GB for single retrieval via ECMWF Web API. Normally this is not a problem but for global fields with T1279 resolution and hourly time steps the limit may already apply.
    2. If the retrieved MARS files are large but the resulting files are relative small (small local domain) then the retrieval to the local host may be inefficient since all data must be transferred via the Internet. This scenario applies most notably if etadot has to be calculated via the continuity equation as this requires global fields even if the domain is local. In this case job submission via ecgate might be a better choice. It really depends on the use patterns and also on the internet connection speed.

    Selection and adjustment of CONTROL files

    This section describes how to work with the CONTROL files. A detailed explanation of CONTROL file parameters and naming compositions can be found here. The more accurately the CONTROL file describes the retrieval needed, the fewer command-line parameters are needed to be set in the run scripts. With version 7.1 all CONTROL file parameters have default values. They can be found in section CONTROL parameters or in the CONTROL.documentation file within the Control directory. Only those parameters which need to be changed for a dataset retrieval needs to be set in a CONTROL file!

    The limitation of a dataset to be retrieved should be done very cautiously. The datasets can differ in many ways and vary over the time in resolution and parameterisations methods, especially the operational model cycles improves through a lot of changes over the time. If you are not familiar with the data it might be useful or necessary to check for availability of data in ECMWF’s MARS:

    There you can select step by step what data suits your needs. This would be the most straightforeward way of checking for available data and therefore limit the possibility of flex_extract to fail. The following figure gives an example how the web interface would look like:

    _files/MARS_catalogue_snapshot.png

    Additionally, you can find a lot of helpful links to dataset documentations, direct links to specific dataset web catalogues or further general information at the link collection in the ECMWF data section.

    Flex_extract is specialised to retrieve a limited number of datasets, namely ERA-Interim, CERA-20C, ERA5 and HRES (operational data) as well as the ENS (operational data, 15-day forecast). The limitation relates mainly to the dataset itself, the stream (what kind of forecast or what subset of dataset) and the experiment number. Mostly, the experiment number is equal to 1 to signal that the actual version should be used.

    The next level of differentiation would be the field type, level type and time period. Flex_extract currently only supports the main streams for the re-analysis datasets and provides extraction of different streams for the operational dataset. The possibilities of compositions of dataset and stream selection are represented by the current list of example CONTROL files. You can see this in the naming of the example files:

    The main differences and features in the datasets are listed in the table shown below:

    _files/dataset_cmp_table.png

    A common problem for beginners in retrieving ECMWF datasets is a mismatch in the choice of values for these parameters. For example, if you try to retrieve operational data for 24 June 2013 or earlier and set the maximum level to 137, you will get an error because this number of levels was introduced only on 25 June 2013. Thus, be careful in the combination of space and time resolution as well as the field types.

    Note

    Sometimes it might not be clear how specific parameters in the control file must be set in terms of format. Please consult the description of the parameters in section CONTROL parameters or have a look at the ECMWF user documentation for MARS keywords

    In the following, we shortly discuss the typical retrievals for the different datasets and point to the respective CONTROL files.

    Public datasets