Stacking gridded swath files into cells

Using the command-line interface

The simplest interface for stacking swath files into cells is the command-line interface included with this package.

Make sure you’ve created and activated an appropriate Python environment first.

TL;DR: in a shell run something like:

ascat_swaths_to_cells /path/to/my/swath/files/ /path/to/my/new/cell/files/ H121 contiguous --start_date 2023-01-01 --end_date 2024-01-01

The output cells will be in point array format.

You can then convert to indexed or contiguous ragged array format:

ascat_convert_cell_format /path/to/my/new/cell/files/ /path/to/my/converted/cell/files/ H121 contiguous

Have a look at the help to see required arguments

! ascat_swaths_to_cells  --help

usage: ascat_swaths_to_cells [-h] [--start_date START_DATE]
                             [--end_date END_DATE] [--dump_size DUMP_SIZE]
                             [--cells CELLS [CELLS ...]] [--quiet]
                             FILEPATH OUTPATH PRODUCT_ID [fmt_kwargs ...]

Stack ASCAT swath files to a cell grid

positional arguments:
  FILEPATH              Path to folder containing swath files
  OUTPATH               Path to the output data
  PRODUCT_ID            Product identifier
  fmt_kwargs            Format keyword arguments, depends on the product
                        format used. Example: 'sat=A year=2008'

options:
  -h, --help            show this help message and exit
  --start_date START_DATE
                        Start date in format YYYY-MM-DD. Must also provide end
                        date if this is provided.
  --end_date END_DATE   End date in format YYYY-MM-DD. Must also provide start
                        date if this is provided.
  --dump_size DUMP_SIZE
                        Size at which to dump the data to disk before reading
                        more (default: 1GB)
  --cells CELLS [CELLS ...]
                        Numbers of the cells to process (default: None)
  --quiet               Do not print progress information

So we need to pass at least three positional arguments to the stacker:

FILEPATH - This is a path to the parent directory of the product’s swath files.
OUTPATH - A path to the directory you’d like to send output to.
PRODUCT_ID - The name of the product you’re processing. The program chooses a product reader based on this string, which makes certain assumptions about filename and directory structure. Several products are included in the ASCAT package, and to use them your directory structure must adhere to what they assume. Otherwise you may also create your own product readers. (TODO make a link here)

After the positional arguments we can also pass as many keyword arguments as we want:

fmt_kwargs - These are keyword arguments that will be passed on to the product reader’s fn_read_fmt and sf_read_fmt (functions that tell the package how to find your files).

We can also pass some options:

--start_date and --end_date - in YYYY-MM-DD format. Sets the time range of swath files to stack.
--dump_size - the size of the buffer to fill with read data before dumping to cell files and reading more data. Make this too big and merging/processing will take a while after reading. Too small and repeated writes will be a bottleneck. Even if you have a lot of memory something like 8GB is a good value.

To check which product_id -s are available to use, use ascat_product_info

! ascat_product_info

Available Swath Products:
H129
H121
H122
SIG0_6.25
SIG0_12.5

Available Cell Products:
H129
H121
H122
SIG0_6.25
SIG0_12.5
ERSH
ERSN

To see how a product’s readers have been defined, pass its product_id:

! ascat_product_info h121

Swath Product Information:
class AscatH121Swath(AscatSwathProduct):
    fn_pattern = "W_IT-HSAF-ROME,SAT,SSM-ASCAT-METOP{sat}-12.5km-H121_C_LIIB_{placeholder}_{placeholder1}_{date}____.nc"
    sf_pattern = {"satellite_folder": "metop_[abc]", "year_folder": "{year}", "month_folder": "{month}"}
    date_field_fmt = "%Y%m%d%H%M%S"
    grid_name = "fibgrid_12.5"
    cell_fn_format = "{:04d}.nc"

    @staticmethod
    def fn_read_fmt(timestamp, sat="[ABC]"):
        sat = sat.upper()
        return {
            "date": timestamp.strftime("%Y%m%d*"),
            "sat": sat,
            "placeholder": "*",
            "placeholder1": "*"
        }

    @staticmethod
    def sf_read_fmt(timestamp, sat="[abc]"):
        sat = sat.lower()
        return {
            "satellite_folder": {
                "satellite": f"metop_{sat}"
            },
            "year_folder": {
                "year": f"{timestamp.year}"
            },
            "month_folder": {
                "month": f"{timestamp.month}".zfill(2)
            },
        }

class AscatSwathProduct(SwathProduct):
    grid_name = None

    @classmethod
    def preprocess_(cls, ds):
        ds["location_id"].attrs["cf_role"] = "timeseries_id"
        ds.attrs["global_attributes_flag"] = 1
        ds.attrs["featureType"] = "point"
        ds.attrs["grid_mapping_name"] = cls.grid_name
        if "spacecraft" in ds.attrs:
            # Assumption: the spacecraft attribute is something like "metop-a"
            sat_id = {"a": 3, "b": 4, "c": 5}
            sat = ds.attrs["spacecraft"][-1].lower()
            ds["sat_id"] = ("obs",
                            np.repeat(sat_id[sat], ds["location_id"].size))
            del ds.attrs["spacecraft"]
        return ds

    @staticmethod
    def postprocess_(ds):
        for key, item in {"latitude": "lat", "longitude": "lon", "altitude": "alt"}.items():
            if key in ds:
                ds = ds.rename({key: item})
        if "altitude" not in ds:
            ds["alt"] = ("locations", np.full_like(ds["lat"], fill_value=np.nan))
        return ds

class SwathProduct:
    from ascat.swath import Swath
    file_class = Swath

Cell Product Information:
class AscatH121Cell(RaggedArrayCellProduct):
    grid_name = "fibgrid_12.5"

class RaggedArrayCellProduct(BaseCellProduct):
    file_class = RaggedArrayTs
    sample_dim = "obs"
    instance_dim = "locations"

    @classmethod
    def preprocessor(cls, ds):
        if "row_size" in ds.variables:
            ds["row_size"].attrs["sample_dimension"] = cls.sample_dim
        if "locationIndex" in ds.variables:
            ds["locationIndex"].attrs["instance_dimension"] = cls.instance_dim
        if "location_id" in ds.variables:
            ds["location_id"].attrs["cf_role"] = "timeseries_id"
        if ds.attrs.get("featureType") is None:
            ds = ds.assign_attrs({"featureType": "timeSeries"})
        if ds.attrs.get("grid_mapping_name") is None:
            ds.attrs["grid_mapping_name"] = cls.grid_name
        return ds

class BaseCellProduct:
    fn_format = "{:04d}.nc"

    @classmethod
    def preprocessor(cls, ds):
        return ds

Once you have the right product id chosen or set up, pass your swath file root, output directory, and product id to ascat_swaths_to_cells, along with any other arguments.

ascat_swaths_to_cells /path/to/my/swath/files/ /path/to/my/new/cell/files/ H121 --start_date 2023-01-01 --end_date 2023-12-31

ascat_swaths_to_cells works by iterating through the source swath files one at a time, opening them as xarray datasets, performing any necessary preprocessing, and concatenating each new dataset to all of the previous ones. Once that dataset’s nbytes attribute reaches dump_size, reading is paused while the combined dataset is dumped out into one file in timeseries point array format for each of its constituent cells. Once the cells are written, the process starts again.

On all dumps, data for any cells that already have a file is appended to those files. This is useful if you want to add new data to an existing stack, but if you want to make a fresh export, it’s important to make sure the CLI is pointed to an empty directory.

The output cells are in timeseries point array format. In order to convert them to contiguous ragged array format, we can use the ascat_convert_cell_format CLI. Pass it the path to your newly-stacked cell files, an output directory to write the converted cell files to, a product_id, and the argument contiguous (you could also use indexed here if you’d prefer that format).

ascat_convert_cell_format /path/to/my/new/cell/files/ /path/to/my/converted/cell/files/ H121 contiguous

Using Python

The CLI described above is just a wrapper for a python function. If you need more control over the processing or want to include this as a step in a pipeline, you can make a SwathGridFiles object and call .stack_to_cell_files on it directly.

We pass it at least an output directory path (out_dir), where the outputs will be written, and we can also pass it several other options.

from datetime import datetime
from ascat.swath import SwathGridFiles

swath_source = "/data-write/RADAR/hsaf/h121_v2.0/netcdf"
swath_collection = SwathGridFiles.from_product_id(swath_source, "H121")

# where to save the files
cell_file_directory = ""


# the maximum size of the data buffer before dumping to file (actual maximum memory used will be higher)
max_nbytes = None

# the date range to use. This should be a tuple of datetime.datetime objects
date_range = (datetime(2019, 1, 1), datetime(2019, 12, 31))

# Pass a list of cell numbers (integers) here if you only want to stack data for a certain set of cells. This is mainly useful for testing purposes, since even splitting a day's worth of swath data into files for all of its constituent cells is a lengthy process.
cells=None

# mode : "w" for creating new files if any already exist, "a" to append data to existing cell files
mode = "w"

# # uncomment to run
# swath_collection.stack_to_cell_files(
#     output_dir=cell_file_directory,
#     max_nbytes=max_nbytes,
#     date_range=date_range,
#     mode=mode,
#     processes=processes,
# )

from ascat.cell import CellGridFiles

cell_collection = CellGridFiles.from_product_id(cell_file_directory, product_id="H121")
contiguous_cell_file_directory = "contiguous_directory_name"
# # uncomment to run
# cell_collection.convert_to_contiguous(contiguous_cell_file_directory)

Conversion to contiguous ragged array format will sort the sample dimension first by time and then by location_id. At this point it is no longer practically possible to append new data to the dataset without first re-converting it to indexed ragged array format and then converting back.

Adding a custom product class

To add your own product classes you’ll need to clone this repository and install it in your environment as an editable package (e.g. pip install -e /home/username/Clones/ascat). Then you can edit .../ascat/src/ascat/product_info/product_info.py to add your own classes following the examples of the existing ones. Best to copy-paste, e.g. AscatH129Swath and edit the fields accordingly.

Once your product class is written, add it to the swath_io_catalog dictionary, along with a key to access it. Then you can use this key to specify your custom product when running the CLI.