Stacking gridded swath files into cells
Using the command-line interface
The simplest interface for stacking swath files into cells is the command-line interface included with this package.
Make sure you’ve created and activated an appropriate Python environment first.
TL;DR: in a shell run something like:
ascat_swaths_to_cells /path/to/my/swath/files/ /path/to/my/new/cell/files/ H121 contiguous --start_date 2023-01-01 --end_date 2024-01-01
The output cells will be in point array format.
You can then convert to indexed or contiguous ragged array format:
ascat_convert_cell_format /path/to/my/new/cell/files/ /path/to/my/converted/cell/files/ H121 contiguous
Have a look at the help to see required arguments
! ascat_swaths_to_cells --help
usage: ascat_swaths_to_cells [-h] [--start_date START_DATE]
[--end_date END_DATE] [--dump_size DUMP_SIZE]
[--cells CELLS [CELLS ...]] [--quiet]
FILEPATH OUTPATH PRODUCT_ID [fmt_kwargs ...]
Stack ASCAT swath files to a cell grid
positional arguments:
FILEPATH Path to folder containing swath files
OUTPATH Path to the output data
PRODUCT_ID Product identifier
fmt_kwargs Format keyword arguments, depends on the product
format used. Example: 'sat=A year=2008'
options:
-h, --help show this help message and exit
--start_date START_DATE
Start date in format YYYY-MM-DD. Must also provide end
date if this is provided.
--end_date END_DATE End date in format YYYY-MM-DD. Must also provide start
date if this is provided.
--dump_size DUMP_SIZE
Size at which to dump the data to disk before reading
more (default: 1GB)
--cells CELLS [CELLS ...]
Numbers of the cells to process (default: None)
--quiet Do not print progress information
So we need to pass at least three positional arguments to the stacker:
FILEPATH- This is a path to the parent directory of the product’s swath files.OUTPATH- A path to the directory you’d like to send output to.PRODUCT_ID- The name of the product you’re processing. The program chooses a product reader based on this string, which makes certain assumptions about filename and directory structure. Several products are included in the ASCAT package, and to use them your directory structure must adhere to what they assume. Otherwise you may also create your own product readers. (TODO make a link here)
After the positional arguments we can also pass as many keyword arguments as we want:
fmt_kwargs- These are keyword arguments that will be passed on to the product reader’sfn_read_fmtandsf_read_fmt(functions that tell the package how to find your files).
We can also pass some options:
--start_dateand--end_date- in YYYY-MM-DD format. Sets the time range of swath files to stack.--dump_size- the size of the buffer to fill with read data before dumping to cell files and reading more data. Make this too big and merging/processing will take a while after reading. Too small and repeated writes will be a bottleneck. Even if you have a lot of memory something like8GBis a good value.
To check which product_id -s are available to use, use ascat_product_info
! ascat_product_info
Available Swath Products:
H129
H121
H122
SIG0_6.25
SIG0_12.5
Available Cell Products:
H129
H121
H122
SIG0_6.25
SIG0_12.5
ERSH
ERSN
To see how a product’s readers have been defined, pass its product_id:
! ascat_product_info h121
Swath Product Information:
class AscatH121Swath(AscatSwathProduct):
fn_pattern = "W_IT-HSAF-ROME,SAT,SSM-ASCAT-METOP{sat}-12.5km-H121_C_LIIB_{placeholder}_{placeholder1}_{date}____.nc"
sf_pattern = {"satellite_folder": "metop_[abc]", "year_folder": "{year}", "month_folder": "{month}"}
date_field_fmt = "%Y%m%d%H%M%S"
grid_name = "fibgrid_12.5"
cell_fn_format = "{:04d}.nc"
@staticmethod
def fn_read_fmt(timestamp, sat="[ABC]"):
sat = sat.upper()
return {
"date": timestamp.strftime("%Y%m%d*"),
"sat": sat,
"placeholder": "*",
"placeholder1": "*"
}
@staticmethod
def sf_read_fmt(timestamp, sat="[abc]"):
sat = sat.lower()
return {
"satellite_folder": {
"satellite": f"metop_{sat}"
},
"year_folder": {
"year": f"{timestamp.year}"
},
"month_folder": {
"month": f"{timestamp.month}".zfill(2)
},
}
class AscatSwathProduct(SwathProduct):
grid_name = None
@classmethod
def preprocess_(cls, ds):
ds["location_id"].attrs["cf_role"] = "timeseries_id"
ds.attrs["global_attributes_flag"] = 1
ds.attrs["featureType"] = "point"
ds.attrs["grid_mapping_name"] = cls.grid_name
if "spacecraft" in ds.attrs:
# Assumption: the spacecraft attribute is something like "metop-a"
sat_id = {"a": 3, "b": 4, "c": 5}
sat = ds.attrs["spacecraft"][-1].lower()
ds["sat_id"] = ("obs",
np.repeat(sat_id[sat], ds["location_id"].size))
del ds.attrs["spacecraft"]
return ds
@staticmethod
def postprocess_(ds):
for key, item in {"latitude": "lat", "longitude": "lon", "altitude": "alt"}.items():
if key in ds:
ds = ds.rename({key: item})
if "altitude" not in ds:
ds["alt"] = ("locations", np.full_like(ds["lat"], fill_value=np.nan))
return ds
class SwathProduct:
from ascat.swath import Swath
file_class = Swath
Cell Product Information:
class AscatH121Cell(RaggedArrayCellProduct):
grid_name = "fibgrid_12.5"
class RaggedArrayCellProduct(BaseCellProduct):
file_class = RaggedArrayTs
sample_dim = "obs"
instance_dim = "locations"
@classmethod
def preprocessor(cls, ds):
if "row_size" in ds.variables:
ds["row_size"].attrs["sample_dimension"] = cls.sample_dim
if "locationIndex" in ds.variables:
ds["locationIndex"].attrs["instance_dimension"] = cls.instance_dim
if "location_id" in ds.variables:
ds["location_id"].attrs["cf_role"] = "timeseries_id"
if ds.attrs.get("featureType") is None:
ds = ds.assign_attrs({"featureType": "timeSeries"})
if ds.attrs.get("grid_mapping_name") is None:
ds.attrs["grid_mapping_name"] = cls.grid_name
return ds
class BaseCellProduct:
fn_format = "{:04d}.nc"
@classmethod
def preprocessor(cls, ds):
return ds
Once you have the right product id chosen or set up, pass your swath file root, output directory, and product id to ascat_swaths_to_cells, along with any other arguments.
ascat_swaths_to_cells /path/to/my/swath/files/ /path/to/my/new/cell/files/ H121 --start_date 2023-01-01 --end_date 2023-12-31
ascat_swaths_to_cells works by iterating through the source swath files one at a time, opening them as xarray datasets, performing any necessary preprocessing, and concatenating each new dataset to all of the previous ones. Once that dataset’s nbytes attribute reaches dump_size, reading is paused while the combined dataset is dumped out into one file in timeseries point array format for each of its constituent cells. Once the cells are written, the process starts again.
On all dumps, data for any cells that already have a file is appended to those files. This is useful if you want to add new data to an existing stack, but if you want to make a fresh export, it’s important to make sure the CLI is pointed to an empty directory.
The output cells are in timeseries point array format. In order to convert them to contiguous ragged array format, we can use the ascat_convert_cell_format CLI. Pass it the path to your newly-stacked cell files, an output directory to write the converted cell files to, a product_id, and the argument contiguous (you could also use indexed here if you’d prefer that format).
ascat_convert_cell_format /path/to/my/new/cell/files/ /path/to/my/converted/cell/files/ H121 contiguous
Using Python
The CLI described above is just a wrapper for a python function. If you need more control over the processing or want to include this as a step in a pipeline, you can make a SwathGridFiles object and call .stack_to_cell_files on it directly.
We pass it at least an output directory path (out_dir), where the outputs will be written, and we can also pass it several other options.
from datetime import datetime
from ascat.swath import SwathGridFiles
swath_source = "/data-write/RADAR/hsaf/h121_v2.0/netcdf"
swath_collection = SwathGridFiles.from_product_id(swath_source, "H121")
# where to save the files
cell_file_directory = ""
# the maximum size of the data buffer before dumping to file (actual maximum memory used will be higher)
max_nbytes = None
# the date range to use. This should be a tuple of datetime.datetime objects
date_range = (datetime(2019, 1, 1), datetime(2019, 12, 31))
# Pass a list of cell numbers (integers) here if you only want to stack data for a certain set of cells. This is mainly useful for testing purposes, since even splitting a day's worth of swath data into files for all of its constituent cells is a lengthy process.
cells=None
# mode : "w" for creating new files if any already exist, "a" to append data to existing cell files
mode = "w"
# # uncomment to run
# swath_collection.stack_to_cell_files(
# output_dir=cell_file_directory,
# max_nbytes=max_nbytes,
# date_range=date_range,
# mode=mode,
# processes=processes,
# )
from ascat.cell import CellGridFiles
cell_collection = CellGridFiles.from_product_id(cell_file_directory, product_id="H121")
contiguous_cell_file_directory = "contiguous_directory_name"
# # uncomment to run
# cell_collection.convert_to_contiguous(contiguous_cell_file_directory)
Conversion to contiguous ragged array format will sort the sample dimension first by time and then by location_id. At this point it is no longer practically possible to append new data to the dataset without first re-converting it to indexed ragged array format and then converting back.
Adding a custom product class
To add your own product classes you’ll need to clone this repository and install it in your environment as an editable package (e.g. pip install -e /home/username/Clones/ascat). Then you can edit .../ascat/src/ascat/product_info/product_info.py to add your own classes following the examples of the existing ones. Best to copy-paste, e.g. AscatH129Swath and edit the fields accordingly.
Once your product class is written, add it to the swath_io_catalog dictionary, along with a key to access it. Then you can use this key to specify your custom product when running the CLI.