{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "d35f4ffa-ec35-43b5-8eca-9847a408e7c4",
   "metadata": {},
   "source": [
    "Stacking gridded swath files into cells\n",
    "=======================================\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "22fba5ee-236e-4392-9b50-890f6e2fdaf1",
   "metadata": {},
   "source": [
    "## Using the command-line interface\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e8b5893e-00bc-4228-8663-c0164385274e",
   "metadata": {},
   "source": [
    "The simplest interface for stacking swath files into cells is the command-line interface included with this package.\n",
    "\n",
    "Make sure you&rsquo;ve created and activated an appropriate Python environment first.\n",
    "\n",
    "---\n",
    "\n",
    "**TL;DR**: in a shell run something like:\n",
    "\n",
    "    ascat_swaths_to_cells /path/to/my/swath/files/ /path/to/my/new/cell/files/ H121 contiguous --start_date 2023-01-01 --end_date 2024-01-01\n",
    "\n",
    "The output cells will be in point array format.\n",
    "\n",
    "You can then convert to indexed or contiguous ragged array format:\n",
    "\n",
    "    ascat_convert_cell_format /path/to/my/new/cell/files/ /path/to/my/converted/cell/files/ H121 contiguous\n",
    "\n",
    "---\n",
    "\n",
    "Have a look at the help to see required arguments\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "9b292dd6-cca2-4c9d-af82-63f969de0c0a",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "usage: ascat_swaths_to_cells [-h] [--start_date START_DATE]\n",
      "                             [--end_date END_DATE] [--dump_size DUMP_SIZE]\n",
      "                             [--cells CELLS [CELLS ...]] [--quiet]\n",
      "                             FILEPATH OUTPATH PRODUCT_ID [fmt_kwargs ...]\n",
      "\n",
      "Stack ASCAT swath files to a cell grid\n",
      "\n",
      "positional arguments:\n",
      "  FILEPATH              Path to folder containing swath files\n",
      "  OUTPATH               Path to the output data\n",
      "  PRODUCT_ID            Product identifier\n",
      "  fmt_kwargs            Format keyword arguments, depends on the product\n",
      "                        format used. Example: 'sat=A year=2008'\n",
      "\n",
      "options:\n",
      "  -h, --help            show this help message and exit\n",
      "  --start_date START_DATE\n",
      "                        Start date in format YYYY-MM-DD. Must also provide end\n",
      "                        date if this is provided.\n",
      "  --end_date END_DATE   End date in format YYYY-MM-DD. Must also provide start\n",
      "                        date if this is provided.\n",
      "  --dump_size DUMP_SIZE\n",
      "                        Size at which to dump the data to disk before reading\n",
      "                        more (default: 1GB)\n",
      "  --cells CELLS [CELLS ...]\n",
      "                        Numbers of the cells to process (default: None)\n",
      "  --quiet               Do not print progress information\n"
     ]
    }
   ],
   "source": [
    "! ascat_swaths_to_cells  --help"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "99e3397b-debf-48c6-929e-d77b702b50fe",
   "metadata": {},
   "source": [
    "So we need to pass at least three positional arguments to the stacker:\n",
    "\n",
    "1.  `FILEPATH` - This is a path to the parent directory of the product&rsquo;s swath files.\n",
    "2.  `OUTPATH` - A path to the directory you&rsquo;d like to send output to.\n",
    "3.  `PRODUCT_ID` - The name of the product you&rsquo;re processing. The program chooses a product reader based on this string, which makes certain assumptions about filename and directory structure. Several products are included in the ASCAT package, and to use them your directory structure must adhere to what they assume. Otherwise you may also create your own product readers. (TODO make a link here)\n",
    "\n",
    "After the positional arguments we can also pass as many keyword arguments as we want:\n",
    "\n",
    "-   `fmt_kwargs` - These are keyword arguments that will be passed on to the product reader&rsquo;s `fn_read_fmt` and `sf_read_fmt` (functions that tell the package how to find your files).\n",
    "\n",
    "We can also pass some options:\n",
    "\n",
    "-   `--start_date` and `--end_date` - in YYYY-MM-DD format. Sets the time range of swath files to stack.\n",
    "\n",
    "-   `--dump_size` - the size of the buffer to fill with read data before dumping to cell files and reading more data. Make this too big and merging/processing will take a while after reading. Too small and repeated writes will be a bottleneck. Even if you have a lot of memory something like `8GB` is a good value.\n",
    "\n",
    "To check which `product_id` -s are available to use, use `ascat_product_info`\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "1e33076c-fb7c-40d1-aef9-3eb01130fcd4",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Available Swath Products:\n",
      "H129\n",
      "H121\n",
      "H122\n",
      "SIG0_6.25\n",
      "SIG0_12.5\n",
      "\n",
      "Available Cell Products:\n",
      "H129\n",
      "H121\n",
      "H122\n",
      "SIG0_6.25\n",
      "SIG0_12.5\n",
      "ERSH\n",
      "ERSN\n"
     ]
    }
   ],
   "source": [
    "! ascat_product_info"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "149b47bf-4b9d-4ab6-b499-d19cc3eaddd1",
   "metadata": {},
   "source": [
    "To see how a product&rsquo;s readers have been defined, pass its `product_id`:\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "0fb9b6cb-4255-4d8e-8b88-68e603ab1f26",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Swath Product Information:\n",
      "class AscatH121Swath(AscatSwathProduct):\n",
      "    fn_pattern = \"W_IT-HSAF-ROME,SAT,SSM-ASCAT-METOP{sat}-12.5km-H121_C_LIIB_{placeholder}_{placeholder1}_{date}____.nc\"\n",
      "    sf_pattern = {\"satellite_folder\": \"metop_[abc]\", \"year_folder\": \"{year}\", \"month_folder\": \"{month}\"}\n",
      "    date_field_fmt = \"%Y%m%d%H%M%S\"\n",
      "    grid_name = \"fibgrid_12.5\"\n",
      "    cell_fn_format = \"{:04d}.nc\"\n",
      "\n",
      "    @staticmethod\n",
      "    def fn_read_fmt(timestamp, sat=\"[ABC]\"):\n",
      "        sat = sat.upper()\n",
      "        return {\n",
      "            \"date\": timestamp.strftime(\"%Y%m%d*\"),\n",
      "            \"sat\": sat,\n",
      "            \"placeholder\": \"*\",\n",
      "            \"placeholder1\": \"*\"\n",
      "        }\n",
      "\n",
      "    @staticmethod\n",
      "    def sf_read_fmt(timestamp, sat=\"[abc]\"):\n",
      "        sat = sat.lower()\n",
      "        return {\n",
      "            \"satellite_folder\": {\n",
      "                \"satellite\": f\"metop_{sat}\"\n",
      "            },\n",
      "            \"year_folder\": {\n",
      "                \"year\": f\"{timestamp.year}\"\n",
      "            },\n",
      "            \"month_folder\": {\n",
      "                \"month\": f\"{timestamp.month}\".zfill(2)\n",
      "            },\n",
      "        }\n",
      "\n",
      "class AscatSwathProduct(SwathProduct):\n",
      "    grid_name = None\n",
      "\n",
      "    @classmethod\n",
      "    def preprocess_(cls, ds):\n",
      "        ds[\"location_id\"].attrs[\"cf_role\"] = \"timeseries_id\"\n",
      "        ds.attrs[\"global_attributes_flag\"] = 1\n",
      "        ds.attrs[\"featureType\"] = \"point\"\n",
      "        ds.attrs[\"grid_mapping_name\"] = cls.grid_name\n",
      "        if \"spacecraft\" in ds.attrs:\n",
      "            # Assumption: the spacecraft attribute is something like \"metop-a\"\n",
      "            sat_id = {\"a\": 3, \"b\": 4, \"c\": 5}\n",
      "            sat = ds.attrs[\"spacecraft\"][-1].lower()\n",
      "            ds[\"sat_id\"] = (\"obs\",\n",
      "                            np.repeat(sat_id[sat], ds[\"location_id\"].size))\n",
      "            del ds.attrs[\"spacecraft\"]\n",
      "        return ds\n",
      "\n",
      "    @staticmethod\n",
      "    def postprocess_(ds):\n",
      "        for key, item in {\"latitude\": \"lat\", \"longitude\": \"lon\", \"altitude\": \"alt\"}.items():\n",
      "            if key in ds:\n",
      "                ds = ds.rename({key: item})\n",
      "        if \"altitude\" not in ds:\n",
      "            ds[\"alt\"] = (\"locations\", np.full_like(ds[\"lat\"], fill_value=np.nan))\n",
      "        return ds\n",
      "\n",
      "class SwathProduct:\n",
      "    from ascat.swath import Swath\n",
      "    file_class = Swath\n",
      "\n",
      "Cell Product Information:\n",
      "class AscatH121Cell(RaggedArrayCellProduct):\n",
      "    grid_name = \"fibgrid_12.5\"\n",
      "\n",
      "class RaggedArrayCellProduct(BaseCellProduct):\n",
      "    file_class = RaggedArrayTs\n",
      "    sample_dim = \"obs\"\n",
      "    instance_dim = \"locations\"\n",
      "\n",
      "    @classmethod\n",
      "    def preprocessor(cls, ds):\n",
      "        if \"row_size\" in ds.variables:\n",
      "            ds[\"row_size\"].attrs[\"sample_dimension\"] = cls.sample_dim\n",
      "        if \"locationIndex\" in ds.variables:\n",
      "            ds[\"locationIndex\"].attrs[\"instance_dimension\"] = cls.instance_dim\n",
      "        if \"location_id\" in ds.variables:\n",
      "            ds[\"location_id\"].attrs[\"cf_role\"] = \"timeseries_id\"\n",
      "        if ds.attrs.get(\"featureType\") is None:\n",
      "            ds = ds.assign_attrs({\"featureType\": \"timeSeries\"})\n",
      "        if ds.attrs.get(\"grid_mapping_name\") is None:\n",
      "            ds.attrs[\"grid_mapping_name\"] = cls.grid_name\n",
      "        return ds\n",
      "\n",
      "class BaseCellProduct:\n",
      "    fn_format = \"{:04d}.nc\"\n",
      "\n",
      "    @classmethod\n",
      "    def preprocessor(cls, ds):\n",
      "        return ds\n",
      "\n"
     ]
    }
   ],
   "source": [
    "! ascat_product_info h121"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e9370dfc-871d-4d1f-917a-c333a546ec92",
   "metadata": {},
   "source": [
    "Once you have the right product id chosen or set up, pass your swath file root, output directory, and product id to `ascat_swaths_to_cells`, along with any other arguments.\n",
    "\n",
    "    ascat_swaths_to_cells /path/to/my/swath/files/ /path/to/my/new/cell/files/ H121 --start_date 2023-01-01 --end_date 2023-12-31\n",
    "\n",
    "`ascat_swaths_to_cells` works by iterating through the source swath files one at a time, opening them as xarray datasets, performing any necessary preprocessing, and concatenating each new dataset to all of the previous ones. Once that dataset&rsquo;s `nbytes` attribute reaches `dump_size`, reading is paused while the combined dataset is dumped out into one file in *timeseries point array* format for each of its constituent cells. Once the cells are written, the process starts again.\n",
    "\n",
    "On all dumps, data for any cells that already have a file is appended to those files. This is useful if you want to add new data to an existing stack, but if you want to make a fresh export, it&rsquo;s important to make sure the CLI is pointed to an empty directory.\n",
    "\n",
    "The output cells are in *timeseries point array* format. In order to convert them to *contiguous* ragged array format, we can use the `ascat_convert_cell_format` CLI. Pass it the path to your newly-stacked cell files, an output directory to write the converted cell files to, a `product_id`, and the argument `contiguous` (you could also use `indexed` here if you&rsquo;d prefer that format).\n",
    "\n",
    "    ascat_convert_cell_format /path/to/my/new/cell/files/ /path/to/my/converted/cell/files/ H121 contiguous\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4c595d1c-cfcd-493f-b719-6a343f1f1365",
   "metadata": {},
   "source": [
    "## Using Python\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e24ff404-405f-48e6-b5d1-80022b73e080",
   "metadata": {},
   "source": [
    "The CLI described above is just a wrapper for a python function. If you need more control over the processing or want to include this as a step in a pipeline, you can make a `SwathGridFiles` object and call `.stack_to_cell_files` on it directly.\n",
    "\n",
    "We pass it at least an output directory path (`out_dir`), where the outputs will be written, and we can also pass it several other options.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "0c6ece12-db37-4927-a762-c99e7e8a8ccd",
   "metadata": {},
   "outputs": [],
   "source": [
    "from datetime import datetime\n",
    "from ascat.swath import SwathGridFiles\n",
    "\n",
    "swath_source = \"/data-write/RADAR/hsaf/h121_v2.0/netcdf\"\n",
    "swath_collection = SwathGridFiles.from_product_id(swath_source, \"H121\")\n",
    "\n",
    "# where to save the files\n",
    "cell_file_directory = \"\"\n",
    "\n",
    "\n",
    "# the maximum size of the data buffer before dumping to file (actual maximum memory used will be higher)\n",
    "max_nbytes = None\n",
    "\n",
    "# the date range to use. This should be a tuple of datetime.datetime objects\n",
    "date_range = (datetime(2019, 1, 1), datetime(2019, 12, 31))\n",
    "\n",
    "# Pass a list of cell numbers (integers) here if you only want to stack data for a certain set of cells. This is mainly useful for testing purposes, since even splitting a day's worth of swath data into files for all of its constituent cells is a lengthy process.\n",
    "cells=None\n",
    "\n",
    "# mode : \"w\" for creating new files if any already exist, \"a\" to append data to existing cell files\n",
    "mode = \"w\"\n",
    "\n",
    "# # uncomment to run\n",
    "# swath_collection.stack_to_cell_files(\n",
    "#     output_dir=cell_file_directory,\n",
    "#     max_nbytes=max_nbytes,\n",
    "#     date_range=date_range,\n",
    "#     mode=mode,\n",
    "#     processes=processes,\n",
    "# )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "04329180-e9ac-4e0f-a07c-c49dde7292e5",
   "metadata": {},
   "outputs": [],
   "source": [
    "from ascat.cell import CellGridFiles\n",
    "\n",
    "cell_collection = CellGridFiles.from_product_id(cell_file_directory, product_id=\"H121\")\n",
    "contiguous_cell_file_directory = \"contiguous_directory_name\"\n",
    "# # uncomment to run\n",
    "# cell_collection.convert_to_contiguous(contiguous_cell_file_directory)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2566ccc7-8eb2-4ab5-9152-d1ae3648a82e",
   "metadata": {},
   "source": [
    "Conversion to contiguous ragged array format will sort the sample dimension first by time and then by `location_id`. At this point it is no longer practically possible to append new data to the dataset without first re-converting it to indexed ragged array format and then converting back.\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fe1f85fd-ebd3-4fcf-83fb-2780868011dc",
   "metadata": {},
   "source": [
    "## Adding a custom product class\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "92f27fd1-1946-45ee-9ae5-8c194caa00f8",
   "metadata": {},
   "source": [
    "To add your own product classes you&rsquo;ll need to clone this repository and install it in your environment as an editable package (e.g. `pip install -e /home/username/Clones/ascat`). Then you can edit `.../ascat/src/ascat/product_info/product_info.py` to add your own classes following the examples of the existing ones. Best to copy-paste, e.g. `AscatH129Swath` and edit the fields accordingly.\n",
    "\n",
    "Once your product class is written, add it to the `swath_io_catalog` dictionary, along with a key to access it. Then you can use this key to specify your custom product when running the CLI.\n",
    "\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.4"
  },
  "org": null
 },
 "nbformat": 4,
 "nbformat_minor": 5
}