Pre-ARD to ARD Conversion¶
Overview¶
The pre-ARD we generate and download is close, but not quite ready for use in applications like time series analysis. Some complications include
The pre-ARD images have no metadata, including no acquisition timestamps or band names
The pre-ARD images may be split up into many pieces to stay below 4GB in file size. These pieces will need to be put back together to create the ARD
The format options used by the Earth Engine create a GeoTIFF that is compressed and interleaved by pixel, which is not very efficient for reading as images or time series.
In the conversion step of the CEDAR workflow, the pre-ARD images are concatenated into one piece (if needed), transformed along its band dimension to unmux the band and time dimensions, assigned metadata that had been stored in the image metadata file, and exported into a NetCDF file suitable for further processing.
Basic Usage¶
Before we consider converting our pre-ARD data, it is worth taking a look back
at your CEDAR configuration file. Specifically, this step of the workflow
uses information stored in the ard
section of the file
(see this section of the User Guide).
This section contains information about where the ARD should be stored
(destination
) and how the NetCDF4 file should be encoded. Specifying the
destination directory as a template string in the configuration file is useful
to deterministically organize converted ARD by the image collection, tile,
time periods, or other attribute information. The destination directory can be
overriden through the cedar convert
command, however.
We can either run cedar convert
by pointing to a specific image metadata
JSON file, or by pointing to a directory containing such files. This second
usage is included so you can point to the download directory (named after the
tracking name) and convert all images within it.
For the example of an order containing the following files:
$ ls -l TRACKING_2019-07-18T16:45:25.528253_h063v052/
LANDSAT_LC08_C01_T1_SR_h062v053_2012-01-01_2017-01-01.json
LANDSAT_LC08_C01_T1_SR_h062v053_2012-01-01_2017-01-01-0000000000-0000001024.tif
LANDSAT_LC08_C01_T1_SR_h062v053_2012-01-01_2017-01-01-0000001024-0000002048.tif
LANDSAT_LC08_C01_T1_SR_h062v053_2012-01-01_2017-01-01-0000001024-0000003072.tif
LANDSAT_LC08_C01_T1_SR_h062v053_2012-01-01_2017-01-01-0000001024-0000003072.tif
LANDSAT_LC08_C01_T1_SR_h062v053_2012-01-01_2017-01-01-0000002048-0000000000.tif
LANDSAT_LC08_C01_T1_SR_h062v053_2012-01-01_2017-01-01-0000002048-0000003072.tif
...
LANDSAT_LC08_C01_T1_SR_h062v053_2017-01-01_2022-01-01.json
LANDSAT_LC08_C01_T1_SR_h062v053_2017-01-01_2022-01-01-0000000000-0000001024.tif
...
LANDSAT_LE07_C01_T1_SR_h062v053_1997-01-01_2002-01-01.json
...
The usage,
$ cedar convert TRACKING_2019-07-18T16:45:25.528253_h063v052/
Would convert 10 pairs of pre-ARD metadata and image (pieces) into 10 NetCDF4 files.
By comparison, the usage,
$ cedar convert TRACKING_2019-07-18T16:45:25.528253_h063v052/LANDSAT_LC08_C01_T1_SR_h062v053_2012-01-01_2017-01-01.json
Would convert a single pre-ARD image and metadata pair to a single NetCDF4 file.
Advanced Usage¶
The cedar convert
program has the ability to run in parallel by taking
advantage of the Dask and (optionally) Distributed libraries and their
integrations with the XArray library. This parallel processing option is
designed to be almost invisible to the user, who only has to specify the type
of Dask scheduler and the number of workers they want to use as a command line
option.
For example, to choose the multiprocessing scheduler with 2 processes,
you can run cedar convert
as:
$ cedar convert \
--executor processes 2 \
TRACKING_2019-07-18T16:45:25.528253_h063v052
The distributed scheduler is often more desireable because of the debugging information it can provide through its Bokeh status page. To use the Distributed scheduler locally,
$ cedar convert \
--executor distributed 2 \
TRACKING_2019-07-18T16:45:25.528253_h063v052
This usage will start a distributed.LocalCluster
with 2 workers, as
described in the documentation for using Distributed on a single machine.
You may also wish to connect to an existing Dask Distributed cluster, which you can do by specifying the scheduler IP address and port instead of the number of workers:
$ cedar convert \
--executor distributed HOSTNAME:PORT \
TRACKING_2019-07-18T16:45:25.528253_h063v052
For more information about the choice Dask schedulers, please visit their Scheduler Overview documentation.
Note
The computation involved in converting pre-ARD to ARD is primarily either decompressing the pre-ARD or compressing and writing the ARD to the NetCDF. Because of the way we are writing the NetCDF files, we are not able to do concurrent writes without passing a file lock. As such, the conversion process does not currently benefit from parallel processing as much as it could if we could do parallel writes like other formats support (e.g., Zarr). You are strongly encouraged to analyze the performance benefits of parallel processing before running the program in parallel over your all of your data.
Tips¶
When specifying the ARD destination directory (either in the configuration file or by overriding using
cedar convert --dest DEST
), you may wish to use an environment variable for the root destination directory. This variable will be expanded before determining the destination path. Using environment variables is common for batch processing because it allows you to manipulate variables or other data without modifying the config file.