Clusters of atmopsheric and oceanic variables and teleconnections that are candidate drivers for Tropical Cyclogenesis

This project provides the dataset employed for the development of a machine learning framework designed to detect and interpret Tropical Cyclone Genesis (TCG) activity across six major tropical ocean basins: North Atlantic, Northeast Pacific, Northwest Pacific, North Indian, South Indian, and South Pacific. The dataset includes pre-processed environmental and climatic variables relevant to TCG dynamics, aggregated at the basin level with monthly resolution from January 1980 to December 2022. All data are derived from the ERA5 reanalysis dataset, with a spatial resolution of 2.5° × 2.5°. ERA5 reanalysis data were accessed through the DKRZ data pool, made available by DKRZ Data Management. The atmospheric and oceanic variables provided are absolute vorticity at 850 hPa, maximum potential intensity (MPI), mean sea level pressure (MSLP), relative humidity at 700 hPa, sea surface temperature (SST), relative vorticity at 850 hPa, vertical wind shear between 850 and 200 hPa, and vertical velocity at 500 hPa. Several of these variables are derived from ERA5 primary variables and represent physically meaningful diagnostics used widely in tropical cyclone research. To reduce spatial dimensionality, each variable has been clustered within each basin using the K-means algorithm, and the area-weighted mean value of each cluster is reported as a time series. Additionally, the dataset includes monthly values of a suite of large-scale climate indices known to influence tropical cyclone activity: Atlantic Meridional Mode (AMM), Niño3.4, North Atlantic Oscillation (NAO), Pacific Decadal Oscillation (PDO), Pacific-North American Pattern (PNA), Southern Oscillation Index (SOI), Tropical Northern Atlantic Index (TNA), Tropical Southern Atlantic Index (TSA), and the Western Pacific Index (WP). Lastly, for each basin, the dataset contains monthly counts of tropical cyclogenesis events, enabling evaluation of predictive models and interpretability methods. This dataset is intended to support research in seasonal TCG detection, and it enables reproducibility of the methods developed in the associated study.

Temperature Humidity Index GDDP-NEX-CMIP6 ML projections

The experiment conducted aimed to enhance the temporal resolution of climate projections for agricultural applications by using machine learning to downscale daily NEX-GDDP-CMIP6 climate data (https://doi.org/10.7917/OFSG3345) to hourly Temperature Humidity Index (THI) values. The THI is a critical metric for assessing heat stress in dairy cattle, which is a significant concern under changing climatic conditions. We utilized the Extreme Gradient Boost (XGBoost Chen et al. 2016) algorithm, chosen for its efficiency and capability to handle large datasets, to train models using historical hourly data from the ERA5 reanalysis dataset (Hersbach et al. 2020). The trained models were then applied to generate hourly THI projections from 2020 to 2100 across 12 climate models under two Shared Socioeconomic Pathways (SSP2-4.5 and SSP5-8.5). The focus was exclusively on land areas, with a spatial grid resolution of 0.25 degrees, ensuring the relevance and applicability of the data for agricultural purposes. The result is a comprehensive, high-resolution dataset that provides detailed insights into the future impacts of heat stress on dairy cattle, facilitating better planning and mitigation strategies in the agricultural sector.

HadEX-CAM dataset: original and deep learning infilled TX90p, TN90p, TX10p, TN10p ETCCDI Indices (CLINT H2020)

The HadEX-CAM dataset contains four land-based extreme indices (TX90p, TN90p, TX10p, TN10p) for the European region. The original dataset (containing missing values) has been created by the MetOffice by aggregating station data using the Climate Anomaly Method (CAM). The infilled version of this dataset has been created by DKRZ by applying a deep learning (DL) model based on U-Net architecture and trained on CMIP6 data (see https://www.nature.com/articles/s41467-024-53464-2). The original HadEX-CAM dataset is distributed under the Open Government Licence: http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/. The DL-infilled HadEX-CAM dataset is distributed under the Creative Commons Attribution 4.0 International license.

Climate data for adaptation and vulnerability assessments — southwest

Climate data for adaptation and vulnerability assessments — southwest (ClimAVA-SW) provides bias-corrected, downscaled daily climatic data at ~4km spatial resolution from 17 CMIP6 GCMs, three different climatic variables (pr, tasmax, and tasmin), and three different shared socioeconomic pathways (SSP245, SSP370, and SSP585). Historical runs span from January 1, 1981, to December 31, 2014. Future scenarios span from January 1, 2015, to December 31, 2100. The ClimAVA-SW dataset encompasses the geopolitical boundaries of the six states in the southwestern United States: California, Nevada, Arizona, New Mexico, Utah, and Colorado, as well as watersheds that run into these states. Employing the Spatial Pattern Interactions Downscaling (SPID) method, ClimAVA ensures high-quality downscaling using machine learning models. These models capture the relationship between spatial patterns at Global Circulation Model (GCM) resolution and fine-resolution pixel values derived from the reference data (PRISM 4K). A random forest model is trained for each pixel, using the finer reference data as a predictand and nine pixels from the spatially resampled (coarser) version of the reference data as predictors. These models are then utilized to downscale the bias-corrected GCM data. Results from this method have proven to maintain climate realism and greatly represent extreme events.

Climate data for adaptation and vulnerability assessments — northwest

Climate data for adaptation and vulnerability assessments — northwest (ClimAVA-NW) provides bias-corrected, downscaled daily climatic data at ~4km spatial resolution from 17 CMIP6 GCMs, three different climatic variables (pr, tasmax, and tasmin), and three different shared socioeconomic pathways (SSP245, SSP370, and SSP585). Historical runs span from January 1, 1981, to December 31, 2014. Future scenarios span from January 1, 2015, to December 31, 2100. The ClimAVA-NW dataset encompasses the geopolitical boundaries of the five states in the northwestern United States: Idaho, Oregon, Wyoming, Montana, and Washington. Employing the Spatial Pattern Interactions Downscaling (SPID) method, ClimAVA ensures high-quality downscaling using machine learning models. These models capture the relationship between spatial patterns at Global Circulation Model (GCM) resolution and fine-resolution pixel values derived from the reference data (PRISM 4K). A random forest model is trained for each pixel, using the finer reference data as a predictand and nine pixels from the spatially resampled (coarser) version of the reference data as predictors. These models are then utilized to downscale the bias-corrected GCM data. Results from this method have proven to maintain climate realism and greatly represent extreme events.

Spatialization of near-surface air temperature and updating based on thermal infrared remote sensing information in the Qinghai-Tibet Plateau

The Qinghai-Tibet Plateau, known for its high altitude, cold climate, and fragile ecosystem, presents unique challenges and opportunities for the implementation of an intelligent sponge urban system. The heat island effect, a phenomenon where urban areas experience higher temperatures compared to surrounding rural areas, can be particularly problematic in such a sensitive environment. Predicting and mitigating heat island intensity is crucial for improving urban livability and environmental sustainability. To develop a procedure for predicting heat island intensity in an intelligent sponge urban system, ensuring accurate and real-time predictions through a series of steps. Collect parameter information of the underlying surface using meteorological observation data from the sponge city, field observation data, and investigation data of the sponge city. Gather comprehensive data on the physical and environmental characteristics of the urban surface. Establish a set of digital labels with feature data derived from the collected information. Add the labeled data to the training sample set for the prediction model of sponge city surface heat island intensity. A crucial input for establishing a real-time prediction model. Train the prediction model function for sponge city surface heat island intensity using the data. This Experiment contains 2 datasets, corrected surface air temperature data and training data. The corrected surface air temperature data has been processed using meteorological observation data and thermal infrared remote sensing data. The data covers a high-altitude area of the Qinghai-Tibet Plateau, with a spatial resolution of 30 meters. The original temperature data were obtained from multiple sources, including thermal infrared remote sensing data from Landsat 8(L8) and Landsat 9 (L9) Collection 2 (C2) Level 2 (L2) products, as well as ground station measurements from National Tibetan Plateau/Third Pole Environment Data Center. The regression algorithms in supervised learning was trained to correct for biases and inaccuracies in updating the spatialized data of near-surface air temperature. This dataset is suitable for climate research, environmental monitoring, and other applications requiring relatively accurate surface air temperature data.

Climate data for adaptation and vulnerability assessments (SWE) – west

The ClimAVA_SWE data set — where ClimAVA stands for Climate Data for Adaptation and Vulnerability Assessments — provides high-resolution (4 km) future climate projections derived from 13 CMIP6 General Circulation Models (GCMs). It focuses on Snow Water Equivalent (SWE), a crucial indicator of water availability, hydrologic extremes, and climate-related vulnerability, and includes projections for three Shared Socioeconomic Pathways (SSP245, SSP370, and SSP585) at a daily temporal scale. The initial release of ClimAVA_SWE covers the entire western United States. ClimAVA_SWE is produced using the newly developed Spatial Interactions Downscaling (SPID) method, which ensures high-quality downscaling through advanced machine learning techniques. SPID captures the relationship between large-scale spatial patterns at GCM resolution and fine-scale pixel values. For each pixel, two Random Forest models (one for the accumulation period and one for the ablation period) were trained using fine-resolution reference data as the predictand, and nine neighboring pixels from a spatially resampled (coarser) version of the reference data as predictors. These trained models are then applied to bias-corrected GCM data to generate the downscaled projections. The resulting dataset maintains strong climate realism and effectively represents extreme events.

Near-surface air temperature dataset for the Qinghai-Tibet Plateau (2019) derived from thermal infrared remote sensing and elevation-constrained modeling

This dataset provides high-resolution (30 m) spatialized near-surface air temperature products for the Qinghai-Tibet Plateau, updated using thermal infrared remote sensing data from Landsat 8 (L8) and Landsat 9 (L9) Collection 2 (C2) Level 2 (L2) products, combined with elevation-corrected regression modeling. The dataset includes corrected temperature files (adjusted via machine learning-based elevation corrections) for model development. The elevation corrections were performed using Topographic Data of Qinghai-Tibet Plateau (2021), integrated via Gaussian filtering to enhance spatial consistency in high-elevation regions. Supervised learning regression models (Random Forest Regression, Multilayer Perceptron regression, or Decision Tree regression) were applied to minimize Thermal Infrared Radiation-derived temperature biases and optimize high-altitude temperature estimation. The near-surface temperature lapse rate (LR) is a critical parameter in glaciological and hydrological models, but existing approaches often rely on empirical estimations with limited spatial representativeness. To mitigate these limitations, an optimized temperature spatialization method is proposed, fusing Local Representatives (LRs) across glacierized regions through Inverse Distance Weighting (IDW). This approach accounts for elevation-dependent microclimates while maintaining regional consistency. This dataset is suitable for climate research, and environmental modeling requiring high-resolution near-surface air temperature data.

Spatial distribution of air temperature in high-elevation glacierized regions: from observations in four catchments on the Tibetan Plateau

This experiment contains 30-meter resolution near-surface air temperature datasets for four glacier regions (Guliya, Aru, Naimona’nyi, Dunde) on the Qinghai-Tibet Plateau, derived by integrating in situ measurements (automatic weather stations and loggers; January-November 2019) with Landsat 8/9 thermal infrared data. The datasets, spanning elevations of 4,947–6,078 m and temperatures from −42°C to +16°C, address data gaps through masked nearest-neighbor interpolation and Kriging (for clustered outliers ≥5), with spatial smoothing to minimize observational noise. Validation against ground measurements employs Root Mean Square Error (RMSE) of °C and standard deviation (SD) of °C metrics, visualized via Temperature_Error graphs. The glaciers-representing diverse climatic zones-include Guliya (ice cap, westerlies-dominated), Aru (valley glacier), Naimona’nyi (Himalayan slopes), and Dunde (Qilian Mountains, transitional climate). Supported by elevation data from the Topographic Data of Qinghai-Tibet Plateau (2021) and temperature data from the GATP Dataset (doi:10.26050/WDCC/GATP). These spatially interpolated temperature distributions serve as a reference for assessing cryospheric and climate models in high-altitude regions.

IceCloudNet: 3D reconstruction of cloud ice from Meteosat SEVIRI - data

IceCloudNet is a novel method based on machine learning able to obtain high quality vertically resolved predictions for ice water content and ice crystal number concentration of clouds containing ice. The predictions come at the spatio-temporal coverage and resolution of Meteosat SEVIRI and the vertical resolution of DARDAR. IceCloudNet consists of a ConvNeXt-based U-Net and a 3D PatchGAN discriminator model and is trained by predicting DARDAR profiles from co-located SEVIRI images. Despite the sparse availability of DARDAR data due to its narrow overpass, IceCloudNet is able to predict cloud occurrence, macrophysical shape, and microphysical properties with high precision. We release 5 years of vertically resolved ice water content (IWC) and ice crystal number concentration (Nice) of clouds containing ice with a 3 km×3 km×240 m×15 minute resolution on a spatial domain of 30°W to 30°E and 30°S to 30°N. The resulting data set increases the availability of vertical cloud profiles for the period when DARDAR is available by more than six orders of magnitude and moreover, is able to provide vertical cloud profiles beyond the lifetime of the recently ended satellite missions underlying DARDAR.