ghg_forcing_for_cmip.preprocessing#
Data preprocessing
Prepare data for statistical analysis to create a GHG forcing dataset.
Classes:
| Name | Description |
|---|---|
Condition |
Enum for different data conditions |
Functions:
| Name | Description |
|---|---|
add_hemisphere |
Add a grouping variable "hemisphere" |
add_missing_lat_lon_combinations |
Add all missing latitude-longitude combinations to the data grid |
combine_datasets |
Combine ground-based and satellite datasets |
concat_datasets |
Concatenate multiple dataframes |
create_future_dataset |
Create a dataframe with future data |
do_feature_engineering |
Feature engineering for statistical analysis |
prepare_dataset |
Prepare datasets for statistical analysis |
preprocess_prediction_dataset |
Preprocess dataset for prediction task |
select_sampling_sites |
Select sampling sites incl. site_code, lat, lon |
standardize_feature |
Standardize a pandas Series |
Condition #
Enum for different data conditions
- COLLOCATED: both ground-based and satellite data are present
- EO_ONLY: only satellite data is present
- GB_ONLY: only ground-based data is present
- BOTH_NONE: neither ground-based nor satellite data is present
Source code in src/ghg_forcing_for_cmip/preprocessing.py
add_hemisphere #
Add a grouping variable "hemisphere"
The levels are defined as follows: + southern < - split_value, + northern > split_value, + -split_value < tropics < split_value
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
dataframe including lat variable |
required |
split_value
|
Union[float, int]
|
latitudinal value where split in southern, northern hemisphere and tropics should be done |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
dataframe including new variable hemisphere |
Source code in src/ghg_forcing_for_cmip/preprocessing.py
add_missing_lat_lon_combinations #
add_missing_lat_lon_combinations(
df: DataFrame,
year_seq: Any,
grid_cell_size: int,
day: int = 15,
months: int = 12,
max_lat: int = 90,
max_lon: int = 180,
selected_vars: list[str] = [
"year",
"month",
"lat",
"lon",
],
) -> DataFrame
Add all missing latitude-longitude combinations to the data grid
- create template data grid with all possible combinations
- merge template data grid with observed data
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
observed data |
required |
year_seq
|
Any
|
array with years |
required |
grid_cell_size
|
int
|
size of grid cell |
required |
day
|
int
|
day used for creating a date |
15
|
months
|
int
|
number of months in a year, by default 12 |
12
|
max_lat
|
int
|
maximum absolute latitudinale value, by default 90 |
90
|
max_lon
|
int
|
maximum absolute longitudinale value, by default 180 |
180
|
selected_vars
|
list[str]
|
relevant data variables for merging, by default ["year", "month", "lat", "lon"] |
['year', 'month', 'lat', 'lon']
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
data set with all possible lat x lon combination |
Source code in src/ghg_forcing_for_cmip/preprocessing.py
combine_datasets #
combine_datasets(
df_gb: DataFrame,
df_eo: DataFrame,
select_cols: list[str] = [
"year",
"month",
"lat",
"lon",
],
) -> DataFrame
Combine ground-based and satellite datasets
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df_gb
|
DataFrame
|
Dataframe containing ground-based data |
required |
df_eo
|
DataFrame
|
Dataframe containing satellite data |
required |
select_cols
|
list[str]
|
Columns to select from both dataframes for merging |
['year', 'month', 'lat', 'lon']
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
Combined dataframe with ground-based and satellite data |
Source code in src/ghg_forcing_for_cmip/preprocessing.py
concat_datasets #
Concatenate multiple dataframes
create additional variable indicating whether data is observed or modelled
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dfs
|
list[DataFrame]
|
list of dataframes that shall be concatenated |
required |
obs_gb_value
|
list[bool]
|
list of booleans indicating for each dataframe in
|
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
concatenated dataframe with "obs_gb" indicator variable |
Source code in src/ghg_forcing_for_cmip/preprocessing.py
create_future_dataset #
create_future_dataset(
pred_year_range: tuple[int, int],
lat_only: bool = False,
months: int = 12,
max_lat: int = 90,
max_lon: int = 180,
grid_cell_size: int = 5,
) -> DataFrame
Create a dataframe with future data
Includes variables: year, month, lat, lon
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
pred_year_range
|
tuple[int, int]
|
range of years for prediction (start, end) |
required |
lat_only
|
bool
|
whether only latitudes should be included |
False
|
months
|
int
|
number of months in a year, by default 12 |
12
|
max_lat
|
int
|
latitudinal limit, by default 90 |
90
|
max_lon
|
int
|
longitudinal limit, by default 180 |
180
|
grid_cell_size
|
int
|
size of grid cell, by default 5 |
5
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
dataframe with future data combinations |
Source code in src/ghg_forcing_for_cmip/preprocessing.py
do_feature_engineering #
Feature engineering for statistical analysis
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
d
|
DataFrame
|
dataframe to be processed |
required |
eo_included
|
bool
|
whether satellite data is included |
True
|
day
|
int
|
day used for creating a date, by default 15 |
15
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
dataframe with additional and/or scaled features |
Source code in src/ghg_forcing_for_cmip/preprocessing.py
prepare_dataset #
prepare_dataset(
df_combined: DataFrame,
condition: Union[Condition, str],
day: int = 15,
) -> DataFrame
Prepare datasets for statistical analysis
Filter data based on the specified condition and do required feature engineering for statistical analysis.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df_combined
|
DataFrame
|
Dataframe combining ground-based and satellite data |
required |
condition
|
Union[Condition, str]
|
condition to filter data: "collocated": both ground-based and satellite data are present "eo-only": only satellite data is present "gb-only": only ground-based data is present "both-none": neither ground-based nor satellite data is present |
required |
day
|
int
|
day used for creating a date, by default 15 |
15
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
preprocessed dataframe ready for statistical analysis |
Source code in src/ghg_forcing_for_cmip/preprocessing.py
preprocess_prediction_dataset #
Preprocess dataset for prediction task
feature engineering and scaling
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
raw dataset that shall be used for fitting prediction model |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
dataset prepared for prediction task |
Source code in src/ghg_forcing_for_cmip/preprocessing.py
select_sampling_sites #
select_sampling_sites(
raw_data: DataFrame,
year_min: int = 2003,
no_sites: int = 12,
) -> DataFrame
Select sampling sites incl. site_code, lat, lon
sites with highest number of observations after year_min are selected
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
raw_data
|
DataFrame
|
raw data |
required |
year_min
|
int
|
minimum year for selecting sampling sites |
2003
|
no_sites
|
int
|
number of sampling sites to be selected |
12
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
dataframe with selected sampling sites and their lat, lon coordinates |