Aggregating and downscaling timeseries data

The pyam package offers many tools to facilitate processing of scenario data. In this notebook, we illustrate methods to aggregate and downscale timeseries data of an IamDataFrame across regions and sectors, as well as checking consistency of given data along these dimensions.

In this tutorial, we show how to make the most of pyam to compute such aggregate timeseries data, and to check that a scenario ensemble (or just a single scenario) is complete and that timeseries data “add up” across regions and along the variable tree (i.e., that the sum of values of the subcategories such as Primary Energy|* are identical to the values of the category Primary Energy).

There are two distinct use cases where these features can be used.

Use case 1: compute data at higher/lower sectoral or spatial aggregation

Given scenario results at a specific (usually very detailed) sectoral and spatial resolution, pyam offers a suite of functions to easily compute aggregate timeseries. For example, this allows to sum up national energy demand to regional or global values, or to compute the average of a global carbon price weighted by regional emissions.

These functions can be used as part of an automated workflow to generate complete scenario results from raw model outputs.

Use case 2: check the consistency of data across sectoral or spatial levels

In model comparison exercises or ensemble compilation projects, a user needs to verify the internal consistency of submitted scenario results (cf. Huppmann et al., 2018, doi: 10.1038/s41558-018-0317-4). Such inconsistencies can be due to incomplete variable hierarchies, reporting templates incompatible with model specifications, or user error.

Overview

This notebook illustrates the following features:

  1. Load timeseries data from a snapshot file and inspect the scenario

  2. Aggregate timeseries over sectors (i.e., sub-categories)

  3. Aggregate timeseries over regions including weighted average

  4. Downscale timeseries given at a region level to sub-regions using a proxy variable

  5. Check the internal consistency of a scenario (ensemble)

[1]:
import pandas as pd
import pyam
pyam - INFO: Running in a notebook, setting `pyam` logging level to `logging.INFO` and adding stderr handler

0. Load timeseries data from snapshot file and inspect the scenario

The stylized scenario used in this tutorial has data for two regions (reg_a & reg_b) as well as the World aggregate, and for categories of variables: primary energy demand, emissions, carbon price, and population.

[2]:
df = pyam.IamDataFrame(data='tutorial_data_aggregating_downscaling.csv')
pyam.utils - INFO: Reading `tutorial_data_aggregating_downscaling.csv`
[3]:
df.regions()
[3]:
0    World
1    reg_a
2    reg_b
Name: region, dtype: object
[4]:
df.variables()
[4]:
0            Emissions|CO2
1      Emissions|CO2|AFOLU
2    Emissions|CO2|Bunkers
3     Emissions|CO2|Energy
4               Population
5             Price|Carbon
6           Primary Energy
7      Primary Energy|Coal
8      Primary Energy|Wind
Name: variable, dtype: object

1. Aggregating timeseries across sectors

Let’s first display the data for the components of primary energy demand.

[5]:
df.filter(variable='Primary Energy|*').timeseries()
[5]:
2005 2010
model scenario region variable unit
model_a scen_a World Primary Energy|Coal EJ/y 9.0 10.0
Primary Energy|Wind EJ/y 3.0 5.0
reg_a Primary Energy|Coal EJ/y 6.0 6.0
Primary Energy|Wind EJ/y 2.0 3.0
reg_b Primary Energy|Coal EJ/y 3.0 4.0
Primary Energy|Wind EJ/y 1.0 2.0

Next, we are going to use the aggregate() function to compute the total Primary Energy from its components (wind and coal) in each region (including World).

The function returns an IamDataFrame, so we can use timeseries() to display the resulting data.

[6]:
df.aggregate('Primary Energy').timeseries()
[6]:
2005 2010
model scenario region variable unit
model_a scen_a World Primary Energy EJ/y 12.0 15.0
reg_a Primary Energy EJ/y 8.0 9.0
reg_b Primary Energy EJ/y 4.0 6.0

If we are interested in use case 1, we could use the argument append=True to directly add the computed aggregate to the IamDataFrame.

However, in this tutorial, the data already includes the total primary energy demand. Therefore, we illustrate use case 2 and apply the check_aggregate() function to verify whether a given variable is the sum of its sectoral components (i.e., Primary Energy should be equal to Primary Energy|Coal plus Primary Energy|Wind). The validation is performed separately for each region.

The function returns None if the validation is correct (which it is for primary energy demand) or a pandas.DataFrame highlighting where the aggregate does not match (this will be illustrated in the next section).

[7]:
df.check_aggregate('Primary Energy')

The function also returns useful logging messages when there is nothing to check (because there are no sectors below Primary Energy|Wind).

[8]:
df.check_aggregate('Primary Energy|Wind')
pyam._aggregate - INFO: cannot aggregate variable `Primary Energy|Wind` because it has no components

2. Aggregating timeseries across subregions

Similarly to the previous example, we now use the aggregate_region() function to compute regional aggregates. By default, this method sums all the regions in the dataframe to make a World region; this can be changed with the keyword arguments region and subregions.

[9]:
df.aggregate_region('Primary Energy').timeseries()
[9]:
2005 2010
model scenario region variable unit
model_a scen_a World Primary Energy EJ/y 12.0 15.0

Adding regional components

As a next step, we use check_aggregate_region() to verify that the regional aggregate of CO2 emissions matches the timeseries data given in the scenario.

[10]:
df.check_aggregate_region('Emissions|CO2')
pyam.core - INFO: `Emissions|CO2` - 2 of 2 rows are not aggregates of subregions
[10]:
region subregions
model scenario region variable unit year
model_a scen_a World Emissions|CO2 EJ/y 2005 10.0 9.0
2010 14.0 12.0

As announced above, this validation failed and we see a dataframe of the expected data at the region level and the aggregation computed from the subregions.

Let’s look at the entire emissions timeseries in the scenario to find out what is going on.

[11]:
df.filter(variable='Emissions*').timeseries()
[11]:
2005 2010
model scenario region variable unit
model_a scen_a World Emissions|CO2 EJ/y 10.0 14.0
Emissions|CO2|AFOLU EJ/y 3.0 4.0
Emissions|CO2|Bunkers EJ/y 1.0 2.0
Emissions|CO2|Energy EJ/y 6.0 8.0
reg_a Emissions|CO2 EJ/y 6.0 8.0
Emissions|CO2|AFOLU EJ/y 2.0 3.0
Emissions|CO2|Energy EJ/y 4.0 5.0
reg_b Emissions|CO2 EJ/y 3.0 4.0
Emissions|CO2|AFOLU EJ/y 1.0 1.0
Emissions|CO2|Energy EJ/y 2.0 3.0

Investigating the data carefully, you will notice that emissions from the energy sector and agriculture, forestry & land use (AFOLU) are given in the subregions and the World region, whereas emissions from bunker fuels are only defined at the global level. This is a common issue in emissions data, where some sources (e.g., global aviation and maritime transport) cannot be attributed to one region.

Luckily, the functions aggregate_region() and check_aggregate_region() support this use case: by adding components=True, the regional aggregation will include any sub-categories of the variable that are only present at the region level but not in any subregion.

[12]:
df.aggregate_region('Emissions|CO2', components=True).timeseries()
[12]:
2005 2010
model scenario region variable unit
model_a scen_a World Emissions|CO2 EJ/y 10.0 14.0

The regional aggregate now matches the data given at the World level in the tutorial data.

Note that the components to be included at the region level can also be specified directly via a list of variables, in this case we would use components=['Emissions|CO2|Bunkers'].

Computing a weighted average across regions

One other frequent requirement when aggregating across regions is a weighted average.

To illustrate this feature, the tutorial data includes carbon price data. Naturally, the appropriate weighting data are the regional carbon emissions.

The following cells show:

  1. The carbon price data across the regions

  2. A (failing) validation that the regional aggretion (without weights) matches the reported prices at the World level

  3. The emissions-weighted average of carbon prices returned as a new IamDataFrame

[13]:
df.filter(variable='Price|Carbon').timeseries()
[13]:
2005 2010
model scenario region variable unit
model_a scen_a World Price|Carbon USD/tCO2 4.0 27.0
reg_a Price|Carbon USD/tCO2 1.0 30.0
reg_b Price|Carbon USD/tCO2 10.0 21.0
[14]:
df.check_aggregate_region('Price|Carbon')
pyam.core - INFO: `Price|Carbon` - 2 of 2 rows are not aggregates of subregions
[14]:
region subregions
model scenario region variable unit year
model_a scen_a World Price|Carbon USD/tCO2 2005 4.0 11.0
2010 27.0 51.0
[15]:
df.aggregate_region('Price|Carbon', weight='Emissions|CO2').timeseries()
[15]:
2005 2010
model scenario region variable unit
model_a scen_a World Price|Carbon USD/tCO2 4.0 27.0

3. Downscaling timeseries data to subregions

The inverse operation of regional aggregation is “downscaling” of timeseries data given at a regional level to a number of subregions, usually using some other data as proxy to divide and allocate the total to the subregions.

This section shows an example using the downscale_region() function to divide the total primary energy demand using population as a proxy.

[16]:
df.downscale_region('Primary Energy', proxy='Population').timeseries()
[16]:
2005 2010
model scenario region variable unit
model_a scen_a reg_a Primary Energy EJ/y 8.0 9.0
reg_b Primary Energy EJ/y 4.0 6.0

By the way, the functions aggregate(), aggregate_region() and downscale_region() also take lists of variables as variable argument. See the next cell for an example.

[17]:
var_list = ['Primary Energy', 'Primary Energy|Coal']
df.downscale_region(var_list, proxy='Population').timeseries()
[17]:
2005 2010
model scenario region variable unit
model_a scen_a reg_a Primary Energy EJ/y 8.0 9.0
Primary Energy|Coal EJ/y 6.0 6.0
reg_b Primary Energy EJ/y 4.0 6.0
Primary Energy|Coal EJ/y 3.0 4.0

4. Checking the internal consistency of a scenario (ensemble)

The previous sections illustrated two functions to validate specific variables across their sectors (sub-categories) or regional disaggregation. These two functions are combined in the check_internal_consistency() feature.

This feature of the pyam package currently only supports “consistency” in the sense of a strictly hierarchical variable tree (with subcategories summing up to the category value including components, discussed above) and that all the regions sum to the World region.
See this issue for more information.

If we have an internally consistent scenario ensemble (or single scenario), the function will return None; otherwise, it will return a concatenation of pandas.DataFrames indicating all detected inconsistencies.

For this section, we use a tutorial scenario which is constructed to highlight the individual validation features below. The scenario below has two inconsistencies:

  1. In year 2010 and regions region_b & World, the values of coal and wind do not add up to the total Primary Energy value

  2. In year 2020 in the World region, the value of Primary Energy and Primary Energy|Coal is not the sum of region_a and region_b (but the sum of wind and coal to Primary Energy in each sub-region is correct)

[18]:
tutorial_df = pyam.IamDataFrame(pd.DataFrame([
    ['World', 'Primary Energy', 'EJ/yr', 7, 15],
    ['World', 'Primary Energy|Coal', 'EJ/yr', 4, 11],
    ['World', 'Primary Energy|Wind', 'EJ/yr', 2, 4],
    ['region_a', 'Primary Energy', 'EJ/yr', 4, 8],
    ['region_a', 'Primary Energy|Coal', 'EJ/yr', 2, 6],
    ['region_a', 'Primary Energy|Wind', 'EJ/yr', 2, 2],
    ['region_b', 'Primary Energy', 'EJ/yr', 3, 6],
    ['region_b', 'Primary Energy|Coal', 'EJ/yr', 2, 4],
    ['region_b', 'Primary Energy|Wind', 'EJ/yr', 0, 2],
],
    columns=['region', 'variable', 'unit', 2010, 2020]
), model='model_a', scenario='scen_a')

All checking-functions take arguments for np.is_close() as keyword arguments. We show our recommended settings and how to use them here.

[19]:
np_isclose_args = {
    'equal_nan': True,
    'rtol': 1e-03,
    'atol': 1e-05,
}
[20]:
tutorial_df.check_internal_consistency(**np_isclose_args)
pyam.core - INFO: `Primary Energy` - 2 of 6 rows are not aggregates of components
pyam.core - INFO: `Primary Energy` - 1 of 2 rows are not aggregates of subregions
pyam._aggregate - INFO: cannot aggregate variable `Primary Energy|Coal` because it has no components
pyam.core - INFO: `Primary Energy|Coal` - 1 of 2 rows are not aggregates of subregions
pyam._aggregate - INFO: cannot aggregate variable `Primary Energy|Wind` because it has no components
[20]:
variable components region subregions
model scenario region variable unit year
model_a scen_a World Primary Energy EJ/yr 2010 7.0 6.0 NaN NaN
2020 NaN NaN 15.0 14.0
Primary Energy|Coal EJ/yr 2020 NaN NaN 11.0 10.0
region_b Primary Energy EJ/yr 2010 3.0 2.0 NaN NaN

The output of this function reports both types of illustrative inconsistencies in the scenario constructed for this section. The log also shows that the other two variables (coal and wind) cannot be assessed because they have no subcategories.

In practice, it would now be up to the user to determine the cause of the inconsistency (or confirm that this is expected for some reason).