Aggregating and downscaling timeseries data¶

The pyam package offers many tools to facilitate processing of scenario data. In this notebook, we illustrate methods to aggregate and downscale timeseries data of an IamDataFrame across regions and sectors, as well as checking consistency of given data along these dimensions.

In this tutorial, we show how to make the most of pyam to compute such aggregate timeseries data, and to check that a scenario ensemble (or just a single scenario) is complete and that timeseries data “add up” across regions and along the variable tree (i.e., that the sum of values of the subcategories such as Primary Energy|* are identical to the values of the category Primary Energy).

There are two distinct use cases where these features can be used.

Use case 1: compute data at higher/lower sectoral or spatial aggregation¶

Given scenario results at a specific (usually very detailed) sectoral and spatial resolution, pyam offers a suite of functions to easily compute aggregate timeseries. For example, this allows to sum up national energy demand to regional or global values, or to compute the average of a global carbon price weighted by regional emissions.

These functions can be used as part of an automated workflow to generate complete scenario results from raw model outputs.

Use case 2: check the consistency of data across sectoral or spatial levels¶

In model comparison exercises or ensemble compilation projects, a user needs to verify the internal consistency of submitted scenario results (cf. Huppmann et al., 2018, doi: 10.1038/s41558-018-0317-4). Such inconsistencies can be due to incomplete variable hierarchies, reporting templates incompatible with model specifications, or user error.

Overview¶

This notebook illustrates the following features:

Import data from file and inspect the scenario
Aggregate timeseries over sectors (i.e., sub-categories)
Aggregate timeseries over regions including weighted average
Downscale timeseries given at a region level to sub-regions using a proxy variable
Downscale timeseries using an explicit weighting dataframe
Check the internal consistency of a scenario (ensemble)

See Also

The pyam package also supports algebraic operations (addition, subtraction, multiplication, division) on the timeseries data along any axis or dimension. See the algebraic operations tutorial notebook for more information.

0. Import data from file and inspect the scenario¶

The stylized scenario used in this tutorial has data for two regions (reg_a & reg_b) as well as the World aggregate, and for categories of variables: primary energy demand, emissions, carbon price, and population.

[1]:

from pyam import IamDataFrame

df = IamDataFrame(data="tutorial_data_aggregating_downscaling.csv")

[INFO] 16:21:18 - pyam.core: Reading file tutorial_data_aggregating_downscaling.csv

[2]:

df.region

[2]:

['World', 'reg_a', 'reg_b']

[3]:

df.variable

[3]:

['Emissions|CO2',
 'Emissions|CO2|AFOLU',
 'Emissions|CO2|Bunkers',
 'Emissions|CO2|Energy',
 'Population',
 'Price|Carbon',
 'Primary Energy',
 'Primary Energy|Coal',
 'Primary Energy|Wind']

1. Aggregating timeseries across sectors¶

Let’s first display the data for the components of primary energy demand.

[4]:

df.filter(variable="Primary Energy|*").timeseries()

[4]:

					2005	2010
model	scenario	region	variable	unit
model_a	scen_a	World	Primary Energy\|Coal	EJ/yr	9.0	10.0
		World	Primary Energy\|Wind	EJ/yr	3.0	5.0
		reg_a	Primary Energy\|Coal	EJ/yr	6.0	6.0
		reg_a	Primary Energy\|Wind	EJ/yr	2.0	3.0
		reg_b	Primary Energy\|Coal	EJ/yr	3.0	4.0
		reg_b	Primary Energy\|Wind	EJ/yr	1.0	2.0

Next, we are going to use the aggregate() function to compute the total Primary Energy from its components (wind and coal) in each region (including World).

The function returns an IamDataFrame, so we can use timeseries() to display the resulting data.

[5]:

df.aggregate("Primary Energy").timeseries()

[5]:

					2005	2010
model	scenario	region	variable	unit
model_a	scen_a	World	Primary Energy	EJ/yr	12.0	15.0
		reg_a	Primary Energy	EJ/yr	8.0	9.0
		reg_b	Primary Energy	EJ/yr	4.0	6.0

If we are interested in use case 1, we could use the argument append=True to directly add the computed aggregate to the IamDataFrame instance.

However, in this tutorial, the data already includes the total primary energy demand. Therefore, we illustrate use case 2 and apply the check_aggregate() function to verify whether a given variable is the sum of its sectoral components (i.e., Primary Energy should be equal to Primary Energy|Coal plus Primary Energy|Wind). The validation is performed separately for each region.

The function returns None if the validation is correct (which it is for primary energy demand) or a pandas.DataFrame highlighting where the aggregate does not match (this will be illustrated in the next section).

[6]:

df.check_aggregate("Primary Energy")

The function also returns useful logging messages when there is nothing to check (because there are no sectors below Primary Energy|Wind).

[7]:

df.check_aggregate("Primary Energy|Wind")

[WARNING] 16:21:18 - pyam.aggregation: Cannot aggregate variable 'Primary Energy|Wind' because it has no components.

2. Aggregating timeseries across subregions¶

Similarly to the previous example, we now use the aggregate_region() function to compute regional aggregates. By default, this method sums all the regions in the dataframe to make a World region; this can be changed with the keyword arguments region and subregions.

[8]:

df.aggregate_region("Primary Energy").timeseries()

[8]:

					2005	2010
model	scenario	region	variable	unit
model_a	scen_a	World	Primary Energy	EJ/yr	12.0	15.0

Adding regional components¶

As a next step, we use check_aggregate_region() to verify that the regional aggregate of CO2 emissions matches the timeseries data given in the scenario.

[9]:

df.check_aggregate_region("Emissions|CO2")

[WARNING] 16:21:18 - pyam.core: `Emissions|CO2` - 2 of 2 rows are not aggregates of subregions

[9]:

						region	subregions
model	scenario	region	variable	unit	year
model_a	scen_a	World	Emissions\|CO2	Mt CO2	2005	10.0	9.0
model_a	scen_a	World	Emissions\|CO2	Mt CO2	2010	14.0	12.0

As announced above, this validation failed and we see a dataframe of the expected data at the region level and the aggregation computed from the subregions.

Let’s look at the entire emissions timeseries in the scenario to find out what is going on.

[10]:

df.filter(variable="Emissions*").timeseries()

[10]:

					2005	2010
model	scenario	region	variable	unit
model_a	scen_a	World	Emissions\|CO2	Mt CO2	10.0	14.0
			Emissions\|CO2\|AFOLU	Mt CO2	3.0	4.0
			Emissions\|CO2\|Bunkers	Mt CO2	1.0	2.0
			Emissions\|CO2\|Energy	Mt CO2	6.0	8.0
		reg_a	Emissions\|CO2	Mt CO2	6.0	8.0
			Emissions\|CO2\|AFOLU	Mt CO2	2.0	3.0
			Emissions\|CO2\|Energy	Mt CO2	4.0	5.0
		reg_b	Emissions\|CO2	Mt CO2	3.0	4.0
			Emissions\|CO2\|AFOLU	Mt CO2	1.0	1.0
			Emissions\|CO2\|Energy	Mt CO2	2.0	3.0

Investigating the data carefully, you will notice that emissions from the energy sector and agriculture, forestry & land use (AFOLU) are given in the subregions and the World region, whereas emissions from bunker fuels are only defined at the global level. This is a common issue in emissions data, where some sources (e.g., global aviation and maritime transport) cannot be attributed to one region.

Luckily, the functions aggregate_region() and check_aggregate_region() support this use case: by adding components=True, the regional aggregation will include any sub-categories of the variable that are only present at the region level but not in any subregion.

[11]:

df.aggregate_region("Emissions|CO2", components=True).timeseries()

[11]:

					2005	2010
model	scenario	region	variable	unit
model_a	scen_a	World	Emissions\|CO2	Mt CO2	10.0	14.0

The regional aggregate now matches the data given at the World level in the tutorial data.

Note that the components to be included at the region level can also be specified directly via a list of variables, in this case we would use components=['Emissions|CO2|Bunkers'].

Computing a weighted average across regions¶

One other frequent requirement when aggregating across regions is a weighted average.

To illustrate this feature, the tutorial data includes carbon price data. Naturally, the appropriate weighting data are the regional carbon emissions.

The following cells show:

The carbon price data across the regions
A (failing) validation that the regional aggregation (without weights) matches the reported prices at the World level
The emissions-weighted average of carbon prices returned as a new IamDataFrame

[12]:

df.filter(variable="Price|Carbon").timeseries()

[12]:

					2005	2010
model	scenario	region	variable	unit
model_a	scen_a	World	Price\|Carbon	USD/t CO2	4.0	27.0
		reg_a	Price\|Carbon	USD/t CO2	1.0	30.0
		reg_b	Price\|Carbon	USD/t CO2	10.0	21.0

[13]:

df.check_aggregate_region("Price|Carbon")

[WARNING] 16:21:18 - pyam.core: `Price|Carbon` - 2 of 2 rows are not aggregates of subregions

[13]:

						region	subregions
model	scenario	region	variable	unit	year
model_a	scen_a	World	Price\|Carbon	USD/t CO2	2005	4.0	11.0
model_a	scen_a	World	Price\|Carbon	USD/t CO2	2010	27.0	51.0

[14]:

df.aggregate_region("Price|Carbon", weight="Emissions|CO2").timeseries()

[14]:

					2005	2010
model	scenario	region	variable	unit
model_a	scen_a	World	Price\|Carbon	USD/t CO2	4.0	27.0

3. Downscaling timeseries data to subregions using a proxy¶

The inverse operation of regional aggregation is “downscaling” of timeseries data given at a regional level to a number of subregions, usually using some other data as proxy to divide and allocate the total to the subregions.

This section shows an example using the downscale_region() function to divide the total primary energy demand using population as a proxy.

[15]:

df.filter(variable="Population").timeseries()

[15]:

					2005	2010
model	scenario	region	variable	unit
model_a	scen_a	World	Population	million	3.0	5.0
		reg_a	Population	million	1.5	2.5
		reg_b	Population	million	1.5	2.5

[16]:

df.downscale_region("Primary Energy", proxy="Population").timeseries()

[16]:

					2005	2010
model	scenario	region	variable	unit
model_a	scen_a	reg_a	Primary Energy	EJ/yr	6.0	7.5
model_a	scen_a	reg_b	Primary Energy	EJ/yr	6.0	7.5

By the way, the functions aggregate(), aggregate_region() and downscale_region() also take lists of variables as variable argument. See the next cell for an example.

[17]:

var_list = ["Primary Energy", "Primary Energy|Coal"]
df.downscale_region(var_list, proxy="Population").timeseries()

[17]:

					2005	2010
model	scenario	region	variable	unit
model_a	scen_a	reg_a	Primary Energy	EJ/yr	6.0	7.5
		reg_a	Primary Energy\|Coal	EJ/yr	4.5	5.0
		reg_b	Primary Energy	EJ/yr	6.0	7.5
		reg_b	Primary Energy\|Coal	EJ/yr	4.5	5.0

4. Downscaling timeseries data to subregions using a weighting dataframe¶

In cases where using existing data directly as a proxy (as illustrated in the previous section) is not practical, a user can also create a weighting dataframe and pass that directly to the downscale_region() function.

The example below uses the weighting factors implied by the population variable for easy comparison to the previous section.

[18]:

import pandas as pd

weight = pd.DataFrame(
    [[0.66, 0.6], [0.33, 0.4]],
    index=pd.Series(["reg_a", "reg_b"], name="region"),
    columns=pd.Series([2005, 2010], name="year"),
)
weight

[18]:

year	2005	2010
region
reg_a	0.66	0.6
reg_b	0.33	0.4

[19]:

df.downscale_region(var_list, weight=weight).timeseries()

[19]:

					2005	2010
model	scenario	region	variable	unit
model_a	scen_a	reg_a	Primary Energy	EJ/yr	8.0	9.0
		reg_a	Primary Energy\|Coal	EJ/yr	6.0	6.0
		reg_b	Primary Energy	EJ/yr	4.0	6.0
		reg_b	Primary Energy\|Coal	EJ/yr	3.0	4.0

5. Checking the internal consistency of a scenario (ensemble)¶

The previous sections illustrated two functions to validate specific variables across their sectors (sub-categories) or regional disaggregation. These two functions are combined in the check_internal_consistency() feature.

This feature of the pyam package currently only supports “consistency” in the sense of a strictly hierarchical variable tree (with subcategories summing up to the category value including components, discussed above) and that all the regions sum to the ‘World’ region.

See this issue for more information.

If we have an internally consistent scenario ensemble (or single scenario), the function will return None; otherwise, it will return a concatenation of pandas.DataFrames indicating all detected inconsistencies.

For this section, we use a tutorial scenario which is constructed to highlight the individual validation features below. The scenario below has two inconsistencies:

In year 2010 and regions region_b & World, the values of coal and wind do not add up to the total Primary Energy value
In year 2020 in the World region, the value of Primary Energy and Primary Energy|Coal is not the sum of region_a and region_b (but the sum of wind and coal to Primary Energy in each sub-region is correct)

[20]:

tutorial_df = IamDataFrame(
    pd.DataFrame(
        [
            ["World", "Primary Energy", "EJ/yr", 7, 15],
            ["World", "Primary Energy|Coal", "EJ/yr", 4, 11],
            ["World", "Primary Energy|Wind", "EJ/yr", 2, 4],
            ["region_a", "Primary Energy", "EJ/yr", 4, 8],
            ["region_a", "Primary Energy|Coal", "EJ/yr", 2, 6],
            ["region_a", "Primary Energy|Wind", "EJ/yr", 2, 2],
            ["region_b", "Primary Energy", "EJ/yr", 3, 6],
            ["region_b", "Primary Energy|Coal", "EJ/yr", 2, 4],
            ["region_b", "Primary Energy|Wind", "EJ/yr", 0, 2],
        ],
        columns=["region", "variable", "unit", 2010, 2020],
    ),
    model="model_a",
    scenario="scen_a",
)

All checking-functions take arguments for np.is_close() as keyword arguments. We show our recommended settings and how to use them here.

[21]:

np_isclose_args = {
    "equal_nan": True,
    "rtol": 1e-03,
    "atol": 1e-05,
}

[22]:

tutorial_df.check_internal_consistency(**np_isclose_args)

[INFO] 16:21:19 - pyam.core: `Primary Energy` - 2 of 6 rows are not aggregates of components
[WARNING] 16:21:19 - pyam.core: `Primary Energy` - 1 of 2 rows are not aggregates of subregions
[WARNING] 16:21:19 - pyam.aggregation: Cannot aggregate variable 'Primary Energy|Coal' because it has no components.
[WARNING] 16:21:19 - pyam.core: `Primary Energy|Coal` - 1 of 2 rows are not aggregates of subregions
[WARNING] 16:21:19 - pyam.aggregation: Cannot aggregate variable 'Primary Energy|Wind' because it has no components.

[22]:

						variable	components	region	subregions
model	scenario	region	variable	unit	year
model_a	scen_a	World	Primary Energy	EJ/yr	2010	7.0	6.0	NaN	NaN
			Primary Energy	EJ/yr	2020	NaN	NaN	15.0	14.0
			Primary Energy\|Coal	EJ/yr	2020	NaN	NaN	11.0	10.0
		region_b	Primary Energy	EJ/yr	2010	3.0	2.0	NaN	NaN

The output of this function reports both types of illustrative inconsistencies in the scenario constructed for this section. The log also shows that the other two variables (coal and wind) cannot be assessed because they have no subcategories.

In practice, it would now be up to the user to determine the cause of the inconsistency (or confirm that this is expected for some reason).