Aggregating and downscaling timeseries data¶
The pyam package offers many tools to facilitate processing of scenario data. In this notebook, we illustrate methods to aggregate and downscale timeseries data of an IamDataFrame across regions and sectors, as well as checking consistency of given data along these dimensions.
In this tutorial, we show how to make the most of pyam to compute such aggregate timeseries data, and to check that a scenario ensemble (or just a single scenario) is complete and that timeseries data “add up” across regions and along the variable tree (i.e., that the sum of values of the subcategories such as Primary Energy|*
are identical to the values of the category Primary Energy
).
There are two distinct use cases where these features can be used.
Use case 1: compute data at higher/lower sectoral or spatial aggregation¶
Given scenario results at a specific (usually very detailed) sectoral and spatial resolution, pyam offers a suite of functions to easily compute aggregate timeseries. For example, this allows to sum up national energy demand to regional or global values, or to compute the average of a global carbon price weighted by regional emissions.
These functions can be used as part of an automated workflow to generate complete scenario results from raw model outputs.
Use case 2: check the consistency of data across sectoral or spatial levels¶
In model comparison exercises or ensemble compilation projects, a user needs to verify the internal consistency of submitted scenario results (cf. Huppmann et al., 2018, doi: 10.1038/s41558-018-0317-4). Such inconsistencies can be due to incomplete variable hierarchies, reporting templates incompatible with model specifications, or user error.
Overview¶
This notebook illustrates the following features:
Import data from file and inspect the scenario
Aggregate timeseries over sectors (i.e., sub-categories)
Aggregate timeseries over regions including weighted average
Downscale timeseries given at a region level to sub-regions using a proxy variable
Downscale timeseries using an explicit weighting dataframe
Check the internal consistency of a scenario (ensemble)
See Also
The pyam package also supports algebraic operations (addition, subtraction, multiplication, division) on the timeseries data along any axis or dimension. See the algebraic operations tutorial notebook for more information.
[1]:
import pandas as pd
from pyam import IamDataFrame
0. Import data from file and inspect the scenario¶
The stylized scenario used in this tutorial has data for two regions (reg_a
& reg_b
) as well as the World
aggregate, and for categories of variables: primary energy demand, emissions, carbon price, and population.
[2]:
df = IamDataFrame(data='tutorial_data_aggregating_downscaling.csv')
pyam.core - INFO: Reading file tutorial_data_aggregating_downscaling.csv
[3]:
df.region
[3]:
['World', 'reg_a', 'reg_b']
[4]:
df.variable
[4]:
['Emissions|CO2',
'Emissions|CO2|AFOLU',
'Emissions|CO2|Bunkers',
'Emissions|CO2|Energy',
'Population',
'Price|Carbon',
'Primary Energy',
'Primary Energy|Coal',
'Primary Energy|Wind']
1. Aggregating timeseries across sectors¶
Let’s first display the data for the components of primary energy demand.
[5]:
df.filter(variable='Primary Energy|*').timeseries()
[5]:
2005 | 2010 | |||||
---|---|---|---|---|---|---|
model | scenario | region | variable | unit | ||
model_a | scen_a | World | Primary Energy|Coal | EJ/yr | 9.0 | 10.0 |
Primary Energy|Wind | EJ/yr | 3.0 | 5.0 | |||
reg_a | Primary Energy|Coal | EJ/yr | 6.0 | 6.0 | ||
Primary Energy|Wind | EJ/yr | 2.0 | 3.0 | |||
reg_b | Primary Energy|Coal | EJ/yr | 3.0 | 4.0 | ||
Primary Energy|Wind | EJ/yr | 1.0 | 2.0 |
Next, we are going to use the aggregate() function to compute the total Primary Energy
from its components (wind and coal) in each region (including World
).
The function returns an IamDataFrame, so we can use timeseries() to display the resulting data.
[6]:
df.aggregate('Primary Energy').timeseries()
[6]:
2005 | 2010 | |||||
---|---|---|---|---|---|---|
model | scenario | region | variable | unit | ||
model_a | scen_a | World | Primary Energy | EJ/yr | 12.0 | 15.0 |
reg_a | Primary Energy | EJ/yr | 8.0 | 9.0 | ||
reg_b | Primary Energy | EJ/yr | 4.0 | 6.0 |
If we are interested in use case 1, we could use the argument append=True
to directly add the computed aggregate to the IamDataFrame instance.
However, in this tutorial, the data already includes the total primary energy demand. Therefore, we illustrate use case 2 and apply the check_aggregate() function to verify whether a given variable is the sum of its sectoral components (i.e., Primary Energy
should be equal to Primary Energy|Coal
plus Primary Energy|Wind
). The validation is performed separately for each region.
The function returns None
if the validation is correct (which it is for primary energy demand) or a pandas.DataFrame highlighting where the aggregate does not match (this will be illustrated in the next section).
[7]:
df.check_aggregate('Primary Energy')
The function also returns useful logging messages when there is nothing to check (because there are no sectors below Primary Energy|Wind
).
[8]:
df.check_aggregate('Primary Energy|Wind')
pyam.aggregation - INFO: Cannot aggregate variable 'Primary Energy|Wind' because it has no components!
2. Aggregating timeseries across subregions¶
Similarly to the previous example, we now use the aggregate_region() function to compute regional aggregates. By default, this method sums all the regions in the dataframe to make a World
region; this can be changed with the keyword arguments region
and subregions
.
[9]:
df.aggregate_region('Primary Energy').timeseries()
[9]:
2005 | 2010 | |||||
---|---|---|---|---|---|---|
model | scenario | region | variable | unit | ||
model_a | scen_a | World | Primary Energy | EJ/yr | 12.0 | 15.0 |
Adding regional components¶
As a next step, we use check_aggregate_region() to verify that the regional aggregate of CO2 emissions matches the timeseries data given in the scenario.
[10]:
df.check_aggregate_region('Emissions|CO2')
pyam.core - INFO: `Emissions|CO2` - 2 of 2 rows are not aggregates of subregions
[10]:
region | subregions | ||||||
---|---|---|---|---|---|---|---|
model | scenario | region | variable | unit | year | ||
model_a | scen_a | World | Emissions|CO2 | Mt CO2 | 2005 | 10.0 | 9.0 |
2010 | 14.0 | 12.0 |
As announced above, this validation failed and we see a dataframe of the expected data at the region
level and the aggregation computed from the subregions
.
Let’s look at the entire emissions timeseries in the scenario to find out what is going on.
[11]:
df.filter(variable='Emissions*').timeseries()
[11]:
2005 | 2010 | |||||
---|---|---|---|---|---|---|
model | scenario | region | variable | unit | ||
model_a | scen_a | World | Emissions|CO2 | Mt CO2 | 10.0 | 14.0 |
Emissions|CO2|AFOLU | Mt CO2 | 3.0 | 4.0 | |||
Emissions|CO2|Bunkers | Mt CO2 | 1.0 | 2.0 | |||
Emissions|CO2|Energy | Mt CO2 | 6.0 | 8.0 | |||
reg_a | Emissions|CO2 | Mt CO2 | 6.0 | 8.0 | ||
Emissions|CO2|AFOLU | Mt CO2 | 2.0 | 3.0 | |||
Emissions|CO2|Energy | Mt CO2 | 4.0 | 5.0 | |||
reg_b | Emissions|CO2 | Mt CO2 | 3.0 | 4.0 | ||
Emissions|CO2|AFOLU | Mt CO2 | 1.0 | 1.0 | |||
Emissions|CO2|Energy | Mt CO2 | 2.0 | 3.0 |
Investigating the data carefully, you will notice that emissions from the energy sector and agriculture, forestry & land use (AFOLU) are given in the subregions and the World
region, whereas emissions from bunker fuels are only defined at the global level. This is a common issue in emissions data, where some sources (e.g., global aviation and maritime transport) cannot be attributed to one region.
Luckily, the functions aggregate_region() and check_aggregate_region() support this use case: by adding components=True
, the regional aggregation will include any sub-categories of the variable that are only present at the region
level but not in any subregion.
[12]:
df.aggregate_region('Emissions|CO2', components=True).timeseries()
[12]:
2005 | 2010 | |||||
---|---|---|---|---|---|---|
model | scenario | region | variable | unit | ||
model_a | scen_a | World | Emissions|CO2 | Mt CO2 | 10.0 | 14.0 |
The regional aggregate now matches the data given at the World
level in the tutorial data.
Note that the components to be included at the region level can also be specified directly via a list of variables, in this case we would use components=['Emissions|CO2|Bunkers']
.
Computing a weighted average across regions¶
One other frequent requirement when aggregating across regions is a weighted average.
To illustrate this feature, the tutorial data includes carbon price data. Naturally, the appropriate weighting data are the regional carbon emissions.
The following cells show:
The carbon price data across the regions
A (failing) validation that the regional aggregation (without weights) matches the reported prices at the
World
levelThe emissions-weighted average of carbon prices returned as a new IamDataFrame
[13]:
df.filter(variable='Price|Carbon').timeseries()
[13]:
2005 | 2010 | |||||
---|---|---|---|---|---|---|
model | scenario | region | variable | unit | ||
model_a | scen_a | World | Price|Carbon | USD/t CO2 | 4.0 | 27.0 |
reg_a | Price|Carbon | USD/t CO2 | 1.0 | 30.0 | ||
reg_b | Price|Carbon | USD/t CO2 | 10.0 | 21.0 |
[14]:
df.check_aggregate_region('Price|Carbon')
pyam.core - INFO: `Price|Carbon` - 2 of 2 rows are not aggregates of subregions
[14]:
region | subregions | ||||||
---|---|---|---|---|---|---|---|
model | scenario | region | variable | unit | year | ||
model_a | scen_a | World | Price|Carbon | USD/t CO2 | 2005 | 4.0 | 11.0 |
2010 | 27.0 | 51.0 |
[15]:
df.aggregate_region('Price|Carbon', weight='Emissions|CO2').timeseries()
[15]:
2005 | 2010 | |||||
---|---|---|---|---|---|---|
model | scenario | region | variable | unit | ||
model_a | scen_a | World | Price|Carbon | USD/t CO2 | 4.0 | 27.0 |
3. Downscaling timeseries data to subregions using a proxy¶
The inverse operation of regional aggregation is “downscaling” of timeseries data given at a regional level to a number of subregions, usually using some other data as proxy to divide and allocate the total to the subregions.
This section shows an example using the downscale_region() function to divide the total primary energy demand using population as a proxy.
[16]:
df.filter(variable='Population').timeseries()
[16]:
2005 | 2010 | |||||
---|---|---|---|---|---|---|
model | scenario | region | variable | unit | ||
model_a | scen_a | World | Population | million | 3.0 | 5.0 |
reg_a | Population | million | 1.5 | 2.5 | ||
reg_b | Population | million | 1.5 | 2.5 |
[17]:
df.downscale_region('Primary Energy', proxy='Population').timeseries()
[17]:
2005 | 2010 | |||||
---|---|---|---|---|---|---|
model | scenario | region | variable | unit | ||
model_a | scen_a | reg_a | Primary Energy | EJ/yr | 6.0 | 7.5 |
reg_b | Primary Energy | EJ/yr | 6.0 | 7.5 |
By the way, the functions aggregate(), aggregate_region() and downscale_region() also take lists of variables as variable
argument. See the next cell for an example.
[18]:
var_list = ['Primary Energy', 'Primary Energy|Coal']
df.downscale_region(var_list, proxy='Population').timeseries()
[18]:
2005 | 2010 | |||||
---|---|---|---|---|---|---|
model | scenario | region | variable | unit | ||
model_a | scen_a | reg_a | Primary Energy | EJ/yr | 6.0 | 7.5 |
Primary Energy|Coal | EJ/yr | 4.5 | 5.0 | |||
reg_b | Primary Energy | EJ/yr | 6.0 | 7.5 | ||
Primary Energy|Coal | EJ/yr | 4.5 | 5.0 |
4. Downscaling timeseries data to subregions using a weighting dataframe¶
In cases where using existing data directly as a proxy (as illustrated in the previous section) is not practical, a user can also create a weighting dataframe and pass that directly to the downscale_region()
function.
The example below uses the weighting factors implied by the population variable for easy comparison to the previous section.
[19]:
weight = pd.DataFrame(
[[0.66, 0.6], [0.33, 0.4]],
index=pd.Series(['reg_a', 'reg_b'], name='region'),
columns=pd.Series([2005, 2010], name='year')
)
weight
[19]:
year | 2005 | 2010 |
---|---|---|
region | ||
reg_a | 0.66 | 0.6 |
reg_b | 0.33 | 0.4 |
[20]:
df.downscale_region(var_list, weight=weight).timeseries()
[20]:
2005 | 2010 | |||||
---|---|---|---|---|---|---|
model | scenario | region | variable | unit | ||
model_a | scen_a | reg_a | Primary Energy | EJ/yr | 8.0 | 9.0 |
Primary Energy|Coal | EJ/yr | 6.0 | 6.0 | |||
reg_b | Primary Energy | EJ/yr | 4.0 | 6.0 | ||
Primary Energy|Coal | EJ/yr | 3.0 | 4.0 |
5. Checking the internal consistency of a scenario (ensemble)¶
The previous sections illustrated two functions to validate specific variables across their sectors (sub-categories) or regional disaggregation. These two functions are combined in the check_internal_consistency() feature.
If we have an internally consistent scenario ensemble (or single scenario), the function will return None
; otherwise, it will return a concatenation of pandas.DataFrames indicating all detected inconsistencies.
For this section, we use a tutorial scenario which is constructed to highlight the individual validation features below. The scenario below has two inconsistencies:
In year
2010
and regionsregion_b
&World
, the values of coal and wind do not add up to the totalPrimary Energy
valueIn year
2020
in theWorld
region, the value ofPrimary Energy
andPrimary Energy|Coal
is not the sum ofregion_a
andregion_b
(but the sum of wind and coal toPrimary Energy
in each sub-region is correct)
[21]:
tutorial_df = IamDataFrame(pd.DataFrame([
['World', 'Primary Energy', 'EJ/yr', 7, 15],
['World', 'Primary Energy|Coal', 'EJ/yr', 4, 11],
['World', 'Primary Energy|Wind', 'EJ/yr', 2, 4],
['region_a', 'Primary Energy', 'EJ/yr', 4, 8],
['region_a', 'Primary Energy|Coal', 'EJ/yr', 2, 6],
['region_a', 'Primary Energy|Wind', 'EJ/yr', 2, 2],
['region_b', 'Primary Energy', 'EJ/yr', 3, 6],
['region_b', 'Primary Energy|Coal', 'EJ/yr', 2, 4],
['region_b', 'Primary Energy|Wind', 'EJ/yr', 0, 2],
],
columns=['region', 'variable', 'unit', 2010, 2020]
), model='model_a', scenario='scen_a')
All checking-functions take arguments for np.is_close() as keyword arguments. We show our recommended settings and how to use them here.
[22]:
np_isclose_args = {
'equal_nan': True,
'rtol': 1e-03,
'atol': 1e-05,
}
[23]:
tutorial_df.check_internal_consistency(**np_isclose_args)
pyam.core - INFO: `Primary Energy` - 2 of 6 rows are not aggregates of components
pyam.core - INFO: `Primary Energy` - 1 of 2 rows are not aggregates of subregions
pyam.aggregation - INFO: Cannot aggregate variable 'Primary Energy|Coal' because it has no components!
pyam.core - INFO: `Primary Energy|Coal` - 1 of 2 rows are not aggregates of subregions
pyam.aggregation - INFO: Cannot aggregate variable 'Primary Energy|Wind' because it has no components!
[23]:
variable | components | region | subregions | ||||||
---|---|---|---|---|---|---|---|---|---|
model | scenario | region | variable | unit | year | ||||
model_a | scen_a | World | Primary Energy | EJ/yr | 2010 | 7.0 | 6.0 | NaN | NaN |
2020 | NaN | NaN | 15.0 | 14.0 | |||||
Primary Energy|Coal | EJ/yr | 2020 | NaN | NaN | 11.0 | 10.0 | |||
region_b | Primary Energy | EJ/yr | 2010 | 3.0 | 2.0 | NaN | NaN |
The output of this function reports both types of illustrative inconsistencies in the scenario constructed for this section. The log also shows that the other two variables (coal and wind) cannot be assessed because they have no subcategories.
In practice, it would now be up to the user to determine the cause of the inconsistency (or confirm that this is expected for some reason).