Importing various data table formats

The default input/output (io) format of the pyam package is the tabular, wide data format, where the columns represent the time domain. This follows the standard established by the Integrated Assessment Modeling Consortium (IAMC); read the docs for more information.

Alas, to make using the package as easy as possible, iniatilizing an IamDataFrame will accept a variety of different table formats and allows specifying missing columns via keyword arguments. This way, a user can import their data into pyam as easily as possible without needing to worry (more than necessary) about manipulating the original, raw timeseries data.

This tutorial illustrates the broad range of possible formats to facilitate choosing the one that works for your data!

Overview

The first section shows the standard pandas.DataFrame (a.k.a. table) used in the pyam test suite. The following sections shows modifications of this dataframe and the necessary (if any) additional specifications to initialize an IamDataFrame. The last cell of each section uses the equals() function to assert that the IamDataFrame in that section is identical to the object in the first section.

It does not matter whether an IamDataFrame is initialized from a pandas.DataFrame or a path to a xslsx/csv file with the data in the given format. For simplicity, this tutorial only uses dataframes.

The default columns (a.k.a. index or dimensions) of the IAMC data format are
pyam.IAMC_IDX = ['model', 'scenario', 'region', 'variable', 'unit'].
The last section of this tutorial illustrates the use of additional, custom columns.
[1]:
import pandas as pd
import pyam
pyam - INFO: Running in a notebook, setting `pyam` logging level to `logging.INFO` and adding stderr handler

0. Initialize timeseries data from standard IAMC-format table

The first cell creates a “simple dataframe” in the standard wide IAMC format.

It then casts that dataframe to an IamDataFrame and uses the timeseries() function to again show the data in the standard format.

[2]:
SIMPLE_DF = pd.DataFrame([
    ['model_a', 'scen_a', 'World', 'Primary Energy', 'EJ/y', 1, 6.],
    ['model_a', 'scen_a', 'World', 'Primary Energy|Coal', 'EJ/y', 0.5, 3],
    ['model_a', 'scen_b', 'World', 'Primary Energy', 'EJ/y', 2, 7],
],
    columns=pyam.IAMC_IDX + [2005, 2010],
)

SIMPLE_DF
[2]:
model scenario region variable unit 2005 2010
0 model_a scen_a World Primary Energy EJ/y 1.0 6.0
1 model_a scen_a World Primary Energy|Coal EJ/y 0.5 3.0
2 model_a scen_b World Primary Energy EJ/y 2.0 7.0
[3]:
df_simple = pyam.IamDataFrame(SIMPLE_DF)
[4]:
df_simple.timeseries()
[4]:
2005 2010
model scenario region variable unit
model_a scen_a World Primary Energy EJ/y 1.0 6.0
Primary Energy|Coal EJ/y 0.5 3.0
scen_b World Primary Energy EJ/y 2.0 7.0

1. Adding missing columns

The IAMC data format expects the columns model, scenario, region, variable and unit. If the input dataframe does not have one or several of these columns, the value for that column can be given as a keyword argument of the type col=value.

[5]:
MISSING_COLS_DF = pd.DataFrame([
    ['scen_a', 'World', 'Primary Energy', 'EJ/y', 1, 6.],
    ['scen_a', 'World', 'Primary Energy|Coal', 'EJ/y', 0.5, 3],
    ['scen_b', 'World', 'Primary Energy', 'EJ/y', 2, 7],
],
    columns=['scenario', 'region', 'variable', 'unit', 2005, 2010],
)

MISSING_COLS_DF
[5]:
scenario region variable unit 2005 2010
0 scen_a World Primary Energy EJ/y 1.0 6.0
1 scen_a World Primary Energy|Coal EJ/y 0.5 3.0
2 scen_b World Primary Energy EJ/y 2.0 7.0
[6]:
df_missing_cols = pyam.IamDataFrame(MISSING_COLS_DF, model='model_a')
[7]:
df_missing_cols.equals(df_simple)
[7]:
True

2. Using a long data format

The next illustration is a dataframe where the years and values are given in a long format, i.e., in two columns named year and value. This is the format internally used by pyam.

[8]:
LONG_DF = pd.DataFrame([
    ['model_a', 'scen_a', 'World', 'Primary Energy', 'EJ/y', 2005, 1.],
    ['model_a', 'scen_a', 'World', 'Primary Energy', 'EJ/y', 2010, 6],
    ['model_a', 'scen_a', 'World', 'Primary Energy|Coal', 'EJ/y', 2005, 0.5],
    ['model_a', 'scen_a', 'World', 'Primary Energy|Coal', 'EJ/y', 2010, 3],
    ['model_a', 'scen_b', 'World', 'Primary Energy', 'EJ/y', 2005, 2],
    ['model_a', 'scen_b', 'World', 'Primary Energy', 'EJ/y', 2010, 7],
],
    columns=pyam.IAMC_IDX + ['year', 'value'],
)

LONG_DF
[8]:
model scenario region variable unit year value
0 model_a scen_a World Primary Energy EJ/y 2005 1.0
1 model_a scen_a World Primary Energy EJ/y 2010 6.0
2 model_a scen_a World Primary Energy|Coal EJ/y 2005 0.5
3 model_a scen_a World Primary Energy|Coal EJ/y 2010 3.0
4 model_a scen_b World Primary Energy EJ/y 2005 2.0
5 model_a scen_b World Primary Energy EJ/y 2010 7.0
[9]:
df_long = pyam.IamDataFrame(LONG_DF)
[10]:
df_long.equals(df_simple)
[10]:
True

3. Using column headers as variable names

The next example shows a table where the values are given as columns and the column header specifies the variable name. In this case, a user needs to specify the columns that should be interpreted as values using the keyword argument value=<value_cols>.

[11]:
VALUE_COLS_DF = pd.DataFrame([
    ['model_a', 'scen_a', 'World', 'EJ/y', 2005, 1, 0.5],
    ['model_a', 'scen_a', 'World', 'EJ/y', 2010, 6., 3],
    ['model_a', 'scen_b', 'World', 'EJ/y', 2005, 2, None],
    ['model_a', 'scen_b', 'World', 'EJ/y', 2010, 7, None]
],
    columns=['model', 'scenario', 'region', 'unit', 'year',
             'Primary Energy', 'Primary Energy|Coal'],
)

VALUE_COLS_DF
[11]:
model scenario region unit year Primary Energy Primary Energy|Coal
0 model_a scen_a World EJ/y 2005 1.0 0.5
1 model_a scen_a World EJ/y 2010 6.0 3.0
2 model_a scen_b World EJ/y 2005 2.0 NaN
3 model_a scen_b World EJ/y 2010 7.0 NaN
[12]:
df_value_cols = pyam.IamDataFrame(VALUE_COLS_DF,
                                  value=['Primary Energy', 'Primary Energy|Coal'])
[13]:
df_value_cols.equals(df_simple)
[13]:
True

4. Using non-standard column names

As stated above, the IAMC data format expects a specific set of column names. If the input dataframe has columns with non-standard headers, the column renaming can be done on the fly by pyam using the keyword argument default_col=input_col.

[14]:
OTHER_HEADER_DF = pd.DataFrame([
    ['model_a', 'scen_a', 'World', 'Primary Energy', 'EJ/y', 1, 6.],
    ['model_a', 'scen_a', 'World', 'Primary Energy|Coal', 'EJ/y', 0.5, 3],
    ['model_a', 'scen_b', 'World', 'Primary Energy', 'EJ/y', 2, 7],
],
    columns=['model', 'foo', 'region', 'variable', 'unit', 2005, 2010],
)

OTHER_HEADER_DF
[14]:
model foo region variable unit 2005 2010
0 model_a scen_a World Primary Energy EJ/y 1.0 6.0
1 model_a scen_a World Primary Energy|Coal EJ/y 0.5 3.0
2 model_a scen_b World Primary Energy EJ/y 2.0 7.0
[15]:
df_other_header = pyam.IamDataFrame(OTHER_HEADER_DF, scenario='foo')
[16]:
df_other_header.equals(df_simple)
[16]:
True

5. Concatenating multiple columns as variable names

In the IAMC data format, the variable implements a semi-hierarchical structure using the | (pipe) character. If the input dataframe has the hierarchy (or dimensions of an index) represented as separate columns, the concatenation can be performed during the initialization using the keyword argument variable=<list_of_cols>.

[17]:
CONCAT_DF = pd.DataFrame([
    ['model_a', 'scen_a', 'World', 'Primary Energy', None, 'EJ/y', 1, 6.],
    ['model_a', 'scen_a', 'World', 'Primary Energy', 'Coal', 'EJ/y', 0.5, 3],
    ['model_a', 'scen_b', 'World', 'Primary Energy', None, 'EJ/y', 2, 7],
],
    columns=['model', 'scenario', 'region', 'var_1', 'var_2', 'unit', 2005, 2010],
)

CONCAT_DF
[17]:
model scenario region var_1 var_2 unit 2005 2010
0 model_a scen_a World Primary Energy None EJ/y 1.0 6.0
1 model_a scen_a World Primary Energy Coal EJ/y 0.5 3.0
2 model_a scen_b World Primary Energy None EJ/y 2.0 7.0
[18]:
df_concat = pyam.IamDataFrame(CONCAT_DF, variable=['var_1', 'var_2'])
[19]:
df_concat.equals(df_simple)
[19]:
True

6. Combining multiple format issues

The last example in this tutorial illustrates that the features above can be used in combination. The input dataframe has the following issues:

  • missing model column

  • non-standard name of the region column

  • values in columns with variable as name

Also, notice that the value-columns do not have the same headers as the variables in the dataframe in Section 0. Therefore, we use the rename() function to change the variables after initialization to the expected names.

[20]:
COMPLICATED_DF = pd.DataFrame([
    ['scen_a', 'World', 'EJ/y', 2005, 1, 0.5],
    ['scen_a', 'World', 'EJ/y', 2010, 6., 3],
    ['scen_b', 'World', 'EJ/y', 2005, 2, None],
    ['scen_b', 'World', 'EJ/y', 2010, 7, None]
],
    columns=['scenario', 'iso', 'unit', 'year', 'primary', 'coal'],
)

COMPLICATED_DF
[20]:
scenario iso unit year primary coal
0 scen_a World EJ/y 2005 1.0 0.5
1 scen_a World EJ/y 2010 6.0 3.0
2 scen_b World EJ/y 2005 2.0 NaN
3 scen_b World EJ/y 2010 7.0 NaN
[21]:
df_complicated = (
    pyam.IamDataFrame(COMPLICATED_DF, model='model_a', region='iso',
                      value=['primary', 'coal'])
    .rename(variable={'primary': 'Primary Energy', 'coal': 'Primary Energy|Coal'})
)
[22]:
df_complicated.equals(df_simple)
[22]:
True

Postscriptum: custom data columns in an IamDataFrame

This final section illustrates the behaviour of pyam when working with non-standard columns.

The custom data column feature is currently only on experimental support. Not all pyam functions currently support custom columns in a ‘data’ table. If you encounter any problems, please remove any non-standard columns from the input dataframe (or contribute to fix the problem!).

[23]:
CUSTOM_COL_DF = pd.DataFrame([
    ['model_a', 'scen_a', 'World', 'Primary Energy', 'EJ/y', 2.1, 1, 6.],
    ['model_a', 'scen_a', 'World', 'Primary Energy|Coal', 'EJ/y', 2.1, 0.5, 3],
    ['model_a', 'scen_b', 'World', 'Primary Energy', 'EJ/y', 2.1, 2, 7],
],
    columns=pyam.IAMC_IDX + ['version', 2005, 2010],
)

CUSTOM_COL_DF
[23]:
model scenario region variable unit version 2005 2010
0 model_a scen_a World Primary Energy EJ/y 2.1 1.0 6.0
1 model_a scen_a World Primary Energy|Coal EJ/y 2.1 0.5 3.0
2 model_a scen_b World Primary Energy EJ/y 2.1 2.0 7.0
[24]:
df_custom_col = pyam.IamDataFrame(CUSTOM_COL_DF)
[25]:
df_custom_col.timeseries()
[25]:
2005 2010
model scenario region variable unit version
model_a scen_a World Primary Energy EJ/y 2.1 1.0 6.0
Primary Energy|Coal EJ/y 2.1 0.5 3.0
scen_b World Primary Energy EJ/y 2.1 2.0 7.0