Importing various data table formats¶

The default input/output (io) format of the pyam package is the tabular, wide data format, where the columns represent the time domain. This follows the standard established by the Integrated Assessment Modeling Consortium (IAMC); read the docs for more information.

Alas, to make using the package as easy as possible, iniatilizing an IamDataFrame will accept a variety of different table formats and allows specifying missing columns via keyword arguments. This way, a user can import their data into pyam as easily as possible without needing to worry (more than necessary) about manipulating the original, raw timeseries data.

This tutorial illustrates the broad range of possible formats to facilitate choosing the one that works for your data!

Overview¶

The first section shows the standard pandas.DataFrame (a.k.a. table) used in the pyam test suite. The following sections shows modifications of this dataframe and the necessary (if any) additional specifications to initialize an IamDataFrame. The last cell of each section uses the equals() function to assert that the IamDataFrame in that section is identical to the object in the first section.

It does not matter whether an IamDataFrame is initialized from a pandas.DataFrame or a path to a xslsx/csv file with the data in the given format. For simplicity, this tutorial only uses dataframes.

The default columns (a.k.a. index or dimensions) of the IAMC data format are

pyam.IAMC_IDX = ['model', 'scenario', 'region', 'variable', 'unit']. The last section of this tutorial illustrates the use of additional, custom columns.

0. Initialize timeseries data from standard IAMC-format table¶

The first cell creates a “simple dataframe” in the standard wide IAMC format.

It then casts that dataframe to an IamDataFrame and uses the timeseries() function to again show the data in the standard format.

[1]:

import pandas as pd

from pyam import IAMC_IDX

SIMPLE_DF = pd.DataFrame(
    [
        ["model_a", "scen_a", "World", "Primary Energy", "EJ/y", 1, 6.0],
        ["model_a", "scen_a", "World", "Primary Energy|Coal", "EJ/y", 0.5, 3],
        ["model_a", "scen_b", "World", "Primary Energy", "EJ/y", 2, 7],
    ],
    columns=IAMC_IDX + [2005, 2010],
)

SIMPLE_DF

[1]:

	model	scenario	region	variable	unit	2005	2010
0	model_a	scen_a	World	Primary Energy	EJ/y	1.0	6.0
1	model_a	scen_a	World	Primary Energy\|Coal	EJ/y	0.5	3.0
2	model_a	scen_b	World	Primary Energy	EJ/y	2.0	7.0

[2]:

from pyam import IamDataFrame

df_simple = IamDataFrame(SIMPLE_DF)

[3]:

df_simple.timeseries()

[3]:

					2005	2010
model	scenario	region	variable	unit
model_a	scen_a	World	Primary Energy	EJ/y	1.0	6.0
	scen_a	World	Primary Energy\|Coal	EJ/y	0.5	3.0
	scen_b	World	Primary Energy	EJ/y	2.0	7.0

1. Adding missing columns¶

The IAMC data format expects the columns model, scenario, region, variable and unit. If the input dataframe does not have one or several of these columns, the value for that column can be given as a keyword argument of the type col=value.

[4]:

MISSING_COLS_DF = pd.DataFrame(
    [
        ["scen_a", "World", "Primary Energy", "EJ/y", 1, 6.0],
        ["scen_a", "World", "Primary Energy|Coal", "EJ/y", 0.5, 3],
        ["scen_b", "World", "Primary Energy", "EJ/y", 2, 7],
    ],
    columns=["scenario", "region", "variable", "unit", 2005, 2010],
)

MISSING_COLS_DF

[4]:

	scenario	region	variable	unit	2005	2010
0	scen_a	World	Primary Energy	EJ/y	1.0	6.0
1	scen_a	World	Primary Energy\|Coal	EJ/y	0.5	3.0
2	scen_b	World	Primary Energy	EJ/y	2.0	7.0

[5]:

df_missing_cols = IamDataFrame(MISSING_COLS_DF, model="model_a")

[6]:

df_missing_cols.equals(df_simple)

[6]:

True

2. Using a long data format¶

The next illustration is a dataframe where the years and values are given in a long format, i.e., in two columns named year and value. This is the format internally used by pyam.

[7]:

LONG_DF = pd.DataFrame(
    [
        ["model_a", "scen_a", "World", "Primary Energy", "EJ/y", 2005, 1.0],
        ["model_a", "scen_a", "World", "Primary Energy", "EJ/y", 2010, 6],
        ["model_a", "scen_a", "World", "Primary Energy|Coal", "EJ/y", 2005, 0.5],
        ["model_a", "scen_a", "World", "Primary Energy|Coal", "EJ/y", 2010, 3],
        ["model_a", "scen_b", "World", "Primary Energy", "EJ/y", 2005, 2],
        ["model_a", "scen_b", "World", "Primary Energy", "EJ/y", 2010, 7],
    ],
    columns=IAMC_IDX + ["year", "value"],
)

LONG_DF

[7]:

	model	scenario	region	variable	unit	year	value
0	model_a	scen_a	World	Primary Energy	EJ/y	2005	1.0
1	model_a	scen_a	World	Primary Energy	EJ/y	2010	6.0
2	model_a	scen_a	World	Primary Energy\|Coal	EJ/y	2005	0.5
3	model_a	scen_a	World	Primary Energy\|Coal	EJ/y	2010	3.0
4	model_a	scen_b	World	Primary Energy	EJ/y	2005	2.0
5	model_a	scen_b	World	Primary Energy	EJ/y	2010	7.0

[8]:

df_long = IamDataFrame(LONG_DF)

[9]:

df_long.equals(df_simple)

[9]:

True

3. Using column headers as variable names¶

The next example shows a table where the values are given as columns and the column header specifies the variable name. In this case, a user needs to specify the columns that should be interpreted as values using the keyword argument value=<value_cols>.

[10]:

VALUE_COLS_DF = pd.DataFrame(
    [
        ["model_a", "scen_a", "World", "EJ/y", 2005, 1, 0.5],
        ["model_a", "scen_a", "World", "EJ/y", 2010, 6.0, 3],
        ["model_a", "scen_b", "World", "EJ/y", 2005, 2, None],
        ["model_a", "scen_b", "World", "EJ/y", 2010, 7, None],
    ],
    columns=[
        "model",
        "scenario",
        "region",
        "unit",
        "year",
        "Primary Energy",
        "Primary Energy|Coal",
    ],
)

VALUE_COLS_DF

[10]:

	model	scenario	region	unit	year	Primary Energy	Primary Energy\|Coal
0	model_a	scen_a	World	EJ/y	2005	1.0	0.5
1	model_a	scen_a	World	EJ/y	2010	6.0	3.0
2	model_a	scen_b	World	EJ/y	2005	2.0	NaN
3	model_a	scen_b	World	EJ/y	2010	7.0	NaN

[11]:

df_value_cols = IamDataFrame(
    VALUE_COLS_DF, value=["Primary Energy", "Primary Energy|Coal"]
)

[12]:

df_value_cols.equals(df_simple)

[12]:

True

4. Using non-standard column names¶

As stated above, the IAMC data format expects a specific set of column names. If the input dataframe has columns with non-standard headers, the column renaming can be done on the fly by pyam using the keyword argument default_col=input_col.

[13]:

OTHER_HEADER_DF = pd.DataFrame(
    [
        ["model_a", "scen_a", "World", "Primary Energy", "EJ/y", 1, 6.0],
        ["model_a", "scen_a", "World", "Primary Energy|Coal", "EJ/y", 0.5, 3],
        ["model_a", "scen_b", "World", "Primary Energy", "EJ/y", 2, 7],
    ],
    columns=["model", "foo", "region", "variable", "unit", 2005, 2010],
)

OTHER_HEADER_DF

[13]:

	model	foo	region	variable	unit	2005	2010
0	model_a	scen_a	World	Primary Energy	EJ/y	1.0	6.0
1	model_a	scen_a	World	Primary Energy\|Coal	EJ/y	0.5	3.0
2	model_a	scen_b	World	Primary Energy	EJ/y	2.0	7.0

[14]:

df_other_header = IamDataFrame(OTHER_HEADER_DF, scenario="foo")

[15]:

df_other_header.equals(df_simple)

[15]:

True

5. Concatenating multiple columns as variable names¶

In the IAMC data format, the variable implements a semi-hierarchical structure using the | (pipe) character. If the input dataframe has the hierarchy (or dimensions of an index) represented as separate columns, the concatenation can be performed during the initialization using the keyword argument variable=<list_of_cols>.

[16]:

CONCAT_DF = pd.DataFrame(
    [
        ["model_a", "scen_a", "World", "Primary Energy", None, "EJ/y", 1, 6.0],
        ["model_a", "scen_a", "World", "Primary Energy", "Coal", "EJ/y", 0.5, 3],
        ["model_a", "scen_b", "World", "Primary Energy", None, "EJ/y", 2, 7],
    ],
    columns=["model", "scenario", "region", "var_1", "var_2", "unit", 2005, 2010],
)

CONCAT_DF

[16]:

	model	scenario	region	var_1	var_2	unit	2005	2010
0	model_a	scen_a	World	Primary Energy	None	EJ/y	1.0	6.0
1	model_a	scen_a	World	Primary Energy	Coal	EJ/y	0.5	3.0
2	model_a	scen_b	World	Primary Energy	None	EJ/y	2.0	7.0

[17]:

df_concat = IamDataFrame(CONCAT_DF, variable=["var_1", "var_2"])

[18]:

df_concat.equals(df_simple)

[18]:

True

6. Combining multiple format issues¶

The last example in this tutorial illustrates that the features above can be used in combination. The input dataframe has the following issues:

missing model column
non-standard name of the region column
values in columns with variable as name

Also, notice that the value-columns do not have the same headers as the variables in the dataframe in Section 0. Therefore, we use the rename() function to change the variables after initialization to the expected names.

[19]:

COMPLICATED_DF = pd.DataFrame(
    [
        ["scen_a", "World", "EJ/y", 2005, 1, 0.5],
        ["scen_a", "World", "EJ/y", 2010, 6.0, 3],
        ["scen_b", "World", "EJ/y", 2005, 2, None],
        ["scen_b", "World", "EJ/y", 2010, 7, None],
    ],
    columns=["scenario", "iso", "unit", "year", "primary", "coal"],
)

COMPLICATED_DF

[19]:

	scenario	iso	unit	year	primary	coal
0	scen_a	World	EJ/y	2005	1.0	0.5
1	scen_a	World	EJ/y	2010	6.0	3.0
2	scen_b	World	EJ/y	2005	2.0	NaN
3	scen_b	World	EJ/y	2010	7.0	NaN

[20]:

df_complicated = IamDataFrame(
    COMPLICATED_DF, model="model_a", region="iso", value=["primary", "coal"]
).rename(variable={"primary": "Primary Energy", "coal": "Primary Energy|Coal"})

[21]:

df_complicated.equals(df_simple)

[21]:

True

Postscriptum: custom data columns in an IamDataFrame¶

This final section illustrates the behaviour of pyam when working with non-standard columns.

The custom data column feature is currently only on experimental support. Not all pyam functions currently support custom columns in a ‘data’ table. If you encounter any problems, please remove any non-standard columns from the input dataframe (or contribute to fix the problem!).

[22]:

CUSTOM_COL_DF = pd.DataFrame(
    [
        ["model_a", "scen_a", "World", "Primary Energy", "EJ/y", 2.1, 1, 6.0],
        ["model_a", "scen_a", "World", "Primary Energy|Coal", "EJ/y", 2.1, 0.5, 3],
        ["model_a", "scen_b", "World", "Primary Energy", "EJ/y", 2.1, 2, 7],
    ],
    columns=IAMC_IDX + ["version", 2005, 2010],
)

CUSTOM_COL_DF

[22]:

	model	scenario	region	variable	unit	version	2005	2010
0	model_a	scen_a	World	Primary Energy	EJ/y	2.1	1.0	6.0
1	model_a	scen_a	World	Primary Energy\|Coal	EJ/y	2.1	0.5	3.0
2	model_a	scen_b	World	Primary Energy	EJ/y	2.1	2.0	7.0

[23]:

df_custom_col = IamDataFrame(CUSTOM_COL_DF)

[24]:

df_custom_col.timeseries()

[24]:

						2005	2010
model	scenario	region	variable	unit	version
model_a	scen_a	World	Primary Energy	EJ/y	2.1	1.0	6.0
	scen_a	World	Primary Energy\|Coal	EJ/y	2.1	0.5	3.0
	scen_b	World	Primary Energy	EJ/y	2.1	2.0	7.0