Importing various data table formats¶
The default input/output (io) format of the pyam package is the tabular, wide data format, where the columns represent the time domain. This follows the standard established by the Integrated Assessment Modeling Consortium (IAMC); read the docs for more information.
Alas, to make using the package as easy as possible, iniatilizing an IamDataFrame will accept a variety of different table formats and allows specifying missing columns via keyword arguments. This way, a user can import their data into pyam as easily as possible without needing to worry (more than necessary) about manipulating the original, raw timeseries data.
This tutorial illustrates the broad range of possible formats to facilitate choosing the one that works for your data!
Overview¶
The first section shows the standard pandas.DataFrame (a.k.a. table) used in the pyam test suite. The following sections shows modifications of this dataframe and the necessary (if any) additional specifications to initialize an IamDataFrame. The last cell of each section uses the equals() function to assert that the IamDataFrame in that section is identical to the object in the first section.
It does not matter whether an IamDataFrame is initialized from a pandas.DataFrame or a path to a xslsx
/csv
file with the data in the given format. For simplicity, this tutorial only uses dataframes.
pyam.IAMC_IDX = ['model', 'scenario', 'region', 'variable', 'unit']
. The last section of this tutorial illustrates the use of additional, custom columns.0. Initialize timeseries data from standard IAMC-format table¶
The first cell creates a “simple dataframe” in the standard wide IAMC format.
It then casts that dataframe to an IamDataFrame and uses the timeseries() function to again show the data in the standard format.
[1]:
import pandas as pd
from pyam import IAMC_IDX
SIMPLE_DF = pd.DataFrame(
[
["model_a", "scen_a", "World", "Primary Energy", "EJ/y", 1, 6.0],
["model_a", "scen_a", "World", "Primary Energy|Coal", "EJ/y", 0.5, 3],
["model_a", "scen_b", "World", "Primary Energy", "EJ/y", 2, 7],
],
columns=IAMC_IDX + [2005, 2010],
)
SIMPLE_DF
[1]:
model | scenario | region | variable | unit | 2005 | 2010 | |
---|---|---|---|---|---|---|---|
0 | model_a | scen_a | World | Primary Energy | EJ/y | 1.0 | 6.0 |
1 | model_a | scen_a | World | Primary Energy|Coal | EJ/y | 0.5 | 3.0 |
2 | model_a | scen_b | World | Primary Energy | EJ/y | 2.0 | 7.0 |
[2]:
from pyam import IamDataFrame
df_simple = IamDataFrame(SIMPLE_DF)
/home/docs/checkouts/readthedocs.org/user_builds/pyam-iamc/checkouts/latest/pyam/utils.py:318: FutureWarning: The previous implementation of stack is deprecated and will be removed in a future version of pandas. See the What's New notes for pandas 2.1.0 for details. Specify future_stack=True to adopt the new implementation and silence this warning.
.stack(dropna=True)
[3]:
df_simple.timeseries()
[3]:
2005 | 2010 | |||||
---|---|---|---|---|---|---|
model | scenario | region | variable | unit | ||
model_a | scen_a | World | Primary Energy | EJ/y | 1.0 | 6.0 |
Primary Energy|Coal | EJ/y | 0.5 | 3.0 | |||
scen_b | World | Primary Energy | EJ/y | 2.0 | 7.0 |
1. Adding missing columns¶
The IAMC data format expects the columns model
, scenario
, region
, variable
and unit
. If the input dataframe does not have one or several of these columns, the value for that column can be given as a keyword argument of the type col=value
.
[4]:
MISSING_COLS_DF = pd.DataFrame(
[
["scen_a", "World", "Primary Energy", "EJ/y", 1, 6.0],
["scen_a", "World", "Primary Energy|Coal", "EJ/y", 0.5, 3],
["scen_b", "World", "Primary Energy", "EJ/y", 2, 7],
],
columns=["scenario", "region", "variable", "unit", 2005, 2010],
)
MISSING_COLS_DF
[4]:
scenario | region | variable | unit | 2005 | 2010 | |
---|---|---|---|---|---|---|
0 | scen_a | World | Primary Energy | EJ/y | 1.0 | 6.0 |
1 | scen_a | World | Primary Energy|Coal | EJ/y | 0.5 | 3.0 |
2 | scen_b | World | Primary Energy | EJ/y | 2.0 | 7.0 |
[5]:
df_missing_cols = IamDataFrame(MISSING_COLS_DF, model="model_a")
/home/docs/checkouts/readthedocs.org/user_builds/pyam-iamc/checkouts/latest/pyam/utils.py:318: FutureWarning: The previous implementation of stack is deprecated and will be removed in a future version of pandas. See the What's New notes for pandas 2.1.0 for details. Specify future_stack=True to adopt the new implementation and silence this warning.
.stack(dropna=True)
[6]:
df_missing_cols.equals(df_simple)
[6]:
True
2. Using a long data format¶
The next illustration is a dataframe where the years and values are given in a long format, i.e., in two columns named year
and value
. This is the format internally used by pyam.
[7]:
LONG_DF = pd.DataFrame(
[
["model_a", "scen_a", "World", "Primary Energy", "EJ/y", 2005, 1.0],
["model_a", "scen_a", "World", "Primary Energy", "EJ/y", 2010, 6],
["model_a", "scen_a", "World", "Primary Energy|Coal", "EJ/y", 2005, 0.5],
["model_a", "scen_a", "World", "Primary Energy|Coal", "EJ/y", 2010, 3],
["model_a", "scen_b", "World", "Primary Energy", "EJ/y", 2005, 2],
["model_a", "scen_b", "World", "Primary Energy", "EJ/y", 2010, 7],
],
columns=IAMC_IDX + ["year", "value"],
)
LONG_DF
[7]:
model | scenario | region | variable | unit | year | value | |
---|---|---|---|---|---|---|---|
0 | model_a | scen_a | World | Primary Energy | EJ/y | 2005 | 1.0 |
1 | model_a | scen_a | World | Primary Energy | EJ/y | 2010 | 6.0 |
2 | model_a | scen_a | World | Primary Energy|Coal | EJ/y | 2005 | 0.5 |
3 | model_a | scen_a | World | Primary Energy|Coal | EJ/y | 2010 | 3.0 |
4 | model_a | scen_b | World | Primary Energy | EJ/y | 2005 | 2.0 |
5 | model_a | scen_b | World | Primary Energy | EJ/y | 2010 | 7.0 |
[8]:
df_long = IamDataFrame(LONG_DF)
[9]:
df_long.equals(df_simple)
[9]:
True
3. Using column headers as variable names¶
The next example shows a table where the values are given as columns and the column header specifies the variable name. In this case, a user needs to specify the columns that should be interpreted as values using the keyword argument value=<value_cols>
.
[10]:
VALUE_COLS_DF = pd.DataFrame(
[
["model_a", "scen_a", "World", "EJ/y", 2005, 1, 0.5],
["model_a", "scen_a", "World", "EJ/y", 2010, 6.0, 3],
["model_a", "scen_b", "World", "EJ/y", 2005, 2, None],
["model_a", "scen_b", "World", "EJ/y", 2010, 7, None],
],
columns=[
"model",
"scenario",
"region",
"unit",
"year",
"Primary Energy",
"Primary Energy|Coal",
],
)
VALUE_COLS_DF
[10]:
model | scenario | region | unit | year | Primary Energy | Primary Energy|Coal | |
---|---|---|---|---|---|---|---|
0 | model_a | scen_a | World | EJ/y | 2005 | 1.0 | 0.5 |
1 | model_a | scen_a | World | EJ/y | 2010 | 6.0 | 3.0 |
2 | model_a | scen_b | World | EJ/y | 2005 | 2.0 | NaN |
3 | model_a | scen_b | World | EJ/y | 2010 | 7.0 | NaN |
[11]:
df_value_cols = IamDataFrame(
VALUE_COLS_DF, value=["Primary Energy", "Primary Energy|Coal"]
)
[12]:
df_value_cols.equals(df_simple)
[12]:
True
4. Using non-standard column names¶
As stated above, the IAMC data format expects a specific set of column names. If the input dataframe has columns with non-standard headers, the column renaming can be done on the fly by pyam using the keyword argument default_col=input_col
.
[13]:
OTHER_HEADER_DF = pd.DataFrame(
[
["model_a", "scen_a", "World", "Primary Energy", "EJ/y", 1, 6.0],
["model_a", "scen_a", "World", "Primary Energy|Coal", "EJ/y", 0.5, 3],
["model_a", "scen_b", "World", "Primary Energy", "EJ/y", 2, 7],
],
columns=["model", "foo", "region", "variable", "unit", 2005, 2010],
)
OTHER_HEADER_DF
[13]:
model | foo | region | variable | unit | 2005 | 2010 | |
---|---|---|---|---|---|---|---|
0 | model_a | scen_a | World | Primary Energy | EJ/y | 1.0 | 6.0 |
1 | model_a | scen_a | World | Primary Energy|Coal | EJ/y | 0.5 | 3.0 |
2 | model_a | scen_b | World | Primary Energy | EJ/y | 2.0 | 7.0 |
[14]:
df_other_header = IamDataFrame(OTHER_HEADER_DF, scenario="foo")
/home/docs/checkouts/readthedocs.org/user_builds/pyam-iamc/checkouts/latest/pyam/utils.py:318: FutureWarning: The previous implementation of stack is deprecated and will be removed in a future version of pandas. See the What's New notes for pandas 2.1.0 for details. Specify future_stack=True to adopt the new implementation and silence this warning.
.stack(dropna=True)
[15]:
df_other_header.equals(df_simple)
[15]:
True
5. Concatenating multiple columns as variable names¶
In the IAMC data format, the variable
implements a semi-hierarchical structure using the |
(pipe) character. If the input dataframe has the hierarchy (or dimensions of an index) represented as separate columns, the concatenation can be performed during the initialization using the keyword argument variable=<list_of_cols>
.
[16]:
CONCAT_DF = pd.DataFrame(
[
["model_a", "scen_a", "World", "Primary Energy", None, "EJ/y", 1, 6.0],
["model_a", "scen_a", "World", "Primary Energy", "Coal", "EJ/y", 0.5, 3],
["model_a", "scen_b", "World", "Primary Energy", None, "EJ/y", 2, 7],
],
columns=["model", "scenario", "region", "var_1", "var_2", "unit", 2005, 2010],
)
CONCAT_DF
[16]:
model | scenario | region | var_1 | var_2 | unit | 2005 | 2010 | |
---|---|---|---|---|---|---|---|---|
0 | model_a | scen_a | World | Primary Energy | None | EJ/y | 1.0 | 6.0 |
1 | model_a | scen_a | World | Primary Energy | Coal | EJ/y | 0.5 | 3.0 |
2 | model_a | scen_b | World | Primary Energy | None | EJ/y | 2.0 | 7.0 |
[17]:
df_concat = IamDataFrame(CONCAT_DF, variable=["var_1", "var_2"])
/home/docs/checkouts/readthedocs.org/user_builds/pyam-iamc/checkouts/latest/pyam/utils.py:318: FutureWarning: The previous implementation of stack is deprecated and will be removed in a future version of pandas. See the What's New notes for pandas 2.1.0 for details. Specify future_stack=True to adopt the new implementation and silence this warning.
.stack(dropna=True)
[18]:
df_concat.equals(df_simple)
[18]:
True
6. Combining multiple format issues¶
The last example in this tutorial illustrates that the features above can be used in combination. The input dataframe has the following issues:
missing
model
columnnon-standard name of the
region
columnvalues in columns with
variable
as name
Also, notice that the value-columns do not have the same headers as the variables in the dataframe in Section 0. Therefore, we use the rename() function to change the variables after initialization to the expected names.
[19]:
COMPLICATED_DF = pd.DataFrame(
[
["scen_a", "World", "EJ/y", 2005, 1, 0.5],
["scen_a", "World", "EJ/y", 2010, 6.0, 3],
["scen_b", "World", "EJ/y", 2005, 2, None],
["scen_b", "World", "EJ/y", 2010, 7, None],
],
columns=["scenario", "iso", "unit", "year", "primary", "coal"],
)
COMPLICATED_DF
[19]:
scenario | iso | unit | year | primary | coal | |
---|---|---|---|---|---|---|
0 | scen_a | World | EJ/y | 2005 | 1.0 | 0.5 |
1 | scen_a | World | EJ/y | 2010 | 6.0 | 3.0 |
2 | scen_b | World | EJ/y | 2005 | 2.0 | NaN |
3 | scen_b | World | EJ/y | 2010 | 7.0 | NaN |
[20]:
df_complicated = IamDataFrame(
COMPLICATED_DF, model="model_a", region="iso", value=["primary", "coal"]
).rename(variable={"primary": "Primary Energy", "coal": "Primary Energy|Coal"})
[21]:
df_complicated.equals(df_simple)
[21]:
True
Postscriptum: custom data columns in an IamDataFrame¶
This final section illustrates the behaviour of pyam when working with non-standard columns.
The custom data column feature is currently only on experimental support. Not all pyam functions currently support custom columns in a ‘data’ table. If you encounter any problems, please remove any non-standard columns from the input dataframe (or contribute to fix the problem!).
[22]:
CUSTOM_COL_DF = pd.DataFrame(
[
["model_a", "scen_a", "World", "Primary Energy", "EJ/y", 2.1, 1, 6.0],
["model_a", "scen_a", "World", "Primary Energy|Coal", "EJ/y", 2.1, 0.5, 3],
["model_a", "scen_b", "World", "Primary Energy", "EJ/y", 2.1, 2, 7],
],
columns=IAMC_IDX + ["version", 2005, 2010],
)
CUSTOM_COL_DF
[22]:
model | scenario | region | variable | unit | version | 2005 | 2010 | |
---|---|---|---|---|---|---|---|---|
0 | model_a | scen_a | World | Primary Energy | EJ/y | 2.1 | 1.0 | 6.0 |
1 | model_a | scen_a | World | Primary Energy|Coal | EJ/y | 2.1 | 0.5 | 3.0 |
2 | model_a | scen_b | World | Primary Energy | EJ/y | 2.1 | 2.0 | 7.0 |
[23]:
df_custom_col = IamDataFrame(CUSTOM_COL_DF)
/home/docs/checkouts/readthedocs.org/user_builds/pyam-iamc/checkouts/latest/pyam/utils.py:318: FutureWarning: The previous implementation of stack is deprecated and will be removed in a future version of pandas. See the What's New notes for pandas 2.1.0 for details. Specify future_stack=True to adopt the new implementation and silence this warning.
.stack(dropna=True)
[24]:
df_custom_col.timeseries()
[24]:
2005 | 2010 | ||||||
---|---|---|---|---|---|---|---|
model | scenario | region | variable | unit | version | ||
model_a | scen_a | World | Primary Energy | EJ/y | 2.1 | 1.0 | 6.0 |
Primary Energy|Coal | EJ/y | 2.1 | 0.5 | 3.0 | |||
scen_b | World | Primary Energy | EJ/y | 2.1 | 2.0 | 7.0 |