{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Importing various data table formats\n",
    "\n",
    "The default input/output (io) format of the **pyam** package is the *tabular, wide data format*,\n",
    "where the columns represent the time domain.\n",
    "This follows the standard established by the *Integrated Assessment Modeling Consortium*\n",
    "([IAMC](https://www.iamconsortium.org/));\n",
    "[read the docs](https://pyam-iamc.readthedocs.io/en/stable/data.html) for more information.\n",
    "\n",
    "Alas, to make using the package as easy as possible, iniatilizing an **IamDataFrame** will accept a variety of different table formats\n",
    "and allows specifying missing columns via keyword arguments.\n",
    "This way, a user can import their data into **pyam** as easily as possible without needing to worry (more than necessary) about manipulating the original, raw timeseries data.\n",
    "\n",
    "This tutorial illustrates the broad range of possible formats\n",
    "to facilitate choosing the one that works for *your data*!\n",
    "\n",
    "## Overview\n",
    "\n",
    "The first section shows the standard [pandas.DataFrame](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html) (a.k.a. table) used in the **pyam** test suite.\n",
    "The following sections shows modifications of this dataframe\n",
    "and the necessary (if any) additional specifications to initialize an **IamDataFrame**.\n",
    "The last cell of each section uses the [equals()](https://pyam-iamc.readthedocs.io/en/stable/api/iamdataframe.html#pyam.IamDataFrame.equals) function\n",
    "to assert that the **IamDataFrame** in that section is identical to the object in the first section.\n",
    "\n",
    "It does not matter whether an **IamDataFrame** is initialized from a [pandas.DataFrame](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html)\n",
    "or a path to a `xslsx`/`csv` file with the data in the given format.\n",
    "For simplicity, this tutorial only uses dataframes.\n",
    "\n",
    "<div class=\"alert alert-info\">\n",
    "\n",
    "The default columns (a.k.a. index or dimensions) of the IAMC data format are  \n",
    "`pyam.IAMC_IDX = ['model', 'scenario', 'region', 'variable', 'unit']`.<br />\n",
    "The last section of this tutorial illustrates the use of additional, custom columns.\n",
    "\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 0. Initialize timeseries data from standard IAMC-format table\n",
    "\n",
    "The first cell creates a \"simple dataframe\" in the standard *wide* IAMC format.\n",
    "\n",
    "It then casts that dataframe to an **IamDataFrame**\n",
    "and uses the [timeseries()](https://pyam-iamc.readthedocs.io/en/stable/api/iamdataframe.html#pyam.IamDataFrame.timeseries) function\n",
    "to again show the data in the standard format."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "from pyam import IAMC_IDX\n",
    "\n",
    "SIMPLE_DF = pd.DataFrame(\n",
    "    [\n",
    "        [\"model_a\", \"scen_a\", \"World\", \"Primary Energy\", \"EJ/y\", 1, 6.0],\n",
    "        [\"model_a\", \"scen_a\", \"World\", \"Primary Energy|Coal\", \"EJ/y\", 0.5, 3],\n",
    "        [\"model_a\", \"scen_b\", \"World\", \"Primary Energy\", \"EJ/y\", 2, 7],\n",
    "    ],\n",
    "    columns=IAMC_IDX + [2005, 2010],\n",
    ")\n",
    "\n",
    "SIMPLE_DF"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pyam import IamDataFrame\n",
    "\n",
    "df_simple = IamDataFrame(SIMPLE_DF)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_simple.timeseries()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Adding missing columns\n",
    "\n",
    "The IAMC data format expects the columns `model`, `scenario`, `region`, `variable` and `unit`.\n",
    "If the input dataframe does not have one or several of these columns, the value for that column\n",
    "can be given as a keyword argument of the type `col=value`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "MISSING_COLS_DF = pd.DataFrame(\n",
    "    [\n",
    "        [\"scen_a\", \"World\", \"Primary Energy\", \"EJ/y\", 1, 6.0],\n",
    "        [\"scen_a\", \"World\", \"Primary Energy|Coal\", \"EJ/y\", 0.5, 3],\n",
    "        [\"scen_b\", \"World\", \"Primary Energy\", \"EJ/y\", 2, 7],\n",
    "    ],\n",
    "    columns=[\"scenario\", \"region\", \"variable\", \"unit\", 2005, 2010],\n",
    ")\n",
    "\n",
    "MISSING_COLS_DF"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_missing_cols = IamDataFrame(MISSING_COLS_DF, model=\"model_a\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_missing_cols.equals(df_simple)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Using a *long* data format\n",
    "\n",
    "The next illustration is a dataframe where the years and values are given in a *long format*,\n",
    "i.e., in two columns named `year` and `value`.\n",
    "This is the format internally used by **pyam**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "LONG_DF = pd.DataFrame(\n",
    "    [\n",
    "        [\"model_a\", \"scen_a\", \"World\", \"Primary Energy\", \"EJ/y\", 2005, 1.0],\n",
    "        [\"model_a\", \"scen_a\", \"World\", \"Primary Energy\", \"EJ/y\", 2010, 6],\n",
    "        [\"model_a\", \"scen_a\", \"World\", \"Primary Energy|Coal\", \"EJ/y\", 2005, 0.5],\n",
    "        [\"model_a\", \"scen_a\", \"World\", \"Primary Energy|Coal\", \"EJ/y\", 2010, 3],\n",
    "        [\"model_a\", \"scen_b\", \"World\", \"Primary Energy\", \"EJ/y\", 2005, 2],\n",
    "        [\"model_a\", \"scen_b\", \"World\", \"Primary Energy\", \"EJ/y\", 2010, 7],\n",
    "    ],\n",
    "    columns=IAMC_IDX + [\"year\", \"value\"],\n",
    ")\n",
    "\n",
    "LONG_DF"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_long = IamDataFrame(LONG_DF)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_long.equals(df_simple)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Using column headers as variable names\n",
    "\n",
    "The next example shows a table where the values are given as columns\n",
    "and the column header specifies the variable name.\n",
    "In this case, a user needs to specify the columns that should be interpreted as values\n",
    "using the keyword argument `value=<value_cols>`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "VALUE_COLS_DF = pd.DataFrame(\n",
    "    [\n",
    "        [\"model_a\", \"scen_a\", \"World\", \"EJ/y\", 2005, 1, 0.5],\n",
    "        [\"model_a\", \"scen_a\", \"World\", \"EJ/y\", 2010, 6.0, 3],\n",
    "        [\"model_a\", \"scen_b\", \"World\", \"EJ/y\", 2005, 2, None],\n",
    "        [\"model_a\", \"scen_b\", \"World\", \"EJ/y\", 2010, 7, None],\n",
    "    ],\n",
    "    columns=[\n",
    "        \"model\",\n",
    "        \"scenario\",\n",
    "        \"region\",\n",
    "        \"unit\",\n",
    "        \"year\",\n",
    "        \"Primary Energy\",\n",
    "        \"Primary Energy|Coal\",\n",
    "    ],\n",
    ")\n",
    "\n",
    "VALUE_COLS_DF"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_value_cols = IamDataFrame(\n",
    "    VALUE_COLS_DF, value=[\"Primary Energy\", \"Primary Energy|Coal\"]\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_value_cols.equals(df_simple)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Using non-standard column names\n",
    "\n",
    "As stated above, the IAMC data format expects a specific set of column names.\n",
    "If the input dataframe has columns with non-standard headers,\n",
    "the *column renaming* can be done on the fly by **pyam** using the keyword argument `default_col=input_col`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "OTHER_HEADER_DF = pd.DataFrame(\n",
    "    [\n",
    "        [\"model_a\", \"scen_a\", \"World\", \"Primary Energy\", \"EJ/y\", 1, 6.0],\n",
    "        [\"model_a\", \"scen_a\", \"World\", \"Primary Energy|Coal\", \"EJ/y\", 0.5, 3],\n",
    "        [\"model_a\", \"scen_b\", \"World\", \"Primary Energy\", \"EJ/y\", 2, 7],\n",
    "    ],\n",
    "    columns=[\"model\", \"foo\", \"region\", \"variable\", \"unit\", 2005, 2010],\n",
    ")\n",
    "\n",
    "OTHER_HEADER_DF"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_other_header = IamDataFrame(OTHER_HEADER_DF, scenario=\"foo\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_other_header.equals(df_simple)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Concatenating multiple columns as variable names\n",
    "\n",
    "In the IAMC data format, the `variable` implements a semi-hierarchical structure using the `|` (pipe) character.\n",
    "If the input dataframe has the hierarchy (or dimensions of an index) represented as separate columns,\n",
    "the concatenation can be performed during the initialization\n",
    "using the keyword argument `variable=<list_of_cols>`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "CONCAT_DF = pd.DataFrame(\n",
    "    [\n",
    "        [\"model_a\", \"scen_a\", \"World\", \"Primary Energy\", None, \"EJ/y\", 1, 6.0],\n",
    "        [\"model_a\", \"scen_a\", \"World\", \"Primary Energy\", \"Coal\", \"EJ/y\", 0.5, 3],\n",
    "        [\"model_a\", \"scen_b\", \"World\", \"Primary Energy\", None, \"EJ/y\", 2, 7],\n",
    "    ],\n",
    "    columns=[\"model\", \"scenario\", \"region\", \"var_1\", \"var_2\", \"unit\", 2005, 2010],\n",
    ")\n",
    "\n",
    "CONCAT_DF"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_concat = IamDataFrame(CONCAT_DF, variable=[\"var_1\", \"var_2\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_concat.equals(df_simple)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6. Combining multiple format issues\n",
    "\n",
    "The last example in this tutorial illustrates that the features above can be used in combination.\n",
    "The input dataframe has the following issues:\n",
    "\n",
    " - missing `model` column\n",
    " - non-standard name of the `region` column\n",
    " - values in columns with `variable` as name\n",
    "\n",
    "Also, notice that the value-columns do not have the same headers as the variables in the dataframe in **Section 0**.\n",
    "Therefore, we use the [rename()](https://pyam-iamc.readthedocs.io/en/stable/api/iamdataframe.html#pyam.IamDataFrame.rename) function\n",
    "to change the variables after initialization to the expected names."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "COMPLICATED_DF = pd.DataFrame(\n",
    "    [\n",
    "        [\"scen_a\", \"World\", \"EJ/y\", 2005, 1, 0.5],\n",
    "        [\"scen_a\", \"World\", \"EJ/y\", 2010, 6.0, 3],\n",
    "        [\"scen_b\", \"World\", \"EJ/y\", 2005, 2, None],\n",
    "        [\"scen_b\", \"World\", \"EJ/y\", 2010, 7, None],\n",
    "    ],\n",
    "    columns=[\"scenario\", \"iso\", \"unit\", \"year\", \"primary\", \"coal\"],\n",
    ")\n",
    "\n",
    "COMPLICATED_DF"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_complicated = IamDataFrame(\n",
    "    COMPLICATED_DF, model=\"model_a\", region=\"iso\", value=[\"primary\", \"coal\"]\n",
    ").rename(variable={\"primary\": \"Primary Energy\", \"coal\": \"Primary Energy|Coal\"})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_complicated.equals(df_simple)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Postscriptum: custom data columns in an IamDataFrame\n",
    "\n",
    "This final section illustrates the behaviour of **pyam** when working with non-standard columns.\n",
    "\n",
    "<div class=\"alert alert-warning\">\n",
    "\n",
    "The *custom data column* feature is currently only on experimental support.\n",
    "Not all **pyam** functions currently support custom columns in a 'data' table.\n",
    "If you encounter any problems, please remove any non-standard columns\n",
    "from the input dataframe (or [contribute](https://pyam-iamc.readthedocs.io/en/stable/contributing.html) to fix the problem!).\n",
    "\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "CUSTOM_COL_DF = pd.DataFrame(\n",
    "    [\n",
    "        [\"model_a\", \"scen_a\", \"World\", \"Primary Energy\", \"EJ/y\", 2.1, 1, 6.0],\n",
    "        [\"model_a\", \"scen_a\", \"World\", \"Primary Energy|Coal\", \"EJ/y\", 2.1, 0.5, 3],\n",
    "        [\"model_a\", \"scen_b\", \"World\", \"Primary Energy\", \"EJ/y\", 2.1, 2, 7],\n",
    "    ],\n",
    "    columns=IAMC_IDX + [\"version\", 2005, 2010],\n",
    ")\n",
    "\n",
    "CUSTOM_COL_DF"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_custom_col = IamDataFrame(CUSTOM_COL_DF)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_custom_col.timeseries()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}