Workspace preparation for FinalFit interface
============================================

Standard Procedure
------------------

The standard way to get HiggsDNA Ntuples and transform them in FinalFit friendly output is to use the ``prepare_output_file.py`` script, provided and maintained in the ``script`` repository.
The script will perform multiple steps:
* Merge all the ``.parquet`` files and categorise the events, obtaining one file for each category of each sample.
* Convert the ``merged.parquet`` into ``ROOT`` trees.
* Convert the ``ROOT`` trees into FinalFit compatible ``RooWorkspace``s.

All the steps can be performed in one go with a command more or less like this::

        python3 prepare_output_file.py --input [path to output dir] --merge --root --ws --syst --cats --args "--do_syst"

or the single steps can be performed by running the auxiliary files (``merge_parquet.py``, ``convert_parquet_to_root.py``, ``Tree2WS``) separately.
A complete set of options for the main script is listed below.

Merging step
------------
During this step the main script calls ``merge_parquet.py`` multiple times. The starting point is the output of HiggsDNA, i.e. ``out_dir/sample_n/``. These directory **must** contain only ``.parquet`` files that have to be merged. 
The script will create a new directory called ``merged`` under ``out_dir``, if this directory already exists it will throw an error and exit.
When converting the data (in my case they were split per era, ``Data_B_2017``, ``Data_C_2017`` etc.) the script will put them in a new directory ``Data_2017`` and then merge again the output in a ``.parquet`` called ``allData_2017.parquet``.
During this step the events are also split into categories according to the boundaries defined in the ``cat_dict`` in the main file. An example of such dictionary is presented here::

        if opt.cats:
        cat_dict = {
            "best_resolution": {
                "cat_filter": [
                    ("sigma_m_over_m_decorr", "<", 0.005),
                    ("lead_mvaID", ">", 0.43),
                    ("sublead_mvaID", ">", 0.43),
                ]
            },
            "medium_resolution": {
                "cat_filter": [
                    ("sigma_m_over_m_decorr", ">", 0.005),
                    ("sigma_m_over_m_decorr", "<", 0.008),
                    ("lead_mvaID", ">", 0.43),
                    ("sublead_mvaID", ">", 0.43),
                ]
            },
            "worst_resolution": {
                "cat_filter": [
                    ("sigma_m_over_m_decorr", ">", 0.008),
                    ("lead_mvaID", ">", 0.43),
                    ("sublead_mvaID", ">", 0.43),
                ]
            },
        }

if you don't provide the dictionary to the script all the events will be put in a single file labelled as ``UNTAGGED``.

During the merging step MC samples can also be normalised to the ``efficiency x acceptance`` value as required later on by FinalFits, this step can be skipped using the tag ``--skip-normalisation``.

Root step 
---------

During this step the script calls multiple times the script ``convert_parquet_to_root.py``. The arguments to pass to the script, for instance if you want the systematic variation included in the output ``ROOT tree`` are specified when calling ``prepare_output_file.py`` using ``--args "--do_syst"``.
As before the script creates a new called ``root`` under ``out_dir``, if this directory already exists it will throw an error and exit. In the script there is a dictionary called ``outfiles`` that contains the name of the output root file that will be created according to the process tipe, if the wf is run using the main script this correspond to the proces containd in ``process_dict``.

Workspace step
--------------

During this step the main script uses multiple time the ``Flashgg_FinalFit``, it moves to the directory defined in the ``--final_fit`` option (improvable) and uses the ``Tree2WS`` script there on the content of the ``root`` directory previously created. The output is stored in ``out_dir/root/smaple_name/ws/``.

Commands
--------

The workflow is meant to be run in one go using the ``prepare_output_file.py`` script, it can be also split in different steps or run with the single auxiliary files but it can result a bit cumbersome.

To run everything starting from the output of HiggsDNA with categories and systematic variatrion one can use::

        python3 prepare_output_file.py --input [path to output dir] --merge --root --ws --syst --cats --args "--do_syst"

and everithing should run smoothly, it does for me at least (I've not tried the scripts in a while so thing may have to be adjusted in this document).
Some options can be removed. If you want to use ``--syst`` and ``--root`` you should also add ``--args "--do_syst"``.

The complete list of options for the main file is here:

    * ``--merge``, "Do merging of the .parquet files"
    * ``--root``, "Do root conversion step"
    * ``--ws``, "Do root to workspace conversion step"
    * ``--ws_config``, "configuration file for Tree2WS, as it is now it must be stored in Tree2WS directory in FinalFit",
    * ``--final_fit``, "FlashggFinalFit path" # the default is just for me, it should be changed but I don't see a way to make this generally valid
    * ``--syst``, "Do systematics variation treatment"
    * ``--cats``, ="Split into categories",
    * ``--args``, "additional options for root converter: --do_syst, --notag",
    * ``--skip-normalisation``, "Independent of file type, skip normalisation step",
    * ``--verbose``, "verbose lefer for the logger: INFO (default), DEBUG",

The merging step can also be run separately using::

        python3 merge_parquet.py --source [path to the directory containing .paruets] --target [target directory path] --cats [cat_dict]

the script works also without the ``--cats`` option, it creates a dummy selection of ``Pt > -1`` and call the category ``UNTAGGED``.

Same for the root step::

        python3 convert_parquet_to_root.py [/path/to/merged.parquet] [path to output file containing also the filename] mc (or data depending what you're doing) --process [process name (should match one of the outfiles dict entries)] --do_syst --cats [cat_dict] --vars [variation.json]

``--do_syst`` is not mandatory, but if it's there also the dictionary containing the variations must be specified with the ``--var`` option. As before the script works also without the ``--cats`` option.