.. _ptypy_data:

***************
Data management
***************

.. note::
   In this chapter, We refer to the *raw input data* with *data* and not 
   to data stored in memory of the computer by :any:`Storage` instances. 
   With the term *preparation* we refer to all data processing 
   steps prior to the reconstruction and avoid the ambiguous term
   *processing* although it may be more familiar to the reader.

Consider the following generic steps which every ptychographer has to complete
prior to a successful image reconstruction.

**(A)** *Conducting a scanning diffraction experiment.* 
   While or after the experiment is performed, the researcher is left with *raw images*
   acquired from the detector and *meta data* which, in general, consists of scanning
   positions along with geometric information about the setup, e.g. photon *energy*,
   propagation *distance*, *detector pixel size* etc.

**(B)** *Preparing the data.*
   In this step, the user performs a subset of the following actions
   
   * select the appropriate region of the detector where the scattering events were counted, 
   * apply possible *pixel corrections* to convert the detector counts of the chosen
     diffraction frame into photon counts, e.g. flat-field and dark-field
     correction,
   * switch image orientation to match with the coordinate system of the 
     reconstruction algorithms,
   * assign a suited mask to exclude invalid pixel data (hot or dead pixel, overexposure),
   * and/or simply rebin the data.
   
   Finally the user needs to zip the diffraction frames together with the scanning positions. 

**(C)** *Saving the processed data or feed the data into recontruction process.*
   In this step the user needs to save the data in a suitable format
   or provide the data directly for the reconstruction engine.

**Data management** in |ptypy| deals with **(B)** and **(C)** as a ptychography 
reconstruction software naturally **cannot** provide actual experimental 
data. Nevertheless, the treatment of raw data is usually very similar for
every experiment. Consequently, |ptypy| provides an abstract base class,
called :py:class:`PtyScan`, which aims to help with steps (B) and (C). In order
to adapt |ptypy| for a specific experimental setup, we simply
subclass :py:class:`PtyScan` and reimplement only that subset of its methods which are 
affected by the specifics of the experiemental setup
(see :ref:`subclassptyscan`). 

.. _sec_ptyscan:

The PtyScan class
=================

:py:class:`PtyScan` is the abstract base class in |ptypy| that manages raw input
data.

A PtyScan instance is constructed from a set of generic parameters,
see :py:data:`.scan.data` in the ptypy parameter tree.

It provides the following features:

**Parallelization**
  When |ptypy| is run across several MPI processes, PtyScan takes care of 
  distributing the scan-point indices among processes such that each
  process only loads the data it will later use in the reconstruction.
  Hence, the load on the network is not affected by the number of
  processes.
  The parallel behavior of :py:class:`PtyScan`, is controlled by the parameter 
  :py:data:`.scan.data.load_parallel`. It uses the :py:class:`~ptypy.utils.parallel.LoadManager`

**Preparation**
  PtyScan can handle a few of the raw processing steps mentioned above.
  
  * Selection a region-of-interest from the raw detector image. This
    selection is controlled by the parameters :py:data:`.scan.data.auto_center`,
    and :py:data:`.scan.data.shape` and :py:data:`.scan.data.center`.
  
  * Switching of orientation and rebinning are controlled by 
    :py:data:`.scan.data.orientation` and :py:data:`.scan.data.rebin`.
  
  * Finding a suitable mask or weight for pixel correction is left
    to the user, as this is a setup-specific implementation. 
    See :py:meth:`~ptypy.core.data.PtyScan.load_weight`,
    :py:meth:`~ptypy.core.data.PtyScan.load_common`,
    :py:meth:`~ptypy.core.data.PtyScan.load`
    and :py:meth:`~ptypy.core.data.PtyScan.correct`
    for detailed explanations.
    
**Packaging**
  PtyScan packs the prepared *data* together with the used scan point 
  *indices*, scan *positions* and a *weight* (=mask) and geometric *meta*
  information. This package is requested by the managing instance 
  :py:class:`~ptypy.core.manager.ModelManager` on the call 
  :py:meth:`~ptypy.core.manager.ModelManager.new_data`.
  
  Because data acquisition and preparation can happen during a reconstruction
  process, it is possible to specify the minimum number of data frames passed
  to each process on a *new_data()* by setting the value of :py:data:`.scan.data.min_frames`.
  The total number of frames processed for a scan is set by :py:data:`.scan.data.num_frames`.
  
  If not extracted from other files, 
  the user may set the photon energy with :py:data:`.scan.data.energy`,
  the propagation distance from sample to detector with 
  :py:data:`.scan.data.distance` and the detector pixel size with
  :py:data:`.scan.data.psize`.

**Storage**
  PtyScan and its subclass are capable of storing the data in an 
  *hfd5*-compatible [HDF]_ file format. The data file names have a custom 
  suffix: ``.ptyd``.
  
  A detailed overview of the *.ptyd* data file tree is given below in 
  the section :ref:`ptyd_file`
  
  The parameters 
  :py:data:`.scan.data.save` and :py:data:`.scan.data.chunk_format`
  control the way PtyScan saves the processed data.
  
  .. note::
     Although *h5py* [h5py]_ supports parallel write, this feature is not 
     used in ptypy. At the moment, all mpi nodes send their prepared data
     to the master node which writes the date to a file.


.. _ptyd_scenarios:

Usage scenarios
===============

The PtyScan class of |ptypy| provides support for three use cases.

**Beamline integreted use.** 
  
  In this use case, the researcher has integrated |ptypy| into the beamline 
  end-station or experimental setup
  with the help of a custom subclass of :py:class:`PtyScan` that we call
  ``UserScan``. This subclass has its own methods to extract many of the 
  of the generic parameters of :py:data:`.scan.data` and also defaults
  for specific custom parameters, for instance file paths or file name
  patterns (for a detailed introduction on how to subclass PtyScan, see
  :ref:`subclassptyscan`). Once the experiment is completed, the researcher can initiate
  a reconstruction directly from raw data with a standard reconstruction
  script. 
  
  .. figure:: ../img/data_case_integrated.png
     :width: 70 %
     :figclass: highlights
     :name: case_integrated

     Integrated use case of :py:class:`PtyScan`.
     
     A custom subclass ``UserScan``
     serves as a translator between |ptypy|'s generic parameters and 
     data types and the raw image data and meta data from the experiment.
     Typically the experiment has to be completed before a reconstruction
     is started, but with some effort it is even possible to have the reconstruction
     start immediately after acquisition of the first frame. As data preparation
     is blended in with the reconstruction process, the reconstruction
     holds when new data is prepared. Optionally, the prepared data
     is saved to a ``.ptyd`` file to avoid having to run the preparation steps for subsequent
     reconstruction runs.
     
**Post preparation use.**
  
  In this use case, the experiment is long passed and the researcher has
  either used custom subclass of PtyScan or *any other script* that 
  generates a compatible .hdf5 file (see :ref:`here<ptyd_file>`) to save prepared data of that
  experiment. Reconstruction is supposed to work when passing the
  data file path in the parameter tree.
  
  Only the input file path needs
  to be passed either with :py:data:`~.scan.data.source` or with
  :py:data:`~.scan.data.dfile` when :py:data:`~.scan.data.source`
  takes the value ``'file'``. In that latter case, secondary processing
  and saving to another file is not supported, while it is allowed
  in the first case. While the latter case seems infavorable due to the
  lack of secondary preparation options, 
  it is meant as a user-friendly transition switch from the first
  reconstruction at the experiment to 
  post-experiment analysis. Only the :py:data:`~.scan.data.source` 
  parameter needs to be altered in script from ``<..>.data.source=<recipe>``
  to ``<..>.data.source='file'`` while the rest of the parameters are ignored
  and may remain untouched.
  
  .. figure:: ../img/data_case_prepared.png
     :width: 70 %
     :figclass: highlights
     :name: case_prepared
     
     Standard supported use case of :py:class:`PtyScan`.
     
     If a structure-compatible (see :ref:`ptyd_file`) ``*.hdf5``-file is 
     available, |ptypy| can be used without customizing a subclass of 
     :py:class:`PtyScan`. It will use the shipped subclass :py:class:`PtydScan`
     to read in the (prepared) raw data. 
     
**Preparation and reconstruction on-the-fly with data acquisition.**
  
  This use case is for even tighter beamline integration
  and on-the-fly scans. The researcher has mastered a suitable
  subclass ``UserScan`` to prepare data from the setup. Now, the preparation 
  happens in a separate process while image frames are acquired.
  This process runs a python script where the subclass ``UserScan``
  prepares the data using the :py:meth:`~ptypy.core.data.PtyScan.auto`
  method. The :py:data:`~.scan.data.save` parameter is set
  to 'link' in order to create a separate file for each data chunk
  and to avoid write access on the source file.
  The chunk files are linked back into the main source ``.ptyd`` file. 
  
  All reconstruction processes may access the prepared data without 
  overhead or notable pauses in the reconstruction. For |ptypy| there
  is no difference if compared to a single source file (a feature of [HDF]_\ ).
  
  .. figure:: ../img/data_case_flyscan.png
     :width: 70 %
     :figclass: highlights
     :name: case_flyscan

     On-the-fly or demon-like use case of :py:class:`PtyScan`.
     
     A separate process prepares the data *chunks* and saves them 
     in separate files which are
     linked back into the source data file. This process
     may run silently as a ''demon'' in the background. Reconstructions
     can start immediately and run without delays or pauses due to data
     preparation.


.. _ptyd_file:

Ptyd file format
================

Ptypy uses the python module **h5py** [h5py]_ to store and load data in the
**H**\ ierarchical **D**\ ata **F**\ ormat [HDF]_ . HDF resembles very
much a directory/file tree of today's operating systems, while the "files"
are (multidimensonial) datasets. 

Ptypy stores and loads the (processed) experimental data in a file with extension
*.ptyd*, which is a hdf5-file with a data tree of very simple nature. 
Comparable to tagged image file formats like *.edf* or *.tiff*, the ``ptyd`` data file seperates
meta information (stored in ``meta/``) from the actual data payload 
(stored in ``chunks/``). A schematic overview of the data tree is depicted below.

::
   
   *.ptyd/
     
         meta/
            
            [general parameters; optional but very useful]
            version     : str
            num_frames  : int
            label       : str
            
            [geometric porameters; all optional]
            shape       : int or (int,int)
            energy      : float, optional
            distance    : float, optional
            center      : (float,float) or None, optional
            psize       : float or (float,float), optional
            propagation : "farfield" or "nearfield", optional
            ...
            
         chunks/
         
            0/
              data      : array(M,N,N) of float
              indices   : array(M) of int, optional
              positions : array(M ,2) of float
              weights   : same shape as data or empty
            1/
              ...
            2/
              ...
            ...

All parameters of ``meta/`` are a subset of :py:data:`.scan.data`\ .
Omitting any of these parameters or setting the value of the dataset to 
``'None'`` has the same effect.

The first set of parameters

::
   
   version     : str 
   num_frames  : int 
   label       : str 

are general (optional) parameters.
 
  * ``version`` is ptypy version this dataset was prepared with
    (current version is |version|, see :py:data:`~.scan.data.version`).
  * ``label`` is a custom user label. Choose a unique label to your liking.
  * ``num_frames`` indicates how many diffraction image frames are 
    expected in the dataset (see :py:data:`~.scan.data.num_frames`)
    It is important to set this parameter when the data acquisition is not
    finished but the reconstruction has already started. If the dataset
    is complete, the loading class :py:class:`PtydScan` retrieves the 
    total number of frames from the payload ``chunks/``
    
The next set of optional parameters are

::

   shape       : int or (int,int)
   energy      : float
   distance    : float
   center      : (float,float)
   psize       : float or (float,float)
   propagation : "farfield" or "nearfield"

which refer to the experimental scanning geometry. 

  * ``shape`` 
    (see :py:data:`.scan.data.shape`)
  * ``energy`` 
    (see :py:data:`.scan.data.energy` or :py:data:`.scan.geometry.energy`)
  * ``distance`` 
    (see :py:data:`.scan.data.distance`)
  * ``center``      : (float,float)
    (see :py:data:`.scan.data.center`)
  * ``psize``       : float or (float,float)
    (see :py:data:`.scan.data.psize`)
  * ``propagation`` : "farfield" or "nearfield"
    (see :py:data:`.scan.data.propagation`)

Finally these parameters will be digested by the 
:py:mod:`~ptypy.core.geometry` module in order to provide a suited propagator.

.. note::
   
   As you may have already noted, there are three ways to specify the 
   geometry of the experiment. 
   
   ::
   
      bla


As walking the data tree and extracting the data from the *hdf5* file 
is a bit cumbersome with h5py, there are a few convenience function in the 
:py:mod:`ptypy.io.h5rw` module.


.. _subclassptyscan:

Tutorial : Subclassing PtyScan
==============================


.. note::
   This tutorial was generated from the python source
   :file:`[ptypy_root]/tutorial/subclassptyscan.py` using :file:`ptypy/doc/script2rst.py`. 
   You are encouraged to modify the parameters and rerun the tutorial with::
   
     $ python [ptypy_root]/tutorial/subclassptyscan.py

In this tutorial, we learn how to subclass :py:class:`PtyScan` to make
ptypy work with any experimental setup.

This tutorial can be used as a direct follow-up to :ref:`simupod`
if section :ref:`store` was completed

Again, the imports first.

::

   >>> import numpy as np
   >>> from ptypy.core.data import PtyScan
   >>> from ptypy import utils as u

For this tutorial we assume that the data and meta information is
in this path:

::

   >>> save_path = '/tmp/ptypy/sim/'

Furthermore, we assume that a file about the experimental geometry is
located at

::

   >>> geofilepath = save_path + 'geometry.txt'
   >>> print(geofilepath)
   /tmp/ptypy/sim/geometry.txt
   
and has contents of the following form

::

   >>> print(''.join([line for line in open(geofilepath, 'r')]))
   distance 1.5000e-01
   energy 2.3305e-03
   psize 2.4000e-05
   shape 256
   
   
The scanning positions are in

::

   >>> positionpath = save_path + 'positions.txt'
   >>> print(positionpath)
   /tmp/ptypy/sim/positions.txt
   

with a list of positions for vertical and horizontanl movement and the
image frame from the "camera"

::

   >>> print(''.join([line for line in open(positionpath, 'r')][:6])+'....')
   ccd/diffraction_0000.npy 0.0000e+00 0.0000e+00
   ccd/diffraction_0001.npy 0.0000e+00 4.1562e-04
   ccd/diffraction_0002.npy 3.9528e-04 1.2844e-04
   ccd/diffraction_0003.npy 2.4430e-04 -3.3625e-04
   ccd/diffraction_0004.npy -2.4430e-04 -3.3625e-04
   ccd/diffraction_0005.npy -3.9528e-04 1.2844e-04
   ....
   

Writing a subclass
------------------

The simplest subclass of PtyScan would look like this

::

   >>> class NumpyScan(PtyScan):
   >>>     """
   >>>     A PtyScan subclass to extract data from a numpy array.
   >>>     """
   >>> 
   >>>     def __init__(self, pars=None, **kwargs):
   >>>         # In init we need to call the parent.
   >>>         super(NumpyScan, self).__init__(pars, **kwargs)
   >>> 

Of course this class does nothing special beyond PtyScan.
As it is, the class also cannot be used as a real PtyScan instance
because its defaults are not properly managed. For this, Ptypy provides a
powerful self-documenting tool call a "descriptor" which can be applied
to any new class using a decorator. The tree of all valid ptypy parameters
is located at :ref:`here <parameters>`. To manage the default
parameters of our subclass and document its existence, we would need to write

::

   >>> from ptypy import defaults_tree


::

   >>> @defaults_tree.parse_doc('scandata.numpyscan')
   >>> class NumpyScan(PtyScan):
   >>>     """
   >>>     A PtyScan subclass to extract data from a numpy array.
   >>>     """
   >>> 
   >>>     def __init__(self, pars=None, **kwargs):
   >>>         # In init we need to call the parent.
   >>>         super(NumpyScan, self).__init__(pars, **kwargs)
   >>> 

The decorator extracts information from the docstring of the subclass and
parent classes about the expected input parameters. Currently the docstring
of `NumpyScan` does not contain anything special, thus the only parameters
registered are those of the parent class, `PtyScan`:

::

   >>> print(defaults_tree['scandata.numpyscan'].to_string())
   [name]
   default = PtyScan
   help = 
   type = str
   
   [dfile]
   default = None
   help = File path where prepared data will be saved in the ``ptyd`` format.
   type = file
   userlevel = 0
   
   [chunk_format]
   default = .chunk%02d
   help = Appendix to saved files if save == 'link'
   type = str
   doc = 
   userlevel = 2
   
   [save]
   default = None
   help = Saving mode
   type = str
   doc = Mode to use to save data to file.
   	<newline>
   	- ``None``: No saving
   	- ``'merge'``: attemts to merge data in single chunk **[not implemented]**
   	- ``'append'``: appends each chunk in master \*.ptyd file
   	- ``'link'``: appends external links in master \*.ptyd file and stores chunks separately
   	<newline>
   	in the path given by the link. Links file paths are relative to master file.
   userlevel = 1
   
   [auto_center]
   default = None
   help = Determine if center in data is calculated automatically
   type = bool
   doc = 
   	- ``False``, no automatic centering
   	- ``None``, only if :py:data:`center` is ``None``
   	- ``True``, it will be enforced
   userlevel = 0
   
   [load_parallel]
   default = data
   help = Determines what will be loaded in parallel
   type = str
   doc = Choose from ``None``, ``'data'``, ``'common'``, ``'all'``
   choices = ['data', 'common', 'all']
   
   [rebin]
   default = None
   help = Rebinning factor
   type = int
   doc = Rebinning factor for the raw data frames. ``'None'`` or ``1`` both mean *no binning*
   userlevel = 1
   lowlim = 1
   uplim = 32
   
   [orientation]
   default = None
   help = Data frame orientation
   type = int, tuple, list
   doc = Choose
   	<newline>
   	- ``None`` or ``0``: correct orientation
   	- ``1``: invert columns (numpy.flip_lr)
   	- ``2``: invert rows  (numpy.flip_ud)
   	- ``3``: invert columns, invert rows
   	- ``4``: transpose (numpy.transpose)
   	- ``4+i``: tranpose + other operations from above
   	<newline>
   	Alternatively, a 3-tuple of booleans may be provided ``(do_transpose,
   	do_flipud, do_fliplr)``
   choices = [0, 1, 2, 3, 4, 5, 6, 7]
   userlevel = 1
   
   [min_frames]
   default = 1
   help = Minimum number of frames loaded by each node
   type = int
   doc = 
   userlevel = 2
   lowlim = 1
   
   [positions_theory]
   default = None
   help = Theoretical positions for this scan
   type = ndarray
   doc = If provided, experimental positions from :py:class:`PtyScan` subclass will be ignored. If data
   	preparation is called from Ptycho instance, the calculated positions from the
   	:py:func:`ptypy.core.xy.from_pars` dict will be inserted here
   userlevel = 2
   
   [num_frames]
   default = None
   help = Maximum number of frames to be prepared
   type = int
   doc = If `positions_theory` are provided, num_frames will be ovverriden with the number of
   	positions available
   userlevel = 1
   
   [label]
   default = None
   help = The scan label
   type = str
   doc = Unique string identifying the scan
   userlevel = 1
   
   [experimentID]
   default = None
   help = Name of the experiment
   type = str
   doc = If None, a default value will be provided by the recipe. **unused**
   userlevel = 2
   
   [version]
   default = 0.1
   help = TODO: Explain this and decide if it is a user parameter.
   type = float
   doc = 
   userlevel = 2
   
   [shape]
   default = 256
   help = Shape of the region of interest cropped from the raw data.
   type = int, tuple
   doc = Cropping dimension of the diffraction frame
   	Can be None, (dimx, dimy), or dim. In the latter case shape will be (dim, dim).
   userlevel = 1
   
   [center]
   default = 'fftshift'
   help = Center (pixel) of the optical axes in raw data
   type = list, tuple, str
   doc = If ``None``, this parameter will be set by :py:data:`~.scan.data.auto_center` or elsewhere
   userlevel = 1
   
   [psize]
   default = 0.000172
   help = Detector pixel size
   type = float, tuple
   doc = Dimensions of the detector pixels (in meters)
   userlevel = 0
   lowlim = 0
   
   [distance]
   default = 7.19
   help = Sample to detector distance
   type = float
   doc = In meters.
   userlevel = 0
   lowlim = 0
   
   [energy]
   default = 7.2
   help = Photon energy of the incident radiation in keV
   type = float
   doc = 
   userlevel = 0
   lowlim = 0
   
   [add_poisson_noise]
   default = False
   help = Decides whether the scan should have poisson noise or not
   type = bool
   

As you can see, there are already many parameters documented in `PtyScan`'s
class. For each parameter, most important are the *type*, *default* value and
*help* string. The decorator does more than collect this information: it also
generates from it a class variable called `DEFAULT`, which stores all defaults:

::

   >>> print(u.verbose.report(NumpyScan.DEFAULT, noheader=True))
   * id3V4ANI238G           : ptypy.utils.parameters.Param(20)
     * name                 : PtyScan
     * dfile                : None
     * chunk_format         : .chunk%02d
     * save                 : None
     * auto_center          : None
     * load_parallel        : data
     * rebin                : None
     * orientation          : None
     * min_frames           : 1
     * positions_theory     : None
     * num_frames           : None
     * label                : None
     * experimentID         : None
     * version              : 0.1
     * shape                : 256
     * center               : fftshift
     * psize                : 0.000172
     * distance             : 7.19
     * energy               : 7.2
     * add_poisson_noise    : False
   
   
Now we are ready to add functionality to our subclass.
A first step of initialisation would be to retrieve
the geometric information that we stored in ``geofilepath`` and update
the input parameters with it.

We write a tiny file parser.

::

   >>> def extract_geo(base_path):
   >>>     out = {}
   >>>     with open(base_path+'geometry.txt') as f:
   >>>         for line in f:
   >>>             key, value = line.strip().split()
   >>>             out[key] = eval(value)
   >>>     return out
   >>> 

We test it.

::

   >>> print(extract_geo(save_path))
   {'distance': 0.15, 'energy': 0.0023305, 'psize': 2.4e-05, 'shape': 256}
   

That seems to work. We can integrate this parser into
the initialisation as we assume that this small access can be
done by all MPI nodes without data access problems. Hence,
our subclass becomes

::

   >>> @defaults_tree.parse_doc('scandata.numpyscan')
   >>> class NumpyScan(PtyScan):
   >>>     """
   >>>     A PtyScan subclass to extract data from a numpy array.
   >>> 
   >>>     Defaults:
   >>> 
   >>>     [name]
   >>>     type = str
   >>>     default = numpyscan
   >>>     help =
   >>> 
   >>>     [base_path]
   >>>     type = str
   >>>     default = './'
   >>>     help = Base path to extract data files from.
   >>>     """
   >>> 
   >>>     def __init__(self, pars=None, **kwargs):
   >>>         p = self.DEFAULT.copy(depth=2)
   >>>         p.update(pars)
   >>> 
   >>>         with open(p.base_path+'geometry.txt') as f:
   >>>             for line in f:
   >>>                 key, value = line.strip().split()
   >>>                 # we only replace Nones or missing keys
   >>>                 if p.get(key) is None:
   >>>                     p[key] = eval(value)
   >>> 
   >>>         super(NumpyScan, self).__init__(p, **kwargs)
   >>> 

We now need a new input parameter called `base_path`, so we documented it
in the docstring after the section header "Defaults:".

::

   >>> print(defaults_tree['scandata.numpyscan.base_path'])
   [base_path]
   default = './'
   help = Base path to extract data files from.
   type = str
   

As you can see, the first step in `__init__` is to build a default
parameter structure to ensure that all input parameters are available.
The next line updates this structure to overwrite the entries specified by
the user.

Good! Next, we need to implement how the class finds out about
the positions in the scan. The method
:py:meth:`~ptypy.core.data.PtyScan.load_positions` can be used
for this purpose.

::

   >>> print(PtyScan.load_positions.__doc__)
   
           **Override in subclass for custom implementation**
   
           *Called in* :py:meth:`initialize`
   
           Loads all positions for all diffraction patterns in this scan.
           The positions loaded here will be available by all processes
           through the attribute ``self.positions``. If you specify position
           on a per frame basis in :py:meth:`load` , this function has no
           effect.
   
           If theoretical positions :py:data:`positions_theory` are
           provided in the initial parameter set :py:data:`DEFAULT`,
           specifying positions here has NO effect and will be ignored.
   
           The purpose of this function is to avoid reloading and parallel
           reads on files that may require intense parsing to retrieve the
           information, e.g. long SPEC log files. If parallel reads or
           log file parsing for each set of frames is not a time critical
           issue of the subclass, reimplementing this function can be ignored
           and it is recommended to only reimplement the :py:meth:`load`
           method.
   
           If `load_parallel` is set to `all` or common`, this function is
           executed by all nodes, otherwise the master node executes this
           function and broadcasts the results to other nodes.
   
           Returns
           -------
           positions : ndarray
               A (N,2)-array where *N* is the number of positions.
   
           Note
           ----
           Be aware that this method sets attribute :py:attr:`num_frames`
           in the following manner.
   
           * If ``num_frames == None`` : ``num_frames = N``.
           * If ``num_frames < N`` , no effect.
           * If ``num_frames > N`` : ``num_frames = N``.
   
           
The parser for the positions file would look like this.

::

   >>> def extract_pos(base_path):
   >>>     pos = []
   >>>     files = []
   >>>     with open(base_path+'positions.txt') as f:
   >>>         for line in f:
   >>>             fname, y, x = line.strip().split()
   >>>             pos.append((eval(y), eval(x)))
   >>>             files.append(fname)
   >>>     return files, pos
   >>> 

And the test:

::

   >>> files, pos = extract_pos(save_path)
   >>> print(files[:2])
   ['ccd/diffraction_0000.npy', 'ccd/diffraction_0001.npy']
   
   >>> print(pos[:2])
   [(0.0, 0.0), (0.0, 0.00041562)]
   

::

   >>> @defaults_tree.parse_doc('scandata.numpyscan')
   >>> class NumpyScan(PtyScan):
   >>>     """
   >>>     A PtyScan subclass to extract data from a numpy array.
   >>> 
   >>>     Defaults:
   >>> 
   >>>     [name]
   >>>     type = str
   >>>     default = numpyscan
   >>>     help =
   >>> 
   >>>     [base_path]
   >>>     type = str
   >>>     default = /tmp/ptypy/sim/
   >>>     help = Base path to extract data files from.
   >>>     """
   >>> 
   >>>     def __init__(self, pars=None, **kwargs):
   >>>         p = self.DEFAULT.copy(depth=2)
   >>>         p.update(pars)
   >>> 
   >>>         with open(p.base_path+'geometry.txt') as f:
   >>>             for line in f:
   >>>                 key, value = line.strip().split()
   >>>                 # we only replace Nones or missing keys
   >>>                 if p.get(key) is None:
   >>>                     p[key] = eval(value)
   >>> 
   >>>         super(NumpyScan, self).__init__(p, **kwargs)
   >>> 
   >>>     def load_positions(self):
   >>>         # the base path is now stored in
   >>>         base_path = self.info.base_path
   >>>         pos = []
   >>>         with open(base_path+'positions.txt') as f:
   >>>             for line in f:
   >>>                 fname, y, x = line.strip().split()
   >>>                 pos.append((eval(y), eval(x)))
   >>>                 files.append(fname)
   >>>         return np.asarray(pos)
   >>> 

One nice thing about rewriting ``self.load_positions`` is that
the maximum number of frames will be set and we do not need to
manually adapt :py:meth:`~ptypy.core.data.PtyScan.check`

The last step is to overwrite the actual loading of data.
Loading happens (MPI-compatible) in
:py:meth:`~ptypy.core.data.PtyScan.load`

::

   >>> print(PtyScan.load.__doc__)
   
           **Override in subclass for custom implementation**
   
           Loads data according to node specific scanpoint indices that have
           been determined by :py:class:`LoadManager` or otherwise.
   
           Returns
           -------
           raw, positions, weight : dict
               Dictionaries whose keys are the given scan point `indices`
               and whose values are the respective frame / position according
               to the scan point index. `weight` and `positions` may be empty
   
           Note
           ----
           This is the *most* important method to change when subclassing
           :py:class:`PtyScan`. Most often it suffices to override the constructor
           and this method to create a subclass suited for a specific
           experiment.
           
   
Load seems a bit more complex than ``self.load_positions`` for its
return values. However, we can opt-out of providing weights (masks)
and positions, as we have already adapted ``self.load_positions``
and there were no bad pixels in the (linear) detector

The final subclass looks like this. We overwrite two defaults from
`PtyScan`:

::

   >>> @defaults_tree.parse_doc('scandata.numpyscan')
   >>> class NumpyScan(PtyScan):
   >>>     """
   >>>     A PtyScan subclass to extract data from a numpy array.
   >>> 
   >>>     Defaults:
   >>> 
   >>>     [name]
   >>>     type = str
   >>>     default = numpyscan
   >>>     help =
   >>> 
   >>>     [base_path]
   >>>     type = str
   >>>     default = /tmp/ptypy/sim/
   >>>     help = Base path to extract data files from.
   >>> 
   >>>     [auto_center]
   >>>     default = False
   >>> 
   >>>     [dfile]
   >>>     default = /tmp/ptypy/sim/npy.ptyd
   >>>     """
   >>> 
   >>>     def __init__(self, pars=None, **kwargs):
   >>>         p = self.DEFAULT.copy(depth=2)
   >>>         p.update(pars)
   >>> 
   >>>         with open(p.base_path+'geometry.txt') as f:
   >>>             for line in f:
   >>>                 key, value = line.strip().split()
   >>>                 # we only replace Nones or missing keys
   >>>                 if p.get(key) is None:
   >>>                     p[key] = eval(value)
   >>> 
   >>>         super(NumpyScan, self).__init__(p, **kwargs)
   >>> 
   >>>     def load_positions(self):
   >>>         # the base path is now stored in
   >>>         base_path = self.info.base_path
   >>>         pos = []
   >>>         with open(base_path+'positions.txt') as f:
   >>>             for line in f:
   >>>                 fname, y, x = line.strip().split()
   >>>                 pos.append((eval(y), eval(x)))
   >>>                 files.append(fname)
   >>>         return np.asarray(pos)
   >>> 
   >>>     def load(self, indices):
   >>>         raw = {}
   >>>         bp = self.info.base_path
   >>>         for ii in indices:
   >>>             raw[ii] = np.load(bp+'ccd/diffraction_%04d.npy' % ii)
   >>>         return raw, {}, {}
   >>> 

Loading the data
----------------

With the subclass we create a scan only using defaults

::

   >>> NPS = NumpyScan()
   >>> NPS.initialize()

In order to process the data. We need to call
:py:meth:`~ptypy.core.data.PtyScan.auto` with the chunk size
as arguments. It returns a data chunk that we can inspect
with :py:func:`ptypy.utils.verbose.report`. The information is
concatenated, but the length of iterables or dicts is always indicated
in parantheses.

::

   >>> print(u.verbose.report(NPS.auto(80), noheader=True))
   * id3V4AQEBE20           : dict(3)
     * common               : ptypy.utils.parameters.Param(8)
       * version            : 0.1
       * num_frames         : 116
       * label              : None
       * shape              : [array = [256 256]]
       * psize              : [array = [0.000172 0.000172]]
       * energy             : 7.2
       * center             : [array = [128. 128.]]
       * distance           : 7.19
     * chunk                : ptypy.utils.parameters.Param(6)
       * indices            : list(80)
         * id2M979S98S8     : 0
         * id2M979S98T8     : 1
         * id2M979S98U8     : 2
         * id2M979S98V8     : 3
         * id2M979S9908     : 4
         * ...              : ....
       * indices_node       : list(80)
         * id2M979S98S8     : 0
         * id2M979S98T8     : 1
         * id2M979S98U8     : 2
         * id2M979S98V8     : 3
         * id2M979S9908     : 4
         * ...              : ....
       * num                : 0
       * data               : dict(80)
         * 0                : [256x256 int32 array]
         * 1                : [256x256 int32 array]
         * 2                : [256x256 int32 array]
         * 3                : [256x256 int32 array]
         * 4                : [256x256 int32 array]
         * 5                : [256x256 int32 array]
         * 6                : [256x256 int32 array]
         * 7                : [256x256 int32 array]
         * 8                : [256x256 int32 array]
         * 9                : [256x256 int32 array]
         * 10               : [256x256 int32 array]
         * 11               : [256x256 int32 array]
         * 12               : [256x256 int32 array]
         * 13               : [256x256 int32 array]
         * 14               : [256x256 int32 array]
         * 15               : [256x256 int32 array]
         * 16               : [256x256 int32 array]
         * 17               : [256x256 int32 array]
         * 18               : [256x256 int32 array]
         * 19               : [256x256 int32 array]
         * 20               : [256x256 int32 array]
         * 21               : [256x256 int32 array]
         * 22               : [256x256 int32 array]
         * 23               : [256x256 int32 array]
         * 24               : [256x256 int32 array]
         * 25               : [256x256 int32 array]
         * 26               : [256x256 int32 array]
         * 27               : [256x256 int32 array]
         * 28               : [256x256 int32 array]
         * 29               : [256x256 int32 array]
         * 30               : [256x256 int32 array]
         * 31               : [256x256 int32 array]
         * 32               : [256x256 int32 array]
         * 33               : [256x256 int32 array]
         * 34               : [256x256 int32 array]
         * 35               : [256x256 int32 array]
         * 36               : [256x256 int32 array]
         * 37               : [256x256 int32 array]
         * 38               : [256x256 int32 array]
         * 39               : [256x256 int32 array]
         * 40               : [256x256 int32 array]
         * 41               : [256x256 int32 array]
         * 42               : [256x256 int32 array]
         * 43               : [256x256 int32 array]
         * 44               : [256x256 int32 array]
         * 45               : [256x256 int32 array]
         * 46               : [256x256 int32 array]
         * 47               : [256x256 int32 array]
         * 48               : [256x256 int32 array]
         * 49               : [256x256 int32 array]
         * 50               : [256x256 int32 array]
         * 51               : [256x256 int32 array]
         * 52               : [256x256 int32 array]
         * 53               : [256x256 int32 array]
         * 54               : [256x256 int32 array]
         * 55               : [256x256 int32 array]
         * 56               : [256x256 int32 array]
         * 57               : [256x256 int32 array]
         * 58               : [256x256 int32 array]
         * 59               : [256x256 int32 array]
         * 60               : [256x256 int32 array]
         * 61               : [256x256 int32 array]
         * 62               : [256x256 int32 array]
         * 63               : [256x256 int32 array]
         * 64               : [256x256 int32 array]
         * 65               : [256x256 int32 array]
         * 66               : [256x256 int32 array]
         * 67               : [256x256 int32 array]
         * 68               : [256x256 int32 array]
         * 69               : [256x256 int32 array]
         * 70               : [256x256 int32 array]
         * 71               : [256x256 int32 array]
         * 72               : [256x256 int32 array]
         * 73               : [256x256 int32 array]
         * 74               : [256x256 int32 array]
         * 75               : [256x256 int32 array]
         * 76               : [256x256 int32 array]
         * 77               : [256x256 int32 array]
         * 78               : [256x256 int32 array]
         * 79               : [256x256 int32 array]
       * weights            : dict(80)
         * 0                : [256x256 bool array]
         * 1                : [256x256 bool array]
         * 2                : [256x256 bool array]
         * 3                : [256x256 bool array]
         * 4                : [256x256 bool array]
         * 5                : [256x256 bool array]
         * 6                : [256x256 bool array]
         * 7                : [256x256 bool array]
         * 8                : [256x256 bool array]
         * 9                : [256x256 bool array]
         * 10               : [256x256 bool array]
         * 11               : [256x256 bool array]
         * 12               : [256x256 bool array]
         * 13               : [256x256 bool array]
         * 14               : [256x256 bool array]
         * 15               : [256x256 bool array]
         * 16               : [256x256 bool array]
         * 17               : [256x256 bool array]
         * 18               : [256x256 bool array]
         * 19               : [256x256 bool array]
         * 20               : [256x256 bool array]
         * 21               : [256x256 bool array]
         * 22               : [256x256 bool array]
         * 23               : [256x256 bool array]
         * 24               : [256x256 bool array]
         * 25               : [256x256 bool array]
         * 26               : [256x256 bool array]
         * 27               : [256x256 bool array]
         * 28               : [256x256 bool array]
         * 29               : [256x256 bool array]
         * 30               : [256x256 bool array]
         * 31               : [256x256 bool array]
         * 32               : [256x256 bool array]
         * 33               : [256x256 bool array]
         * 34               : [256x256 bool array]
         * 35               : [256x256 bool array]
         * 36               : [256x256 bool array]
         * 37               : [256x256 bool array]
         * 38               : [256x256 bool array]
         * 39               : [256x256 bool array]
         * 40               : [256x256 bool array]
         * 41               : [256x256 bool array]
         * 42               : [256x256 bool array]
         * 43               : [256x256 bool array]
         * 44               : [256x256 bool array]
         * 45               : [256x256 bool array]
         * 46               : [256x256 bool array]
         * 47               : [256x256 bool array]
         * 48               : [256x256 bool array]
         * 49               : [256x256 bool array]
         * 50               : [256x256 bool array]
         * 51               : [256x256 bool array]
         * 52               : [256x256 bool array]
         * 53               : [256x256 bool array]
         * 54               : [256x256 bool array]
         * 55               : [256x256 bool array]
         * 56               : [256x256 bool array]
         * 57               : [256x256 bool array]
         * 58               : [256x256 bool array]
         * 59               : [256x256 bool array]
         * 60               : [256x256 bool array]
         * 61               : [256x256 bool array]
         * 62               : [256x256 bool array]
         * 63               : [256x256 bool array]
         * 64               : [256x256 bool array]
         * 65               : [256x256 bool array]
         * 66               : [256x256 bool array]
         * 67               : [256x256 bool array]
         * 68               : [256x256 bool array]
         * 69               : [256x256 bool array]
         * 70               : [256x256 bool array]
         * 71               : [256x256 bool array]
         * 72               : [256x256 bool array]
         * 73               : [256x256 bool array]
         * 74               : [256x256 bool array]
         * 75               : [256x256 bool array]
         * 76               : [256x256 bool array]
         * 77               : [256x256 bool array]
         * 78               : [256x256 bool array]
         * 79               : [256x256 bool array]
       * positions          : [80x2 float64 array]
     * iterable             : list(80)
       * id3V4AQCNFO0       : dict(4)
         * index            : 0
         * data             : [256x256 int32 array]
         * position         : [array = [0. 0.]]
         * mask             : [256x256 bool array]
       * id3V4ANG90M0       : dict(4)
         * index            : 1
         * data             : [256x256 int32 array]
         * position         : [array = [0.         0.00041562]]
         * mask             : [256x256 bool array]
       * id3V4ANH7UC0       : dict(4)
         * index            : 2
         * data             : [256x256 int32 array]
         * position         : [array = [0.00039528 0.00012844]]
         * mask             : [256x256 bool array]
       * id3V4ANI69A0       : dict(4)
         * index            : 3
         * data             : [256x256 int32 array]
         * position         : [array = [ 0.0002443  -0.00033625]]
         * mask             : [256x256 bool array]
       * id3V4ANI6B60       : dict(4)
         * index            : 4
         * data             : [256x256 int32 array]
         * position         : [array = [-0.0002443  -0.00033625]]
         * mask             : [256x256 bool array]
       * ...                : ....
   
   
   >>> print(u.verbose.report(NPS.auto(80), noheader=True))
   * id3V4ANI6C80           : dict(3)
     * common               : ptypy.utils.parameters.Param(8)
       * version            : 0.1
       * num_frames         : 116
       * label              : None
       * shape              : [array = [256 256]]
       * psize              : [array = [0.000172 0.000172]]
       * energy             : 7.2
       * center             : [array = [128. 128.]]
       * distance           : 7.19
     * chunk                : ptypy.utils.parameters.Param(6)
       * indices            : list(36)
         * id2M979S9BC8     : 80
         * id2M979S9BD8     : 81
         * id2M979S9BE8     : 82
         * id2M979S9BF8     : 83
         * id2M979S9BG8     : 84
         * ...              : ....
       * indices_node       : list(36)
         * id2M979S9BC8     : 80
         * id2M979S9BD8     : 81
         * id2M979S9BE8     : 82
         * id2M979S9BF8     : 83
         * id2M979S9BG8     : 84
         * ...              : ....
       * num                : 1
       * data               : dict(36)
         * 80               : [256x256 int32 array]
         * 81               : [256x256 int32 array]
         * 82               : [256x256 int32 array]
         * 83               : [256x256 int32 array]
         * 84               : [256x256 int32 array]
         * 85               : [256x256 int32 array]
         * 86               : [256x256 int32 array]
         * 87               : [256x256 int32 array]
         * 88               : [256x256 int32 array]
         * 89               : [256x256 int32 array]
         * 90               : [256x256 int32 array]
         * 91               : [256x256 int32 array]
         * 92               : [256x256 int32 array]
         * 93               : [256x256 int32 array]
         * 94               : [256x256 int32 array]
         * 95               : [256x256 int32 array]
         * 96               : [256x256 int32 array]
         * 97               : [256x256 int32 array]
         * 98               : [256x256 int32 array]
         * 99               : [256x256 int32 array]
         * 100              : [256x256 int32 array]
         * 101              : [256x256 int32 array]
         * 102              : [256x256 int32 array]
         * 103              : [256x256 int32 array]
         * 104              : [256x256 int32 array]
         * 105              : [256x256 int32 array]
         * 106              : [256x256 int32 array]
         * 107              : [256x256 int32 array]
         * 108              : [256x256 int32 array]
         * 109              : [256x256 int32 array]
         * 110              : [256x256 int32 array]
         * 111              : [256x256 int32 array]
         * 112              : [256x256 int32 array]
         * 113              : [256x256 int32 array]
         * 114              : [256x256 int32 array]
         * 115              : [256x256 int32 array]
       * weights            : dict(36)
         * 80               : [256x256 bool array]
         * 81               : [256x256 bool array]
         * 82               : [256x256 bool array]
         * 83               : [256x256 bool array]
         * 84               : [256x256 bool array]
         * 85               : [256x256 bool array]
         * 86               : [256x256 bool array]
         * 87               : [256x256 bool array]
         * 88               : [256x256 bool array]
         * 89               : [256x256 bool array]
         * 90               : [256x256 bool array]
         * 91               : [256x256 bool array]
         * 92               : [256x256 bool array]
         * 93               : [256x256 bool array]
         * 94               : [256x256 bool array]
         * 95               : [256x256 bool array]
         * 96               : [256x256 bool array]
         * 97               : [256x256 bool array]
         * 98               : [256x256 bool array]
         * 99               : [256x256 bool array]
         * 100              : [256x256 bool array]
         * 101              : [256x256 bool array]
         * 102              : [256x256 bool array]
         * 103              : [256x256 bool array]
         * 104              : [256x256 bool array]
         * 105              : [256x256 bool array]
         * 106              : [256x256 bool array]
         * 107              : [256x256 bool array]
         * 108              : [256x256 bool array]
         * 109              : [256x256 bool array]
         * 110              : [256x256 bool array]
         * 111              : [256x256 bool array]
         * 112              : [256x256 bool array]
         * 113              : [256x256 bool array]
         * 114              : [256x256 bool array]
         * 115              : [256x256 bool array]
       * positions          : [36x2 float64 array]
     * iterable             : list(36)
       * id3V4ANI6BQ0       : dict(4)
         * index            : 80
         * data             : [256x256 int32 array]
         * position         : [array = [0.0018532 0.0016686]]
         * mask             : [256x256 bool array]
       * id3V4ANGAKA0       : dict(4)
         * index            : 81
         * data             : [256x256 int32 array]
         * position         : [array = [0.0021597 0.0012469]]
         * mask             : [256x256 bool array]
       * id3V4ANI6D60       : dict(4)
         * index            : 82
         * data             : [256x256 int32 array]
         * position         : [array = [0.0023717  0.00077061]]
         * mask             : [256x256 bool array]
       * id3V4AQEBFK0       : dict(4)
         * index            : 83
         * data             : [256x256 int32 array]
         * position         : [array = [0.0024801  0.00026067]]
         * mask             : [256x256 bool array]
       * id3V4ANG7O20       : dict(4)
         * index            : 84
         * data             : [256x256 int32 array]
         * position         : [array = [ 0.0024801  -0.00026067]]
         * mask             : [256x256 bool array]
       * ...                : ....
   
   
We observe the second chunk was not 80 frames deep but 34
as we only had 114 frames of data.

So where is the *.ptyd* data-file? As default, PtyScan does not
actually save data. We have to manually activate it in in the
input paramaters.

::

   >>> data = NPS.DEFAULT.copy(depth=2)
   >>> data.save = 'append'
   >>> NPS = NumpyScan(pars=data)
   >>> NPS.initialize()

::

   >>> for i in range(50):
   >>>     msg = NPS.auto(20)
   >>>     if msg == NPS.EOS:
   >>>         break
   >>> 

We can analyse the saved ``npy.ptyd`` with
:py:func:`~ptypy.io.h5IO.h5info`

::

   >>> from ptypy.io import h5info
   >>> print(h5info(NPS.info.dfile))
   File created : Mon Mar 11 09:53:29 2024
    * chunks [dict 6]:
        * 0 [dict 4]:
            * data [20x256x256 int32 array]
            * indices [list = [0.000000, 1.000000, 2.000000, 3.000000,  ...]]
            * positions [20x2 float64 array]
            * weights [20x256x256 bool array]
        * 1 [dict 4]:
            * data [20x256x256 int32 array]
            * indices [list = [20.000000, 21.000000, 22.000000, 23.000000,  ...]]
            * positions [20x2 float64 array]
            * weights [20x256x256 bool array]
        * 2 [dict 4]:
            * data [20x256x256 int32 array]
            * indices [list = [40.000000, 41.000000, 42.000000, 43.000000,  ...]]
            * positions [20x2 float64 array]
            * weights [20x256x256 bool array]
        * 3 [dict 4]:
            * data [20x256x256 int32 array]
            * indices [list = [60.000000, 61.000000, 62.000000, 63.000000,  ...]]
            * positions [20x2 float64 array]
            * weights [20x256x256 bool array]
        * 4 [dict 4]:
            * data [20x256x256 int32 array]
            * indices [list = [80.000000, 81.000000, 82.000000, 83.000000,  ...]]
            * positions [20x2 float64 array]
            * weights [20x256x256 bool array]
        * 5 [dict 4]:
            * data [16x256x256 int32 array]
            * indices [list = [100.000000, 101.000000, 102.000000, 103.000000,  ...]]
            * positions [16x2 float64 array]
            * weights [16x256x256 bool array]
    * info [dict 23]:
        * add_poisson_noise [scalar = False]
        * auto_center [scalar = False]
        * base_path [string = "b'/tmp/ptypy/sim/'"]
        * center [array = [128. 128.]]
        * chunk_format [string = "b'.chunk%02d'"]
        * dfile [string = "b'/tmp/ptypy/sim/npy.ptyd'"]
        * distance [scalar = 7.19]
        * energy [scalar = 7.2]
        * experimentID [None]
        * label [None]
        * load_parallel [string = "b'data'"]
        * min_frames [scalar = 1]
        * name [string = "b'numpyscan'"]
        * num_frames [None]
        * orientation [None]
        * positions_scan [116x2 float64 array]
        * positions_theory [None]
        * psize [scalar = 0.000172]
        * rebin [scalar = 1]
        * save [string = "b'append'"]
        * shape [array = [256 256]]
        * version [scalar = 0.1]
        * weight2d [scalar = True]
    * meta [dict 8]:
        * center [array = [128. 128.]]
        * distance [scalar = 7.19]
        * energy [scalar = 7.2]
        * label [None]
        * num_frames [scalar = 116]
        * psize [array = [0.000172 0.000172]]
        * shape [array = [256 256]]
        * version [scalar = 0.1]
   
   None
   

Listing the new subclass
------------------------

In order to make the subclass available in your local |ptypy|,
navigate to ``[ptypy_root]/ptypy/experiment`` and paste the content
into a new file ``user.py``::

  $ touch [ptypy_root]/ptypy/experiment/user.py
  
Append the following lines into ``[ptypy_root]/ptypy/experiment.__init__.py``::

  from user import NumpyScan
  PtyScanTypes.update({'numpy':NumpyScan})

Now, your new subclass will be used whenever you pass ``'numpy'`` for
the :py:data:`.scan.data.source` parameter. All special parameters of the class
should be passed via the dict :py:data:`.scan.data.recipe`. 


.. [h5py] http://www.h5py.org/
.. [HDF] **H**\ ierarchical **D**\ ata **F**\ ormat, `<http://www.hdfgroup.org/HDF5/>`_