Adol-C Notes

From Autodiff
Jump to: navigation, search

Notes for Adol-C development

The following is a list of desired features in no particular order with notes on implementation issues.

Terminology

The distinction between the terms trace,tape,taylors (meaning stored Taylor coefficients for some later reverse sweep) is muddled in the interface names, the source code and the manual (see e.g. tape_doc, trace_on, trace_off, tape_stats, ...). The cleaned up version should be like this:

  • trace : the operations, constant values etc. optionally including the Taylor coefficients; this seems to be the cleanest term because it is trace of the execution
  • tape : ideally that should not be used anymore because if confuses things and if we mean the Taylor coefficients by it then we already said that the Taylor coefficients can be associated with the trace
  • taylors : the Taylor coefficients which can be associated with the trace.

Extended Trace(Tape)less Forward

Given the above terminology this should now be called traceless forward.

Traceless Forward with Taylors

We can think about successive forward sweeps with increments in the Taylor polynomial order (similar to the "Diamant" tool a la Charpentier et al.). Therefore we could have "taylors" and reread them instead of recomputing them all without having a trace. This is probably not hugely important because of the following counter arguments:

  • Recomputing is often cheaper than storing and retrieving from memory
  • Persistent storage and subsequence retrieval of the taylors requires memory management a la 'locint' i.e. overhead

User interface

There has always been an annoying disconnect in the usage pattern of the traceless and the regular variant with respect to seeding and harvesting. Moreover, one can certainly think about writing the trace and simultaneously propagate and store taylors such that afterwards one doesn't need an extra forward sweep but can invoke the reverse sweep right away.(see #Combine trace and taylors) If we agree to that it would be sensible to NOT suggest a different seeding style by setting the individual inputs but instead do something like this for the traceless version:

Adolc::TraceLess& tl(TraceLess::instance()); // this kind of interface permits a thread safe implementation

Adolc::Matrix& seedM(tl.getSeedMatrix(n,d)); // d= number of directions 

// add some code to set the seedM entries

// --- here starts the function
// every <<= operator takes a new row from seedM 
// that is held by 'tl' which is either a global instance or an instance per thread 
// there will be an exception thrown if the operator is invoked more than n times
// Also one can control the allocation for the d directions via tl

// every >>= operator will write a new row to a Harvest  matrix
// --- here ends the function  

Adolc::Matrix& harvestM(tl.getHarvest()); // will give out an m x d matrix
// the method called above will throw an exception if the <<= operator was invoked less than n times. 

m=tl.getDependentCount(); // inquiry if one does not know m

For a suggestion about a new and improved C++ interface for making a trace see #C++ interface. The goal should be that the concepts of seed/harvest matrices are expressed similarly in both the traceless and tracefull variant.

Code sharing with uni5_for.c

Currently the code in uni5_for.c is the basis for all of former 5 forwards variants (zos,fos,hos,fov,hov) but it is infested with C preprocessor macros (see also ). For the near term we can't do much about the macros but we should try to share the logic between uni5_for.c the traceless forward operators and also the operators that do combined tracing and propagation (see ). The former will only operate on the storage directly associated with the adouble instances. The latter can operate on a workarray because the trace needs the locints anyway. So the difference is how the pointers to the Taylor arrays are set. There is also a question if we should try to cover that bitmap propagations. For example the * operator:

            case mult_a_a:               /* Multiply two adoubles (*)    mult_a_a */
                arg1 = get_locint_f();
                arg2 = get_locint_f();
                res  = get_locint_f();

                IF_KEEP_WRITE_TAYLOR(res,keep,k,p)

#if defined(_INDO_)
                combine_2_index_domains(res, arg1, arg2, ind_dom);
#if defined(_NONLIND_)
		extend_nonlinearity_domain_binary(arg1, arg2, ind_dom, nonl_dom);
#endif
#else
#if !defined(_ZOS_) /* BREAK_ZOS */
                ASSIGN_T(Tres,  TAYLOR_BUFFER[res])
                ASSIGN_T(Targ1, TAYLOR_BUFFER[arg1])
                ASSIGN_T(Targ2, TAYLOR_BUFFER[arg2])

#ifdef _INT_FOR_
                FOR_0_LE_l_LT_p
                TRES_FOINC = TARG2_INC | TARG1_INC;
#else
                /* olvo 980915 now in reverse order to allow x = x*x etc. */
                INC_pk_1(Tres)
                INC_pk_1(Targ1)
                INC_pk_1(Targ2)

                FOR_p_GT_l_GE_0
                FOR_k_GT_i_GE_0
                { TRES_FODEC = dp_T0[arg1]*TARG2_DEC +
                               TARG1_DEC*dp_T0[arg2];
                  DEC_TRES_FO
#if defined(_HIGHER_ORDER_)
                  Targ1OP = Targ1-i+1;
                  Targ2OP = Targ2;

                  for (j=0;j<i;j++) {
                  *Tres += (*Targ1OP++) * (*Targ2OP--);
                  }
                  Tres--;
#endif /* _HIGHER_ORDER_ */
            }
#endif
#endif
#endif /* ALL_TOGETHER_AGAIN */
#if !defined(_NTIGHT_)
                dp_T0[res] = dp_T0[arg1] *
                                               dp_T0[arg2];
#endif /* !_NTIGHT_ */
                break;

This needs to be adapted to be usable in an operator. The changes for the tracing operator are simpler and only require to obtain arg1, arg2 and res differently, e.g.:

friend adub operator * ( const badouble& lOp, const badouble& rOp ) { 
  arg1=lOp.location;
  arg2=rOp.location;
  res=next_loc();
 // here comes the code as above assuming the same arrays are set up for the macros to work
  ...
  return res; // to construct the adub.
}

The known issues are

  • need to make sure the work arrays (dp_T0, db_T) are set up like in uni5_for.c
  • need to preprocess into the 5 variants (like forward) that need to be discriminated e.g. by namespace but are otherwise identical to plain adoubles.

For the traceless versions the differences are more complicated because we do not have locations. Assume the Taylor coefficients are kept in member adval.

adouble adouble::operator * (const adouble& a) const {
    adouble tmp;
    Tres=tmp.adval;
    Targ1=adval;
    Targ2=a.adval;

    // modified operator body where direct access to dp_T0 (or dp_T) are done via yet another macro.

    tmp.val=val*a.val;
    return tmp;
}

The known issues (necessary modifications) are

  • in the uni5_for code there is still direct access to dp_T0[res] which won't work in the operator context; Unfortunately this will require yet another macro.
  • because the sparsity propagation interface takes the locations as arguments there is no simple way to make this work for the traceless version because we don't necessarily have locations.

Dynamic vs. Static Allocation

The current implementation (first order / vector) uses a fixed (maximal) number of directions that is compiled into the library with the following pros and cons:

  • fixed number of directions avoids overhead of dynamic allocation
  • has memory overhead (including fetching unused data into the cache etc.) when not all directions are needed
  • overhead is even more pronounced when one picks directions and derivative order.
  • could recompile the library every time some new directions/order is needed but if one is willing to do that they could go all the way and instead use the generated Rapsodia libraries instead.
  • could use dynamically allocated memory instead but plain version is going to be slow
  • use home-grown allocator instead (see below)

Dynamic Memory Management

For a home-grown allocator to work well it is most usefull to assume a scheme similar to the location scheme where contiguous tails are being free'd for reuse. That in turn requires the more complicated class structure with the adubs but also allows to hand over the taylor coefficients (and values) instead of deep copying them for the unnamed temporaries. It is unclear if the overhead of the memory management is less than the overhead of fixed but possibly unused coefficients. Most likely a recompiled library for the exact number of desired directions/order would perform best but then again one might want to use Rapsodia anyway.

Complex Arithmetic

The issues with complex arithmetic are listed below.

  • the differentiation of various intrinsics are described in a tech memo
  • mixing complex with real arithmetic and unclear consequence of extracting the real or imaginary part of a complex number when the corresponding imaginary/real part is not 0 (the user should ascertain) and constant(we need to ascertain). There is no good way to automatically discriminate the benign from the dangerous cases.
  • the implementation (see below)

Implementation as acomplex vs std::complex<adouble>

There is a native complex data type in C99 but in C++ there is only a complex template class. In the GNU implementation the std::complex makes use of the C99 complex type. In the decision of implementing an acomplex vs instantiating another template version the following issues have to be considered.

  • for an acomplex type there are many combinations of operator arguments that will have to be defined between adouble/acomplex and the respective base classes badouble/bacomplex and presumably also adub/acomp for unnamed temporaries. That inflation of interfaces is going to be unwieldy and it is easy to miss some combination that then may lead to unnecessary type promotions.
  • An acomplex type permits a more efficient representation of the derivatives where Cauchy-Riemann is satisfied
  • A std::complex<adouble> is an easy way to get complex coverage at the expense of an naive treatment of complex numbers as "unrelated" adouble pairs.
  • The template instantiation however is not complete for all operartor combinations as can be tested with the example below.
    #include <cmath>
    #include <complex>
    #include <adolc.h>
    
    typedef std::complex<adouble> acomplex;
    
    acomplex foo(acomplex a,adouble b) { 
      acomplex y;
      y=sin(b*a); // this is OK, has a template operator defined
      y=3.0*y; // but this is not (btw   y=y*3.0 isn't OK either)
      return y;
    }
    

    The GNU compiler and Intel compiler have the same problem that originates with the std::complex template class where there are no definitions for the mixed type operators only for the specialization of complex<float> and complex<double>. Otherwise the library contains only the defintions

      template<typename _Tp>    inline complex<_Tp>    operator*(const complex<_Tp>& __x, const complex<_Tp>& __y)
      template<typename _Tp>    inline complex<_Tp>    operator*(const complex<_Tp>& __x, const _Tp& __y)
      template<typename _Tp>    inline complex<_Tp>    operator*(const _Tp& __x, const complex<_Tp>& __y)
    

    i.e. is missing the combinations with the built in types. For template functions like the above, the C++ standard prohibits to match the arguments by type conversion (e.g. of the 3.0 constant literal to an adouble which then would have a defined operator). The problem lies with the compiler trying to deduce the template type _Tp from "3.0*y" where based in "3.0" _Tp would be deduced as being "double" while deducing it from "y" yields "adouble" to make an instantiation of the function template. The instantiation however has to be unambiguous and currently there is no mechanism described in the standard that permits implicit type conversion for the type deduction. Consequently the above pattern will not work for any other type supplied to the complex class except the ones for which explicit specializations exist. A simple way out of the dilemma is to declare an operator like this (here in the style of the GNU implementation):

    template<typename _Tp, typename _Tp2>
    inline std::complex<_Tp>
    operator*(const _Tp2& __x, const std::complex<_Tp>& __y)
    {
      return complex<_Tp>(__x*__y.real(),__x*__y.imag());
    }
    

    Instead of the template type _Tp one could, for example, also write it directly in terms of the acomplex type. Note, that there have to be definitions for all binary operators and the variant where the right operand has _Tp2 type. The potential for ambiguous definitions when _Tp2 is identical with _Tp is avoided because "partial ordering rules" are applied that determine that e.g. operator*(const _Tp& __x, const complex<_Tp>& __y) is more specialized than operator*(const _Tp2& __x, const std::complex<_Tp>& __y) and therefore is to be preferred (and not ambiguous) for "b*a". There can however be some loss of precision if the complex type argument _Tp has less precision than the _Tp2 type in which case the resulting complex should be constructed as std::complex<_Tp2> instead of the above. I don't know of a way to express that in the template class and this is likely the reason why the above operators are not defined in the standard implementation and it shows some issues that arise with the use of templated containers for numerical computations. Unless somebody declares an "afloat" or uses "adouble" together with "long double" that shouldn't be an issue in practice.

Alternative Active Types

There are options for many variants of the active type. They all share the same set of overloaded operators but aside from that have different definitions in those operators. Particularly if one considers and extension of the operator definitions for an acomplex class that structure is becoming bigger and more unwieldy. It is desirable to share the operator declarations between the variants. Similarly to the variants arising from reusing the uni5_for code snippets for the tracing and traceless forward, it would probably be easiest from a maintenance point of view to simply include the operator bodies from sets of files that are exchangeable.

Combine trace and taylors

If we write the trace and simultaneously proagate and store taylors then afterwards one doesn't need an extra forward sweep over the trace but can invoke the reverse sweep right away. If ADOL-C internal design is done well this should be very easy to implement and would be more consistent in the proposed modified user interface and also may save some runtime.

Berz Remainder Term

replicate cosy infinity's univariate Taylor series with interval remainder term approach.

Constraint Propagation

use the operators to generate a DAG on which the constraint propagation is carried out.

C++ interface

The current user interfaces are dissatisfying because

  • none of them show any object oriented approach despite the tool relying on C++
  • even the C interfaces are muddled (see above the terminology mess)

Because being callable from Fortran is important, the C interface will stay as they are with the possible caveat of keeping the "check-only" arguments like m,n, etc. solely for the Fortran interface but in C. The current so-called "C++" interfaces should be replaced. The C++ interfaces can be made more plausible to the user if we can have the variant suggested at #Combine trace and taylors. Then a user would code something like this:

Adolc::TraceMaker& tm(TraceMaker.instance()); // this kind of interface permits a thread safe implementation

Adolc::Matrix& seedM(tm.getSeedMatrix(n,d)); // d= number of directions
tm.taylorWrite(1); // writes Taylor coefficients up to order 1, also impacts the order loop bound in the operators

// add some code to set the seedM entries

// --- here starts the function
// every <<= operator takes a new row from seedM 
// that is held by 'tm' which is either a global instance or an instance per thread 
// there will be an exception thrown if the operator is invoked more than n times
// Also one can control the allocation for the d directions via tm

// every >>= operator will write a new row to a Harvest  matrix
// --- here ends the function  

Adolc::Matrix& harvestM(tm.getHarvest()); // will give out an m x d matrix
// the method called above will throw an exception if the <<= operator was invoked less than n times. 

m=tm.getDependentCount(); // inquiry if one does not know m

// now we can do a reverse sweep right away, for instance like this
Adolc::Tensor& weights(tm.getWeightsTensor()) // the tm instance already knows d, m and the order of Taylor coefficients we kept.
// add some code to set the entries in weights
Adolc::Tensor& soCoeffs(tm.reverse()) // no arguments needed, everything is already known. 
// user accesses soCoeffs.

So, in short there is TraceMaker to make a Trace instance and then the Trace object gets all the info to run over the trace. Of course this is not the complete interface. There can be extra parameters to give specific file names

Getting Rid of the MACRO_MESS

For ISSM

moved to ISSM