Opened 16 years ago

Closed 13 years ago

#133 closed defect (fixed)

udunits crashes with efc

Reported by: Dave Offiler Owned by: Dave Offiler
Priority: major Milestone: 3.0
Component: ropp_io Version: 1.1
Keywords: udunits, units conversion Cc: Huw Lewis

Description

On tx01 with efc/ecc compilers, ROPP_IO tool ucar2ropp crashes with:

bash-2.05$ ../tools/ucar2ropp ../data/ucar_test1.nc Reading ../data/ucar_test1.nc... ucar2ropp(24215): unaligned access to 0x60000fffffff4d84, ip=0x40000000007464b0 forrtl: severe (174): SIGSEGV, segmentation fault occurred

I have tracked this down to the low-level udunits interface utcvt() called by Chris' udunits F90 library in utcvt90(). The crash happens whenever udunits is called upon to do a real conversion; conversions involving like units are trapped and do not call the low-level library. A possible work-around could be to hard-wire conversions to ROPP default units in the UCAR netCDF reader (part of ropp_io_read_netcdf_get.f90), thus by-passing the problem. NB the crash occurs whenever udunits is called, whether on read or write. The other *2ropp tests works because there are no true units conversions being done. Fixing ucar2ropp would not help e.g. ROPP_FM where unit conversions do take place.

Since this is clearly an efc/ecc problem (it doesn't occur on any other platform/compiler combo), it may well be related to the 64-bit architecture of the TX, so should be solved accordingly.

Looking at the udunits documentation, the Fortran/C interface stores C-pointers in Fortran default INTEGERs; quite likely there is a mis-match between storage lengths. But so far, I've not found a combination of compiler switches which could be related which cure the problem but don't cause new problems elsewhere (e.g. using -i8 everywhere causes file open failures in the MetDB BUFR 'C' write routine).

Given the short remaining lifetime for the NEC SX supercomputers - and their TX front-ends - it may not be worthwhile spending too much effort on this, and in any case not to bust a gut for v1.2.

Change history (9)

comment:1 by (none), 16 years ago

Milestone: 2.0

Milestone 2.0 deleted

comment:2 by Huw Lewis, 16 years ago

Milestone: 3.0

comment:3 by Dave Offiler, 16 years ago

On my personal AMD64 PC (running 64-bit OpenSUSE 11.0 Linux and ifort 10.1 / G95 0.91 / GFortran 4.4.0 compilers) ifort gives a 'bus error' and gfc a 'segmentation fault' when running ucar2ropp (though g95 works ok). It seems likley (and logical) that this behaviour is consistent with, and has the same cause as, the NEC (64-bit) efc crash first noted in this ticket.

During Beta testing ov ROPP v2.0, Josep Apparicio (EC) has reported a segmentation fault using GFortran (unspecified architecture). We need to ask him is his machine is also 64-bit.

Since this problem now appears to occur on at least one (and possibly two) other 64-bit machine(s), and not exclusively on the NEX TX, this issue needs futher investigation for a solution asap.

If no solution is found before v2.0 is released, the documentation must clearly record that 64-bit architectures are not supported for any code utilising the udunits C/Fortran interface.

comment:4 by Dave Offiler, 16 years ago

Recall original failure from ucar2ropp is:

ucar2ropp(20791): unaligned access to 0x60000fffffff4e14, ip=0x4000000000686bb0
forrtl: severe (174): SIGSEGV, segmentation fault occurred

Further tests on TX01:

1) There does not appear to be any ecc compiler switches which force C pointers to be 4-byte on this 64-bit machine. The trivial test program (adapted from the netCDF configure script):

#include <stdio.h>
int main()
{
  printf("%ld\n", sizeof(char*));
}

gives '8' whatever (un)likely ecc command line options are given. -auto-ilp32 is the only option that might have any relevance, and that doesn't change anything in the current context. gcc and cc do not appear to have any options either.

[The above program reports '4' on the Dell 32-bit desktop].

Conclusion: forcing C to 4-byte pointers to be compatible with Fortran default 4-byte integers is not an option.

2) By default, efc INTEGER types are 4-byte. Can we make Fortran run with default 8-byte INTEGRS? Yes, with -integer_size 64

So - build netCDF and BUFR (udunits has no Fortran) with this switch.

  • netCDF builds & runs all tests successfully
  • BUFR compiles, but gives run-time errors when calling the MetDB C I/O routines (not using default ints?)

So rebuilt BUFR without this switch (test ok again)

Built ropp_utils and ropp_io successfully with 64-bit option. Not unexpectedly, ropp2bufr (64-bit INTs) fails when calling the MetDB encode routine (now 32-bit INTs). Didn't bother to run ucar2ropp since a working BUFR encoder has priority.

Conclusion: forcing Fortran default INTs to 8-byte to be compatible with default size of C pointers integers is also not an option.

3) Make the interfacing Fortran INTs explicitly the size of C pointers (ie INTEGER*8 for 64-bit platforms)

First rebuilt netCDF, BUFR & ropp_units with default settings. Edited the udunits API interface ropp_io/udunits/udunits.m4 [previously identified as where the problem lies] and under Section 3, for line:

    integer                         :: funit, tunit

change to:

    integer*8                       :: funit, tunit

then build ropp_io.

Running ucar2ropp now gives:

forrtl: severe (174): SIGSEGV, segmentation fault occurred

so at least this has removed the 'unaligned access' error. Could the SIGSEGV be a different problem? Does this edit do anything on other 64-bit platforms which only reported a SIGSEGV?

The saga continues....

comment:5 by Dave Offiler, 16 years ago

More digging...

The MetDB_c_utils.c in the MetDB BUFR library package has a compile-time option to typeset ints as long and floats as double if 'L64' is defined (the defaults being standard ints and floats).

1) Added -DL64 to CFLAGS and -integer_size 64 to FFLAGS for the LINUX-EFC entry in f90_compilers.dat and rebuild the BUFR package - all tools build & run ok.

2) Included -integer_size 64 in netcdf_configure_efc_linux & ropp_configure_efc_linux and rebuild all with build_deps & build_ropp. All builds and all tools run without error.

However, the make test in ropp_io fails in the BUFR & ucar2ropp test with differences reported in the cdl files. On inspection, ncdump has generated values with many more digits than usual (i.e. on 32-bit platforms), and there are differences only in the last few decimal places - not significant for our purposes.

Looked into ncdump and modified the test scripts (and gentestfiles.sh) to generate cdl files with -p 5,5 (use only 5 significant digits for float & double) This is sufficient to test for checking erroneous values, but may need some optimisaton to check for true loss of precision). Re-running make tests now shows all tests as PASS.

All changes checked into SVN; changed ropp_configure_efc_linux file copied into ropp_test and checked in from there. Now over to Mike to run the Test Folder system on the NEC as a final test before closing this ticket.

NB: the above solution for the NEC (build everything with 64-bit pointers & ints) works for this particular platform & compilers (pending confirmation from Mike's testing), but may not be a general solution. More investigation needed to check possibilities on my AMD64 and perhaps devise a method of auto-detection (e.g. using the above bit of C code) to set the correct flags. This generic issue is passed on to Ticket #152.

comment:6 by Dave Offiler, 16 years ago

Resolution: fixed
Status: newclosed

Test Folder test MT-IO-04 (ucar2ropp test) on NEC with efc is: PASS

All other tests also PASS.

See: http://www-nwp/~frir/ropp_test/html/versions/2.0/efc_linux.html

Closing this Ticket (but the issue might still come up in #152 in seeking a generic solution).

comment:7 by Ian Culverwell, 13 years ago

Resolution: fixed
Status: closedreopened

comment:8 by Ian Culverwell, 13 years ago

Owner: set to Dave Offiler
Status: reopenedassigned

comment:9 by Ian Culverwell, 13 years ago

Resolution: fixed
Status: assignedclosed
Note: See TracTickets for help on using tickets.