Opened 16 years ago
Closed 13 years ago
#133 closed defect (fixed)
udunits crashes with efc
Reported by: | Dave Offiler | Owned by: | Dave Offiler |
---|---|---|---|
Priority: | major | Milestone: | 3.0 |
Component: | ropp_io | Version: | 1.1 |
Keywords: | udunits, units conversion | Cc: | Huw Lewis |
Description
On tx01 with efc/ecc compilers, ROPP_IO tool ucar2ropp crashes with:
bash-2.05$ ../tools/ucar2ropp ../data/ucar_test1.nc Reading ../data/ucar_test1.nc... ucar2ropp(24215): unaligned access to 0x60000fffffff4d84, ip=0x40000000007464b0 forrtl: severe (174): SIGSEGV, segmentation fault occurred
I have tracked this down to the low-level udunits interface utcvt() called by Chris' udunits F90 library in utcvt90(). The crash happens whenever udunits is called upon to do a real conversion; conversions involving like units are trapped and do not call the low-level library. A possible work-around could be to hard-wire conversions to ROPP default units in the UCAR netCDF reader (part of ropp_io_read_netcdf_get.f90), thus by-passing the problem. NB the crash occurs whenever udunits is called, whether on read or write. The other *2ropp tests works because there are no true units conversions being done. Fixing ucar2ropp would not help e.g. ROPP_FM where unit conversions do take place.
Since this is clearly an efc/ecc problem (it doesn't occur on any other platform/compiler combo), it may well be related to the 64-bit architecture of the TX, so should be solved accordingly.
Looking at the udunits documentation, the Fortran/C interface stores C-pointers in Fortran default INTEGERs; quite likely there is a mis-match between storage lengths. But so far, I've not found a combination of compiler switches which could be related which cure the problem but don't cause new problems elsewhere (e.g. using -i8 everywhere causes file open failures in the MetDB BUFR 'C' write routine).
Given the short remaining lifetime for the NEC SX supercomputers - and their TX front-ends - it may not be worthwhile spending too much effort on this, and in any case not to bust a gut for v1.2.
Change history (9)
comment:1 by , 16 years ago
Milestone: | 2.0 |
---|
comment:2 by , 16 years ago
Milestone: | → 3.0 |
---|
comment:3 by , 16 years ago
On my personal AMD64 PC (running 64-bit OpenSUSE 11.0 Linux and ifort 10.1 / G95 0.91 / GFortran 4.4.0 compilers) ifort gives a 'bus error' and gfc a 'segmentation fault' when running ucar2ropp (though g95 works ok). It seems likley (and logical) that this behaviour is consistent with, and has the same cause as, the NEC (64-bit) efc crash first noted in this ticket.
During Beta testing ov ROPP v2.0, Josep Apparicio (EC) has reported a segmentation fault using GFortran (unspecified architecture). We need to ask him is his machine is also 64-bit.
Since this problem now appears to occur on at least one (and possibly two) other 64-bit machine(s), and not exclusively on the NEX TX, this issue needs futher investigation for a solution asap.
If no solution is found before v2.0 is released, the documentation must clearly record that 64-bit architectures are not supported for any code utilising the udunits C/Fortran interface.
comment:4 by , 16 years ago
Recall original failure from ucar2ropp is:
ucar2ropp(20791): unaligned access to 0x60000fffffff4e14, ip=0x4000000000686bb0 forrtl: severe (174): SIGSEGV, segmentation fault occurred
Further tests on TX01:
1) There does not appear to be any ecc compiler switches which force C pointers to be 4-byte on this 64-bit machine. The trivial test program (adapted from the netCDF configure script):
#include <stdio.h> int main() { printf("%ld\n", sizeof(char*)); }
gives '8' whatever (un)likely ecc command line options are given. -auto-ilp32 is the only option that might have any relevance, and that doesn't change anything in the current context. gcc and cc do not appear to have any options either.
[The above program reports '4' on the Dell 32-bit desktop].
Conclusion: forcing C to 4-byte pointers to be compatible with Fortran default 4-byte integers is not an option.
2) By default, efc INTEGER types are 4-byte. Can we make Fortran run with default 8-byte INTEGRS? Yes, with -integer_size 64
So - build netCDF and BUFR (udunits has no Fortran) with this switch.
- netCDF builds & runs all tests successfully
- BUFR compiles, but gives run-time errors when calling the MetDB C I/O routines (not using default ints?)
So rebuilt BUFR without this switch (test ok again)
Built ropp_utils and ropp_io successfully with 64-bit option. Not unexpectedly, ropp2bufr (64-bit INTs) fails when calling the MetDB encode routine (now 32-bit INTs). Didn't bother to run ucar2ropp since a working BUFR encoder has priority.
Conclusion: forcing Fortran default INTs to 8-byte to be compatible with default size of C pointers integers is also not an option.
3) Make the interfacing Fortran INTs explicitly the size of C pointers (ie INTEGER*8 for 64-bit platforms)
First rebuilt netCDF, BUFR & ropp_units with default settings. Edited the udunits API interface ropp_io/udunits/udunits.m4 [previously identified as where the problem lies] and under Section 3, for line:
integer :: funit, tunit
change to:
integer*8 :: funit, tunit
then build ropp_io.
Running ucar2ropp now gives:
forrtl: severe (174): SIGSEGV, segmentation fault occurred
so at least this has removed the 'unaligned access' error. Could the SIGSEGV be a different problem? Does this edit do anything on other 64-bit platforms which only reported a SIGSEGV?
The saga continues....
comment:5 by , 16 years ago
More digging...
The MetDB_c_utils.c in the MetDB BUFR library package has a compile-time option to typeset ints as long and floats as double if 'L64' is defined (the defaults being standard ints and floats).
1) Added -DL64 to CFLAGS and -integer_size 64 to FFLAGS for the LINUX-EFC entry in f90_compilers.dat and rebuild the BUFR package - all tools build & run ok.
2) Included -integer_size 64 in netcdf_configure_efc_linux & ropp_configure_efc_linux and rebuild all with build_deps & build_ropp. All builds and all tools run without error.
However, the make test in ropp_io fails in the BUFR & ucar2ropp test with differences reported in the cdl files. On inspection, ncdump has generated values with many more digits than usual (i.e. on 32-bit platforms), and there are differences only in the last few decimal places - not significant for our purposes.
Looked into ncdump and modified the test scripts (and gentestfiles.sh) to generate cdl files with -p 5,5 (use only 5 significant digits for float & double) This is sufficient to test for checking erroneous values, but may need some optimisaton to check for true loss of precision). Re-running make tests now shows all tests as PASS.
All changes checked into SVN; changed ropp_configure_efc_linux file copied into ropp_test and checked in from there. Now over to Mike to run the Test Folder system on the NEC as a final test before closing this ticket.
NB: the above solution for the NEC (build everything with 64-bit pointers & ints) works for this particular platform & compilers (pending confirmation from Mike's testing), but may not be a general solution. More investigation needed to check possibilities on my AMD64 and perhaps devise a method of auto-detection (e.g. using the above bit of C code) to set the correct flags. This generic issue is passed on to Ticket #152.
comment:6 by , 16 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
Test Folder test MT-IO-04 (ucar2ropp test) on NEC with efc is: PASS
All other tests also PASS.
See: http://www-nwp/~frir/ropp_test/html/versions/2.0/efc_linux.html
Closing this Ticket (but the issue might still come up in #152 in seeking a generic solution).
comment:7 by , 13 years ago
Resolution: | fixed |
---|---|
Status: | closed → reopened |
comment:8 by , 13 years ago
Owner: | set to |
---|---|
Status: | reopened → assigned |
comment:9 by , 13 years ago
Resolution: | → fixed |
---|---|
Status: | assigned → closed |
Milestone 2.0 deleted