Opened 18 years ago

Closed 18 years ago

Last modified 16 years ago

#71 closed defect (fixed)

Memory errors in ropp2bufr

Reported by: Dave Offiler Owned by: Dave Offiler
Priority: critical Milestone:
Component: ropp_io Version: 0.8
Keywords: Cc:

Description

Reported by AvE.

ropp2bufr gives 'segmentation fault' or 'memory error' when running with a particular input test netCDF file on g95 and ifort builds. Other files encode correctly. gdb suggests the memory fault is generated within the MetDB BUFR library routine enbufr().

With the problem file, DO confirms that memory problems exist for builds with g95, ifort, ifort9, pgf95 and gfc under Linux. ifc and nag under Linux do to exhibit this problem. A second test file does not give any errors. The issue is therefore not compiler-specific, but is data-dependent.

The file in question has a single profile and appears to be read correctly, but crashes somewhere in the encode process.

Change history (4)

comment:1 by Dave Offiler, 18 years ago

Status: newassigned

The GPSRO BUFR template is defined by a single master sequence decsriptor (as with most other data types). The BUFR encoder & decoder expand sequences recursively on-the-fly. The encoder is passed an intial array of descriptors (here containing just the master sequence descriptor). This array needs to be declared at least as big as the fully-expanded list of descriptors. The miniumum expansion is to have one descriptor per input data element; there can be more descriptors for scale/bit width changes, replication etc. The ropp2bufr code dynamically allocates a descriptor array of of a dimention equal to the number of data elements (calculated from the properties of the profile being enoded) wth a fixed margin of 2000 extra elements. This was more than adequate for all test datasets.

The dataset in question has 349 Level 1b samples; by default the encoder thins to no more than 375, so no thinning takes place. Forcing a thinning by using the -p switch to (say) no more than 200 samples allows normal encoding, suggesting the size of the data to be related to the problem.

Further investigation showed that the MetDB library top-level call ENBUFV2() is no longer returning the final number of expanded descriptors; the input value is unchanged, so the correct expansion is not displayed with the DEBUG (-d) diagnostic output. It turns out that for the problem dataset, the expanded list is larger than the allocated array dimention - the 2000 element headroom (set in variable 'nextra' ) is insufficient. Arbitrarilly increasing nextra to 5000 allowed this file to be encoded nominally.

After further experimentation (with g95), the following was found: 1) setting nextra to 2000 or to 50% of the number of data elements (whichver is greater) gives sufficient headroom for this and larger profiles to be encoded. 2) there appears to be a fundamental limit at around 1500 samples in L1b at which memory errors re-appear, whatever arbitrarily large value is given to nextra, or by increasing the length of the character string which holds the encoded BUFR bit stream (curently 50,000 bytes)

The impact of (2) is academic; MetDB cannot handle BUFR messages of more than 27Kb - this limits the max. no. of samples to perhaps 500. GTS is even more limited (15Kb). In fact MetDB cannot decode (retrieve) more than 375 samples per profile, which is why this is the deafult max. The main impact is in ROPP testing when thinning of high-resolution profiles is disabled (-p 9999) for ropp2bufr/bufr2ropp test cycles. It is recommened that test files not exceeding 1000 samples be used.

Hence ropp2bufr.f90 has been modified by: 1) setting nextra to 2000 or 50% of the no. of data elements, whichever is larger, 2) clamping any user-specified value for the max. no. of samples (via the -p switch) to 1500. The default maximum remains at 375. 3) -d (DEBUG) output to stdout has been modifed to correctly display the no. of expanded descriptors after encoding 4) the native-to-BUFR conversion (which also does the thinning) is performed before the allocation of the descriptor array, so that the thinnded no. of data elements is used in the size calculation.

These changes initially tested with g95. Also tested successfully with: ifc, ifort9, nag, pgf95 and gfc on RH Linux. The exception is ifort (Intel v8), which now gives:

$ $ROPP_IFO/bin/ropp2bufr troublemaker.nc -o dummy.bufr -d

            ===== ROPP BUFR Encoder =====
                   09:10 23-Aug-2006

Reading  ROPP data from troublemaker.nc
FATAL ERROR (from ropp_io_read_ncdf_get):
   Start+count exceeds dimension bound

ie the input file cannot be read. Checking with ropp2ropp (which has not been changed between builds, nor have any ncdf90 routines, which have merely been re-compiled; the netCDF library has not been touched at all), this gives the same error.

$ $ROPP_IFO/bin/ropp2ropp troublemaker.nc
Reading troublemaker.nc...
FATAL ERROR (from ropp_io_read_ncdf_get):
   Start+count exceeds dimension bound

This appears to be a problem in the basic ropp_io_read code, but affecting ifort only. NB ropp2bufr previously read the same file, but failed in the BUFR library.

The updated ropp2bufr.f90 is committed [882] as solving the original problem, but this Ticket is left open pending resolution of the ifort issue, which is not unique to ropp2bufr.

NB not rebuilt/tested on other than RH Linux.

comment:2 by Dave Offiler, 18 years ago

Resolution: fixed
Status: assignedclosed

The ifort issue was a red herring. Re-build after a make distclean and both ropp2bufr & ropp2ropp work correctly.

Also checked HP-NAG (like Linux-NAG, worked before anyhow; works still).

comment:3 by Dave Offiler, 18 years ago

Version: 0.8

comment:4 by (none), 16 years ago

Milestone: 0.9

Milestone 0.9 deleted

Note: See TracTickets for help on using tickets.