Opened 8 years ago

Last modified 8 years ago

#480 new enhancement

Consider converting undulation files egm96.dat and corrcoef.dat to netCDF format

Reported by: Ian Culverwell Owned by: Ian Culverwell
Priority: normal Milestone: Whenever
Component: ROPP (all) Version: 8.0
Keywords: undulation Cc:

Description

The zipping and unzipping of ropp_pp/data/egm96.dat and ropp_pp/data/corrcoef.dat have caused some unnecessary trouble during the testing of ROPP9.0. Do they really have to be text files?

Unzipped, they are quite big files:

idculv@eld037:> ls -ltr ropp_pp/data/*.dat 
-rw-r--r-- 1 idculv satsense 5292621 Jul 26 16:08 corrcoef.dat
-rw-r--r-- 1 idculv satsense 5292378 Jul 26 16:08 egm96.dat

Zipped up, their sizes are

idculv@eld037:> ls -ltr ropp_pp/data/*.dat.gz           
-rw-r--r-- 1 idculv satsense 1656792 Jul 26 16:08 corrcoef.dat.gz
-rw-r--r-- 1 idculv satsense 1802837 Jul 26 16:08 egm96.dat.gz

So zipping these test files is generally a good idea.

But corrcoef.dat contains 65341 lines like

idculv@eld037:> more ropp_pp/data/corrcoef.dat 
   0   0  -0.502745269977262810D+01   0.000000000000000000D+00                  
   1   0   0.362815722250429060D+00   0.000000000000000000D+00                  
   1   1  -0.104348574331318722D+01  -0.204625738821275838D+01                  
   2   0  -0.318495437163588324D+01   0.000000000000000000D+00                  
   2   1   0.196089895083435536D+00  -0.207358978500757085D+01                  

(These are just the coefficients in a spherical harmonic expansion.) (Note the huge amount of blank space stored at the end of each line!)

Storing these numbers at double precision in a netCDF file would need ~ 65341 * 2 * 8 = 1045456 bytes, which is 50% smaller than corrcoef.dat.gz.

Similarly, egm96.dat has 65338 lines like

idculv@eld037:> more ropp_pp/data/egm96.dat         
   2   0 -0.484165371736E-03  0.000000000000E+00  0.35610635E-10  0.00000000E+00
   2   1 -0.186987635955E-09  0.119528012031E-08  0.10000000E-29  0.10000000E-29
   2   2  0.243914352398E-05 -0.140016683654E-05  0.53739154E-10  0.54353269E-10
   3   0  0.957254173792E-06  0.000000000000E+00  0.18094237E-10  0.00000000E+00

The netCDF equivalent would occupy ~ 65338 * 4 * 8 = 2090816 bytes, which is only 15% bigger than egm96.dat.gz.

Overall, the unzipped netCDF files would occupy about 10% less space than the zipped text files. More to the point, we wouldn't need to keep zipping and unzipping them. In addition, the files would be in a standard format, which ROPP users could take and use for their own ends.

Obviously we would need to make sure that the proposed data formatting change wouldn't affect the results. In particular, attention should be paid to the data in corrcoeff.dat. 18 significant digits are claimed for the two fields. This is (just) more than are allowed in a netCDF double precision variable and a Fortran REAL(KIND=KIND(1.D0)) variable. So we would need to check that the apparent degradation of accuracy in the netCDF files did not in fact have any impact in the Fortran code.

Attachments (2)

und_coeffs.nc (2.0 MB ) - added by Ian Culverwell 8 years ago.
und_coeffs.nc
und_dat2nc.f90 (5.0 KB ) - added by Ian Culverwell 8 years ago.
und_dat2nc.f90

Download all attachments as: .zip

Change history (7)

comment:1 by Ian Culverwell, 8 years ago

For the record, some dummy files of the right dimensions have sizes:

-rw-r--r--  1 idculv satsense 1045572 Dec 23 15:02 corrcoef.nc
-rw-r--r--  1 idculv satsense 2091004 Dec 23 15:02 egm96.nc

comment:2 by Ian Culverwell, 8 years ago

In fact, it turns out that we don't even read the last two columns of egm96.dat, so they can be removed from the netCDF versions. Further, we can put all the information in one netCDF file, say und_coeffs.nc (attached, as is its Fortran generation program und_dat2nc.f90). This is about 2.1 MB in size:

-rw-r--r--  1 idculv satsense 2091788 Jan  3 11:26 und_coeffs.nc

This is 60% of the combined size of the gzipped ascii files egm96.dat.gz and corrcoef.dat.gz.

(Zipping up und_coeffs.nc makes practically no difference to its size. In any case, there are already files in the ROPP distribution (eg ropp_1dvar/data/IT-1DVAR-01_b.nc at 2.7 MB) that are bigger than this. So we could live with an unzipped und_coeffs.nc.)

by Ian Culverwell, 8 years ago

Attachment: und_coeffs.nc added

und_coeffs.nc

comment:3 by Ian Culverwell, 8 years ago

(Why are the last hc and cc so much smaller than the others?

idculv@eld037:> tail -5 egm96.dat 
 360 356  0.437069296408E-10 -0.104331448796E-09  0.50033977E-10  0.50033977E-10
 360 357 -0.628042366728E-11  0.106635915741E-09  0.50033977E-10  0.50033977E-10
 360 358  0.709604781531E-10  0.691761006753E-10  0.50033977E-10  0.50033977E-10
 360 359  0.183971631467E-10 -0.310123632209E-10  0.50033977E-10  0.50033977E-10
 360 360 -0.447516389678E-24 -0.830224945525E-10  0.50033977E-10  0.50033977E-10

idculv@eld037:> tail -5 corrcoef.dat 
 360 356  -0.554638277560201478D-02  -0.138393269720064329D-02                  
 360 357  -0.119289584478972659D-02  -0.178910411538898653D-02                  
 360 358  -0.444423721432539963D-02  -0.621882922713206093D-02                  
 360 359  -0.738649105727355159D-02   0.547503707694727275D-03                  
 360 360   0.205445744870012291D-17   0.725246966198756643D-02                  

)

comment:4 by Ian Culverwell, 8 years ago

How are we to read the data in the proposed netCDF file? The netCDF reading routines are in ropp_io, and the undulation calculating routine, SUBROUTINE Datum_HMSL, is in ropp_utils/coordinates/earth.f90. As a matter of design principle, ropp_utils does not depend on ropp_io. I suppose we would just have to use the 'native' netCDF library, by means of

  USE netcdf
...
  CALL check(nf90_open(ofile, nf90_nowrite, ncid))
  CALL check(nf90_inq_varid(ncid, 'hc', hc_varid))
  CALL check(nf90_get_var(ncid, hc_varid, hc))
  CALL check(nf90_close(ncid))

(as for the attached und_dat2nc.f90), rather than the ROPP netCDF routines

  USE ncdf
...
  CALL ncdf_open(file)
  CALL ncdf_getvar('hc', hc)
  CALL ncdf_close()

In the first case we would have to augment the link statement with -lnetcdf -lnetcdff. I think this is probably OK - potential ROPP users would struggle to do anything without any netCDF routines. If they really want to use Datum_HMSL without netCDF, we can point them to the earlier code and datasets.

Actually doing the software engineering of this might require thinking caps to be put on. For starters, I think we'd need to include something like

CM_CHECK_MODULE(netcdf)
AM_CONDITIONAL(HAVE_NETCDF, test x$HAVE_MODULE_netcdf = xyes)
if test x$HAVE_MODULE_netcdf = xno ; then
   AC_MSG_WARN([])
   AC_MSG_WARN([PACKAGE NETCDF NOT FOUND])
   AC_MSG_WARN([THIS PACKAGE REQUIRES NETCDF TO BE INSTALLED FIRST.])
   AC_MSG_WARN([*** NOTE:                                            ***])
   AC_MSG_WARN([*** Users wishing to install ROPP_IO must first have ***])
   AC_MSG_WARN([*** the NETCDF package installed before building     ***])
   AC_MSG_WARN([*** this package. See ROPP Release Notes or ROPP     ***])
   AC_MSG_WARN([*** User Guide for further details.                  ***])
   AC_MSG_WARN([])
   AC_MSG_ERROR([Module NETCDF not found])
fi

and

CM_CHECK_LIB(netcdf, nf_open, -lnetcdff -lnetcdf)
AM_CONDITIONAL(HAVE_NETCDF, test x$HAVE_LIBRARY_netcdf = xyes)
if test x$HAVE_LIBRARY_netcdf = xno ; then
   AC_MSG_WARN([])
   AC_MSG_WARN([LIBRARY NETCDF (Fortran) NOT FOUND])
   AC_MSG_WARN([THIS PACKAGE REQUIRES NETCDF TO BE INSTALLED FIRST.])
   AC_MSG_WARN([*** NOTE:                                            ***])
   AC_MSG_WARN([*** Users wishing to install ROPP_IO must first have ***])
   AC_MSG_WARN([*** the NETCDF package installed before building     ***])
   AC_MSG_WARN([*** this package. See ROPP Release Notes or ROPP     ***])
   AC_MSG_WARN([*** User Guide for further details.                  ***])
   AC_MSG_WARN([])
   AC_MSG_ERROR([Library NETCDF not found])
fi

in ropp_utils/configure.ac, as is done in ropp_io/configure.ac.

by Ian Culverwell, 8 years ago

Attachment: und_dat2nc.f90 added

und_dat2nc.f90

comment:5 by Ian Culverwell, 8 years ago

(Note, finally for now, that the output of und_dat2nc.f90, namely

From egm96.dat: hc(1:10) = 
    0.00000000000000000000E+00    0.00000000000000000000E+00    0.00000000000000000000E+00   -0.48416537173599999769E-03   -0.18698763595500000226E-09
    0.24391435239799998455E-05    0.95725417379199996818E-06    0.20299888218400000784E-05    0.90462776860499997739E-06    0.72107265705700001092E-06

From corrcoef.dat: cc(1:10) = 
   -0.50274526997726285416E+01    0.36281572225042907354E+00   -0.10434857433131872195E+01   -0.31849543716358832413E+01    0.19608989508343552255E+00
    0.25007644753435060991E+01    0.45974797592494542897E+01   -0.11930287232961291066E+00    0.28147693676854443900E+01    0.43486814175551880002E+00

From und_coeffs.nc: hc(1:10) = 
    0.00000000000000000000E+00    0.00000000000000000000E+00    0.00000000000000000000E+00   -0.48416537173599999769E-03   -0.18698763595500000226E-09
    0.24391435239799998455E-05    0.95725417379199996818E-06    0.20299888218400000784E-05    0.90462776860499997739E-06    0.72107265705700001092E-06

egm96.dat - und_coeffs.nc: hc(1:10) = 
    0.00000000000000000000E+00    0.00000000000000000000E+00    0.00000000000000000000E+00    0.00000000000000000000E+00    0.00000000000000000000E+00
    0.00000000000000000000E+00    0.00000000000000000000E+00    0.00000000000000000000E+00    0.00000000000000000000E+00    0.00000000000000000000E+00

From und_coeffs.nc: cc(1:10) = 
   -0.50274526997726285416E+01    0.36281572225042907354E+00   -0.10434857433131872195E+01   -0.31849543716358832413E+01    0.19608989508343552255E+00
    0.25007644753435060991E+01    0.45974797592494542897E+01   -0.11930287232961291066E+00    0.28147693676854443900E+01    0.43486814175551880002E+00

egm96.dat - und_coeffs.nc: cc(1:10) = 
    0.00000000000000000000E+00    0.00000000000000000000E+00    0.00000000000000000000E+00    0.00000000000000000000E+00    0.00000000000000000000E+00
    0.00000000000000000000E+00    0.00000000000000000000E+00    0.00000000000000000000E+00    0.00000000000000000000E+00    0.00000000000000000000E+00

suggests that in fact the numerical precision of the .dat files are exactly preserved in the .nc file, at least as far as a Fortran program that reads the data at double precision is concerned. This would appear to answer the final concern of the original posting.)

Note: See TracTickets for help on using tickets.