Opened 12 years ago

Closed 11 years ago

#301 closed defect (fixed)

xlf95 difference from ifort12 (et al) on IT-PP-05

Reported by: Ian Culverwell Owned by: Ian Culverwell
Priority: normal Milestone: 7.0
Component: ropp_pp Version: 6.0
Keywords: Cc:

Description

Following the replacement of the OCC-derived ropp_pp reference files with those from ROPP6.0, a discrepancy has come to light between the results of IT-PP-05 when ROPP is compiled with xlf95 (running under AIX on an IBM supercomputer) and when it is compiled with ifort12 and all the other compilers running on a linux box.

Investigations show that this discrepancy has been there as long as we have been testing xlf95, but that previously it was masked by the much larger difference between ROPP (built with any compiler) and OCC.

IT-PP-05 tests the ropp_pp_occ_tool on open and closed loop GRAS data. The figure below shows the differences between xlf95 and ifort, and between ROPP6.0 and 6.1, on the third profile of the test dataset, where the differences are largest.

test2_bangle.gif

It is clear that the difference in bangle between xlf95 and ifort12 is pretty much the same at 6.1 as it was 6.0, because the 6.0-to-6.1 change is very small. (Offline tests indicate that this small difference is largely due to a change in ropp_pp_fourier_filter.f90 made at 6.1 (see #299).)

The same is true for refractivity.

test2_refrac.gif

Painstaking investigations have yet to discover the reason for the difference between xlf95 and ifort12 (etc) on this dataset. It may be down to a deep-rooted difference in the performance of fortran intrinsics on the two compilers/platforms. We propose "parking" the issue for the moment, as it is delaying the release of ROPP6.1.

Attachments (27)

test2_bangle.gif (45.4 KB ) - added by Ian Culverwell 12 years ago.
test2_bangle.gif
test2_refrac.gif (40.8 KB ) - added by Ian Culverwell 12 years ago.
test2_refrac.gif
IT-PP-05_prof3_phase_Li.png (35.7 KB ) - added by Ian Culverwell 11 years ago.
IT-PP-05_prof3_phase_Li.png
IT-PP-05_prof1_phase_Li.png (35.7 KB ) - added by Ian Culverwell 11 years ago.
IT-PP-05_prof1_phase_Li.png
IT-PP-05_prof3_bangle_diff.gif (73.8 KB ) - added by Ian Culverwell 11 years ago.
IT-PP-05_prof3_bangle_diff.gif
IT-PP-05_prof3_refrac_diff.gif (41.4 KB ) - added by Ian Culverwell 11 years ago.
IT-PP-05_prof3_refrac_diff.gif
IT-PP-05_prof3_release_diff.gif (67.1 KB ) - added by Ian Culverwell 11 years ago.
IT-PP-05_prof3_release_diff.gi
IT-PP-05_prof3_dtime_phase_Li_open_llop_lcf.png (54.7 KB ) - added by Ian Culverwell 11 years ago.
IT-PP-05_prof3_dtime_phase_Li_open_llop_lcf.png
phase_L1.gif (66.0 KB ) - added by Ian Culverwell 11 years ago.
phase_L1.gif
test3_1.gif (50.9 KB ) - added by Ian Culverwell 11 years ago.
test3_1.gif
test3_p3.gif (49.7 KB ) - added by Ian Culverwell 11 years ago.
test3_3.gif
test3_3.gif (49.7 KB ) - added by Ian Culverwell 11 years ago.
test3_3.gif
phase_L1_new.gif (80.9 KB ) - added by Ian Culverwell 11 years ago.
phase_L1_new.gif
dtime_phaseL1_lcf_new.png (61.1 KB ) - added by Ian Culverwell 11 years ago.
/dtime_phaseL1_lcf_new.png
test3_3_nopurple.gif (55.5 KB ) - added by Ian Culverwell 11 years ago.
test3_3_nopurple.gif
test3_1new_noblue.gif (55.5 KB ) - added by Ian Culverwell 11 years ago.
test3_1new_noblue.gif
test3_1new_nobluenogreen.gif (53.4 KB ) - added by Ian Culverwell 11 years ago.
test3_1new_nobluenogreen.gif
SSY_email_231013.txt (15.8 KB ) - added by Ian Culverwell 11 years ago.
SSY_email_231013.txt
SSY_email_231013.tar.gz (265.2 KB ) - added by Ian Culverwell 11 years ago.
SSY_email_231013.tar.gz
residual_xgns.png (45.2 KB ) - added by Ian Culverwell 11 years ago.
residual_xgns.png
test3_original_p1.gif (63.9 KB ) - added by Ian Culverwell 11 years ago.
test3_final_p1.gif (40.2 KB ) - added by Ian Culverwell 11 years ago.
test3_final_p1.gif
test3_final_p2.gif (39.4 KB ) - added by Ian Culverwell 11 years ago.
test3_final_p2.gif
test3_original_p2.gif (68.3 KB ) - added by Ian Culverwell 11 years ago.
test3_original_p2.gif
plot_xyz_residual_ifort.gif (75.7 KB ) - added by Ian Culverwell 11 years ago.
plot_xyz_residual_ifort.gif
plot_xyz_residual_xlf95.gif (74.1 KB ) - added by Ian Culverwell 11 years ago.
plot_xyz_residual_xlf95.gif
plot_xyz_residual_sunf95.gif (76.3 KB ) - added by Ian Culverwell 11 years ago.
plot_xyz_residual_sunf95.gif

Download all attachments as: .zip

Change history (43)

by Ian Culverwell, 12 years ago

Attachment: test2_bangle.gif added

test2_bangle.gif

by Ian Culverwell, 12 years ago

Attachment: test2_refrac.gif added

test2_refrac.gif

comment:1 by Ian Culverwell, 11 years ago

Still got nowhere. Leave open.

comment:2 by Ian Culverwell, 11 years ago

This root of this problem appears to be a data issue rather than a coding one. There are lots of rubbish phase_L1 points in the data:

IT-PP-05_prof3_phase_Li.png

Lots of phase_L1s are about -31e6, which is not ropp_MDFV (=-99999000.), and which are therefore currently processed by ropp_pp_preprocessing_grasrs.f90. This can't be right. (There are also a lot of phase_L1 values of 9.9692100e+36 at the end of the occultation, which is NC_FILL_DOUBLE, the default netCDF FillValue. These are not shown above.)

The phase_L2s are better: valid or ropp_MDFV or NC_FILL_DOUBLE (again, the latter aren't shown).

However, the same is true of the other profiles in the file, eg profile1:

IT-PP-05_prof1_phase_Li.png

This profile causes no difficulty for ROPP. We therefore assume that for profile 3 - the one that shows a significant difference betwen xlf95 and the other compilers - the amount or intensity of "bad" data is enough to throw the sensitive calculations of bending angle (and thence refractivity) into slightly different places when compiled with different compilers. Certainly the simple trick of replacing

 WHERE(ro_data%Lev1a%phase_L1(:) == ropp_MDFV) 
 WHERE(ro_data%Lev1a%phase_L2(:) == ropp_MDFV) 

by

 WHERE(ro_data%Lev1a%phase_L1(:) < ropp_MDTV) 
 WHERE(ro_data%Lev1a%phase_L2(:) < ropp_MDTV) 

(twice) in ropp_pp_preprocessing_grasrs.f90, which omits all the dodgy stuff, seems to fix the problem. It changes the xlf95-ifort bending angle differences shown in top panel here to the ones shown in the bottom panel.

IT-PP-05_prof3_bangle_diff.gif

(The dashed box shows the criterion for passing the test Folder test: |diff| < 0.1% in lowest 50km.)

Similarly for the refractivities:

IT-PP-05_prof3_refrac_diff.gif

Job done? I don't know. Removing all this bad data has a big (~5%) difference on the derived bending angles and refractivities:

IT-PP-05_prof3_release_diff.gi

Is this acceptable? Maybe a better question is: are the original bangle and refrac profiles, partly based on phase_L1s ~ -31e6, acceptable? And we still don't really know why the dodgy data provoked a difference on xlf95. I suggest we leave this ticket open, and discuss at the DRI.

I don't know how or why these rubbish phases get into the input dataset. Note that the whole issue of the treatment of "bad" data in ropp_pp input files is discussed in ticket #287.

(In passing, I note that the results of the Global MSIS search: Month = 11 Lat = 60. Lon = 340., seem to be completely wrong for this profile, which has month[0]=9, lat[0]=-46.2301, lon[0]=-54.9948. This issue is highlighted in ticket #317.)

by Ian Culverwell, 11 years ago

Attachment: IT-PP-05_prof3_phase_Li.png added

IT-PP-05_prof3_phase_Li.png

by Ian Culverwell, 11 years ago

Attachment: IT-PP-05_prof1_phase_Li.png added

IT-PP-05_prof1_phase_Li.png

by Ian Culverwell, 11 years ago

IT-PP-05_prof3_bangle_diff.gif

by Ian Culverwell, 11 years ago

IT-PP-05_prof3_refrac_diff.gif

by Ian Culverwell, 11 years ago

IT-PP-05_prof3_release_diff.gi

by Ian Culverwell, 11 years ago

IT-PP-05_prof3_dtime_phase_Li_open_llop_lcf.png

comment:3 by Ian Culverwell, 11 years ago

Discussed with Stig: perhaps the phase=-31e6 is OK. It's only the derivative that matters, and this could just be raw sampling data. See the open_loop_lcf flag:

[[Image(IT-PP-05_prof3_dtime_phase_Li_open_llop_lcf.png)]

Break down phase_L1 into regions defined by this flag:

[[Image(phase_L1.gif)]

It appears to be the blue region (open_loop_lcf = even > 0) that's causing the trouble - phases all over the place. When I omit this section of data, and rerun in the old code ()the blue and the green are omitted in the new code), I see no significant difference between the xlf95 and ifort:

[[Image(test3_1.gif)]

(Sim for refrac).

by Ian Culverwell, 11 years ago

Attachment: phase_L1.gif added

phase_L1.gif

by Ian Culverwell, 11 years ago

Attachment: test3_1.gif added

test3_1.gif

comment:4 by Ian Culverwell, 11 years ago

I meant of course

open_loop_lcf

IT-PP-05_prof3_dtime_phase_Li_open_llop_lcf.png

regions of phase_L1

phase_L1.gif

revised xlf95-ifort on expurgated dataset

test3_1.gif

comment:5 by Ian Culverwell, 11 years ago

We also see very little difference between the BA/REF profiles produced when we omiit just the blue data (ie use RS where the CL and RS data overlap):

test3_3.gif.

This is very different to the ~5% differences we saw when we omitted all the RS (and overlapping CL) data before.

by Ian Culverwell, 11 years ago

Attachment: test3_p3.gif added

test3_3.gif

by Ian Culverwell, 11 years ago

Attachment: test3_3.gif added

test3_3.gif

comment:6 by Ian Culverwell, 11 years ago

If this works on old and new data, then a reasonably simple fix is to replace

  icl_min = SUM(MINLOC(ro_data%lev1a%dtime(:), MASK = .NOT. BTEST(LCF(:),0)))
  icl_max = SUM(MAXLOC(ro_data%lev1a%dtime(:), MASK = .NOT. BTEST(LCF(:),0)))

with

  icl_min = SUM(MINLOC(ro_data%lev1a%dtime(:), MASK = LCF(:)==0)
  icl_max = SUM(MAXLOC(ro_data%lev1a%dtime(:), MASK = LCF(:)==0)

in ropp_pp_preprocess_grasrs.f90. By this means we are defining the open loop region as the one in which LCF is 0, not just even.

But it remains to be seen what happens when new data are passed through the routines.

comment:7 by Ian Culverwell, 11 years ago

Axel von Engeln kindly provided, via Yago Andres and Chris Burrows, 5 recent (2012) level 1a phase datasets from EUMETSAT, in grouped netCDF4 format. After converting to standard ROPP format by using eum2ropp, the resulting profiles were run through IT-PP-05 on ifort (on a linux box), sunf95 (on a linux box), and xlf95 (on the HPC).

Result: no significant (ie > 0.1% fractional difference in 0-50km) difference between any of them.

This is good - it means that a new test dataset using these profiles should pass the test folder. (We already know it passes the xlf95 test.)

Bt we don't really know why. The new profiles still overlapping open loop and raw sampling sections. Then again, four of the 5 old (2007) profiles had that, and they passed the test. One clue: the L1 phases in the overlap region tend to be quite small (~1-1e5) and positive, rather than ~ -31e6 as they were before. Perhaps it's just good luck.

comment:8 by Ian Culverwell, 11 years ago

In support of this last point, here's the phase_L1 breakdown for on of the new files:

phase_L1_new.gif

Note that the RS phase data is O(100m), not -31e6m as before. Perhaps this is it.

by Ian Culverwell, 11 years ago

Attachment: phase_L1_new.gif added

phase_L1_new.gif

by Ian Culverwell, 11 years ago

Attachment: dtime_phaseL1_lcf_new.png added

/dtime_phaseL1_lcf_new.png

comment:9 by Ian Culverwell, 11 years ago

Otherwise the new data look pretty similar to the old, eg, dtime, phase_L1 and open_loop_lcf for the same file:

/dtime_phaseL1_lcf_new.png

comment:10 by Ian Culverwell, 11 years ago

I meant overlapping closed loop and raw sampling, of course. (Thanks, Stig!)

comment:11 by Ian Culverwell, 11 years ago

Awaiting DRI approval of the "solution".

by Ian Culverwell, 11 years ago

Attachment: test3_3_nopurple.gif added

test3_3_nopurple.gif

comment:12 by Ian Culverwell, 11 years ago

Removing the overlapping CL data from the new GRAS profiles has a significant (~few %) effect on the processed bangles and refracs:

test3_1new_noblue.gif

Interestingly, this is larger than the effect of removing overlapping CL and RS data:

test3_1new_nobluenogreen.gif

So, backpedalling a bit, removing data is probably not such a good idea. This conclusion is reinforced by the hilarious discovery that removing the missing phase_L1 data from profile 3 of the earlier 2007 GRAS RS dataset has a measurable impact on the preocessed bangles and refracs:

test3_3_nopurple.gif

(Note the icing on the cake that the effect of removing missing data is different for xlf95 and ifort.)

What a load of rubbish.

I propose deferring the whole question to ROPP8.0, when hopefully someone else will look at it.

by Ian Culverwell, 11 years ago

Attachment: test3_1new_noblue.gif added

test3_1new_noblue.gif

by Ian Culverwell, 11 years ago

test3_1new_nobluenogreen.gif

comment:13 by Ian Culverwell, 11 years ago

Stig Syndergaard (DMI) has taken a close look at this problem. His email and attachments are attached. Key points:

ropp_pp_cutoff_amplitude: There's a small bug in this, which results in (one) too many points in the merged sample. Has a small impact.

ropp_pp_preprocess_grasrs: Stig discovered that the orbit fitting (via ropp_pp_regression) uses all the data, not just the non-missing stuff. (Later processing uses properly masked data.) This explains the differences shown in the last figure above, for omitting bad data (as inferred from the value of LCF flag) from the orbit-fitting resulted in exactly the same results when run on the full and the expurgated datasets.

ropp_pp_residual_regression: There remains, however, the puzzling difference between the results using xlf95/HPC and ifort/linux. Stig suggested that this might be due to the numerical infelicity of the orbit-fitting routine ropp_pp_regression, which fits a 5th degree polynomial through the each of the {x, y, z} co-ordinates of the LEO and the GNSS satellite. Since these co-ordinates are almost constant during an occultation, the resulting inversion is a bit ill-conditioned. He found residuals (fit-data) of up to 2 cm ("1st residual" in the attached xgns.png). By refitting these residuals (using ropp_pp_residual_regression in ropp_pp_utils) these were much reduced ("2nd residual" in the attached xgns.png).

Stig's hunch was that this regression would likely be sensitive to the compiler/platform, and might therefore explain the xlf95/HPC and ifort/linux differences. Happily this seems to be borne out in practice.

As a reminder, here are the xlf95-ifort differences in bending angle and refractivity for the original code:

These are outside the 0.1% threshold for acceptability.

Here are the corresponding results when the regression is improved by calling ropp_pp_residual_regression on the residuals from ropp_pp_regression, and then amending the fitting coefficients accordingly:

test3_final_p1.gif

These are within the 0.1% threshold for acceptability.

These are both for the full dataset (missing data included), which is why there remains a tiny difference between the "purple" and "nopurple" datasets:

test3_final_p2.gif

(This difference can be removed entirely by applying the fix described in the second point above, but a general algorithm to do so on a dataset with missing raw sampling data anywhere needs a little more work.)

SUMMARY

  1. Differences between caused by presence (sic) of missing data explained by the fact that all the data are used in the orbit-fitting routines.
  1. These routines are compiler/platform sensitive, but can be made significantly less so by fitting the residuals (fit-data) from the first fit, and updating the fitting coefficients appropriately.

by Ian Culverwell, 11 years ago

Attachment: SSY_email_231013.txt added

SSY_email_231013.txt

by Ian Culverwell, 11 years ago

Attachment: SSY_email_231013.tar.gz added

SSY_email_231013.tar.gz

by Ian Culverwell, 11 years ago

Attachment: residual_xgns.png added

residual_xgns.png

by Ian Culverwell, 11 years ago

Attachment: test3_original_p1.gif added

by Ian Culverwell, 11 years ago

Attachment: test3_final_p1.gif added

test3_final_p1.gif

by Ian Culverwell, 11 years ago

Attachment: test3_final_p2.gif added

test3_final_p2.gif

by Ian Culverwell, 11 years ago

Attachment: test3_original_p2.gif added

test3_original_p2.gif

comment:14 by Ian Culverwell, 11 years ago

I calculated the 1st and 2nd residuals on IT-PP-05_prof3, as Stig did, for ifort/linux:

plot_xyz_residual_ifort.gif

and for xlf95/HPC:

plot_xyz_residual_xlf95.gif

The first residuals are very different on the two platforms, while the 2nd residuals are much smaller and much more similar. This explains the above results very neatly, and strengthens the case for regressing the residuals in the orbit calculations in ropp_pp.

by Ian Culverwell, 11 years ago

Attachment: plot_xyz_residual_ifort.gif added

plot_xyz_residual_ifort.gif

by Ian Culverwell, 11 years ago

Attachment: plot_xyz_residual_xlf95.gif added

plot_xyz_residual_xlf95.gif

by Ian Culverwell, 11 years ago

plot_xyz_residual_sunf95.gif

comment:15 by Ian Culverwell, 11 years ago

Just to complete this, here are the residuals calculated on the other "linux" compiler that we've used before in this study, sunf95/linux:

plot_xyz_residual_sunf95.gif

As you can see, and as expected, it's almost exactly the same as the ifort result, and different to the xlf95 result.

comment:16 by Ian Culverwell, 11 years ago

Resolution: fixed
Status: newclosed

The actions arising from all this work have been carried forward to tickets #348 and #349, and therefore, with Stig's agreement, closing this ticket - finally.

Note: See TracTickets for help on using tickets.