The temporary “progress report” comments in the parent post can be found at the end of this post.
Reformatting of MCDW data
The MCDW readme.txt file, dated 2014-06-11, is the only documentation for the MCDW file format which I have found, and fails to adequately document that format.
There are three types of files: Surface, Upper Air, and a publication file for each data month. These monthly files are updated when the data month is finalized. This is done when NCDC generates the monthly publication file.
The first ambiguity which needs to be pointed out here is that “finalized” has a meaning which needs to be explained. Only the publication file is “final”. Updated data for past months may appear in subsequent ssmYYMM.txt and ssmYYMM.fin files, so these files provide the “currently final” data, which remains open to further revision.
Mean Surface Data Table Description
The format is as follows WMO number, Station Name, LAT/LONG(degrees,minutes), elevation (meters), station pressure, sea level pressure (pressure values in millibars (hectopascals), mean temp (deg C), departure (deg C), vapor pressure, departure
(vapor pressure values are in millibars), days with precipitation (1mm or more for the month), monthly precipitation total (mm), precipitation quintile (frequency group of precipitation totals; Group numbers range from 0-6. Groups 0 and 6 are for totals less than and greater than any recorded in the 30-yr Climate Normal reference period.), sunshine total in hours, percentage of long term average.
This might seem straightforward description. But practice does not follow this simple description, for example:
The inconsistencies and undocumented data:
The initial headache was provided by use of two date formats: the date format changed from YYYYMM (199408) to MMM YYYY (September 1994).
Up to 1998, and with a couple of later isolated occurrences, what appears to be the number of days for which observations were averaged (never exceeding 31, generally at least 28, but occasionally as low as 3) appears as an additional column of data between elevation and station pressure.
And throughout, an additional column of data (a flag “Y” or “Z” if present) follows the “sea-level pressure” (clearly here something other than sea-level pressure unless a record high pressure of 2994+ hectopascals at sea-level failed to catch my intention), preceding Mean Temp.
Some mean temperatures, correctly extracted from MCDW files, nevertheless seem unlikely:
99.5°C at TULEAR, and
88.8°C at SAL seem unlikely. (99.5°C may possibly result from someone entering 99.9 as a missing value code, but typing 99.5 instead, 88.8°C may be similar if someone shares with me the unfortunate ability to hit adjacent keys by mistake)
-74.0°C at VERHNIJ BASKUNCHAK (Russia) is possible, but
-85.0°C at BARIKA (Algeria) and
The next stage is to update a ghcnm.tavg.v3.3.0.YYYYMMDD.qcu.dat from the MCDW records, checking for any corrupt values or missed updates, and checking for unlikely temperature values like those above.
In addition, artifacts such as those highlighted below appear from time to time and must be discarded.
Fortunately these artifacts are relatively easy to remove at the same time as blank lines. Station data records start with a numeric WMO number. The first line of each file starts with “CURRENT”, and supplies the date of the file. Updates or corrections to past station data records follow the current station data records, with lines starting “OVERDUE” or “CORRECTIONS” (or both) as section marker. Any line starting with blanks should either supply a date for the overdue or corrected records, or the name of a continent or country; otherwise treat as an artifact. The continent or country name lines can be ignored as the WMO number in the data record supplies the necessary information.
Exploration of updating from MCDW records
to be added
The comments from the parent post:
Thursday July 4. Short delay needed while downloading MCDW files for recent years, reformatting to enable comparison with Met Éireann data
Friday July 12. Progress report. Delay somewhat lengthened while exploring MCDW further. I think I now see how GHCN was incorporating corrupted data from MCDW, and how this should be avoided.
Before moving on I’m implementing the code which I believe should have been used by GNCN when importing MCDW data before describing the appropriate steps. That will still leave open the question of how corrupted data entered MCDW in the first place, and whether this was really confined to “select stations in Ireland”. A comment from Nick Stokes suggests a line of inquiry to check. It would be helpful if there were adequate documentation for MCDW. Using Google to search for such documentation turns up sensible recommendations as far back as 1993 (see below here) – peer-reviewed published paper, computer compatible format, etc – but not proper documentation. I’ll have more to say on the format of MCDW in particular, and how it might be improved with a simple change.