Tawhiri/noaa wind data

From CUSF Wiki
Revision as of 18:58, 19 April 2021 by EllieClifford (talk | contribs) ("Add old wiki")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Idea: Getting data from the NOAA

Currently, we use PyDAP and talk to a NOAA Java OPENDAP server.

This works well ish, but:

  • Often, just after dataset publication, we get "Server error 0: "/gfs/gfs20121225/gfs_06z is not an available dataset" - my unconfirmed theory is that of the two nodes on the round-robin DNS, one gets the dataset first. This then tells predict.py that it's available, and the next request goes to the other node, producing the above error
  • It's a bit slow
  • Caching is a hack

Idea: drop PyDAP and write our own client

Speaking OPENDAP

On the one hand, we need not speak the whole of the OPENDAP protocol. This would simplify things, require less requests to get the format of the data and so on. We know what format the wind data is in, and if it did change, then it would affect other areas of the predictor that expect a certain format anyway.

However: when it changes, if the expected form of wind data is hardcoded deep in the code, it may be a complete pain to change.

DNS abuse

We could perform a manual DNS look up, and get a list of IP addresses that we can talk to, and then maintain some information on each node (if it's up, which datasets it has, ...). Then if we want some data, we could hit all nodes simultaneously (for different portions of data) assuming that the limiting factor is not our network speed.

Chunk sizes

This requires more investigation; the NOAA OPENDAP servers have some strange behaviour depending on how much data you request in one go. Larger chunks seem to require it to think for a little while before it starts sending data over the network.

If we download larger chunks, more data that we don't need is downloaded - wasted time Downloading lots of small chunks will slow us down due to the overhead of many small requests.