Tāwhirimātea

From CUSF Wiki
(Redirected from Tawhiri/integration)
Jump to navigation Jump to search

Tāwhirimātea

Tāwhirimātea (Tawhiri) is the name given to the next version of the Landing Prediction Software, which will probably be different enough from the current version (see below) to warrant a new name.

The name comes from a Māori god of weather, which rather aptly "drove Tangaroa and his progeny into the sea" (WP).

Introduction

Long story short, the predictor consists of

  • Some python responsible for downloading NOAA wind data.
  • Some C responsible for solving a simple DE; Forwards Euler and assumes the balloon moves at the speed of the wind (it is still very very accurate, so not much effort is required there).
  • Some PHP responsible for drawing a map.

And the motivation for this project is

  • The servers the python relied on were getting really slow, so that could be rewritten from scratch (done).
  • The C is old, and we want to add new features like:
    • Monte Carlo prediction, better integration
    • a floating balloon mode
    • perhaps something that takes into account the altitude of the ground when it lands
  • It would be easier to integrate the various bits if they were all written in Python. This would also ease integration with other projects too, like [tracker] (slated for a rewrite in Python) and [[1]] (Python; CUSF).
  • PHP is disgusting.

Notes

Ideas

Summer 2013

In the Summer of 2013, PyDAP broke. The servers became too slow to use properly.

Daniel (djr61) hacked up in 2 days a replacement wind downloading program in python (gevent, numpy), that gets binary (GRIB) data off the NOAA FTP servers instead, and unpacks it to a 18GB array of doubles. The C then accesses that by memmapping the entire file into its address space and casting it to an array of the right dimensions.

This works hilariously well with predictions now completing in under 20ms. We have gigabit internet, lots of spare power and loads of RAM, so it's not as bad as it sounds. It also hugely simplified the C code.

See my post to the UKHAS mailling list for a quick summary (hourlies)

This is deployed at http://predict.habhub.org/.

Some Notes:

It was a 2 day sprint. There are no comments. No unit tests. It works, and the interaction of the downloader with the other bits should be fairly minimal. I intend to eventually clean it up.

Code:

Tasks

  1. Rewrite the core predictor (in Python?)
  2. Add a float mode
  3. Add better integration
  4. Rewrite the web interface (Python? gevent? Flask? PostgreSQL? Worker daemons?)

People

  • Daniel (djr61)
  • George (gd365)
  • Ilya (im354)
  • David (db590)
  • Thomas (tp378)

CUSF C Predictor Notes

Source, Versions

These notes refer to cusf-standalone-predictor 0d32e97 cusf-landing-predicton 5e6e9a9

The C binaries in the hourly and the standalone predictors have the following differences:

  • Standalone has an 'alarm' option (e689ab9) - Kills the predictor with alarm(2)/SIGALRM after 10 minutes if it does not exit gracefully.
  • Standalone is far stricter on errors (e358188, 1fc73b9) - Changed several WARNs to ERRORS, and exit as soon as any ERROR occurs.

Functions

  • pred.c
    • main
      • parses options
      • sets up wind cache: wind_file_cache_new
      • parses scenario(s)
      • sets up altitude model: altitude_model_new
      • run_model()
        • N.B. FILE *output, *kml_file ‘passed’ via global variables
      • misc cleanup
    • small kml header/footer functions
  • run_model.c
    • run_model
      • repeatedly:
        • calls advance_one_timestep
        • calls write_position (every N steps)
        • sorts states by log likelihood (currently hardcoded: only 1 state)
      • write_position (final position)
    • advance_one_timestep
      • for each state:
        • altitude_model_get_altitude()
        • get_wind()
        • random_sample_normal for wind speed; adds randomness to windspeed and updates log likelihood (currently hardcoded rmserror=0 so does not modify speed)
        • updates state (i.e., forwards euler / rectangles integration)
    • get_wind
      • wind_file_cache_find_entry for current latitude and longitude; returns a wind file before now and after now
      • check both contain current point (in space); before < now < after
      • wind_file_get_wind on both files
      • return x, y velocity; mean variance (mean flattens 2D -> 1D)
  • altitude.c
    • get_density
    • altitude model:
      • constant ascent
      • descent: assume terminal velocity
      • NB: drag coefficient = descent rate * magic constant (magic constant is in pred.c)
    • seems to be designed with multiple altitude models (e.g., floating, etc.) in mind, not used.
  • wind/wind_file.c
    • functions to read wind files (csv)
    • wind_file_get_wind
      • not thread safe
      • searches wind file axes for desired location (lat, lon, height)
      • linear interpolation in 3D for x and y velocity
      • estimates wind variance by calculating variance (i.e., E(x^2) - E(x)^2) of the 8 points of the cube-like-thing that we interpolated in
  • wind/wind_file_cache.c
    • scans a directory for wind files, parses headers to determine lat/lon ranges, time.
    • wind_file_cache_find_entry gets the best file for a certain point in space/time
  • omitted: util/ - some functions referred to, but have self explanatory names

Remarks

Is it messy? Not really. I think the majority of the pain in adding to it would be having to write in C; the bits that manage files and loading scenarios etc. are really quite tedious in C. Certainly the prediction bit itself benefits from speed. Hopefully this will not be an issue. Could possibly have a python/C mix if required.

Is it worth tidying up? Tidying would probably just consist of changing it to conform to what we think C should look like rather than actual changes. Most of the things are pretty reasonable. It does sometimes segfault. Might be worth fixing that.

Is it easy to redo in python? Not too much effort. Certainly aren't even that many lines of code in here; hopefully LOC would drop dramatically in python too.

Is it worth redoing in python? IMO yes because then we can add cool new things. e.g., I would really like to have some better integration: I’m kind of hoping it will be possible to integrate from entering a cube to leaving a cube in one step. Might be possible to still monte carlo while integrating cube to cube in one step. Or multiple configurable altitude models [e.g., float].

Idea: Multiple altitude models

Currently, the predictor has only one option: ascent at a fixed rate, then descent using a simple model that assumes terminal velocity.

We could have:

  • Ascent/Descent (as above)
  • Floating balloons (this is an oft-requested feature)
  • ???

Generally, it would be nice if it was easy to add new models to the predictor; some sort of base class/subclass thing.

Idea: Better Integration

Currently, the predictor uses Forwards Euler integration. It should be possible to use something better.

Analytic integration is probably not going to work.

Idea: Getting data from the NOAA

Currently, we use PyDAP and talk to a NOAA Java OPENDAP server.

This works well ish, but:

  • Often, just after dataset publication, we get "Server error 0: "/gfs/gfs20121225/gfs_06z is not an available dataset" - my unconfirmed theory is that of the two nodes on the round-robin DNS, one gets the dataset first. This then tells predict.py that it's available, and the next request goes to the other node, producing the above error
  • It's a bit slow
  • Caching is a hack

Idea: drop PyDAP and write our own client

Speaking OPENDAP

On the one hand, we need not speak the whole of the OPENDAP protocol. This would simplify things, require less requests to get the format of the data and so on. We know what format the wind data is in, and if it did change, then it would affect other areas of the predictor that expect a certain format anyway.

However: when it changes, if the expected form of wind data is hardcoded deep in the code, it may be a complete pain to change.

DNS abuse

We could perform a manual DNS look up, and get a list of IP addresses that we can talk to, and then maintain some information on each node (if it's up, which datasets it has, ...). Then if we want some data, we could hit all nodes simultaneously (for different portions of data) assuming that the limiting factor is not our network speed.

Chunk sizes

This requires more investigation; the NOAA OPENDAP servers have some strange behaviour depending on how much data you request in one go. Larger chunks seem to require it to think for a little while before it starts sending data over the network.

If we download larger chunks, more data that we don't need is downloaded - wasted time Downloading lots of small chunks will slow us down due to the overhead of many small requests.

Idea: Rewriting the core of the predictor in Python

Currently the predictor is split across 4 languages:

  • The homepage and AJAX requests are served by PHP
  • The client is mostly Javascript
  • PHP starts a python script, which downloads wind data and invokes:
  • The predictor itself, a C binary

The majority of the time spent predicting is downloading wind data, so we don't really need C-performance. (Having said that, we may find we want it later when we come to do Monte Carlo ...).

Benefits to moving to Python:

  • The majority of the C predictor is IO, getting wind data, reading scenarios, writing CSV. This would be far simpler in Python
  • Easier to add features, add different altitude models
  • Can have the predictor process get wind data directly, so we don't need latitude/longitude deltas - it just gets what it needs

Although we intend to replace PyDAP, a Python rewrite could initially talk to PyDAP and use the existing PHP/Javascript. Indeed, it looks like it would be fairly simple to break work on the predictor up into independent stages, with rewriting the C being one.