Trillian for Developers

Have you always wanted to explore astronomy, but ended up in software? Now’s your chance to make a significant contribution— no astronomy degree required.

Trillian will require not only the latest developments in computer science and grid computing, but also presents new and interesting challenges in working with the specifics of astronomical data. We are actively soliciting development help from the programmer community at large to build a platform for scientific research unlike any before it. These contributions will enable new and exciting scientific discoveries.

How can you help? We’re starting from scratch here: we have many terabytes of data from ground based telescopes (e.g. SDSS, 2MASS) and satellites (e.g. WISE, Spitzer), and we are building a new system to distribute it across many nodes. The primary purpose is to create a computational engine to run user-provided code against the entire data set (or subsets of).

Below is a list of tasks that you can contribute to. Contributions can come in many forms, from advice to code relating to:

Data API design.
Security considerations.
Managing data distribution.
Data visualization of terabytes of data.
Deploying compute instances to cloud services.
Expert advice on new infrastructure design/implementation.
Code!!

High Level Project Overview

While astronomers have accumulated large amounts of data, access to it is highly disparate and not nearly as easy as it needs to be. This makes all-sky, multi-wavelength analyses extremely difficult or impossible. For example, let’s say I’m interested in thermally pulsating asymptotic giant branch stars, a kind of star in a specific phase of its life. No catalog exists of these kinds of stars — they are rare and difficult to find (and even define). But they can surely be found in the many large-scale surveys that exist. The stars have certain characteristics: they are very bright, they are variable, they are surrounded by dust (which means they will be bright in certain wavelengths and dark in others), etc. If I can describe these characteristics (i.e. a model) as a piece of code, I want to be able to run that “test” against all detected objects in the sky. For each one, the result will be a likelihood measurement of how well the model fits the data. The system will understand the peculiarities of each telescope that made the measurement.

Models can be made not only for stars, but dust, clusters, galaxies... really anything that can be found in space. We will encourage astronomers to build a large number of models. In fact, this is what scientists do— we want to translate these models into a computational space and apply them to large volumes of data. Once we have a library of models, then we can ask the most interesting question, the one that can’t be asked today: what, in all of our observations, doesn’t fit any of our models? It’s here where new discoveries are waiting to be made.

Platform Structure

This is how we’ve envisioned the structure, but everything is open to suggestion or comment.

Central Server

A central server will run a PostgreSQL database. This will keep track of what data is available and where it is. In addition, it will also host several “point source” catalogs. (These are catalogs that contain a list of parameters for each object detected by a particular telescope; this data is tabular in form.) Since these are comparatively small (most are a few GB or smaller), it will be useful to have them all in one place. The database will also record the results of analyses— another reason to have point source information to cross reference against. Computation jobs will be farmed out by the central server.

Data Nodes

There is far too much data to store in any single central server, and more is being made available all the time. We will use Healpix (a system that divides a sphere into equal-area pixels) to divide the sky. Each pixel will contain all of the data available for that part of the sky. Some will be tabular (e.g. a list of objects and their properties), some will be more complex (e.g. image data, spectral data). When a new data node (which may be local or remote) is registered with the central server as being available, the server will calculate how many pixels of data will fit there, assemble them, and place them on the node.

Computation Node

Models are submitted to the central server as pieces of code (for now, limited to Python). For each model, the aim is to compute the likelihood that each object in the data matches the model. For the moment, we will limit this to a trivial calculation, e.g. is the magnitude (brightness in the sky), between values x₁ and x₂)? The computation may take place on the central server or be farmed out to another node.

Data API

An API is needed to retrieve the data for a particular object or location on the sky. The data may be located on the central node, but more likely will be located on a remote node. The remote data node may be able to perform the calculation (in which case the data will not need to be retrieved, only the result), but the data node may not possess (or grant) the ability to perform model computations. Since an API is required to transmit the data from the node to the server performing the computation, there is no reason not to make this API directly available to the public. This will be a very useful tool alone for the astronomical community, which we get (nearly) “for free”.

Task List

Roll up your sleeves.

Three major components comprise the foundation of Trillian: a distributed storage system, a distributed computation engine, and an API for programmatic data access. Tasks will center around these components, but there will be a certain amount of overlap. Many things can be performed in parallel, e.g. developing an API to access the data isn’t a prerequisite to how the data will be distributed.

Design the central database schema to manage data and compute nodes.
Take a disparate data set (table data, image, etc.) and package it into a bundle based on the location in the sky.
Create ability to register a node with the central server, indicating disk space available, location of node, ability to compute models, etc.
Design an API to access data both by Trillian and to the public. API must be flexible to accommodate future data.
Define a procedure for adding a new data set.
Design a queue to accept models and coordinate their calculation on compute nodes.

And of course much more— one of the first tasks will be filling out this list. The best way to do this is to contribute to the Trillian GitHub repository.