The GeneBeans Dataflow Interface is a program that lets you use a graphical language to describe bioinformatics questions you want to ask of a database. You specify a workflow - what actions the computer should take, in what order they should be taken, and how information should be routed.

Each action to be taken is represented by a box. Possible actions are listed on the left side of the window in a palette, which is grouped into several categories. Click the category of action you want at the top left, click on the action at the lower left, and then click anywhere on the main pane to place that action.

You can then set the details of the action by double-clicking on it, which brings up a second window.

GeneBeans can be used for typical database queries. You'd start with the Database Input action, which you can think of as reading in the entire database. (GeneBeans currently uses the Plant Gene Index database from TIGR). You'd then connect the Database Input to one or more Filters, each of which retains some database records while throwing out others. Finally, the results of your last Filter get connected to an Output action, which prints out the records that match your query.

For example, the following graph queries the database for all sequences that have been annotated as hydrogenases, printing only summary information (the number of sequences found - here, there are 1850).

By connecting two Filters in series, we get an "and" query. For instance, we can search for hydrogenases in rice (there are 34 in the database).

An "and" query is also useful when you don't know the exact form in which the data will appear in the database. One Filter for "lactase hydrogenase" won't show you a "L-lactate dehydrogenase" or a "L-galactono-gamma-lactone dehydrogenase", while two consecutive Filters (for "lact" and "hydrogenase") will.

We can also connect Filters in parallel, to create an "or" query.

Note that this will produce more than the union of the two sets; if we search for "similar" and "-like", every sequence that is annotated to be similar to one thing and like another will show up twice in the output.

Drawing parallel and series diagrams isn't a very convenient way to query a database, and it really isn't the point of GeneBeans. The software is intended to do start with a query and then perform statistical analyses or execute other algorithms on the results.

For example, one of our first questions was "Find the correlation between GC content and expression for each species in the database." This is a sequence database, so we don't have a direct measurement of expression, but our biologist chose to approximate it by calling singletons low-expression and contigs high-expression. For a crude check of statistical significance, we did the same calculation on a random sample of 10% of the database. The diagram for this question:

(In iceplant, with 6549 sequences, the total correlation is .0215; the correlation on the random subset is .0669; this doesn't seem very convincing. Potato, with 19426 sequences, is more consistent: total correlation -0.140, subset correlation -0.131.)

This is a compact representation of the question that is both human- and machine-readable. The software can run this on one species at a time, and just prints out the correlations.

Another question might be "Compute the fraction of genes in each species that have no known similar genes."

(There are 38710 arabidopsis sequences in the database, of which 30626 do not appear to be annotated as similar to another gene; 48860 and 47857 for barley, 17555 and 17119 for cotton. To change from one species to another we just change the parameters of the first filter.)

We'd like to be able to "Compute the fraction of genes in each functional class (in the database as a whole and in each species) that have no known similar genes," but without a structured set of annotations (i.e. using Gene Ontology) it's very hard to make queries about functional classes.

GeneBeans is also designed to remember every query you execute, allow you to write notes about it, and allow you to use it as a starting point for future queries. This concept is often called an "Electronic Lab Book".

This summer and fall, we will add to GeneBeans the ability to talk to other programs - perhaps perform a BLAST search, or a Teiresias query, in the middle of a network. We will also be making it more tolerant of and efficient at large searches. We'd appreciate your ideas for what kind of questions you'd want to ask of the software, what other actions should be possible in a workflow, and what other programs we should be able to interface with. We hope to support the GO project (Gene Ontology) soon.

There is a brief description of the individual nodes available.

Send biology to stapletona@uncw.edu, computer issues to hudsont@uncw.edu. Last revised 04/11/2003.