Next Previous Contents

5. VWhere: N-Dimensional Data Mining

5.1 vwhere

Synopsis

A graphical, interactive version of the S-Lang where function

Usage

Array_Type = vwhere( structure | 1Darray, 1Darray, ...)

Description

VWhere provides an easy to understand yet capable mechanism for exploring and filtering multidimensional datasets. Using a visual, interactive approach to the construction of complex, multi-dimensional filters, VWhere offers a fluid and intuitive alternative to the classic approach of file-based filtering with command line tools (as used, e.g., within astronomy data analysis), and can be considerably faster, cleaner, and more powerful. Filtering is performed upon data vectors generated either in-memory or from disk files, with no filter syntax required and the result instantly visualized for inspection. In contrast, file-based filtering tools require explicit syntax (often conflicting with the syntax employed by other tools or systems) and that the resulting file(s) be re-loaded into separate programs for verification (e.g. a plot or file dump). In contrast with the all-at- once style mandated by file-based filtering, VWhere filters may be applied incrementally (or not) to arbitrary axes of your input dataset. This avoids the creation of numerous "file litter" products while one experiments with filter ranges or axis combinations, as well as the performance penalties of multiple I/O iterations over files. Moreover, most file-based filtering tools are static in function: they cannot be augmented at runtime by dynamic loading of modules. VWhere filters, however, may employ not only any built-in S-Lang arithmetic operator or function, but also essentially arbitrary C, C++, or FORTRAN codes loaded from external modules.

Input to vwhere should contain at least 2 numeric vectors. Vectors may be passed in the form of a comma-separated list of 1-D arrays or as a single structure containing two or more fields. All vectors must have the same length. Vectors will be ignored if they do not match the length of the first vector, or are non-numeric in type, or have names prefixed with an underscore. When using the comma-separated list form it can be useful to prefix the name of each vector with "&" (the S-Lang reference operator). This lets VWhere determine the vector name and reflect it in the Axis Expression window, instead of assigning it a less-meaningful name such as "array1" (see examples below).

Upon invocation VWhere launches its Axis Expression Window, which provides a means of generating plots, fabricating new data vectors, and issuing arbitrary commands through an interactive S-Lang prompt.

PLOTTING

Filtering in VWhere amounts to manipulating regions of interest on plots. The number of plots that may be created or overplotted, and the number of region filters applied to each, is effectively unlimited. Plots may also be deleted, panned, and zoomed -- providing a rapid means of data exploration -- as well as customized through a number of graphical user interface preferences.

Plots are specified in the Axis Expression Window via two editable text fields, one for each of the X and Y axes. The content of each field defaults to the name of the first and second input vectors, respectively, and may be changed either by typing new expressions for each axis or by selecting from the Choose dropdown menu. In general each axis expression may contain any valid S-Lang statement, even calls to C, C++, or FORTRAN functions imported from external modules. The chief constraints upon an axis expression are that it be less than 256 characters long and that it generate a numeric vector.

Two kinds of plots may be visualized, filter plots and overplots, by pressing either the Plot or OPlot buttons. The main distinction between the two is that overplotted X/Y vector pairs may beof arbitrary length, while filter plots require vectors exactly equal in length to those within the input dataset. The latter constraint stems from the fact that, logically, array expressions given to the underlying S-Lang where command [e.g. where(A < 5 and B > 11)] can operate only upon isomorphic vectors. In addition, because overplotted vectors are not subject to where filtering they are always drawn in their entirety; thus they provide additional means for qualitative, visual comparison, but have no quantitative effect on the result returned by VWhere. Finally, when a filter plot is created each unique axis expression -- and the resulting vector that it generates -- is "remembered" in the Choose dropdown menu. This provides for easy re-selection later, and is a fast and simple mechanism for fabricating new data on the fly, of essentially arbitrary complexity, thanks to the extensibility and generality of axis expressions -- on the fly.

REGION FILTERS

The following region filters may be applied after visualizing a plot:

        rectangle               click MouseButton1, then drag mouse to
                                define bounding box

        ellipse                 same as rectangle

        polygon                 click MouseButton1 to add vertices
                                click MouseButton2 to close polygon
                                click MouseButton3 to cancel
Filters may be deleted (by hitting the BACKSPACE or DELETE key), moved, or resized after initial placement. By default, regions perform INCLUDE filtering: points within a region are kept and points outside it will be discarded. Alternatively, a region may be used for EXCLUDE filtering by pressing the 'e' key while it is initially being laid; the region will be marked with a diagonal slash, and points within it will be discarded during subsequent filtering while points outside it will be kept. A region may be toggled between the INCLUDE/EXCLUDE state by pressing 'e' while it is selected.

Points included during filtering are considered "selected," and will be drawn in the foreground line style and symbol color; excluded points are considered "filtered" and will be drawn, when requested, in the background line style and color. Line styles and symbol colors may be adjusted from within the preferences dialogs.

INCREMENTAL FILTERING

One of the more useful features of vwhere is the incremental manner in which the dataset may be filtered. In contrast with file in / file out filtering method offered by command line tools, which applies the entire set of filters to the entire input dataset -- conceptually in just one pass -- vwhere provides the option of filtering some axes of the dataset, by applying region filters to currently displayed plots, prior to filtering other axes.

This provides a powerful mechanism for exploring relationships within your data, and can also speed up subsequent plotting and filtering. When incremental filtering is on (the default) only points selected by the current filters will be colored in subsequent plots. Filtered points will either be drawn grayed out on subsequent plots (the default) or not drawn at all (a faster option for large datasets), per the current preferences. The next section describes how filters are incrementally combined.

RETURN VALUE

The vwhere guilet return value matches that of the native S-Lang where function: an array of numbers, each representing an index into the vector(s) given to the comparison operator(s) of the where expression. These indices may then be applied to related datasets, or used to create filtered output files, etcetera.

Filters applied to a single plot are unioned to form the set of points selected by that plot. If only one plot is visualized then this set completely specifies the indices returned by vwhere. When multiple plots are visualized the incremental selections from each are either intersected (the default) or unioned (when chosen in the preferences dialog) to generate the aggregate set of selected points.

If zero region filters have been applied the entire input dataset will be returned. Dismissing the guilet by any means other than pressing "Done" in the plots window will return the empty dataset.

Example

The following explores the curves y = x^2 and z = x^3 over [1,100] :

                x = [1:100];  y = x^2;  z = x^3;
                result = vwhere(x, y, z);
The next call
                result = vwhere(&x, &y, &z);
is identical, but because the arrays are passed as references VWhere can determine the name of each vector and reflect them in the Axis Expression window as "x," "y," and "z," instead of fabricating the names "array1," "array2," and "array3".

The following explores a hypothetical binary table read from disk:

                tab = your_favorite_FITS_file_reader ("table.fits");
                result = vwhere(tab);
If the tab structure contained CCD_ID, PHA, and TIME fields, then valid expressions by which two plots could generated from this table might be:
        PLOT 1:
                        X :       ccd_id
                        Y :       pha
        PLOT 2:
                        X :       time
                        Y :       log10(pha)
The log10(pha) expression for the Y axis of the second plot creates a new data vector, which will also be selectable from the Choose dropdown menu for use in subsequent plots.

Notes

The GtkPlot widget atop which VWhere is built is not robust in the face of Inf/Nan values. VWhere attempts to compensate for this, but for performance reasons does not execute both isnan() and isinf() on all X/Y plot vectors. To avoid undefined behavior and potential data loss, Inf/Nan values should thus be culled first.

See Also

http://arxiv.org/abs/astro-ph/0412003 (ADASS XIV Proceedings)


Next Previous Contents