Unix Philosophy and the Wire-Cell Toolkit

Sat 11 Jun 2022

For many years I have understood the “Unix philosophy” of software to mean “do one thing and do it well”. Reading that Wikipedia entry teaches me that the original paper gives three more points. I ponder these in relationship to the Wire-Cell Toolkit.

The “do one thing” is conceptually easy to grasp and is at the core of the Wire-Cell Toolkit (WCT). Excluding some low-level utilities, “everything” is accessed via an abstract “interface” base class. Each interface defines a small number of methods. A developer creates a “component class” which is a concrete implementation of one or more interfaces. User code, which could also be component code, can access an instance of a component via one of its interfaces. Given that, the developer of user code need only worry about understanding a small set of semantic context to use the interface.

For example, the IConfigurable has two methods default_configuration() and configure(). The component expects the first to be called, its return value potentially modified and the result passed to the second call. It also expects these two calls to occur in the same thread. Whatever else may happen externally, with these simple rules assumed, the component developer is secure in coding what they need. Likewise, interface-using code is free to do whatever it wants as long as these simple rules are followed. These behavior rules may be likened to how Unix commands generally assume ample system memory and disk space, existence of input files, output directories, etc.

The Unix philosophy also requires that the many “one things” can be composed into novel, compound “one things”. As a corollary it constrains the information exchanged between the “one things” to take minimal and standardized form.

Generalized, this composition is precisely a data flow graph and that is the primary (but not only) execution pattern followed by WCT applications. In Unix we generally make only linear pipelines, if we make any compounds at all. In some rare cases we may make moderately more complex graphs via tee or explicit use of higher numbered file descriptors. The problems that WCT tackle are inherently much more complex than typically seen on the Unix command line and thus graphs become both broad (many pipelines) and deep (long pipelines). This motivates WCT to use a more general “graph first” configuration language which is rather different than the “node first” or at most “pipeline first” semantics that Unix shell languages encourage.

The maxim covering minimal and standardized form of information exchange addresses the nature of graph edges. In WCT we define an edge by a data interface abstract base class (IData). This provides the standardization. If one graph node port produces an IFrame the connected port must accept it and the receiving node knows precisely the form it is getting. The minimal criteria is less constrained. Here, developers of data interfaces must think carefully how to factor the data structure concepts and anticipate not just immediate but future use. For sure, careful design of IData is a cusp. Get it right and the future is bright. Get it wrong and the pain will be felt for a long time. The uncharitable “keep it simple, stupid” slogan applies. Found in hindsight there are existing cases where the slogan was violated and it has led to ongoing problems. Yet, generally the intention of IData is exactly coherent with the philosophy.

The third maxim of the Unix philosophy embraces competition between alternative implementations. The standardization of data exchange formats is the “market” that allows this competition. One may take a compound graph and “snip” out a node or subgraph, replace it with a competitor and the result is the “same but different” job. If the replacement allows faster, more accurate, less resource intensive or otherwise better results, the replacement wins. Otherwise, we go back to the original, no harm, no foul. The WCT configuration language allows such A/B testing to be easily performed.

Competition at the microscopic, graph node level is encouraged by support for completion at the macroscopic, library level. The WCT plugin system allows developers to provide a shared library of WCT components in a manner of their choice, depending only on WCT’s core “interface” library. Developers who do not wish to invent their own project organization may produce WCT style packages easily either by hand or bootstrapping with the template-based code generator to make a Wire-Cell User Package (WCUP).

The third maxim also encourage discarding of “clumsy parts”. Coupling the parts through explicit interface classes simplifies doing just that. In addition, the WCT provides virtually all of the “batteries” needed to compose almost all jobs. Only a small number of niche components needed to connect WCT graphs to external software are kept outside of the WCT code base. This code centralization, sometimes called “monorepo”, allows WCT developers to make sweeping changes when needed without involving and disrupting WCT users.

A recent example was the addition of the IDFT interface and component implementations which factors out discrete Fourier transform operations. Previously, DFT functions were hard-linked in the WCT util library. Moving them behind an interface now allows different IDFT implementations. Already, WCT has gained IDFT implementations based on FFTW3 and PyTorch (CPU or GPU) and soon will merge in a direct CUDA (GPU) implementation. The user with GPU resources can now accelerate every WCT component that uses DFTs with a simple configuration change and not C++ development. However, in order to migrate from hard-linked to interface-based DFT a lot of C++ code had to be rewritten. Since this code was all in the single WCT repository, the change was largely invisible to external user code that depends on WCT via its interfaces.

The last maxim of the philosophy is about programmatic automation. Do not ask the human to do what software can. The WCUP code generator is one example, though not yet widely used given the monorepo nature of mainline WCT development. The factoring of functionality into components is another example. WCT encourages a developer not to rewrite something which a component provides.

The WCT aux sub-package and library provides high-level code which may use other components and which components may hard-link so that they need not all solve the same problems. For example, the IDFT interface types are simple C-style arrays. Especially for 2D, these are not convenient to use in code. Developers wish to use std::vector and Eigen3 arrays. Thus the aux package provides the DftTools family of functions that adapt these hard-compiled types to the more general IDFT.

Very recently, new developments related to the modeling and generation of noise has uncovered a new target for such factoring. A future post here or at the Wire-Cell News will likely cover it. In short, initial problems related to a particular type of noise were solved in one specific node implementation. Support for new types of noise began to be added and that led to attempts to yet again solve these problems in new, redundant code. To make for easy development by humans and more robust code WCT is factoring these common noise to shared tools.

I have no real conclusion to all this other than it satisfies my desire to express the parallels between the Unix philosophy and the WCT design. Until bumping into the linked Wikipedia page, I was not aware of the maxims beyond the first. Perhaps long time use of Unix caused them to seep into my thinking. Or, perhaps, these maxims are just so obviously The Right Way To Do Things that they get honored without them needing to be explicitly stated!