Adding more CPU or GPU slows things down

Sun 17 Oct 2021 thread

memory

gpu

hardware

software

paradigm

wct

Adding more computing units reduces computing throughput. I will try to describe what I mean by this true but internally paradoxical sounding statement, some ways we can break it and the new problems these solutions bring.

More is less #

For the longest time, the software workloads in our field (neutrino hep-ex but also others) enjoyed a trio of enabling features:

perfectly (embarrassingly) job-level parallelism
ever faster CPU cores over time and
adequate amount of RAM relative to what was needed.

As is well known by now, those salad days are long over. Over the past decade, acceleration of CPU core processing has ceased while CPU core density is increasing. RAM has not kept up with core count and our software has outgrown its nominal RAM allocations. Finally, much of our software is stuck in a serial processing paradigm so we lack flexibility to effectively utilize the growing multiplicity of cores.

Typically our jobs require 2-3x or more of the RAM nominally allocated per core as offered by high-throughput computing (HTC) facilities. We have been forced to follow a tactic of over-allocating and then wasting CPU cores in order to gain their associated RAM.

This core-time waste is strongly frowned upon by HTC managers. If that were not enough, two other trends put this over-allocation tactic in jeopardy.

First, as we enter the era of DUNE we potentially face an even larger RAM usage. If we continue to “scale” the code for current “small” detectors such as MicroBooNE and ProtoDUNE-SP to DUNE FDHD we may expect a 25x increase in RAM usage. Of course, some tricks can and will be played in our current paradigm to limit this absurd growth but I feel a paradigm shift is needed and in part I guess I am struggling here to define what that may look like.

Second, we expect we must seek slivers of time in high-performance computing (HPC) facilities as we see their funding is waxing and expect to be forced to make our software run on their hardware instead of or in addition to the more familiar HTC. This takes us further from our salad days as HPC tend to be designed with yet higher CPU/RAM ratios. HPC also brings in hardware “accelerators” such as GPU (use of scare quotes will become apparent). Our community is and will be a relatively minor user of HPC and that gives their facilities less motivation to allow our CPU-wasting tactic.

Seeking HPC time also brings in a deep and interesting problem of how to also effectively use GPU. The ratio of GPU/CPU provided by a given HPC brings in a problem very similar to the problem of the CPU/RAM ratio.

So, we as a community are stuck with the unusual problem: how to effectively use more CPU (but not more RAM). What can be done?

Loop-level multi-threading #

This means to make use of small scale, “loop level” MT parallelism. In the old CPU-bottleneck days, employing this tactic alone would be sociopathetic to the job. An $N$-way parallel loop would require an allocation of $N$ CPU cores. Once the loop happily completes $N$-times faster (at best) the rest of job which nominally will be executing in a serial manner. The need to allocate $N$ cores per job and no way to share them outside the MT code segment leads to waste and frowning HTC managers.

Of course, when following the waste-CPU-for-RAM tactic, we nominally waste $N-1$ cores anyways. In this tactic, loop-level algorithms are encouraged to make use of the otherwise wated cores. This helps to turn that frown right-way around.

We want further utilize the wasted cores, especially as that waste will grow as the hardware CPU/RAM ratio increases. We do that with task-level parallelization, described in the next section. When introduced, we allow more than one loop-level MT process to run in parallel and that will lead to an CPU over-load condition. The significance of overhead due to context switching that will result rightly must be profiled to ascertain if it is a practical problem or not.

On the other hand if all MT algorithms are implemented with the same threading system, or compatible systems, there can be a coherently way to share the same thread pool and avoid CPU overloading. This can be done with, for exaple, TBB. In practice, loop-level MT algorithms are developed by a variety of individuals and they are pursuing a variety of implementations such as those based on std::thread, Kokkos, OpenMP and TBB and others may be expected.

Task level multi-threading #

The overall processing patterns enacted by a job tend to break up into discrete computational units which exchange data. This is fractal (scale invariant) in nature. We described loop level above, but that is an arbitrary level distinction. Even there we talk of parallelizing “the inner” and “the outer” loop of some code context. As we go up in scale at some point we reach an arbitrary point we call “task level” and above that we get back to our salad day friend, “job level”. In fact, we may go down in scale from loop level and consider SIMD/vectorization. On CPU, I will ignore that (important though it be) and will address it a bit for GPU.

The offline software for most of our experiments is built on The Framework, hallowed be its name. In all popular instances of The Framework we define a “task” which is rigidly associated with a unit of data which we name with maximal ambiguity as “the event”. In art the “task” is a “module” and in Gaudi it is an “algorithm”, both which provide a single C++ base class interface for the developer to implement. Through this interface, a task is executed once per “event”. The Framework is thus formally (if still ambiguously) an event processing framework. This “event” unit is enforced by The Framework from its input, through all processing and to final output. The choice of the scope of data that “the event” unit spans determines and ties together the memory usage and the units and structure of task scheduling. All concerned must agree on what is “the event”.

Both frameworks named above are popular in the neutrino hep-ex sphere (and beyond) and both have made strives to add task-level parallelism. Gaudi has its swarm and art has its “path level” parallelism (granularity that of a pipeline of tasks). However, developers of the tasks still design their code under assumptions made by the monolithic “event” data model. The Framework must then forever maintain the unit of “the event” in order to honor these developer assumptions.

This focus on “the event” leads to some potential extreme absurdity. For example, much useful code is developed for MicroBooNE with its single anode and ProtoDUNE-SP with its “mere” six dual-faced anodes. Meanwhile, DUNE FDHD will have 150 in one of its four modules. From a single DUNE module a nominal “event” will be 25x larger than that of ProtoDUNE-SP, which already stresses memory. More absurdly, special “events” will record possible supernova bursts over a vastly extended period of time. Such “events” will be 200x larger than nominal. Clearly something has to give.

The Wire-Cell Toolkit (WCT) was designed without the arbitrary limiting definition of “the event” which opens up a new but not so unfamiliar paradigm. It honors the fact that “task level” is scale invariant and encourages developers design a code unit (WCT calls is a “component”) at a scale which is natural to the problem it solves. The design of a WCT “component” is still familiar to an art “module” or Gaudi “algorithm” in it may be structured in a serial single-threaded manner or may house loop-level parallelism. The difference is that a WCT component is not constrained to an interface that is tied to “the event” but in fact may assume smaller grained (or indeed larger grained) data scope.

Isolated components are of course useless and WCT provides means to compose them into larger aggregates and to further compose those in to yet larger. This composition may continue until reaching a scale appropriate for running the result on a given hardware resource. With existing code releases, WCT jobs may scale up to the point they completely fill the confines of one physical computer. Experimental extensions in development for WCT allow a single “job” to span multiple computers.

These WCT task compositions are structured in the form of a data flow graph. Each task represents a node in a graph. Nodes are connected via their ports by graph edges that transport and sometimes buffer data. A node may have input or output ports and each port is identified on the node and well defined in terms of the data type it may pass.

WCT allows different strategies to execute nodes in its data flow graph. Two exist now and more can be developed against WCT’s abstract graph execution interface. A single-threaded engine exists to optimize for the conservation of RAM by minimizing the data “in flight” through the graph at any given time. A multi-threaded engine based on TBB flow_graph executes some number nodes in parallel up to a maximum number given by the user.

As said, a node developer may assume a data context smaller than one “event”. For example, we may add $N$ instances of a node implementation to a graph, one for each of $N$ sub-detector units (eg one for each of 150 APAs in DUNE FDHD). It further allows pipeline-parallelism where for each sub-detector unit we may have multiple nodes in a line each processing as a chain. Typical pipelines are composed of a few stages and so a job can effectively utilize many hundreds of cores. Many sub-detector units provide obvious further sectioning. For example, many operations for LArTPC with tomographic readout apply on a per view basis, giving allowance for another 3x increase in potential thread utilization.

Thus, for DUNE far detector modules, WCT provides the basis for effective use of kilo-core CPUs. 100-core CPUs exist now and 3-4 Moores lifetimes brings us to DUNE data an to the 1k-core era. Running on today’s CPUs, the flow graph structure with broad and pipelined graphs can have more nodes than available cores. This is still an advantage as it allows for there always to be work available whenever a core may otherwise go idle.

The flow graph paradigm solves another problem of large jobs related to I/O. The Framework tends to reflect the monolith of “the event” into monolithic and serial I/O. Loading “the event” is a single threaded operation happening at “the start of the event” while wasting $N-1$ threads. Likewise event ending output. With compression necessarily being employed on the expected large volumes of data, this single-threaded I/O stages can be significant and in some cases dominate the run time of some jobs.

However, with WCT, input is “merely” a source-type node and output a sink. Multiple files may be read or written in parallel, each in a simple single threaded context. Only the exact amount of data required at any given time by the downstream node of the source. This may easily be 1% or in some cases a minuscule fraction of the “event”.

As any given graph node does not care to what nodes it connects, any graph may be cut with sinks added to cap off the output and sources added to replace the files saved to the remainder. In addition, any edge may be cut to insert a node which “taps” the data stream for saving or consumed immediately for some purpose and otherwise passes it along to its output port.

This flexibility solves another thorny application level problem which often goes under the name “event mixing”. Here, we do not wish to mix events but mix portions. For example, we wish to properly combine kinematics level data from independent generators prior to feeding to some tracking simulation. Or, we wish to embed the readout of a simulated signal interaction to a readout of “background” taken from a real detector (or vice versa). WCT and the flow graph paradigm in general naturally supports this kind of “mixing”. The push/pull nature of the graph execution means that sources to mix will execute “just in time” to provide the next input needed. The “mixer” node becomes relatively trivial to implement and configuring for different “mixes” poses no extra burden.

GPU parallelism #

GPUs essentially represent the worse case of non-salad days

many many many cores
all of them very slow
with very very limited RAM/core

However, the three “many’s” can outweigh the three “very’s” in that description for an algorithm that can be implemented in a highly data-parallel manner. In particular, FFT and AI/ML inference are important bottlenecks and both are greatly accelerated on GPU relative to running them on a single CPU core.

This advantage also leads to another “hardware ratio problem”. For GPU we leave the world of HTC and seek time on HPC. Their managers frown on jobs that waste either CPU or GPU allocations. And like with our jobs on CPU, on GPU the bottleneck is not cores but (GPU) RAM. To utilize GPUs we are again judged on fully utilizing a resource (GPU cores) where to do so we must limit utilization on an unmetered resource (GPU RAM).

Our typical jobs require O(few GB) of GPU RAM from GPUs. Some rely on the Pytorch kernel which itself accounts for 0.5 GB and thus we strive to require only one instance. These jobs must run on hardware that may provide anywhere from 1 GB on older workstations to 4G on modern laptops to 32 GB on the latest GPU cards provided by HPC.

As GPU cores do not provide the bottleneck, a naive job will learn it has over used the GPU by receiving an out-of-memory error in response to an allocation. When code bounces against this RAM limit, handling the OOM exception involves cleaning up current usage and trying again. This can lead to a repeat of many try/fail/clean loops. At best this reduces throughput and at worse may become an infinite recursion.

Thus, a proactive resource limiting mechanism is required in the software. Given previous understanding of the GPU RAM requirements per unit of processing we may (and in WCT have) implemented a semaphore pattern to limit the number of processes which may execute in parallel on the GPU at any given time.

This simple solution brings a new problem. When the $N+1^{st}$ thread wishes to execute work on a GPU which is already at its limit of $N$ it must wait in some manner. A simple semaphore will simply block that thread. Thus by accelerating the code with a GPU of limited memory we must slow down the job by making threads wait. This is even when using TBB flow_graph based engine which otherwise could make use of the idle thread! The threads that hold the semaphore are also idle, while they wait for the GPU to return. So while the GPU does accelerate some algorithms, it does so at the cost of directly blocking a thread and potentially blocking yet more.

A smarter semaphore may be implemented whch is somehow aware of the larger thread pool so that it may return a waiting thread for use while the semaphore is occupied. Naively we may say, “just use a TBB semaphore”. This would be fine except for the minor fact that there is no such beast. Instead, TBB flow_graph provides a far more elegant solution, though one which brings a challenging if interesting host of yet newer problems.

With TBB flow_graph we may designate a max concurrency number for each graph node. WCT currently sets this to unity for all nodes. Set larger, TBB becomes free to run one instance of a node concurrently on multiple threads. For a node to operate in this manner, it must be implemented inherently thread safe which is a requirement that most of our developers can not handle. But even if that is overcome, concurrent node execution in the flow graph will lead to out-of-order data. This causes a loss of synchronization in the graph which spoils various assumptions. It is possible to construct a subgraph which contains the concurrent nodes such that synchronicity is restored at its output.

Thus to make use of TBB flow_graph concurrency to enact a semaphore pattern to protect from over use of GPU memory requires developing special purpose nodes which isolate the new problems that this solution brings. This is very doable if challenging to develop and will maximize both CPU and GPU utilization by leveraging the inherent load balancing nature of parallel data flow graph execution.

Where are we? #

I’m still not sure where I wanted to go with this article or what exactly is the paradigm we should chase and so I have no deep conclusions. For now, the main points I tried to make are the following.

We face a RAM problem not a CPU problem.
The problem is really due to respecting HTC/HPC accounting biases for CPU/GPU utilization while it is their RAM limitation that we bump against.
Application of multi-thread parallelism is scale invariant and we should design solutions that honor and enable this fact.
GPU brings its own problems which are also largely RAM problems but also push can lead to forced CPU idleness that we should and can avoid with some new development.
The Wire-Cell Toolkit provides a new if minor paradigm shift that solves many of the problems we face in these post-salad days and so far appears to provide a good basis as we go to utilize heterogeneous hardware and yet larger core/RAM ratios.