Going forward, I have plans for various features and experiments related to performance. Some of these are just ideas, others are works in progress, and all will hopefully pan out enough to warrant future posts. For now, I'd just like to highlight a few.
Grafting is Prefetching
Considering this is my newest line of research and has already shown promise, I suspect it is withholding additional gains. I may be able to do some profile guided tuning of the skip thresholds. There are a lot of things to test, and this will certainly keep me up at night in the short term.
There's More Than Vision
This experiment made me come to an additional realization: the vision and optimizations are finally fast enough that overhead matters pretty much everywhere else. In the early days of working on a computer vision system, the vision is many orders of magnitude slower than anything else could, should, or would be. My early implementations were struggling for 1fps at times.
During development, this manifested as a focus on optimizing those high-cost areas of the vision code. My other code isn't bad or slow, but it wasn't being squeezed for every milliliter of performance. Considering how much of an effect skipping prefetch has, the nuts and the bolts of the system are finally showing in performance tests.
Almost everything in Atheon and Mobius is build around
std::weak_ptr, so much so that I have forward declared everything for ease of use:
When I talk about perception trees, I just mean that every
Perception contains a
std::map< ptrs::Cognition::Weak, ptrs::Perception::Shared, ptrs::Cognition::WeakComparator >
Similarly, the factory which deduplicates perceptions upon construction does so using a map for each type of
std::map< std::tuple<...constructor arg types...>, Atheon::ptrs::Cognition::Weak >
Mobius even has a generic message dispatching system based around weak pointers. The observer-based system connects dozens of subsystems that comprise the bot, allowing components to send messages back-and-forth without any object ownership, the need to know a peer's type, object lifecycle related bugs, or type confusion bugs.
I have done essentially zero testing of different data structures (such as
unordered_map), alternative smart pointer implementations, and so on. I did play with
std::pmr::polymorphic_allocator at one point, but it made things slower.
The declarations in Snippet 1 demonstrate a simple design decision which will pay off as I start exploring this work. Because all of the pointer types I use are declared in that 38-line file, I can easily swap them out without changing the tens of thousands of lines that rely on them. Moreover, the
Pointee class gives me an entry point to insert additional functionality, such as a typed hash to enable use of hash tables.
I'd like to expose grafting's pixel diffing layer to Mobius' Lua environment, enabling the use of customized diffing pipelines during grafting. This would allow scripts to gain additional speed by using fuzzy matching to ignore negligible changes where possible. Grafting must not be slowed down by the overhead of such a system, though, so to avoid the root of all evil I am letting grafting mature.
Profile All The Things
I really like the automatic balancing functionality that grafting uses, and I'd like to expand this to the whole suite of optimizations. I'll need to build a configuration layer which can toggle the features and tweak the thresholds, and tie it up to a system which creates, probabilistically executes, and measures different configurations to find the most efficient combinations.
It is my intuition that the complexity of such a system can increase drastically relative to the number of configurations, so I'll have to think this through in detail to avoid making spaghetti.
A new technique that I'm working on aims to selectively apply pipeline and perception parallelization based on the performance characteristics of workload. To accomplish this, Mobius can wrap an
IExecutor instance within a
SelectiveExecutor. Rather than queuing work as soon as it is scheduled by the vision code, the selective executor will wait until
join() is called. At that point, it will use profiling information to decide, probabilistically, whether it should parallelize the tasks or not.
When the selective executor decides not to parallelize something, the underlying
ThreadedExecutor is passed along. This makes the thread pool available to something deeper in the pipeline, such as a parallelizable task or a nested collection of pipelines.
Using a unique identifier for each task, a selective executor could track performance information of specific operations. Moreover, as work is scheduled and
IExecutor instances are passed along, the parent/child relationship of executors can be used to reconstruct a call graph of the tasks. The combination of this information with realtime evaluation of scheduled work would allow the selective executor to draw conclusions such as this one:
I currently have 2 tasks scheduled and have access to 12 threads. Historically, the first scheduled task schedules 10 additional tasks and the second schedules 8. I will run my two tasks inline and allow them to parallelize their work.
This same decision making process would probabilistically recurse as deep as the work schedule will allow, narrowing in on the correct depth at which to parallelize work.
I'm still quite early in the design and implementation of this system. Early tests show that the theory is solid, demonstrating that withholding parallelization for a task resembling the one described above can result in quicker turnaround.
As it stands, selective parallelization can't be applied to prefetched pipelines, since prefetch would force use of the thread pool for pipeline parallelization. In the future, I'd like to build a
FragmentedExecutor to represent splits of a
ThreadedExecutor into smaller pools which can be utilized at multiple levels in a pipeline. This fragmentation would be performed by a
SelectiveExecutor, and would be guided by the same information. Here is the sort of conclusion the heuristic would reach when preparing to prefetch:
I currently have 36 pipelines to parallelize and 24 threads. I know that 6 of the pipelines are capable of their own parallelization, averaging 12 tasks each. I will use 16 threads to parallelize the other 30 pipelines. While those run, I will execute the remaining 6 pipelines inline and provide each with 8 threads for perception parallelization.
Note: this scenario describes a realistic workload of 30 basic detection pipelines and 6 text recognition pipelines with an average text length of 12.