🎉 Celebrating 25 Years of GameDev.net! 🎉

Not many can claim 25 years on the Internet! Join us in celebrating this milestone. Learn more about our history, and thank you for being a part of our community!

Hundreds of timers...

Started by
28 comments, last by hplus0603 7 years, 9 months ago

I love how this thread ended up in premature optimization land.



Frankly IMO it kinda started off there ;-)

Wielder of the Sacred Wands
[Work - ArenaNet] [Epoch Language] [Scribblings]

Advertisement

I thought the key system 'thread' overhead was the locks and such of making a system level call


You don't need locks just because you have threads; it depends on your data structures.
Similarly, you don't avoid the need for locks when you use fibers, unless you apply very strict rules to when you may call fiber-switching functions. Which means things like, "don't do I/O in fibers," which ends up being too restrictive for most cases.
The benefit of fibers is that re-scheduling to another thread/fiber is cheaper in userspace than in kernel space. That's it. If you are drowning in context switches, then fibers may help. But if that's the case, you're probably threading your application too finely anyway.

Mechanisms (data) for 'fibers' dont have to be on the stack


You are in a fiber. You make a function call. Where does the state of the current function get saved? Where do locals for the new function you call get allocated?
Separately, that other function you call, may call a yielding function. Where does the OS save the locals of each function on the call stack while it switches to another fiber?
Of course fibers need stack space. You may allocate that from the heap or other unallocated memory regions for each fiber, just like the OS allocates stack space for threads from unallocated memory space.

When a game object is a thread it has to operate as a thread for everything it does (overkill for some uses) ... ie the stuff you mention.

SO I had LOTS (10000s) of objects of limited AI ability (hierarchical FSM control logic, some shared read-only mappings) each with its data (state) restricted to the base objects structure (heap) or a few with linked blocks (from free pool etc..). IO handled by another thread with relevant data injected lockless into the object processing thread (its all sequenced and data protected that way so no locks...)

They would run in a lock-step simulation. I knew the objects could run sequentially (msg signalling when interobject events happened - reaction on next 'turn', with all function calls completed before moving on to next object - no blocking, and all behavior logic Scripted with closely enforced 'sleep' patterns) thus within one thread handling them all (or a seperate core's thread for a second set if scaling was needed). SO no locks required here.

Its a specialized situation and was able to avoid alot of the overhead of more generalized solutions (the performance required the optimization)

I also did fewer (100s) higher AI (planners, magnitudes bigger behavior logic, etc..) objects, but this was done in a similar way (no per object thread/no blocking/no stack based state data/lockless as much as possible) again for the optimizations needed to achieve the size requirement of the game simulation. The logic WASNT closely tied to IO activity (data from that was fully metered and controlled as part of the overall design)

-

Might the above be called 'flat' fibers ??

Not that everything should (or can) be done this way, but depending on performance needs (and circumstances allowing it ) doing this kind of optimization might make the difference in having the program's goals achievable.

--------------------------------------------[size="1"]Ratings are Opinion, not Fact

Might the above be called 'flat' fibers ??


I don't understand the description of your system.

A fiber has a stack. Switching fibers involves saving the CPU registers to some memory, restoring other CPU registers for another fiber, changing the stack pointer to the new fiber, and finally jumping/returning to wherever that fiber was when it yielded.

It sounds a bit like you're simply describing having a lot of objects that get "poked" or "stepped" by a main thread, pulling state as needed. That's just object structure, not fibers.
enum Bool { True, False, FileNotFound };

Might the above be called 'flat' fibers ??


I don't understand the description of your system.

A fiber has a stack. Switching fibers involves saving the CPU registers to some memory, restoring other CPU registers for another fiber, changing the stack pointer to the new fiber, and finally jumping/returning to wherever that fiber was when it yielded.

It sounds a bit like you're simply describing having a lot of objects that get "poked" or "stepped" by a main thread, pulling state as needed. That's just object structure, not fibers.

'flat' would imply doing very little in its context switch

well if not systems-level or language level constructs (though higher level 'script' Macro-overlay-implemented compiled to native code - ).

Implemented in a library independant of the the objects code

whatever then all that might be called (not Green Threads as its not a VM)

'tasklet' would seem being something lower called from the 'objects

Stackless (no register caching either)

non-concurrent, non-preemptive lockless, soft blocking (resume from a 'yield' is handled differently)

only 'co-routine' as far as they all run under the same thread+process

soft scheduled (the main process)

I dont know about 'pulling state' as they are selfcontained/independant (fibers 'pull state ' dont they - one way or another)

anyway alot of overhead chopped out for a specific use in driving object behavior.

--------------------------------------------[size="1"]Ratings are Opinion, not Fact

coroutines would be a better name - they're the general concept of parallel calls, whereas fibres are a specific mechanism for implementing coroutines.

premature optimization


"Selecting the right algorithm" is not premature optimization; it's designing for performance.
Performance is a feature. You have to plan and build for it, at the high level.
The famous quote about premature optimization was in the context of worrying about things like whether a particular if statement would emit one extra instruction or not, and dropping to assembly to avoid it.
Hoare and Knuth would ABSOLUTELY tell you to choose the right algorithm for the particular problem you're trying to solve.

Finally: "add the element at the end, then sort the new array" is also bad advice, unless your sorting algorithm has explicit optimizations for sorting already-mostly-sorted containers.
Curiously, QuickSort could go quadratic in this case; meanwhile, most BubbleSort implementations may actually perform the expected O(N) work.
If you use a sorted array, you should expect to use a binary search (or linear scan from backwards) to find the location, and then do O(N) work to insert into the vector (moving the rest of the elements out of the way.)
If your array is large (thousands of elements, as suggested above,) then touching the N-sized amount of elements is bad, because it will evict a bunch of other data from L1 cache.
std::map<> will work much better in this case, as will std::priority_queue<>. It's the well-defined data structure intended exactly for this use case, so you should use it!

Put another way: Computers, when treated right, are really fast.
Every time you have to wait for a laggy computer, it's probably caused by someone not understanding the difference between "premature optimziation" and "designing for performance."
enum Bool { True, False, FileNotFound };

Not so much premature optimization as much as 'design considerations' (measure twice, cut once), when investigating the problem early, might save ALOT of redesign/retrofitting later.

I suggested some kind of quantized bucket chain which might lessen processing if sufficient preknowledge of the timers intervals/patterns can be well understood. (ie - are many Timers many seconds or longer ???)

You Quantize by when the future timers are being due - coarse (long delay) put into the array of buckets (just a link list top insert -- unsorted within those interval buckets). And when each bucket set comes current (now a short time before due) THEN the contents of that bucket get fully sorted/merged into the immediately due list.

That at least minimizes the size of the data set having to be constantly Sorted/Inserted into/Pulled from the priority queue. When there are thousands of timers it might pay off more than a little IF the ratio of 'long delays' is high enough against short delays.

The bilevel system obviously adds complexity and has to be customized (over just using some standard priority queue object).

Preknowledge of the frequency/overhead of timer turnovers might decide if its worth implementing to profile it.

--------------------------------------------[size="1"]Ratings are Opinion, not Fact

premature optimization


"Selecting the right algorithm" is not premature optimization; it's designing for performance.
Performance is a feature. You have to plan and build for it, at the high level.
The famous quote about premature optimization was in the context of worrying about things like whether a particular if statement would emit one extra instruction or not, and dropping to assembly to avoid it.
Hoare and Knuth would ABSOLUTELY tell you to choose the right algorithm for the particular problem you're trying to solve.


What is 'the right algorithm' isn't just about hypothetical performance at the asymptote - it's about:

- how quickly the code can be implemented
- how transparent and maintainable it is
- how easy it would be to swap out if it turns out to be a problem
- how often it'll be called
- how much data it needs to handle
etc.

As I noted, I've seen the naïve O(N*M) algorithm in shipping code and it was never a problem, because the values of N and M are relatively small (even on an MMO!) and are constrained by other aspects of the system. But if this ever did become a problem and showed up on a profiler, it's trivial to swap out, because the interface is so small (add, poll, remove). So why worry ahead of time?

Finally: "add the element at the end, then sort the new array" is also bad advice, unless your sorting algorithm has explicit optimizations for sorting already-mostly-sorted containers.


Any insertion sort will handle it pretty well. Even std::sort is normally good because most implementations use a median of three or median of nine method so that they're near-optimal on already sorted data. Again though, this is a narrow interface, so try the simple thing first, and if it doesn't scale, swap it out. You'll only have lost about an hour or two's work.

Every time you have to wait for a laggy computer, it's probably caused by someone not understanding the difference between "premature optimziation" and "designing for performance."

And every time your computer crashes, it's often caused by someone who chose a higher-performing and harder to maintain system over a safer and adequately-performing one (eg. using C or C++ when not needed).

this is a narrow interface, so try the simple thing first, and if it doesn't scale, swap it out


That's exactly the kind of thinking that leads to slow performance everywhere and the profile doesn't show you any obvious place to improve.
An experienced programmer will pick the right option up front, and thus doesn't need any re-work later.
std::priority_queue<> is pretty easy to use; std::map<> is trivial to use. Both will perform very well. Pick one of those, rather than something obviously bad, if you already know the volume of data.

Here's the declaration for priority_queue<>:


    typedef std::pair<my_time_type, my_action_type *> queue_item;
    std::priority_queue<queue_item, std::vector<queue_item>, std::greater<queue_item> >  my_timer_queue;

    // inserting in queue
    my_timer_queue.push(queue_item(time, action));

    // checking which items should run
    while (!my_timer_queue.empty() && my_timer_queue.top().first <= current_time) {
        auto ptr = my_timer_queue.top().second;
        my_timer_queue.pop();
        ptr->timer_expired();
    }
This will manage a heap in a linear container (vector,) which additionally has the potential cache benefits suggested above.
It will use log N insertion and removal times, and finding the next time to run (the time of the earliest expiring timer) is constant time.
This is very likely as close to optimal as you can get, and it's very simple to use.

Some other languages do not have well-behaved heap data structures, though; there you may have to sort an array yourself.
Make sure to test it with large numbers so you can tell whether the sort function you're using is pessimal or not.

every time your computer crashes, it's often caused by someone who chose a higher-performing and harder to maintain system


You're saying that choosing a well performing algorithm will cause crashes, which on the face of it seems like an absurd statement.

I'd view that as someone biting off more than they can chew. Which is also very, very, common -- although schedule pressure is often as much a reason for that as programmer competence.

That's a very different discussion, though :-)
enum Bool { True, False, FileNotFound };

Wow ! I was on vacation and came back some weeks ago but I forgot this topic !

I like to see how it evolved from managing timers to (prematurely) optimizing and (precociously) choosing the right algorithm.

Well, I've come to the (provisional) conclusion that I don't need timers to perform a lot of actions I previously though about :

- There's no need of a timer to rearm a shoot action after some delay time : write only the time "now + delay" when you shoot, and always test for "now > this stored time" when you shoot again. That's true for all timed actions and for the exhaustion of any buff / DOT / and so on.

- Environment events are way rarer and fit quietly in a priority queue.

- And so there is no need of a dedicated thread to run this stuff : that is good news !

Thank you all for all your inputs !

This topic is closed to new replies.

Advertisement