Resource Guide on workload distribution or "How to handle heavy splittable tasks"

Discussion in 'Spigot Plugin Development' started by 7smile7, Dec 21, 2019.

  1. This guide is meant for everyone with intermediate/advanced experience with Java and Spigot.

    Ever wanted to fill in a huge chunk of blocks without the usage of FAWE or create a task that
    should check something as often as possible with a limited CPU time?
    Then this thread might interest you.

    I want to share my experience with workload distribution with you guys. I'm also looking for some feedback on my methods in order to improve them and maybe learn a bit. :D

    But lets get right ahead.

    1) Foreknowledge

    1.1 The Deque

    The Deque is a linear collection. It supports insertion and removal on both ends.

    [​IMG]

    Using deques as queues results in FIFO (first in first out) behavior meaning that elements are added at the end of the deque and removed from the beginning.

    Common methods:
    poll() -> retrieves the first element from the queue and removes it
    peek() -> gets the first element but does not remove it from the queue
    add(e) -> adds an element to the end of the queue

    This collection will later be used in 2 ways.
    - Linear processing
    - Looped processing

    1.2 Minecraft Ticks

    Nearly all video games (including Minecraft) are driven by one big program loop. Just as every gear in a clock is synchronized with the pendulum, every task involved in advancing a game simulation is synchronized with the game loop. Appropriately, one cycle of the game loop is called a tick.

    Minecraft schedules 20 ticks per second (TPS) which results in 50ms processing time per tick.
    [​IMG]


    1.3 Tickables

    If you want to call methods every tick you will have to maintain some sort of collection to iterate over every tick.

    It is tempting to just create a concrete collection with the explicit element type you are planing to use.
    Like a Set<SteamMachine>.

    This however would result in a mess where you had to maintain a lot of collections with specific use cases.
    Minecraft solves this with an interface 'ITickable' that is implemented by every game object that should be called every tick.

    Now if we have some instance of 'StamMachine' that implements 'Tickable' and another 'Drill' that implements 'Tickable', we can just throw them both in a Collection<Tickable> and call them every tick.


    2) Processing with CPU time limitation

    2.1 Linear processing

    Problem description
    We have two corners represented by Blocks A and B and want to fill in the area between them
    with a certain Material.
    The volume grows to the power of 3 (so does the amount of Blocks inside the cube) which will sooner or later result in a massive overhead when filling the area.

    Implementation
    Lets first take a look at some code for this.
    We just iterate over delta x,y,z and fill the locations with a certain Material.
    I purposely used a slow method (Block#setType(Material)) for this.
    [​IMG]



    This results in lag spikes with larger cuboids.


    Solution
    At first we are going to provide some abstraction for our workload.
    [​IMG]

    Then some implementation
    [​IMG]

    One could arguably just use the World object. This however could result in problems when the world
    gets unloaded. This class can also be serialized pretty easy in order to create persistent workloads.



    And at last our runnable.
    [​IMG]
    The variable MAX_MS_PER_TICK defines how many milliseconds your task is allowed
    to last every tick.

    Note that you have no control over the start time of your task.
    It is possible that the tick already took 47ms. Then the runnable still produces overhead.
    Paper has a TickStartEvent where you can track at what exact time the tick started.
    As far as i know Spigot does not provide any method to know when the tick has started.

    This is the result where i regularly check the TPS.
    There are some client side lags because of light updates.
    (As stated Block#setType is pretty bad for this case)

    Again. This is super slow because the paste implementation is pretty stupid in order to show
    the difference in performance. Also 5ms are only 10% of a tick.


    2.2 Looped processing

    Problem description
    We want to constantly spawn particles at a number of locations.

    Implementation
    We once again create a class that implements Workload.
    [​IMG]
    However if we would now add an instance of ParticleSpawner to our WorkloadThread it would just
    get discarded after spawning a particle once.

    Solution
    In order to decide if the Workload should be rescheduled we have to provide some more logic in our
    Worklaod interface.
    [​IMG]
    The method is keyed as default so Workload maintains its status as "single method interface" which
    qualifies it for the use in lambdas.

    Now we just need to change our WorkloadThread so it reschedules the right workload.
    Note that it is perfectly fine to use different collections in order to make sure some specific tasks
    have different time limitations. I will show an example in 2.3
    We now have to keep track of the first element polled so we don't create a infinite loop.

    Edit: We need to cache the first element that is actually rescheduled!
    [​IMG]

    [​IMG]
    Note how the FIFO nature of our ArrayDeque helps putting the workload at the end
    after computing is done.

    Also our ParticleSpawner needs to be adjusted.
    In order to always reschedule we can just override the method and return true.
    [​IMG]


    Independent from the execution condition (as shown in the next spoiler) we can go ahead and also write an abstract class that gets rescheduled until a certain condition is met (eg a schematic paste is done or a certain mob dies etc)

    like this:
    [​IMG]

    One use case is a Worklaod that runs a fixed number of times:
    https://i.imgur.com/V8m7i9b.png

    If you only want to run the workload every nth tick you can either increment an integer yourself inside the compute() method or you could think on step further and create a Workload that only gets executed if a certain condition is met.
    https://i.imgur.com/gnik5Ld.png

    We then have to adjust our WorkloadThread
    https://i.imgur.com/Jf5tQAk.png

    With this we can create an abstract conditional Workload
    https://i.imgur.com/qWh1Fa3.png

    And at last a workload that only runs every n ticks:
    https://i.imgur.com/CYDaMta.png

    Here is the result:
    A particle cube that can be as big as you want without any tps losses.


    2.3 Creating independent loops

    We now have the ability to distribute our workload between ticks and schedule repeated tasks on a collection with a strict CPU time limitation.

    But in most cases you want to give every task its own CPU time.
    Like 0.5ms for player checks, 2.75ms for block changes, 3.245ms for particle spawning etc.

    For this case we just need to create a new class that will limit itself while maintaining its own collection.
    Its pretty much our old WorkloadThread class
    [​IMG]


    We now only schedule the WorkloadDistributor every tick.
    This step is optional. You can always just run your WorkloadThreads via the BukkitScheduler.
    Your own implementation has the advantage that you have full control over the order the threads will be ran. You can let your custom WorkloadThread implement comparable and map them in a specific order.


    3) Evenly distributed workloads

    Now we want to look at how to evenly distribute tasks over several ticks.
    Sometimes we got a bunch of heavy tasks that should not be executed every tick and dont need
    a high resolution. So we are fine with them running every second or so. When i started using runnables
    my thought was that i could reduce the weight of my task by simply scheduling it less frequent. This might help in some cases but if you have a bunch of heavy tasks then this can actually be counter productive.

    In this example we have 10 tasks each taking aprox. 5ms to execute.
    We thought about scheduling the task every 10 ticks to reduce the load on the CPU.
    (X = ticks, Y = ms per tick, T = Task)

    [​IMG]
    But this leads to micro lags appearing every half a second which will only get worse if we where to add more of them.

    The sollution is pretty simple. We only want the task to run once every 1/2 second and we dont care if the task is ran together with the other ones. So we simply give the task a fixed point in the tick loop.

    [​IMG]
    Now each task gets executed exactly 10 ticks after its last execution.
    Adding new tasks now only requires us to find the current position that has the least amount of tasks
    and add it there.

    One real world example would be a task that should be executed once a second for each player on the server. Now imagine having a server with 120 concurrent online players. Every second there would be a massive increase in tick time because we iterate over 120 players in one tick and execute something for them. Distributed with the maximal distributin size possible for one second (so ds = 20) we only have to
    iterate over 6 players per tick which is way more realistic for the CPU to handle in every occasion.

    One implementation of such a distributed runnable could look like this:
    [​IMG]
    This would be the general sollution to evenly distribute a volatile group of workloads with no correlation.

    We could also have a typed implementation that gets started with a very specific type in mind and
    without using the abstract workload layer.

    Example typed with composition pattern:
    [​IMG]

    Usage example for telling a player Hi once a second:
    [​IMG]

    It is also possible to replace the functional fields (Consumer, Predicate, Supplier) with abstract methods.


    ---------------------------------------------------------------------------------------------------------------------------------

    PS: My current go to for self limiting workloads.
    This is is compact and has a great performance.

    [​IMG]
     
    #1 7smile7, Dec 21, 2019
    Last edited: Mar 18, 2021
    • Winner x 38
    • Like x 12
    • Useful x 11
    • Informative x 3
    • Creative x 1
  2. dude.. this is amazing
     
    • Friendly Friendly x 2
    • Agree Agree x 1
  3. Pretty nice post, good job! Just a few things to note:
    - You should listen to what your IDE tells you, there are some issues in there, including typos, your IDE is trying to tell you about.
    - You should have been using nanoTime from the start, the currentTimeMillis resolution isn't reliable enough for this kind of stuff.
    Keep up the good work!
     
  4. But the solution doesn’t fix the problem lol. You’ll still get an NPE since you don’t check :p

    https://stackoverflow.com/questions/4624919/performance-drag-of-java-assertions-when-disabled

    Agreed with the above commenter. Listen to IDEA, the public abstract is completely redundant in an interface.

    An ArrayDeque here is an interesting choice because if you’re polling objects from the head (I think that’s what the stock poll() does, might be the tail?), it will be O(n) to shift everything up. Use a LinkedList instead.
    Apparently not! See replies from OP and Aikar on the next page :)

    It doesn’t matter for this example since all your tasks are the same, but if your tasks do something different and ordering matters, use peek() because you are putting the first task at the end of the queue which might lead to unexpected issues.

    Noting that this is just an example, still kind of interesting that you opted to select an int to workload map. Why not just store the workload object in a flat collection? You’re returning it anyways so the int isn’t exactly meaningful here.

    The “XYZThread” naming here also is a little confusing since they aren’t even threads. They’re more like fibers since they run in java space.
     
    #4 xTrollxDudex, Dec 22, 2019
    Last edited: Apr 19, 2020
  5. :eek:

    {See post below}
    I previously made some jmh benchmarks regarding iteration times and Array backed collections
    were super fast. So i didn't think about the element shift scaling.
    PS running some benchmarks regarding poll and reading in LinkedLists/ArrayDeques
    Looks like ArrayDeque has always more throughput. Both have l linear time complexities but the ArrayDeque has a lower base execution time for poll()/add()

    I'm not quite sure what you mean by that. If i don't check for the first element i have polled and reschedule it then there will be a loop until the time limitation is over, meaning that elements will be polled multiple (in my case hundrets) times per tick.


    Ah thats just something from my first implementation where i made the constructor for WorkloadThreads protected and let the WorkloadDistributor create instances where it would transfer itself to the WorkloadThread in order to have a stop() method in my WorkloadThread.
    I mean this would also be possible with just the hash of the object but well...


    Yes my naming is def off here.
     
    #5 7smile7, Dec 22, 2019
    Last edited: Dec 22, 2019
  6. Ah no I see my mistake now. You don’t call compute() in that example, computeWorkload() only checks if it’s going to be added to the end of the work queue.
     

  7. Code (Text):
    Benchmark                           (ELEMENT_COUNT)   Mode  Cnt   Score   Error   Units
    JmhQueuePoll.arrayDequePoll                       5  thrpt    5  71,153 ± 0,154  ops/us
    JmhQueuePoll.arrayDequePollThenAdd                5  thrpt    5  41,575 ± 0,384  ops/us
    JmhQueuePoll.linkedListPoll                       5  thrpt    5  88,966 ± 1,291  ops/us
    JmhQueuePoll.linkedListPollThenAdd                5  thrpt    5  12,869 ± 0,013  ops/us

    Benchmark                           (ELEMENT_COUNT)   Mode  Cnt  Score   Error   Units
    JmhQueuePoll.arrayDequePoll                      50  thrpt    5  7,066 ± 0,002  ops/us
    JmhQueuePoll.arrayDequePollThenAdd               50  thrpt    5  4,049 ± 0,003  ops/us
    JmhQueuePoll.linkedListPoll                      50  thrpt    5  8,783 ± 0,751  ops/us
    JmhQueuePoll.linkedListPollThenAdd               50  thrpt    5  1,293 ± 0,001  ops/us

    Benchmark                           (ELEMENT_COUNT)   Mode  Cnt    Score   Error   Units
    JmhQueuePoll.arrayDequePoll                     500  thrpt    5  818,094 ± 0,789  ops/ms
    JmhQueuePoll.arrayDequePollThenAdd              500  thrpt    5  374,637 ± 0,264  ops/ms
    JmhQueuePoll.linkedListPoll                     500  thrpt    5  909,217 ± 0,669  ops/ms
    JmhQueuePoll.linkedListPollThenAdd              500  thrpt    5  128,524 ± 0,330  ops/ms

    Benchmark                           (ELEMENT_COUNT)   Mode  Cnt   Score   Error   Units
    JmhQueuePoll.arrayDequePoll                    5000  thrpt    5  82,116 ± 0,075  ops/ms
    JmhQueuePoll.arrayDequePollThenAdd             5000  thrpt    5  39,233 ± 0,131  ops/ms
    JmhQueuePoll.linkedListPoll                    5000  thrpt    5  82,471 ± 0,031  ops/ms
    JmhQueuePoll.linkedListPollThenAdd             5000  thrpt    5  12,890 ± 0,014  ops/ms
    This is my benchmark for LinkedList vs ArrayDeque.
    Time complexity for 'poll()' and 'poll() then add()' is pretty much O(1) for both collections.
    But ArrayDeque is ~(3 - 2.5) times as fast with 'poll() then add()'

    [​IMG]
     
    #7 7smile7, Dec 22, 2019
    Last edited: Dec 22, 2019
    • Like Like x 1
    • Useful Useful x 1
  8. Hm interesting. I rechecked with the source and it turns out I was wrong, ArrayDeque doesn’t shift anything when an element is removed. Very interesting indeed!
     
  9. GitHub link???
     
  10. This is a guide, not a library.
     
  11. I think the code could be minimized here depending on what method of multitasking you are using. But this is great for basic geometry.
     
  12. I'll post a github link with examples after writing the last section. (y)
     
    • Winner Winner x 1
  13. z2y

    z2y

    Cool will definitely try implementing that in the plugin I'm working on.
     
  14. Thank you for sharing your expertise
     
    • Like Like x 1
  15. FrostedSnowman

    Resource Staff

    your title has a typo
     
  16. Why you use an Int2ObjectMap ?
     
  17. its fastutil

    Code (Text):
    <dependency>
        <groupId>it.unimi.dsi</groupId>
        <artifactId>fastutil</artifactId>
        <version>8.3.0</version>
    </dependency>
     
  18. This I knew, however, I do not know what difference there is between this and a common HashMap with an integer key
     
  19. It offers better performance since there’s no need for the JVM to constantly box and unbox the integer.
     
  20. can you add some codes for running a runnable after the certain work-load started/finished? the feature can be helpful for us.