List elements grouping by multiple fields

Discussion in 'Programming' started by MrDienns, Feb 5, 2020.

  1. So, refactoring some metrics stuff for my project. I made a metrics interface which expects me to return a list of objects. Basically, just some objects with annotations which are used to push some stuff to some database.

    Looks like this;

    Code (Java):
    public class Entities implements DyescapeSystemMetric {

        @Override
        public List<Object> getMetrics() {

            return null;
        }

        @Measurement(name = "Entities")
        public class EntitiesModel {

            @Column(name = "World", tag = true)
            private String world;

            @Column(name = "Type", tag = true)
            private EntityType entityType;

            @Column(name = "Count")
            private int count;

            public EntitiesModel(String world, EntityType entityType, int count) {
                this.world = world;
                this.entityType = entityType;
                this.count = count;
            }
        }
    }
    Now, what I'm trying to do is the following. I want to iterate over all entities in all worlds, and make one object (one EntitiesModel class) for each unique combination of the world and entity type tag, where the count represents the count for those two tags.

    I could just return one object for every entity that I find, but that means pushing well over 5000 records into a database every 5 seconds for every server. Not ideal to say the least.

    So, I prefer to send a return value which contains a few dozen items. For example, one where the world is "my_world", entity type is Zombie, and the count is 125.

    Pretty sure I should be able to use the streams API somehow, but can't quite figure it out. The grouping functionality on the collectors seem to make multi dimensional maps, which I also don't think is the most ideal. Preferably I just have a streams statement which eventually collects to a list.

    Anyone advice on how to do this? Again, looking for the shortest / simplest approach, which can hopefully be done with a simple stream and to list collector.

    Thanks
     
  2. Coded this without IDE and without testing so it might be full of mistakes, but going from the API it should conceptually work:

    Without streams:
    Code (Java):
    Map<Integer, EntitiesModel> entitiesModels = new HashMap<>();
    for (Entity entity : entities) {
      int hashCode = 31 * entity.getWorld().hashCode() + entity.getType().hashCode();
      if (entitiesModels.containsKey(hashCode)) {
        entitiesModels.get(hashCode).count++;
      } else {
        entitiesModels.put(hashCode, new EntitiesModel(entity.getWorld(), entity.getType(), 1);
      }
    }
    Note that I use an integer as key, this is a trick to combine the hashcode without creating a wrapper class.

    With streams:
    Code (Java):
    Map<Integer, EntitiesModel> entitiesModels = entities.stream().collect(Collectors.groupingBy(e -> e.getWorld().hashCode() + e.getType().hashCode(), new Collector() {
      @Override
      public BiConsumer<EntitiesModel, Entity> accumulator() {
        return (e, f) -> {
          e.world = entity.getWorld();
          e.entityType = entity.getType();
          e.count = 1;
        };
      }

      @Override
      public Set<Collector.Characteristics> characteristics() {
        // Idk lol
        return null;
      }

      @Override
      public BinaryOperator<EntitiesModel> combiner() {
        return (e, f) -> new EntitiesModel(e.world, e.entityType, e.count + f.count);
      }

      @Override
      public Function<EntitiesModel, EntitiesModel> finisher() {
        return Function.identity();
      }

      @Override
      public Supplier<EntitiesModel> supplier() {
        return () -> new EntitiesModel(null, null, 0);
      }
    });
    As you can see the streams code is an absolute nightmare because you have to create a custom Collector for the EntitiesModel (since the EntitiesModel is in some way a 'collection'). Also note I used the EntitiesModel class as accumulator in the Collector to make it extra dirty.
     
  3. FrostedSnowman

    Resource Staff

    Apologies if I'm not understanding, but you want to calculate the amount of entities, per type, per world, without having to create an individualized object per entity?

    If World::getEntitiesByClass isn't an option, here is an approach you can consider which has good performance. Although did not use official benchmarks, I ran it several times and took an average run time. To calculate 2034 bats, it took 1 if not less than 1ms.

    [​IMG]


    Store the model object in a map, mapped to a world. I'll use the world name for this case.
    Code (Java):
    private final Map<String, EntitiesModel> worlds = new HashMap<>();
    The model itself:
    Code (Java):
    public class EntitiesModel {

        public static final int ENTITY_TYPES = EntityType.values().length;

        private final String world;
        private int[] entities;

        public EntitiesModel(String world) {
            this.world = world;
        }

        public int[] getEntities() {
            return this.entities;
        }

        public void setEntities(int[] entities) {
            this.entities = entities;
        }

        public String getWorld() {
            return this.world;
        }

        @Override
        public boolean equals(Object o) {
            if (this == o) return true;
            if (o == null || getClass() != o.getClass()) return false;
            EntitiesModel that = (EntitiesModel) o;
            return world.equals(that.world) &&
                    Arrays.equals(entities, that.entities);
        }

        @Override
        public int hashCode() {
            int result = Objects.hash(world);
            result = 31 * result + Arrays.hashCode(entities);
            return result;
        }
    }
    The model will hold an array which references the amount of entity types based on their type ordinal. Additionally it will cache the EntityType array size in order to avoid re-creating the array each time we want to sum the entities.

    Code (Java):
    public void sumEntities(World world) {
        EntitiesModel entitiesModel = this.worlds.computeIfAbsent(world.getName(), EntitiesModel::new);

        int[] amount = new int[EntitiesModel.ENTITY_TYPES];

        for (Entity entity : world.getEntities()) {
            ++amount[entity.getType().ordinal()];
        }
        entitiesModel.setEntities(amount);
    }
    This will calculate the amount of entities in the world specified and cache it in the map if not already. This will increment the amount of each entity type's ordinal to the amount present.

    To read the actual model, it would look something like:

    Code (Java):
    public void bar(World world) {
        sumEntities(world);
        EntitiesModel entitiesModel = this.worlds.get(world.getName());
        int bats = entitiesModel.getEntities()[EntityType.BAT.ordinal()];
    }
    I expanded some of the methods out for readability and for better interpretation. You can of course condense some of the logic to just automatically perform the calculation upon querying the map.
     
    #3 FrostedSnowman, Feb 5, 2020
    Last edited: Feb 5, 2020
  4. SteelPhoenix

    Moderator

    *If I understand what you're trying to do here's a (probably inefficient/janky) oneliner*
    Note: this is for all worlds, you can easily cut out the unnecessary bits if you want it for one world.

    Ait here goes:
    Code (Java):
    Set<EntitiesModel> result = getServer().getWorlds().stream().map(world -> world.getEntities().stream().collect(Collectors.groupingBy(Entity::getType, Collectors.summingInt(e -> 1))).entrySet().stream().map(entry -> new EntitiesModel(world.getName(), entry.getKey(), entry.getValue())).collect(Collectors.toSet())).collect(HashSet::new, Set::addAll, Set::addAll);
    Breakdown:
    Code (Java):
    Set<EntitiesModel> result = getServer().getWorlds().stream()
        // For each world
        .map(world ->
            world.getEntities().stream()
                // Map entities to a Map<EntityType, Integer> (type and count)
                .collect(Collectors.groupingBy(Entity::getType, Collectors.summingInt(e -> 1)))
                // Map each entry to an EntitiesModel object
                .entrySet().stream().map(entry -> new EntitiesModel(world.getName(), entry.getKey(), entry.getValue()))
            // Collect it in a Set<EntitiesModel>
            .collect(Collectors.toSet()))
        // Collect the Stream<Set<EntitiesModel> into one merged Set<EntitiesModel>
        .collect(HashSet::new, Set::addAll, Set::addAll);
    ***Y I K E S***

    *Untested, unoptimized - it was late ok
     
  5. This will be vulnerable to hash collisions.



    You will have to collect to a map, then create a stream using .entrySet().stream() again and collect that to a list. @SteelPhoenix has shown one way to do it, but here's another:

    Code (Java):
    List<EntitiesModel> models;
    models.stream()
        .collect(Collectors.groupingBy(m -> new Pair<>(m.world, m.type), Collectors.summingInt(e -> e.count)))
        .entrySet()
        .stream()
        .map(e -> new EntitiesModel(e.getKey().a, e.getKey().b, e.getValue()))
        .collect(Collectors.toList())

    I make use of a class called Pair, which is something you need a lot when working with the Stream API. Apache Commons already provides the same - you can use theirs, or you can implement it yourself.

    Code (Java):
    public class Pair<A, B> {
        A a;
        B b;
        public Pair(A a, B b) {
            this.a = a;
            this.b = b;
        }
        @Override
        public boolean equals(Object other) {
            return other instanceof Pair && this.a.equals((Pair) other).a && this.b.equals((Pair) other).b;
        }
        @Override
        public int hashCode() {
            return 31 * a.hashCode() + b.hashCode();
        }
    }
     
  6. Yes, the drawback is that you lose the fallback of the equals method. (I don't exactly know why you link to that article about cryptographic hash functions, in this context hash collisions are normal and not necessarily evil) Your Pair class uses the same hash code, so it will have the same amount of hash collisions, but those would be resolved by the Map by looking at equals. I leave it up to OP to choose.
     
  7. Thanks all, this gives a nice direction.

    I'm still not sure if it really matches what I desire though. I've given one example now with only two tags and one field, but later on I'll undoubtedly have more tags and more fields. Since I'd build loads of these metrics (for performance & system monitoring), this would be repeated a decent amount for anything that needs to be monitored, hence why I think a short stream statement would be most ideal. Any kind of library would also be great, but not exactly sure what kind of terminology I should be looking for.

    To give a bit more context, it's for an InfluxDB database. Every object I return in a list (see my OP code), is a soon-to-be database record. Imagine it as an SQL query (it's actually very similar to SQL in terms of structure).
     
  8. SteelPhoenix

    Moderator

    Looking at your example, wouldn't it be a better idea to push changes once they happens instead of polling every once in a while and pushing all that data to a db?
     
  9. We do that for player metrics, but this is runtime metrics. Like, how many entities are currently in the world. Tag (group by) entity type, world, level, etc. Not something I can push individually or when they happen, since it would result in way too many database entries.
     
  10. This sounds scary at first, however:

    I happen to have quite a bit of experience with InfluxDB. If you make sure the WAL sits on fast storage, 5000 points per 5 seconds is actually not a problem at all! Most time series databases are optimized heavily for the kinds of writes you'll be making, but they sometimes require you to tinker a bit for your specific type of workload. InfluxDB is built to handle many writes per second from many different sources. You should not think of it as your run-of-the-mill RDBMS.

    I recommend checking out retention policies and continuous queries, because they allow you to be a bit more "reckless" with your writes and as a result open up a lot more possibilities for special use cases on the query side of things. InfluxData have a bunch of really nice examples in their docs, and they specifically have examples about the kind of stuff you're talking about here.

    Another thing I want to mention is how it's important to understand what tags and fields mean for your series cardinality (and thus your write speed) and your query patterns, as well as how important it is to split things up into measurements. A measurement is not the same thing as a table, and even if two data points come at the same time from the same producer, it will likely make sense to split them up into two different measurements. This is a bit orthogonal to how you would normally collocate stuff on a single table in an RDBMS.

    I'd be happy to share some war stories and experiences, but it might be easier in a voice chat. Feel free to hit me up in a PM if you wanna chat about this :)
     
    • Informative Informative x 1
  11. Thanks for the info and advice. I'm aware that these kind of databases are optimised for this kind of stuff, but is it really worth or smart to push 5000 points what could effectively be, like, 50? I'd just be counting the entities per world per entity type (in this case, though many similar cases will come). Whether it comes in through 5000 points or as 50 points functionally shouldn't make a difference, so I hope it's understandable I'm trying to optimise :p

    I made this nested for-loop which does get the functionality done, but naturally not as memory efficient as some stream would be able to do it most likely somehow.

    Code (Java):
    @Override
    public List<Object> getMetrics() {

        Map<Integer, EntitiesModel> result = new HashMap<>();

        for (World world : Bukkit.getWorlds()) {
            for (Entity entity : world.getEntities()) {
                EntitiesModel model = new EntitiesModel(world.getName(), entity.getType(), 1);
                if (result.containsKey(model.hashCode())) {
                    EntitiesModel current = result.get(model.hashCode());
                    current.count++;
                } else {
                    result.put(model.hashCode(), model);
                }
            }
        }

        return new ArrayList<>(result.values());
    }
    [​IMG]
    why the fuck are there 2 falling blocks?
     
    #11 MrDienns, Feb 6, 2020
    Last edited: Feb 6, 2020
  12. I think I got a little overexcited about someone asking a question about InfluxDB that I kinda forgot to address the question xD

    It definitely makes sense to count your entities and write that into InfluxDB, and like you already said yourself, you're gonna get the cartesian product of your tags, so (#worlds x #types). And like you also mentioned yourself, metrics are different from event logs, so the "value at a given time" is the right way to use a time series database.

    Others have also mentioned how you can use the Streams API, and as you can see from those suggestions, it gets a bit hairy really quickly - especially if you start throwing stuff like Pair into the mix. Since you are on the path of optimization, I think you should consider writing a dead simple loop-based implementation, and then compare readability and profile the performance. The Streams API tends to get a lot of flak around the web for being a bit slow, so if you're paying in both readability (very much subjective of course) and performance, it might not be worth it.

    So yeah, just echoing yourself and others - I didn't have much input after all. The offer still stands if you want to hear some war stories or talk about InfluxDB in general :)
     
  13. Hash collisions are not only a problem in cryptographic hash functions. While a HashMap will still work even if there's hash collisions, it'll be way slower (from O(1) access to O(n) access worst-case). However, in your case, a hash function with collisions will not work at all, so you'd need a cryptographic one (which produces no collisions) for the code to work.



    Do I understand right - you have a bunch of Model classes with Column annotated methods whose tag property is true if and only if the property is a "key", in that sense; and then you want a method List<Model> merge(List<Model> list) which merges all elements in the list which have all the same key properties? If so, you might as well implement the feature into your annotation system.

    In particular, what I'd do is add a new property reduce to your Column annotation:
    Code (Java):
    @interface Column {
        String name();
        boolean tag() default false;
        BinaryOperator<?> reduce() default null;
    }
    It will determine how non-tag values will be merged, eg. for count it would be just summing them up. You can then modify your count variable in EntitiesModel:
    Code (Java):
        @Column(name = "Count", reduce = Integer::sum)
        private int count;
    This is the public interface you touch when you create new Model classes. You now have a number of ways to implement your merge method, here's an example implementation:
    Code (Java):
    // don't know the Reflection annotation APIs off the top of my head, IIRC you had to do some access stuff but I'm sure you can figure that out^^
    public <T extends Model> Collection<T> merge(List<T> list) {
        return list.stream().collect(Collectors.groupingBy(
            m -> getKeys(m),
            Collectors.reducing((a, b) -> mergeTwo(a, b))
        )).values();
    }

    private Map<String, Object> getKeys(Model a) {
        return a.getClass().getFields()
            .stream()
            .filter(a -> a.getAnnotation(Column.class) != null && a.getAnnotation(Column.class).tag())
            .collect(Collectors.toMap(a -> a.getName(), a -> a.getValue()));
    }

    private <T extends Model> T mergeTwo(T a, T b) {
        if (a.getClass() != b.getClass()) throw new IllegalArgumentException();
        Class<? extends T> clss = a.getClass();
        for (Field field : clss.getFields()) {
            Column annot = clss.getAnnotation(Column.class);
            if (annot == null || annot.tag()) continue;
            field.set(a, ((BinaryOperator<T>) annot.reduce()).apply(field.get(a), field.get(b)));
        }
        return a;
    }
    (I have invented the class Model as a super-class for EntitiesModel etc. in the code, but that's just for readability purposes and the code would still work if you used Object instead)

    This is a long piece of code, but it's all hidden in abstraction; if you want to clean up your list of 5000 objects, all you need to do is a single call to merge. Cool!



    EDIT: Removed the ! from !annot.tag() in mergeTwo. I don't usually leave edit notes but this one is pretty major (the code won't work with the !) so I'm putting it here
     
    #13 StillNoNumber, Feb 6, 2020
    Last edited: Feb 6, 2020
    • Like Like x 1