loadFactor should be 1.0 instead of default 0.75

Comment From: jhoeller

This is rather subtle, unfortunately. It would be ideal to specify the number of expected entries upfront, but given that there is no such API on HashMap, specifying the expected size as initial capacity is an acceptable trade-off for our scenarios.

We generally avoid using custom load factors since that factor affects not just resize steps but also lookup performance. The default load factor of 0.75 is also used in the HashMap copy constructor, and an effective load factor of 0.5 is applied by the new Map.of method (through its "expand factor" of 2 for the input length), supposedly for the same reason: a better trade-off in terms of lookup performance through generously distributed hash buckets with less chance for collisions.

Since a HashMap does not use the given capacity value as-is but rather upgrades it to the next power of two anyway, actual resize steps do not happen as often as expected, so in ~half the cases a load factor of 1 won't make a difference, and even in the other half it'll only trigger a single resize step. Ironically, whatever load factor it chooses to use itself, once such a HashMap is copied through the copy constructor, it'll revert to a default load factor of 0.75 again.

We'd have to look at the typical distribution in such narrow-sized maps with load factor 1 in our scenarios: If the elements expose a well-formed hashcode and collisions are typically none and there are no lookup operations for non-existing elements and no additional elements are added, I suppose we could consider a load factor of 1. That said, there are rather strong recommendations against such a load factor in general... and a potential resize step isn't so bad.

Comment From: quaff

Those maps will always resize once before this commit, what about new HashMap<>((int) (size / 0.75))?

Comment From: jhoeller

Well, the maps resize ~half of the time on average since the upgrade to the next power of two covers some ground as well (e.g. size 9 -> capacity 16 is sufficient but size 13 -> capacity 16 isn't in case of load factor .75). In any case, it's not ideal as-is, I agree.

Calculating against the load factor in every such place would help indeed but look somewhat convoluted. Maybe we'd have to put it into a common utility method, or even into convenient map factory methods on our CollectionFactory class. It's a shame that HashMap does not provide a corresponding constructor itself :-(

I'm not convinced that this makes enough difference in practice to be worth the extra "noise" at code level but we can consider doing a consistent revision for 5.3 here.

Comment From: diguage

Extract the creation of HashMap into a Builder may be a better solution. It can change the factor at the Builder for the global.

Comment From: wind57

@jhoeller here is my interaction with Stuart Marks on this subject : https://stackoverflow.com/questions/52692803/hashmap-rehash-resize-capacity

I will simplify this in a few sentences if you do not have the time to read it entirely. You can say to a helper method : "please give me a REAL capacity under which no rehash operations will occur". At the moment, this can be done via (for example):

    /**
     * Returns a capacity (maximum number of entries) that this Map can hold before the inner table is resized.
     *
     * @param desiredSize how many entries will this map contain
     * @param cls the class of the Map
     * @return a capacity under which no rehash operations will occur for the given Map
     */
    public static int capacity(int desiredSize, Class<? extends Map<?, ?>> cls) {
        // this is empty on purpose for the point of discussion
    }

and you could say : HashMap<String, String> map = new HashMap(capacity(9, HashMap.class)). But this has subtle problems.

  • we have to treat ConcurrentHashMap differently, because it does the correct thing (with a small gotcha: https://bugs.openjdk.java.net/browse/JDK-8202422).

  • the callers of this should be confused (I would). passing a TreeMap or EnumMap here, makes no sense as they do not have constructors that take a size in. But that is most probably nitpicking on my side.

  • HashMap/LinkedHashMap make perfect sense and are good candidates.

  • IdentityHashMap and Hashtable have weird inner defaults as far as I remember, that must be handled in the method body as separate scenarios, most probably.

  • this entire method is going to be a false promise anyway (https://bugs.openjdk.java.net/browse/JDK-8211831 + the very first link I gave). Specifically:

... "no rehash operations will ever occur".... Strictly speaking the above statement is now incorrect, so it should probably be adjusted to say something like "rehash operations will generally not occur.

While it can surely be done, this is not obvious, at least IMHO. I will be more then glad (and will in a few days) to provide a prototype for such a method, but I am really not sure it is really needed. What are your thoughts?

Comment From: jhoeller

I've introduced CollectionUtils.newHashMap and newLinkedHashMap methods that take an expectedSize argument and repurposed the existing int constructors of our own LinkedCaseInsensitiveMap and LinkedMultiValueMap to work in expectedSize style as well (since for those two we never exposed load factor arguments to begin with, it seems sensible to simply extend the semantics of those existing capacity constructors to the expected number of elements). To be committed soon.

Comment From: wind57

why not say in the documentation exactly like java.util.HashMap was supposed to, and will after the defect will be closed? :

expectedSize the expected number of elements under which no rehash operations will generally occur

otherwise what is the point of that documentation? providing the size in case of a HashXXX structure is all about rehashes, to me. Am I missing something?

Also what is wrong with the diamond operator in return new HashMap<>(....) in CollectionUtils?

Comment From: jhoeller

Good catch about the diamond operator, that was an accident that sneaked into the commit. I'll fix this right away.

As for the documentation, we can certainly hint at no resize/rehash operations expected there. Generally speaking, I find referring to the expected size (as per the Map.size() method) and the expected number of elements pretty intuitive, matching a simplified user way of thinking about it, as opposed to referring to capacity, load factor or the occurrence of rehashes.

Comment From: wind57

I don't fully agree on the wording that is done, but is far better now. Thank you and I will stop commenting here.