Reduce memory usage of field maps in FieldInfos and BlockTree TermsReader. #13327

bruno-roustant · 2024-04-29T13:49:03Z

Description

Two goals:

1- Reduce the memory usage of field maps when there are many fields.
FieldInfos construtor is refactored to build the byNumber array in a more efficient way, avoiding array growing and copies.
Use a primitive IntObjectHashMap to reduce the memory usage compared to an HashMap in Lucene90BlockTreeTermsReader.

2- Add new IntObjectHashMap in the existing small hppc fork. Leverage this PR to show an example use-case. It hopefully can be reused later for other use-cases.

…ader.

bruno-roustant · 2024-04-29T13:55:01Z

.../core/src/java/org/apache/lucene/codecs/lucene90/blocktree/Lucene90BlockTreeTermsReader.java

@@ -113,7 +114,8 @@ public final class Lucene90BlockTreeTermsReader extends FieldsProducer {
 // produce DocsEnum on demand
 final PostingsReaderBase postingsReader;

- private final Map<String, FieldReader> fieldMap;
+ private final FieldInfos fieldInfos;
+ private final IntObjectHashMap<FieldReader> fieldMap;


This PR proposes to leverage the existing field-name -> FieldInfo map in FieldInfos to not repeat the ref to the field name strings here. Instead use the field number (specific to the FieldInfos) as key, so that we can use a compact primitive map.

Then below, in terms(String fieldName), we can use FieldInfos.fieldInfo(String fieldName) as a first mapping to the field number, and then use this compact map to get the Terms.

FWIW this is what Lucene90PointsReader does today as well.

Nice. It could benefit from the IntObjectHashMap.

Actually there are many usages of Map<Integer, Object>. I could open some PRs when memory (and perf) matters after IntObjectHashMap is in.

lucene/core/src/java/org/apache/lucene/index/FieldInfos.java

jpountz

FYI FieldInfos used to switch between sparse and dense in the past, and we changed it to always use a dense encoding 6 years ago: https://issues.apache.org/jira/browse/LUCENE-8033.

It doesn't necessarily mean we cannot merge your change, at least the sparse encoding is used more often than it was in the past. I'm curious if you know why the indexes you're seeing end up having sparse fields?

bruno-roustant · 2024-04-30T14:47:53Z

Very interesting, I didn't know this history of the FieldInfos.

I ended up analyzing the FieldInfos after we saw the time and memory usage for the byNumber array when there are many fields.
I think the time comes from the ArrayUtil.grow(byNumberTemp, info.number + 1), when the array has to grow and copy multiple times. Also, the byName map grows and is rehashed.
For the memory part, it was more a guess since we saw large FieldInfo[] that could be explained if the max info.number was high. But I have no metrics to share unfortunately.

If the decision was already taken to stay on a dense array, then I can remove the primitive map from this PR. But I would like to keep the creation and the sorting of the array at the end of the loop, as I think it is faster.

lucene/core/src/java/org/apache/lucene/index/FieldInfos.java

dsmiley · 2024-04-30T14:52:43Z

lucene/core/src/java/org/apache/lucene/index/FieldInfos.java

+ ? new MapFieldInfoByNumber(infos)
+ : new ArrayFieldInfoByNumber(infos, maxFieldNumber);
+ // The iteration of FieldInfo is ordered by ascending field number.
+ values = Collections.unmodifiableCollection(Arrays.asList(sortedFieldInfos));


nowadays, do List.of(sortedFieldInfos) rather than this double-layer referring to two other classes

List.of() makes a copy as it considers the input as an "untrusted array". Here we don't copy, just wrap. Actually we could keep just Arrays.asList(sortedFieldInfos) since we own it privately, so we know we don't modify it (only iterator(), which does not support removal for Arrays.asList).

lucene/core/src/java/org/apache/lucene/index/FieldInfos.java

…ing.

bruno-roustant · 2024-05-02T14:53:07Z

So I did some benchmarking.
As I could not reproduce the sparse byNumber mapping, I prefer to remove the code for the primitive map in FieldInfos.

At the same time I could evaluate the time spent by the current FieldInfos code to build the byNumber and values arrays. When we read a segment, the reader provides the FieldInfo[] input always sorted in ascending order for number. So the code that grows byNumberTemp has to grow multiple times when there a many fields. Same for the hashmap initialized with default capacity.
This latest commit changes the way byNumber is built to only allocate one array, or even proposes to use directly the input FieldInfo[] if it is full. This constructor becomes 40% faster than before, even more when the GC is busy, as it allocates less arrays.

bruno-roustant · 2024-05-02T15:00:36Z

lucene/core/src/java/org/apache/lucene/index/FieldInfos.java

+ // The input FieldInfo[] contains all fields numbered from 0 to infos.length - 1 and they are
+ // sorted, use it directly. This is an optimization when reading a segment with all fields
+ // since the FieldInfo[] is sorted.
+ byNumber = infos; // We could copy the input array, but do we need to?


Here it seems to me we can use directly the input array. Do you think we should copy it for safety?
This input array is created by the readers and passed as parameter to the constructor.

This is fairly internal stuff labelled lucene.experimental; I say yes but document it in the javadoc. Obviously look at existing callers to double-check this is fine.

dsmiley · 2024-05-08T18:11:05Z

lucene/core/src/java/org/apache/lucene/index/FieldInfos.java

+ } else {
+ // The below code is faster than Arrays.stream(byNumber).filter(Objects::nonNull).toList(),
+ // mainly when the input FieldInfo[] is sorted, when reading a segment.
+ FieldInfo[] sortedFieldInfos = ArrayUtil.copyOfSubArray(infos, 0, infos.length);


It looks a bit curious to be calling a method with "sub array" for what is actually the whole array. Maybe a convenience method copy(infos) should be provided. Could be deferred of course.

I agree. I created another PR to add ArrayUtil.copyOf().

Don't even need to copy it as we've given ourselves permission to take ownership of the input array; to manipulate it.

dsmiley · 2024-05-08T18:16:16Z

lucene/core/src/java/org/apache/lucene/index/FieldInfos.java

+ } else {
+ byNumber = new FieldInfo[maxFieldNumber + 1];
+ for (FieldInfo fieldInfo : infos) {
+ FieldInfo previous = byNumber[fieldInfo.number];


this is non-obvious. At a glance, we are retrieving the very same fieldInfo, yet supposedly it's the previous. Oh, maybe you mean "existing" fieldInfo as opposed to a fieldInfo ordered prior to this one?

Yes, existing, to check we have no duplicates. Let's rename it "existing" for clarity.

Thanks. "previous" wasn't wrong, just ambiguous. I like "existing"; another possible name is "old".

dsmiley · 2024-05-08T18:23:59Z

lucene/core/src/java/org/apache/lucene/index/FieldInfos.java

+ // The input FieldInfo[] contains all fields numbered from 0 to infos.length - 1 and they are
+ // sorted, use it directly. This is an optimization when reading a segment with all fields
+ // since the FieldInfo[] is sorted.
+ byNumber = infos; // We could copy the input array, but do we need to?


This is fairly internal stuff labelled lucene.experimental; I say yes but document it in the javadoc. Obviously look at existing callers to double-check this is fine.

dsmiley · 2024-05-11T17:47:28Z

lucene/core/src/java/org/apache/lucene/index/FieldInfos.java

+ // The below code is faster than Arrays.stream(byNumber).filter(Objects::nonNull).toList(),
+ // mainly when the input FieldInfo[] is sorted, when reading a segment.


If fieldNumberStrictlyAscending, and we've given ourselves permission to accept the input array, we can merely wrap it in Arrays.asList and we're done.

dsmiley · 2024-05-11T17:48:05Z

lucene/core/src/java/org/apache/lucene/index/FieldInfos.java

+ } else {
+ // The below code is faster than Arrays.stream(byNumber).filter(Objects::nonNull).toList(),
+ // mainly when the input FieldInfo[] is sorted, when reading a segment.
+ FieldInfo[] sortedFieldInfos = ArrayUtil.copyOfSubArray(infos, 0, infos.length);


Don't even need to copy it as we've given ourselves permission to take ownership of the input array; to manipulate it.

…ader. (#13327)

Reduce memory usage of field maps in FieldInfos and BlockTree TermsRe…

d6161e7

…ader.

bruno-roustant commented Apr 29, 2024

View reviewed changes

lucene/core/src/java/org/apache/lucene/index/FieldInfos.java Show resolved Hide resolved

jpountz reviewed Apr 30, 2024

View reviewed changes

dsmiley reviewed Apr 30, 2024

View reviewed changes

bruno-roustant added 2 commits April 30, 2024 17:27

Call newHashmap() and use IntFunction.

3aec6d5

Remove sparse field number map. Optimize the field number array build…

8b7f89f

…ing.

bruno-roustant commented May 2, 2024

View reviewed changes

dsmiley reviewed May 8, 2024

View reviewed changes

Javadoc and variable renaming for clarity.

e64521d

dsmiley reviewed May 11, 2024

View reviewed changes

No need to copy infos input array.

3022306

dsmiley approved these changes May 11, 2024

View reviewed changes

bruno-roustant and others added 4 commits May 13, 2024 12:39

CHANGES.txt

cb7185f

Merge branch 'main' into fieldmap

9bab4bd

Remove unused import

e4ba2e1

Tidy

4b8db9d

bruno-roustant merged commit 8c738ba into apache:main May 13, 2024
3 checks passed

bruno-roustant deleted the fieldmap branch May 13, 2024 14:07

bruno-roustant added a commit that referenced this pull request May 14, 2024

Reduce memory usage of field maps in FieldInfos and BlockTree TermsRe…

3fa2ddc

…ader. (#13327)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce memory usage of field maps in FieldInfos and BlockTree TermsReader. #13327

Reduce memory usage of field maps in FieldInfos and BlockTree TermsReader. #13327

bruno-roustant commented Apr 29, 2024 •

edited

bruno-roustant Apr 29, 2024

jpountz Apr 30, 2024

bruno-roustant Apr 30, 2024

bruno-roustant Apr 30, 2024

jpountz left a comment

bruno-roustant commented Apr 30, 2024 •

edited

dsmiley Apr 30, 2024

bruno-roustant Apr 30, 2024 •

edited

bruno-roustant commented May 2, 2024

bruno-roustant May 2, 2024

dsmiley May 8, 2024

dsmiley May 8, 2024

bruno-roustant May 11, 2024

dsmiley May 11, 2024

dsmiley May 8, 2024

bruno-roustant May 11, 2024

dsmiley May 11, 2024

dsmiley May 8, 2024

dsmiley May 11, 2024

dsmiley May 11, 2024

		// The below code is faster than Arrays.stream(byNumber).filter(Objects::nonNull).toList(),
		// mainly when the input FieldInfo[] is sorted, when reading a segment.

Reduce memory usage of field maps in FieldInfos and BlockTree TermsReader. #13327

Reduce memory usage of field maps in FieldInfos and BlockTree TermsReader. #13327

Conversation

bruno-roustant commented Apr 29, 2024 • edited

Description

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jpountz left a comment

Choose a reason for hiding this comment

bruno-roustant commented Apr 30, 2024 • edited

Choose a reason for hiding this comment

bruno-roustant Apr 30, 2024 • edited

Choose a reason for hiding this comment

bruno-roustant commented May 2, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bruno-roustant commented Apr 29, 2024 •

edited

bruno-roustant commented Apr 30, 2024 •

edited

bruno-roustant Apr 30, 2024 •

edited