New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce memory usage of field maps in FieldInfos and BlockTree TermsReader. #13327
Conversation
@@ -113,7 +114,8 @@ public final class Lucene90BlockTreeTermsReader extends FieldsProducer { | |||
// produce DocsEnum on demand | |||
final PostingsReaderBase postingsReader; | |||
|
|||
private final Map<String, FieldReader> fieldMap; | |||
private final FieldInfos fieldInfos; | |||
private final IntObjectHashMap<FieldReader> fieldMap; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR proposes to leverage the existing field-name -> FieldInfo map in FieldInfos to not repeat the ref to the field name strings here. Instead use the field number (specific to the FieldInfos) as key, so that we can use a compact primitive map.
Then below, in terms(String fieldName), we can use FieldInfos.fieldInfo(String fieldName) as a first mapping to the field number, and then use this compact map to get the Terms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW this is what Lucene90PointsReader does today as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice. It could benefit from the IntObjectHashMap.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually there are many usages of Map<Integer, Object>. I could open some PRs when memory (and perf) matters after IntObjectHashMap is in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI FieldInfos used to switch between sparse and dense in the past, and we changed it to always use a dense encoding 6 years ago: https://issues.apache.org/jira/browse/LUCENE-8033.
It doesn't necessarily mean we cannot merge your change, at least the sparse encoding is used more often than it was in the past. I'm curious if you know why the indexes you're seeing end up having sparse fields?
Very interesting, I didn't know this history of the FieldInfos. I ended up analyzing the FieldInfos after we saw the time and memory usage for the byNumber array when there are many fields. If the decision was already taken to stay on a dense array, then I can remove the primitive map from this PR. But I would like to keep the creation and the sorting of the array at the end of the loop, as I think it is faster. |
? new MapFieldInfoByNumber(infos) | ||
: new ArrayFieldInfoByNumber(infos, maxFieldNumber); | ||
// The iteration of FieldInfo is ordered by ascending field number. | ||
values = Collections.unmodifiableCollection(Arrays.asList(sortedFieldInfos)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nowadays, do List.of(sortedFieldInfos)
rather than this double-layer referring to two other classes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
List.of() makes a copy as it considers the input as an "untrusted array". Here we don't copy, just wrap. Actually we could keep just Arrays.asList(sortedFieldInfos) since we own it privately, so we know we don't modify it (only iterator(), which does not support removal for Arrays.asList).
So I did some benchmarking. At the same time I could evaluate the time spent by the current FieldInfos code to build the byNumber and values arrays. When we read a segment, the reader provides the FieldInfo[] input always sorted in ascending order for number. So the code that grows byNumberTemp has to grow multiple times when there a many fields. Same for the hashmap initialized with default capacity. |
// The input FieldInfo[] contains all fields numbered from 0 to infos.length - 1 and they are | ||
// sorted, use it directly. This is an optimization when reading a segment with all fields | ||
// since the FieldInfo[] is sorted. | ||
byNumber = infos; // We could copy the input array, but do we need to? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here it seems to me we can use directly the input array. Do you think we should copy it for safety?
This input array is created by the readers and passed as parameter to the constructor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fairly internal stuff labelled lucene.experimental; I say yes but document it in the javadoc. Obviously look at existing callers to double-check this is fine.
} else { | ||
// The below code is faster than Arrays.stream(byNumber).filter(Objects::nonNull).toList(), | ||
// mainly when the input FieldInfo[] is sorted, when reading a segment. | ||
FieldInfo[] sortedFieldInfos = ArrayUtil.copyOfSubArray(infos, 0, infos.length); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks a bit curious to be calling a method with "sub array" for what is actually the whole array. Maybe a convenience method copy(infos) should be provided. Could be deferred of course.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. I created another PR to add ArrayUtil.copyOf().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't even need to copy it as we've given ourselves permission to take ownership of the input array; to manipulate it.
} else { | ||
byNumber = new FieldInfo[maxFieldNumber + 1]; | ||
for (FieldInfo fieldInfo : infos) { | ||
FieldInfo previous = byNumber[fieldInfo.number]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is non-obvious. At a glance, we are retrieving the very same fieldInfo, yet supposedly it's the previous. Oh, maybe you mean "existing" fieldInfo as opposed to a fieldInfo ordered prior to this one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, existing, to check we have no duplicates. Let's rename it "existing" for clarity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. "previous" wasn't wrong, just ambiguous. I like "existing"; another possible name is "old".
// The input FieldInfo[] contains all fields numbered from 0 to infos.length - 1 and they are | ||
// sorted, use it directly. This is an optimization when reading a segment with all fields | ||
// since the FieldInfo[] is sorted. | ||
byNumber = infos; // We could copy the input array, but do we need to? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fairly internal stuff labelled lucene.experimental; I say yes but document it in the javadoc. Obviously look at existing callers to double-check this is fine.
// The below code is faster than Arrays.stream(byNumber).filter(Objects::nonNull).toList(), | ||
// mainly when the input FieldInfo[] is sorted, when reading a segment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If fieldNumberStrictlyAscending, and we've given ourselves permission to accept the input array, we can merely wrap it in Arrays.asList and we're done.
} else { | ||
// The below code is faster than Arrays.stream(byNumber).filter(Objects::nonNull).toList(), | ||
// mainly when the input FieldInfo[] is sorted, when reading a segment. | ||
FieldInfo[] sortedFieldInfos = ArrayUtil.copyOfSubArray(infos, 0, infos.length); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't even need to copy it as we've given ourselves permission to take ownership of the input array; to manipulate it.
Description
Two goals:
1- Reduce the memory usage of field maps when there are many fields.
FieldInfos construtor is refactored to build the byNumber array in a more efficient way, avoiding array growing and copies.
Use a primitive IntObjectHashMap to reduce the memory usage compared to an HashMap in Lucene90BlockTreeTermsReader.
2- Add new IntObjectHashMap in the existing small hppc fork. Leverage this PR to show an example use-case. It hopefully can be reused later for other use-cases.