Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce memory usage of field maps in FieldInfos and BlockTree TermsReader. #13327

Merged
merged 9 commits into from May 13, 2024

Conversation

bruno-roustant
Copy link
Contributor

@bruno-roustant bruno-roustant commented Apr 29, 2024

Description

Two goals:

1- Reduce the memory usage of field maps when there are many fields.
FieldInfos construtor is refactored to build the byNumber array in a more efficient way, avoiding array growing and copies.
Use a primitive IntObjectHashMap to reduce the memory usage compared to an HashMap in Lucene90BlockTreeTermsReader.

2- Add new IntObjectHashMap in the existing small hppc fork. Leverage this PR to show an example use-case. It hopefully can be reused later for other use-cases.

@@ -113,7 +114,8 @@ public final class Lucene90BlockTreeTermsReader extends FieldsProducer {
// produce DocsEnum on demand
final PostingsReaderBase postingsReader;

private final Map<String, FieldReader> fieldMap;
private final FieldInfos fieldInfos;
private final IntObjectHashMap<FieldReader> fieldMap;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR proposes to leverage the existing field-name -> FieldInfo map in FieldInfos to not repeat the ref to the field name strings here. Instead use the field number (specific to the FieldInfos) as key, so that we can use a compact primitive map.

Then below, in terms(String fieldName), we can use FieldInfos.fieldInfo(String fieldName) as a first mapping to the field number, and then use this compact map to get the Terms.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW this is what Lucene90PointsReader does today as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. It could benefit from the IntObjectHashMap.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually there are many usages of Map<Integer, Object>. I could open some PRs when memory (and perf) matters after IntObjectHashMap is in.

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI FieldInfos used to switch between sparse and dense in the past, and we changed it to always use a dense encoding 6 years ago: https://issues.apache.org/jira/browse/LUCENE-8033.

It doesn't necessarily mean we cannot merge your change, at least the sparse encoding is used more often than it was in the past. I'm curious if you know why the indexes you're seeing end up having sparse fields?

@bruno-roustant
Copy link
Contributor Author

bruno-roustant commented Apr 30, 2024

Very interesting, I didn't know this history of the FieldInfos.

I ended up analyzing the FieldInfos after we saw the time and memory usage for the byNumber array when there are many fields.
I think the time comes from the ArrayUtil.grow(byNumberTemp, info.number + 1), when the array has to grow and copy multiple times. Also, the byName map grows and is rehashed.
For the memory part, it was more a guess since we saw large FieldInfo[] that could be explained if the max info.number was high. But I have no metrics to share unfortunately.

If the decision was already taken to stay on a dense array, then I can remove the primitive map from this PR. But I would like to keep the creation and the sorting of the array at the end of the loop, as I think it is faster.

? new MapFieldInfoByNumber(infos)
: new ArrayFieldInfoByNumber(infos, maxFieldNumber);
// The iteration of FieldInfo is ordered by ascending field number.
values = Collections.unmodifiableCollection(Arrays.asList(sortedFieldInfos));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nowadays, do List.of(sortedFieldInfos) rather than this double-layer referring to two other classes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

List.of() makes a copy as it considers the input as an "untrusted array". Here we don't copy, just wrap. Actually we could keep just Arrays.asList(sortedFieldInfos) since we own it privately, so we know we don't modify it (only iterator(), which does not support removal for Arrays.asList).

@bruno-roustant
Copy link
Contributor Author

So I did some benchmarking.
As I could not reproduce the sparse byNumber mapping, I prefer to remove the code for the primitive map in FieldInfos.

At the same time I could evaluate the time spent by the current FieldInfos code to build the byNumber and values arrays. When we read a segment, the reader provides the FieldInfo[] input always sorted in ascending order for number. So the code that grows byNumberTemp has to grow multiple times when there a many fields. Same for the hashmap initialized with default capacity.
This latest commit changes the way byNumber is built to only allocate one array, or even proposes to use directly the input FieldInfo[] if it is full. This constructor becomes 40% faster than before, even more when the GC is busy, as it allocates less arrays.

// The input FieldInfo[] contains all fields numbered from 0 to infos.length - 1 and they are
// sorted, use it directly. This is an optimization when reading a segment with all fields
// since the FieldInfo[] is sorted.
byNumber = infos; // We could copy the input array, but do we need to?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here it seems to me we can use directly the input array. Do you think we should copy it for safety?
This input array is created by the readers and passed as parameter to the constructor.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fairly internal stuff labelled lucene.experimental; I say yes but document it in the javadoc. Obviously look at existing callers to double-check this is fine.

} else {
// The below code is faster than Arrays.stream(byNumber).filter(Objects::nonNull).toList(),
// mainly when the input FieldInfo[] is sorted, when reading a segment.
FieldInfo[] sortedFieldInfos = ArrayUtil.copyOfSubArray(infos, 0, infos.length);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks a bit curious to be calling a method with "sub array" for what is actually the whole array. Maybe a convenience method copy(infos) should be provided. Could be deferred of course.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. I created another PR to add ArrayUtil.copyOf().

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't even need to copy it as we've given ourselves permission to take ownership of the input array; to manipulate it.

} else {
byNumber = new FieldInfo[maxFieldNumber + 1];
for (FieldInfo fieldInfo : infos) {
FieldInfo previous = byNumber[fieldInfo.number];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is non-obvious. At a glance, we are retrieving the very same fieldInfo, yet supposedly it's the previous. Oh, maybe you mean "existing" fieldInfo as opposed to a fieldInfo ordered prior to this one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, existing, to check we have no duplicates. Let's rename it "existing" for clarity.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. "previous" wasn't wrong, just ambiguous. I like "existing"; another possible name is "old".

// The input FieldInfo[] contains all fields numbered from 0 to infos.length - 1 and they are
// sorted, use it directly. This is an optimization when reading a segment with all fields
// since the FieldInfo[] is sorted.
byNumber = infos; // We could copy the input array, but do we need to?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fairly internal stuff labelled lucene.experimental; I say yes but document it in the javadoc. Obviously look at existing callers to double-check this is fine.

Comment on lines 179 to 180
// The below code is faster than Arrays.stream(byNumber).filter(Objects::nonNull).toList(),
// mainly when the input FieldInfo[] is sorted, when reading a segment.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If fieldNumberStrictlyAscending, and we've given ourselves permission to accept the input array, we can merely wrap it in Arrays.asList and we're done.

} else {
// The below code is faster than Arrays.stream(byNumber).filter(Objects::nonNull).toList(),
// mainly when the input FieldInfo[] is sorted, when reading a segment.
FieldInfo[] sortedFieldInfos = ArrayUtil.copyOfSubArray(infos, 0, infos.length);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't even need to copy it as we've given ourselves permission to take ownership of the input array; to manipulate it.

@bruno-roustant bruno-roustant merged commit 8c738ba into apache:main May 13, 2024
3 checks passed
@bruno-roustant bruno-roustant deleted the fieldmap branch May 13, 2024 14:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants