Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Read timeout when trying to tranfer index #7

Open
alxgrk opened this issue Sep 28, 2023 · 4 comments
Open

[Bug] Read timeout when trying to tranfer index #7

alxgrk opened this issue Sep 28, 2023 · 4 comments
Labels
bug Something isn't working

Comments

@alxgrk
Copy link

alxgrk commented Sep 28, 2023

Describe the bug
I'm trying to transfer a single collection to get started with the tool. Dump & restore of the data did work so far, however, when it comes to replicating the indexes the following error occurs:

Sep 28 14:27:20 ip-172-31-28-185.eu-central-1.compute.internal java[148206]: 14:27:20.888 [performer-1] INFO  p.a.t.m.c.t.MongoToolsTransfer - Finished transfer of collection [foo.bar from [mongodb://foo1:27017,foo2:27017,foo3:27017/foo?ssl=true&replicaSet=atlas-shard-0] to [mongodb://172.17.4.100,172.17.5.100,172.17.6.100/foo?tls=false&ssl=false&retryWrites=true&w=majority]
Sep 28 14:27:20 ip-172-31-28-185.eu-central-1.compute.internal java[148206]: 14:27:20.889 [performer-1] INFO  p.a.t.m.core.index.IndexClonerImpl - Cloning all indexes for collection: [bar]
Sep 28 14:27:20 ip-172-31-28-185.eu-central-1.compute.internal java[148206]: 14:27:20.890 [performer-1] INFO  p.a.t.m.i.h.LoggingStateEventHandler - IndexRebuildStartEvent(sourceToDestination=source: foo.bar, destination: foo.bar)
Sep 28 14:27:20 ip-172-31-28-185.eu-central-1.compute.internal java[148206]: 14:27:20.892 [performer-1] INFO  p.a.t.m.c.s.LocalToDestinationSynchronizer - Starting LocalToDestination synchronization for collection: [source: foo.bar, destination: foo.bar]
Sep 28 14:27:20 ip-172-31-28-185.eu-central-1.compute.internal java[148206]: 14:27:20.893 [performer-1] INFO  p.a.t.m.i.h.LoggingStateEventHandler - LocalToDestinationStartEvent(sourceToDestination=source: foo.bar, destination: foo.bar)
Sep 28 14:27:20 ip-172-31-28-185.eu-central-1.compute.internal java[148206]: 14:27:20.897 [performer-1] INFO  p.a.t.m.i.d.SynchronizationDetectorFactory - Starting detecting synchronization...
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]: 14:27:30.972 [synchronizationDetector-4] WARN  p.a.t.m.c.d.HashSynchronizationDetector - Unable to perform detection with hash. Cause: [com.mongodb.MongoSocketReadTimeoutException: Timeout while receiving message]
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]: Exception in thread "index-cloner-foo.bar" java.util.concurrent.CompletionException: com.mongodb.MongoSocketReadTimeoutException: Timeout while receiving message
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]:         at java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:315)
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]:         at java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:320)
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]:         at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1770)
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]:         at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]:         at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]:         at java.base/java.lang.Thread.run(Thread.java:833)
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]: Caused by: com.mongodb.MongoSocketReadTimeoutException: Timeout while receiving message
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]:         at com.mongodb.internal.connection.InternalStreamConnection.translateReadException(InternalStreamConnection.java:699)
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]:         at com.mongodb.internal.connection.InternalStreamConnection.receiveMessageWithAdditionalTimeout(InternalStreamConnection.java:577)
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]:         at com.mongodb.internal.connection.InternalStreamConnection.receiveCommandMessageResponse(InternalStreamConnection.java:413)
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]:         at com.mongodb.internal.connection.InternalStreamConnection.sendAndReceive(InternalStreamConnection.java:340)
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]:         at com.mongodb.internal.connection.UsageTrackingInternalConnection.sendAndReceive(UsageTrackingInternalConnection.java:116)
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]:         at com.mongodb.internal.connection.DefaultConnectionPool$PooledConnection.sendAndReceive(DefaultConnectionPool.java:643)
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]:         at com.mongodb.internal.connection.CommandProtocolImpl.execute(CommandProtocolImpl.java:71)
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]:         at com.mongodb.internal.connection.DefaultServer$DefaultServerProtocolExecutor.execute(DefaultServer.java:206)
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]:         at com.mongodb.internal.connection.DefaultServerConnection.executeProtocol(DefaultServerConnection.java:119)
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]:         at com.mongodb.internal.connection.DefaultServerConnection.command(DefaultServerConnection.java:85)
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]:         at com.mongodb.internal.connection.DefaultServerConnection.command(DefaultServerConnection.java:75)
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]:         at com.mongodb.internal.connection.DefaultServer$OperationCountTrackingConnection.command(DefaultServer.java:293)
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]:         at com.mongodb.internal.operation.CommandOperationHelper.createReadCommandAndExecute(CommandOperationHelper.java:233)
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]:         at com.mongodb.internal.operation.CommandOperationHelper.lambda$executeRetryableRead$4(CommandOperationHelper.java:215)
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]:         at com.mongodb.internal.operation.OperationHelper.lambda$withSourceAndConnection$0(OperationHelper.java:356)
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]:         at com.mongodb.internal.operation.OperationHelper.withSuppliedResource(OperationHelper.java:381)
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]:         at com.mongodb.internal.operation.OperationHelper.lambda$withSourceAndConnection$1(OperationHelper.java:355)
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]:         at com.mongodb.internal.operation.OperationHelper.withSuppliedResource(OperationHelper.java:381)
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]:         at com.mongodb.internal.operation.OperationHelper.withSourceAndConnection(OperationHelper.java:354)
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]:         at com.mongodb.internal.operation.CommandOperationHelper.lambda$executeRetryableRead$5(CommandOperationHelper.java:213)
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]:         at com.mongodb.internal.async.function.RetryingSyncSupplier.get(RetryingSyncSupplier.java:67)
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]:         at com.mongodb.internal.operation.CommandOperationHelper.executeRetryableRead(CommandOperationHelper.java:218)
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]:         at com.mongodb.internal.operation.CommandOperationHelper.executeRetryableRead(CommandOperationHelper.java:199)
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]:         at com.mongodb.internal.operation.CommandReadOperation.execute(CommandReadOperation.java:50)
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]:         at com.mongodb.client.internal.MongoClientDelegate$DelegateOperationExecutor.execute(MongoClientDelegate.java:191)
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]:         at com.mongodb.client.internal.MongoDatabaseImpl.executeCommand(MongoDatabaseImpl.java:202)
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]:         at com.mongodb.client.internal.MongoDatabaseImpl.runCommand(MongoDatabaseImpl.java:171)
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]:         at com.mongodb.client.internal.MongoDatabaseImpl.runCommand(MongoDatabaseImpl.java:166)
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]:         at com.mongodb.client.internal.MongoDatabaseImpl.runCommand(MongoDatabaseImpl.java:156)
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]:         at pl.allegro.tech.mongomigrationstream.core.index.IndexClonerImpl.createIndexOnDestinationCollection(IndexClonerImpl.kt:47)
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]:         at pl.allegro.tech.mongomigrationstream.core.index.IndexClonerImpl.cloneIndexes$lambda$3$lambda$1$lambda$0(IndexClonerImpl.kt:32)
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]:         at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1768)
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]:         ... 3 more
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]: Caused by: java.net.SocketTimeoutException: Read timed out
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]:         at java.base/sun.nio.ch.NioSocketImpl.timedRead(NioSocketImpl.java:288)
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]:         at java.base/sun.nio.ch.NioSocketImpl.implRead(NioSocketImpl.java:314)
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]:         at java.base/sun.nio.ch.NioSocketImpl.read(NioSocketImpl.java:355)
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]:         at java.base/sun.nio.ch.NioSocketImpl$1.read(NioSocketImpl.java:808)
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]:         at java.base/java.net.Socket$SocketInputStream.read(Socket.java:966)
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]:         at com.mongodb.internal.connection.SocketStream.read(SocketStream.java:113)
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]:         at com.mongodb.internal.connection.SocketStream.read(SocketStream.java:138)
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]:         at com.mongodb.internal.connection.InternalStreamConnection.receiveResponseBuffers(InternalStreamConnection.java:716)
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]:         at com.mongodb.internal.connection.InternalStreamConnection.receiveMessageWithAdditionalTimeout(InternalStreamConnection.java:574)
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]:         ... 33 more
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]: 14:27:30.976 [synchronizationDetector-1] INFO  p.a.t.m.i.h.LoggingDetectionResultHandler - Queue size detector - collection: [source: foo.bar, destination: foo.bar] queue size: [0]

To Reproduce
This is the config:

perform.transfer=true
perform.synchronization=true
perform.synchronization.handlers=LoggingSynchronizationHandler
perform.synchronization.detectors=QueueSize,CollectionCount,DbHash
perform.synchronization.validators=DbAvailability,DestinationCollectionMissing,SourceCollectionAvailable

custom.rootPath=/home/ec2-user/mongomigrationstream/
custom.queue.factory=BiqQueueFactory
custom.dumpReadPreference=primary
custom.batchSize=1000

collections.source=bar
collections.destination=bar

source.db.uri=mongodb://foo1:27017,foo2:27017,foo3:27017/foo?ssl=true&replicaSet=atlas-shard-0
source.db.name=foo
source.db.authentication.enabled=true
source.db.authentication.username=user
source.db.authentication.password=pwd
source.db.authentication.authDbName=admin

destination.db.uri=mongodb://172.17.4.100,172.17.5.100,172.17.6.100/foo?tls=false&ssl=false&retryWrites=true&w=majority
destination.db.name=foo
destination.db.authentication.enabled=true
destination.db.authentication.username=user
destination.db.authentication.password=pwd
destination.db.authentication.authDbName=admin

Expected behavior
Initial transfer should work.

Environment (please complete the following information):

  • OS: Amazon Linux 2 as a migration server
  • Version: mongo-migration-stream-0.7.0

Additional context
If I get it to work, this tool is really, really useful to me! Thanks for open-sourcing it!

@alxgrk alxgrk added the bug Something isn't working label Sep 28, 2023
@alxgrk
Copy link
Author

alxgrk commented Sep 29, 2023

Hi again,
it looks like the actual problem is the dbHash command that is being issued by HashSynchronizationDetector. For the source database, in my case containing ~30million documents, this is a showstopper unfortunately as the commands seems to completely drag down the database.
I found that this detector can be disabled via config, so I will try to have it run without that. What will be the implications if that detector is missing?

EDIT: without HashSynchronizationDetector the source DB doesn't complain anymore, but the index-migration still fails. Any idea what might go wrong?

alxgrk added a commit to alxgrk/mongo-migration-stream that referenced this issue Oct 2, 2023
@alxgrk
Copy link
Author

alxgrk commented Oct 2, 2023

@pitagoras3 I added an PR to make timeouts and read preference configurable. In my case, increasing the timeouts and reading from the secondary made the migration work smoothly until now.

One thing that's still left to discuss would be the HashSynchronizationDetector - what is the intention behind this? Did it work for your migrations? I read in the blog post that you had large datasets as well...

@pitagoras3
Copy link
Collaborator

Hi @alxgrk,

Firstly, thanks for trying out mongo-migration-stream :) I will answer your questions one by one:

it looks like the actual problem is the dbHash command that is being issued by HashSynchronizationDetector. For the source database, in my case containing ~30million documents, this is a showstopper unfortunately as the commands seems to completely drag down the database.
I found that this detector can be disabled via config, so I will try to have it run without that. What will be the implications if that detector is missing?

Implications are such that migrator won't calculate hashes for source and destination collections, and will not compare those hashes (for extra certainty that migration went well). From what I can tell you, on production in Allegro we also have disabled HashSynchronizationDetector, and we're relying on QueueSize,CollectionCount detectors.

@pitagoras3 I added an PR to make timeouts and read preference configurable. In my case, increasing the timeouts and reading from the secondary made the migration work smoothly until now.

Thank you! I will take a look into it.

One thing that's still left to discuss would be the HashSynchronizationDetector - what is the intention behind this? Did it work for your migrations? I read in the blog post that you had large datasets as well...

So, the intention is to answer the question "Are all documents migrated?" To answer that question, I've implemented 3 SynchronizationDetectors which are evaluated every 10 seconds after all collections get to synchronization state.

  1. CollectionCountSynchronizationDetector - it "at the same time" gets estimatedDocumentCount for source collection and destination collection and prints the result (and their diff). Ideal situation is when the difference is equal to 0 (both collections have the same size).
  2. QueueSizeSynchronizationDetector is simpler - it checks whether there are any events on the migrator queue to send to the destination collection. Ideal situation is when the size of the queue is 0 (there are no events to send).
  3. HashSynchronizationDetector - it performs dbHash command on both source collection and destination collection and verifies if hashes are equal. Ideal situation is when both hashes are equal (meaning that documents in both collections are exactly the same).

As you discovered, dbHash command can take "some" time to evaluate (depending on collection size). I will definitely add this info in the documentation :)

Have I answered all your questions, or can I clarify something more?

@alxgrk
Copy link
Author

alxgrk commented Oct 2, 2023

Your answers are perfectly fine, thanks @pitagoras3 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants