Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fscrawler doesn't resuming indexing #1693

Open
ScottCov opened this issue Aug 17, 2023 · 2 comments
Open

Fscrawler doesn't resuming indexing #1693

ScottCov opened this issue Aug 17, 2023 · 2 comments
Labels
check_for_bug Needs to be reproduced

Comments

@ScottCov
Copy link

When I stop the docker container while it is in the process of indexing a folder and then restart the container, it does not resume indexing the remaining files in that folder or any other folders that weren't yet indexed. If I add new folders with files it will resume indexing but the only way I can get it to index everything is by restarting the container with "--restart" and then it starts from the beginning.

Job Settings

---
name: "job_name"
fs:
  url: "/tmp/es"
  update_rate: "15m"
  excludes:
  - "*/~*"
  json_support: false
  filename_as_id: true
  add_filesize: true
  remove_deleted: true
  add_as_inner_object: false
  store_source: false
  index_content: true
  attributes_support: false
  raw_metadata: false
  xml_support: false
  index_folders: true
  lang_detect: false
  continue_on_error: false
  ocr:
    language: "eng"
    enabled: true
    pdf_strategy: "ocr_and_text"
  follow_symlinks: false
elasticsearch:
  pipeline: "fscrawler-copy"
  nodes:
  - url: "https://192.168.1.201:9200"
  username: "elastic"
  password: "Dynaco123$"
  bulk_size: 100
  flush_interval: "5s"
  byte_size: "10mb"
  ssl_verification: false

Logs

15:44:35,285 INFO  [f.p.e.c.f.c.BootstrapChecks] Memory [Free/Total=Percent]: HEAP [231.5mb/3.8gb=5.84%], RAM [2.1gb/15.4gb=14.15%], Swap [975.7mb/976.9mb=99.87%].
15:44:35,671 INFO  [f.p.e.c.f.FsCrawlerImpl] Starting FS crawler
15:44:35,672 INFO  [f.p.e.c.f.FsCrawlerImpl] FS crawler started in watch mode. It will run unless you stop it with CTRL+C.
15:44:35,864 WARN  [f.p.e.c.f.c.ElasticsearchClient] We are not doing SSL verification. It's not recommended for production.
SLF4J: No SLF4J providers were found.
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See https://www.slf4j.org/codes.html#noProviders for further details.
SLF4J: Class path contains SLF4J bindings targeting slf4j-api versions 1.7.x or earlier.
SLF4J: Ignoring binding found at [jar:file:/usr/share/fscrawler/lib/log4j-slf4j-impl-2.20.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See https://www.slf4j.org/codes.html#ignoredBindings for an explanation.
15:44:36,563 INFO  [f.p.e.c.f.c.ElasticsearchClient] Elasticsearch Client connected to a node running version 8.8.1
15:44:36,572 WARN  [f.p.e.c.f.c.ElasticsearchClient] We are not doing SSL verification. It's not recommended for production.
15:44:36,666 INFO  [f.p.e.c.f.c.ElasticsearchClient] Elasticsearch Client connected to a node running version 8.8.1
15:44:36,713 INFO  [f.p.e.c.f.FsParserAbstract] FS crawler started for [job_name] for [/tmp/es] every [5m]
15:44:36,831 INFO  [f.p.e.c.f.t.TikaInstance] OCR is enabled. This might slowdown the process.

{
  "name" : "job_name",
  "lastrun" : "2023-08-17T15:24:57.563347074",
  "indexed" : 6,
  "deleted" : 0
}

Expected behavior

I had expected it would resume indexing the directory it had been stopped in the middle of and files in other folders that hadn't been indexed yet. Is that not expected behavior?

Versions:

  • OS:Debian 11
  • Version 2.10 snapshot docker
@ScottCov ScottCov added the check_for_bug Needs to be reproduced label Aug 17, 2023
@dadoonet
Copy link
Owner

Is that not expected behavior?

Yes. That's expected as FSCrawler did not get a chance to run a full indexation of the dir.
There's are no checkpoints sadly yet.

@ScottCov
Copy link
Author

For most purposes that's probably fine anyway

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
check_for_bug Needs to be reproduced
Projects
None yet
Development

No branches or pull requests

2 participants