Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some static features are in the query-document feature set #4

Open
lgrz opened this issue Sep 26, 2018 · 1 comment
Open

Some static features are in the query-document feature set #4

lgrz opened this issue Sep 26, 2018 · 1 comment
Assignees

Comments

@lgrz
Copy link
Contributor

lgrz commented Sep 26, 2018

Move the following features into the static document feature set since they can all be precomputed and don't depend on the query (note that url slash count and url length are already there, but they need to be removed from doc_entry and friends):

    // The number of times the <title> tag appears in the document
    int tag_title_count = 0;
    // The number of times the <heading> tag appears in the document
    int tag_heading_count = 0; // Indri heading field includes tags h1-h4
    // The number of inlinks in the document
    int tag_inlink_count = 0;
    // The number of times the <applet> tag appears in the document
    int tag_applet_count = 0;
    // The number of times the <object> tag appears in the document
    int tag_object_count = 0;
    // The number of times the <embed> tag appears in the document
    int tag_embed_count = 0;

    // Number of slashes in URL
    int url_slash_count = 0;
    // URL length
    size_t url_length = 0;
@lgrz lgrz self-assigned this Sep 26, 2018
@lgrz lgrz added this to the 0.0.1-beta milestone Sep 26, 2018
@lgrz
Copy link
Contributor Author

lgrz commented May 26, 2019

To add to this, f_url_slash_count is output via the config.ini according to the documentation, but f_url_length was omitted. So if one is using a config copied from the documentation you will see this behaviour.

lgrz added a commit that referenced this issue Jul 6, 2019
Addresses part of #4 by removing the duplicate static url features that
already appear in `generate_static_doc_features`.
lgrz added a commit that referenced this issue Jul 6, 2019
Related to #4 where the url features are static features that are
already computed in `generate_static_doc_features`.

This is a breaking change for previously created forward indexes that
include the `UrlStats` information. Currently there is no internal
versioning for indexes that are created. See #23.
@lgrz lgrz removed this from the 0.1.0-beta milestone May 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

1 participant