Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DBSTREAM: add "time" argument to the learn function #1472

Open
ShkarupaDC opened this issue Dec 9, 2023 · 2 comments
Open

DBSTREAM: add "time" argument to the learn function #1472

ShkarupaDC opened this issue Dec 9, 2023 · 2 comments

Comments

@ShkarupaDC
Copy link

ShkarupaDC commented Dec 9, 2023

Hi! I want to propose a new feature for DBSTREAM.

DBSTREAM uses a protected internal timer (_time_stamp) to measure the time between learning steps. There are 2 issues with this approach.

  1. First, when several learning samples arrive at a time instead of one, we should update the internal timer manually like that
for x in batch:
    dbstream.learn_one(x)
    dbstream._time_stamp -= 1
dbstream._time_stamp += 1
  1. Second, using the natural time when samples arrive instead of a surrogate time is required sometimes. DBSTREAM does not distinguish 2 scenarios when samples come at
  • 100ms and 200ms
  • 100ms and 500ms
    if these arrivals are sequential (no other samples arrive in between). However, there can be a large difference from the business perspective.

I propose to add a t (time) argument to the learn_one function. Then, we can learn from samples using the same t value if samples arrive simultaneously and supply time in any units to this function, adjusting the fading_factor.

@MaxHalford
Copy link
Member

I think this is a great idea! @Dennis1989 what do you think?

@hoanganhngo610
Copy link
Contributor

@ShkarupaDC Hi! Sorry for getting this late to get back to you.

Within the original paper, the authors have designed DBSTREAM with the time step concept. This means that from my understanding, the authors only consider the order of which data comes, not the speed at which data comes. As such, this is the reason why I use time_stamp in my implementation to represent the parameter t in the original paper.

Moreover, in data stream in general, we usually assume that data comes once at a time. As such, when samples arrive simultaneously, what we would usually do would be to consider them as other data points, coming one by one and in order. This might seem unreasonable, but to make DBSTREAM compatible with the design language of River in general, and to align with such philosophy, we decided to implement it this way.

Hope that this answer clearly explains your concerns.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants