New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

All Train Test Split Changes #1381

Open

aolfat wants to merge 21 commits into without-ts-split from All-ts-split-changes

Contributor

aolfat commented Mar 13, 2024

Description

All Training Set Split changes thus far

Type of change

Does this correspond to an open issue?

Select type(s) of change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update

Checklist:

I have performed a self-review of my code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
I have fixed any merge conflicts

aolfat and others added 18 commits

March 13, 2024 12:05


move client to a separate file and refactor

8fd9387


some more clean up

2442ef5


small one

8f5b72c


even more clean up

c129406


remove unused function

45e2181


small comments

892e533


address ahmad's cpomment

a0f001b


ts split stuff

309ea49


Added some optimizations and comments

b2a20ed


some more changes

157b879


get rid of logs

22677ed


rename protos

decac62


fix tests

07371f4


rename some things

2ff6e1b


cleanup some

fff93f3


missed name change

64a6e8a


Delete grpc_debug.log

cd4255b


Merge branch 'feature/training_test_split_refactor' into All-ts-split…

ddc7b92

…-changes

aolfat had a problem deploying to Integration testing

March 13, 2024 20:29

— with

GitHub Actions Failure

aolfat added 2 commits

March 13, 2024 13:33


one more name change

a452d29


Merge branch 'feature/training_test_split_refactor' into All-ts-split…

5b3e7f0

…-changes

sdreyer assigned aolfat

aolfat changed the title ~~All ts split changes~~ All Train Test Split Changes

simba-git reviewed

View reviewed changes

provider/clickhouse.go Outdated

+ colTypes, err := store.getValueColumnTypes(trainingTestSplitName)
+ fmt.Printf("these are the column types: %v\n", colTypes)
+ if err != nil {
+ return nil, nil, nil, fmt.Errorf("could not get column types: %v", err)

Contributor

simba-git Mar 18, 2024

If this happened, dropFunc would never be called right?

Maybe force the caller to call create and defer close then get the iterators. Would also make the function signature a little cleaner.

provider/clickhouse.go Outdated

+ // return callback to drop view
+ dropFunc := func() error {
+ // two queries to drop split and row number table

Contributor

simba-git Mar 18, 2024

Out of date comment?

provider/clickhouse.go Outdated

+ testSize float32,
+ shuffle bool,
+ randomState int,
+) (TrainingSetIterator, TrainingSetIterator, func() error, error) {

Contributor

simba-git Mar 18, 2024

Dont love having this many things returned, I have advice in a comment below though

anthonylasso reviewed

View reviewed changes

api/main.go Outdated

+ for {
+ req, err := stream.Recv()
+ if err == io.EOF {
+ // Client has closed the stream, close the downstream stream

Contributor

anthonylasso Mar 20, 2024

nit: don't need the comment. self explanatory via code + logger message.

api/main.go Outdated

+ return err
+ }
+ // Forward the request to the downstream service

Contributor

anthonylasso Mar 20, 2024

nit: same as above.

api/main.go Outdated

+ resp, err := clientStream.Recv()
+ if err == io.EOF {
+ // End of stream from downstream service

Contributor

anthonylasso Mar 20, 2024

nit: same

proto/serving.proto

@@ @@ -10,6 +10,7 @@ package featureform.serving.proto; @@
 service Feature {
  rpc TrainingData(TrainingDataRequest) returns (stream TrainingDataRow) {}
+ rpc TrainTestSplit(stream TrainTestSplitRequest) returns (stream BatchTrainTestSplitResponse) {}
  rpc TrainingDataColumns(TrainingDataColumnsRequest) returns (TrainingColumns) {}

Contributor

anthonylasso Mar 20, 2024

nit: try to keep formatting changes to separate PRs otherwise it forces the reader to diff formatting.

Contributor Author

aolfat Mar 20, 2024

hit the gear icon in github -> "hide whitespaace"

proto/serving.proto Outdated

+ float train_size = 4;
+ bool shuffle = 5;
+ int32 random_state = 6; // Seed for shuffling, if shuffle is true
+ RequestType request_type = 7; // Specify the type of data being requested

Contributor

anthonylasso Mar 20, 2024

Why not just name it RequestDataType request_data_type to avoid needing the clarifying comment.

Contributor Author

aolfat Mar 20, 2024

i can remove these comments, they were generated

client/src/featureform/serving.py Outdated

+ (This functionality is currently only available for Clickhouse).
+ Splits an existing training set into training and testing iterators. The split is processed on the underlying
+ provider and calculated and serving time.

Contributor

anthonylasso Mar 20, 2024

replace "calculated and serving time." with "calculated at serving time."

Contributor Author

aolfat Mar 20, 2024

good catch

client/src/featureform/serving.py

+ train (Iterator): An iterator for training values.
+ test (Iterator): An iterator for testing values.
+ """
+ if batch_size < 1:

Contributor

anthonylasso Mar 20, 2024

will the min batch size be configurable at some point? if so I'd replace this magic number with a const + interpolate the value error msg with the const.

if not ignore.

Contributor Author

aolfat Mar 20, 2024

min batch size has to be 1 right? effectively meaning no batch

client/src/featureform/serving.py Outdated

+ variant = self._stream.version
+ stub = self._stream._stub
+ model = self._stream.model if hasattr(self._stream, "model") else None
+ if random_state is None:

Contributor

anthonylasso Mar 20, 2024

shouldn't this random_state conditional be a part of the first if check on line 533?

also it sets the random_state to zero, but that's an error state value in the first conditional

Contributor Author

aolfat Mar 20, 2024

also really good catch

client/src/featureform/serving.py Outdated

+ @staticmethod
+ def validate_test_size(test_size, train_size):
+ if test_size > 1 or test_size < 0:

Contributor

anthonylasso Mar 20, 2024

nit: anything we can do to clean this function up? it's a rough read.

for example we could reorder the first IF to be

if test_size < 0 or 1 < test_size to adhere to natural left-right reading pattern (unless you're optimizing for likeliest condition?)

client/src/featureform/serving.py

		return type_mapping[value.WhichOneof("value")]


		def get_numpy_array_type(types):

Contributor

anthonylasso Mar 20, 2024

i'd throw a parameterized unit test at this function. reads like it could easily be a source of a hidden logic bug.

anthonylasso reviewed

View reviewed changes

client/src/featureform/train_test_split.py

+ if self._np_type_parser is None and data.rows:
+ self._np_type_parser = _NpProtoTypeParser.init_types(data.rows[0])
+ for i, row in enumerate(data.rows):

Contributor

anthonylasso Mar 20, 2024

not using the index var "i". can just drop it like:

for row in enumerate(data.rows)

client/src/featureform/train_test_split.py Show resolved Hide resolved

provider/clickhouse.go Outdated

+ randomState int,
+) (string, error) {
+ // Generate unique suffix for the view names
+ tableNameSuffix := fmt.Sprintf("%s_%d_%t_%d", trainingSetTable, int(testSize*100), shuffle, randomState)

Contributor

anthonylasso Mar 20, 2024

int(testSize*100) why 100, is it not better to use rand int here?

serving/serving.go

+ req, err := stream.Recv()
+ if err != nil {
+ return err

Contributor

anthonylasso Mar 20, 2024

int: log that the error happened at recv()

serving/serving.go Outdated

  }
  serv.Logger.Debugw("Get Training Set From Store", "name", name, "variant", variant)
  return store.GetTrainingSet(provider.ResourceID{Name: name, Variant: variant})
 }
+func (serv *FeatureServer) getTrainingSetTestSplitIterator(name, variant string, testSize float32, shuffle bool, randomState int) (provider.TrainingSetIterator, provider.TrainingSetIterator, func() error, error) {

Contributor

anthonylasso Mar 20, 2024

nit: too many returns here. maybe use a wrapper SplitResp struct or similar.


All the changes

9b1f0a7

move client to a separate file and refactor

some more clean up

small one

even more clean up

remove unused function

small comments

address ahmad's cpomment

ts split stuff

Added some optimizations and comments

some more changes

get rid of logs

rename protos

fix tests

rename some things

cleanup some

missed name change

Delete grpc_debug.log

one more name change

small fixes

stupid

alllll the changes

i hate changing interfaces

one more

some more more misses

whoops

aolfat temporarily deployed to Integration testing

March 20, 2024 16:03

— with

GitHub Actions Inactive

aolfat commented

View reviewed changes

provider/bigquery.go

@@ @@ -214,6 +214,15 @@ func (it *bqGenericTableIterator) Values() GenericRecord { @@
 func (it *bqGenericTableIterator) Columns() []string {
  var columns []string
+ // As the documentation for bigquery.Schema notes:

Contributor Author

aolfat Mar 20, 2024

ignore this

codecov bot commented Mar 20, 2024

Codecov Report

Attention: Patch coverage is 46.52174% with 246 lines in your changes are missing coverage. Please review.

❗ No coverage uploaded for pull request base (without-ts-split@8bb4f5e). Click here to learn what that means.

Files	Patch %	Lines
serving/serving.go	2.74%	177 Missing ⚠️
provider/clickhouse.go	79.22%	8 Missing and 8 partials ⚠️
client/src/featureform/serving.py	78.84%	9 Missing and 2 partials ⚠️
client/src/featureform/train_test_split.py	91.22%	7 Missing and 3 partials ⚠️
provider/bigquery.go	0.00%	5 Missing and 1 partial ⚠️
provider/sql.go	0.00%	6 Missing ⚠️
client/src/featureform/client.py	42.85%	4 Missing ⚠️
provider/k8s.go	0.00%	4 Missing ⚠️
provider/offline.go	0.00%	4 Missing ⚠️
provider/spark.go	0.00%	4 Missing ⚠️
... and 1 more

Additional details and impacted files

@@                 Coverage Diff                 @@
##             without-ts-split    #1381   +/-   ##
===================================================
  Coverage                    ?   55.71%           
===================================================
  Files                       ?      190           
  Lines                       ?    24780           
  Branches                    ?      836           
===================================================
  Hits                        ?    13806           
  Misses                      ?     9441           
  Partials                    ?     1533

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

anthonylasso reviewed

View reviewed changes

serving/serving.go

  return nil
 }
+type splitContext struct {

Contributor

anthonylasso Mar 20, 2024

nit: not a fan of the struct name since context is a strong convention in go. maybe SplitMetadata?

Contributor Author

aolfat Mar 20, 2024

I mostly agree and thought about it a bit too but with it being an internal struct and literally being the context for all the functions I thought it would do.

serving/serving.go

+ for {
+ if isTrainFinished && isTestFinished {
+ // If both iterators are finished, we can close the stream

Contributor

anthonylasso Mar 20, 2024

delete the comment imo. the code is self-documenting.

serving/serving.go

+ featureObserver := serv.Metrics.BeginObservingTrainingServe(name, variant)
+ defer featureObserver.Finish()
+ ctx := splitContext{

Contributor

anthonylasso Mar 20, 2024

nit: ctx clashes with a strong go convention.

Contributor Author

aolfat Mar 20, 2024

yeahhh i thought about that, I'm a big advocate of just typing things out so I'll do that

serving/serving.go

+ }
+ }
+ response := &pb.BatchTrainTestSplitResponse{

Contributor

anthonylasso Mar 20, 2024

nit: isn't this logically a request? I feel like the naming is a bit off. like you send requests, and receive responses but currently the proto looks like this:

Send(*BatchTrainTestSplitResponse) error
Recv() (*TrainTestSplitRequest, error)

but resolve if I'm missing some context.

Contributor Author

aolfat Mar 20, 2024

yeahhh it's kindaaaa weird

so the flow is this ->

we make a trainsplitreq -> recieving by api/main -> send to serving.go -> recieve req by serving -> create response by serving and SEND response -> recv resp on main -> send resp on main

serving/serving.go

+ *ctx.isTrainFinished = true
+ }
+ if *ctx.isTestFinished && *ctx.isTrainFinished {

Contributor

anthonylasso Mar 20, 2024

question: can a request be just a testType or a trainType, or is it always both types running together?

the reason I ask is because it looks like we need both conditions to close out, but lines 230 and 232 set them independently of each other, so one could still be false?

Ignore If I'm misreading.

Contributor Author

aolfat Mar 20, 2024

good q.

a request is either a testtype OR a traintype, never both, if one of them finishes we return an iterator finished response so that the iteration is stopped but the stream isn't closed

serving/serving.go

+ }
+ store, err := p.AsOfflineStore()
+ if err != nil {
+ // This means that the provider of the training set isn't an offline store.

Contributor

anthonylasso Mar 20, 2024

looks like an important distinction. tbh I'd take the comment and write it to the logger.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment