Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

All Train Test Split Changes #1381

Open
wants to merge 21 commits into
base: without-ts-split
Choose a base branch
from
Open

Conversation

aolfat
Copy link
Contributor

@aolfat aolfat commented Mar 13, 2024

Description

All Training Set Split changes thus far

Type of change

Does this correspond to an open issue?

Select type(s) of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update

Checklist:

  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I have fixed any merge conflicts

@aolfat aolfat changed the title All ts split changes All Train Test Split Changes Mar 18, 2024
colTypes, err := store.getValueColumnTypes(trainingTestSplitName)
fmt.Printf("these are the column types: %v\n", colTypes)
if err != nil {
return nil, nil, nil, fmt.Errorf("could not get column types: %v", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this happened, dropFunc would never be called right?

Maybe force the caller to call create and defer close then get the iterators. Would also make the function signature a little cleaner.


// return callback to drop view
dropFunc := func() error {
// two queries to drop split and row number table
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of date comment?

testSize float32,
shuffle bool,
randomState int,
) (TrainingSetIterator, TrainingSetIterator, func() error, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dont love having this many things returned, I have advice in a comment below though

api/main.go Outdated
for {
req, err := stream.Recv()
if err == io.EOF {
// Client has closed the stream, close the downstream stream
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: don't need the comment. self explanatory via code + logger message.

api/main.go Outdated
return err
}

// Forward the request to the downstream service
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: same as above.

api/main.go Outdated

resp, err := clientStream.Recv()
if err == io.EOF {
// End of stream from downstream service
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: same

@@ -10,6 +10,7 @@ package featureform.serving.proto;

service Feature {
rpc TrainingData(TrainingDataRequest) returns (stream TrainingDataRow) {}
rpc TrainTestSplit(stream TrainTestSplitRequest) returns (stream BatchTrainTestSplitResponse) {}
rpc TrainingDataColumns(TrainingDataColumnsRequest) returns (TrainingColumns) {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: try to keep formatting changes to separate PRs otherwise it forces the reader to diff formatting.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hit the gear icon in github -> "hide whitespaace"

float train_size = 4;
bool shuffle = 5;
int32 random_state = 6; // Seed for shuffling, if shuffle is true
RequestType request_type = 7; // Specify the type of data being requested
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just name it RequestDataType request_data_type to avoid needing the clarifying comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i can remove these comments, they were generated

(This functionality is currently only available for Clickhouse).

Splits an existing training set into training and testing iterators. The split is processed on the underlying
provider and calculated and serving time.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace "calculated and serving time." with "calculated at serving time."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch

train (Iterator): An iterator for training values.
test (Iterator): An iterator for testing values.
"""
if batch_size < 1:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will the min batch size be configurable at some point? if so I'd replace this magic number with a const + interpolate the value error msg with the const.

if not ignore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

min batch size has to be 1 right? effectively meaning no batch

variant = self._stream.version
stub = self._stream._stub
model = self._stream.model if hasattr(self._stream, "model") else None
if random_state is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this random_state conditional be a part of the first if check on line 533?

also it sets the random_state to zero, but that's an error state value in the first conditional

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also really good catch


@staticmethod
def validate_test_size(test_size, train_size):
if test_size > 1 or test_size < 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: anything we can do to clean this function up? it's a rough read.

for example we could reorder the first IF to be

if test_size < 0 or 1 < test_size to adhere to natural left-right reading pattern (unless you're optimizing for likeliest condition?)

return type_mapping[value.WhichOneof("value")]


def get_numpy_array_type(types):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'd throw a parameterized unit test at this function. reads like it could easily be a source of a hidden logic bug.

if self._np_type_parser is None and data.rows:
self._np_type_parser = _NpProtoTypeParser.init_types(data.rows[0])

for i, row in enumerate(data.rows):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not using the index var "i". can just drop it like:

for row in enumerate(data.rows)

client/src/featureform/train_test_split.py Show resolved Hide resolved
randomState int,
) (string, error) {
// Generate unique suffix for the view names
tableNameSuffix := fmt.Sprintf("%s_%d_%t_%d", trainingSetTable, int(testSize*100), shuffle, randomState)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

int(testSize*100) why 100, is it not better to use rand int here?


req, err := stream.Recv()
if err != nil {
return err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

int: log that the error happened at recv()

}
serv.Logger.Debugw("Get Training Set From Store", "name", name, "variant", variant)
return store.GetTrainingSet(provider.ResourceID{Name: name, Variant: variant})
}

func (serv *FeatureServer) getTrainingSetTestSplitIterator(name, variant string, testSize float32, shuffle bool, randomState int) (provider.TrainingSetIterator, provider.TrainingSetIterator, func() error, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: too many returns here. maybe use a wrapper SplitResp struct or similar.

move client to a separate file and refactor

some more clean up

small one

even more clean up

remove unused function

small comments

address ahmad's cpomment

ts split stuff

Added some optimizations and comments

some more changes

get rid of logs

rename protos

fix tests

rename some things

cleanup some

missed name change

Delete grpc_debug.log

one more name change

small fixes

stupid

alllll the changes

i hate changing interfaces

one more

some more more misses

whoops
@aolfat aolfat temporarily deployed to Integration testing March 20, 2024 16:03 — with GitHub Actions Inactive
@@ -214,6 +214,15 @@ func (it *bqGenericTableIterator) Values() GenericRecord {

func (it *bqGenericTableIterator) Columns() []string {
var columns []string
// As the documentation for bigquery.Schema notes:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ignore this

Copy link

codecov bot commented Mar 20, 2024

Codecov Report

Attention: Patch coverage is 46.52174% with 246 lines in your changes are missing coverage. Please review.

❗ No coverage uploaded for pull request base (without-ts-split@8bb4f5e). Click here to learn what that means.

Files Patch % Lines
serving/serving.go 2.74% 177 Missing ⚠️
provider/clickhouse.go 79.22% 8 Missing and 8 partials ⚠️
client/src/featureform/serving.py 78.84% 9 Missing and 2 partials ⚠️
client/src/featureform/train_test_split.py 91.22% 7 Missing and 3 partials ⚠️
provider/bigquery.go 0.00% 5 Missing and 1 partial ⚠️
provider/sql.go 0.00% 6 Missing ⚠️
client/src/featureform/client.py 42.85% 4 Missing ⚠️
provider/k8s.go 0.00% 4 Missing ⚠️
provider/offline.go 0.00% 4 Missing ⚠️
provider/spark.go 0.00% 4 Missing ⚠️
... and 1 more
Additional details and impacted files
@@                 Coverage Diff                 @@
##             without-ts-split    #1381   +/-   ##
===================================================
  Coverage                    ?   55.71%           
===================================================
  Files                       ?      190           
  Lines                       ?    24780           
  Branches                    ?      836           
===================================================
  Hits                        ?    13806           
  Misses                      ?     9441           
  Partials                    ?     1533           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@@ -82,13 +79,182 @@ func (serv *FeatureServer) TrainingData(req *pb.TrainingDataRequest, stream pb.F
return nil
}

type splitContext struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: not a fan of the struct name since context is a strong convention in go. maybe SplitMetadata?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mostly agree and thought about it a bit too but with it being an internal struct and literally being the context for all the functions I thought it would do.


for {
if isTrainFinished && isTestFinished {
// If both iterators are finished, we can close the stream
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete the comment imo. the code is self-documenting.

featureObserver := serv.Metrics.BeginObservingTrainingServe(name, variant)
defer featureObserver.Finish()

ctx := splitContext{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: ctx clashes with a strong go convention.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeahhh i thought about that, I'm a big advocate of just typing things out so I'll do that

}
}

response := &pb.BatchTrainTestSplitResponse{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: isn't this logically a request? I feel like the naming is a bit off. like you send requests, and receive responses but currently the proto looks like this:

Send(*BatchTrainTestSplitResponse) error
Recv() (*TrainTestSplitRequest, error)

but resolve if I'm missing some context.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeahhh it's kindaaaa weird

so the flow is this ->

we make a trainsplitreq -> recieving by api/main -> send to serving.go -> recieve req by serving -> create response by serving and SEND response -> recv resp on main -> send resp on main

*ctx.isTrainFinished = true
}

if *ctx.isTestFinished && *ctx.isTrainFinished {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: can a request be just a testType or a trainType, or is it always both types running together?

the reason I ask is because it looks like we need both conditions to close out, but lines 230 and 232 set them independently of each other, so one could still be false?

Ignore If I'm misreading.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good q.

a request is either a testtype OR a traintype, never both, if one of them finishes we return an iterator finished response so that the iteration is stopped but the stream isn't closed

}
store, err := p.AsOfflineStore()
if err != nil {
// This means that the provider of the training set isn't an offline store.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like an important distinction. tbh I'd take the comment and write it to the logger.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants