You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Yes. I would be willing to contribute this feature with guidance from the MLflow community.
Proposal Summary
Add the ability to split the ingested data by groups that don't get included in the training set.
For example, have the option to use sklearn GroupShuffleSplit within the split step of the recipe. Without using the split_by_feature as a feature in the training set
from sklearn.model_selection import GroupShuffleSplit
GroupShuffleSplit(test_size=0.2, n_splits=2, random_state=2).split(
data, groups=data[split_by_feature]
)
Motivation
What is the use case for this feature?
Modeling using stratified sampling for the training and test sets
Why is this use case valuable to support for MLflow users in general?
Built in stratified sampling would help with avoiding workarounds to use this method within the split step of recipes
Why is this use case valuable to support for your project(s) or organization?
All of our models require stratified sampling in order to work as intended
Why is it currently difficult to achieve this use case?
As the code is now any features that get ingested and used in a grouped split will be fed to the next step for transformations. Since transformations get registered with the model that means the unused feature stays
Details
No response
What component(s) does this bug affect?
area/artifacts: Artifact stores and artifact logging
area/build: Build and test infrastructure for MLflow
area/deployments: MLflow Deployments client APIs, server, and third-party Deployments integrations
area/docs: MLflow documentation pages
area/examples: Example code
area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
area/models: MLmodel format, model serialization/deserialization, flavors
Willingness to contribute
Yes. I would be willing to contribute this feature with guidance from the MLflow community.
Proposal Summary
Add the ability to split the ingested data by groups that don't get included in the training set.
For example, have the option to use sklearn
GroupShuffleSplit
within thesplit
step of the recipe. Without using thesplit_by_feature
as a feature in the training setMotivation
Modeling using stratified sampling for the training and test sets
Built in stratified sampling would help with avoiding workarounds to use this method within the
split
step of recipesAll of our models require stratified sampling in order to work as intended
As the code is now any features that get ingested and used in a grouped split will be fed to the next step for transformations. Since transformations get registered with the model that means the unused feature stays
Details
No response
What component(s) does this bug affect?
area/artifacts
: Artifact stores and artifact loggingarea/build
: Build and test infrastructure for MLflowarea/deployments
: MLflow Deployments client APIs, server, and third-party Deployments integrationsarea/docs
: MLflow documentation pagesarea/examples
: Example codearea/model-registry
: Model Registry service, APIs, and the fluent client calls for Model Registryarea/models
: MLmodel format, model serialization/deserialization, flavorsarea/recipes
: Recipes, Recipe APIs, Recipe configs, Recipe Templatesarea/projects
: MLproject format, project running backendsarea/scoring
: MLflow Model server, model deployment tools, Spark UDFsarea/server-infra
: MLflow Tracking server backendarea/tracking
: Tracking Service, tracking client APIs, autologgingWhat interface(s) does this bug affect?
area/uiux
: Front-end, user experience, plotting, JavaScript, JavaScript dev serverarea/docker
: Docker use across MLflow's components, such as MLflow Projects and MLflow Modelsarea/sqlalchemy
: Use of SQLAlchemy in the Tracking Service or Model Registryarea/windows
: Windows supportWhat language(s) does this bug affect?
language/r
: R APIs and clientslanguage/java
: Java APIs and clientslanguage/new
: Proposals for new client languagesWhat integration(s) does this bug affect?
integrations/azure
: Azure and Azure ML integrationsintegrations/sagemaker
: SageMaker integrationsintegrations/databricks
: Databricks integrationsThe text was updated successfully, but these errors were encountered: