All notable changes to this project will be documented in this file. For help with updating to new Bytewax versions, please see the migration guide.
Add any extra change notes here and we'll put them in the release notes on GitHub when we make a new release.
- Fixes a performance issue where
{py:obj}
bytewax.operators.StatefulBatchLogic.notify_at
(and thus many of the other stateful operators'notify_at
derived from it) was being called superfluously.
- Fixes a bug when using
{py:obj}
~bytewax.operators.windowing.EventClock
where in-order but "slow" data results in watermark assertion errors.
-
Adds a dataflow structure visualizer. Run
python -m bytewax.visualize
. -
Breaking change The internal format of recovery databases has been changed from using
JsonPickle
to Python's built-in {py:obj}pickle
. Recovery stores that used the old format will not be usable after upgrading. -
Breaking change The
unary
operator andUnaryLogic
have been renamed tostateful
andStatefulLogic
respectively. -
Adds a
stateful_batch
operator to allow for lower-level batch control while managing state. -
StatefulLogic.on_notify
,StatefulLogic.on_eof
, andStatefulLogic.notify_at
are now optional overrides. The defaults retain the state and emit nothing. -
Breaking change Windowing operators have been moved from
bytewax.operators.window
intobytewax.operators.windowing
. -
Breaking change
ClockConfig
s have hadConfig
dropped from their name and are justClock
s. E.g. If you previouslyfrom bytewax.operators.window import SystemClockConfig
nowfrom bytewax.operators.windowing import SystemClock
. -
Breaking change
WindowConfig
s have been renamed toWindower
s. E.g. If you previouslyfrom bytewax.operators.window import SessionWindow
nowfrom bytewax.operators.windowing import SessionWindower
. -
Breaking change All windowing operators now return a set of streams {py:obj}
~bytewax.operators.windowing.WindowOut
. {py:obj}~bytewax.operators.windowing.WindowMetadata
now is branched into its own stream and is no longer part of the single downstream. All window operator emitted items are labeled with the unique window ID they came from to facilitate joining the data later. -
Breaking change {py:obj}
~bytewax.operators.windowing.fold_window
now requires amerge
argument. This handles whenever the session windower determines that two windows must be merged because a new item bridged a gap. -
Breaking change The
join_named
andjoin_window_named
operators have been removed because they did not support returning proper type information. Use {py:obj}~bytewax.operators.join
or {py:obj}~bytewax.operators.windowing.join_window
instead, which have been enhanced to properly type their downstream values. -
Breaking change {py:obj}
~bytewax.operators.join
and {py:obj}~bytewax.operators.windowing.join_window
have had theirproduct
argument replaced withinsert_mode
. You now can specify more nuanced kinds of join modes. -
Python interfaces are now provided for custom clocks and windowers. Subclass {py:obj}
~bytewax.operators.windowing.Clock
(and a corresponding {py:obj}~bytewax.operators.windowing.ClockLogic
) or {py:obj}~bytewax.operators.windowing.Windower
(and a corresponding {py:obj}~bytewax.operators.windowing.WindowerLogic
) to define your own senses of time and window definitions. -
Adds a {py:obj}
~bytewax.operators.windowing.window
operator to allow you to write more flexible custom windowing operators. -
Session windows now work correctly with out-of-order data and joins.
-
{py:obj}
~bytewax.operators.windowing.WindowMetadata
now contains a {py:obj}~bytewax.operators.windowing.WindowMetadata.merged_ids
field with any window IDs that were merged into this window. -
All windowing operators now process items in timestamp order. The most visible change that this results in is that the {py:obj}
~bytewax.operators.windowing.collect_window
operator now emits collections with values in timestamp order. -
Adds a {py:obj}
~bytewax.operators.filter_map_value
operator. -
Adds a {py:obj}
~bytewax.operators.enrich_cached
operator for easier joining with an external data source. -
Adds a {py:obj}
~bytewax.operators.key_rm
convenience operator to remove keys from a {py:obj}~bytewax.operators.KeyedStream
.
- Fixes a bug where using a system clock on certain architectures causes items to be dropped from windows.
-
Multiple operators have been reworked to avoid taking and releasing Python's global interpreter lock while iterating over multiple items. Windowing operators, stateful operators and operators like
branch
will see significant performance improvements.Thanks to @damiondoesthings for helping us track this down!
-
Breaking change
FixedPartitionedSource.build_part
,DynamicSource.build
,FixedPartitionedSink.build_part
andDynamicSink.build
now take an additionalstep_id
argument. This argument can be used when labeling custom Python metrics. -
Custom Python metrics can now be collected using the
prometheus-client
library. -
Breaking change The schema registry interface has been removed. You can still use schema registries, but you need to instantiate the (de)serializers on your own. This allows for more flexibility. See the
confluent_serde
andredpanda_serde
examples for how to use the new interface. -
Fixes bug where items would be incorrectly marked as late in sliding and tumbling windows in cases where the timestamps are very far from the
align_to
parameter of the windower. -
Adds
stateful_flat_map
operator. -
Breaking change Removes
builder
argument fromstateful_map
. Instead, the initial state value is alwaysNone
and you can call your previous builder by hand in themapper
. -
Breaking change Improves performance by removing the
now: datetime
argument fromFixedPartitionedSource.build_part
,DynamicSource.build
, andUnaryLogic.on_item
. If you need the current time, use:
from datetime import datetime, timezone
now = datetime.now(timezone.utc)
- Breaking change Improves performance by removing the
sched: datetime
argument fromStatefulSourcePartition.next_batch
,StatelessSourcePartition.next_batch
,UnaryLogic.on_notify
. You should already have the scheduled next awake time in whatever instance variable you returned in{Stateful,Stateless}SourcePartition.next_awake
orUnaryLogic.notify_at
.
-
Fixes a bug that prevented the deletion of old state in recovery stores.
-
Better error messages on invalid epoch and backup interval parameters.
-
Fixes bug where dataflow will hang if a source's
next_awake
is set far in the future.
-
Changes the default batch size for
KafkaSource
from 1 to 1000 to match the Kafka input operator. -
Fixes an issue with the
count_window
operator: #364.
-
Support for schema registries, through
bytewax.connectors.kafka.registry.RedpandaSchemaRegistry
andbytewax.connectors.kafka.registry.ConfluentSchemaRegistry
. -
Custom Kafka operators in
bytewax.connectors.kafka.operators
:input
,output
,deserialize_key
,deserialize_value
,deserialize
,serialize_key
,serialize_value
andserialize
. -
Breaking change
KafkaSource
now emits a specialKafkaSourceMessage
to allow access to all data on consumed messages.KafkaSink
now consumesKafkaSinkMessage
to allow setting additional fields on produced messages. -
Non-linear dataflows are now possible. Each operator method returns a handle to the
Stream
s it produces; add further steps via calling operator functions on those returned handles, not the rootDataflow
. See the migration guide for more info. -
Auto-complete and type hinting on operators, inputs, outputs, streams, and logic functions now works.
-
A ton of new operators:
collect_final
,count_final
,count_window
,flatten
,inspect_debug
,join
,join_named
,max_final
,max_window
,merge
,min_final
,min_window
,key_on
,key_assert
,key_split
,merge
,unary
. Documentation for all operators are inbytewax.operators
now. -
New operators can be added in Python, made by grouping existing operators. See
bytewax.dataflow
module docstring for more info. -
Breaking change Operators are now stand-alone functions;
import bytewax.operators as op
and use e.g.op.map("step_id", upstream, lambda x: x + 1)
. -
Breaking change All operators must take a
step_id
argument now. -
Breaking change
fold
andreduce
operators have been renamed tofold_final
andreduce_final
. They now only emit on EOF and are only for use in batch contexts. -
Breaking change
batch
operator renamed tocollect
, so as to not be confused with runtime batching. Behavior is unchanged. -
Breaking change
output
operator does not forward downstream its items. Add operators on the upstream handle instead. -
next_batch
on input partitions can now return anyIterable
, not just aList
. -
inspect
operator now has a default inspector that prints out items with the step ID. -
collect_window
operator now can collect intoset
s anddict
s. -
Adds a
get_fs_id
argument to{Dir,File}Source
to allow handling non-identical files per worker. -
Adds a
TestingSource.EOF
andTestingSource.ABORT
sentinel values you can use to test recovery. -
Breaking change Adds a
datetime
argument toFixedPartitionSource.build_part
,DynamicSource.build_part
,StatefulSourcePartition.next_batch
, andStatelessSourcePartition.next_batch
. You can now use this to update yournext_awake
time easily. -
Breaking change Window operators now emit
WindowMetadata
objects downstream. These objects can be used to introspect the open_time and close_time of windows. This changes the output type of windowing operators from:(key, values)
to(key, (metadata, values))
. -
Breaking change IO classes and connectors have been renamed to better reflect their semantics and match up with documentation.
-
Moves the ability to start multiple Python processes with the
-p
or--processes
to thebytewax.testing
module. -
Breaking change
SimplePollingSource
moved frombytewax.connectors.periodic
tobytewax.inputs
since it is an input helper. -
SimplePollingSource
'salign_to
argument now works.
-
Adds the
batch
operator to Dataflows. CallingDataflow.batch
will batch incoming items until either a batch size has been reached or a timeout has passed. -
Adds the
SimplePollingInput
source. Subclass this input source to periodically source new input for a dataflow. -
Re-adds GLIBC 2.27 builds to support older linux distributions.
-
Breaking change Recovery system re-worked. Kafka-based recovery removed. SQLite recovery file format changed; existing recovery DB files can not be used. See the module docstring for
bytewax.recovery
for how to use the new recovery system. -
Dataflow execution supports rescaling over resumes. You can now change the number of workers and still get proper execution and recovery.
-
epoch-interval
has been renamed tosnapshot-interval
-
The
list-parts
method ofPartitionedInput
has been changed to return aList[str]
and should only reflect the available inputs that a given worker has access to. You no longer need to return the complete set of partitions for all workers. -
The
next
method ofStatefulSource
andStatelessSource
has been changed tonext_batch
and should return aList
of elements, or the empty list if there are no elements to return.
-
Added new cli parameter
backup-interval
, to configure the length of time to wait before "garbage collecting" older recovery snapshots. -
Added
next_awake
to input classes, which can be used to schedule when the next call tonext_batch
should occur. Usenext_awake
instead oftime.sleep
. -
Added
bytewax.inputs.batcher_async
to bridge async Python libraries in Bytewax input sources. -
Added support for linux/aarch64 and linux/armv7 platforms.
KafkaRecoveryConfig
has been removed as a recovery store.
- Add support for Windows builds - thanks @zzl221000!
- Adds a CSVInput subclass of FileInput
- Add a cooldown for activating workers to reduce CPU consumption.
- Add support for Python 3.11.
-
Breaking change Reworked the execution model.
run_main
andcluster_main
have been moved tobytewax.testing
as they are only supposed to be used when testing or prototyping. Production dataflows should be ran by calling thebytewax.run
module withpython -m bytewax.run <dataflow-path>:<dataflow-name>
. Seepython -m bytewax.run -h
for all the possible options. The functionality offered byspawn_cluster
are now only offered by thebytewax.run
script, sospawn_cluster
was removed. -
Breaking change
{Sliding,Tumbling}Window.start_at
has been renamed toalign_to
and both now require that argument. It's not possible to recover windowing operators without it. -
Fixes bugs with windows not closing properly.
-
Fixes an issue with SQLite-based recovery. Previously you'd always get an "interleaved executions" panic whenever you resumed a cluster after the first time.
-
Add
SessionWindow
for windowing operators. -
Add
SlidingWindow
for windowing operators. -
Breaking change Rename
TumblingWindowConfig
toTumblingWindow
-
Add
filter_map
operator. -
Breaking change New partition-based input and output API. This removes
ManualInputConfig
andManualOutputConfig
. Seebytewax.inputs
andbytewax.outputs
for more info. -
Breaking change
Dataflow.capture
operator is renamed toDataflow.output
. -
Breaking change
KafkaInputConfig
andKafkaOutputConfig
have been moved tobytewax.connectors.kafka.KafkaInput
andbytewax.connectors.kafka.KafkaOutput
. -
Deprecation warning The
KafkaRecovery
store is being deprecated in favor ofSqliteRecoveryConfig
, and will be removed in a future release.
- Breaking change Fixes issue with multi-worker recovery. If the cluster crashed before all workers had completed their first epoch, the cluster would resume from the incorrect position. This requires a change to the recovery store. You cannot resume from recovery data written with an older version.
-
Dataflow continuation now works. If you run a dataflow over a finite input, all state will be persisted via recovery so if you re-run the same dataflow pointing at the same input, but with more data appended at the end, it will correctly continue processing from the previous end-of-stream.
-
Fixes issue with multi-worker recovery. Previously resume data was being routed to the wrong worker so state would be missing.
-
Breaking change The above two changes require that the recovery format has been changed for all recovery stores. You cannot resume from recovery data written with an older version.
-
Adds an introspection web server to dataflow workers.
-
Adds
collect_window
operator.
- Added Google Colab support.
- Added tracing instrumentation and configurations for tracing backends.
-
Fixes bug where window is never closed if recovery occurs after last item but before window close.
-
Recovery logging is reduced.
-
Breaking change Recovery format has been changed for all recovery stores. You cannot resume from recovery data written with an older version.
-
Adds a
DynamoDB
andBigquery
output connector.
-
Performance improvements.
-
Support SASL and SSL for
bytewax.inputs.KafkaInputConfig
.
-
KafkaInputConfig now accepts additional properties. See
bytewax.inputs.KafkaInputConfig
. -
Support for a pre-built Kafka output component. See
bytewax.outputs.KafkaOutputConfig
.
-
Added the
fold_window
operator, works likereduce_window
but allows the user to build the initial accumulator for each key in abuilder
function. -
Output is no longer specified using an
output_builder
for the entire dataflow, but you supply an "output config" per capture. Seebytewax.outputs
for more info. -
Input is no longer specified on the execution entry point (like
run_main
), it is instead using theDataflow.input
operator. -
Epochs are no longer user-facing as part of the input system. Any custom Python-based input components you write just need to be iterators and emit items. Recovery snapshots and backups now happen periodically, defaulting to every 10 seconds.
-
Recovery format has been changed for all recovery stores. You cannot resume from recovery data written with an older version.
-
The
reduce_epoch
operator has been replaced withreduce_window
. It takes a "clock" and a "windower" to define the kind of aggregation you want to do. -
run
andrun_cluster
have been removed and the remaining execution entry points moved intobytewax.execution
. You can now get similar prototyping functionality withbytewax.execution.run_main
andbytewax.execution.spawn_cluster
usingTesting{Input,Output}Config
s. -
Dataflow
has been moved intobytewax.dataflow.Dataflow
.
-
Input is no longer specified using an
input_builder
, but now aninput_config
which allows you to use pre-built input components. Seebytewax.inputs
for more info. -
Preliminary support for a pre-built Kafka input component. See
bytewax.inputs.KafkaInputConfig
. -
Keys used in the
(key, value)
2-tuples to route data for stateful operators (likestateful_map
andreduce_epoch
) must now be strings. Because of thisbytewax.exhash
is no longer necessary and has been removed. -
Recovery format has been changed for all recovery stores. You cannot resume from recovery data written with an older version.
-
Slight changes to
bytewax.recovery.RecoveryConfig
config options due to recovery system changes. -
bytewax.run()
andbytewax.run_cluster()
no longer takerecovery_config
as they don't support recovery.
-
Adds
bytewax.AdvanceTo
andbytewax.Emit
to control when processing happens. -
Adds
bytewax.run_main()
as a way to test input and output builders without starting a cluster. -
Adds a
bytewax.testing
module with helpers for testing. -
bytewax.run_cluster()
andbytewax.spawn_cluster()
now take amp_ctx
argument to allow you to change the multiprocessing behavior. E.g. from "fork" to "spawn". Defaults now to "spawn". -
Adds dataflow recovery capabilities. See
bytewax.recovery
. -
Stateful operators
bytewax.Dataflow.reduce()
andbytewax.Dataflow.stateful_map()
now require astep_id
argument to handle recovery. -
Execution entry points now take configuration arguments as kwargs.
-
Capture operator no longer takes arguments. Items that flow through those points in the dataflow graph will be processed by the output handlers setup by each execution entry point. Every dataflow requires at least one capture.
-
Executor.build_and_run()
is replaced with four entry points for specific use cases:-
run()
for exeuction in the current process. It returns all captured items to the calling process for you. Use this for prototyping in notebooks and basic tests. -
run_cluster()
for execution on a temporary machine-local cluster that Bytewax coordinates for you. It returns all captured items to the calling process for you. Use this for notebook analysis where you need parallelism. -
spawn_cluster()
for starting a machine-local cluster with more control over input and output. Use this for standalone scripts where you might need partitioned input and output. -
cluster_main()
for starting a process that will participate in a cluster you are coordinating manually. Use this when starting a Kubernetes cluster.
-
-
Adds
bytewax.parse
module to help with reading command line arguments and environment variables for the above entrypoints. -
Renames
bytewax.inp
tobytewax.inputs
.