Flow is a PHP based, strongly typed ETL (Extract Transform Load), asynchronous data processing library with constant memory consumption.
Extract from the Source, Transform, Load to the Sink.
<?php
declare(strict_types=1);
use function Flow\ETL\Adapter\Parquet\{from_parquet, to_parquet};
use function Flow\ETL\DSL\{data_frame, lit, ref, sum, to_output};
use Flow\ETL\Filesystem\SaveMode;
require __DIR__ . '/vendor/autoload.php';
data_frame()
->read(from_parquet(__FLOW_DATA__ . '/orders_flow.parquet'))
->select('created_at', 'total_price', 'discount')
->withEntry('created_at', ref('created_at')->cast('date')->dateFormat('Y/m'))
->withEntry('revenue', ref('total_price')->minus(ref('discount')))
->select('created_at', 'revenue')
->groupBy('created_at')
->aggregate(sum(ref('revenue')))
->sortBy(ref('created_at')->desc())
->withEntry('daily_revenue', ref('revenue_sum')->round(lit(2))->numberFormat(lit(2)))
->drop('revenue_sum')
->write(to_output(truncate: false))
->withEntry('created_at', ref('created_at')->toDate('Y/m'))
->mode(SaveMode::Overwrite)
->write(to_parquet(__FLOW_OUTPUT__ . '/daily_revenue.parquet'))
->run();
$ php daily_revenue.php
+------------+---------------+
| created_at | daily_revenue |
+------------+---------------+
| 2023/10 | 206,669.74 |
| 2023/09 | 227,647.47 |
| 2023/08 | 237,027.31 |
| 2023/07 | 240,111.05 |
| 2023/06 | 225,536.35 |
| 2023/05 | 234,624.74 |
| 2023/04 | 231,472.05 |
| 2023/03 | 231,697.36 |
| 2023/02 | 211,048.97 |
| 2023/01 | 225,539.81 |
+------------+---------------+
10 rows
The reasons behind creating this project can be explained in few tweets. To get familiar with basic ETL Api, please look into flow-php/etl repository, everything else is listed below.
- constant memory consumption
- caching
- reading from any data source
- writing to any data source
- rich collection of data transformation functions
- grouping & aggregating
- remote files processing
- joins
- sorting
- displaying datasets as ASCII table
- validation against schema
- DataFrame - Lazy data processing frame.
- Rows - Immutable colllection of
Row
objects. - Row - Immutable, strongly typed collection of
Entry
objects. - Entry - Immutable, strongly typed object representing cell in a row.
- Extractor (Reader) - Memory safe, Data Source returning \Generator, yielding
Rows
to thePipeline
- Transformer - Data transformer receiving and returning
Rows
(in most cases transformer), one instance ofRows
at once. - Loader (Writer) - Memory safe representation of Data Sink, responsibility of Loader is to write
Rows
into destination storage, one at time. - Pipeline - Interface representing ETL process, each received
Rows
instanced is pased through allPipes
, also responsible for error handling. - Pipe - Loader of Transformer instance existing in
Pipes
collection. - Function - transformation that might happen on a single row, single entry, rows or group of rows
- 8.1 - ā
- 8.2 - ā
- 8.3 - ā
- array
- boolean
- datetime
- enum
- float
- integer
- json
- list
- map
- null
- object
- string
- structure
- uuid
- xml
- xml_node
- ETL
- Adapters
- Libraries
Flow ETL provides a rich set of official functions to transform data, please find them all in flow-php/etl repository.
Flow PHP is sponsored by:
- Blackfire - the best PHP profiling and monitoring tool!