-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generate data based on estimated Delta table size #77
Comments
i'm thinking that we could do something like the following: add method `withTargetSize(sizeMb, we would then generate 100,000 rows and write them to tmp location (default dbfs:/tmp) in format (default parquet) Then we could use the size of 100k rows to compute how many rows will make up the target size. The row count used for the calculations could be configurable also. As size of result set will be a guestimate (due to possible random data), this will produce an approximately sized data output. Note if size of data per row is not very variable (i.e limited arbitrary text generation), this should produce a dataset close to the target size. Key question: Do you need exact target size or approximate sized outputs? |
For accurate target sizes, you would need to produce a data set larger than the target size and sample / write repeatedly until you get close to the target size. For large data sets, this could be costly in terms of performance |
As I have been using the data generator I have had to use trial and error to get a table size I require. Not sure if this is feasible but it would be great to generate data based on the final table size required instead of no. of rows.
Alternatively it might be useful to easily get back stats about the generated table size and use that to iteratively generate more data to reach the desired table size.
Currently I am doing the following to get back the table size which works well but needs to be run manually each time.
Thanks
Tahir
The text was updated successfully, but these errors were encountered: