Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tabulate by Batch, and readiness for Tabulate By X #816

Open
wants to merge 10 commits into
base: develop
Choose a base branch
from

Conversation

artoonie
Copy link
Collaborator

@artoonie artoonie commented Mar 29, 2024

Request for comment on language.
As a noun, I call it a "a tabulate-by field"
As a verb, I call it "Tabulate By Field"

I can imagine preferences for other names, by "Tabulate By Pivot" or "Tabulate By Metadata Field" or something else, but "Tabulate By Field" seems sufficiently clear to me. Thoughts? Any other words that may reduce ambiguity?

Closes #807

@artoonie artoonie added the WIP label Mar 29, 2024
@artoonie artoonie force-pushed the feature/issue-807_tabulate-by branch from 1fd1860 to d97838b Compare March 29, 2024 23:39
@artoonie artoonie requested a review from yezr April 13, 2024 22:07
@artoonie artoonie marked this pull request as ready for review April 13, 2024 22:07
@artoonie artoonie removed the WIP label Apr 13, 2024
@artoonie
Copy link
Collaborator Author

@yezr this has been lightly tested but I could use some help gathering test data, test configurations, and checking edge cases.

@artoonie artoonie added the WIP label Apr 15, 2024
@artoonie artoonie removed the WIP label Apr 30, 2024
@artoonie artoonie changed the title WIP: Tabulate by Batch, and readiness for Tabulate By X Tabulate by Batch, and readiness for Tabulate By X Apr 30, 2024
Copy link
Contributor

@tarheel tarheel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems great. My feedback is mainly around the language. The verb phrase "tabulate by field" is OK, but could be clearer. There and elsewhere, my concern is mostly about how generic the word "field" is.

Logger.severe("tabulateByPrecinct may not be used with CDF files.");
for (TabulateByField tabulateByField : enabledFields()) {
validationErrors.add(ValidationError.CVR_CDF_TABULATE_BY_DISAGREEMENT);
Logger.severe("%s may not be used with CDF files.", tabulateByField);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should maybe note that this is just because we haven't chosen to (fully) implement it.

@@ -1291,5 +1332,21 @@ public String getInternalLabel() {
}
}

enum TabulateByField {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this name kind of confusing because it sounds like a verb phrase (as in "I want to tabulate by field"). Maybe TabulationBucketingField or TabulationGroupingField or TabulationSliceField?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like slice, and think it will make sense to s/field/slice throughout this PR

@@ -9,7 +9,7 @@

/*
* Purpose: Ingests tabulation results and generates various summary report files.
* Design: Generates per-precinct files if specified.
* Design: Generates per-field files if specified.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would clarify what "per-field" means in this comment. Ideally we can choose a global term for this subgroups, like one of the terms I suggested above or something similar. Bucket, group, subgroup, slice, subset. "Pivot" and "metadata" feel maybe a bit too technical. Anyway, there are a bunch of places in this PR where we're just saying "field" or "fieldId", and I find that confusingly generic. I'd prefer for all of those to say something more along the lines of "bucketType"/"bucketId" or whatever to convey that that's the purpose of the fields and the IDs.

@@ -63,8 +66,8 @@ class ResultsWriter {

// number of rounds needed to elect winner(s)
private int numRounds;
// all precinct Ids which may appear in the output cvrs
private Set<String> precinctIds;
// all Field Ids which may appear in the output cvrs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/which/that

@@ -565,7 +573,8 @@ String writeRctabCvrCsv(
outputFile.getAbsolutePath());
CSVPrinter csvPrinter;
BufferedWriter writer = Files.newBufferedWriter(outputFile.toPath());
csvPrinter = new CSVPrinter(writer, CSVFormat.DEFAULT);
CSVFormat format = CSVFormat.DEFAULT.builder().setNullString("").build();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

convenient

private final Map<String, Map<Integer, RoundTally>> precinctRoundTallies = new HashMap<>();
private final RoundTallies roundTallies = new RoundTallies();
// roundTalliesByField is a map from a tabulate-by field to roundTallies for that field
private final BreakdownByField<RoundTallies> roundTalliesByField = new BreakdownByField<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I misreading the code, or does this mean that if you're using more than one breakdown (with the current code, that would be both batch and precinct), there will be a collision if there's any overlap in the IDs? E.g. if there's a batch 1 and also a precinct 1, they'll be writing to and reading from the same tallies?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, no, now I see that roundTalliesByField is a map from field to fieldId to round to vote count. Maybe roundTalliesByFields (plural) would be more clear?

String precinctId = cvr.getPrecinct();
if (precinctId != null) {
precinctIds.add(precinctId);
for (TabulateByField field : config.enabledFields()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good example of the ambiguity here: without context, I would never guess that config.enabledFields() is giving me a list of enabled breakdown fields

@@ -789,8 +798,8 @@ void generateSummaryFiles(String timestamp) throws IOException {
}
}

Set<String> getPrecinctIds() throws IOException {
return precinctIds;
FieldIdSet getFieldIds() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should fieldIds/getFieldIds clarify that it's referring to the enabled ones? And why do we need the fieldIds instance variable instead of just using config.enabledFields()? Aren't we throwing an exception if there's a discrepancy between the config and the CVRs in this regard anyway?

* Breaks down the templated type T by a specific field.
* The field type is the outer map (TabulateByField), and the field ID is the inner map (String).
*/
static class BreakdownByField<T> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so you made this generic so that you could reuse it for both tallies and tally transfers.

@artoonie
Copy link
Collaborator Author

@tarheel all comments are addressed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Tabulating by "sub-level" Enhancement 1: Tabulate by Batch
2 participants