Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement separator validation #326

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

MichaelS239
Copy link
Contributor

@MichaelS239 MichaelS239 commented Dec 17, 2023

Add methods for separator validation in .csv tables

@MichaelS239 MichaelS239 force-pushed the separator_validation branch 2 times, most recently from c1b2e75 to 9ce5ca8 Compare December 22, 2023 18:59
@MichaelS239 MichaelS239 marked this pull request as draft December 22, 2023 19:17
@MichaelS239 MichaelS239 marked this pull request as ready for review March 3, 2024 22:02
Comment on lines +234 to +241
struct TestSeparatorValidationParam {
std::filesystem::path csv_path;
char separator;
char expected;

TestSeparatorValidationParam(std::filesystem::path const& path, char sep, char exp)
: csv_path(path), separator(sep), expected(exp) {}
};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use path from already existing csv configs
And we don't need constructor

Suggested change
struct TestSeparatorValidationParam {
std::filesystem::path csv_path;
char separator;
char expected;
TestSeparatorValidationParam(std::filesystem::path const& path, char sep, char exp)
: csv_path(path), separator(sep), expected(exp) {}
};
struct TestSeparatorValidationParam {
CSVConfig csv_config;
char test_separator;
std::optopnal<char> expected_separator;
};

EXPECT_EQ(actual.value_or('\0'), p.expected);
}

INSTANTIATE_TEST_SUITE_P(
Copy link
Collaborator

@vs9h vs9h Mar 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, that we also should add some complex cases to derive the separator. Something like:

;,;,;
;,;,;

In this case valid separators are ; and ,.
Also check validation for csvs with quotes:

'a..b',c,d
e,d,'f..g'

In this case, only ',' is a valid separator

TEST_P(TestSeparatorValidation, Default) {
TestSeparatorValidationParam const& p = GetParam();
std::optional<char> actual = util::ValidateSeparator(p.csv_path, p.separator);
EXPECT_EQ(actual.value_or('\0'), p.expected);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and so we can write here

Suggested change
EXPECT_EQ(actual.value_or('\0'), p.expected);
EXPECT_EQ(actual, p.expected_separator);

Comment on lines +194 to +212
if (has_next_) {
for (char c : next_line_) {
letter_count[c]++;
}
}

std::unordered_map<char, unsigned> next_letter_count;
while (has_next_) {
GetNextIfHas();
next_letter_count.clear();
for (char c : next_line_) {
next_letter_count[c]++;
}
for (auto letter : letter_count) {
if (letter.second != next_letter_count[letter.first]) {
letter_count[letter.first] = 0;
}
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case we will also take into account those characters that we do not want to take into account - the chars in quotes. See second example in the last comment in the file src/tests/test_util.cpp


namespace util {

std::optional<char> ValidateSeparator(std::filesystem::path const& path, char separator);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to change the function signature. We shouldn't just write a warning to the console, we need to explicitly return a message with error information and actual separator, if found.
We can return something like: std::pair<std::optional<char>, std::string>.
This is necessary because, first of all, this function will be needed for the python server. Let's say we downloaded a user's dataset and want to find out whether he sent the correct separator. And the main goal is to provide readable information about the error.
And accordingly, you need to figure out how to bind this function to python bindings

Comment on lines +51 to +53
auto csv_parser = std::make_shared<CSVParser>(csv_config);
csv_parser->ValidateSeparator();
return csv_parser;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I'm not sure that we always want to validate the separator when creating a table. The main problem is that we have to read the table completely before running the algorithm itself. This causes the file system caches and disk cache to fill up. Therefore, if someone tries to conduct experiments and does not know that validation always takes place first, then the results of the experiments will be incorrect (since all caches must be reset before running the test).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants