Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Add a basic TextFileStream and tests #23

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

mallardduck
Copy link
Contributor

@mallardduck mallardduck commented Dec 22, 2020

Description

This PR provides an out of the box implementation of a TextFileStream which works like the StringStream. The major difference is that it's using a file pointer for the stream source.


How is this accomplished and why?

While this is a crucial functionality to add IMO, it's not without challenges. Normally PHP file pointers are very mutable and track their own current position. This would mean that file handles need to be managed carefully to either: a) pass them around for reuse, or b) properly close and reopen them.

For how this library works that could get hard to manage very fast. In order to make this more compatible with how the StringStream works we'd need effectively an immutable version of PHP file handlers. So I created something to do exactly that - an API to wrap the default PHP file handlers that provides Immutable objects with idempotence.

See here: https://github.com/mallardduck/immutable-read-file

Effectively the package is just a thin wrapper that uses a FileHandlerManager for opening, sharing, tracking and freeing open file handles.


Necessary changes in parsica?

Due to file pointers tracking position based on current bytes I've had to extend Position to track bytes as well.
This seems to not add any issues for StringStream and will help cover more use-cases like this TextFileStream. This new bytePosition property should remain consistent with the existing Position logic as this value is tracked and incremented the same way that line and column are tracked.

Internally nothing about parsica has changed in this context, however all methods of file handle manipulation speak in terms of bytePosition rather than line/column. So really it's a new property to track and pass around - however it's only necessary when setting the files position like here:

https://github.com/parsica-php/parsica/pull/23/files#diff-8c40f483a7f4a801d83fbe9ac70eb75e538c6216d8eef5789c730aa4145825a4R39


Notes about changes to tests

In order to reuse some test logic I've also split off a test file to be used with require_once that way Excel tests can be done on multiple streams.

@mallardduck
Copy link
Contributor Author

Tests seem to be failing due to composer dependencies not being updated.

@turanct
Copy link
Collaborator

turanct commented Mar 1, 2021

Hey

First of all, thanks so much for your contribution, it's greatly appreciated.

I looked at your PR, and the first thing I noticed is the change to Position. You changed it to incorporate the byte position that you need. But this change also introduces a problem with Position: it's now potentially incorrect. It's possible to instantiate it so that the line & column don't line up with the bytePosition that was added. I think this might be a problem in the future. Also because in some scenarios we might not know byte position, so it might be hard to fill in.

Alternatives from the top of my head might include

  • calculating byte position from line & column
  • having a position interface that can have multiple implementations
  • ... others?

I'll be looking into this in more detail in the coming week.

Thanks for your patience!

@mallardduck
Copy link
Contributor Author

mallardduck commented Mar 2, 2021

hey @turanct - thanks for the update and getting time to review this PR.

It is a bit tricky to resolve the proper way to deal with this type of situation. Mainly because with a string blob and a file stream the concept of Position seem to be a bit different. Specifically that in the context of a file text stream -both fopen and SplFileObject- track based on bytePosition only. Internally to parsica this should be kept sane and in sync by virtue of how the parser works.

So I'm curious what type of workflow could cause your concern here:

introduces a problem with Position: it's now potentially incorrect

If that issue were possible then seeing an example or even a failing test for it would be great. I've accounted for both ASCII and UTF8 strings in my tests to ensure file based streams track this consistently so the parser doesn't get bytes confused and provide incorrect results. So absent any 3rd party interactions with the file in question, once the parser starts working with a TextFileStream there shouldn't be concern about the values drifting from what they should be.

  • calculating byte position from line & column

This is essentially what is being done -the canonical default being byte position 0 and line/character 1- then bytes are calculated here as data is parsed.

Given that bytePosition is calculated as characters are parsed, it should always reflect the file in use. Since that value goes into the Position instances which are immutable the initial values they have should never be wrong.
(Again, ignoring outside interference with the file in question)

The library I created to manage the concept of "immutable read file" handlers then keeps track of opening and closing file pointers as needed, as well as providing an immutable instance of "the current file position". So this allows a single file pointer to be reused for many unique bytePosition instances. This helper library only consumes the value provided by parisica's Position, so parsica's existing Position::advance method still maintains full control over the bytePosition value for the given instance in question.

So far this seems successful for file pointers and text blobs as the stream. I'm curious what other contexts we may need to imagine covering support for? I have a hard time imagining a method of parsing (and rewinding) the parsing of an actual "live stream" of text.

I.e. for parsing an HTTP response for example, the response body would likely need to be cached no matter what in order to rewind it during parsing. So at that point the buffer should either be a StingStream in memory or a TextFileStream to the temp file.

@mallardduck mallardduck changed the title Add a basic TextFileStream and tests [Feature] Add a basic TextFileStream and tests Apr 19, 2021
@mallardduck
Copy link
Contributor Author

hey @turanct - just wondering if you might be able to take a look at this again? maybe if you start a "review" and make notes on the areas of concern that could be a good step?

As mentioned I understand the confusion around BytePosition being tracked, but I believe I've taken steps to address those concerns. Effectively this PR expands the Position object to operate consistently while providing two different contexts for tracking position. This matters as working with an arbitrary blob of text the current Column+Line context makes sense. However, when working with a stream of text from a socket - AFAIK - only the byte position can be used.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants