New API exploration #90

drewnoakes · 2017-02-13T13:19:31Z

New API

This pull request is exploring an improved API within the library the library.

Discussion about the API should happen here, and this description will be updated to reflect what's agreed upon.

Desired qualities & features

Easier to update properties

Currently, adding a property requires editing four disparate locations in code:

Adding a const field to a directory class
Adding it's name to the field name hash table
(optionally) Add a custom description method, and
(optionally) register it in the descriptor class's switch statement

We could lose the concept of descriptor classes altogether.

A new API should co-locate all this in one place. For example:

/* TODO add example code */

Property metadata (or, metadata about metadata)

We can model property metadata, such as:

A unique ID for the property (within the library)
Summary
Description
The XMP namespace and identifier of the property, if available
Expected data type
Expected value count (for enumerable values)
Expected string encoding (for strings)
Whether the tag influences the extraction process (e.g. camera make influencing makernote decoding)

Metadata varies depending upon the property type.

Support varied tag identifiers

Directory as a base class requires all tag identifiers to be integers. That's fine for Exif and other TIFF formats that use numeric identifiers for tags, but not all file formats have that. Instead we end up defining our own arbitrary integer mapping.

XMP in particular doesn't fit this mould, and we don't support it very well. XMP properties have two keys -- a namespace and a key. We should support composite keys such as this.

A likely implementation will have a directory base class that's generic on its key type.

We still want to easily enumerate all properties and print out their descriptions. Key types must have some common base type from which a key string can be obtained.

/* TODO add example code */

Top level object

The .NET project does not have a Metadata class, instead using IEnumerable<Directory>. C# provides great operators (OfType<>, First, Select, SelectMany, ...), and allows extension methods on this interface.

Data driven approach

Rather than defining all properties in code, they should be loaded from a configuration file.

The configuration file would be reused between the Java and C# implementations. It should simplify the creation of implementations in other languages as well.

Implementations could provide partial support for the metadata types described in the data file, allowing gradual implementation. We could automatically generate documentation/tabulation of support across implementations.

Exiftool's Perl source code reads a lot more like data than code. This data file should be equally declarative.

Logical vs. physical properties

As discussed in drewnoakes/metadata-extractor#10, it'd be valuable to allow simpler access to logical values that may have multiple possible physical locations.

Examples of such logical properties: Timestamp, Orientation, CameraMake, CameraModel, Aperture, Exposure, Flash, FocalLength, ISO, WhiteBalance, ImageSize, GeoLocation, Altitude, Heading, ThumbnailSize, LensModel, DriveMode, ExposureMode, ExposureProgram, Rating, Subject, Label, Copyright, Author, Comment, ImageCount.

Some could be sourced from many locations (Timestamp, ISO, Flash...) and others which are combinations of multiple tags (ImageSize, GeoLocation, ThumbnailSize, ...).

Efficient storage

Some formats use fixed length records. For these, a directory could store the single byte[] and use IndexedProperty methods to read/write values at runtime.

Context

Some kind of object that configures how metadata extraction is completed. It could cache the parsed data file (see above), specify filtering options (see below), configuration such as threshold limits on byte[] sizes, which metadata types to extract. If we ever need runtime code generation, it could cache those resources too.

Filtering

It seems useful to be able to limit the types of metadata returned during processing, to reduce heap allocation and reduce IO/CPU usage. There's a PR for this in the Java implementation, and some discussion there.

Serialisation

It might be good to support hooks for serialisation and deserialisation in arbitrary formats. There's a PR in the Java version that uses Java's object serialisation, but a more general approach should support XML, JSON, etc.

Future support for editing metadata

This is a very sought after feature, but it's a big commitment as the cost per error is high, and it will require a great deal of engineering.

So while it's not a v-next goal explicitly, it will likely be the next significant milestone for the project, and we should at least give it some thought when it comes to this iteration of the API. We should consider trying to minimise future API churn.

Naming

The data model (directories, tags) dates back to when the project was called ExifExtractor. The terms come from the TIFF specification.

Is property a more suitable and general term for what we currently call tag?
Are there any other names/concepts that should be renamed or refactored?

rcketscientist · 2017-02-18T12:49:27Z

While writing the x3f support which can have string keys I was thinking about this. While it would require a massive code reformat I think it would actually have a minimal logical impact to do:

Key
|-StringKey
|-IntegerKey
|-CompoundKey

With Directory<T>s managing themselves, the type of the constants should be transparent to the user, unless I'm overlooking something.

Sorry, I'm not a coder primarily, so I don't know the fancy terms for what I just tried to describe.

rcketscientist · 2017-02-19T16:57:23Z

(Java)
I scrapped the simple key extension idea. It worked great for replacing everything with some quick regex, but wasn't compatible with switch statements. I think all we need is a reverse-lookup enum.

I modified Directory and one of the subclasses here:
https://gist.github.com/rcketscientist/60a03908034d653a1c9134f1abaf38e5

I threaded the logic through the descriptor and exif handler without issues. As you can see Directory defers the map logic to its subclasses. It should handle the change transparently with the addition of reverse-lookup gets:

K is enum property keys.
T is the value K represents (Integer in most cases).

    public void setObject(K tagType, @NotNull Object value)
    {
        if (value == null)
            throw new NullPointerException("cannot set a null object");

        if (!getTagMap().containsKey(tagType)) {
            _definedTagList.add(new Tag(tagType, this));
        }
        getTagMap().put(tagType, value);
    }

    public void setObject(T tagValue, @NotNull Object value)
    {
        setObject(getTagFromValue(tagValue), value);
    }

This solution will probably be a little more manual to implement, so let me know what you think before I go refactoring the whole project.

rcketscientist · 2017-02-21T15:43:14Z

I played with the enum idea a bit more and realized it can tick all the checkboxes in java. Using interfaces and the template capability in java enums eliminates the need for Tag, Descriptor and locates all tag information centrally in the enum itself.

Key (interface):
https://gist.github.com/rcketscientist/99830fbbb1ce0e451bbf698aeeca0678

Example directory:
https://gist.github.com/rcketscientist/60a03908034d653a1c9134f1abaf38e5

By consolidating so much we'd lose plug-in backwards compatibility, but updating to the new API shouldn't be too much work.

drewnoakes force-pushed the new-api branch from 74bafc7 to 9b34bc3 Compare February 13, 2017 14:01

drewnoakes mentioned this pull request Feb 13, 2017

Produce values derived from one or more tags drewnoakes/metadata-extractor#10

Open

drewnoakes mentioned this pull request Mar 15, 2017

Implement Metadata filter drewnoakes/metadata-extractor#225

Closed

drewnoakes force-pushed the new-api branch from 9b34bc3 to f15124b Compare April 12, 2020 10:46

drewnoakes force-pushed the new-api branch from 19e364f to cc56a07 Compare May 5, 2020 12:49

drewnoakes force-pushed the new-api branch from cc56a07 to 393c5d9 Compare April 29, 2022 23:58

drewnoakes added 7 commits May 5, 2022 21:28

Exploration of ideas for a new API.

39bf554

WIP

53aa32f

WIP

f03848d

WIP

f90d975

Inline field

737c869

WIP

a760533

WIP

d3f8157

drewnoakes force-pushed the new-api branch from adbdb78 to d3f8157 Compare May 5, 2022 11:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New API exploration #90

New API exploration #90

drewnoakes commented Feb 13, 2017 •

edited

rcketscientist commented Feb 18, 2017

rcketscientist commented Feb 19, 2017 •

edited

rcketscientist commented Feb 21, 2017

New API exploration #90

Are you sure you want to change the base?

New API exploration #90

Conversation

drewnoakes commented Feb 13, 2017 • edited

New API

Desired qualities & features

Easier to update properties

Property metadata (or, metadata about metadata)

Support varied tag identifiers

Top level object

Data driven approach

Logical vs. physical properties

Efficient storage

Context

Filtering

Serialisation

Future support for editing metadata

Naming

rcketscientist commented Feb 18, 2017

rcketscientist commented Feb 19, 2017 • edited

rcketscientist commented Feb 21, 2017

drewnoakes commented Feb 13, 2017 •

edited

rcketscientist commented Feb 19, 2017 •

edited