Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability to format the result #3

Open
gajus opened this issue Jan 17, 2017 · 4 comments
Open

Add ability to format the result #3

gajus opened this issue Jan 17, 2017 · 4 comments

Comments

@gajus
Copy link
Owner

gajus commented Jan 17, 2017

There has been a request to add a "formatting" ability like in scrape-it library.

Its documented as:

convert (Function): An optional function to change the value.

Example:

{
   articles: {
       listItem: ".article"
     , data: {
           createdAt: {
               selector: ".date"
+             , convert: x => new Date(x)
           }
         , title: "a.article-title"
         , tags: {
               listItem: ".tags > span"
           }
         , content: {
               selector: ".article-content"
             , how: "html"
           }
       }
   }
}

Considerations:

  • Need to consider how this integrates with validation (does formatting happen before, after)
  • Whats the API?
@sllvn
Copy link

sllvn commented Jan 18, 2017

Re: API, I've toyed with idea of passing arrays to indicate selector + transforms, a la createdAt: ['.date', x => new Date(x)]. IMO, it's easier to read than createdAt: { selector: ".date", convert: x => new Date(x) } especially when you have many transforms in your schema.

@ComLock
Copy link
Contributor

ComLock commented Oct 16, 2018

Lets say you select all links in a document and want to filter out duplicates.
sm a|ra href

Any user-defined subroutine is called once per item in the array, not on the array as a whole, right?
(Nor is it called as a reducer?)
So I cannot make a subroutine to sort and remove duplicates from the array.
Or a subroutine to flatten the array.

@ComLock
Copy link
Contributor

ComLock commented Oct 16, 2018

It can be done if the subroutine combines select and read :)

sl: (subject, v, b) => selectSubroutine(subject, ['a', '{0,}'], b).map(match => readSubroutine(match, ['attribute', 'href'], b))

@ComLock
Copy link
Contributor

ComLock commented Oct 16, 2018

Wow powerful stuff:

function sortAndRemoveDups(arr) {
	const sorted = arr.sort();
	const uniq = [];
	let prev = null;
	for (let i = 0; i < sorted.length; i += 1) {
		if (sorted[i] !== prev) { uniq.push(sorted[i]); }
		prev = sorted[i];
	}
	return uniq;
}

...

slb: (s, v, b) => sortAndRemoveDups(selectSubroutine(s, [v.concat('a:not([href^="#"])').join(' '), '{0,}'], b).map(m => readSubroutine(m, ['attribute', 'href'], b)))

...

allRealLinksUnderBody: slb body

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants