Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to by pass c.OnError #799

Open
quangnx99 opened this issue Jan 3, 2024 · 3 comments
Open

how to by pass c.OnError #799

quangnx99 opened this issue Jan 3, 2024 · 3 comments

Comments

@quangnx99
Copy link

when I scrapping data, page return http status 404 but result still have html response. I want get response. But in colly, if OnError occurred then onHTML do not occurre. How can I get response when error?

@quangnx99
Copy link
Author

when I scrapping data, page return http status 404 but result still have html response. I want get response. But in colly, if OnError occurred then onHTML do not occurre. How can I get response when error?

I resolved with using property ParseHTTPErrorResponse in OnRequest

	c.OnRequest(func(r *colly.Request) {
		c.ParseHTTPErrorResponse = true
	})

@oliverbenns
Copy link

oliverbenns commented Jun 8, 2024

I also have this issue where a website returns 410 Gone but still provides the html body, yet it'll fail in colly. ParseHTTPErrorResponse does not seem to work, nor is it ideal as I'd still like to error on other codes.

@oliverbenns
Copy link

You can hack around the OnError function receiver but honestly it's very gross because you're limited in how much you can hook into the Colly logic (really you want to push onto the on http callback slice, but it's private)

I strongly suggest doing this outside of colly with a std http request + goquery instead of the below.

func (c *Client) GetPage(_ context.Context, id string) (*PageResult, error) {
	pageUrl := "http://google.com"
	col := colly.NewCollector()
	var pageModel *PageModel
	col.UserAgent = userAgent

	var err error

	col.OnError(func(res *colly.Response, collyErr error) {
		if res.StatusCode != http.StatusOK && res.StatusCode != http.StatusGone {
			err = fmt.Errorf("invalid status code for page %s: %w", pageUrl, err)
			return
		}

		doc, err := goquery.NewDocumentFromReader(bytes.NewBuffer(res.Body))
		if err != nil {
			err = fmt.Errorf("could not parse response body: %w", err)
			return
		}

		doc.Find("script").Each(func(i int, s *goquery.Selection) {
		    if i == 0 {
                         pageModel = s.Text()
                     }
		})
	})

	_ = col.Visit(pageUrl)
	if err != nil {
		return nil, fmt.Errorf("could not visit %s: %w", pageUrl, err)
	}

	return &PageResult{
		Model: pageModel,
	}, nil
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants