[TIL][Go] Parsing Cookie-Protected Web Content with jQuery

#javascript #go #webdev

Preface:

It was on Chinese New Year's Eve, because the TV at home had been broken for several days, so I started looking at my own Github project. I thought about working on Go modules, but found that this project was no longer usable.

iloveptt is a small project of mine that specializes in crawling PTT boards. I suddenly found that it was not working when I picked it up recently. After investigation, I found that the GoJQuery function package could no longer find the correct information. This article records how to find and how to solve the problem.

Related Projects:

I Love PTT A PTT Crawler and Photo downloader which written in Golang: https://github.com/kkdai/iloveptt
Golang: A photo download package for gomobile in Golang: https://github.com/kkdai/photomgr

The Problem Occurred

The jQuery that was originally used to parse the directory suddenly failed, but it was certain that the data search was correct. I also checked the web page source code through a browser and found no relevant changes.

At this time, you first need to print out all the data. You can first list all the data's HTML and see what's different about the data obtained by jQuery. You can use the doc.Find("*").Each(func(i int, s *goquery.Selection) { method and use s.Html() to print out the real result of the search to see where the problem lies.

It turns out that the webpage has user agreement terms

At this time, you will find that the HTML source code displayed here is different from what you see in the browser. It turns out that PTT displays user agreement terms, and users must agree to the content related to those over 18 years old.

And the reason you can read the content of the webpage normally is because your browsing status has cookies.

And you can check the cookies through the Chrome Developer Console through Networking to find Request Cookies to see if there are Cookies.

Reference articles: https://bryannotes.blogspot.com/2015/07/python-crawler.html

Query by adding Cookie to JQuery

Then you need to think about whether the github.com/PuerkitoBio/goquery package can add cookies? First, let's check the documentation. https://godoc.org/github.com/PuerkitoBio/goquery#NewDocument You will find that it has the following methods.

Among them, NewDocumentFromResponse can be used, we have to start thinking about how to obtain information with cookies through net/http?

https://siongui.github.io/2018/03/03/go-http-request-with-cookie/ This article gives a good example.

Final Modification Method

Referring to the above modification method, we need to modify the code as follows to run normally.

Finally, attach the relevant issue number: https://github.com/kkdai/photomgr/issues/6

Conclusion:

jQuery is very convenient to directly operate some data on the webpage through a selector, but sometimes the information captured through the app browser may be different from the browser you open yourself. In the debugging process, you may need to be more careful so that you don't get stuck in not finding the real problem and blindly guessing.

This article hopes to give some ideas to those who want to write Golang crawlers through jQuery, and can also help everyone understand the relevant knowledge.

Reference:

Top comments (0)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.