DEV Community

Cover image for Book review "Perl and LWP"
Tib
Tib

Posted on • Edited on

Book review "Perl and LWP"

Perl and LWP

The book is available online for free

Thank you a lot O Reilly and Sean W. Burke for that !

Disclaimer

A lot of weaknesses of this book are due to the age but I won't refrain to report them with honesty :)

I'm very grateful to the person that took time to write this book : Sean W. Burke !

Tell me about history

At the beginning, Sean reminds us a lot of LWP and webservers history and I personnally loved it.

For my point of view, IT sciences are very sensitive to "technologies hypes" and things seems to go faster and faster these days (probably just an impression, I'm too young to judge) but I personnally love to read about history and oldies which created the present world.

Some people can think about this part as a weakness, but I definitely loved it.

Gandpa

Quickly we face some outdated websites 😧

Among other old stuff, you will have fun with HTML code from 2002 like table with captiatlized tags : <TD ALIGN=... or <BODY BGCOLOR="#aabbcc" BACKGROUND="img/bg.jpeg" ... or again <B CLASS="h3">

Be prepared !! 😁

Old HTML

Some paragraphs from chapter 3 are dedicated to explanations about HTML forms elements like textarea, submit or hidden fields (not really outdated but I felt like a jump in the past 😃)

And what about Netscapes 4.76 😖 used as browser (released in 2000).

Netscape

Sometimes Sean is coding like my grandma (regularize_hostname to canonicalize hostname should be taken from a module, not coded by hand with regex).

...This remark does not mean that I'm not coding like my grandma too ! 😁

There is only one edition of this book and that's why it is so outdated :/

Cool bits

Stop critics, let's look at cool things from this book.

I was very happy to learn that :

  • LWP manages a chained list of responses objects that we can use to check all the redirection flow.
  • We can't do HEAD request on some resources like a CGI (OK it's CGI... but interesting bit anyway)
  • The spec for HTML comments is very weird
  • An attribute like alt=bad dog! (without quote) could be interpreted as a boolean...
  • Comments are pseudo elements
  • The difference between UserAgent and Robot (read robots.txt + handle sleep time)

And I enjoyed the detailed sections about :

  • The parts of an URL
  • The methods allowed

Cool Man

Later the multiples sections about "reading the HTML" are interesting :

  • Dig into HTML with regex 😨 (pandas were harmed during the write of this paragraph, even if I know about this famous post responding to this other famous post)
  • Parsing sequentially with HTML::TokeParser
  • Parsing HTML with HTML::TreeBuilder and tree access (2020 recommended approach, eventually with another module)

As well as the details about dumped tokeparser get_token or unget_token and token sequences.

Or the parts about :

  • Ordering of traversing
  • Traversing recursively or by using a stack
  • Modifying the HTML tree

Later, we finally have some chapters about web basics like cookies and credentials (basic).

The spider chapter is cool and appendix are interesting also even if it's only flat documentation.

I really like the mindset of Sean, UTF8 and unicode are well mentionned and pushed, Microsoft and their non standart politics are well adressed and I found very funny the reference to All your pies are belong to us. 😃

Surprising

Something surprising was to see Sean using suck.com (not related to anything NSFW) as a basis of an example. Bad choice in my humble opinion 😃

There is A LOT of retrieval of string in HTML using regex... 😨

The disclaimer is present but very light and does not mention that it’s brittle and limited (for correctness : not completely true).

I have no problem to do some dirty code when we I do not need more that a quick-and-dirty implementation (no over-engineering for nothing)...

...but these days everybody is very touchy with HTML and regex 😆

Seems also that CSS selectors were not in use in 2002 :/

Alt Text

And finally the TPJ#17 article about OOP seemed for me totally off topic.

Conclusion

I'm always a bit shy about using scraping because I feel like it's very brittle. I was hoping to see some advanced and robust webscraping methods in this book but it does not provides this at all.

I basically read the whole book without skipping any section (sometimes just went faster on the code samples) so it's still worth to read book !

You can get this and read it you will still learn interesting stuff but there are also probably some better alternatives today (I think about Mojolicious Web Clients but I haven't read it right now).

Top comments (2)

Collapse
 
smonff profile image
🌌 Sébastien Feugère ☔

Glad to see I am not the only one who enjoys reading antic O'Reilly books. I guess you'll enjoy Mojolicious Web Clients...

Collapse
 
thibaultduponchelle profile image
Tib

There are some others to come 😄