<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Urvir</title>
    <description>The latest articles on DEV Community by Urvir (@urvir).</description>
    <link>https://dev.to/urvir</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F467674%2Fc19c66bc-c707-425a-b270-07be192ab0fd.jpg</url>
      <title>DEV Community: Urvir</title>
      <link>https://dev.to/urvir</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/urvir"/>
    <language>en</language>
    <item>
      <title>Necessity is mother of invention</title>
      <dc:creator>Urvir</dc:creator>
      <pubDate>Sat, 12 Sep 2020 17:37:07 +0000</pubDate>
      <link>https://dev.to/urvir/necessity-is-mother-of-invention-2k75</link>
      <guid>https://dev.to/urvir/necessity-is-mother-of-invention-2k75</guid>
      <description>&lt;p&gt;During this lock down my mother came to me asking for her Gujarati news paper. As news paper weren't allowed to be distributed in our residential complex i tried to find an e-paper for her. The paper was available on the news paper's website for free but issue was all pages were stored as PDF files but stored as 1 file per page.&lt;/p&gt;

&lt;p&gt;I am big fan of Python's approach to find solution for practical problems and ever growing list of modules and libraries for anything under sun. &lt;/p&gt;

&lt;p&gt;I followed an experiment model and combined different methods to get the required result.&lt;/p&gt;

&lt;p&gt;I use BeautifulSoup to scrap the data and PYPDF2 to read and merge files with tips from Stakeoveflow :)&lt;/p&gt;

&lt;p&gt;Below is the code. This gets me a single PDF in a directory with today's date. She is able to run this from her smart phone using Pydroid3.&lt;/p&gt;

&lt;p&gt;from bs4 import BeautifulSoup&lt;br&gt;
from urllib.request import Request, urlopen&lt;br&gt;
import requests&lt;br&gt;
import os&lt;br&gt;
from datetime import datetime&lt;/p&gt;

&lt;p&gt;today=datetime.today().strftime('%d-%m-%Y')&lt;br&gt;
if not os.path.exists(today):&lt;br&gt;
   os.mkdir(today)&lt;/p&gt;

&lt;p&gt;os.chdir(os.path.join(os.getcwd(),today))&lt;/p&gt;

&lt;p&gt;req = Request("&lt;a href="http://www.newspapaersomething.com/frmEPShow.aspx%22"&gt;http://www.newspapaersomething.com/frmEPShow.aspx"&lt;/a&gt;)&lt;br&gt;
html_page = urlopen(req)&lt;/p&gt;

&lt;p&gt;soup = BeautifulSoup(html_page, "lxml")&lt;/p&gt;

&lt;p&gt;links = []&lt;/p&gt;

&lt;p&gt;for link in soup.findAll('a'):&lt;br&gt;
    links.append(link.get('href'))&lt;/p&gt;

&lt;p&gt;mylist=[]&lt;br&gt;
del links[0:15]&lt;br&gt;
mylist=links&lt;/p&gt;

&lt;p&gt;cnt=len(mylist)&lt;br&gt;
i=0&lt;br&gt;
urlnew = [None] * cnt&lt;/p&gt;

&lt;p&gt;while True: &lt;br&gt;
    urlnew[i] = mylist[i]&lt;br&gt;
    r = requests.get(urlnew[i], allow_redirects=True)&lt;br&gt;
    z=urlnew[i].split("/")&lt;br&gt;
    name=z[-1]&lt;br&gt;
    open(name, 'wb').write(r.content)&lt;br&gt;
    i = i + 1 &lt;br&gt;
    if(i &amp;gt;= cnt): &lt;br&gt;
        break&lt;/p&gt;

&lt;p&gt;from PyPDF2 import PdfFileMerger,PdfFileReader&lt;/p&gt;

&lt;p&gt;def mergeIntoOnePDF(path):&lt;br&gt;
    f=path+"/"&lt;br&gt;
    pdf_files=[fileName for fileName in os.listdir(f) if fileName.endswith('.pdf')]&lt;br&gt;
    merger=PdfFileMerger()&lt;br&gt;
    for filename in pdf_files:&lt;br&gt;
       merger.append(PdfFileReader(os.path.join(f,filename),"rb"))&lt;br&gt;
    merger.write(os.path.join(f,"merged_full.pdf"))&lt;/p&gt;

&lt;p&gt;mergeIntoOnePDF(os.getcwd())&lt;/p&gt;

</description>
      <category>python</category>
      <category>pdf</category>
      <category>webscrapping</category>
      <category>pydroid3</category>
    </item>
  </channel>
</rss>
