<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Sajid Shaikh</title>
    <description>The latest articles on DEV Community by Sajid Shaikh (@shaikhsajid1111).</description>
    <link>https://dev.to/shaikhsajid1111</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F470754%2Fb88080d2-3d51-46ff-b690-f71534969bfe.jpg</url>
      <title>DEV Community: Sajid Shaikh</title>
      <link>https://dev.to/shaikhsajid1111</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/shaikhsajid1111"/>
    <language>en</language>
    <item>
      <title>Scrape twitter profiles and hashtags</title>
      <dc:creator>Sajid Shaikh</dc:creator>
      <pubDate>Mon, 01 Nov 2021 07:18:10 +0000</pubDate>
      <link>https://dev.to/shaikhsajid1111/scrape-twitter-profiles-and-hashtags-i68</link>
      <guid>https://dev.to/shaikhsajid1111/scrape-twitter-profiles-and-hashtags-i68</guid>
      <description>&lt;p&gt;I was going through &lt;a href="https://github.com/bisguzar/twitter-scraper" rel="noopener noreferrer"&gt;this&lt;/a&gt; project that scrapes twitter however it is now not working properly as Twitter has changed its front-end code structure and even the way how tweets fetch from the backend. Now, sending an HTTP request and parsing that HTML source code to get the tweet's data does not work and I needed even more data than what twitter's API can offer. So, I created this project to run with a headless web browser and get the tweet's data. &lt;/p&gt;

&lt;p&gt;What data do we get?&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
    &lt;thead&gt;
        &lt;tr&gt;
            &lt;td&gt;Key&lt;/td&gt;
            &lt;td&gt;Type&lt;/td&gt;
            &lt;td&gt;Description&lt;/td&gt;
        &lt;/tr&gt;
    &lt;/thead&gt;
    &lt;tbody&gt;
        &lt;tr&gt;
            &lt;td&gt;tweet_id&lt;/td&gt;
            &lt;td&gt;String&lt;/td&gt;
            &lt;td&gt;Post Identifier(integer casted inside string)&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
            &lt;td&gt;username&lt;/td&gt;
            &lt;td&gt;String&lt;/td&gt;
            &lt;td&gt;Username of the profile&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
            &lt;td&gt;name&lt;/td&gt;
            &lt;td&gt;String&lt;/td&gt;
            &lt;td&gt;Name of the profile&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
            &lt;td&gt;profile_picture&lt;/td&gt;
            &lt;td&gt;String&lt;/td&gt;
            &lt;td&gt;Profile Picture link&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
            &lt;td&gt;replies&lt;/td&gt;
            &lt;td&gt;Integer&lt;/td&gt;
            &lt;td&gt;Number of replies of tweet&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
            &lt;td&gt;retweets&lt;/td&gt;
            &lt;td&gt;Integer&lt;/td&gt;
            &lt;td&gt;Number of retweets of tweet&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
            &lt;td&gt;likes&lt;/td&gt;
            &lt;td&gt;Integer&lt;/td&gt;
            &lt;td&gt;Number of likes of tweet&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
            &lt;td&gt;is_retweet&lt;/td&gt;
            &lt;td&gt;boolean&lt;/td&gt;
            &lt;td&gt;Is the tweet a retweet?&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
            &lt;td&gt;retweet_link&lt;/td&gt;
            &lt;td&gt;String&lt;/td&gt;
            &lt;td&gt;If it is retweet, then the retweet link else it'll be empty string&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
            &lt;td&gt;posted_time&lt;/td&gt;
            &lt;td&gt;String&lt;/td&gt;
            &lt;td&gt;Time when tweet was posted in ISO 8601 format&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
            &lt;td&gt;content&lt;/td&gt;
            &lt;td&gt;String&lt;/td&gt;
            &lt;td&gt;content of tweet as text&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
            &lt;td&gt;hashtags&lt;/td&gt;
            &lt;td&gt;Array&lt;/td&gt;
            &lt;td&gt;Hashtags presents in tweet, if they're present in tweet&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
            &lt;td&gt;mentions&lt;/td&gt;
            &lt;td&gt;Array&lt;/td&gt;
            &lt;td&gt;Mentions presents in tweet, if they're present in tweet&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
            &lt;td&gt;images&lt;/td&gt;
            &lt;td&gt;Array&lt;/td&gt;
            &lt;td&gt;Images links, if they're present in tweet&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
            &lt;td&gt;videos&lt;/td&gt;
            &lt;td&gt;Array&lt;/td&gt;
            &lt;td&gt;Videos links, if they're present in tweet&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
            &lt;td&gt;tweet_url&lt;/td&gt;
            &lt;td&gt;String&lt;/td&gt;
            &lt;td&gt;URL of the tweet&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
            &lt;td&gt;link&lt;/td&gt;
            &lt;td&gt;String&lt;/td&gt;
            &lt;td&gt;If any link is present inside tweet for some external website. &lt;/td&gt;
        &lt;/tr&gt;
    &lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;What we can scrape?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Any profile's tweet that exists on Twitter.&lt;/li&gt;
&lt;li&gt;Scrape by keyword as well, like "google".&lt;/li&gt;
&lt;li&gt;Scrape by hashtags like "#india".&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What if the IP is getting blocked due to too many requests?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;It has a feature to set proxies as well, authenticated as well as unauthenticated.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To know more about it's usage check the entire repository &lt;a href="https://github.com/shaikhsajid1111/twitter-scraper-selenium" rel="noopener noreferrer"&gt;here&lt;/a&gt;&lt;/p&gt;

</description>
      <category>programming</category>
      <category>selenium</category>
      <category>twitter</category>
      <category>scraping</category>
    </item>
    <item>
      <title>Scrape Facebook public pages without an API key or limitations</title>
      <dc:creator>Sajid Shaikh</dc:creator>
      <pubDate>Mon, 04 Jan 2021 15:02:44 +0000</pubDate>
      <link>https://dev.to/shaikhsajid1111/scrape-facebook-public-pages-without-an-api-key-or-limitations-43d4</link>
      <guid>https://dev.to/shaikhsajid1111/scrape-facebook-public-pages-without-an-api-key-or-limitations-43d4</guid>
      <description>&lt;p&gt;Facebook's API is really difficult to setup and have rate limiting as well. Why not getting public data with some automation?. Here's a python library that does the job.&lt;/p&gt;

&lt;p&gt;Install it with pypi:&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;code&gt;pip install facebook-page-scraper&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;Or &lt;br&gt;
Install it from source:&lt;br&gt;
Download it using git:&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;code&gt;git clone https://github.com/shaikhsajid1111/facebook_page_scraper.git&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;and open terminal inside folder and enter command:&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;code&gt;python3 setup.py install&lt;/code&gt;&lt;br&gt;
&lt;br&gt;
.&lt;/p&gt;

&lt;p&gt;How to use it?&lt;br&gt;
Well its simple!,&lt;br&gt;
Just import class from the package,instantiate and start scraping.&lt;/p&gt;

&lt;p&gt;Suppose I want posts from Facebook AI,&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;facebook_page_scraper&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Facebook_scraper&lt;/span&gt;

&lt;span class="c1"&gt;#instantiate the Facebook_scraper class
&lt;/span&gt;
&lt;span class="n"&gt;page_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;facebookai&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;posts_count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;
&lt;span class="n"&gt;browser&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;firefox&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="n"&gt;facebook_ai&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Facebook_scraper&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;page_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;posts_count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;browser&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Above was instantiation part, Suppose you want data in JSON format than just call the&lt;br&gt;
&lt;br&gt;
 &lt;code&gt;scrap_to_json()&lt;/code&gt;&lt;br&gt;
&lt;br&gt;
 method.&lt;br&gt;
Like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;json_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;facebook_ai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;scrap_to_json&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json_data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And you will get the JSON Output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"1730063790503900"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Facebook AI"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"shares"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;65&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"reactions"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"likes"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;305&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"loves"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;31&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"wow"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"cares"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"sad"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"angry"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"haha"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"reaction_count"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;343&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"comments"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;11&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"We&lt;/span&gt;&lt;span class="se"&gt;\u&lt;/span&gt;&lt;span class="s2"&gt;2019re training computer vision models that leverage Transformers, a deep neural network architecture. Data-efficient image Transformers (DeiT) use less data and computing resources to produce high-performance image classification AI models.  We hope to advance the field of computer vision by sharing this work with the broader community, making large-scale systems that train AI models more accessible to researchers and engineers."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"posted_on"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2020-12-24T04:05:27"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"video"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"image"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="s2"&gt;"https://scontent-bom1-2.xx.fbcdn.net/v/t39.2365-6/p540x282/131570013_988138305044034_3894567585410559092_n.png?_nc_cat=109&amp;amp;ccb=2&amp;amp;_nc_sid=eaa83b&amp;amp;_nc_ohc=mAeDelparrEAX-3Mk7E&amp;amp;_nc_ht=scontent-bom1-2.xx&amp;amp;_nc_tp=30&amp;amp;oh=3fedb0e3cea6ad6f934ca20f77bec624&amp;amp;oe=600CB4C9"&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"post_url"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://www.facebook.com/facebookai/posts/1730063790503900"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;if you want to save the data to CSV file directly, Just call the&lt;br&gt;
&lt;br&gt;
 &lt;code&gt;scrap_to_csv()&lt;/code&gt;&lt;br&gt;
&lt;br&gt;
 method.&lt;/p&gt;

&lt;p&gt;Like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;filename&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data_file&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  &lt;span class="c1"&gt;#file name without CSV extension,where data will be saved
&lt;/span&gt;&lt;span class="n"&gt;directory&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;E:\data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="c1"&gt;#directory where CSV file will be saved
&lt;/span&gt;&lt;span class="n"&gt;facebook_ai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;scrap_to_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;filename&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;directory&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Ff2nidz5yyd955374bpzj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Ff2nidz5yyd955374bpzj.png" alt="CSV output" width="800" height="371"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/shaikhsajid1111/facebook_page_scraper" rel="noopener noreferrer"&gt;source&lt;/a&gt; &lt;/p&gt;

</description>
      <category>facebook</category>
      <category>python</category>
      <category>scraping</category>
      <category>selenium</category>
    </item>
  </channel>
</rss>
