<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Vadim Beskrovnov</title>
    <description>The latest articles on DEV Community by Vadim Beskrovnov (@vbeskrovnov).</description>
    <link>https://dev.to/vbeskrovnov</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F16613%2Fa1bbf2ca-68fc-4d8d-bdc7-2403b1b57b3b.jpeg</url>
      <title>DEV Community: Vadim Beskrovnov</title>
      <link>https://dev.to/vbeskrovnov</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/vbeskrovnov"/>
    <language>en</language>
    <item>
      <title>How to Boost Your Productivity as a Developer</title>
      <dc:creator>Vadim Beskrovnov</dc:creator>
      <pubDate>Sun, 04 Dec 2022 19:20:44 +0000</pubDate>
      <link>https://dev.to/vbeskrovnov/how-to-boost-your-productivity-as-a-developer-4ko9</link>
      <guid>https://dev.to/vbeskrovnov/how-to-boost-your-productivity-as-a-developer-4ko9</guid>
      <description>&lt;p&gt;Are you struggling to stay focused and productive as a developer? You're not alone. In today's fast-paced world, it can be difficult to stay on top of your work and achieve your goals. But fear not! In this article, I'll share some simple tips and tricks to help you boost your productivity and get more done in less time.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Create a to-do list. It sounds simple, but having a clear and concise to-do list can make a big difference in your productivity. Start each day by writing down the tasks you need to complete, and prioritize them based on importance and deadline. This will help you stay focused and avoid getting overwhelmed by a long list of tasks.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Set aside dedicated time for focused work. It's easy to get distracted by emails, social media, or other interruptions when you're working. To avoid these distractions, set aside dedicated time for focused work. This could be in the form of a daily time block, where you turn off all distractions and focus on your most important tasks.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Take regular breaks. It may seem counterintuitive, but taking regular breaks can actually help boost your productivity. When you work for long periods of time without a break, your brain becomes fatigued and you lose focus. By taking regular breaks, you give your brain a chance to recharge and come back to your work with renewed energy and focus.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Use productivity tools and apps. There are many productivity tools and apps available that can help you stay organized and on track. Some popular tools include Trello for project management, Evernote for note-taking, and Todoist for task management. Experiment with different tools to find the ones that work best for you.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Surround yourself with a supportive community. As a developer, it can be easy to feel isolated and alone. But having a supportive community of like-minded individuals can make a big difference in your productivity. Consider joining a local developer meetup group, or participating in online forums and communities where you can share ideas and learn from others.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;By following these tips, you can boost your productivity as a developer and achieve your goals more efficiently. Give them a try and see how they can help you stay focused and productive in your work.&lt;/p&gt;

&lt;p&gt;What are your favorite productivity tips for developers?&lt;/p&gt;

</description>
      <category>productivity</category>
    </item>
    <item>
      <title>Product design interview. My experience</title>
      <dc:creator>Vadim Beskrovnov</dc:creator>
      <pubDate>Sat, 10 Sep 2022 16:10:41 +0000</pubDate>
      <link>https://dev.to/vbeskrovnov/product-design-interview-my-experience-530n</link>
      <guid>https://dev.to/vbeskrovnov/product-design-interview-my-experience-530n</guid>
      <description>&lt;h2&gt;
  
  
  Intro
&lt;/h2&gt;

&lt;p&gt;I would like to share my experience passing 45 min product design interview(NOT system design interview) in tech company. Everything described below is my experience, it is NOT a benchmark answer and does not claim to be one.&lt;/p&gt;

&lt;h2&gt;
  
  
  Task
&lt;/h2&gt;

&lt;p&gt;You need to design an API to implement a news feed like on the following picture.&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd1wzenapuggkbydab6qn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd1wzenapuggkbydab6qn.png" alt="Product design interview task"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Solution
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Task understanding
&lt;/h3&gt;

&lt;p&gt;As usual, the task itself is very high level and unclear, and our first goal is to ask as many important questions as possible to understand the context of the task.&lt;br&gt;
Let's ask the first portion of questions:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Are there restrictions on the use of API types?
It allows us to get rid of the first decision, which API should we use: REST, Web socket, GraphQL, etc. Or at least shows the interviewer that we care about it.&lt;/li&gt;
&lt;li&gt;How many users do we have, and what is the expected load?
This is one of the main factors that influence all decisions during the interview, and it is better to understand this requirement right away to save time.&lt;/li&gt;
&lt;li&gt;How many client types(iPhone, Android, PC, etc.) are expected? Will the design vary?
If there is only one client type, or at least the design is the same for all clients, this will allow us to make only one version of the API, otherwise we will have to solve an additional problem.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;There are definitely much more questions which we can to ask, but we should remember that time is limited, and we only need to get the most critical information to make decisions.&lt;/p&gt;

&lt;h3&gt;
  
  
  First steps
&lt;/h3&gt;

&lt;p&gt;Based on answers above, we decided to start designing the required API. Let's imagine that we have to use REST API. So we need at least one endpoint to get a list of recent posts:&lt;/p&gt;

&lt;h6&gt;
  
  
  Endpoint
&lt;/h6&gt;

&lt;p&gt;&lt;code&gt;GET /posts&lt;/code&gt;&lt;/p&gt;

&lt;h6&gt;
  
  
  Request
&lt;/h6&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{
    "count": 2
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h6&gt;
  
  
  Response
&lt;/h6&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{
  "posts": [
    {
      "picture_url": "",
      "text": "",
      "comments_ammount": 42,
      "likes_amount": 146
    },   
    {
      "picture_url": "",
      "text": "",
      "comments_ammount": 22,
      "likes_amount": 246
    }
  ]
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This endpoint allows us to render the exact layout we need, but it has multiple problems:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;How are we going to load next posts during scrolling? If we are going to make another request with &lt;code&gt;count: 2&lt;/code&gt;, we can get the same posts.&lt;/li&gt;
&lt;li&gt;How are we going to sort posts? What if returned posts will be the oldest ones, but we require the newest?&lt;/li&gt;
&lt;li&gt;How can we get exact comments or list of people who liked the post?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This is not an exhaustive list of issues, but in my opinion they are the most critical, so let's try to fix them.&lt;/p&gt;

&lt;h3&gt;
  
  
  Improvements
&lt;/h3&gt;

&lt;p&gt;So how can we solve &lt;strong&gt;issue #1&lt;/strong&gt;? &lt;br&gt;
I can see two ways here:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Simple but limited. Load the maximum amount of posts, let's say 1000, and allow users to scroll feed only to this amount.&lt;/li&gt;
&lt;li&gt;Use pagination and get the posts in batches.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;It's hard to even call it a trade-off, as the choice is obvious. The best option is to use pagination, it allows us to get posts portionally.&lt;/p&gt;

&lt;h6&gt;
  
  
  Request
&lt;/h6&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{
    "offset": 0,
    "limit": 2
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h6&gt;
  
  
  Response
&lt;/h6&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{
  "offset": 0,
  "limit": 2,
  "posts": [
    {
      "picture_url": "",
      "text": "",
      "comments_ammount": 42,
      "likes_amount": 146
    },   
    {
      "picture_url": "",
      "text": "",
      "comments_ammount": 22,
      "likes_amount": 246
    }
  ]
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now we can get posts batch by batch, which help us to avoid rendering the same post twice.&lt;/p&gt;

&lt;p&gt;To solve the &lt;strong&gt;problem #2&lt;/strong&gt; we need to add more fields to be able to sort our posts. We have at least 2 ways here:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;More flexible. We can receive to fields: &lt;code&gt;how&lt;/code&gt; to sort and &lt;code&gt;what&lt;/code&gt; to sort. For example, sort in &lt;code&gt;ascending&lt;/code&gt; order by &lt;code&gt;comments_ammount&lt;/code&gt; field. It makes our solution flexible but increase complexity.&lt;/li&gt;
&lt;li&gt;More simple. If we are talking about news feed, in the most cases users want to sort posts by date only. So we can just have only one field: &lt;code&gt;how&lt;/code&gt; to sort. For example, sort in &lt;code&gt;ascending&lt;/code&gt; order – get the oldest posts first.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I would choose more simple option here for now, anyway it can be improved later. &lt;/p&gt;

&lt;h6&gt;
  
  
  Request
&lt;/h6&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{
    "offset": 0,
    "limit": 2,
    "sort": "ASCENDING"
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h6&gt;
  
  
  Response
&lt;/h6&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{
  "offset": 0,
  "limit": 2,
  "posts": [
    {
      "picture_url": "",
      "text": "",
      "comments_ammount": 42,
      "likes_amount": 146,
      "date": "2022-09-08T09:45:55Z"
    },   
    {
      "picture_url": "",
      "text": "",
      "comments_ammount": 22,
      "likes_amount": 246,
      "date": "2022-09-10T09:45:55Z"
    }
  ]
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let's move on to the problem #3 – how to see comments and likes? Again, we have multiple ways of solving:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Implement separate endpoint to get all comments/likes for post by &lt;code&gt;post_id&lt;/code&gt;. This approach will have better performance for feed loading, as we will get fewer data. But it makes users wait when they want to see comments.&lt;/li&gt;
&lt;li&gt;Include comments/likes in feed response. This approach is the opposite of the previous one. It will slow down feed loading, but it will speed up comments/likes viewing&lt;/li&gt;
&lt;li&gt;Hybrid approach. Include top N comments/likes in feed response and implement separate endpoints to get full data. This one is the most complex, but the most optimal. &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The hybrid approach looks like the most appropriate here, and it doesn't require a lot of effort, so let's use it.&lt;/p&gt;

&lt;h6&gt;
  
  
  Endpoint
&lt;/h6&gt;

&lt;p&gt;&lt;code&gt;GET /posts&lt;/code&gt;&lt;/p&gt;

&lt;h6&gt;
  
  
  Request
&lt;/h6&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{
    "offset": 0,
    "limit": 2,
    "sort": "ASCENDING"
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h6&gt;
  
  
  Response
&lt;/h6&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{
  "offset": 0,
  "limit": 2,
  "posts": [
    {
      "id": 12345,
      "picture_url": "",
      "text": "",
      "comments_ammount": 42,
      "likes_amount": 146,
      "date": "2022-09-08T09:45:55Z",
      "top_comments": [
        {
          "id": 1,
          "author_id": 2,
          "text": ""
        },
        {
          "id": 2,
          "author_id": 4,
          "text": ""
        }
      ],
      "top_likes": [
        {
          "id": 1,
          "author_id": 5
        },
        {
          "id": 2,
          "author_id": 8
        }
      ]
    },   
    {
      "id": 12346,
      "picture_url": "",
      "text": "",
      "comments_ammount": 22,
      "likes_amount": 246,
      "date": "2022-09-10T09:45:55Z",
      "top_comments": [
        {
          "id": 4,
          "author_id": 10,
          "text": ""
        },
        {
          "id": 5,
          "author_id": 11,
          "text": ""
        }
      ],
      "top_likes": [
        {
          "id": 6,
          "author_id": 12
        },
        {
          "id": 7,
          "author_id": 16
        }
      ]
    }
  ]
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h6&gt;
  
  
  Endpoint
&lt;/h6&gt;

&lt;p&gt;&lt;code&gt;GET /posts/{id}/comments&lt;/code&gt;&lt;/p&gt;

&lt;h6&gt;
  
  
  Request
&lt;/h6&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{
    "offset": 0,
    "limit": 4,
    "sort": "ASCENDING"
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h6&gt;
  
  
  Response
&lt;/h6&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{
  "offset": 0,
  "limit": 4,
  "comments": [
    {
      "id": 4,
      "author_id": 10,
      "text": ""
    },
    {
      "id": 5,
      "author_id": 11,
      "text": ""
    },
    {
      "id": 6,
      "author_id": 10,
      "text": ""
    },
    {
      "id": 7,
      "author_id": 11,
      "text": ""
    }
  ]
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Endpoint for likes looks the same, so we omit it here.&lt;/p&gt;

&lt;h3&gt;
  
  
  Time ended
&lt;/h3&gt;

&lt;p&gt;It is likely that by this point the 45-minute interview will have come to an end, and it is worth finalizing the decision. We have certainly prepared a working API, but there are still a lot of problems, and it is worth going over them quickly.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;What if new posts will be created during scrolling? It means that &lt;code&gt;offset&lt;/code&gt; will be shifted and on the next request we will receive a duplicated post.&lt;/li&gt;
&lt;li&gt;What about speed of pictures loading? Should we add multiple options of pictures URL: &lt;code&gt;preview&lt;/code&gt; – small but low quality and &lt;code&gt;fullsize&lt;/code&gt; – big with good quality?&lt;/li&gt;
&lt;li&gt;How to deal with new posts loading? Should we use long pooling or publish/subscribe model?&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Conclusion
&lt;/h3&gt;

&lt;p&gt;Note that there are several options for solving an issue (trade-offs), your job as a candidate is to show that you see them and try to justify your choice of the one best option. And don't worry if after the interview you understand, that you miss something. The goal of the interview is to see the way, you're thinking, not to prepare the real design of the product.&lt;/p&gt;

</description>
      <category>interview</category>
      <category>productdesign</category>
      <category>architecture</category>
      <category>api</category>
    </item>
    <item>
      <title>How I parsed estate marketplace to build price graph stats</title>
      <dc:creator>Vadim Beskrovnov</dc:creator>
      <pubDate>Tue, 09 Aug 2022 21:53:27 +0000</pubDate>
      <link>https://dev.to/vbeskrovnov/how-i-parsed-estate-marketplace-to-build-price-graph-stats-45b5</link>
      <guid>https://dev.to/vbeskrovnov/how-i-parsed-estate-marketplace-to-build-price-graph-stats-45b5</guid>
      <description>&lt;h2&gt;
  
  
  My goal
&lt;/h2&gt;

&lt;p&gt;I was looking for a flat to buy, and I wanted to find something cheaper than the market. Estate market is quite efficient, so it is almost impossible to find cheap item manually, that’s why I decided to automate this process. &lt;/p&gt;

&lt;h2&gt;
  
  
  Solution
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Idea
&lt;/h3&gt;

&lt;p&gt;To understand, if a specific item is “cheap” or not, we need to have historical data of previous deals of this object or at least of similar objects in the same location. There is no public service, which can provide deals data, but there are a lot of property websites which allows you to find sales offers. I decided to write an application, which parses one of such sites and save all properties into the database. Then I could make queries and analyse price trend in the specific area.&lt;/p&gt;

&lt;h3&gt;
  
  
  Tools
&lt;/h3&gt;

&lt;p&gt;Based on my experience, I decided to choose Java language to implement it. I started with command line application using Spring Boot and Spring Batch. High level data process looked as follows.&lt;/p&gt;

&lt;h3&gt;
  
  
  Architecture
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--FaqEohwm--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/dps4nxt4p36s2bgvwqzh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--FaqEohwm--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/dps4nxt4p36s2bgvwqzh.png" alt="Pipeline architecture" width="880" height="429"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Let's go through these components one by one.&lt;/p&gt;

&lt;h4&gt;
  
  
  Properties website
&lt;/h4&gt;

&lt;p&gt;It is a website to parse. That time I was interested in property in Russia, so I used local portal: &lt;a href="https://www.avito.ru"&gt;https://www.avito.ru&lt;/a&gt;. There are multiple categories, including flats. The structure is as following: there are a list of ads in each category with multiple pages and 50 items per page. Each item contains information about a specific property, that I was needed in.&lt;/p&gt;

&lt;h4&gt;
  
  
  Page Parser
&lt;/h4&gt;

&lt;p&gt;This is the first component of my application, it receives category URL as an input: &lt;code&gt;/moskva/kvartiry/prodam-ASgBAgICAUSSA8YQ?cd=1&amp;amp;p={page}&lt;/code&gt;. As you can see there are two parameters, the first one &lt;code&gt;cd&lt;/code&gt; is always the same and the second one &lt;code&gt;p&lt;/code&gt; is responsible for page number. Then using &lt;code&gt;jsoup&lt;/code&gt; library, I read each page in cycle and collected URLs.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Elements items = document.select(".item");
for (Element item: items) {
    Elements itemElement = item.select(".item-description-title-link");
    String relativeItemReference = itemElement.attr("href");
    urls.add(relativeItemReference);
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After reading of each page, I sent a list of URLs to the next component.&lt;/p&gt;

&lt;h4&gt;
  
  
  ID exists filter
&lt;/h4&gt;

&lt;p&gt;Each item has the following URL: &lt;code&gt;/moskva/kvartiry/2-k._kvartira_548m_911et._2338886814&lt;/code&gt;, it contains an identifier in the end (&lt;code&gt;2338886814&lt;/code&gt;). This is the unique ID of the ad. I used it as a key in cache to avoid parsing the same items twice.&lt;/p&gt;

&lt;p&gt;But some items can be parsed twice anyway because cache writes were made later, so multiple ads with the same ID could pass this gate.&lt;/p&gt;

&lt;h4&gt;
  
  
  Item Parser
&lt;/h4&gt;

&lt;p&gt;After the filter, all unique IDs went to the next component – &lt;code&gt;Item Parser&lt;/code&gt;. It uses ID to go to item page and read all data from this page.&lt;br&gt;
&lt;br&gt;
 &lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Elements attributes = doc.select(".item-params-list-item");
Map &amp;lt;String, String&amp;gt; attrs = attributes.stream().collect(Collectors.toMap(
    attr -&amp;gt; attr.text().split(":")[0].trim(),
    attr -&amp;gt; attr.text().split(":")[1].trim()));
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;estate
    .setTotalSpace(Double.parseDouble(attrs.getOrDefault("Общая площадь", "").split(" ")[0]));

estate
    .setLiveSpace(Double.parseDouble(attrs.getOrDefault("Жилая площадь", "").split(" ")[0]));

...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;As a result, an object with all property info is built and passed forward – to saver component.&lt;/p&gt;
&lt;h4&gt;
  
  
  Saver
&lt;/h4&gt;

&lt;p&gt;This component is the last one in my pipeline. It receives items from &lt;code&gt;Item Parser&lt;/code&gt; converts them to JSON and then saves to Elasticsearch, using batches to improve performance.&lt;/p&gt;

&lt;p&gt;As a result, I was able to build multiple Kibana dashboards with prices and popularity metrics. One of the most useful components is the interactive map, that allows you to render data with coordinates(I got coordinates from the ad's description). It helps me find perfect property in good area with good price.&lt;/p&gt;
&lt;h3&gt;
  
  
  Problems
&lt;/h3&gt;

&lt;p&gt;During this experiment, I faced some problems and tried different solutions, which I want to share.&lt;/p&gt;
&lt;h4&gt;
  
  
  IP address blocking
&lt;/h4&gt;

&lt;p&gt;As you can guess, nobody wants to allow parsing their data, so this site also has different layers of protection. Thus, during development everything worked fine, because I made small amount of requests. But as soon as I started testing I faced with huge amount of 403 errors.&lt;/p&gt;

&lt;p&gt;Firstly, I tried to use multiple headers and cookies to simulate a real user with a browser.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Document imageDoc = Jsoup
    .connect(url)
    .userAgent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36")
    .header("referer", "https://www.avito.ru" + relativeItemReference)
    .header("accept", "*/*")
    .ignoreContentType(true)
    .get();
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It didn’t help me. I think that they have much more intelligent checks than just verifying &lt;code&gt;userAgent&lt;/code&gt;. &lt;/p&gt;

&lt;p&gt;So my next try was to find the smallest timeout which can help me avoid blocking. To find it, I used free VPN services to be able to quickly change IP addresses. I have experimentally set a minimum timeout of 25 seconds. But it means that I can parse only ~3500 items per day, and it is definitely not enough.&lt;/p&gt;

&lt;p&gt;To increase parsing speed, I decided to parallel my algorithm and use a proxy for each thread.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;doc = Jsoup.connect(url)
    .proxy(proxy.getHost(), proxy.getPort())
    .userAgent("Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.4; en-US; rv:1.9.2.2) Gecko/20100316 Firefox/3.6.2")
    .header("Content-Language", "en-US")
    .timeout(timeout)
    .get();
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;So my only limitation now was the number of proxies available, and I was using public free proxies. I didn’t want to pay for it, so I had to use free unstable proxies, so some of them were slow, some were unstable. &lt;/p&gt;

&lt;p&gt;My next improvement was to choose the best proxies from my list for each request. I made a scheduled job, that was checking each proxy every N minutes and save useful metadata like connection speed and number of errors.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;@Scheduled(fixedDelay = 100, initialDelay = 1)
@Transactional
public void checkProxy() {
    ProxyEntity foundProxy = getProxyWithOldestUpdate()
        .orElseThrow(() -&amp;gt; new RuntimeException("Proxy not found"));

    int retries = 0;
    while (retries &amp;lt; retryCount) {
        log.debug("Attempt {}", retries + 1);
        if (checkProxy(foundProxy)) {
            log.debug("Proxy [{}] UP", foundProxy.getHost());
            foundProxy.setActive(true);
            break;
        }
        RequestUtils.wait(1000);
        retries++;
    }
    if (retries == retryCount) {
        foundProxy.setActive(false);
        log.debug("Proxy [{}] DOWN", foundProxy.getHost());
    }
    foundProxy.setCheckDate(LocalDateTime.now());
    proxyRepository.save(foundProxy);
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Items duplicating
&lt;/h4&gt;

&lt;p&gt;Sometimes it happens that people create a new ad for the same property, so for the system there are two different ads with different IDs. And it is fine for property website, but not for statistic and data analysis. &lt;/p&gt;

&lt;p&gt;To get rid of duplicated items in my database, I just checked by title and description using string comparison. Sometimes it can be false positive, so I removed not a real duplicate, but a different ad with the same text. It is completely opposite, because it is totally fine for data analysis but critical for property website. Anyway, it solved my problem.&lt;/p&gt;

</description>
      <category>java</category>
      <category>parsing</category>
      <category>scrapping</category>
    </item>
  </channel>
</rss>
