DEV Community: Dvir_BD

A Community-Driven Data Exploration Journey: Airbnb Property Data & Bright Data

Dvir_BD — Mon, 05 Jun 2023 09:24:49 +0000

I am excited to share an intriguing use case that showcases not only the potential of data marketplaces and data analysis, but also the incredible value of the global developer community.

Recently, I embarked on a project to analyze the real estate market in Salt Lake City - a city predicted to be the next hotspot for timeshares and vacation rentals. I built a dataset of 918 Airbnb properties in Salt Lake City, accommodating five guests or less, which I plan to use for price comparisons, market analysis, and cool visualizations.

The exciting part? Instead of spending hours painstakingly scraping data, I got it directly from Bright Data’s marketplace! This high-quality dataset is accurate and ready to use, leaving me more time to focus on analyzing the data rather than gathering it.

However, I quickly faced a new challenge - how could I best visualize this information and extract valuable insights from the dataset? Being a self-taught (and continuously learning) data analyst, I decided to leverage the wisdom of the Reddit community to help navigate this problem.

My initial questions were:

How do the prices of these properties vary by location, number of bedrooms, or amenities offered?
Are there any patterns or trends in the reviews or ratings of these properties?

I also needed advice on the best tools or techniques to explore this dataset and answer these questions. Python, Excel, or any other data analysis software - all suggestions were welcome!

The response from the community was overwhelming. Not only did they share their insights and expertise, but one member even set up a GitHub project to streamline our collective efforts. Together, we transformed the raw Airbnb data into insightful graphs and charts, shedding light on previously hidden trends and patterns.

This journey has reinforced two key insights:

The power of data marketplaces like Bright Data in providing reliable and ready-to-use data.
The value of community in problem-solving and innovation. When developers collaborate, the potential for innovation becomes boundless. If you're interested in the project, check out the [GitHub repository](https://github.com/yuchen927/python_salt_lake_city_airbnb/blob/main/salt_lake_city_airbnb.ipynb

The 8 biggest myths about web scraping 😤

Dvir_BD — Mon, 22 May 2023 11:25:41 +0000

Myth #1: Web scraping is not a legal practice

Many people have the misconception that web scraping is illegal. The truth is that it is perfectly legal as long as one does not collect password-protected information, or Personally Identifiable data (PII). The other thing to pay attention to is the Terms of Service (ToS) of target websites, and to ensure that rules, regulations, and stipulations are followed when collecting information from a specific website. Companies that target open source web data that is anonymized and who only work with data collection networks that are CCPA, and GDPR- compliant can never go wrong.

In the United States, at the Federal level there are no laws prohibiting web scraping as long as the information being collected is public and no harm is done to the target site in the process of scraping. In the European Union and in the United Kingdom, scraping is viewed from an intellectual property standpoint, under the Digital Services Act. This states that ‘ The reproduction of publicly available content’ is not illegal, meaning as long as the data collected is publicly available, you are legally in the clear.

Myth #2: Scraping is Only for Developers

This is one of the more common myths. Many professionals with no technical background typically give up on being able to control their data intake without even looking into this. It is true that many scraping techniques do require technical skills that mostly developer types possess. But it is also true that there are new zero-code tools currently available, these solutions help automate the scraping process by making pre-built data scrapers available to the average business person. They also include web scraping templates for popular sites such as Amazon and Booking.

Myth #3: Scraping is Hacking

This is not true. Hacking consists of illegal activities that typically result in the exploitation of private networks or computer systems. The point of taking control of these consists of carrying out illicit activities such as stealing private information or manipulating systems for personal gain.

Web scraping, on the other hand, is the practice of accessing publicly available information from target websites. This information is typically used by businesses to better compete in their space. This results in better services, and fairer market prices for consumers.

Myth #4: Scraping is Easy

Many people wrongfully believe that ‘scraping is a piece of cake’. ‘What is the problem?’, they ask, ‘all you need to do is go into the website you are targeting and retrieve the target information’. Conceptually this seems right, but in practice, scraping is a very technical, manual, and resource-heavy endeavor. Whether you choose to use Java , Selenium, PHP, or PhantomJs, you need to keep a technical team on staff that knows how to write scripts in these languages.

Many times, target sites have complex architectures and blocking mechanisms which are constantly changing. Once those hurdles are overcome, data sets typically need to be cleaned, synthesized, and structured so that algorithms can analyze them for valuable insights. The bottom line is that scraping is anything but easy.

Myth #5: Once collected, data is ‘ready-to-use’

This is usually just not the case. There are many aspects to consider when collecting target information. For example, what format can the information be captured in versus what format your systems are able to ingest data in. For example, let’s say all of the data you are collecting is in JSON format, yet your systems can only process files in CSV. Beyond format, there are also the issues of structuring, synthesizing, and cleaning data before it can actually be used. This may include removing corrupted or duplicated files, for example. Only once the data is formatted, cleaned and structured is it ready to be analyzed and used.

Myth #6: Data scraping is a fully automated process

Many people believe that there are bots who simply crawl websites and retrieve information at the click of a button. This is not true, most web scraping is manual and requires technical teams to oversee the process and troubleshoot issues. There are, however, ways in which this process can be automated, either by using a Web Scraper IDE tool or simply by buying pre-collected Datasets that do not require any involvement in the complexities of the data scraping process.

Myth #7: It is easy to scale data scraping operations

This is a total myth. If you are maintaining in-house data collection software and hardware, as well as a technical team to manage operations. When looking to meaningfully scale operations, new servers need to be added, new team members need to be hired, and new scrapers need to be built for target sites. Consider that the upkeep of a server alone could run a business up to an average of $1,500 on a monthly basis. The larger the company, the higher the cost multiple.

On the other hand, when relying on Data as a Service provider, however, scaling operations can be extremely easy as you are relying on third-party infrastructure and teams. As well as live maps of thousands of constantly changing web domains.

Myth #8: Web scraping produces large amounts of usable data

This is usually not the case. Businesses performing manual data collection can very often be served inaccurate data or information that is illegible. That is why it is important to use tools and systems that perform quality validation and that route traffic through real peer devices. This enables target sites to identify requesters as real users and ‘encourages’ them to retrieve accurate datasets for the GEO in question. Using a data collection network that uses quality validation will allow you to retrieve a small data sample, validate it, and only then run the collection job in its entirety. Saving both time and resources.

The bottom line

As you can see there are many misconceptions regarding data scraping. Now that you have the facts you can better approach your future data collection jobs.

Parsing JSON data with Python

Dvir_BD — Sun, 02 Apr 2023 09:13:15 +0000

Defining JSON
JSON, or JavaScript Object Notation is a format commonly used to transfer data (mainly by APIs) in a way that will not be ‘heavy on the system’. The basic principle is utilizing text in order to record, and transfer data points to a third party.

The rules of JSON Syntax
JSON’s syntax is identical to JavaScript (JS) as JSON is essentially an offshoot of JS. Here are the major rules:

One: ‘Arrays’ are displayed in square brackets

Example:

“companies”:[
    {“BrandName”:”Adidas”, “NumberofEmployees”:”20,000″},
    {“BrandName”:”Nike”, “NumberofEmployees”:”31,000″},
    {“BrandName”:”Asics”, “NumberofEmployees”:”14,000″}
]

Two: ‘Objects’ are flanked by curly brackets

Example: {"BrandName":"Adidas", "NumberofEmployees":"20,000"}

Three: Data points are separated by commas

Example: "Asics", "Adidas", "Nike"

Four: Data points appear in pairs of ‘keys’ and ‘values’

Example: "BrandName":"Adidas"

Here is the subtotal of all the above parts when combined to display a JSON array of three company records (objects), and the number of employees currently employed at each respective corporation:

{
“companies”:[
{“BrandName”:”Adidas”, “NumberofEmployees”:”20,000″},
{“BrandName”:”Nike”, “NumberofEmployees”:”31,000″},
{“BrandName”:”Asics”, “NumberofEmployees”:”14,000″}
]

}

JSON in the context of Python
The good news is that Python supports JSON natively. When looking to use JSON in the context of Python, one can enjoy the ease of using Python’s built-in package: ‘The JSON encoder and decoder.’ Give this documentation a good read, and it will be instrumental in helping you kickstart your JSON/Python conversion. To get you started, the first string of code you will need in order to import JSON to Python is:

>>> import json
Here is an example of the structure of what will typically follow:

# some JSON:
x =  '{ "name":"John", "age":30, "city":"New York"}'
# parse x:
y = json.loads(x)
# the result is a Python dictionary:
print(y["age"])

Keep in mind that JSON information is usually stored in ‘string variables,’ as is the case with the vast majority of APIs. These ‘string variables’ need to be parsed into the Python dictionary (see the next section) before any further actions can be completed in the target language (Python). As demonstrated in the example code snippet above firstly, you want to import the Python JSON module, which contains the load and loads (note that the ‘s’ here stands for ‘string’) functions.

A useful tool: JSON -> Python ‘dictionary’
As with any language, different ‘items’ are said/written differently yet mean the same thing. This is the concept of a dictionary. ‘Chair’ in English is ‘Chaise’ in French. Here is your ultimate JSON -> Python dictionary of the most common/useful terms:

Data collection automation: JSON/Python alternatives
Bright Data’s Data Collector gives busy professionals a way to collect large amounts of web data without having to write any code. Many companies trying to collect data for competitive intelligence, dynamic pricing strategies, or user-driven market research are actually targeting many of the same websites. That is why Bright Data created different web scrapers, including hundreds of ready-to-use, site-specific web crawlers.