DEV Community

Cover image for Scrapy: store files efficiently using folder trees
Panagiotis Simakis
Panagiotis Simakis

Posted on

Scrapy: store files efficiently using folder trees

Context

When it comes to scrapping, Scrapy is one of the most known and used frameworks. The huge community and the large collection of third-party extensions are only few reasons to choose scrapy. You can find a scrapy extension for almost everything.

File storing

As I have already said here, storing large number of files is quite challenging, you should always have in mind storing large number of files in a single folder is not a good idea.

In case of Scrapy, all files are stored in a single folder. So I decided to implement a Scrapy pipeline extension in order to provide a way to store files in a more efficient way using folder trees.

scrapy-files-hierarchy

GitHub logo sp1thas / scrapy-folder-tree

A scrapy pipeline which stores files using folder trees.

scrapy-folder-tree

pre-commit.ci status codecov PyPI GitHub license PyPI - Format PyPI - Status

This is a scrapy pipeline that provides an easy way to store files and images using various folder structures.

Supported folder structures:

Given this scraped file: 05b40af07cb3284506acbf395452e0e93bfc94c8.jpg, you can choose the following folder structures:

Using the file name

class: scrapy-folder-tree.ImagesHashTreePipeline

full
├── 0
.   ├── 5
.   .   ├── b
.   .   .   ├── 05b40af07cb3284506acbf395452e0e93bfc94c8.jpg
Using the crawling time

class: scrapy-folder-tree.ImagesTimeTreePipeline

full
├── 0
.   ├── 11
.   .   ├── 48
.   .   .   ├── 05b40af07cb3284506acbf395452e0e93bfc94c8.jpg
Using the crawling date

class: scrapy-folder-tree.ImagesDateTreePipeline

full
├── 2022
.   ├── 1
.   .   ├── 24
.   .   .   ├── 05b40af07cb3284506acbf395452e0e93bfc94c8.jpg

Installation

pip install scrapy-folder-tree
Enter fullscreen mode Exit fullscreen mode

Usage

Use the following settings in your project:

ITEM_PIPELINES = {
    'scrapy_folder_tree.FilesHashTreePipeline': 300
}
Enter fullscreen mode Exit fullscreen mode



This scrapy pipelines provides various ways to store your crawled files. Currently, given this scraped file: 05b40af07cb3284506acbf395452e0e93bfc94c8.jpg, the following three folder structures are supported:

Using the file name(hash)

full
├── 0
.   ├── 5
.   .   ├── b
.   .   .   ├── 05b40af07cb3284506acbf395452e0e93bfc94c8.jpg
Enter fullscreen mode Exit fullscreen mode

Using the crawling date

full
├── 0
.   ├── 11
.   .   ├── 48
.   .   .   ├── 05b40af07cb3284506acbf395452e0e93bfc94c8.jpg
Enter fullscreen mode Exit fullscreen mode

Using the crawling time

full
├── 2022
.   ├── 1
.   .   ├── 24
.   .   .   ├── 05b40af07cb3284506acbf395452e0e93bfc94c8.jpg
Enter fullscreen mode Exit fullscreen mode

Installation

pip install scrapy-folder-tree
Enter fullscreen mode Exit fullscreen mode

Usage

Use the following settings in your project:

ITEM_PIPELINES = {
    'scrapy_folder_tree.FilesHashTreePipeline': 300
}

FOLDER_TREE_DEPTH = 3
Enter fullscreen mode Exit fullscreen mode

Feel free to give a try and to provide your feedback.

Future work

  • Support more folder structures
  • Parameterize folder structure

Top comments (0)