You find a scraper on GitHub that does exactly what you need. Almost. It scrapes product prices but doesn't handle pagination. You know how to add pagination. You could copy the code, modify it, and use your own version.
But that's wasteful. The original developer maintains the project. Fixes bugs. Adds features. If you copy it, you're on your own. You have to manually copy their updates into your version forever.
Better idea: add pagination to their project. Contribute it back. Now everyone benefits. They get pagination. You get their future updates automatically. Win-win.
Or maybe this happens. You're working with a teammate on the same scraper. You both edit scraper.py. You finish first. Commit. Push. Done.
Your teammate finishes. Tries to push. Git rejects it. "Your branch is behind. Pull first." They pull. Merge conflict. Your changes clash with theirs. Same function. Different implementations.
Now you're both stuck. Who's code wins? How do you merge without breaking both versions? Do you text each other? Email? Video call to figure out whose change stays?
This is where collaboration workflow saves you.
Git and GitHub have built-in tools for working with other people. Forking projects. Proposing changes. Reviewing code. Resolving conflicts. Tracking issues. The tools exist. You just need to learn the pattern.
Let me show you how to collaborate without chaos.
The Two Collaboration Scenarios
You'll encounter two main situations.
Scenario 1: Contributing to Someone Else's Project
You don't have permission to push to their repo. You can't just commit and push like it's yours.
The workflow:
- Fork (make your own copy on GitHub)
- Clone your fork
- Create a branch
- Make changes
- Push to your fork
- Create pull request to original repo
- Wait for maintainer to review and merge
This is how open source works.
Scenario 2: Team Project (Shared Access)
Everyone has permission to push. But you need rules so people don't overwrite each other.
The workflow:
- Clone the shared repo
- Always pull before starting work
- Create a branch for your feature
- Make changes
- Push your branch
- Create pull request
- Teammate reviews
- Merge after approval
- Everyone pulls the updated main
This is how teams work.
Both scenarios use the same Git features (branches, pull requests). The difference is permissions.
Forking: Making Your Own Copy
Forking creates a copy of someone's repo under your GitHub account.
When to Fork
- Contributing to open source projects
- Customizing a project for your needs
- Experimenting with changes before proposing them
- Learning from a project by modifying it
How to Fork
Let's fork a real project. We'll use a simple scraper template.
Step 1: Find a Project
Go to: https://github.com/example-user/simple-scraper
(For this example, find any small scraping project you want to contribute to)
Step 2: Click Fork
- Top-right corner: "Fork" button
- Choose your account (where to fork it)
- Optional: Change the name
- Click "Create fork"
You now have a copy: https://github.com/YOUR-USERNAME/simple-scraper
This is yours. You can push to it. Modify it. Break it. The original is untouched.
Cloning Your Fork
# Clone YOUR fork, not the original
git clone https://github.com/YOUR-USERNAME/simple-scraper.git
cd simple-scraper
# Check the remote
git remote -v
Output:
origin https://github.com/YOUR-USERNAME/simple-scraper.git (fetch)
origin https://github.com/YOUR-USERNAME/simple-scraper.git (push)
origin points to your fork.
Adding the Original as Upstream
You'll want to pull updates from the original project.
# Add the original repo as 'upstream'
git remote add upstream https://github.com/example-user/simple-scraper.git
# Verify
git remote -v
Output:
origin https://github.com/YOUR-USERNAME/simple-scraper.git (fetch)
origin https://github.com/YOUR-USERNAME/simple-scraper.git (push)
upstream https://github.com/example-user/simple-scraper.git (fetch)
upstream https://github.com/example-user/simple-scraper.git (push)
Now you have two remotes:
-
origin- your fork (you can push here) -
upstream- original repo (you can only pull from here)
Contributing to Open Source
Let's add a feature to the project we forked.
Step 1: Create a Feature Branch
# Make sure you're on main
git checkout main
# Pull latest from original project
git pull upstream main
# Create feature branch
git checkout -b add-pagination
Step 2: Make Your Changes
# Edit scraper.py
# Add pagination functionality
def scrape_all_pages(base_url):
"""Scrape all pages with pagination"""
all_products = []
page = 1
while True:
url = f"{base_url}?page={page}"
products = scrape_products(url)
if not products:
break
all_products.extend(products)
page += 1
return all_products
Step 3: Test Your Changes
# Run the scraper
python scraper.py
# Make sure it works
# Test edge cases
Step 4: Commit Your Changes
git add scraper.py
git commit -m "Add pagination support for multi-page scraping"
Good commit message format:
- First line: Brief summary (50 chars or less)
- Optional: Blank line + detailed explanation
git commit -m "Add pagination support for multi-page scraping
- Implemented scrape_all_pages() function
- Handles pagination automatically
- Stops when no more products found
- Tested with 10+ page product listings"
Step 5: Push to Your Fork
# Push to YOUR fork (origin), not the original (upstream)
git push -u origin add-pagination
Step 6: Create a Pull Request
On GitHub:
- Go to your fork:
https://github.com/YOUR-USERNAME/simple-scraper - You'll see: "add-pagination had recent pushes" with "Compare & pull request" button
- Click "Compare & pull request"
Pull Request Form:
-
Base repository:
example-user/simple-scraper(original) -
Base branch:
main -
Head repository:
YOUR-USERNAME/simple-scraper(your fork) -
Compare branch:
add-pagination
Title: "Add pagination support"
Description:
## Summary
Adds pagination support for scraping multi-page product listings.
## Changes
- New function `scrape_all_pages()` that handles pagination
- Automatically iterates through pages until no products found
- Backward compatible (existing code still works)
## Testing
- Tested with 10+ page listings
- Tested with single page (no regression)
- Tested with invalid URLs (proper error handling)
## Example Usage
python
products = scrape_all_pages('https://example.com/products')
print(f'Scraped {len(products)} products across multiple pages')
Fixes #15
shell
The Fixes #15 references an issue number. If there's an open issue requesting this feature, link it.
Click "Create pull request"
Step 7: Wait for Review
The maintainer will:
- Review your code
- Test it
- Ask questions or request changes
- Approve and merge, or decline
If they request changes:
# Make the requested changes
# Edit files
# Commit
git add .
git commit -m "Address review feedback: improve error handling"
# Push to same branch
git push
The pull request updates automatically. No need to create a new one.
If they merge:
Your code is now in the original project! Congratulations, you contributed to open source.
Cleanup:
# Switch back to main
git checkout main
# Pull the merged changes from original
git pull upstream main
# Delete your feature branch
git branch -d add-pagination
# Update your fork's main on GitHub
git push origin main
Team Collaboration Workflow
You're working with teammates on a shared repo. Everyone has push access.
The Golden Rules
- Never push directly to main (use branches + pull requests)
- Always pull before starting work
- Create a branch for every feature/fix
- Get approval before merging (code review)
- Pull main before merging your branch
Daily Team Workflow
Morning routine:
# Update your local main
git checkout main
git pull origin main
# Create today's feature branch
git checkout -b feature/user-login
During the day:
# Make changes
# (edit files)
# Commit regularly
git add .
git commit -m "Implement login validation"
# Push to GitHub (backup + share with team)
git push -u origin feature/user-login
When feature is done:
# Make sure main hasn't changed
git checkout main
git pull origin main
# Merge main into your feature branch (get latest changes)
git checkout feature/user-login
git merge main
# Resolve any conflicts
# (if main changed while you worked)
# Push updated branch
git push
# Create pull request on GitHub
# Wait for teammate review
# Merge after approval
After your PR is merged:
# Update local main
git checkout main
git pull origin main
# Delete feature branch
git branch -d feature/user-login
git push origin --delete feature/user-login
Handling Simultaneous Work
Scenario: You and a teammate both work on different features simultaneously.
You:
git checkout -b feature/add-export
# (work on export feature)
git commit -m "Add CSV export"
git push -u origin feature/add-export
# (create PR, get approved, merge)
Teammate (at the same time):
git checkout -b feature/add-filters
# (work on filters)
git commit -m "Add product filters"
git push -u origin feature/add-filters
# (create PR)
Your teammate's PR now needs updating:
# Your export feature merged first
# Teammate needs to update their branch
git checkout feature/add-filters
git pull origin main # Get your merged export feature
# (resolve any conflicts if both touched same files)
git push
# (PR updates, gets approved, merges)
This is why "pull before merge" matters.
Code Review Best Practices
Pull requests aren't just for merging. They're for reviewing code.
As the Author (Creating PR)
Write a clear description:
## What this PR does
Adds ability to export scraped data to CSV format.
## Why we need it
Users requested CSV export for importing to Excel.
## How to test
1. Run scraper: python scraper.py
2. Check data/export.csv exists
3. Open in Excel, verify formatting
## Screenshots
(if UI changes)
## Checklist
- [x] Code works locally
- [x] Added tests
- [x] Updated README
- [x] No merge conflicts
Keep PRs small:
- One feature per PR
- Easier to review
- Faster to merge
- Less likely to have conflicts
Respond to feedback graciously:
- "Good catch, I'll fix that"
- "I didn't consider that case, adding now"
- "Great suggestion, implemented"
As the Reviewer
What to look for:
- Does it work? (test the code locally)
- Does it break anything? (run existing tests)
- Is it readable? (can you understand it?)
- Is it secure? (no passwords, proper input validation)
- Is it efficient? (no obvious performance issues)
How to comment:
Good feedback:
Line 23: Consider adding error handling here for when the URL is invalid.
Suggestion:
try:
response = requests.get(url)
response.raise_for_status()
except requests.exceptions.RequestException as e:
logger.error(f"Failed to fetch {url}: {e}")
return None
Bad feedback:
this is bad
Be specific. Be constructive. Suggest solutions.
Approving or requesting changes:
On GitHub:
- "Comment" - add feedback without blocking
- "Approve" - looks good, ready to merge
- "Request changes" - needs fixes before merging
Reviewing Your Own Code
Even solo, review your own PRs:
- Create PR
- Look at the diff (what changed)
- Read it like you didn't write it
- Catch mistakes you missed
- Merge when satisfied
This habit catches bugs before production.
Handling Merge Conflicts in Teams
Two people edited the same line. Git can't auto-merge.
Example Conflict
Person A edits scraper.py:
def scrape_product(url):
response = requests.get(url, timeout=30)
Person B edits the same line:
def scrape_product(url):
response = requests.get(url, timeout=60)
Both push to different branches. Person A merges first. Person B tries to merge.
Conflict!
Resolving the Conflict
Person B:
# Pull latest main
git checkout main
git pull origin main
# Try to merge into feature branch
git checkout feature/increase-timeout
git merge main
Output:
Auto-merging scraper.py
CONFLICT (content): Merge conflict in scraper.py
Automatic merge failed; fix conflicts and then commit the result.
Open scraper.py:
def scrape_product(url):
<<<<<<< HEAD
response = requests.get(url, timeout=60)
=======
response = requests.get(url, timeout=30)
>>>>>>> main
Person B's choices:
- Keep their change (60): Maybe they tested and 60 is better
- Keep Person A's change (30): Maybe 30 is the agreed standard
- Choose different value: Maybe 45 is a compromise
- Ask Person A: "Hey, I increased timeout to 60. You set it to 30. Which should we use?"
After deciding (let's say keep 60):
def scrape_product(url):
response = requests.get(url, timeout=60)
Remove the markers. Save file.
# Mark conflict as resolved
git add scraper.py
# Complete the merge
git commit -m "Merge main, kept 60s timeout after discussion with Person A"
# Push updated branch
git push
The PR updates. Reviewer sees the conflict was resolved. Merges when satisfied.
.gitignore: What NOT to Commit
Some files shouldn't be in Git.
Why .gitignore Matters
Bad things that happen when you commit the wrong files:
- API keys get exposed (security breach)
- Repo becomes huge (data files, databases)
- Merge conflicts on generated files (compiled code, caches)
- Teammate's environment breaks (OS-specific files)
Creating .gitignore
# In your project root
touch .gitignore
Essential patterns for Python projects:
# Python
__pycache__/
*.py[cod]
*$py.class
*.so
.Python
env/
venv/
ENV/
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
*.egg-info/
.installed.cfg
*.egg
# Environment
.env
.venv
config.py
# IDE
.vscode/
.idea/
*.swp
*.swo
*~
# OS
.DS_Store
Thumbs.db
# Data files
*.csv
*.json
*.xlsx
data/
scraped_data/
# Logs
*.log
logs/
# Database
*.db
*.sqlite3
Commit .gitignore:
git add .gitignore
git commit -m "Add .gitignore for Python project"
Common Patterns
Ignore all files of a type:
*.csv
*.log
*.db
Ignore a directory:
data/
logs/
Ignore all except one file:
# Ignore all .env files
.env*
# Except the example
!.env.example
Ignore files in root only:
/config.py # Only root config.py, not src/config.py
Testing .gitignore
# Check what Git sees
git status
# If ignored files still show up:
# They were already tracked before .gitignore
# Remove from Git (keep on disk)
git rm --cached filename.csv
# Commit the removal
git commit -m "Stop tracking data files"
.gitignore for Scrapers
# Virtual environment
venv/
env/
# Environment variables (API keys!)
.env
config.py
# Scraped data
data/
*.csv
*.json
*.xlsx
# Logs
*.log
logs/
# Database
*.db
*.sqlite3
# Temporary files
temp/
cache/
# Screenshots (if using Selenium)
screenshots/
# Keep example config
!config.example.py
GitHub Issues: Tracking Work
Issues are like to-do lists for your project.
Creating an Issue
On GitHub → Issues → New issue
Title: "Add pagination support"
Description:
## Problem
Scraper only gets first page of results. Multi-page listings are incomplete.
## Proposed Solution
Add pagination support to scrape_products() function.
## Example
Site: https://example.com/products
Current: Gets 20 products (page 1)
Desired: Gets all 200 products (10 pages)
## Acceptance Criteria
- [ ] Scrapes all pages automatically
- [ ] Stops when no more products
- [ ] Handles pagination URLs (?page=N)
- [ ] Works with existing code
Labels: enhancement, good first issue
Assignee: Yourself or teammate
Click "Submit new issue"
Working on an Issue
# Reference issue in branch name
git checkout -b fix-pagination-15
# Reference issue in commits
git commit -m "Add pagination support (fixes #15)"
# Reference issue in PR
# (GitHub auto-links when you mention #15)
When the PR merges, the issue closes automatically.
Using Issues for Bug Reports
Title: "Scraper crashes on products without prices"
Description:
## Bug Description
Scraper crashes with `AttributeError: NoneType` when product has no price.
## Steps to Reproduce
1. Run: `python scraper.py`
2. URL: https://example.com/products/out-of-stock
3. Error occurs on line 45
## Expected Behavior
Should skip products without prices or set price to 0.
## Actual Behavior
AttributeError: 'NoneType' object has no attribute 'text'
File "scraper.py", line 45
price = price_elem.text
## Environment
- Python 3.9
- Ubuntu 20.04
- Requirements.txt versions
Labels: bug, high priority
This gives teammates everything they need to reproduce and fix.
GitHub Projects: Organizing Work
Projects are kanban boards for issues and PRs.
Creating a Project
On GitHub → Projects → New project
- Name: "Scraper Roadmap"
- Template: "Board"
- Create
Default columns:
- Todo
- In Progress
- Done
Add custom columns:
- Backlog (ideas for later)
- Review (PRs waiting for review)
- Testing (needs testing before deploy)
Adding Issues to Project
Drag issues between columns:
- Backlog: "Add authentication support"
- Todo: "Add pagination" (next to work on)
- In Progress: "Fix price parsing bug" (actively working)
- Review: "Add CSV export" (PR created, waiting review)
- Done: "Initial scraper setup" (merged and deployed)
Team Workflow with Projects
Weekly planning:
- Review Backlog
- Move important issues to Todo
- Assign to team members
Daily:
- Move your issue to In Progress when you start
- Create PR when done
- Move to Review
- Teammate reviews
- Merge and move to Done
Visual progress tracking. Everyone sees what's being worked on.
Forking vs Branching: When to Use Which
Use Forking When:
- Contributing to projects you don't have push access to
- Open source contributions
- Experimenting with major changes to someone else's project
- Creating your own version of a project
Pattern: Fork → Clone fork → Branch → PR to original
Use Branching When:
- Working on a project you have push access to
- Team projects
- Your own projects
- Any project where you're a collaborator
Pattern: Clone → Branch → PR to same repo
Same Repo, Collaborative:
# Everyone clones the same repo
git clone https://github.com/team/project.git
# Everyone creates branches
git checkout -b feature/my-feature
# Everyone creates PRs to main branch
# (same repo, different branches)
Fork-Based, External Contribution:
# Fork on GitHub
# Clone your fork
git clone https://github.com/YOU/project.git
# Create branch
git checkout -b feature/my-contribution
# PR to original repo
# (from your fork to their main)
Real Example: Contributing to a Scrapy Project
Let's contribute to a real open source project.
Step 1: Find a Project
Search GitHub: "scrapy spider" or "web scraper python"
Let's say you find: scrapy-examples/ecommerce-spider
Step 2: Fork It
Click "Fork" on GitHub.
Step 3: Clone Your Fork
git clone https://github.com/YOUR-USERNAME/ecommerce-spider.git
cd ecommerce-spider
# Add upstream
git remote add upstream https://github.com/scrapy-examples/ecommerce-spider.git
Step 4: Find Something to Contribute
Check Issues:
- Look for
good first issuelabel - Or
help wanted
Let's say Issue #42: "Add support for product reviews"
Step 5: Create Feature Branch
git checkout -b add-product-reviews
Step 6: Implement the Feature
# spiders/products.py
class ProductSpider(scrapy.Spider):
name = 'products'
def parse(self, response):
# Existing product parsing
yield {
'name': response.css('.product-name::text').get(),
'price': response.css('.price::text').get(),
}
# New: Follow review links
review_url = response.css('.reviews-link::attr(href)').get()
if review_url:
yield response.follow(review_url, self.parse_reviews)
def parse_reviews(self, response):
"""Parse product reviews"""
for review in response.css('.review'):
yield {
'product_url': response.url,
'rating': review.css('.rating::attr(data-rating)').get(),
'text': review.css('.review-text::text').get(),
'author': review.css('.author::text').get(),
'date': review.css('.date::text').get(),
}
Step 7: Test Thoroughly
# Run the spider
scrapy crawl products -o output.json
# Check the output
cat output.json
# Make sure reviews are included
Step 8: Update Documentation
# README.md
## Features
- Scrapes product names and prices
- Follows pagination automatically
- **NEW:** Scrapes product reviews with ratings
## Output
The spider outputs JSON with product data and reviews:
json
[
{
"name": "Laptop Pro 15",
"price": "$1299.99"
},
{
"product_url": "https://example.com/laptop-pro-15",
"rating": "5",
"text": "Excellent laptop!",
"author": "John D.",
"date": "2024-03-15"
}
]
shell
Step 9: Commit and Push
git add .
git commit -m "Add product review scraping support (fixes #42)
- Implemented parse_reviews() method
- Follows review links from product pages
- Extracts rating, text, author, date
- Tested with 50+ products
- Updated README with review output format"
git push -u origin add-product-reviews
Step 10: Create Pull Request
On GitHub:
Title: "Add product review scraping support"
Description:
Closes #42
## Summary
Adds ability to scrape product reviews in addition to product data.
## Changes
- New `parse_reviews()` method
- Follows review links from product pages
- Extracts rating, review text, author, and date
- Updated README with examples
## Testing
- Tested on 50+ products with reviews
- Handles products without reviews (gracefully skips)
- Verified output format
## Output Example
json
{
"product_url": "https://example.com/laptop",
"rating": "5",
"text": "Great product",
"author": "Jane S.",
"date": "2024-03-15"
}
## Checklist
- [x] Code tested locally
- [x] Documentation updated
- [x] No breaking changes
- [x] Follows project coding style
shell
Step 11: Respond to Review
Maintainer comments: "Can you add error handling for missing review dates?"
# Make changes
# Edit the code
git add .
git commit -m "Add error handling for missing review dates"
git push
PR updates automatically.
Step 12: PR Gets Merged
Congratulations! You contributed to open source.
Cleanup:
git checkout main
git pull upstream main
git branch -d add-product-reviews
git push origin main
Your contribution is now part of the project forever.
Team Communication
Git tracks code. Communication tracks decisions.
Good Commit Messages
Bad:
updated stuff
fixed bug
changes
Good:
Add pagination support for multi-page scraping
- Implemented scrape_all_pages() function
- Automatically detects last page
- Tested with 20+ page listings
Fixes #15
Format:
Short summary (50 chars or less)
Detailed explanation (wrap at 72 chars):
- What you changed
- Why you changed it
- How to test it
Fixes #issue-number
Pull Request Etiquette
Do:
- Explain what and why
- Link related issues
- Respond to feedback promptly
- Test your code
- Update documentation
Don't:
- Create massive PRs (500+ line changes)
- Ignore review feedback
- Merge without approval
- Force push after review started
- Get defensive about criticism
Code Review Comments
As reviewer, be kind:
❌ "This is wrong"
✅ "Consider handling the case where URL is None"
❌ "Bad code"
✅ "This could be optimized by caching the results"
❌ "Didn't you read the docs?"
✅ "The library actually has a built-in method for this: requests.get(url, timeout=30)"
As author, be receptive:
❌ "My code is fine"
✅ "Good point, I'll add that check"
❌ "That's not important"
✅ "I see your concern. How about this approach instead?"
Advanced Team Workflows
Protected Branches
Prevent direct pushes to main.
On GitHub → Settings → Branches → Add rule:
- Branch name:
main - ✅ Require pull request before merging
- ✅ Require approvals: 1
- ✅ Require status checks to pass
- Save
Now nobody can push directly to main. All changes go through PRs.
Required Reviews
Settings → Branches → main:
- ✅ Require approvals: 2 (for teams)
- ✅ Dismiss stale reviews when new commits pushed
Forces code review before merging.
Auto-Delete Branches
Settings → General:
- ✅ Automatically delete head branches
When PR merges, branch auto-deletes. Keeps repo clean.
Branch Naming Conventions
Team standard:
-
feature/description- new features -
fix/description- bug fixes -
docs/description- documentation -
refactor/description- code cleanup -
test/description- adding tests
Everyone follows the same pattern. Easy to understand at a glance.
Handling Large Teams
Code Owners
Create .github/CODEOWNERS:
# Every PR needs approval from these owners
# Default owners for everything
* @team-lead @senior-dev
# Specific owners for parts
/scrapers/ @scraping-team
/database/ @backend-team
/docs/ @documentation-team
GitHub auto-assigns reviewers based on files changed.
Draft Pull Requests
Create PR that isn't ready yet:
- Create PR
- Select "Draft pull request"
- Keep working
- Mark "Ready for review" when done
Shows progress without spamming reviewers.
PR Templates
Create .github/pull_request_template.md:
## Summary
<!-- What does this PR do? -->
## Changes
<!-- List main changes -->
## Testing
<!-- How was this tested? -->
## Checklist
- [ ] Code works locally
- [ ] Tests pass
- [ ] Documentation updated
- [ ] No merge conflicts
Every new PR pre-fills with this template.
Summary
Collaboration is about process, not just tools.
Contributing to open source:
- Fork the repo
- Clone your fork
- Create feature branch
- Make changes
- Push to your fork
- Create PR to original
- Respond to feedback
- Celebrate when merged
Working in teams:
- Clone shared repo
- Always pull before starting
- Create feature branch
- Make changes
- Push branch
- Create PR
- Get review approval
- Merge
- Delete branch
- Pull updated main
Key tools:
- Forks (your copy of someone's project)
- Branches (parallel development)
- Pull requests (propose changes)
- Code review (quality control)
- Issues (track work)
- Projects (organize work)
- .gitignore (exclude files)
Best practices:
- Small, focused PRs
- Clear commit messages
- Responsive to feedback
- Test before pushing
- Review your own code
- Pull before merging
- Communicate decisions
Collaboration rules:
- Never push directly to main
- Always use branches
- Always get review approval
- Pull before starting work
- Keep PRs small and focused
Git and GitHub handle the mechanics. You handle the communication. Both matter equally.
Next up: Blog 5 - "Advanced Git: The Commands That Save Your Career"
The final blog in the series. We'll cover the scary stuff: recovering deleted commits, rewriting history (carefully), handling committed secrets, git reset vs revert vs rebase, stashing changes, cherry-picking commits, and the advanced commands that fix seemingly impossible situations.
Resources:
- GitHub Skills: https://skills.github.com
- Open source guide: https://opensource.guide/how-to-contribute/
- Conventional commits: https://www.conventionalcommits.org/
- Code review best practices: https://google.github.io/eng-practices/review/
Top comments (0)