Pasting Python code directly into a shell "one-liner" is notoriously prone to `IndentationError` because the shell often strips the whitespace required by Python's syntax.
As a developer, the most reliable and "clean" way to handle this on Linux is to create a small, reusable Python script. This avoids shell escaping and indentation issues entirely.
### The Robust Approach
Run this command to create a file named `extract.py` in your current directory. It uses a "here-document" (EOF), which preserves the indentation exactly as required:
bash
cat << 'EOF' > extract.py
import sys
from html.parser import HTMLParser
class TextExtractor(HTMLParser):
def init(self):
super().init()
self.text_parts = []
def handle_data(self, data):
# Clean up whitespace and add to list
cleaned = data.strip()
if cleaned:
self.text_parts.append(cleaned)
def get_text(self):
return "\n".join(self.text_parts)
if name == "main":
parser = TextExtractor()
parser.feed(sys.stdin.read())
print(parser.get_text())
EOF
### How to use it
Now that the file exists, you can run it cleanly without any indentation headaches.
**For a single file:**
bash
python3 extract.py < index.html > index.txt
**For your batch job (all HTML files in the directory):**
bash
find . -name "*.html" | while read file; do
python3 extract.py < "$file" > "${file%.html}.txt"
done
### Why this is better
1. **Indentation:** By creating the file first, you eliminate the possibility of the shell swallowing spaces or tabs.
2. **Readability:** You can now modify `TextExtractor` easily. For example, if you find that `<script>` or `<style>` tags are leaking text into your output, you can update the script to ignore those specific tags by overriding `handle_starttag` and `handle_endtag`.
Are you finding that the output includes a lot of "garbage" text (like navigation menus or script contents), or does this script provide the clean content you were looking for?
Top comments (0)