docker88

Pasting Python code directly into a shell "one-liner" is notoriously prone to `IndentationError` because the shell often strips the whitespace required by Python's syntax.

As a developer, the most reliable and "clean" way to handle this on Linux is to create a small, reusable Python script. This avoids shell escaping and indentation issues entirely.

### The Robust Approach

Run this command to create a file named `extract.py` in your current directory. It uses a "here-document" (EOF), which preserves the indentation exactly as required:

bash
cat << 'EOF' > extract.py
import sys
from html.parser import HTMLParser

class TextExtractor(HTMLParser):
def init(self):
super().init()
self.text_parts = []

def handle_data(self, data):
    # Clean up whitespace and add to list
    cleaned = data.strip()
    if cleaned:
        self.text_parts.append(cleaned)

def get_text(self):
    return "\n".join(self.text_parts)

if name == "main":
parser = TextExtractor()
parser.feed(sys.stdin.read())
print(parser.get_text())
EOF


### How to use it

Now that the file exists, you can run it cleanly without any indentation headaches.

**For a single file:**

bash
python3 extract.py < index.html > index.txt


**For your batch job (all HTML files in the directory):**

bash
find . -name "*.html" | while read file; do
python3 extract.py < "$file" > "${file%.html}.txt"
done


### Why this is better
1.  **Indentation:** By creating the file first, you eliminate the possibility of the shell swallowing spaces or tabs.
2.  **Readability:** You can now modify `TextExtractor` easily. For example, if you find that `<script>` or `<style>` tags are leaking text into your output, you can update the script to ignore those specific tags by overriding `handle_starttag` and `handle_endtag`.

Are you finding that the output includes a lot of "garbage" text (like navigation menus or script contents), or does this script provide the clean content you were looking for?

DEV Community

docker88

Top comments (0)