DEV Community

Query Filter
Query Filter

Posted on

docker88

Pasting Python code directly into a shell "one-liner" is notoriously prone to `IndentationError` because the shell often strips the whitespace required by Python's syntax.

As a developer, the most reliable and "clean" way to handle this on Linux is to create a small, reusable Python script. This avoids shell escaping and indentation issues entirely.

### The Robust Approach

Run this command to create a file named `extract.py` in your current directory. It uses a "here-document" (EOF), which preserves the indentation exactly as required:

Enter fullscreen mode Exit fullscreen mode


bash
cat << 'EOF' > extract.py
import sys
from html.parser import HTMLParser

class TextExtractor(HTMLParser):
def init(self):
super().init()
self.text_parts = []

def handle_data(self, data):
    # Clean up whitespace and add to list
    cleaned = data.strip()
    if cleaned:
        self.text_parts.append(cleaned)

def get_text(self):
    return "\n".join(self.text_parts)
Enter fullscreen mode Exit fullscreen mode

if name == "main":
parser = TextExtractor()
parser.feed(sys.stdin.read())
print(parser.get_text())
EOF


### How to use it

Now that the file exists, you can run it cleanly without any indentation headaches.

**For a single file:**
Enter fullscreen mode Exit fullscreen mode


bash
python3 extract.py < index.html > index.txt


**For your batch job (all HTML files in the directory):**
Enter fullscreen mode Exit fullscreen mode


bash
find . -name "*.html" | while read file; do
python3 extract.py < "$file" > "${file%.html}.txt"
done


### Why this is better
1.  **Indentation:** By creating the file first, you eliminate the possibility of the shell swallowing spaces or tabs.
2.  **Readability:** You can now modify `TextExtractor` easily. For example, if you find that `<script>` or `<style>` tags are leaking text into your output, you can update the script to ignore those specific tags by overriding `handle_starttag` and `handle_endtag`.

Are you finding that the output includes a lot of "garbage" text (like navigation menus or script contents), or does this script provide the clean content you were looking for?
Enter fullscreen mode Exit fullscreen mode

Top comments (0)