data-parsing· Updated Nov 30, 2025

How to Turn HTML to Text in Python for Web Scraping

Extract clean readable text from HTML pages in Python. Remove scripts, handle whitespace, preserve structure, and clean scraped content for analysis.

The Hidden Complexity of Text Extraction

Extracting text from HTML seems straightforward until you actually try it. You expect clean sentences. You get a mess of concatenated words, random JavaScript, tracking pixels, and navigation menus all jumbled together.

A real web page has dozens of elements that look like content but aren't. Script tags, style blocks, hidden divs, tracking code, cookie notices. Stripping tags is just the start.

What Goes Wrong

from bs4 import BeautifulSoup

html = '''<nav>Home Products Contact</nav>
<main><h1>Great Product</h1><script>trackView('prod123');</script>
<p>Only <strong>$49.99</strong> today!</p></main>
<footer>© 2024 Company</footer>'''

soup = BeautifulSoup(html, 'lxml')
print(soup.get_text())

Output:

Home Products ContactGreat ProducttrackView('prod123');Only $49.99 today!© 2024 Company

No spaces between elements, JavaScript code mixed in, navigation and footer included. Not useful at all.

Proper Text Extraction Methods

Basic Cleanup with Element Removal

The first step is removing elements that don't contain article content before you call get_text(). Scripts, styles, navigation, footers, forms, and hidden elements all need to go.

from bs4 import BeautifulSoup

def extract_article_text(html):
    soup = BeautifulSoup(html, 'lxml')

    unwanted_tags = [
        'script', 'style', 'noscript', 'iframe',
        'nav', 'header', 'footer', 'aside',
        'form', 'button', 'input', 'select'
    ]

    for tag in soup(unwanted_tags):
        tag.decompose()

    for hidden in soup.find_all(style=lambda x: x and 'display:none' in x.replace(' ', '')):
        hidden.decompose()

    for hidden in soup.find_all(attrs={'hidden': True}):
        hidden.decompose()

    text = soup.get_text(separator='\n', strip=True)

    return text

html = '''
<html>
<nav><a href="/">Home</a> | <a href="/shop">Shop</a></nav>
<main>
    <article>
        <h1>Best Running Shoes for 2024</h1>
        <p style="display:none">Hidden SEO text</p>
        <p>These shoes changed my marathon time.</p>
        <script>analytics.push(['view', 'article']);</script>
        <p>The cushioning is incredible for long runs.</p>
    </article>
</main>
<footer>Subscribe to our newsletter</footer>
</html>
'''

print(extract_article_text(html))

Output:

Best Running Shoes for 2024
These shoes changed my marathon time.
The cushioning is incredible for long runs.

Normalize Whitespace

Web pages have inconsistent spacing. Tabs, multiple spaces, empty lines everywhere. HTML source formatting doesn't match what the browser displays, so you need to clean it up.

from bs4 import BeautifulSoup
import re

def clean_extracted_text(html):
    soup = BeautifulSoup(html, 'lxml')

    for tag in soup(['script', 'style', 'noscript']):
        tag.decompose()

    text = soup.get_text(separator='\n')

    lines = []
    for line in text.split('\n'):
        cleaned = re.sub(r'[ \t]+', ' ', line).strip()
        if cleaned:
            lines.append(cleaned)

    result = []
    prev = None
    for line in lines:
        if line != prev:
            result.append(line)
            prev = line

    return '\n'.join(result)

messy_html = '''
<div>
    <h1>   Product   Title   </h1>


    <p>  First   paragraph   with    weird   spacing.  </p>


    <p>Second paragraph.</p>
    <p>Second paragraph.</p>
    <p>Third paragraph with
        line breaks in the HTML.</p>
</div>
'''

print(clean_extracted_text(messy_html))

Output:

Product Title
First paragraph with weird spacing.
Second paragraph.
Third paragraph with line breaks in the HTML.

Target Specific Content Containers

Most pages have a main content area surrounded by headers, sidebars, and footers. Instead of extracting from the whole page and filtering out the junk, find the content container and extract only from there.

Look for semantic HTML5 tags first like main and article. If those don't exist, try common class and ID patterns like post-content, article-body, or entry-content.

from bs4 import BeautifulSoup

def find_main_content(soup):
    main = soup.find('main')
    if main:
        return main

    article = soup.find('article')
    if article:
        return article

    content_selectors = [
        {'id': 'content'},
        {'id': 'main-content'},
        {'id': 'article-body'},
        {'class_': 'post-content'},
        {'class_': 'article-content'},
        {'class_': 'entry-content'},
        {'class_': 'content-body'},
    ]

    for selector in content_selectors:
        el = soup.find('div', **selector)
        if el:
            return el

    return soup.find('body') or soup

def extract_main_text(html):
    soup = BeautifulSoup(html, 'lxml')
    content = find_main_content(soup)

    for tag in content(['script', 'style', 'nav', 'aside']):
        tag.decompose()

    return content.get_text(separator='\n', strip=True)

blog_html = '''
<html>
<body>
    <header>
        <nav>Blog | About | Contact</nav>
    </header>
    <aside class="sidebar">
        <h3>Popular Posts</h3>
        <ul><li>Post 1</li><li>Post 2</li></ul>
    </aside>
    <main>
        <article class="post-content">
            <h1>How I Increased Conversions by 50%</h1>
            <p>It started with a simple A/B test.</p>
            <p>The results surprised everyone on the team.</p>
        </article>
    </main>
    <footer>Copyright 2024</footer>
</body>
</html>
'''

print(extract_main_text(blog_html))

Output:

How I Increased Conversions by 50%
It started with a simple A/B test.
The results surprised everyone on the team.

Preserve Some Structure

Sometimes you want to know where headings and paragraphs were in the original HTML. Insert newlines before block elements to keep some structure in the output.

from bs4 import BeautifulSoup
import re

def structured_text_extraction(html):
    soup = BeautifulSoup(html, 'lxml')

    for tag in soup(['script', 'style', 'noscript']):
        tag.decompose()

    block_tags = ['h1', 'h2', 'h3', 'h4', 'p', 'li', 'tr', 'div']

    for tag in soup.find_all(block_tags):
        tag.insert_before('\n')
        tag.insert_after('\n')

    text = soup.get_text()

    text = re.sub(r'\n{3,}', '\n\n', text)
    text = re.sub(r'[ \t]+', ' ', text)

    lines = [line.strip() for line in text.split('\n') if line.strip()]
    return '\n'.join(lines)

product_html = '''
<div class="product">
    <h1>Wireless Mouse</h1>
    <div class="specs">
        <h2>Specifications</h2>
        <ul>
            <li>2.4GHz wireless</li>
            <li>1600 DPI optical sensor</li>
            <li>6 month battery life</li>
        </ul>
    </div>
    <div class="description">
        <h2>Description</h2>
        <p>Ergonomic design for all day comfort.</p>
        <p>Works with Windows, Mac, and Linux.</p>
    </div>
</div>
'''

print(structured_text_extraction(product_html))

Handle Tables Properly

Tables become unreadable when converted to plain text with get_text(). The structure disappears and cells run together. Extract table data separately and format it yourself.

from bs4 import BeautifulSoup

def extract_table_data(html):
    soup = BeautifulSoup(html, 'lxml')
    tables = []

    for table in soup.find_all('table'):
        rows = []
        for tr in table.find_all('tr'):
            cells = []
            for cell in tr.find_all(['td', 'th']):
                cells.append(cell.get_text(strip=True))
            if cells:
                rows.append(cells)
        if rows:
            tables.append(rows)

    return tables

def table_to_text(rows, delimiter=' | '):
    lines = []
    for row in rows:
        lines.append(delimiter.join(row))
    return '\n'.join(lines)

pricing_html = '''
<table>
    <tr><th>Plan</th><th>Price</th><th>Features</th></tr>
    <tr><td>Basic</td><td>$9/mo</td><td>10 projects</td></tr>
    <tr><td>Pro</td><td>$29/mo</td><td>Unlimited projects</td></tr>
    <tr><td>Enterprise</td><td>Custom</td><td>Priority support</td></tr>
</table>
'''

tables = extract_table_data(pricing_html)
for table in tables:
    print(table_to_text(table))

Output:

Plan | Price | Features
Basic | $9/mo | 10 projects
Pro | $29/mo | Unlimited projects
Enterprise | Custom | Priority support

Fast Extraction with lxml

For high volume scraping where you're processing thousands of pages, lxml is significantly faster than BeautifulSoup. The Cleaner class removes unwanted elements in one pass.

from lxml import html
from lxml.html.clean import Cleaner
import re

def fast_text_extract(html_content):
    cleaner = Cleaner(
        scripts=True,
        javascript=True,
        style=True,
        inline_style=True,
        links=False,
        meta=True,
        page_structure=False,
        processing_instructions=True,
        remove_unknown_tags=False,
        safe_attrs_only=False,
        comments=True,
        forms=True
    )

    try:
        doc = html.fromstring(html_content)
        cleaned = cleaner.clean_html(doc)
        text = cleaned.text_content()

        text = re.sub(r'\s+', ' ', text).strip()

        return text
    except Exception:
        return ''

pages = [
    '<html><body><p>Page 1 content</p><script>track();</script></body></html>',
    '<html><body><p>Page 2 content</p><style>.x{}</style></body></html>',
    '<html><body><p>Page 3 content</p></body></html>',
]

for i, page in enumerate(pages):
    print(f"Page {i+1}: {fast_text_extract(page)}")

Professional Solutions

For production text extraction at scale, ScrapingForge handles this automatically.

import requests

response = requests.get(
    "https://api.scrapingforge.com/v1/scrape",
    params={
        'api_key': 'YOUR_API_KEY',
        'url': 'https://blog.example.com/article/12345',
        'render_js': 'true',
        'output_format': 'text',
        'extract_main_content': 'true'
    }
)

clean_text = response.text

The API uses machine learning to identify the main content area, strips boilerplate, and returns clean text.

What to Remember

Always remove junk elements before calling get_text() or you'll get JavaScript and tracking code in your output. Target content containers instead of extracting from the entire page. Normalize whitespace because HTML formatting is inconsistent and never matches what the browser shows. Handle tables separately if you need to preserve their structure. Use lxml for speed when processing thousands of pages. And always test with real pages because sample HTML is always cleaner than what you'll encounter in the wild.

Before you can extract text, you need to find the right elements. How to Parse Dynamic CSS Classes covers that. If your scraper isn't getting the HTML you expect, check Why Your Scraper Doesn't See the Data. For text hidden in JavaScript rendered content, the JavaScript Rendering Issues guide explains what's happening. And if you can't even access the page, see 403 Access Denied.

From the blog, the Playwright Web Scraping Tutorial for 2025 shows how to extract text from JavaScript heavy sites, and Web Scraping with PHP covers text extraction if you're not using Python.

Data Parsing Questions: Extract and Clean Scraped Data

Common questions about parsing HTML, handling dynamic content, extracting text from web pages, and troubleshooting data extraction issues in web scraping projects.

How to Parse Dynamic CSS Classes When Web Scraping

Learn how to scrape websites with dynamic CSS class names that change on every page load. Use semantic HTML, data attributes, XPath, and structural selectors.