How to Turn HTML to Text in Python for Web Scraping
The Hidden Complexity of Text Extraction
Extracting text from HTML seems straightforward until you actually try it. You expect clean sentences. You get a mess of concatenated words, random JavaScript, tracking pixels, and navigation menus all jumbled together.
A real web page has dozens of elements that look like content but aren't. Script tags, style blocks, hidden divs, tracking code, cookie notices. Stripping tags is just the start.
What Goes Wrong
from bs4 import BeautifulSoup
html = '''<nav>Home Products Contact</nav>
<main><h1>Great Product</h1><script>trackView('prod123');</script>
<p>Only <strong>$49.99</strong> today!</p></main>
<footer>© 2024 Company</footer>'''
soup = BeautifulSoup(html, 'lxml')
print(soup.get_text())
Output:
Home Products ContactGreat ProducttrackView('prod123');Only $49.99 today!© 2024 Company
No spaces between elements, JavaScript code mixed in, navigation and footer included. Not useful at all.
Proper Text Extraction Methods
Basic Cleanup with Element Removal
The first step is removing elements that don't contain article content before you call get_text(). Scripts, styles, navigation, footers, forms, and hidden elements all need to go.
from bs4 import BeautifulSoup
def extract_article_text(html):
soup = BeautifulSoup(html, 'lxml')
unwanted_tags = [
'script', 'style', 'noscript', 'iframe',
'nav', 'header', 'footer', 'aside',
'form', 'button', 'input', 'select'
]
for tag in soup(unwanted_tags):
tag.decompose()
for hidden in soup.find_all(style=lambda x: x and 'display:none' in x.replace(' ', '')):
hidden.decompose()
for hidden in soup.find_all(attrs={'hidden': True}):
hidden.decompose()
text = soup.get_text(separator='\n', strip=True)
return text
html = '''
<html>
<nav><a href="/">Home</a> | <a href="/shop">Shop</a></nav>
<main>
<article>
<h1>Best Running Shoes for 2024</h1>
<p style="display:none">Hidden SEO text</p>
<p>These shoes changed my marathon time.</p>
<script>analytics.push(['view', 'article']);</script>
<p>The cushioning is incredible for long runs.</p>
</article>
</main>
<footer>Subscribe to our newsletter</footer>
</html>
'''
print(extract_article_text(html))
Output:
Best Running Shoes for 2024
These shoes changed my marathon time.
The cushioning is incredible for long runs.
Normalize Whitespace
Web pages have inconsistent spacing. Tabs, multiple spaces, empty lines everywhere. HTML source formatting doesn't match what the browser displays, so you need to clean it up.
from bs4 import BeautifulSoup
import re
def clean_extracted_text(html):
soup = BeautifulSoup(html, 'lxml')
for tag in soup(['script', 'style', 'noscript']):
tag.decompose()
text = soup.get_text(separator='\n')
lines = []
for line in text.split('\n'):
cleaned = re.sub(r'[ \t]+', ' ', line).strip()
if cleaned:
lines.append(cleaned)
result = []
prev = None
for line in lines:
if line != prev:
result.append(line)
prev = line
return '\n'.join(result)
messy_html = '''
<div>
<h1> Product Title </h1>
<p> First paragraph with weird spacing. </p>
<p>Second paragraph.</p>
<p>Second paragraph.</p>
<p>Third paragraph with
line breaks in the HTML.</p>
</div>
'''
print(clean_extracted_text(messy_html))
Output:
Product Title
First paragraph with weird spacing.
Second paragraph.
Third paragraph with line breaks in the HTML.
Target Specific Content Containers
Most pages have a main content area surrounded by headers, sidebars, and footers. Instead of extracting from the whole page and filtering out the junk, find the content container and extract only from there.
Look for semantic HTML5 tags first like main and article. If those don't exist, try common class and ID patterns like post-content, article-body, or entry-content.
from bs4 import BeautifulSoup
def find_main_content(soup):
main = soup.find('main')
if main:
return main
article = soup.find('article')
if article:
return article
content_selectors = [
{'id': 'content'},
{'id': 'main-content'},
{'id': 'article-body'},
{'class_': 'post-content'},
{'class_': 'article-content'},
{'class_': 'entry-content'},
{'class_': 'content-body'},
]
for selector in content_selectors:
el = soup.find('div', **selector)
if el:
return el
return soup.find('body') or soup
def extract_main_text(html):
soup = BeautifulSoup(html, 'lxml')
content = find_main_content(soup)
for tag in content(['script', 'style', 'nav', 'aside']):
tag.decompose()
return content.get_text(separator='\n', strip=True)
blog_html = '''
<html>
<body>
<header>
<nav>Blog | About | Contact</nav>
</header>
<aside class="sidebar">
<h3>Popular Posts</h3>
<ul><li>Post 1</li><li>Post 2</li></ul>
</aside>
<main>
<article class="post-content">
<h1>How I Increased Conversions by 50%</h1>
<p>It started with a simple A/B test.</p>
<p>The results surprised everyone on the team.</p>
</article>
</main>
<footer>Copyright 2024</footer>
</body>
</html>
'''
print(extract_main_text(blog_html))
Output:
How I Increased Conversions by 50%
It started with a simple A/B test.
The results surprised everyone on the team.
Preserve Some Structure
Sometimes you want to know where headings and paragraphs were in the original HTML. Insert newlines before block elements to keep some structure in the output.
from bs4 import BeautifulSoup
import re
def structured_text_extraction(html):
soup = BeautifulSoup(html, 'lxml')
for tag in soup(['script', 'style', 'noscript']):
tag.decompose()
block_tags = ['h1', 'h2', 'h3', 'h4', 'p', 'li', 'tr', 'div']
for tag in soup.find_all(block_tags):
tag.insert_before('\n')
tag.insert_after('\n')
text = soup.get_text()
text = re.sub(r'\n{3,}', '\n\n', text)
text = re.sub(r'[ \t]+', ' ', text)
lines = [line.strip() for line in text.split('\n') if line.strip()]
return '\n'.join(lines)
product_html = '''
<div class="product">
<h1>Wireless Mouse</h1>
<div class="specs">
<h2>Specifications</h2>
<ul>
<li>2.4GHz wireless</li>
<li>1600 DPI optical sensor</li>
<li>6 month battery life</li>
</ul>
</div>
<div class="description">
<h2>Description</h2>
<p>Ergonomic design for all day comfort.</p>
<p>Works with Windows, Mac, and Linux.</p>
</div>
</div>
'''
print(structured_text_extraction(product_html))
Handle Tables Properly
Tables become unreadable when converted to plain text with get_text(). The structure disappears and cells run together. Extract table data separately and format it yourself.
from bs4 import BeautifulSoup
def extract_table_data(html):
soup = BeautifulSoup(html, 'lxml')
tables = []
for table in soup.find_all('table'):
rows = []
for tr in table.find_all('tr'):
cells = []
for cell in tr.find_all(['td', 'th']):
cells.append(cell.get_text(strip=True))
if cells:
rows.append(cells)
if rows:
tables.append(rows)
return tables
def table_to_text(rows, delimiter=' | '):
lines = []
for row in rows:
lines.append(delimiter.join(row))
return '\n'.join(lines)
pricing_html = '''
<table>
<tr><th>Plan</th><th>Price</th><th>Features</th></tr>
<tr><td>Basic</td><td>$9/mo</td><td>10 projects</td></tr>
<tr><td>Pro</td><td>$29/mo</td><td>Unlimited projects</td></tr>
<tr><td>Enterprise</td><td>Custom</td><td>Priority support</td></tr>
</table>
'''
tables = extract_table_data(pricing_html)
for table in tables:
print(table_to_text(table))
Output:
Plan | Price | Features
Basic | $9/mo | 10 projects
Pro | $29/mo | Unlimited projects
Enterprise | Custom | Priority support
Fast Extraction with lxml
For high volume scraping where you're processing thousands of pages, lxml is significantly faster than BeautifulSoup. The Cleaner class removes unwanted elements in one pass.
from lxml import html
from lxml.html.clean import Cleaner
import re
def fast_text_extract(html_content):
cleaner = Cleaner(
scripts=True,
javascript=True,
style=True,
inline_style=True,
links=False,
meta=True,
page_structure=False,
processing_instructions=True,
remove_unknown_tags=False,
safe_attrs_only=False,
comments=True,
forms=True
)
try:
doc = html.fromstring(html_content)
cleaned = cleaner.clean_html(doc)
text = cleaned.text_content()
text = re.sub(r'\s+', ' ', text).strip()
return text
except Exception:
return ''
pages = [
'<html><body><p>Page 1 content</p><script>track();</script></body></html>',
'<html><body><p>Page 2 content</p><style>.x{}</style></body></html>',
'<html><body><p>Page 3 content</p></body></html>',
]
for i, page in enumerate(pages):
print(f"Page {i+1}: {fast_text_extract(page)}")
Professional Solutions
For production text extraction at scale, ScrapingForge handles this automatically.
import requests
response = requests.get(
"https://api.scrapingforge.com/v1/scrape",
params={
'api_key': 'YOUR_API_KEY',
'url': 'https://blog.example.com/article/12345',
'render_js': 'true',
'output_format': 'text',
'extract_main_content': 'true'
}
)
clean_text = response.text
The API uses machine learning to identify the main content area, strips boilerplate, and returns clean text.
What to Remember
Always remove junk elements before calling get_text() or you'll get JavaScript and tracking code in your output. Target content containers instead of extracting from the entire page. Normalize whitespace because HTML formatting is inconsistent and never matches what the browser shows. Handle tables separately if you need to preserve their structure. Use lxml for speed when processing thousands of pages. And always test with real pages because sample HTML is always cleaner than what you'll encounter in the wild.
Related Reading
Before you can extract text, you need to find the right elements. How to Parse Dynamic CSS Classes covers that. If your scraper isn't getting the HTML you expect, check Why Your Scraper Doesn't See the Data. For text hidden in JavaScript rendered content, the JavaScript Rendering Issues guide explains what's happening. And if you can't even access the page, see 403 Access Denied.
From the blog, the Playwright Web Scraping Tutorial for 2025 shows how to extract text from JavaScript heavy sites, and Web Scraping with PHP covers text extraction if you're not using Python.
Data Parsing Questions: Extract and Clean Scraped Data
Common questions about parsing HTML, handling dynamic content, extracting text from web pages, and troubleshooting data extraction issues in web scraping projects.
How to Parse Dynamic CSS Classes When Web Scraping
Learn how to scrape websites with dynamic CSS class names that change on every page load. Use semantic HTML, data attributes, XPath, and structural selectors.

