Creating Custom Parsers¶
This comprehensive guide covers everything you need to know about creating custom parsers for OPAL, from web scraping fundamentals to practical implementation.
Table of Contents¶
- Introduction to Web Scraping
- Core Concepts
- Beautiful Soup Fundamentals
- BaseParser Architecture
- Step-by-Step Parser Creation
- Real-World Examples
- Special Cases & Advanced Topics
- Registration & Testing
- Best Practices & Common Challenges
Introduction to Web Scraping¶
Web scraping is the process of programmatically extracting data from websites. OPAL uses web scraping to gather news articles and court records from Alabama-based websites.
Key Components¶
- HTTP Requests: Fetching web pages from servers
- HTML Parsing: Extracting structured data from HTML
- Data Extraction: Finding specific information within the parsed content
- Error Handling: Gracefully managing network issues and unexpected content
Core Concepts¶
HTTP Requests and Responses¶
When scraping websites, you're making HTTP requests to web servers:
import requests
response = requests.get('https://example.com')
print(f"Status Code: {response.status_code}")
print(f"Content Type: {response.headers.get('Content-Type')}")
Common status codes: - 200: Success - 404: Not Found - 403: Forbidden - 500: Server Error
HTML Structure and DOM¶
HTML documents have a tree-like structure called the DOM (Document Object Model):
<html>
<head>
<title>Page Title</title>
</head>
<body>
<div class="article">
<h1>Article Title</h1>
<p>Article content...</p>
</div>
</body>
</html>
Beautiful Soup Fundamentals¶
Beautiful Soup is Python's primary HTML parsing library. OPAL uses it extensively for extracting data.
Basic Usage¶
from bs4 import BeautifulSoup
# Parse HTML
soup = BeautifulSoup(html_content, 'html.parser')
# Find elements
title = soup.find('title')
all_links = soup.find_all('a')
article_div = soup.find('div', class_='article')
Key Methods¶
Method | Description | Example |
---|---|---|
find() |
Find first matching element | soup.find('h1') |
find_all() |
Find all matching elements | soup.find_all('p') |
select() |
CSS selector | soup.select('.article h1') |
get_text() |
Extract text content | element.get_text(strip=True) |
get() |
Get attribute value | link.get('href') |
Navigation¶
# Parent/child navigation
parent = element.parent
children = element.children
next_sibling = element.next_sibling
# Finding by attributes
articles = soup.find_all('div', {'class': 'article', 'data-type': 'news'})
BaseParser Architecture¶
OPAL uses an Abstract Base Class (ABC) design pattern for parsers. This ensures consistency while allowing flexibility.
The BaseParser Class¶
from abc import ABC, abstractmethod
class BaseParser(ABC):
"""Abstract base class for all OPAL parsers"""
def __init__(self):
self.session = requests.Session()
self.session.headers.update({
'User-Agent': 'OPAL Web Scraper'
})
@abstractmethod
def parse_article(self, url):
"""Must be implemented by each parser"""
pass
@abstractmethod
def extract_article_data(self, soup):
"""Must be implemented by each parser"""
pass
Why Abstract Base Classes?¶
- Enforces Interface: All parsers must implement required methods
- Code Reuse: Common functionality in base class
- Type Safety: Better IDE support and error detection
- Extensibility: Easy to add new parsers
Step-by-Step Parser Creation¶
Step 1: Create Your Parser Class¶
Create a new file in the opal/
directory:
# opal/parser_mynews.py
from opal.parser_module import BaseParser
from bs4 import BeautifulSoup
import requests
class ParserMyNews(BaseParser):
def __init__(self):
super().__init__()
self.source_name = "My News Site"
Step 2: Implement Required Methods¶
def parse_article(self, url):
"""Parse a single article from the given URL"""
try:
response = self.make_request(url)
if response and response.status_code == 200:
soup = BeautifulSoup(response.text, 'html.parser')
return self.extract_article_data(soup, url)
except Exception as e:
print(f"Error parsing article {url}: {e}")
return None
def make_request(self, url):
"""Make HTTP request with error handling"""
try:
response = self.session.get(url, timeout=10)
response.raise_for_status()
return response
except requests.RequestException as e:
print(f"Request failed for {url}: {e}")
return None
Step 3: Implement Data Extraction¶
def extract_article_data(self, soup, url):
"""Extract structured data from the parsed HTML"""
try:
# Extract title
title_element = soup.find('h1', class_='article-title')
title = title_element.get_text(strip=True) if title_element else "No title"
# Extract author
author_element = soup.find('span', class_='author-name')
author = author_element.get_text(strip=True) if author_element else "Unknown"
# Extract date
date_element = soup.find('time', class_='publish-date')
date = date_element.get('datetime', 'Unknown date') if date_element else "Unknown date"
# Extract content
content_div = soup.find('div', class_='article-content')
paragraphs = content_div.find_all('p') if content_div else []
# Format content
line_content = {}
for i, p in enumerate(paragraphs, 1):
text = p.get_text(strip=True)
if text: # Skip empty paragraphs
line_content[f"line {i}"] = text
return {
'url': url,
'title': title,
'author': author,
'date': date,
'line_count': len(line_content),
'line_content': line_content
}
except Exception as e:
print(f"Error extracting data: {e}")
return None
Step 4: Handle Article Discovery¶
def get_article_links(self, base_url, suffix=None, max_pages=None):
"""Discover article links from the news site"""
all_links = set()
page = 1
while True:
if max_pages and page > max_pages:
break
page_url = f"{base_url}?page={page}"
response = self.make_request(page_url)
if not response:
break
soup = BeautifulSoup(response.text, 'html.parser')
# Find article links
articles = soup.find_all('a', class_='article-link')
if not articles:
break # No more articles
for article in articles:
href = article.get('href')
if href:
# Make absolute URL
full_url = urljoin(base_url, href)
# Apply suffix filter if provided
if not suffix or suffix in full_url:
all_links.add(full_url)
page += 1
return list(all_links)
Real-World Examples¶
Example 1: Parser1819 (Simple News Site)¶
class Parser1819(BaseParser):
def extract_article_data(self, soup, url):
# Real implementation from OPAL
title = soup.find('title').text.strip() if soup.find('title') else "No title"
# Author extraction with multiple fallbacks
author = "Unknown author"
author_tags = soup.find_all('a', {'rel': 'author'})
if author_tags:
author = author_tags[0].text.strip()
# Date extraction from meta tag
date = "Unknown date"
date_tag = soup.find('meta', {'property': 'article:published_time'})
if date_tag:
date = date_tag.get('content', 'Unknown date')
Example 2: ParserAppealsAL (JavaScript-Heavy Site)¶
class ParserAppealsAL(BaseParser):
def __init__(self):
super().__init__()
# Uses Selenium for JavaScript rendering
self.driver = self._setup_driver()
def _setup_driver(self):
options = webdriver.ChromeOptions()
options.add_argument('--headless')
return webdriver.Chrome(options=options)
def parse_article(self, url):
# Load page with JavaScript
self.driver.get(url)
# Wait for content to load
WebDriverWait(self.driver, 10).until(
EC.presence_of_element_located((By.CLASS_NAME, "case-table"))
)
# Parse rendered HTML
soup = BeautifulSoup(self.driver.page_source, 'html.parser')
Special Cases & Advanced Topics¶
JavaScript-Rendered Content¶
Some sites load content dynamically with JavaScript. Use Selenium:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
def setup_selenium_parser(self):
options = webdriver.ChromeOptions()
options.add_argument('--headless') # Run without GUI
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
self.driver = webdriver.Chrome(options=options)
Handling Pagination¶
def handle_pagination(self, base_url, max_pages=None):
"""Navigate through paginated content"""
page = 1
all_data = []
while True:
if max_pages and page > max_pages:
break
# Different pagination patterns
page_url = f"{base_url}?page={page}" # Query parameter
# OR: page_url = f"{base_url}/page/{page}" # Path-based
response = self.make_request(page_url)
if not response or response.status_code == 404:
break # No more pages
# Extract data from page
data = self.extract_page_data(response)
if not data:
break # No data found
all_data.extend(data)
page += 1
return all_data
Rate Limiting¶
Respect server resources:
import time
def parse_with_rate_limit(self, urls, delay=1):
"""Parse URLs with delay between requests"""
results = []
for i, url in enumerate(urls):
if i > 0:
time.sleep(delay) # Delay between requests
result = self.parse_article(url)
if result:
results.append(result)
return results
Registration & Testing¶
Register Your Parser¶
Add your parser to opal/__main__.py
:
from opal.parser_mynews import ParserMyNews
# In the parser dictionary
parsers = {
'Parser1819': Parser1819,
'ParserDailyNews': ParserDailyNews,
'ParserAppealsAL': ParserAppealsAL,
'ParserMyNews': ParserMyNews # Add your parser here
}
Test Your Parser¶
-
Unit Test:
# tests/test_parser_mynews.py import pytest from opal.parser_mynews import ParserMyNews def test_parser_initialization(): parser = ParserMyNews() assert parser.source_name == "My News Site" def test_extract_article_data(): parser = ParserMyNews() # Test with sample HTML html = '<h1 class="article-title">Test Article</h1>' soup = BeautifulSoup(html, 'html.parser') data = parser.extract_article_data(soup, 'http://test.com') assert data['title'] == 'Test Article'
-
Integration Test:
Best Practices & Common Challenges¶
Best Practices¶
- Error Handling
- Always use try-except blocks
- Log errors with context
- Return None or empty data on failure
-
Never let parser crash the entire application
-
Respect robots.txt
-
User-Agent Headers
- Always set a descriptive User-Agent
-
Include contact information if scraping extensively
-
Data Validation
- Verify extracted data makes sense
- Handle missing fields gracefully
-
Validate URLs before making requests
-
Performance
- Use session objects for connection pooling
- Implement caching for repeated requests
- Consider concurrent requests for large datasets
Common Challenges¶
- Dynamic Content
- Problem: Content loaded by JavaScript
-
Solution: Use Selenium or analyze API calls
-
Anti-Scraping Measures
- Problem: IP blocking, CAPTCHAs
-
Solution: Rotate user agents, add delays, respect rate limits
-
Changing HTML Structure
- Problem: Site redesigns break selectors
-
Solution: Use multiple selectors, flexible matching
-
Encoding Issues
- Problem: Special characters display incorrectly
-
Solution: Specify encoding:
response.encoding = 'utf-8'
-
Large Datasets
- Problem: Memory issues with thousands of articles
- Solution: Process in batches, write incrementally
Debugging Tips¶
# Debug HTML structure
print(soup.prettify()) # See formatted HTML
# Check element existence
if not soup.find('div', class_='article'):
print("Article container not found!")
# Log selector results
elements = soup.select('.article-title')
print(f"Found {len(elements)} title elements")
# Save problematic HTML
with open('debug.html', 'w') as f:
f.write(str(soup))
Conclusion¶
Creating custom parsers for OPAL involves understanding web scraping fundamentals, implementing the BaseParser interface, and handling real-world challenges. Start simple, test thoroughly, and always respect the websites you're scraping.
For more examples, examine the existing parsers in the OPAL codebase. Each demonstrates different techniques for handling various website structures and challenges.