Creating Court Parsers with OPAL¶
Overview¶
This guide explains how to create parsers for court websites using OPAL's framework. We'll use ParserAppealsAL
as our reference implementation, which extracts court case data from the Alabama Appeals Court Public Portal. Court parsers typically differ from news parsers because they often need to handle JavaScript-rendered content, complex table structures, and paginated results.
Key Features¶
- JavaScript Support: Uses Selenium WebDriver to render JavaScript-heavy pages
- Automatic Browser Management: Handles Chrome driver setup and teardown
- Rate Limiting: Built-in configurable delays between requests to avoid overwhelming the server
- Table Parsing: Specialized logic for extracting structured data from HTML tables
- Error Handling: Robust error handling with graceful fallbacks
- Headless Operation: Can run with or without a visible browser window
Architecture¶
Class Hierarchy¶
ParserAppealsAL inherits from the BaseParser
base class, which defines the common interface for all parsers in the OPAL system. It overrides key methods to provide court-specific functionality.
Dependencies¶
# Core Dependencies
selenium >= 4.0.0 # Browser automation
webdriver-manager >= 4.0.0 # Automatic ChromeDriver management
beautifulsoup4 # HTML parsing
requests # HTTP requests (inherited from base)
# Standard Library
json # JSON data handling
time # Rate limiting
datetime # Timestamp generation
typing # Type hints
Implementation Guide¶
1. Basic Structure¶
To implement your own court parser based on ParserAppealsAL, start with this structure:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
from your_project.parser_module import BaseParser
class YourCourtParser(BaseParser):
def __init__(self, headless=True, rate_limit_seconds=3):
super().__init__()
self.headless = headless
self.rate_limit_seconds = rate_limit_seconds
self.driver = None
2. Core Methods¶
__init__(self, headless: bool = True, rate_limit_seconds: int = 3)
¶
Initializes the parser with configuration options.
Parameters:
- headless
: Run Chrome in headless mode (no visible window)
- rate_limit_seconds
: Delay between requests to avoid rate limiting
_setup_driver(self)
¶
Sets up the Chrome WebDriver with appropriate options:
def _setup_driver(self):
chrome_options = Options()
if self.headless:
chrome_options.add_argument("--headless")
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--disable-dev-shm-usage")
chrome_options.add_argument("--disable-gpu")
chrome_options.add_argument("--window-size=1920,1080")
service = Service(ChromeDriverManager().install())
self.driver = webdriver.Chrome(service=service, options=chrome_options)
make_request(self, url: str, timeout: int = 30) -> Optional[str]
¶
Overrides the base class method to use Selenium instead of requests library.
Key Features: - Lazy driver initialization - Waits for specific elements to load - Implements rate limiting - Returns page source HTML
def make_request(self, url, timeout=30):
if not self.driver:
self._setup_driver()
self.driver.get(url)
# Wait for your specific element
WebDriverWait(self.driver, timeout).until(
EC.presence_of_element_located((By.CSS_SELECTOR, "table"))
)
time.sleep(self.rate_limit_seconds)
return self.driver.page_source
parse_table_row(self, row) -> Optional[Dict]
¶
Extracts data from a single table row. This method is specific to the table structure of your court portal.
Expected Table Structure: 1. Court Name 2. Case Number (with optional link) 3. Case Title 4. Classification 5. Filed Date 6. Status
Returns:
{
"court": "Court of Civil Appeals",
"case_number": {
"text": "2230123",
"link": "/case/details/..."
},
"case_title": "Smith v. Jones",
"classification": "Civil",
"filed_date": "06/11/2024",
"status": "Active"
}
parse_article(self, url: str) -> Dict
¶
Main parsing method that processes a single page of court results.
Process:
1. Loads the page using make_request
2. Parses HTML with BeautifulSoup
3. Finds the main data table
4. Extracts data from each row
5. Returns structured results
parse_all_cases(self, base_url: str, page_urls: List[str]) -> Dict
¶
Processes multiple pages of results and combines them.
Returns:
{
"status": "success",
"total_cases": 318,
"extraction_date": "2025-01-13",
"cases": [
# List of case dictionaries
]
}
3. Integration with OPAL System¶
The parser integrates with OPAL through the IntegratedParser
class:
from opal.integrated_parser import IntegratedParser
from your_parser import YourCourtParser
# Create parser instance
parser = IntegratedParser(YourCourtParser)
# Process court data
result = parser.process_site(
base_url="https://your-court-portal.gov/search",
suffix="", # Not used for court parsers
max_pages=None # Will process all available pages
)
4. URL Pagination¶
Court portals often use complex URL parameters for pagination. The system includes helper functions in court_url_paginator.py
:
parse_court_url()
: Extracts page number and total pages from URLbuild_court_url()
: Constructs URLs for specific pagespaginate_court_urls()
: Generates list of all page URLs
5. Best Practices¶
- Error Handling: Always wrap operations in try-except blocks
- Resource Management: Ensure driver is closed in finally blocks
- Rate Limiting: Respect server limits to avoid IP bans
- Dynamic Waits: Use WebDriverWait instead of fixed sleep times when possible
- Memory Management: Close driver after processing to free resources
6. Testing¶
Create test scripts to validate your parser:
from your_parser import YourCourtParser
def test_single_page():
parser = YourCourtParser(headless=True)
result = parser.parse_article("https://court-url.gov/page1")
assert result["cases"]
assert len(result["cases"]) > 0
# Validate case structure
case = result["cases"][0]
assert "court" in case
assert "case_number" in case
assert "case_title" in case
Customization Guide¶
Adapting for Different Court Systems¶
- Table Structure: Modify
parse_table_row()
to match your court's table columns - Wait Conditions: Update the element selector in
make_request()
- URL Patterns: Adjust pagination logic in helper functions
- Data Fields: Add or remove fields based on available data
Common Modifications¶
-
Different Table Selectors:
-
Additional Data Extraction:
-
Custom Headers:
Troubleshooting¶
Common Issues¶
- ChromeDriver Not Found:
- Solution: webdriver-manager should handle this automatically
-
Manual fix: Download ChromeDriver matching your Chrome version
-
Elements Not Loading:
- Increase timeout in WebDriverWait
- Check if element selectors have changed
-
Verify JavaScript is executing properly
-
Rate Limiting:
- Increase
rate_limit_seconds
- Implement exponential backoff
-
Consider using proxy rotation
-
Memory Leaks:
- Ensure driver is closed after use
- Implement periodic driver restarts for long runs
Performance Considerations¶
- Headless Mode: Significantly faster than visible browser
- Parallel Processing: Not recommended due to rate limits
- Caching: Consider caching parsed results to avoid re-parsing
- Resource Usage: Each driver instance uses ~100-200MB RAM
Example Output¶
{
"status": "success",
"total_cases": 318,
"extraction_date": "2025-01-13",
"cases": [
{
"court": "Court of Civil Appeals",
"case_number": {
"text": "CL-2024-000123",
"link": "/portal/case/details/123"
},
"case_title": "Smith v. Jones Corporation",
"classification": "Civil Appeal",
"filed_date": "01/10/2025",
"status": "Pending"
},
{
"court": "Court of Criminal Appeals",
"case_number": {
"text": "CR-2024-000456",
"link": "/portal/case/details/456"
},
"case_title": "State of Alabama v. Doe",
"classification": "Criminal Appeal",
"filed_date": "01/09/2025",
"status": "Active"
}
]
}
Security Considerations¶
- Input Validation: Always validate URLs before processing
- Sandbox Mode: Chrome runs with --no-sandbox for compatibility
- Credential Storage: Never hardcode credentials in parser
- SSL Verification: Selenium handles SSL by default
Future Enhancements¶
Consider these improvements for production use:
- Retry Logic: Implement automatic retries for failed requests
- Progress Tracking: Add callbacks for progress updates
- Data Validation: Implement schema validation for parsed data
- Export Formats: Support multiple output formats (CSV, Excel)
- Incremental Updates: Track previously parsed cases to avoid duplicates