Alabama Appeals Court Public Portal Scraper - Implementation Instructions¶
Overview¶
Create a court case scraper extension for OPAL that extracts tabular data from the Alabama Appeals Court Public Portal. The scraper must handle JavaScript-rendered content, complex URL-based pagination, and preserve both text and link references.
Step-by-Step Implementation Instructions¶
Step 1: Update Dependencies¶
Add the following to requirements.txt
and pyproject.toml
:
- selenium>=4.0.0
or playwright>=1.40.0
(for JavaScript rendering)
- webdriver-manager>=4.0.0
(if using Selenium for automatic driver management)
Step 2: Create Court Case Parser Module¶
Create a new file opal/court_case_parser.py
with the following specifications:
- Import necessary libraries:
- Selenium/Playwright for JavaScript rendering
- BeautifulSoup for HTML parsing
-
Standard libraries for URL manipulation and JSON output
-
Create
CourtCaseParser
class that extendsBaseParser
: - Override
make_request()
to use Selenium/Playwright instead of requests - Implement JavaScript rendering with appropriate wait conditions
-
Add rate limiting (minimum 2-3 seconds between requests)
-
Implement
parse_table_row()
method to extract: - Court name from
<td class="text-start">
(column 1) - Case number text and href from
<a href="/portal/court/...">
(column 2) - Case title from
<td class="text-start">
(column 3) - Classification from
<td class="text-start">
(column 4) - Filed date from
<td class="text-start">
(column 5) - Open/Closed status from
<td class="text-start">
(column 6)
Step 3: Create Custom URL Pagination Handler¶
Create opal/court_url_paginator.py
with:
- URL parser function to:
- Extract and decode the complex URL parameters
- Identify current page number from
page~(number~X)
-
Extract total pages from
totalPages~X
-
URL builder function to:
- Take base URL and page number
- Update
page~(number~X)
parameter - Maintain all other search parameters
-
Handle special encoding (
~
,%2a2f
, etc.) -
Pagination iterator that:
- Starts at page 0
- Continues until reaching
totalPages
- Yields properly formatted URLs for each page
Step 4: Implement Data Extraction Logic¶
In CourtCaseParser
, create parse_all_cases()
method that:
- Initialize browser driver (Selenium/Playwright)
- Load first page and extract total pages from URL
- For each page:
- Navigate to page URL
- Wait for table to load (use explicit waits)
- Extract all table rows
- Parse each row using
parse_table_row()
- Store results with preserved link references
- Close browser driver when complete
- Return combined results from all pages as single dataset
Step 5: Define Output Format¶
Structure the output JSON as:
{
"status": "success",
"total_cases": 317,
"extraction_date": "2025-06-11",
"cases": [
{
"court": "Alabama Supreme Court",
"case_number": {
"text": "SC-2025-0424",
"link": "/portal/court/68f021c4-6a44-4735-9a76-5360b2e8af13/case/d024d958-58a1-41c9-9fae-39c645c7977e"
},
"case_title": "Frank Thomas Shumate, Jr. v. Berry Contracting L.P. d/b/a Bay Ltd.",
"classification": "Appeal - Civil - Injunction Other",
"filed_date": "06/10/2025",
"status": "Open"
}
]
}
Step 6: Integrate with OPAL CLI¶
Modify existing OPAL files:
- Update
opal/__init__.py
: - Add
from .court_case_parser import CourtCaseParser
-
Add
from .court_url_paginator import paginate_court_urls
-
Update
opal/integrated_parser.py
: - Add conditional logic to handle court case URLs differently
-
Use
paginate_court_urls
instead ofget_all_news_urls
for court sites -
Update
opal/main.py
: - Add
--parser ParserAppealsAL
option to argparse choices - Add court parser to the parser selection logic
- Adjust output filename format for court data
Step 7: Handle Technical Requirements¶
Implement the following in CourtCaseParser
:
- JavaScript rendering:
- Wait for table element to be present
- Wait for data rows to load
-
Handle any loading spinners or dynamic content
-
Error handling:
- Timeout exceptions for slow page loads
- Missing table elements
- Network errors
-
Browser crashes
-
Rate limiting:
- Add configurable delay between page requests (default 3 seconds)
- Respect server response times
Step 8: Testing URLs¶
Use these URLs for testing:
- First page: https://publicportal.alappeals.gov/portal/search/case/results?criteria=~%28advanced~false~courtID~%2768f021c4-6a44-4735-9a76-5360b2e8af13~page~%28size~25~number~0~totalElements~0~totalPages~0%29~sort~%28sortBy~%27caseHeader.filedDate~sortDesc~true%29~case~%28caseCategoryID~1000000~caseNumberQueryTypeID~10463~caseTitleQueryTypeID~300054~filedDateChoice~%27-1y~filedDateStart~%2706%2a2f11%2a2f2024~filedDateEnd~%2706%2a2f11%2a2f2025~excludeClosed~false%29%29
- Second page: Same URL but with page~(number~1)
and updated totalElements~317~totalPages~13
Step 9: Final Integration¶
- Test the complete flow with:
python -m opal --url [court_url] --parser ParserAppealsAL
- Ensure output file is created with court case data in tabular format
- Verify all pages are scraped and combined into single result set
- Confirm case number links are preserved in the output
Expected Deliverables¶
opal/court_case_parser.py
- Main parser for court dataopal/court_url_paginator.py
- URL pagination handler- Updated
opal/__init__.py
,opal/integrated_parser.py
, andopal/main.py
- Updated
requirements.txt
andpyproject.toml
with new dependencies - JSON output file with all court cases in structured format