Architecture¶
OPAL follows a modular architecture that makes it easy to add new parsers and extend functionality.
Core Components¶
BaseParser¶
The foundation of all parsers, providing: - Common web scraping functionality - Error handling and retry logic - Logging infrastructure - Output formatting
Parser Classes¶
Each website has its own parser class that inherits from BaseParser:
- Parser1819
: For 1819 News
- ParserDailyNews
: For Alabama Daily News
- ParserAppealsAL
: For Alabama Appeals Court
Main Module¶
The __main__.py
module handles:
- Command-line argument parsing
- Parser instantiation
- Execution flow
- Output management
Class Hierarchy¶
Data Flow¶
- Input: User provides URL and parser type via CLI
- Initialization: Main module creates parser instance
- Scraping: Parser fetches and processes web pages
- Extraction: Parser extracts structured data
- Output: Data saved to JSON file
Key Design Patterns¶
Template Method Pattern¶
BaseParser defines the scraping workflow:
Factory Pattern¶
Parser selection based on command-line argument:
parsers = {
'Parser1819': Parser1819,
'ParserDailyNews': ParserDailyNews,
'court': ParserAppealsAL
}
parser_class = parsers[args.parser]
Extension Points¶
Adding New Parsers¶
- Create new class inheriting from BaseParser
- Implement required methods:
extract_article_data()
get_article_links()
parse_article()
- Register in main module
Customizing Output¶
Override format_output()
method to customize data structure.
Adding Features¶
- Authentication: Add login methods
- Caching: Implement request caching
- Rate limiting: Add delay mechanisms