Home > User Guide > Parsers > 1819 News Parser
1819 News Parser (Parser1819)¶
Parser1819
The 1819 News Parser (Parser1819
) is designed to extract articles from the 1819 News website, focusing on Alabama political and legal news coverage.
Overview
This parser extracts article content, metadata, and structure from 1819 News, providing comprehensive data for analysis of Alabama news coverage.
Prerequisites¶
- OPAL installed and configured
- Internet connection
- Basic command-line knowledge
Basic Usage¶
Simple Extraction¶
# Extract articles from 1819 News
python -m opal --url "https://1819news.com/" --parser Parser1819
# With pagination limit
python -m opal --url "https://1819news.com/" --parser Parser1819 --max_pages 5
# With URL suffix filter
python -m opal --url "https://1819news.com/" --parser Parser1819 --suffix "/news/item"
Command Line Arguments¶
# Required arguments
--url # Base URL (https://1819news.com/)
--parser # Must be "Parser1819"
# Optional arguments
--suffix # URL suffix to filter article pages
--max_pages # Maximum number of pages to process
Features¶
Content Extraction¶
The parser extracts: - Article title - Author information - Publication date - Full article text (line by line) - Article URL
Structured Output¶
- Articles are parsed line by line
- Preserves paragraph structure
- Maintains text formatting
- Counts total lines for analysis
Data Fields Extracted¶
Field | Description | Example |
---|---|---|
url | Article URL | "https://1819news.com/news/item/..." |
title | Article title | "Alabama Legislature Passes New Bill" |
author | Article author | "John Smith" |
date | Publication date | "March 15, 2024" |
line_count | Number of text lines | 45 |
line_content | Article text by line | {"line 1": "First paragraph...", ...} |
Output Format¶
JSON Structure¶
{
"success": true,
"total_articles": 25,
"timestamp": "2024-03-15T10:30:45",
"site_info": {
"base_url": "https://1819news.com/",
"pages_processed": 5
},
"articles": [
{
"url": "https://1819news.com/news/item/example-article",
"title": "Alabama Legislature Considers New Education Bill",
"author": "Jane Doe",
"date": "March 14, 2024",
"line_count": 32,
"line_content": {
"line 1": "The Alabama Legislature is considering...",
"line 2": "The bill, sponsored by Senator...",
"line 3": "Education advocates say..."
}
}
]
}
Advanced Usage¶
Filtering by URL Pattern¶
# Only process articles with specific URL patterns
python -m opal --url "https://1819news.com/" --parser Parser1819 --suffix "/news/politics"
# Multiple patterns (in script)
python script_with_multiple_suffixes.py
Processing Specific Sections¶
from opal.parser_module import Parser1819
from opal.integrated_parser import IntegratedParser
# Create parser for specific section
news_parser = IntegratedParser(Parser1819)
# Process politics section
politics_articles = news_parser.process_site(
base_url="https://1819news.com/news/politics/",
max_pages=10
)
Common Use Cases¶
Daily News Monitoring¶
#!/bin/bash
# Daily news scrape
DATE=$(date +%Y-%m-%d)
python -m opal --url "https://1819news.com/" --parser Parser1819 --max_pages 3
echo "News saved to ${DATE}_Parser1819.json"
Keyword Analysis¶
import json
# Load parsed articles
with open('2024-03-15_Parser1819.json', 'r') as f:
data = json.load(f)
# Search for keywords
keyword = "court"
matching_articles = []
for article in data['articles']:
content = ' '.join(article['line_content'].values()).lower()
if keyword in content:
matching_articles.append(article)
print(f"Found {len(matching_articles)} articles mentioning '{keyword}'")
Author Tracking¶
# Track articles by specific authors
from collections import Counter
author_counts = Counter(
article['author']
for article in data['articles']
)
print("Articles by author:")
for author, count in author_counts.most_common():
print(f"{author}: {count} articles")
Parser-Specific Behavior¶
HTML Structure Handling¶
- Looks for
div
with classauthor-date
for metadata - Extracts all
<p>
tags for article content - Handles missing author/date gracefully
Error Handling¶
- Returns "Unknown Author" if author not found
- Returns "Unknown Date" if date not found
- Continues processing even if some articles fail
Troubleshooting¶
Common Issues¶
No Articles Found - Verify the base URL is correct - Check if website structure has changed - Ensure suffix parameter matches actual URLs
Missing Metadata - Some articles may not have author information - Date format may vary - Parser provides defaults for missing data
Incomplete Content - Check for JavaScript-rendered content - Some articles may require authentication - Verify network connectivity
Debug Tips¶
# Run with verbose output
python -m opal --url "https://1819news.com/" --parser Parser1819 --max_pages 1
# Check specific article structure
curl -s "https://1819news.com/news/item/sample" | grep -E "(author-date|<p>)"
Performance Optimization¶
- Use --max_pages: Limit pages to process
- Filter with --suffix: Reduce unnecessary requests
- Batch Processing: Process sections separately
- Rate Limiting: Built-in delays prevent blocking
Integration Examples¶
CSV Export¶
import json
import csv
# Load JSON data
with open('2024-03-15_Parser1819.json', 'r') as f:
data = json.load(f)
# Export to CSV
with open('articles.csv', 'w', newline='', encoding='utf-8') as f:
writer = csv.DictWriter(f, fieldnames=['title', 'author', 'date', 'url'])
writer.writeheader()
for article in data['articles']:
writer.writerow({
'title': article['title'],
'author': article['author'],
'date': article['date'],
'url': article['url']
})
Related Topics¶
Next Steps¶
- Try the ParserDailyNews for other news sources
- Learn about Creating Custom Parsers
- Explore Output Examples