Skip to content

Home > User Guide > Parsers > 1819 News Parser

1819 News Parser (Parser1819)

Parser1819

The 1819 News Parser (Parser1819) is designed to extract articles from the 1819 News website, focusing on Alabama political and legal news coverage.

Overview

This parser extracts article content, metadata, and structure from 1819 News, providing comprehensive data for analysis of Alabama news coverage.

Prerequisites

  • OPAL installed and configured
  • Internet connection
  • Basic command-line knowledge

Basic Usage

Simple Extraction

# Extract articles from 1819 News
python -m opal --url "https://1819news.com/" --parser Parser1819

# With pagination limit
python -m opal --url "https://1819news.com/" --parser Parser1819 --max_pages 5

# With URL suffix filter
python -m opal --url "https://1819news.com/" --parser Parser1819 --suffix "/news/item"

Command Line Arguments

# Required arguments
--url        # Base URL (https://1819news.com/)
--parser     # Must be "Parser1819"

# Optional arguments
--suffix     # URL suffix to filter article pages
--max_pages  # Maximum number of pages to process

Features

Content Extraction

The parser extracts: - Article title - Author information - Publication date - Full article text (line by line) - Article URL

Structured Output

  • Articles are parsed line by line
  • Preserves paragraph structure
  • Maintains text formatting
  • Counts total lines for analysis

Data Fields Extracted

Field Description Example
url Article URL "https://1819news.com/news/item/..."
title Article title "Alabama Legislature Passes New Bill"
author Article author "John Smith"
date Publication date "March 15, 2024"
line_count Number of text lines 45
line_content Article text by line {"line 1": "First paragraph...", ...}

Output Format

JSON Structure

{
  "success": true,
  "total_articles": 25,
  "timestamp": "2024-03-15T10:30:45",
  "site_info": {
    "base_url": "https://1819news.com/",
    "pages_processed": 5
  },
  "articles": [
    {
      "url": "https://1819news.com/news/item/example-article",
      "title": "Alabama Legislature Considers New Education Bill",
      "author": "Jane Doe",
      "date": "March 14, 2024",
      "line_count": 32,
      "line_content": {
        "line 1": "The Alabama Legislature is considering...",
        "line 2": "The bill, sponsored by Senator...",
        "line 3": "Education advocates say..."
      }
    }
  ]
}

Advanced Usage

Filtering by URL Pattern

# Only process articles with specific URL patterns
python -m opal --url "https://1819news.com/" --parser Parser1819 --suffix "/news/politics"

# Multiple patterns (in script)
python script_with_multiple_suffixes.py

Processing Specific Sections

from opal.parser_module import Parser1819
from opal.integrated_parser import IntegratedParser

# Create parser for specific section
news_parser = IntegratedParser(Parser1819)

# Process politics section
politics_articles = news_parser.process_site(
    base_url="https://1819news.com/news/politics/",
    max_pages=10
)

Common Use Cases

Daily News Monitoring

#!/bin/bash
# Daily news scrape
DATE=$(date +%Y-%m-%d)
python -m opal --url "https://1819news.com/" --parser Parser1819 --max_pages 3
echo "News saved to ${DATE}_Parser1819.json"

Keyword Analysis

import json

# Load parsed articles
with open('2024-03-15_Parser1819.json', 'r') as f:
    data = json.load(f)

# Search for keywords
keyword = "court"
matching_articles = []

for article in data['articles']:
    content = ' '.join(article['line_content'].values()).lower()
    if keyword in content:
        matching_articles.append(article)

print(f"Found {len(matching_articles)} articles mentioning '{keyword}'")

Author Tracking

# Track articles by specific authors
from collections import Counter

author_counts = Counter(
    article['author'] 
    for article in data['articles']
)

print("Articles by author:")
for author, count in author_counts.most_common():
    print(f"{author}: {count} articles")

Parser-Specific Behavior

HTML Structure Handling

  • Looks for div with class author-date for metadata
  • Extracts all <p> tags for article content
  • Handles missing author/date gracefully

Error Handling

  • Returns "Unknown Author" if author not found
  • Returns "Unknown Date" if date not found
  • Continues processing even if some articles fail

Troubleshooting

Common Issues

No Articles Found - Verify the base URL is correct - Check if website structure has changed - Ensure suffix parameter matches actual URLs

Missing Metadata - Some articles may not have author information - Date format may vary - Parser provides defaults for missing data

Incomplete Content - Check for JavaScript-rendered content - Some articles may require authentication - Verify network connectivity

Debug Tips

# Run with verbose output
python -m opal --url "https://1819news.com/" --parser Parser1819 --max_pages 1

# Check specific article structure
curl -s "https://1819news.com/news/item/sample" | grep -E "(author-date|<p>)"

Performance Optimization

  1. Use --max_pages: Limit pages to process
  2. Filter with --suffix: Reduce unnecessary requests
  3. Batch Processing: Process sections separately
  4. Rate Limiting: Built-in delays prevent blocking

Integration Examples

CSV Export

import json
import csv

# Load JSON data
with open('2024-03-15_Parser1819.json', 'r') as f:
    data = json.load(f)

# Export to CSV
with open('articles.csv', 'w', newline='', encoding='utf-8') as f:
    writer = csv.DictWriter(f, fieldnames=['title', 'author', 'date', 'url'])
    writer.writeheader()

    for article in data['articles']:
        writer.writerow({
            'title': article['title'],
            'author': article['author'],
            'date': article['date'],
            'url': article['url']
        })

Next Steps