Skip to content

Parser1819 API Reference

Parser for 1819 News website.

Class Definition

class Parser1819(BaseParser):
    def __init__(self, url="https://1819news.com/", suffix="/news/item", max_pages=5)

Constructor Parameters

Parameter Type Default Description
url str "https://1819news.com/" Base URL for 1819 News
suffix str "/news/item" URL suffix for article links
max_pages int 5 Maximum pages to scrape

Methods

extract_article_data()

def extract_article_data(self) -> List[Dict]
Extracts articles from multiple pages of 1819 News.

Returns: List of article dictionaries with keys: - title: Article headline - content: Full article text - date: Publication date - author: Article author - url: Article URL - tags: List of article tags

get_article_links()

def get_article_links(self, page_url: str) -> List[str]
Extracts article URLs from a 1819 News page.

Parameters: - page_url: URL of the page to scrape

Returns: List of article URLs matching the suffix pattern

parse_article()

def parse_article(self, article_url: str) -> Dict
Parses individual 1819 News article.

Parameters: - article_url: URL of the article

Returns: Dictionary with article data

Usage Example

from opal.Parser1819 import Parser1819

# Create parser instance
parser = Parser1819(
    url="https://1819news.com/",
    suffix="/news/item",
    max_pages=10
)

# Extract articles
articles = parser.extract_article_data()

# Save to JSON
parser.save_to_json(articles, "1819_news_articles.json")

# Access article data
for article in articles:
    print(f"Title: {article['title']}")
    print(f"Date: {article['date']}")
    print(f"Author: {article['author']}")

Output Format

{
  "title": "Alabama Legislature Passes New Education Bill",
  "content": "Full article text...",
  "date": "2024-01-15",
  "author": "John Smith",
  "url": "https://1819news.com/news/item/education-bill-2024",
  "tags": ["education", "legislature", "alabama"]
}

Notes

  • Uses BeautifulSoup for HTML parsing
  • Handles pagination automatically
  • Filters links by suffix to ensure only articles are scraped
  • Includes error handling for missing elements