Working with Output Data¶

This tutorial shows you how to open, analyze, and work with the data OPAL produces. Whether you're new to data analysis or just getting started with JSON files, this guide will help you unlock the value in your scraped data.

Opening and Viewing Data Files¶

Opening JSON Files¶

Method 1: Web Browser (Easiest) 1. Find your OPAL output file (e.g., 2024-01-15_Parser1819.json) 2. Drag and drop it into Chrome, Firefox, or Safari 3. The browser will format it nicely for reading

Method 2: Text Editor - Windows: Right-click → Open with → Notepad - Mac: Right-click → Open with → TextEdit - Better option: Use VS Code, Notepad++, or Sublime Text for syntax highlighting

Method 3: Online JSON Viewers - Go to jsonviewer.stack.hu or jsonformatter.org - Copy and paste your JSON content - Get a formatted, collapsible view

Opening CSV Files (Court Data)¶

Excel or Google Sheets: 1. Double-click the CSV file 2. It opens automatically with columns properly separated 3. Perfect for sorting, filtering, and creating charts

LibreOffice Calc (Free alternative): 1. Open LibreOffice Calc 2. File → Open → Select your CSV file 3. Choose appropriate delimiter (usually comma)

Understanding JSON Structure¶

News Article Data Structure¶

{
  "articles": [
    {
      "title": "Article headline here",
      "author": "Jane Smith", 
      "date": "January 15, 2024",
      "line_count": 42,
      "content": "Full article text..."
    }
  ],
  "metadata": {
    "source": "https://1819news.com/",
    "parser": "Parser1819",
    "total_articles": 25,
    "scrape_date": "2024-01-15T10:30:45"
  }
}

Key Fields Explained: - articles: Array containing all scraped articles - title: Article headline - author: Writer's name (may be "Unknown" if not found) - date: Publication date - line_count: Number of lines in the article content - content: Full article text with line breaks preserved - metadata: Information about the scraping process

Court Case Data Structure¶

{
  "status": "success",
  "total_cases": 150,
  "cases": [
    {
      "court": "Court of Civil Appeals",
      "case_number": {
        "text": "CL-2024-0001",
        "link": "https://publicportal.alappeals.gov/portal/home/case/caseid/CL-2024-0001"
      },
      "case_title": "Smith v. Jones Construction Company",
      "classification": "Appeal",
      "filed_date": "01/10/2024",
      "status": "Pending"
    }
  ]
}

Key Fields Explained: - cases: Array of all court cases found - court: Which appeals court - case_number: Case ID with clickable link to full details - case_title: Full case name (parties involved) - classification: Type of legal proceeding - filed_date: When the case was submitted - status: Current case status

Basic Data Analysis with Python¶

Installing Required Packages¶

# Install data analysis packages (in your virtual environment)
pip install pandas matplotlib jupyter

Loading and Exploring Data¶

import json
import pandas as pd
from datetime import datetime

# Load news data
with open('2024-01-15_Parser1819.json', 'r') as f:
    news_data = json.load(f)

# Convert to DataFrame for easier analysis
articles_df = pd.DataFrame(news_data['articles'])

# Basic exploration
print(f"Total articles: {len(articles_df)}")
print(f"Date range: {articles_df['date'].min()} to {articles_df['date'].max()}")
print(f"Authors: {articles_df['author'].nunique()} unique authors")

# Display first few articles
print("\nFirst 3 articles:")
for i, article in articles_df.head(3).iterrows():
    print(f"- {article['title']} by {article['author']}")

Analyzing News Content¶

import re
from collections import Counter

def analyze_news_content(articles_df):
    """Analyze news articles for common topics and trends"""

    # Combine all article text
    all_text = ' '.join(articles_df['title'] + ' ' + articles_df['content'])

    # Extract keywords (words 4+ letters, excluding common words)
    stop_words = {'this', 'that', 'with', 'have', 'will', 'from', 'they', 'been', 'said', 'would', 'there', 'could', 'what', 'were', 'when'}
    words = re.findall(r'\b[a-zA-Z]{4,}\b', all_text.lower())
    keywords = [word for word in words if word not in stop_words]

    # Count most common topics
    word_counts = Counter(keywords)

    print("=== NEWS CONTENT ANALYSIS ===")
    print(f"Total words analyzed: {len(keywords)}")
    print(f"Unique topics: {len(word_counts)}")

    print("\nTop 15 topics mentioned:")
    for word, count in word_counts.most_common(15):
        print(f"  {word}: {count} times")

    # Analyze by author
    print(f"\nMost prolific authors:")
    author_counts = articles_df['author'].value_counts().head(5)
    for author, count in author_counts.items():
        print(f"  {author}: {count} articles")

    return word_counts, author_counts

# Run the analysis
keywords, authors = analyze_news_content(articles_df)

Analyzing Court Data¶

def analyze_court_data(json_file):
    """Analyze court case patterns"""

    # Load court data
    with open(json_file, 'r') as f:
        court_data = json.load(f)

    # Convert to DataFrame
    cases_df = pd.DataFrame(court_data['cases'])

    print("=== COURT CASE ANALYSIS ===")
    print(f"Total cases: {len(cases_df)}")

    # Analyze by court
    print("\nCases by court:")
    court_counts = cases_df['court'].value_counts()
    for court, count in court_counts.items():
        print(f"  {court}: {count} cases")

    # Analyze by case type
    print("\nCases by classification:")
    classification_counts = cases_df['classification'].value_counts()
    for classification, count in classification_counts.items():
        print(f"  {classification}: {count} cases")

    # Analyze by status
    print("\nCases by status:")
    status_counts = cases_df['status'].value_counts()
    for status, count in status_counts.items():
        print(f"  {status}: {count} cases")

    # Date analysis (convert dates to datetime)
    cases_df['filed_date'] = pd.to_datetime(cases_df['filed_date'], format='%m/%d/%Y', errors='coerce')

    print(f"\nDate range: {cases_df['filed_date'].min().strftime('%Y-%m-%d')} to {cases_df['filed_date'].max().strftime('%Y-%m-%d')}")

    return cases_df

# Analyze court data
cases_df = analyze_court_data('court_cases.json')

Creating Visualizations¶

News Article Trends¶

import matplotlib.pyplot as plt
import seaborn as sns

def create_news_charts(articles_df):
    """Create charts from news data"""

    # Convert dates to datetime
    articles_df['date'] = pd.to_datetime(articles_df['date'], errors='coerce')

    # Create subplots
    fig, axes = plt.subplots(2, 2, figsize=(15, 10))
    fig.suptitle('News Article Analysis', fontsize=16)

    # 1. Articles per day
    daily_counts = articles_df['date'].dt.date.value_counts().sort_index()
    axes[0, 0].plot(daily_counts.index, daily_counts.values, marker='o')
    axes[0, 0].set_title('Articles per Day')
    axes[0, 0].set_xlabel('Date')
    axes[0, 0].set_ylabel('Number of Articles')
    axes[0, 0].tick_params(axis='x', rotation=45)

    # 2. Article length distribution
    axes[0, 1].hist(articles_df['line_count'], bins=20, alpha=0.7)
    axes[0, 1].set_title('Article Length Distribution')
    axes[0, 1].set_xlabel('Lines in Article')
    axes[0, 1].set_ylabel('Frequency')

    # 3. Top authors
    top_authors = articles_df['author'].value_counts().head(8)
    axes[1, 0].bar(range(len(top_authors)), top_authors.values)
    axes[1, 0].set_title('Most Prolific Authors')
    axes[1, 0].set_xlabel('Author')
    axes[1, 0].set_ylabel('Number of Articles')
    axes[1, 0].set_xticks(range(len(top_authors)))
    axes[1, 0].set_xticklabels(top_authors.index, rotation=45, ha='right')

    # 4. Word cloud of common topics (if wordcloud is installed)
    try:
        from wordcloud import WordCloud
        all_text = ' '.join(articles_df['title'])
        wordcloud = WordCloud(width=400, height=300, background_color='white').generate(all_text)
        axes[1, 1].imshow(wordcloud, interpolation='bilinear')
        axes[1, 1].set_title('Common Topics in Headlines')
        axes[1, 1].axis('off')
    except ImportError:
        # If wordcloud not available, show article count by month
        monthly_counts = articles_df['date'].dt.to_period('M').value_counts().sort_index()
        axes[1, 1].bar(range(len(monthly_counts)), monthly_counts.values)
        axes[1, 1].set_title('Articles per Month')
        axes[1, 1].set_xlabel('Month')
        axes[1, 1].set_ylabel('Number of Articles')

    plt.tight_layout()
    plt.savefig('news_analysis.png', dpi=300, bbox_inches='tight')
    plt.show()

    print("Charts saved as 'news_analysis.png'")

# Create charts
create_news_charts(articles_df)

Court Case Visualizations¶

def create_court_charts(cases_df):
    """Create charts from court data"""

    fig, axes = plt.subplots(2, 2, figsize=(15, 10))
    fig.suptitle('Court Case Analysis', fontsize=16)

    # 1. Cases by court
    court_counts = cases_df['court'].value_counts()
    axes[0, 0].pie(court_counts.values, labels=court_counts.index, autopct='%1.1f%%')
    axes[0, 0].set_title('Cases by Court')

    # 2. Cases by classification
    classification_counts = cases_df['classification'].value_counts()
    axes[0, 1].bar(classification_counts.index, classification_counts.values)
    axes[0, 1].set_title('Cases by Type')
    axes[0, 1].set_xlabel('Classification')
    axes[0, 1].set_ylabel('Number of Cases')
    axes[0, 1].tick_params(axis='x', rotation=45)

    # 3. Cases over time
    cases_df['filed_month'] = cases_df['filed_date'].dt.to_period('M')
    monthly_cases = cases_df['filed_month'].value_counts().sort_index()
    axes[1, 0].plot(range(len(monthly_cases)), monthly_cases.values, marker='o')
    axes[1, 0].set_title('Cases Filed Over Time')
    axes[1, 0].set_xlabel('Month')
    axes[1, 0].set_ylabel('Number of Cases')

    # 4. Case status
    status_counts = cases_df['status'].value_counts()
    axes[1, 1].bar(status_counts.index, status_counts.values)
    axes[1, 1].set_title('Cases by Status')
    axes[1, 1].set_xlabel('Status')
    axes[1, 1].set_ylabel('Number of Cases')

    plt.tight_layout()
    plt.savefig('court_analysis.png', dpi=300, bbox_inches='tight')
    plt.show()

    print("Charts saved as 'court_analysis.png'")

# Create court charts
create_court_charts(cases_df)

Data Export and Conversion¶

Convert JSON to CSV¶

def json_to_csv(json_file, csv_file):
    """Convert news JSON to CSV format"""

    with open(json_file, 'r') as f:
        data = json.load(f)

    # Convert articles to DataFrame
    df = pd.DataFrame(data['articles'])

    # Save to CSV
    df.to_csv(csv_file, index=False)
    print(f"Converted {len(df)} articles to {csv_file}")

# Convert your news data
json_to_csv('2024-01-15_Parser1819.json', 'news_articles.csv')

Create Summary Reports¶

def create_summary_report(articles_df, output_file):
    """Create a summary report of news data"""

    with open(output_file, 'w') as f:
        f.write("=== NEWS SCRAPING SUMMARY REPORT ===\n\n")

        f.write(f"Total articles collected: {len(articles_df)}\n")
        f.write(f"Date range: {articles_df['date'].min()} to {articles_df['date'].max()}\n")
        f.write(f"Unique authors: {articles_df['author'].nunique()}\n")
        f.write(f"Average article length: {articles_df['line_count'].mean():.1f} lines\n\n")

        f.write("TOP AUTHORS:\n")
        top_authors = articles_df['author'].value_counts().head(5)
        for author, count in top_authors.items():
            f.write(f"  {author}: {count} articles\n")

        f.write("\nLONGEST ARTICLES:\n")
        longest = articles_df.nlargest(3, 'line_count')
        for _, article in longest.iterrows():
            f.write(f"  {article['title'][:60]}... ({article['line_count']} lines)\n")

        f.write("\nMOST RECENT ARTICLES:\n")
        recent = articles_df.nlargest(5, 'date')
        for _, article in recent.iterrows():
            f.write(f"  {article['date']}: {article['title'][:60]}...\n")

    print(f"Summary report saved to {output_file}")

# Create summary
create_summary_report(articles_df, 'news_summary.txt')

Advanced Data Analysis¶

Text Analysis and Sentiment¶

# Install required packages first: pip install textblob
from textblob import TextBlob

def analyze_sentiment(articles_df):
    """Analyze sentiment of news articles"""

    sentiments = []

    for _, article in articles_df.iterrows():
        # Analyze title sentiment
        blob = TextBlob(article['title'])
        sentiment = blob.sentiment.polarity  # -1 (negative) to 1 (positive)
        sentiments.append({
            'title': article['title'][:50] + '...',
            'sentiment': sentiment,
            'sentiment_label': 'Positive' if sentiment > 0.1 else 'Negative' if sentiment < -0.1 else 'Neutral'
        })

    sentiment_df = pd.DataFrame(sentiments)

    print("=== SENTIMENT ANALYSIS ===")
    sentiment_counts = sentiment_df['sentiment_label'].value_counts()
    for label, count in sentiment_counts.items():
        print(f"{label}: {count} articles ({count/len(sentiment_df)*100:.1f}%)")

    # Show most positive and negative headlines
    print("\nMost positive headlines:")
    positive = sentiment_df.nlargest(3, 'sentiment')
    for _, row in positive.iterrows():
        print(f"  {row['title']} (score: {row['sentiment']:.2f})")

    print("\nMost negative headlines:")
    negative = sentiment_df.nsmallest(3, 'sentiment')
    for _, row in negative.iterrows():
        print(f"  {row['title']} (score: {row['sentiment']:.2f})")

    return sentiment_df

# Analyze sentiment
sentiment_results = analyze_sentiment(articles_df)

Topic Modeling¶

# Install required packages: pip install scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans

def find_topics(articles_df, n_topics=5):
    """Find common topics in articles using clustering"""

    # Combine title and content for analysis
    texts = [f"{row['title']} {row['content'][:200]}" for _, row in articles_df.iterrows()]

    # Convert text to numerical features
    vectorizer = TfidfVectorizer(max_features=100, stop_words='english')
    text_vectors = vectorizer.fit_transform(texts)

    # Find topics using clustering
    kmeans = KMeans(n_clusters=n_topics, random_state=42)
    clusters = kmeans.fit_predict(text_vectors)

    # Get top words for each topic
    feature_names = vectorizer.get_feature_names_out()

    print("=== TOPIC ANALYSIS ===")
    for i in range(n_topics):
        top_words_idx = kmeans.cluster_centers_[i].argsort()[-10:][::-1]
        top_words = [feature_names[idx] for idx in top_words_idx]
        print(f"Topic {i+1}: {', '.join(top_words)}")

        # Show example articles from this topic
        topic_articles = articles_df[clusters == i]
        print(f"  Example articles ({len(topic_articles)} total):")
        for _, article in topic_articles.head(2).iterrows():
            print(f"    - {article['title'][:60]}...")
        print()

# Find topics
find_topics(articles_df)

Working with Multiple Data Sources¶

Combining News Sources¶

def combine_news_sources(file1, file2, output_file):
    """Combine data from multiple news sources"""

    # Load both files
    with open(file1, 'r') as f:
        data1 = json.load(f)
    with open(file2, 'r') as f:
        data2 = json.load(f)

    # Combine articles
    all_articles = data1['articles'] + data2['articles']

    # Create combined dataset
    combined_data = {
        'articles': all_articles,
        'metadata': {
            'combined_from': [file1, file2],
            'total_articles': len(all_articles),
            'source1_count': len(data1['articles']),
            'source2_count': len(data2['articles']),
            'combined_date': datetime.now().isoformat()
        }
    }

    # Save combined data
    with open(output_file, 'w') as f:
        json.dump(combined_data, f, indent=2)

    print(f"Combined {len(all_articles)} articles from 2 sources")
    print(f"Source 1: {len(data1['articles'])} articles")
    print(f"Source 2: {len(data2['articles'])} articles")
    print(f"Saved to: {output_file}")

# Combine multiple sources
combine_news_sources('1819_news.json', 'daily_news.json', 'combined_news.json')

Cross-Source Analysis¶

def compare_news_sources(file1, file2, source1_name, source2_name):
    """Compare coverage between two news sources"""

    # Load and prepare data
    with open(file1, 'r') as f:
        data1 = json.load(f)
    with open(file2, 'r') as f:
        data2 = json.load(f)

    df1 = pd.DataFrame(data1['articles'])
    df2 = pd.DataFrame(data2['articles'])

    print(f"=== COMPARING {source1_name.upper()} vs {source2_name.upper()} ===")

    # Basic stats
    print(f"{source1_name}: {len(df1)} articles")
    print(f"{source2_name}: {len(df2)} articles")

    # Average article length
    print(f"\nAverage article length:")
    print(f"  {source1_name}: {df1['line_count'].mean():.1f} lines")
    print(f"  {source2_name}: {df2['line_count'].mean():.1f} lines")

    # Most active authors
    print(f"\nMost active authors:")
    print(f"  {source1_name}: {df1['author'].value_counts().index[0]} ({df1['author'].value_counts().iloc[0]} articles)")
    print(f"  {source2_name}: {df2['author'].value_counts().index[0]} ({df2['author'].value_counts().iloc[0]} articles)")

    # Find common topics
    def get_top_words(df, n=10):
        all_text = ' '.join(df['title'] + ' ' + df['content'])
        words = re.findall(r'\b[a-zA-Z]{4,}\b', all_text.lower())
        return Counter(words).most_common(n)

    words1 = dict(get_top_words(df1))
    words2 = dict(get_top_words(df2))

    common_topics = set(words1.keys()) & set(words2.keys())
    print(f"\nCommon topics covered: {len(common_topics)}")
    for topic in sorted(common_topics)[:5]:
        print(f"  {topic}: {source1_name}({words1[topic]}) vs {source2_name}({words2[topic]})")

# Compare sources
compare_news_sources('1819_news.json', 'daily_news.json', '1819 News', 'Alabama Daily News')

Data Integration with Other Tools¶

Export to Database¶

import sqlite3

def save_to_database(json_file, db_file):
    """Save news data to SQLite database"""

    # Load JSON data
    with open(json_file, 'r') as f:
        data = json.load(f)

    # Create database connection
    conn = sqlite3.connect(db_file)

    # Create table
    conn.execute('''
        CREATE TABLE IF NOT EXISTS articles (
            id INTEGER PRIMARY KEY,
            title TEXT,
            author TEXT,
            date TEXT,
            line_count INTEGER,
            content TEXT,
            source TEXT
        )
    ''')

    # Insert articles
    for article in data['articles']:
        conn.execute('''
            INSERT INTO articles (title, author, date, line_count, content, source)
            VALUES (?, ?, ?, ?, ?, ?)
        ''', (
            article['title'],
            article['author'],
            article['date'],
            article['line_count'],
            article['content'],
            data['metadata']['parser']
        ))

    conn.commit()
    conn.close()

    print(f"Saved {len(data['articles'])} articles to database: {db_file}")

# Save to database
save_to_database('news_data.json', 'news.db')

Export for Excel Analysis¶

def create_excel_report(articles_df, filename):
    """Create an Excel file with multiple sheets for analysis"""

    with pd.ExcelWriter(filename, engine='openpyxl') as writer:
        # Main data
        articles_df.to_excel(writer, sheet_name='Articles', index=False)

        # Summary statistics
        summary_stats = pd.DataFrame({
            'Metric': ['Total Articles', 'Unique Authors', 'Avg Length', 'Date Range'],
            'Value': [
                len(articles_df),
                articles_df['author'].nunique(),
                f"{articles_df['line_count'].mean():.1f} lines",
                f"{articles_df['date'].min()} to {articles_df['date'].max()}"
            ]
        })
        summary_stats.to_excel(writer, sheet_name='Summary', index=False)

        # Top authors
        top_authors = articles_df['author'].value_counts().head(10).reset_index()
        top_authors.columns = ['Author', 'Article Count']
        top_authors.to_excel(writer, sheet_name='Top Authors', index=False)

        # Articles by date
        daily_counts = articles_df['date'].value_counts().sort_index().reset_index()
        daily_counts.columns = ['Date', 'Article Count']
        daily_counts.to_excel(writer, sheet_name='Daily Counts', index=False)

    print(f"Excel report saved as: {filename}")

# Create Excel report
create_excel_report(articles_df, 'news_analysis.xlsx')

Next Steps¶

Now that you know how to work with OPAL's output data:

Automate Analysis: Create scripts that automatically analyze new data as you collect it
Build Dashboards: Use tools like Streamlit or Dash to create interactive data dashboards
Set Up Monitoring: Track trends over time by regularly collecting and analyzing data
Share Insights: Export charts and summaries to share your findings

For collecting more data, see Common Use Cases.

For troubleshooting data issues, see Understanding Errors.