User-Agent Headers Guide¶
What is a User-Agent?¶
User-Agent headers are strings that identify the client (browser, bot, or application) making an HTTP request to a web server.
It's an HTTP header that tells the server: - What software is making the request - What version it is - What operating system it's running on
Examples of User-Agent Strings¶
Chrome Browser¶
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36
Python Requests (default)¶
Googlebot¶
Firefox Browser¶
Safari Browser¶
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.1 Safari/605.1.15
Why User-Agents Matter¶
- Server Behavior: Websites may serve different content based on User-Agent
- Access Control: Some sites block requests with suspicious or missing User-Agents
- Analytics: Helps websites understand their traffic
- Bot Detection: Sites use it to identify and potentially block scrapers
- Content Optimization: Sites may serve mobile vs desktop versions
Setting User-Agent in Python¶
Basic Example¶
import requests
# Without User-Agent (might be blocked)
response = requests.get('https://example.com')
# With User-Agent (appears as a browser)
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}
response = requests.get('https://example.com', headers=headers)
Advanced Example with Multiple Headers¶
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding': 'gzip, deflate',
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1',
}
response = requests.get('https://example.com', headers=headers)
Common User-Agent Patterns¶
Desktop Browsers¶
# Windows Chrome
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
# macOS Safari
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.1 Safari/605.1.15'
# Linux Firefox
'Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Firefox/91.0'
Mobile Browsers¶
# iPhone Safari
'Mozilla/5.0 (iPhone; CPU iPhone OS 14_6 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Mobile/15E148 Safari/604.1'
# Android Chrome
'Mozilla/5.0 (Linux; Android 11; SM-G991B) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.120 Mobile Safari/537.36'
Custom Bot (Honest Approach)¶
'OPAL-Bot/1.0 (+https://github.com/yourusername/opal)'
'MyCompany-Scraper/2.1 (contact@mycompany.com)'
Implementing User-Agents in OPAL¶
Enhanced BaseParser¶
def make_request(self, urls: List[str]) -> Tuple[List[str], List[str]]:
"""Shared request functionality for all parsers with proper headers"""
# Realistic browser headers
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.9',
'Accept-Encoding': 'gzip, deflate, br',
'DNT': '1',
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1',
}
responses = []
successful_urls = []
for url in urls:
try:
print(f"Requesting: {url}")
response = requests.get(url, headers=headers, timeout=5)
response.raise_for_status()
responses.append(response.text)
successful_urls.append(url)
except requests.exceptions.RequestException:
print(f"Skipping URL due to error: {url}")
continue
return responses, successful_urls
User-Agent Rotation¶
import random
class RotatingUserAgentParser(BaseParser):
def __init__(self):
super().__init__()
self.user_agents = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:89.0) Gecko/20100101 Firefox/89.0',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:89.0) Gecko/20100101 Firefox/89.0',
]
def get_random_user_agent(self):
return random.choice(self.user_agents)
def make_request(self, urls):
# Use different User-Agent for each request
headers = {'User-Agent': self.get_random_user_agent()}
# ... rest of request logic
Best Practices¶
1. Be Strategic¶
- Use real User-Agents: Copy from actual browsers
- Stay current: Browser versions change frequently
- Match behavior: If you claim to be Chrome, act like Chrome
2. Be Respectful¶
- Respect robots.txt: Even with a browser User-Agent
- Rate limit: Don't overwhelm servers
- Be honest when possible: Some sites appreciate transparent bots
3. Be Consistent¶
- Use complete headers: Include Accept, Accept-Language, etc.
- Maintain session: Use the same User-Agent throughout a session
- Handle responses: Check if the site is behaving differently
4. Be Prepared¶
- Rotate User-Agents: Avoid detection patterns
- Handle blocks: Have fallback strategies
- Monitor changes: Sites may update their detection methods
User-Agent Detection Techniques¶
Websites can detect fake User-Agents by:
- Header Analysis: Checking if browser behavior matches the User-Agent
- Missing Headers: Looking for headers real browsers always send
- JavaScript Testing: Testing browser capabilities that match the claimed version
- Request Patterns: Analyzing timing and request sequences
- Feature Detection: Checking for browser-specific features
Common Mistakes¶
1. Outdated User-Agents¶
# Bad - very old browser version
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 Chrome/45.0.2454.85'
# Good - recent browser version
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/91.0.4472.124'
2. Inconsistent Headers¶
# Bad - claims to be Chrome but uses Firefox Accept header
headers = {
'User-Agent': 'Chrome/91.0.4472.124',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8' # Firefox style
}
3. Missing Common Headers¶
# Bad - only User-Agent
headers = {'User-Agent': 'Mozilla/5.0...'}
# Good - realistic browser headers
headers = {
'User-Agent': 'Mozilla/5.0...',
'Accept': 'text/html,application/xhtml+xml...',
'Accept-Language': 'en-US,en;q=0.9',
'Accept-Encoding': 'gzip, deflate, br',
}
Testing User-Agents¶
Check What You're Sending¶
import requests
# Test your headers
response = requests.get('https://httpbin.org/headers', headers=your_headers)
print(response.json())
Verify Server Response¶
# Check if the site is treating you differently
response_bot = requests.get(url) # Default requests User-Agent
response_browser = requests.get(url, headers=browser_headers)
if response_bot.content != response_browser.content:
print("Site serves different content based on User-Agent")
Tools and Resources¶
- Browser DevTools: Copy real User-Agent strings from Network tab
- User-Agent Databases: Sites like whatismybrowser.com
- Header Checkers: Use httpbin.org to test your headers
- Browser Testing: Use Selenium to see what real browsers send
Conclusion¶
User-Agent headers are a crucial part of web scraping that can mean the difference between successful data extraction and being blocked. Use them thoughtfully and responsibly to build robust scrapers that respect both the technical and ethical aspects of web crawling.