NICAR 2025 Web Scraping Workshop

NICAR 2025 Web Scraping Workshop

This one-hour workshop was presented at the NICAR 2025 data journalism conference on Saturday, March 8th, 2025, at 11:30am. Interested in new and effective advanced web scraping techniques? Join this session to learn about how to extract valuable data from websites using various tools and techniques.

Prerequisites

This workshop is designed for those with basic programming skills and familiarity with web development. No prior knowledge of web scraping is required, but it's recommended that you have a basic understanding of HTML and CSS.

Agenda

  • Video Scraping using Google AI Studio
  • This technique is ideal for scraping data from websites that don't want to be scraped. We'll explore how to use Google's AI Studio tool to turn a video of your screen into JSON data.

  • LLM Scraping
  • We'll discuss the pricing and limitations of using Large Language Models (LLMs) for web scraping, including GPT-4o mini and Gemini 2.0 Pro.

  • PDF Scraping
  • We'll explore how to use LLMs to extract structured data from PDF files, including the FEMA Daily Operations Briefing.

  • Shot-Scraper Technique
  • This technique is a reliable method for extracting structured data from websites, especially those with complex content like PDFs.

Taking Away the Barrier: Collaborating on Tools and Resources

We are building out a suite of tools to help implement these patterns in your newsroom. We'd love to learn what you need from us and how we can collaborate in the future.

Join our community at Google's email provider

Getting Started with Web Scraping

To get started with web scraping, you'll need to install Python and a virtual environment. We recommend using pipenv or conda for this purpose.

      
        # Install pipenv
        pip install --user pipenv

# Create a new virtual environment pipenv --python 3.x shell

Additional Resources