Truckin' Through Time: Building a Grateful Dead Setlist Database
Introduction
Using Python, I constructed a Grateful Dead setlist archive (1970s-1990s) from abandoned webpages. This post covers my first experience writing a web scraper to aggregate the Grateful Dead setlist data from Mark Leone’s old Carnegie Mellon website.
Mark was, I believe, a researcher at CMU. I didn’t delve deeply into Mark’s official role or position. My apologies to Mark if he happens to be reading this! Nonetheless, I appreciate his efforts in aggregating this data; it greatly contributed to the project.
Tools and Technologies Used:
- Python
- BeautifulSoup
- SQLModel
- Ollama (LLama3-8B)
- SQLite
You can find all the code and the SQLite database containing the data I scraped on my GitHub.
The Inspiration
A few days ago, I was out on a walk, brainstorming the kind of dataset I wanted to use to learn more about data processing libraries in Python. Then, “Uncle John’s Band” by the Grateful Dead shuffled on, sparking the idea to explore a Grateful Dead dataset. While I wouldn’t call myself a Dead Head 💀, the band’s extensive touring history and tendency to record their shows always intrigued me.
I began diving into the depths of Google, searching for Grateful Dead setlist databases, and encountered these long-forgotten websites from the 1990s and early 2000s that enthusiasts had spent considerable time putting together. One such site, created by Mark Leone, a researcher at Carnegie Mellon University, stood out to me because it looked fairly easy to scrape for data. While I’m sure there are beautifully curated and well-structured datasets out there with the same information, I embarked on this journey of writing a web scraper to create my own database from Mark Leone’s old site at Carnegie Mellon. This old website, still hosted on Carnegie Mellon’s Computer Science Department’s website, includes a bunch of random Grateful Dead things, including setlists from Grateful Dead tours from 1972-1995.
Vintage Grateful Dead Mac Icons obtained from Mark Leone’s old Carnegie Mellon website. I spent a little too much time looking at all the items in the Apache Directory listing… 😅
Project Overview
Before we dive into the technical details, here’s a flowchart of how my simple program works to create this Grateful Dead setlist database:
Flowchart outlining the main steps of the Grateful Dead setlist database creation process
This flowchart outlines the main steps of the program I wrote. The key stages of this process are:
- Data Extraction with Web Scraping: Gathering raw setlist data from abandoned webpages.
- Data Processing/Parsing: Using Ollama (Llama3) to structure the raw data into a usable format.
- Persisting the Data: Storing the processed information in a SQLite database.
This rest of this post details each of these steps. This approach allowed me to transform unstructured web data into a fairly comprehensive, queryable database of Grateful Dead performances.
Web Scraping
Web scraping is a general technique used to extract information from websites. In this project, I scraped setlist data from Mark Leone’s CMU website, which archived concert setlists of the Grateful Dead. The goal was to compile a comprehensive and structured dataset of these setlists for analysis.
Why Web Scraping?
The data on the website was not available in a format suitable for analysis. By scraping the website, I transformed this rich, unstructured data into a structured format.
The Information Architecture
The data on Mark Leone’s CMU website was organized with a setlist “home” page linking to concerts by year. Each year’s page included links to text files with the setlists.
The “home” page with links to concerts from each specific year.
The concert list for a 1989 with links to setlists for each concert.
The text files were mostly structured, though not always consistently - more on how I handled that later on this post.
The setlist for a show at Henry J Kaiser Convention Center in Oakland, CA on 2/5/89.
Using BeautifulSoup
To scrape the setlist data, I used Python’s BeautifulSoup library to parse the HTML of the webpages. The code iterates through the “home” page, fetching links for each touring year. For each year, it retrieves information on individual concerts and their setlists, storing everything in structured objects.
Here’s the a code snippet containing the bulk of my BeautifulSoup usage:
TouringYear
and ConcertData
are just simple Pydantic models used to store
the scraped data before processing it further and ultimately writing it to the
database.
Extracting Data as JSON with Llama3
Concert and Venue Information
Parsing the concert data presented challenges due to varying formats. Initially, I used regex to extract venue names, cities, states, and countries, but inconsistencies made this difficult.
For example:
- 1972: Formatted consistently, making regex effective:
I managed to write a regex to parse this data, incorporating custom logic to identify international shows by the "X"
prefix in the country code. However, this was only the beginning of the challenges.
- 1976: Minor variations but still parseable:
- 1987: Increased complexity with abbreviations and missing details:
Regex alone was insufficient for handling these variations (and more). After tweaking the logic several times, I of course tried using ChatGPT to extract the JSON from the setlist. This worked phenomenally.
I considered using OpenAI’s API, which even has a JSON mode to ensure only JSON is returned, but the API is not free, and I didn’t want to pay for something I was just experimenting with.
I explored running other LLMs locally on my MacBook, starting with LMStudio. It worked well, except I couldn’t get the models it was serving to return JSON reliably. Eventually, I found that Ollama provided a solution locally, avoiding any costs. Ollama has a JSON mode that reliably ensures the model response is JSON, simplifying further processing.
Setlist Information
I faced similar issues when extracting setlist information all individually contained in their setlists. The issues stemmed from the data not being consistently formatted. My first naive approach was to loop over all lines in the text files skipping the first line which always seemed to reliably repeat the concert and information I had already processed.
Extracting setlists faced similar inconsistencies. Initially, I looped over text files, skipping the first line, which repeated concert information. Consistent formats, like the show at Theatre Tower in Upper Darby, PA on 6/21/76 was formatted consistently:
However, some shows such at Delta Center in Salt Lake City, UT on Sunday 2/19/95 required additional processing:
- The ”->” symbol indicates a song transition.
- Footnotes can span more than 1 line and add extra details for certain songs and/or the setlist.
- The encore song is prepended with “E:“.
Rather than complicating my initial regex approach with additional logic to handle these variations, I used Ollama to structure setlists into JSON, ensuring accuracy.
Data Extraction using Ollama
Using the ollama-python library, which is just a simple wrapper around the Ollama server API, this became a lot simpler than trying to create multiple regex patterns to potentially match all the concert and setlist variations.
I ran the Ollama server on my 2021 M1 Macbook Pro with 16GM RAM serving Llama3-8B.
You can find this code on GitHub as well.
I provided the LLM with the following prompts providing examples of what structure I wanted the JSON to be returned in.
The script took approximately 10 hours to run, primarily due to the inference time for the Ollama server, especially when processing the setlist songs. Below is a snippet of the Ollama server logs. The 20.5-second call was for processing the setlist information, while the 2.6-second call was for processing the concert information. These times varied slightly, but on average, the inference time to process and extract the setlist information was about 9-10 times longer.
Structuring the Database Tables
To organize the scraped and processed data, I used a Python library called SQLModel. It helps define the structure of our database tables based using Python objects.
This was the schema I ended up creating:
Model | Description |
---|---|
Concert | Represents a concert event with attributes like venue (foreign key to Venue table), date, and a relationship with Setlist model (one concert can have many setlists). |
Venue | Represents a venue with attributes like name, city, state (optional), and country. |
Song | Represents a song with attributes like title and a relationship with SetlistSong model (one song can be played in many setlists). |
Setlist | Represents a setlist for a concert with attributes like concert (foreign key to Concert table), notes, and relationships with both Concert (one concert has many setlists) and SetlistSong model (one setlist can have many songs). |
SetlistSong | Represents a specific song played in a setlist with attributes like setlist (foreign key to Setlist table), song (foreign key to Song table), order (position in the setlist), and transition (boolean indicating if there was a transition between songs). |
Results
When the script completed, I had gathered a good chunk of data. The following are the row counts contained in my database tables I wrote to after processing the concert venue and setlist information.
Table | Count |
---|---|
Venue | 484 |
Concert | 484 |
Song | 564 |
SetlistSong | 31546 |
Setlist | 1604 |
It’s quite usable in its current form.
For example, we can write queries to answer the following questions.
How many times was Truckin’ played?
How many shows were not played in the USA?
How many shows were played each year?
But there were quite a few data quality issues that are going to impact the accuracy of the analysis.
Data Quality Issues
The script I wrote executed without errors, resulting in a sizable database. Despite not including error handling in my code to process date information, the LLM consistently returned dates in the MM/DD/YYYY format. However, this does not mean everything worked perfectly. There are still some data quality issues that the LLM was unable to resolve. Perhaps if I had provided the LLM with previously processed data, it would have produced more consistent outputs.
For instance, consider some of the Grateful Dead concerts from 1987 that were challenging to parse using regex:
I basically told the LLM in the prompt to use its best judgement when extracting concert details.
It turns out that “HJK” stands for the Henry J. Kaiser Convention Center located in Oakland, CA. Llama3 didn’t recognize this, so it returned the data inconsistently: sometimes with empty strings for city and state, and other times with “unknown” for the city.
The show listed as “Long Beach (11/14/87)” should have been written as “Long Beach Arena (11/14/87),” but Llama3 wasn’t able to identify and correct this data issue.
Bob Dylan’s Corinna that the Grateful Dead covered with him, is spelled with 2 N’s and is actually officially titled “Corinna, Corinna”.
There are plenty more small issues like this that will require more processing to correct.
Takeaways
This project was learning experience for me, even though the database I created isn’t unique. Modern Grateful Dead sites like herbibot.com and setlist.fm already have this data cleaned and well-organized. Here are some of my key takeaways:
- Web Scraping with BeautifulSoup: BeautifulSoup proved to be a very useful library for parsing HTML in Python.
- Data Extraction with LLMs: Using Ollama to extract data into structured JSON is amazing. This approach simplified the extraction process, especially when dealing with inconsistent formats. JSON extraction can be incredibly useful for many tasks, and exploring local LLMs can save costs associated with API usage.
- Handling Data Inconsistencies: Despite the LLM’s capabilities, some data quality issues persisted. This highlights the importance of combining automated tools with manual verification for critical data projects.
- Future Applications: The methods and tools used in this project have broad applications beyond just creating a Grateful Dead setlist database. They can be adapted for various data extraction and structuring tasks in the real world ™.
Overall, this project was interesting and I learned quite a bit. I hope to clean this data up and do some more thoughful analysis on the setlists.