Build a Trial Court Records Scraper
When we're done, we'll have an application that scapes public records (no per-search fees) and runs in the terminal like this:
In this detailed walk-through, we'll use the Ruby programming language to retrieve public information from Oregon's state trial court record system. We will convert this public information to appropriately-formatted JSON data objects.
This post will read like a code tutorial because that's what it is. But it's meant to be accessible even if you don't know code. The underlying concept—prying open closed legal systems—should be more revolutionary than it currently is.
You can find a GitHib repo containing the code for this walk-through here.
Building this isn't free. The application we will build uses 100% free, open source software. But you still need an OJCIN Online account to access the court records per the government's rules/fee schedule.
As I explain below, (a) appropriate public access to court records is a hotly contested topic, and (b) Oregon's rates are much, much better than the federal PACER system, which charges 10 cents per page. See, PACER FAQ: "What are some examples of per-page charges?" If you don't want to pay $ to actually build this, you can still see the entire process below.
You could build a similar app that searches free access materials. Notably, the free access materials do not include any access to court documents. The free version has significantly limited search capabilities and excludes case information found in the court file.
Table of Contents - Build a Trial Court Records Scraper
A. Overview ↑ Table of Contents
Features. Our code will scrape case information from the Oregon eCourt Information (OECI) system and save it to a text file. It will harvest data including:
- Case Number
- Filing Date
- Case Type
- Case Status
How it works. We will use the Ruby programming language and a few open source software tools (i.e., Nokogiri, Watir, Selenium, and ChromeDriver) to deploy a hidden ("headless") browser to the OECI case index, which looks like this:
Using the headless browser, we will pluck relevant data from cases that fall within our specified date range, case types, and counties.
/output directory, like this:
"oeci_caption": "Mitchel Braning vs Libery Mutal Fire Insur [...]",
"oeci_type": "Tort - General",
"oeci_caption": "Emanuel Leascu vs Michael Robert Nichol [...]",
"oeci_type": "Tort - General",
These JSON objects represent data that you could—at least theoretically 1—save to files or databases. Converting OECI information to data objects makes it possible to organize, process, and use the data in a variety of settings.
This application will be spartan, unconventional, and "hacky." To keep this walk-through relatively brief (and given all of OECI's quirks and limitations), we are not shooting for lots of features or production-grade code. This app just demonstrates one way to get off the ground. From here, you would definitely want to do testing, refactoring, etc.
Use cases. This scraper collects official records of state trial court proceedings in Oregon. Imagine being able to provide easy-to-search example case documents to tenants facing evictions. Or to families facing foreclosures or collection lawsuits. Or to parties in family law and criminal law cases. There are myriad use cases depending on whom you would like to help.
Expanding on this tutorial, you could add functionality to scrape individual dockets, documents, information about parties, attorneys, and judges, etc. I address some potential "next steps" in Section L at the end of the walk-through.
B. A few words about OECI ↑ Table of Contents
It took 4 years—2012 to 2016— for the Oregon Judical Department to roll out OECI as Oregon's new electronic court records system. At the time, Oregon's Supreme Court Chief Justice, Paul DeMuniz, described the transition to an online platform as "a business transformation project that would greatly increase the public’s access to courts." Oregon State Bar Bulletin, April 2014.
Oregon has subcontracted this project to a large, Texas-based software firm called Tyler Technologies. According to Tyler's website, "[c]ourts and justice agencies in seven countries and 28 U.S. states, serving more than 100 million citizens, use Tyler products." Therefore, building a scraper for Tyler's Oregon system may be a stepping stone for building larger multi-jurisdiction tools.
One feature is notably lacking in OECI: programmatic access. Unlike many modern software systems, OECI does not provide access via an Application Programmer Interface (API). To most users, this means the only real way to currently access Oregon court records is manually through the OECI system. In other words, you literally need to click around on the state website to gather the information you need. This is expensive, clunky, and inefficient.
Accessing electronic court record systems has been the subject of major press coverage and litigation in other jurisdictions (See e.g., Attacking a Pay Wall That Hides Public Court Filings, New York Times, Feb. 4, 2019). Thankfully, Oregon's system avoids many of the problems associated with the federal courts' PACER charges of $0.10 per page, which makes Oregon's system an interesting sandbox for a scraping project.
C. Check your system ↑ Table of Contents
$ ruby -v
Any Ruby version >= 2.5 will work fine.
$ bundle -v
Bundler version 2.1.2
D. Create a file structure and initial commit ↑ Table of Contents
Our app will consist of a small handful of files and directories. We could create these files as we go. Instead, for context, we will create them upfront.
oeci_scraper(or whatever you want to name the app). You can put this directory wherever you'd like.
.gitignorein a text editor and add this single line:
.env file. To make sure that we do not share the contents of the
.env, we add
.env (stored in the root direct) to
.gitignore. This will prevent our OECI login credentials from getting baked into the app's source code history.
git initto initialize the directory in our version management software, Git.
git add -Ato stage the new files for our initial commit.
git statusto confirm that
.envis not included in our initial commit per
$ git status
On branch master
No commits yet
Changes to be committed:
(use "git rm --cached
..." to unstage)
new file: .gitignore
new file: Gemfile
new file: app/case_types.rb
new file: app/counties.rb
new file: app/parties.rb
new file: config/initializers/initializer.rb
new file: lib/prompts.rb
new file: oeci_scraper.rb
git commit -m 'initial commit'to commit the new files.
Note: I will not include further Git-related instructions from here. I included the initial commit to: (a) show how the
.gitignore file works, and (b) emphasize the importance of keeping the
.env file out of your source code.
E. Create a Gemfile ↑ Table of Contents
The next step is to make sure our app has access to a few external open source software packages it needs.
We will write this application in the Ruby programming language. We already confirmed above that Ruby is installed.
Ruby calls external packages and libraries "gems." We will use Bundler to manage gems. We also confirmed above that Bundler is installed.
Now we need a Gemfile. The Gemfile "describes the gem dependencies required to execute associated Ruby code." In short, the Gemfile prevents you from having to worry too much about configuration details.
The Gemfile is named
Gemfile and lives in the root directory.
Gemfilethat we created earlier:
Here's what the gems in the
nokogiri (line 3): We will use Nokogiri to pull data from webpages.
watir (line 4): We will use Watir to navigate the headless browser to webpages and fill out web forms.
dotenv (line 5): We will use dotenv to store our OECI credentials as environment variables.
json (line 6): We will include the json gem to make sure Ruby can convert data to JSON.
bundle installto install the application's software dependencies:
$ bundle install
Fetching gem metadata from https://rubygems.org/...........
Fetching gem metadata from https://rubygems.org/.
Using bundler 2.1.2
Using childprocess 3.0.0
Using dotenv 2.7.5
Using json 2.3.0
Using mini_portile2 2.4.0
Using nokogiri 1.10.9
Using regexp_parser 1.7.0
Using rubyzip 2.2.0
Using selenium-webdriver 3.142.7
Using watir 6.16.5
Bundle complete! 4 Gemfile dependencies, 10 gems now installed.
Use `bundle info [gemname]` to see where a bundled gem is installed.
Bundler will create a file called
Gemfile.lock in the root directory.
Gemfile.locklooks like this:
Gemfile.lock is a "snapshot" of the software versions the app uses, including implied requirements. Some version numbers in your
Gemfile.lock may vary slightly depending on your system. This is fine.
F. Install and configure ChromeDriver ↑ Table of Contents
In this step, we'll install and configure ChromeDriver—an open source tool that our code will rely upon to "drive" a web browser. ChromeDriver will navigate our code to the OECI system so it can scrape the information.
Setting up ChromeDriver can be difficult if it's your first time. Your code may not run. You will receive error messages. If you have any issues with the directions below, you may need to search Google for your specific error messages and change your configuration settings based on your specific system. Keep at it - this can be the hardest step of this walk-through.
/Applications/Google Chrome.app/Contents/MacOS/Google Chrome. Check that Chrome is installed at this location and get your Chrome version number :
$ /Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome --version
Google Chrome 83.0.4103.61
Make a note of the version—here it's Google Chrome 83.0.4103.61.
$ mv ~/Desktop/chromedriver /usr/local/bin/
$ chromedriver -v
At this point, we've confirmed that Google Chrome and ChromeDriver are installed.
G. Store your OECI credentials ↑ Table of Contents
Our final configuration step is to store OECI login credentials as environment variables. This step is simple.
"Environment variables" are variables that store information outside of the source code. We will use them here to hold our OECI credentials, which we don't want to hard-wire into the application. This way we can easily add, change, and revoke credentials. Storing these credentials outside of the code improves security.
When we're developing our code locally, we will use dotenv to store environment variables in
.env (hence the name). Later, if we deploy the code to a server, we can set the environment variables on the server.
You will obviously need to change
your_password to your OECI credentials.
H. Create a prompts library ↑ Table of Contents
Our app will need some prompts to guide users through using the application. For this, we will create a basic set of prompts to display messages and collect user inputs.
lib/prompts.rband add the following code (explained below):
Here, we created a Ruby module called
Prompt has five methods:
header (lines 2-6): Tells the user the application is running.
credentials (lines 8-19): Sets OECI credentials as class variables (i.e., variables starting with
@@). This method will also collect OECI credentials from the user if they are not stored as environment variables.
date (lines 21-25): Collects a date from the user to limit the scope of the search. This method could be modified to accept a date range. I have not included that feature in this walk-through for simplicity's sake.
counties (lines 27-32): Shows the user which counties are included in the search. We will define these counties in the next section.
starting_search (lines 34-37): Notifies the user that the search is running.
As you will see below,
Prompt allows us to display appropriate prompts via simple references like:
I. Build search models ↑ Table of Contents
In this step, we will create three crude 3 data models: counties, case types, and parties. Our app will use this data to set parameters for its searches and scraping. These parameters will come from OECI.
First, we will build a model for counties. As you can see below, OECI allows us to limit searches on a county-by-county basis:
The code above defines a class of
Counties. Our class has two methods. The
sample method includes just a couple counties and will be used to test the scraper. The
all method contains all counties in Oregon (from the select menu on OECI, above).
You might notice in the screenshot above that OECI offers an option for "All Locations." Our app will ignore this option for now. This is because OECI limits users to 400 results per search 4. See, OJCIN Quick Reference Guide. One way around this limitation is to scrape in smaller batches—e.g., by county and date—which results in smaller numbers of results in a given batch. We still end up with the same data in the end. We just scrape it in several smaller passes.
To expand functionality here, you could refactor the code to respond to additional methods, like
Second, we will build a model for case types. OECI assigns different case types to new cases as shown in the screenshot below. Creating a list of these case types allows us to include/exclude certain cases during our scraping.
This creates a
CaseTypes class with a single method:
excluded. We will use this to search for all cases except those listed in the
These case types all came from previous searches I did in using the OECI system to identify cases relevant to my work (tort and contract cases). This is not a comprehensive list 5.
Finally, we need a model for parties.
The reason may not be obvious at first. If you run this scraper without a party filter, you will quickly notice something. Look at the following screenshot, which shows the first 14 of 104 cases filed in Multnomah County on January 20, 2020:
Of these 14 cases, the party names indicate that 11 are debt collection lawsuits filed by banks or collection agencies.
The Oregon court system is literally clogged with debt collection lawsuits. Since all of these cases get filed under the general type called "Contract," it is impossible to identify non-debt collection contract cases without a party filter. A substantial portion of important civil cases are based on contract. A party filter is therefore necessary to separate debt collection cases from the many other contract-related cases.
This creates a
Parties class with a single
excluded method. This list above is a partial list I compiled from a few searches over a couple month sample period. Depending on the use case, we could repurpose this code to return only those cases matching certain party names 6.
These three models are all we will create for now. It is possible to expand these models to accommodate new features (e.g., a
DocumentTypes model for certain types of documents, a
Judges model to associate cases with specific judges, etc.)
J. Create an initializer ↑ Table of Contents
Now we will create an initializer. Our initializer will compile everything we've done up to this point in one place. This way our code has access to everything.
This code is pretty straight forward. We're requiring the necessary gems (lines 1-5), instantiating our models (lines 7-15), and including our prompts (lines 17-19). The last couple lines tell the console to hide warnings that pop up in the terminal. If you have trouble running your code, you can turn this on to troubleshoot.
As you will see below, we call this initializer in the first line of our scraper. Separating this initialization in our code allows us to stay focused on our scraper, not configuration.
K. Write the scraper code ↑ Table of Contents
Now for the scraping code itself. Once we get the code in place, all we need to do is run
ruby oeci_scraper.rb from the application's directory and the scraper will run and generate output that looks like this:
🥑 $ ruby oeci_scraper.rb
Date [Format YYYY-MM-DD]: 2020-01-20
20CV03304 - Michelle Greissinger vs Richard Bloom [...]
20CV03320 - Jose Luis Moreno Robles vs Troy Anthony Builta
20CV03344 - Norberto Cortez vs Norma Rene Berlin
20CV03447 - Salena Painter vs Troy Thomas
20CV03477 - Ingrid Cloman vs Nationwide Investigations [...]
20CV03480 - Richard King vs McKenzie Reynolds-Stein
20CV03484 - Shalontelle White vs The Kroger Co.
20CV03493 - Yuniesky Cruz vs Gareth Pooleon
20CV03504 - Craig Brandt vs Gresham Sanitary Service, Inc.
20LT00927 - Michaele A Jarvis vs Kevin Fine
20LT00929 - Ernest Harris vs Lakeshie Parker
Total: 104; Scraped: 11
20CV03429 - Michael Palmore vs Rob's Speedy Delivery Inc.
Total: 6; Scraped: 1
The easiest way to understand how this code works is to read it.
I've heavily commented this code to explain each step. Below, there are some additional notes and OECI screenshots to visualize what the code does.
These screenshots show how we identified the OECI login inputs and submit button in lines 27-31:
Counties (line 34)
This screenshot shows (a) where we got the counties in our
Counties model at
app/counties.rb, and (b) how we identified the OECI county select menu:
Navigating to case listings (lines 38-43)
The following screen screenshots show how the code navigates to the OECI case listings to be scraped:
Once the browser gets to the search results, the remainer of our code uses Nokogiri to do the actual scraping. The following screenshots show how we identified the appropriate elements to scrape:
L. Next steps ↑ Table of Contents
At this point, we have built a minimum viable OECI scraper. From here, you can add features for specific use cases.
Here are some features I have found helpful in my work:
For each case in OECI, you can click through to a docket (i.e., calendar of filings and events). OECI dockets look like this:
As you can see, dockets contain valuable information about case lifelines and events. You could create a feature that automatically looks for new filings against certain parties. Or a one that identifies upcoming trials or dismissals.
Docket scraping is especially powerful when it runs automatically on a schedule (e.g., every night or once per week). You can easily do this with a Ruby tool like Rake.
OECI also puts actual PDF and TIF documents within easy reach. OECI case documents look like this:
The case documents contain a whole deeper layer of valuable information. These documents represent an unthinkable amount of billable legal work hours.
With some careful coding, you can download and reference these documents in data objects. Being able to programmatically access these documents opens up a whole world of data analysis possibilities.
Identifying public access hurdles
My final—and most important—suggestion requires ongoing attention: making sure the government provides sufficient access to public court files.
The Oregon Judicial Department has valid and reasonable concerns about making access too easy. However, is it appropriate for governments to deny citizens programmatic access to their own records? Is it appropriate for the government to charge users thousands of dollars each month to share public information? Is it appropriate for the government to limit searches to 400 entries at a time?