Build a Trial Court Records Scraper

25 May 2020

When we're done, we'll have an application that scapes public records (no per-search fees) and runs in the terminal like this:

In this detailed walk-through, we'll use the Ruby programming language to retrieve public information from Oregon's state trial court record system. We will convert this public information to appropriately-formatted JSON data objects.

This post will read like a code tutorial because that's what it is. But it's meant to be accessible even if you don't know code. The underlying concept—prying open closed legal systems—should be more revolutionary than it currently is.

You can find a GitHib repo containing the code for this walk-through here.

Building this isn't free. The application we will build uses 100% free, open source software. But you still need an OJCIN Online account to access the court records per the government's rules/fee schedule.

As I explain below, (a) appropriate public access to court records is a hotly contested topic, and (b) Oregon's rates are much, much better than the federal PACER system, which charges 10 cents per page. See, PACER FAQ: "What are some examples of per-page charges?" If you don't want to pay $ to actually build this, you can still see the entire process below.

You could build a similar app that searches free access materials. Notably, the free access materials do not include any access to court documents. The free version has significantly limited search capabilities and excludes case information found in the court file.

Table of Contents - Build a Trial Court Records Scraper

  1. Overview
  2. A few words about OECI
  3. Check your system
  4. Create a file structure and initial commit
  5. Create a Gemfile
  6. Install and configure ChromeDriver
  7. Store your OECI credentials
  8. Create a prompts library
  9. Build search models
  10. Create an initializer
  11. Write the scraper code
  12. Next steps

A. Overview ↑ Table of Contents

Features. Our code will scrape case information from the Oregon eCourt Information (OECI) system and save it to a text file. It will harvest data including:

How it works. We will use the Ruby programming language and a few open source software tools (i.e., Nokogiri, Watir, Selenium, and ChromeDriver) to deploy a hidden ("headless") browser to the OECI case index, which looks like this:

OECI Case index
(Source: OECI)

Using the headless browser, we will pluck relevant data from cases that fall within our specified date range, case types, and counties.

Finally, our code will save the scraped data as JSON objects (JSON = JavaScript Object Notation, a format for storing data) in timestamped files located in the /output directory, like this:

{
"oeci_number": "20CV17656",
"oeci_caption": "Mitchel Braning vs Libery Mutal Fire Insur [...]",
"oeci_county": "Multnomah",
"oeci_filing_date": "2020-05-10",
"oeci_type": "Tort - General",
"oeci_status": "Open"
},
{
"oeci_number": "20CV17657",
"oeci_caption": "Emanuel Leascu vs Michael Robert Nichol [...]",
"oeci_county": "Multnomah",
"oeci_filing_date": "2020-05-10",
"oeci_type": "Tort - General",
"oeci_status": "Open"
}

These JSON objects represent data that you could—at least theoretically 1—save to files or databases. Converting OECI information to data objects makes it possible to organize, process, and use the data in a variety of settings.

This application will be spartan, unconventional, and "hacky." To keep this walk-through relatively brief (and given all of OECI's quirks and limitations), we are not shooting for lots of features or production-grade code. This app just demonstrates one way to get off the ground. From here, you would definitely want to do testing, refactoring, etc.

Use cases. This scraper collects official records of state trial court proceedings in Oregon. Imagine being able to provide easy-to-search example case documents to tenants facing evictions. Or to families facing foreclosures or collection lawsuits. Or to parties in family law and criminal law cases. There are myriad use cases depending on whom you would like to help.

Expanding on this tutorial, you could add functionality to scrape individual dockets, documents, information about parties, attorneys, and judges, etc. I address some potential "next steps" in Section L at the end of the walk-through.

B. A few words about OECI ↑ Table of Contents

Image on OECI's Login Page
Note: the Apple Pro Mouse was discontinued in ~2005, 7 years before OECI existed
(Source: OECI's Login Page)

It took 4 years—2012 to 2016— for the Oregon Judical Department to roll out OECI as Oregon's new electronic court records system. At the time, Oregon's Supreme Court Chief Justice, Paul DeMuniz, described the transition to an online platform as "a business transformation project that would greatly increase the public’s access to courts." Oregon State Bar Bulletin, April 2014.

Oregon has subcontracted this project to a large, Texas-based software firm called Tyler Technologies. According to Tyler's website, "[c]ourts and justice agencies in seven countries and 28 U.S. states, serving more than 100 million citizens, use Tyler products." Therefore, building a scraper for Tyler's Oregon system may be a stepping stone for building larger multi-jurisdiction tools.

One feature is notably lacking in OECI: programmatic access. Unlike many modern software systems, OECI does not provide access via an Application Programmer Interface (API). To most users, this means the only real way to currently access Oregon court records is manually through the OECI system. In other words, you literally need to click around on the state website to gather the information you need. This is expensive, clunky, and inefficient.

Accessing electronic court record systems has been the subject of major press coverage and litigation in other jurisdictions (See e.g., Attacking a Pay Wall That Hides Public Court Filings, New York Times, Feb. 4, 2019). Thankfully, Oregon's system avoids many of the problems associated with the federal courts' PACER charges of $0.10 per page, which makes Oregon's system an interesting sandbox for a scraping project.

C. Check your system ↑ Table of Contents

(1)
Start by confirming you have Ruby installed:
$ ruby -v 
ruby 2.6.1

Any Ruby version >= 2.5 will work fine.

(2)
Also confirm that you have Bundler >= 1.18 installed:
$ bundle -v 
Bundler version 2.1.2

We'll be using Bundler to help integrate the necessary open source software packages (a.k.a. gems).

D. Create a file structure and initial commit ↑ Table of Contents

Our app will consist of a small handful of files and directories. We could create these files as we go. Instead, for context, we will create them upfront.

(3)
Create a new directory called oeci_scraper (or whatever you want to name the app). You can put this directory wherever you'd like.
(4)
Create the following directories and files inside of oeci_scraper:
(5)
Open .gitignore in a text editor and add this single line:
.gitignore

The above step is important! Later, we'll store our OECI credentials in the .env file. To make sure that we do not share the contents of the .env, we add .env (stored in the root direct) to .gitignore. This will prevent our OECI login credentials from getting baked into the app's source code history.

(6)
Run git init to initialize the directory in our version management software, Git.
(7)
Run git add -A to stage the new files for our initial commit.
(8)
Use git status to confirm that .env is not included in our initial commit per .gitignore:
$ git status
On branch master

No commits yet

Changes to be committed:
(use "git rm --cached ..." to unstage)
new file: .gitignore
new file: Gemfile
new file: app/case_types.rb
new file: app/counties.rb
new file: app/parties.rb
new file: config/initializers/initializer.rb
new file: lib/prompts.rb
new file: oeci_scraper.rb
(9)
Finally, use git commit -m 'initial commit' to commit the new files.

Note: I will not include further Git-related instructions from here. I included the initial commit to: (a) show how the .gitignore file works, and (b) emphasize the importance of keeping the .env file out of your source code.

E. Create a Gemfile ↑ Table of Contents

The next step is to make sure our app has access to a few external open source software packages it needs.

We will write this application in the Ruby programming language. We already confirmed above that Ruby is installed.

Ruby calls external packages and libraries "gems." We will use Bundler to manage gems. We also confirmed above that Bundler is installed.

Now we need a Gemfile. The Gemfile "describes the gem dependencies required to execute associated Ruby code." In short, the Gemfile prevents you from having to worry too much about configuration details.

The Gemfile is named Gemfile and lives in the root directory.

(10)
Add the following to the Gemfile that we created earlier:
Gemfile

Here's what the gems in the Gemfile do:

nokogiri (line 3): We will use Nokogiri to pull data from webpages.

watir (line 4): We will use Watir to navigate the headless browser to webpages and fill out web forms.

dotenv (line 5): We will use dotenv to store our OECI credentials as environment variables.

json (line 6): We will include the json gem to make sure Ruby can convert data to JSON.

(11)
Run bundle install to install the application's software dependencies:
$ bundle install
Fetching gem metadata from https://rubygems.org/...........
Fetching gem metadata from https://rubygems.org/.
Resolving dependencies...
Using bundler 2.1.2
Using childprocess 3.0.0
Using dotenv 2.7.5
Using json 2.3.0
Using mini_portile2 2.4.0
Using nokogiri 1.10.9
Using regexp_parser 1.7.0
Using rubyzip 2.2.0
Using selenium-webdriver 3.142.7
Using watir 6.16.5
Bundle complete! 4 Gemfile dependencies, 10 gems now installed.
Use `bundle info [gemname]` to see where a bundled gem is installed.

Bundler will create a file called Gemfile.lock in the root directory.

(12)
Confirm that Gemfile.lock looks like this:
Gemfile.lock

Gemfile.lock is a "snapshot" of the software versions the app uses, including implied requirements. Some version numbers in your Gemfile.lock may vary slightly depending on your system. This is fine.

F. Install and configure ChromeDriver ↑ Table of Contents

In this step, we'll install and configure ChromeDriver—an open source tool that our code will rely upon to "drive" a web browser. ChromeDriver will navigate our code to the OECI system so it can scrape the information.

Setting up ChromeDriver can be difficult if it's your first time. Your code may not run. You will receive error messages. If you have any issues with the directions below, you may need to search Google for your specific error messages and change your configuration settings based on your specific system. Keep at it - this can be the hardest step of this walk-through.

(13)
First, make sure you have the Google Chrome browser installed on your computer.
(14)
Google Chrome must be installed in the default location: /Applications/Google Chrome.app/Contents/MacOS/Google Chrome. Check that Chrome is installed at this location and get your Chrome version number :
$ /Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome --version
Google Chrome 83.0.4103.61

Make a note of the version—here it's Google Chrome 83.0.4103.61.

(15)
Go to ChromeDriver's downloads page and download the version that corresponds to your Chrome version:
(16)
Unzip the file and move the ChromeDriver executable file to /usr/local/bin2:
$ mv ~/Desktop/chromedriver /usr/local/bin/ 
(17)
Finally, check that ChromeDriver is installed and that you're using the correct version:
$ chromedriver -v
ChromeDriver 83.0.4103.39

At this point, we've confirmed that Google Chrome and ChromeDriver are installed.

G. Store your OECI credentials ↑ Table of Contents

Our final configuration step is to store OECI login credentials as environment variables. This step is simple.

"Environment variables" are variables that store information outside of the source code. We will use them here to hold our OECI credentials, which we don't want to hard-wire into the application. This way we can easily add, change, and revoke credentials. Storing these credentials outside of the code improves security.

When we're developing our code locally, we will use dotenv to store environment variables in .env (hence the name). Later, if we deploy the code to a server, we can set the environment variables on the server.

(18)
Since we already set up dotenv in the previous steps, all we need to do is add the following lines to the .env file:
.env

You will obviously need to change your_username and your_password to your OECI credentials.

H. Create a prompts library ↑ Table of Contents

Our app will need some prompts to guide users through using the application. For this, we will create a basic set of prompts to display messages and collect user inputs.

(19)
Open lib/prompts.rb and add the following code (explained below):
lib/prompts.rb

Here, we created a Ruby module called Prompt. Prompt has five methods:

1 - header (lines 2-6): Tells the user the application is running.

2 - credentials (lines 8-19): Sets OECI credentials as class variables (i.e., variables starting with @@). This method will also collect OECI credentials from the user if they are not stored as environment variables.

3 - date (lines 21-25): Collects a date from the user to limit the scope of the search. This method could be modified to accept a date range. I have not included that feature in this walk-through for simplicity's sake.

4 - counties (lines 27-32): Shows the user which counties are included in the search. We will define these counties in the next section.

5 - starting_search (lines 34-37): Notifies the user that the search is running.

As you will see below, Prompt allows us to display appropriate prompts via simple references like: Prompt.header and Prompt.credentials.

I. Build search models ↑ Table of Contents

In this step, we will create three crude 3 data models: counties, case types, and parties. Our app will use this data to set parameters for its searches and scraping. These parameters will come from OECI.

Counties

First, we will build a model for counties. As you can see below, OECI allows us to limit searches on a county-by-county basis:

(20)
Copy the following code into app/counties.rb:
app/counties.rb

The code above defines a class of Counties. Our class has two methods. The sample method includes just a couple counties and will be used to test the scraper. The all method contains all counties in Oregon (from the select menu on OECI, above).

You might notice in the screenshot above that OECI offers an option for "All Locations." Our app will ignore this option for now. This is because OECI limits users to 400 results per search 4. See, OJCIN Quick Reference Guide. One way around this limitation is to scrape in smaller batches—e.g., by county and date—which results in smaller numbers of results in a given batch. We still end up with the same data in the end. We just scrape it in several smaller passes.

To expand functionality here, you could refactor the code to respond to additional methods, like Counties.portland_metro, Counties.central_oregon, Counties.willamette_valley, etc.

Case Types

Second, we will build a model for case types. OECI assigns different case types to new cases as shown in the screenshot below. Creating a list of these case types allows us to include/exclude certain cases during our scraping.

(21)
Copy the following code into app/case_types.rb:
app/case_types.rb

This creates a CaseTypes class with a single method: excluded. We will use this to search for all cases except those listed in the excluded method.

These case types all came from previous searches I did in using the OECI system to identify cases relevant to my work (tort and contract cases). This is not a comprehensive list 5.

Parties

Finally, we need a model for parties.

The reason may not be obvious at first. If you run this scraper without a party filter, you will quickly notice something. Look at the following screenshot, which shows the first 14 of 104 cases filed in Multnomah County on January 20, 2020:

Of these 14 cases, the party names indicate that 11 are debt collection lawsuits filed by banks or collection agencies.

The Oregon court system is literally clogged with debt collection lawsuits. Since all of these cases get filed under the general type called "Contract," it is impossible to identify non-debt collection contract cases without a party filter. A substantial portion of important civil cases are based on contract. A party filter is therefore necessary to separate debt collection cases from the many other contract-related cases.

(22)
Copy the following code into app/parties.rb:
app/parties.rb

This creates a Parties class with a single excluded method. This list above is a partial list I compiled from a few searches over a couple month sample period. Depending on the use case, we could repurpose this code to return only those cases matching certain party names 6.

These three models are all we will create for now. It is possible to expand these models to accommodate new features (e.g., a DocumentTypes model for certain types of documents, a Judges model to associate cases with specific judges, etc.)

J. Create an initializer ↑ Table of Contents

Now we will create an initializer. Our initializer will compile everything we've done up to this point in one place. This way our code has access to everything.

(23)
Copy the following code into config/initializers/initializer.rb:
config/initializers/initializer.rb

This code is pretty straight forward. We're requiring the necessary gems (lines 1-5), instantiating our models (lines 7-15), and including our prompts (lines 17-19). The last couple lines tell the console to hide warnings that pop up in the terminal. If you have trouble running your code, you can turn this on to troubleshoot.

As you will see below, we call this initializer in the first line of our scraper. Separating this initialization in our code allows us to stay focused on our scraper, not configuration.

K. Write the scraper code ↑ Table of Contents

Now for the scraping code itself. Once we get the code in place, all we need to do is run ruby oeci_scraper.rb from the application's directory and the scraper will run and generate output that looks like this:

🥑 $ ruby oeci_scraper.rb

OECI SCRAPER

Counties:
Multnomah, Umatilla
(from: app/counties.rb)

Date [Format YYYY-MM-DD]: 2020-01-20

STARTING SEARCH...

***Multnomah***

20CV03304 - Michelle Greissinger vs Richard Bloom [...]
20CV03320 - Jose Luis Moreno Robles vs Troy Anthony Builta
20CV03344 - Norberto Cortez vs Norma Rene Berlin
20CV03447 - Salena Painter vs Troy Thomas
20CV03477 - Ingrid Cloman vs Nationwide Investigations [...]
20CV03480 - Richard King vs McKenzie Reynolds-Stein
20CV03484 - Shalontelle White vs The Kroger Co.
20CV03493 - Yuniesky Cruz vs Gareth Pooleon
20CV03504 - Craig Brandt vs Gresham Sanitary Service, Inc.
20LT00927 - Michaele A Jarvis vs Kevin Fine
20LT00929 - Ernest Harris vs Lakeshie Parker
===============
Total: 104; Scraped: 11

***Umatilla***

20CV03429 - Michael Palmore vs Rob's Speedy Delivery Inc.
===============
Total: 6; Scraped: 1

The easiest way to understand how this code works is to read it.

I've heavily commented this code to explain each step. Below, there are some additional notes and OECI screenshots to visualize what the code does.

(24)
Copy the following code into oeci_scraper.rb:
oeci_scraper.rb

Logging In

These screenshots show how we identified the OECI login inputs and submit button in lines 27-31:

Watir fills out and submits this form at lines 28-30.
Input field names
Button name

Counties (line 34)

This screenshot shows (a) where we got the counties in our Counties model at app/counties.rb, and (b) how we identified the OECI county select menu:

OECI's list of counties
Select field name

Navigating to case listings (lines 38-43)

The following screen screenshots show how the code navigates to the OECI case listings to be scraped:

Selecting the appropriate category of cases at line 38
Selecting search by date at line 41
Entering the dates at lines 42-43

The results

Once the browser gets to the search results, the remainer of our code uses Nokogiri to do the actual scraping. The following screenshots show how we identified the appropriate elements to scrape:

OECI's full results for our search
Element containing the first piece of data we want to scrape
Using browser tools to copy the element's location

L. Next steps ↑ Table of Contents

At this point, we have built a minimum viable OECI scraper. From here, you can add features for specific use cases.

Here are some features I have found helpful in my work:

Case Dockets

For each case in OECI, you can click through to a docket (i.e., calendar of filings and events). OECI dockets look like this:

Case docket in OECI

As you can see, dockets contain valuable information about case lifelines and events. You could create a feature that automatically looks for new filings against certain parties. Or a one that identifies upcoming trials or dismissals.

Docket scraping is especially powerful when it runs automatically on a schedule (e.g., every night or once per week). You can easily do this with a Ruby tool like Rake.

Case Documents

OECI also puts actual PDF and TIF documents within easy reach. OECI case documents look like this:

OECI case document index
PDF case document from OECI

The case documents contain a whole deeper layer of valuable information. These documents represent an unthinkable amount of billable legal work hours.

With some careful coding, you can download and reference these documents in data objects. Being able to programmatically access these documents opens up a whole world of data analysis possibilities.

Identifying public access hurdles

My final—and most important—suggestion requires ongoing attention: making sure the government provides sufficient access to public court files.

The Oregon Judicial Department has valid and reasonable concerns about making access too easy. However, is it appropriate for governments to deny citizens programmatic access to their own records? Is it appropriate for the government to charge users thousands of dollars each month to share public information? Is it appropriate for the government to limit searches to 400 entries at a time?

1 Note that OECI charges "Data Reseller" and "Data Reseller with Bulk Data" fees. See OCJIN Fee Schedule. According to OJCIN, "Data Reseller" means "a subscriber who accesses or uses the OJCIN system for purposes of obtaining OJD data to provide all or part of the OJD data: *** for inclusion in a database that is accessible by third parties; or *** to third parties who are not the End Users." The OJCIN fee schedule explains that bulk data usage requires a "separate application, agreement, and set up fee" (none of which are available online). There are potentially considerable costs to running an OECI scraper at scale and sharing the results. I question whether this unlawfully infringes on the public's right to access public records.
2 Another option: save the executable elsewhere and make sure the directory is included to your path. You can read more about that option here
3 Here is one example where our app is "hacky." A better-structured solution would separate models, data, methods, and other concerns. We'll stick with the duct tape solution here because it's a little simpler and easier for casual readers to follow.
4 What OECI is doing here, without saying it, is limiting access to public records. In other words, while OECI provides access to individual records, the system is designed to prevent access to larger datasets based on those records. I believe this and other similar, intentional hurdles in OECI dramatically undermine Justice DeMuniz's stated goal of "increas[ing] the public’s access to courts."
5 Here are two more access bottlenecks in the OECI system. First, a total list of ~50 case categories is a crude and oversimplistic way to index and indentify an entire universe of cases. Second, OECI does not allow users to conduct searches based on these categories, leaving the public (i.e., us) to hack the way to a solution.
6 This particular list of party names demonstrates another flaw in OECI. For example, the 6 separate entries for Capital One Bank shows there is some inconsistent naming in the OECI system.
Thanks to Stew Fortier, Jesse Evers, , Zachary Zager and Compound Writing for reviewing a draft of this post.