How to Build an Automated Contract Provision Extraction System

Written by: Noah Waisberg

9 minute read

If you only need contract provision extraction software to review agreements you specifically trained it for (which are in the form of clean scans), rules- and comparison-based tech should work fine. This post tells you how to build a system to do this easy task.

We have been building the DiligenceEngine system since early 2011. But imagine our team was starting from scratch on the provision extraction side. Pretend we had no provision extraction models, and not our heap of experienced-lawyer-extracted contract provision examples for training a system, or even any pre-built technology to create provision models with. Nothing to help our software find provisions in contracts. Got it? In under a week, starting with no technology pre-built for this and no provision examples, our team could build a system that would take a few contracts given in advance and accurately extract 50, 75, even 100 data points from them. Everything from term, to renewal, to start date, to assignability, to pricing, price increases, and payment terms, to termination. And more. In this post, I’ll give you basic information on how you can do the same. You shouldn’t need any special computer knowledge to make it through this post, though would need developers to implement what we cover. How can we make such an aggressive claim? What have we spent our last years doing if the contract provision extraction task is so simple? Why do we bother having our tech team led by a Ph.D graduate of a top computer science program (beyond that he’s a nice guy and has excellent taste in Korean restaurants)? The short answer is that it can be very easy or very hard to build a system like ours. It is easy to pre-train a system to find information you already know is there, or to be moderately accurate at extracting contract provisions from random documents and poor quality scans. Most of our users need better. They require high accuracy and robust performance. And that is hard. Very hard. But some people are okay with lower or uncertain accuracy, or they are cost-sensitive, or they just like doing things themselves. Sound like you? This post can help you save money and create your own contract metadata extraction system. Enough with the lead-in. Let’s get to it.

Provision Extraction

As we’ve discussed in previous Contract Review Software Buyer’s Guide posts, there are three basic ways to get software to identify contract provisions in documents:

Since manual rules and comparison methods are easier to work with, if trying to build an automated contract abstraction system quickly from scratch, we would recommend using one of these approaches. There are serious long-term flaws with manual rules and comparison methods if you need high-performance on unfamiliar agreements or poor quality scans, but these are presumably not huge issues if you are taking a DIY approach to contract provision extraction. This post will cover both manual rules and comparison methods for building a contract data extraction system.

Identifying What to Find

The first step in building your own automated agreement abstraction system, irrespective of your technological approach, is for a person to go through the contracts you would like the system to work on and identify what you would like the system to find for each provision. For, example:

“Foreign Corrupt Practices Act Compliance” = “Foreign Corrupt Practices Act In performance of its activities, duties and obligations under this Agreement, Company shall comply with the U.S. Foreign Corrupt Practices Act (“FCPA”). Company represents and warrants that it is familiar with the obligations and restrictions of the FCPA, and that it and its employees and agents will comply with the provisions of the FCPA.”


“Change of Control” = “Change of Control. Licensor may terminate this Agreement immediately on notice upon any change in the ownership or control of Licensee. For such purposes, a “change in ownership or control” shall mean that, forty percent (40%) or more of the voting stock of Licensee becomes subject to the ownership or control of a person or entity or any related group of persons or entities acting in concert, which persons or entities did not own or control such portion of voting stock on the Effective Date hereof. Licensor shall have the same right to terminate upon any transfer of forty percent (40%) or more of the assets of Licensee.”


“Governing Law” = “Governing Law. This Agreement is governed by New York law, excluding New York’s choice of law rules.

As described in the previous two posts in the Contract Review Software Buyer’s Guide, we think there is a real advantage to having an experienced lawyer identify what you would like the system to find. Garbage in, garbage out. Other vendors in our space take a different approach. Work with what you have for your DIY contract metadata extraction project. One approach that would work would be to put all found provisions into a spreadsheet or Word document. Here’s an example:

Or you could mark them with html-like tags, as I did in the governing law example above, and put them into an .xml document. This might or might not be how you approach storing your identified contract provisions over the long term, but should do fine over your quick build. Once you have identified what the system is to find, now you can get on to the technology side of implementing your build.

DIY Manual Rules Contract Metadata Identification

Your first choice for converting your contract provision examples into provision extraction models is using manual rules. This post has a lot more detail on manual rules in contract metadata extraction, but the quick summary is that these are akin to the Boolean search rules you may have used to search research databases like Lexis or Westlaw. Most programmers know how to write “regular expressions,” which are essentially more complex versions of Boolean search strings. Here is an instruction guide for writing regular expressions, if helpful. Here are example regular expressions for the provisions above.

FCPA Compliance: “(?i)foreign\s+corrupt\s+practices\s+act|fcpa”

Change of Control: “(?i)change\s+(of|in)\s+control|change\s+(in|of)(\s+the)?\s+ownership”

Governing Law: “(?i)governing\s+law|governed\s+by\s+.*\s+law”

You may find it easiest to implement your manual rules via a freely available tool suite like GATE, which will also give you other functionality like sentence identification (and even includes basic machine learning capabilities, for what it’s worth). Then feed documents for review in. You may have to tweak your regular expressions to avoid false positives but, with revisions, your system should now extract data that lines-up with the rules you have created. One caution: your system should work fine on the exact provisions you have written it for. It may extract correct text from unfamiliar agreements or poor quality scans, or it may not. Either way, you won’t be easily able to tell how accurate it is on these. If you really need your system to work on unfamiliar agreements or poor quality scans DIY (and a rules-based approach in general) is not the way to go. That said, your rules-based system should work fine for demoing (as long as you demo on the documents you built rules for) or for reviewing template-based agreements you built for (assuming executed agreements being reviewed weren’t seriously modified in negotiations and are in the form of good quality scans).

DIY Comparison Contract Provision Identification

Your other choice for DIY contract metadata extraction is to use a comparison-type approach. Have you ever used document comparison software (aka, “blacklining” or “redlining” systems; Word comes with a built in version)? This software essentially relies on neat, simple, and widely-available computer algorithm often referred to as diff. It was developed in the 1970s. The diff algorithm tells you how different things are from each other, and where they are different. This functionality can drive contract provision extraction software, as further described in this Contract Review Software Buyer’s Guide post and covered in the rest of this section.

Building a comparison-type automated contract provision detection system should be easy. First, get a freely available implementation of the diff algorithm. Although Mac and Linux systems come with a diff program built in, and it is also easily easily accessible for Windows operating systems, the diff utility itself operates on line-by-line comparison basis, while document comparison software works on a character-by-character (or word-by-word) basis. Character-by-character comparison is best suited to contract review work. As such, this code library, which is optimized for comparing text, is appropriate for use in your DIY comparison-based contract metadata identification system.

Once you have the right version of diff set up, put your pre-found contract provision examples into a database. Then set up diff to compare all new text reviewed to your provision examples. If newly-reviewed text matches one of the provision examples, you will have a provision hit. To the extent you would like your system work on provisions that are not exactly like the ones in your example database, set the diff comparison threshold to return all text that is [90][75][60]% similar to your examples. You can determine the similarity simply by having your system count matches and differences. One caution: setting a lower comparison threshold may make you less likely to have misses, but it will also increase your number of false positives. Given how differently contract provisions can be drafted (and how poor quality scans can make text that is supposed to look the same read differently) we recommend against a diff-based system unless you are reviewing contracts that are highly similar. Or use it if you’re okay with middling accuracy on unfamiliar text and poor quality scans.

Next Step: Adding Non-Standard Clause Detection to your Contract Metadata Extraction System

If you’ve implemented the steps to here, you now should have a system that consistently finds and extracts contract provisions you know the wording of in advance, and maybe sometimes works on unfamiliar agreements and poor quality scans. In the next Contract Review Software Buyer’s Guide post, we’ll cover how you can layer non-standard clause detection on top of your newly built contract review software.

Contract Review Buyers Guide Series:

Share this article: