Comparison- and Header Detection-Based Automated Contract Provision Extraction

5 minute read

The two previous installments of the Contract Review Software Buyer’s Guide covered how manual-rule based contract provision extraction systems—while relatively easy to set up and add provisions to—underperform on agreements and provisions that are not identical or very similar to ones they were built to review ("unfamiliar documents"). Most agreements reviewed in due diligence or contract management projects are unfamiliar documents. The series is now moving on to other methods of automated provision detection. This post will cover comparison- and header detection-based approaches to contract provision extraction. The next post will cover machine learning-based systems. Like manual rules-based systems, comparison- and header detection-based methods are technically easy to set up but underperform on unfamiliar documents. We suspect they are little used on their own; our guess is that only one company in the broader contract review software market has built their system on these.

Comparison-Based Automated Contract Review

To build a comparison-based contract provision extraction system, start with a database of provision examples. Then, set a similarity threshold and compare all new text against the database of provision examples. So, for example, any new provision that is ≥70%-similar to a provision in the “amendment” database is extracted an amendment provision. The software powering this comparison is based on the “diff” utility; it also underlies blackline programs. The diff utility is common enough that your computer likely has it built in and accessible if you know how to find it. Pretty easy, right?

Comparison-based systems run into trouble when provisions in new agreements differ from ones in the provision database. This can occur because new agreements are drafted differently, which happens, especially in commercial agreements like supply and distribution contracts (some of the most common agreements in due diligence and contract management database population projects). Or it can occur because of poor quality scans leading to inexact agreement transcriptions. Comparison-based systems can cope with dissimilar agreements or difficult-to-OCR text by relaxing their comparison threshold, but this increases the odds of finding false positives. Comparison-based provision detection could work with a provision database covering all examples of how the provision is drafted (assuming no poor quality scans are reviewed). But it takes a lot of effort to build a good provision database, and it would be hard to be sure the database was actually comprehensive.

It is unclear why a vendor would build a comparison-type system over alternatives. Here is more detail on why:

Comparison-type vs keyword search system

Easy to build the technology underlying both.
Much easier to come up with manual rules describing provisions than to build a good provision database.
Neither will be especially accurate on unfamiliar documents.

Comparison-type vs machine learning-based system

Significantly easier to build the technology for a comparison-based system than a good machine learning-based system (more on how hard this is in the next installment).
Both require large provision databases, and these take a lot of work to properly put together.
Machine learning-based systems should be more accurate on unfamiliar documents and poor quality scans.

Comparison-based systems also share one additional problem with keyword search systems: it is hard for their vendors to know their accuracy. Essentially, it is hard to test these types of systems on unseen data, and you can’t properly determine accuracy without testing on unseen data. This post on manual rule-based contract provision extraction systems has more details.

Header Detection

Header detection identifies provisions based on their headers. A provision titled “Assignment” would be an assignment provision, for example. Header detection is more of a tool potentially used to identify contract provisions than it is the sole basis for a contract provision extraction system. This is because provision headers are a mediocre provision identifier—somewhat over-inclusive (e.g., a provision titled “Assignment” with nothing but binding on successors and assigns language (who cares)) and heavily under-inclusive (they will not catch an assignment event of default). One contract review software system we’ve seen appears to rely heavily on header detection, but—understandably—it does not appear that many vendors have chosen to use this approach.

Neither simple comparison against a provision database nor header detection are great ways to find contract provisions. Comparison-based systems “work” in the same way that manual rule based contract provision “work”: fine in finding provisions in agreements that are identical or very similar to ones they have already seen (assuming good quality scans), otherwise underperforming. And header detection may be a helpful technique for building contract provision detection models, but is not enough to base a system off of.

Contract Review Buyers Guide Series:

Share this article: