How to Add Non-Standard Clause Detection to Your Contract Metadata Extraction System

Written by: Noah Waisberg

8 minute read

Spotting outlying contract provisions isn’t as much fun as finding the 16 differences between these two images. Fortunately, simple software can help on the contract side.

tl;dr: It’s easy for software to identify non-standard clauses across a pool of agreements, with moderate accuracy. This post describes how you can build a system to do so.

In the previous Contract Review Software Buyer’s Guide post, we described how you could use readily available technology to build a system that consistently finds and extracts contract provisions you know the wording of in advance, and also inconsistently finds informations in unfamiliar agreements and poor quality scans. We explained how—in about a week, with reasonably skilled programmers—you could build software that would capture 50–100 data points from contracts you pre-trained it to work on. We wouldn’t necessarily suggest you go this route, but you should now understand that basic contract provision extraction software can be easy to build. The hard part is getting it to work well on unfamiliar documents and poor quality scans.

This post will cover a useful extension to contract review software: automatically identifying non-standard information in agreements. Some organizations need to know how executed agreements differ from their template, and where they have agreed to non-standard clauses. Good news: this is an easy feature to add to automated agreement abstraction software.

In the previous post, we suggested you use a rules or comparison-based approach to build your DIY contract metadata extraction system, since they are well-suited to a fast ~60% solution. (That said, some vendors use these techniques to power provision extraction for their full automated contract abstraction systems.) If accuracy matters, rules and comparison approaches only really make sense for reviewing clean scans of familiar agreements. So, if you are using a contract metadata extraction system that relies on either rules or comparison approaches, you presumably either (i) plan to use it to review agreements you have the form of in advance or (ii) are less concerned about misses. If you plan to primarily review form agreements, an obvious extension is trying to discern non-standard information in them. Good news, if so: The “diff” technology we described in the build-your-own contract metadata extraction post will allow you to spot non-standard clauses, and it is freely available. Non-standard clause detection via diff is so easy to do that this post will tell you how build it yourself. As with our previous DIY automated agreement abstraction post, you won’t need programming skills to make it through this post, but will have to have technical skills to implement this.

Let’s start with some background on diff. Then we will cover how to use diff to identify non-standard agreements and contract provisions.


Diff is a comparison utility; it shows differences. Diff can automatically figure out how the following sentences differ:

  • This Agreement may not be assigned.
  • This Agreement may not be assigned, except by operation of law.

You probably also easily spotted the difference between these sentences. Software is helpful when you need to review more volume, and is pretty firmly established as the way to compare voluminous text.

The previous Contract Review Software Buyer’s Guide post included more background on diff:

Have you ever used document comparison software (aka, “blacklining” or “redlining” systems; Word comes with a built in version)? This software essentially relies on neat, simple, and widely-available computer algorithm often referred to as diff. It was developed in the 1970s. The diff algorithm tells you how different things are from each other, and where they are different. This functionality can drive contract provision extraction software, as further described in this Contract Review Software Buyer’s Guide post and covered in the rest of this section.

Building a comparison-type automated contract provision detection system should be easy. First, get a freely available implementation of the diff algorithm. Although Mac and Linux systems come with a diff program built in, and it is also easily easily accessible for Windows operating systems, the diff utility itself operates on line-by-line comparison basis, while document comparison software works on a character-by-character (or word-by-word) basis. Character-by-character comparison is best suited to contract review work. As such, this code library, which is optimized for comparing text, is appropriate for use in your DIY comparison-based contract metadata identification system.

Diff for Identifying Non-Standard Items

Diff can help identify two categories of non-standard information in contracts:

  1. Agreements that differ off a form. It is accurate at this task.
  2. Provisions that differ off standard wording, aka, non-standard clauses. It will only be accurate at doing this if the system is already able to accurately identify contract provisions.

Identifying Non-Standard Contracts

Perhaps you would like to figure out which of a pool of agreements are not written off a standard form. Diff can help.

Stock a database with one or more templates you identify as “standard.” Then, have your diff-based system return all documents that do not match your templates. These are your “non-standard” contracts. Additionally, you could have the system return all contracts that match your templates—your “standard” contracts.

There are two issues with this approach:

  • Most executed agreements (even if they track the template substantively), differ from the template in provisions like parties, date, financial terms, and signature pages. That does not need to kill your ability to identify non-standard documents. The easy solution is to set your comparison utility to ignore certain sections (like the preamble or signature pages). The more complex-to-implement possibility is to instruct your system to ignore specific data points, like party names or prices. You can also use technology to determine which data points to ignore.
  • Poor quality contract scans will likely be identified as being non-template agreements. This is because the scanned text will look unfamiliar, triggering “non-template” treatment. The good news is that these should show up as false positives as opposed to false negatives. That is, the system will say “this is an unfamiliar/non-template agreement” as opposed to that “this is an example of a template agreement,” and so you will not improperly miss any non-template documents (though may have to wade through more documents than planned).
    • With some work, you can have your system automatically identify poor quality scans. For example, our contract review software automatically assigns a scan quality rating. Even with a scan quality rating, you will still have false positives on poor quality scans.

Despite these issues, diff is a good way to identify non-standard documents.

Identifying Non-Standard Contract Provisions

What if you would like to see all provisions drafted in a non-standard way across a corpus of contracts? Comparison-based methods can (sort-of) help here too, and you can set up a moderately accurate system to do this by yourself.

The first step is stocking a database with standard provisions. Next, set up a diff-based system to compare all agreements for review with the standard clauses (the “DIY COMPARISON CONTRACT PROVISION IDENTIFICATION” section of the previous post has more details). Then, set the diff comparison threshold to return all text that is [90][75][60]% similar to your examples. This will allow your system to also pick up provisions that are drafted slightly differently from those in your standard provisions database. So, for example, any new provision that is >70%-similar to a provision in the “most favored customer” database will be extracted as a most favored customer provision. To here, this is the same as building a automated contract abstraction system using a comparison approach. Here’s where things change: identical hits get classed as “standard” provisions, and all other hits—those that are >70% similar but not identical (e.g., a most favored customer hit that is 85% similar to a most favored customer provision in the standard clause database)—get returned as “non-standard” provisions.* You could even stock an additional database with non-standard most favored customer provision examples, and have the system also identify these hits as additional non-standard most favored customer provision results. This latter step should improve your detection accuracy. You’re going to need it.

Is Identifying Standard Non-Standard Clauses Good Enough?

Identifying non-standard contract clauses can be useful. The problem is that unless your software is already highly accurate at finding unfamiliar information, you will only be spotting not-especially-non-standard-non-standard clauses. And that is less helpful, and perhaps even gives a false sense of security. The next post will cover non-standard clause detection in more detail.

* August 16, 2014 update: It has come to our attention that another contract review software vendor recently received a patent for a method of doing non-standard clause identification. Not joking. While we question how their “invention” is novel and non-obvious—it’s not more complex than my explanation above of how to do a DIY implementation of non-standard clause detection and, as a law school grad and lawyer, I’m probably a fair bit below the level of someone “ordinarily skilled in the art”—you should review their patent before trying to implement non-standard clause detection. To our eyes, there are important differences between their claimed approach and what I described here, differences between their patent claims and better ways to implement this feature, and prior art on using technology to identify non-standard contract provisions broadly. That said, I am not giving legal advice on the scope or applicability of their patent, and you should not rely on my views on about it. Sorry for the lawyerly disclaimer. Disappointing that what is basically a weekend hackathon level project has turned into something you should consult a lawyer about before implementing.

Contract Review Buyers Guide Series:

Share this article: