Artificial Intelligence In eDiscovery
Moving Beyond TAR and CAL
Introduction
The ever-increasing volume of data that lawyers must handle in litigation requires litigators to evolve beyond the current eDiscovery process. AI technology reduces the cost of litigation and improves the quality of the legal work.
This paper addresses some of the key shortcomings of legacy technology used in eDiscovery. We examine recent developments in AI technology to understand how AI can reduce the burden of eDiscovery.
Challenges with the Traditional Process
The traditional eDiscovery process of keyword filtering and document review is broken. The standard workflow is inefficient as it fails to identify many relevant documents and requires costly review of irrelevant data. It focuses solely on the task of document production, leaving analysis and case preparation for later.
Most data preserved and collected in eDiscovery is irrelevant to the claims and defenses in the case. Companies and their employees cannot segregate data based on relevance to a case at the time of collection.
Keyword search is used by lawyers to narrow down the collected documents for legal review, typically yielding a responsive rate in the twenty percent range. The “responsive rate” represents the percentage of documents promoted for legal review that are relevant to the discovery request. Approximately eighty percent of documents returned by keyword filters are ultimately deemed irrelevant. Low precision rates of keyword search result from the inability of keyword search to capture the peculiarities of language.
Search is a balance between recall (more relevant documents) and precision (fewer irrelevant documents). Negotiation of keywords in eDiscovery is a tug-of-war as the requesting party pushes for broad terms to improve recall and the producing party advocates for more precise terms to reduce the review burden.
Lack of precision results in burdensome review of irrelevant documents. More troubling, lack of recall misses relevant documents. While there has been much debate about exactly how ineffective keyword search is in achieving recall, it is beyond debate that keyword search misses a significant number of relevant documents.
It is common for attorneys to narrow search terms and introduce complex queries to reduce the number of documents to be reviewed. Sometimes decisions are driven more by the number of documents returned by the keyword filter rather than finding relevant documents.
Keyword search, with the acknowledged flaws, has become an accepted component of eDiscovery protocols and serves as a mechanism for the parties to cooperate in eDiscovery. The main problem with keyword filtering is that evidence missed by the filter never sees the light of day as no further efforts are expended to locate relevant documents that do not meet the query.
Technology Assisted Review (TAR) has emerged to address the low precision of keyword search. TAR uses traditional machine learning techniques to order the documents for review by the probability of relevance. At some point, the producing party declares that a sufficient recall has been achieved and the remaining documents will neither be reviewed nor produced.
Continuous Active Learning (CAL), a term coined by Maura Grossman and Gordon Cormack, [1] is a workflow that evolved from traditional TAR where the user searches for positive training samples to train a model and then reviews the highest scoring documents. The decisions are fed back into the model to improve it until no more relevant documents are found. Some adaptations of CAL also mix additional documents selected by the machine for inclusion in training.
CAL relies on the manual review of documents to create labeled examples to train the model. CAL envisions that manual review will be conducted on all documents that are produced. The exhaustive review of the production data is used to determine the stopping condition for the review. As CAL progresses, the user must review an increasing number of irrelevant documents. Indeed, most of the documents reviewed towards the end of the CAL process are irrelevant to the claims and defenses.
CAL utilizes a single model per case focused on general relevance. CAL switches the focus away from case analysis and more towards meeting a production deadline. Work on the merits begins after CAL is completed.
Understanding Machine Learning Technology
Machine learning is one of the fastest-growing specialties in today’s technology industry. Given the incredible growth of data generated by society, technology that can intelligently identify patterns, discern meaning and provide organizational structure to data has enormous market potential.
To understand the application of AI to the legal industry, it is helpful to distinguish between unsupervised machine learning and supervised machine learning.
Unsupervised machine learning involves the automated identification of patterns and features of data without user input. Conceptual clustering and topic modeling are the most common examples of unsupervised learning. Clustering algorithms group data into clusters that share common characteristics, but this does not necessarily correlate to legal relevance.
Supervised machine learning involves algorithms that are "supervised" by the user through inputting labeled examples, also known as training data. The computer uses the labeled training data to learn how to map inputs to outputs. Once trained, the system can then make decisions on unseen documents.
In the past, some eDiscovery review platforms used outdated LSI technology to falsely portray their document similarity approach as a “supervised” solution, leading many lawyers to doubt the accuracy of machine learning-based solutions. To combat the market noise, leading experts in the industry suggested that a lawyer should look for eDiscovery systems that use either logistic regression or support vector machine (SVM) – supervised machine learning algorithms. Today, most lawyers employing TAR are using systems that rely on logistic regression or SVM algorithms.
Technology advances at a fast pace. Developments in machine learning over the last five years are now changing the game in Natural Language Processing (NLP), significantly improving the machine’s discernment of the meaning of the text.
Older legal review platforms and status quo TAR processes will soon become obsolete.
Developments in recent years have focused on Deep Learning, a type of machine learning. It uses artificial neural networks to learn and make predictions. Deep learning is designed to simulate the way the human brain processes information.
In 2017, Google introduced transformers, a type of deep learning algorithm, through the release of their paper "Attention Is All You Need". [2] The following year, OpenAI published the GPT paper [3] and Google published the BERT paper [4] showing that unsupervised pre-training followed by supervised fine-tuning of a transformer network on a large data set achieved state-of-the-art results on numerous NLP tasks.
The basic idea behind training a large language model is to examine different parts of the text and weigh the importance of each part to understand the contextual meaning of the text. To understand the concept, you can think of the preschool lesson of simple pattern recognition. Imagine if a teacher shows a student the following sequence of shapes and asks what shape comes next in the sequence:
=?
A triangle is the obvious next shape in the sequence.
The transformer network is trained to solve a sequence problem. During training, part of a sentence is masked and then the transformer learns to predict the masked text by analyzing words that surround the masked text, its context.
You shall know a word by the company it keeps.
J.R. Firth, British Linguist, 1956
Since its publication in 2017, Google's "Attention is All You Need" paper introducing transformers has been cited in over 62,000 research papers. The ability of language models to understand the contextual meaning of text continues to improve as more data and parameters are added. GPT is trained on a huge dataset including the Common Crawl (a data set derived from publicly available web pages including billions of pages with nearly a trillion words). [5]
The public release of ChatGPT has captured the imagination of the legal profession. Prompt ChatGPT with some text and it can complete the rest of a story by predicting the next word in sequence, sometimes creating an impression of astonishing skill.
ChatGPT is an example of text generation, a subfield of natural language processing. Generative text is straightforward for a large language model because the output is based on predicting the next word in sequence, the same task upon which the language model is trained.
Large language models have particular importance to the legal profession. A lawyer is a knowledge worker -- reading and writing is at the core of most legal tasks. One can see that technology is soon within reach for the machine to efficiently draft contract provisions, write segments of briefs and produce other legal work product. Researchers reported in late 2022 that GPT was able to achieve passing scores on the evidence and torts sections of the multistate bar exam. [6]
Large language models unlock a new user experience and improved workflows in eDiscovery. While a transformer cannot just be plugged in to replace a SVM algorithm, [7] the machine’s ability to better discern the meaning and structure of text fundamentally improves the eDiscovery workflow.
Benefits of Modern Machine Learning in eDiscovery
Harnessing the recent advances in AI technology improves eDiscovery in several ways:
1. Performance.
The ability of the computer to better understand the meaning and structure of language improves the accuracy in almost all NLP tasks. Named entity recognition (tag PII for example), language translation, summarization, topic modeling and so on are delivered at higher accuracy levels by leveraging transformers.
Classification is a central tool used in eDiscovery. Transformers enable workflows that improve the performance of classification. Better results with less exemplar training means that more relevant documents will be found and fewer irrelevant documents will be encountered.
2. Training Efficiency.
TAR imposes a burden on lawyers to find and review example documents to start the process and to continually review documents to improve recall and precision. The need for upfront review work to train the machine creates a hurdle that has slowed the widespread adoption of machine learning in litigation.
Language models allow for more intelligent training without the need for upfront review of exemplars. Synthetic training data is a type of training data that is artificially generated by the machine.
Servient’s Descriptive AI leverages language models to take a lawyer’s narrative descriptions of the claims and issues in a case and transform the descriptions into synthetic training data. Synthetic training allows the machine to identify relevant documents and assign evidence to each issue in the case without the burden of upfront review of documents.
3. Reusability.
Great efficiency gains in eDiscovery are available through the reuse of machine learning models. Reusability of work product in eDiscovery has been discussed for years. However, the discussion has been mostly marketing hype positing that traditional machine learning models can be reused across different data populations.
Traditional algorithms such as SVM are not well suited for reuse because of their reliance on feature engineering. The traditional machine learning approach fits the model to the data population on which it is trained. Traditional machine learning algorithms struggle with new data populations containing different text features as the model is unable to fit to the population.
In contrast, reusability is a core assumption of transformers. Deep learning algorithms learn to identify important features without the need for feature engineering. Using large language models with supervised fine-tuning or prompt engineering allows for accurate and efficient reuse of models in eDiscovery.
Servient’s new library of models developed for construction litigation is a good example of using large language models for reuse in eDiscovery. A common task in construction litigation is to identify and organize the project record within the collected data. Separating prime contracts, subcontracts, RFIs, change orders, payment applications, and so on is a task that is repeated by the legal team in each case.
Servient’s reusable construction litigation models automatically identify and tag the elements of the construction record without the need for expensive manual review. A construction litigator can now look at the change orders identified by the machine immediately upon loading the collected data into the eDiscovery platform.
Also, reuse across cases means that feedback in each case continuously improves the performance of the reusable models. The knowledge gained in one case can be leveraged on all future cases.
Descriptive MultiModel Learning
Legal technology should adapt to the work of the lawyer rather than require the lawyer to adapt their work to the technology. TAR has failed on this proposition. TAR requires lawyers to follow complex protocols focusing on the review of a stream of emails and files to train the machine.
During discovery, if requested, the lawyer must gather and produce all data that is “relevant to any party's claim or defense and proportional to the needs of the case.” Fed. R. Civ. P. 26(b). It is helpful to remember that the scope of the eDiscovery task is defined by the claims and defenses in the case and proportionality.
The complexity and inefficiency of the current status quo eDiscovery workflow has resulted in separating the task of producing documents from the analysis of the merits of the case. In large cases, companies sometimes even use one law firm to handle eDiscovery and another firm to pursue the merits.
The litigation process should narrow the issues in dispute at each stage. Pleadings state the factual basis of the claims and are intended to control the scope of discovery. Consider this quote from a legal commentator in 1979 raising concern with legal procedures unguided by the claims:
“When notice pleading dumps into the lap of a court an erroneous controversy without the slightest guide to what the court is asked to decide; when discovery - totally unlimited because no issue is framed - mulls over millions of papers, translates them to microfilm and feeds them into computers to find out if they can be shuffled into any relevance . . . we should, I think, consider whether noble experiments have gone awry.” [8]
Much has changed in legal practice since 1979. We no longer use microfilm; the amount of data has skyrocketed; and AI can now reliably identify relevant documents. But much has stayed the same. Discovery imposes severe burdens when it is “totally unlimited because no issue is framed.”
Today, litigants try to translate the scope of discovery into a long list of negotiated search terms. Attempting to describe the factual background of the claims and defenses in a list of keywords is merely a “fool’s folly.”
"The fool doth think he is wise, but the wise man knows himself to be a fool."
William Shakespeare
It is no surprise that eighty percent of the documents returned by a typical negotiated keyword list are irrelevant to the claims and miss a significant number of relevant documents.
The recent advancement in AI technology allows a lawyer to leverage software in a manner that is consistent with their traditional work patterns. As AI technology becomes more intelligent, the system can conform to the work pattern of the lawyer and be guided by the lawyer’s analysis of the claims and defenses involved in a case.
Descriptive MultiModel Learning (DML) focuses on a lawyer’s analysis of the claims and defenses. In the DML workflow, the lawyer provides a narrative describing the factual basis of the claims and defenses. The lawyer also enters additional narratives to describe each of the important factual issues in dispute.
Leveraging large language models and advanced machine learning, Servient’s Descriptive AI analyzes the narrative descriptions to generate and label synthetic training data. The machine can then assign every document with a score representing the likelihood a document is relevant to the case, as well as the probability of every document’s association to each of the factual issues in the case.
The lawyer can reuse the pleadings, analysis memos, and other work product that they are comfortable with and trained to create as the initial narratives. No example emails or files need to be found and reviewed to initially train the machine.
With Descriptive MultiModel Learning, a separate machine learning model is created for each issue in the case. The ensemble of issue models contributes to the identification of general relevance for every document and more closely matches the analytic framework of legal analysis.
The legal team can then focus on important documents by factual issue. The lawyer can point out if she disagrees with the machine on a document’s assignment to an issue while the lawyer analyzes the documents surfaced by the machine. As the lawyer’s understanding of the case improves, the lawyer can update and revise their narrative descriptions. The machine continuously learns from the lawyer’s work – both from the updated narratives and the feedback from the analysis of the documents.
After the lawyer has completed their analysis of the issues, the lawyer then analyzes any documents that are not associated with an issue but are nevertheless surfaced by the machine as relevant. The lawyer may discover additional factual issues in the case through the review of the unassigned documents.
With DML, the lawyer identifies all documents that are within the scope of discovery by working on the claims and factual issues. The machine’s understanding of relevance to the claims and defenses is derived from the lawyer’s analysis of the claims and defenses.
DML allows lawyers to use models created by their analysis to locate relevant documents that were not identified by the keyword filter while keeping the cooperative and transparent aspect of keyword filtering. This results in a more reasonable and proportionate inquiry, helping the litigants finds more relevant evidence without imposing burdensome review of irrelevant data.
The AI systems of today and tomorrow are moving along the continuum to become a more intelligent assistant for the lawyer. It is time to move on from TAR and harness the power of the current state-of-the-art AI technology.
Conclusion
AI technology can reduce the cost and improve the quality of the eDiscovery processes. Keyword filtering is flawed due to its inability to capture the nuances of language usage, leading to low recall and precision rates. AI technology can help address these problems, providing lawyers with more reliable and cost-effective eDiscovery solutions.
Descriptive MultiModel Learning (DML) is a legal technology that adapts to the work of the lawyer and leverages AI to identify relevant documents by focusing on the analysis of the claims and defenses in a case. DML eliminates the need to train the machine with unorganized emails and files, instead relying on a lawyer's narrative descriptions to generate and label synthetic training data.
[Note the conclusion was generated by GPT by automatically summarizing this paper]
[1] Cormack, Gordon V., and Maura R. Grossman. "Evaluation of machine-learning protocols for technology-assisted review in electronic discovery." In Proceedings of the 37th International ACM SIGIR conference on Research & development in information retrieval, pp. 153-162. 2014.
[2] Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. "Attention is all you need." Advances in neural information processing systems 30 (2017).
[3] Radford, Alec, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. "Improving language understanding by generative pre-training." (2018).
[4] Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. "Bert: Pre-training of deep bidirectional transformers for language understanding." arXiv preprint arXiv:1810.04805 (2018).
[5] Brown, Tom, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D. Kaplan, Prafulla Dhariwal, Arvind Neelakantan et al. "Language models are few-shot learners." Advances in neural information processing systems 33 (2020): 1877-1901.
[6] Bommarito II, Michael, and Daniel Martin Katz. "GPT Takes the Bar Exam." arXiv preprint arXiv:2212.14402 (2022).
[7] Researchers found that simple fine tuning of the 2018 BERT model did not outperform the traditional logistic regression approach to eDiscovery relevance classification. Yang, Eugene, Sean MacAvaney, David D. Lewis, and Ophir Frieder. "Goldilocks: Just-right tuning of BERT for technology-assisted review." In European Conference on Information Retrieval, pp. 502-517. Springer, Cham, 2022.