Editor: ZyLAB has been on the forefront of using computers to reduce clients’ costs of e-discovery. How does Zylab view predictive coding?
Mack: It’s first-generation technology that is presently in the early-adopter phase. The next generation of predictive coding will be much less risky than it is now. The extent of motion practice seen in the key cases that discuss the use of predictive coding demonstrates that there are unexpected landmines in this first generation.
Editor: What do you mean by “risky” or “unexpected landmines?”
Mack: In Illinois in the Kleen case, there have been weeks of expert testimony where the requesting party is trying to force the producing party to use predictive technology. The Da Silva case in New York raised concerns that documents in the seed set disclosed to the other side very early in the process might increase the scope of e-discovery. Corporate counsel are troubled by the possibility that these documents might disclose highly confidential information, be of a personal nature, blemish the reputation of a company or alert the other party to the producing party’s trial strategy. Non-responsive documents randomly selected and turned over to the other side have the potential of exposing the producing party to more risks than traditional methods.
Editor: You mentioned the potential for inadvertent production.
Mack: Some documents that are inadvertently produced can be protected under Rule 502. However, that Rule is limited to privileged documents. Already, to save money, a party may, without doing in-depth review of its documents, consent to the other party taking a “quick peek” at its documents. The requesting party conducts the responsive review, and the producing party designates and claws back the privileged documents. That idea never gained popularity because once the other side sees extraneous information that might give rise to other causes of action or disclose confidential information, it’s very hard to unring the bell.
Editor: In a predictive coding setting, why would random documents be disclosed – especially documents that are highly confidential or that might trigger or expand litigation?
Mack: The courts have been requiring that the seed sets and test sets be disclosed to the other side for transparency purposes. The seed sets and test sets not only contain responsive documents, but also non-responsive documents. They are designed to allow the receiving party to approve or have confidence in the machine’s decisions on the larger set. Traditionally, disclosing non-responsive documents is done only after showing good cause, not voluntarily.
Editor: Are there other ways to save money but avoid the risks associated with predictive coding?
Mack: There certainly are. Effective deduplication programs alone can produce cost savings of upwards of 50 percent. Keyword search alone is probably just as risky as predictive coding. The middle ground is a rules-based approach.
Editor: Could you explain that?
Mack: A rules-based approach provides a visible, objective reason why each file meets or doesn’t meet the criteria – and the rules can be very sophisticated, involving linguistic syntax and text mining.
Take searching a giant data set, for example. Let’s say that I would like to see all the documents, spreadsheets, PowerPoints and emails pertaining to, and the names of persons and companies involved in, any transactions or any requests or transfers that have taken place, within a certain time frame. In earlier times, collecting such information would require many sweeps using different search terms. The results would be put in separate folders. Reports would then be manually generated.
With rules-based coding you can, for example, look for a transfer having specified characteristics by an unidentified person to an unidentified company and get the names of that person and company. This would permit you to generate a report consisting of one-line identifications of all the transfers that are indicated in communications and documents.
Editor: So it’s very much like a sharpshooter who was able to hit the target and not scatter the shots all around.
Mack: Exactly. Rules-based coding will surface key facts quickly. It has some technical advantages where, if you have more data coming in, the rules can be applied equally over time, whereas in a predictive machine learning scenario, as you add more data to the data set, the machine needs to be fed another seed set. That takes time and resources and, depending on how things are charged, additional cost. Another advantage of rules-based coding is in financial litigation, where percentages and amounts of money are involved. Rules-based coding can use number ranges. Predictive models have a hard time handling numbers and spreadsheets.
Editor: They say the biggest argument for predictive coding is that it saves money. Is that true?
Mack: Not necessarily. More money is spent early in the process when more expensive reviewers are creating the seed sets and training the system. Also, right now there tends to be a lot of expensive motion practice and expert testimony involved.
Editor: Is it necessary to retrain a predictive coding system after adding significant documents?
Mack: Predictive coding is based on mathematics, which creates a matrix that identifies how close the document is to another document based on artificial intelligence. If you add a few similar documents to a collection you can probably reliably use the same matrix, but if you add a significant amount of new data, the matrix calculations need to be redone. With machine learning you don’t have as many reviewers of course, but still, stopping and starting review just because you have an additional data set introduces additional cost.
Editor: What about things like drawings? Can predictive coding handle a photograph?
Mack: Drawings and photographs need to be processed before they can be fed into a predictive coding model, and the way that they’re processed to recover the text is very important. We use a pretty unique process where we will pull the text out in multiple directions so that rules-based coding can be effectively used, but a predictive model if it doesn’t see text won’t know what to do with the schematics.
Cameras these days record where the photo was taken and the date and the time. Depending on how the photo is harvested from the data set and how it’s encoded, that information can also be fed into a rules-based engine. A rules-based system can process photographs of license plates after the text in them is read with our OCR engine.
Editor: Are there particular types of cases where the advantages of rules-based codification are particularly evident?
Mack: Predictive models aren’t good at handling multilingual documents – a situation where rules-based systems excel. Patent cases have schematics. They may involve multilingual documents. They frequently involve financial considerations that use numbers, currencies and percents. In securities litigation or a lawsuit involving executives or directors in which the data is highly confidential, you might not want to disclose seed and test sets. When the data is publicly available or less sensitive, then predictive coding is much less risky.
Editor: In an investigation, would a rules-based system be the most effective way of identifying electronically stored documents pertinent to the investigation?
Mack: Yes. Such a system enables you to go directly to the required information without having others first review a seed set. You can refine and reuse your rules over and over, whereas predictive models need to be rebuilt each time. If you are investigating instances of violations of the FCPA, Dodd-Frank or the UK Bribery Act, you, for example, might have a set of rules that you could apply to the emails of individuals who might be suspect. Doing so would be an objective way to show your due diligence. For these reasons, the predictive way of proceeding would be unlikely to prove as useful as rules-based coding. I would think that counsel would want to be a little more specific in their oversight.
Editor: I would think that the use of rules-based monitoring of email traffic at a company’s business locations in countries where corruption is known to be rife might be useful in nipping potential corruption in the bud.
Mack: Yes, and companies are doing that within the constraints of local privacy regulations. Certainly, the intelligence community makes use of similar technologies when they’re looking for terrorist activity.
Editor: I gather from what you say you’re not against machine learning or artificial intelligence in legal applications.
Mack: Absolutely not; I think it’s the future. However, the early adopters of predictive technology are fighting the good fight and will make it easier for second generation tools to save money at the same time that they don’t expose a corporation to unnecessary risk.
It’s a wonderful advance that the judiciary is embracing machine-assisted review in the form of predictive coding. They’re actually ahead of the game. The next generation of the technology will take into account the functional requirements of litigation. It will enable data to be added over time without needing a great big redo of the matrices. There will be a process that doesn’t result in production of non-responsive documents that might harm the corporation in the future. Those are two key requirements for the second generation of predictive coding.
Editor: Do you see a situation where ultimately people will be using both predictive coding and rules-based coding to be sure that all bases are covered?
Mack: Absolutely. Rules-based coding will certainly help in the identification of the seed sets used for training machines used for predictive coding. It will help in quality control. More importantly, it will help surface the facts very quickly so that the trial counsel will be able to do their jobs more effectively and at reduced cost. Hybrid approaches are already being used.
Editor: Where can our readers learn more?
Mack: They can find a complementary analyst report from ESG by going to http://ow.ly/d7UB4.
Published August 23, 2012.