Candice Chan-Glasgow

Director, Review Services and Counsel


July 16, 2021


After potentially relevant electronically stored information (ESI) has been identified, preserved, and collected, it must be processed.  Processing is converting the ESI to a usable format for review and analysis.  Before processing the ESI, agreement should be obtained on processing specifications.  Many processing decisions are common, and agreement on these issues is generally non-contentious.


One primary decision is how you will deduplicate the data.  This is important due to the volumes of duplicate records that are typical for any enterprise.  Deduplication will reduce the volume of documents that will need to be hosted in a review platform and ultimately reviewed for relevance.  During processing, an algorithm is used to generate a unique “hash value” (commonly referred to as a ‘digital fingerprint’) for each document based on characteristics and content and exact duplicates will have the same hash value.


Deduplication is often misunderstood to mean that all duplicates, and apparent duplicates, will be removed from the review set.  Firstly, many documents that appear on their face to be duplicates will not be exact duplicates based on the hash value.  For example, a PDF version of a Word document will have a different hash value from the Word version and is not an exact duplicate, despite the fact that the content of the document is the same.


Secondly, the standard practice is to deduplicate a dataset ‘globally, by family’.  This means that if multiple custodians each have a copy of the exact document outside of email, only one copy of the document will be identified for review.  Where it is important to identify which custodians possessed a copy of a certain document, a field can be created to identify the names of all custodians who had a copy of the exact duplicate.


The ‘by family’ aspect means that if the duplicate document is attached to different emails, it will not be deduplicated.  This is the proper approach, otherwise, attachments will be stripped from emails without an ability to link back to the email.  Deduplicating all attachments to emails is referred to as “deduplication by item”, and would never be recommended except in very narrow circumstances and only upon clear agreement of all parties that emails will be stripped of the attachments.


As a result, even with deduplication there will be both exact duplicates and ‘near’ duplicates in the data.  These can be reviewed efficiently using other tools available in the review platform.


One approach that will reduce the costs of reviewing the opposing party’s productions is to process the incoming production to identify duplicates across both production sets.  This can result in a substantial reduction in the volume of documents requiring review.  For example, in our experience, a large percentage of documents exchanged in construction disputes are common documents such as project change orders, contracts, and email exchanges between the parties. This cost saving approach does require the parties to exchange documents in native format, which is recommended in any event.


Other decisions at the processing stage include the time zone the emails should be normalized against, the prefix to be used for the document IDs, and the order of importance of the individual custodian.


During processing, document metadata such as author, recipients, file name, and document dates are extracted into searchable fields.  The text of a document is also extracted, and optical character recognition (OCR) can be performed on documents without text (for example, scanned documents).  It follows that the content of these scanned documents is not searchable until after the OCR process is complete.  This is important to keep in mind because if the document collection was limited by keywords (which is not advisable), the content of scanned documents would not have been searched and the document collection may be deficient.


Once the documents have been processed, the processed data should be reconciled against the collection plan to ensure that everything that should have been collected was in fact collected. Identifying any issues or gaps in the collection as early as possible is important.


After processing, many platforms provide the ability to conduct preliminary assessments of the data, including identifying the document types, date ranges, and number of documents collected for each custodian.  This allows only documents of interest to be put into a review platform for further review.


Candice Chan-Glasgow is Director, Review Services at Heuristica Discovery Counsel LLP.  Heuristica has offices in Toronto and Calgary and is the sole national law firm whose practice is limited to eDiscovery and electronic evidence.  Heuristica has considerable experience in investigations and disputes and recently became the first law firm in the world to be awarded RelativityOne Silver Partner status.