Terminology To Know for a Successful eDiscovery Project

January 31, 2023

Terminology to Know for a Successful eDiscovery Project

By John Trickey

When it comes to working on eDiscovery projects, there are some key terms you should be aware of to understand the process, workflow and functionality available to gain insight into your matter and minimize the review scope. Here is a cheat sheet of some various terms you should know:

CONTINUOUS ACTIVE LEARNING

Continuous active learning, also known as CAL, is a process in which humans train software to identify documents that meet certain criteria. Relevance is a commonly used criterion, but CAL can have other objectives as well as locate privileged content.

Put simply, CAL identifies and prioritizes the documents most likely to be relevant to the matter being investigated. The system then creates a prioritized queue for review. As the manual review exercise proceeds and human reviewers confirm or reject the top documents put forward by CAL, the review queue is continuously reshuffled and reprioritized as the system learns more about what is relevant and adjusts its predictions accordingly.

As a result of using CAL, those documents reviewed first are the ones which are most likely to be relevant. The documents remaining in the review pool are those most likely to be irrelevant and can be assessed by a more junior reviewer or excluded from the review altogether.

CONCEPTUAL CLUSTERING

Conceptual analytics focuses on related concepts within documents, regardless of their phrasing. This emphasis on meaning versus words introduces context into the document analysis. Conceptual analytics finds documents on the same concepts or topics without these necessarily using the same key terms or phrases.

Clustering is when the system organizes documents based on these concepts, topics or ideas. The system analyzes each document’s text and groups it with others based on the context. These groupings allow users to review documents based on similarity and make the review more intuitive.

EDISCOVERY VS. DATA FORENSICS/COMPUTER FORENSICS

Electronic discovery is the method of identifying, preserving, collecting, processing, reviewing and analyzing electronically stored information in legal proceedings. Identifying, preserving, collecting, analyzing and reporting on digital information are part of the data/computer forensics process as well. As you can see, they are very similar. The crucial difference is the party in charge of analyzing the data.

The expert’s role in an eDiscovery case is to provide information to legal teams in a reviewable format for analysis. In computer forensics, however, the expert will analyze the data and report the findings to the legal teams. The primary distinction between eDiscovery and computer forensics is who performs the electronic information analysis.

DATA HOSTING

Data hosting means the online site where natively harvested data can be preserved, processed and prepared for document analysis and review. It is then maintained within a secure application where attorneys can identify and tag documents for relevancy associated with the criteria of the legal action.

DATA PROCESSING

Data processing for eDiscovery takes data stored in the regular course of business and converts it to a usable format for review, research and analysis. When a copy of that data is processed, that basically means extracting pertinent attributes, or metadata, tied to the native digital document, performing OCR or text extraction for those documents that may not be searchable at the time of collection and assigning a unique identifier or document ID to be referenced while conducting review and for production of documents.

De-NIST

The process of separating documents generated by a computer system from those created by a user. This automated process utilizes a list of file extensions developed by the National Institute of Standards and Technology.

DEDUPLICATION

This is the process of identifying exact duplicate copies of documents present in data collection sets by comparing the unique hash values of the electronic records. Once the repeating documents are identified, they may be excluded from the full data set on a custodial (one copy of that exact duplicate document is kept for each custodiam to whom the data belongs) or global (one copy of that exact duplicate document is kept in the entire universe of documents) level.

NEAR DEDUPLICATION

Not to be confused with document deduplication, which relies on unique hash values, near deduplication calculates document similarity based on textual content. For example, if you had two documents containing exactly the same text – for example, one being a native email file and the other a PDF version of that same email – the hash values of the two files would be entirely different. However, near-duplicate identification compares just the textual content of the two documents and can determine that they are very similar to each other. This can accelerate document review and improve accuracy.

EARLY CASE ASSESSMENT (ECA)

Early case assessment is the process of identifying key electronic data in the preliminary stages of a legal matter prior to collection. This information is then used to help estimate risk, gauge and minimize potential downstream costs and guide case strategy. Some examples of ECA include:

Interviewing custodians, preferably using technology that automates and streamlines the process.
Tiering and prioritizing custodians to collect and review the electronically stored information that is most likely to be relevant first and then moving on to secondary custodians.
Using technology and analytics to understand the contents of the collected and reviewed ESI in order to determine the most appropriate legal strategy to take in the matter.

EMAIL THREADING

Email threading identifies email relationships – strings of the same conversation, people involved in a conversation, attachments and duplicate emails – and groups them together so they can be viewed as one coherent text.

Inclusive email messages contain the most complete content – all the text and attachments in a whole email group. Conversely, non-inclusive is an email whose text and attachments are fully contained in other (inclusive) emails. By reviewing only inclusive emails and skipping duplicates, your review process will be much more efficient.

By reviewing inclusive messages, rather than non-inclusive messages, a team bypasses redundant content and reduces the number of documents to review.

METADATA

Simply put, metadata is “data about the data.” It is the digital attributes associated with a native document. Think of it as looking at the properties when viewing a document. Metadata is important in discovery because it serves as a digital trail for an electronic document: how and when it was created, who has reviewed it, when it was last opened and so forth.

There are many different types of metadata, and it’s critical to understand each with regard to requests for opposing counsel productions and preparation to produce productions. Examples include:

Application Metadata: This is the data embedded in the file created by the source application, such as Microsoft® Word. It moves with the file when copied, though copying may alter application metadata.
Document Metadata: These are the properties of a document that may not be immediately visible in its contents but can often be seen in a “Properties” view. For example, Word tracks the author name and total editing time.
Email Metadata: Data about the email that tracks things like origin, date and time sent, or date and time received. Sometimes, this metadata may not be immediately apparent within the email application that created it. The amount of email metadata available varies depending on the system utilized. For example, Outlook has a metadata field that links together messages in a thread, which can facilitate review – not all email applications have this functionality.
Embedded Metadata: Examples of embedded metadata are edit history or notes in a presentation file. This metadata is usually hidden; however, it can be a vitally important part of the ESI. Embedded metadata may only be viewable in the original native file since it is not always extracted during processing and conversion to an image format.
File System Metadata: Data generated by the file system, such as Windows, tracks key statistics about the file (e.g., name, size, location, etc.), usually stored externally from the file itself.
User-Added Metadata: This is data created by a user while working with, reviewing or copying a file such as notes or tracked changes.
Vendor-Added Metadata: The processing of the native document by an eDiscovery vendor also generates metadata. Don’t be alarmed: It’s impossible to work with some file types without generating some metadata. For example, you can’t review and produce individual emails within a custodian’s Outlook PST file without generating those out as separate emails either in Outlook MSG format or converted to an image format, such as TIFF or PDF, and that adds a layer of metadata to identify this manipulation.

SEARCH TERMS

The criteria used to identify key documents that could be relevant to a legal matter are created by counsel for both parties in the litigation to aid in sampling data sets for relevancy, assessing potential proportionality issues and cost-shifting possibilities. Search terms narrow the criteria and scope of the production be identifying the key relevant concepts. Terms tend to be refined and changed with future iterations as more is learned during the review.

BOOLEAN OPERATORS AND PROXIMITY IN KEYWORDS

Special operators, such as Boolean and proximity operators, can be used to create relationships between multiple keywords for search terms. This technique is used to connect individual keywords or phrases with a single query, used to avoid false positives, and accurately pinpoint documents of interest. Typical connectors are terms AND, OR, and NOT associated with the specific term.

WILDCARD SEARCHES

Wildcards can help expand keywords to include terms that contain part of a critical keyword or an alternative spelling. Wildcards are special characters that can stand in for unknown characters in a text value and are handy for locating multiple items with similar, but not identical data. Typical connectors are * matches any number of characters, ? matches a single alphabet in a specific position, [ ] matches characters within the brackets, ! excludes characters inside the brackets, – matches a range of characters, # matches any single numeric character.

Having a trusted eDiscovery partner on your side will help you through your next discovery case and make sure you have the resources and information you need for a successful outcome. When you work with Everest Discovery, we will be by your side to support you and your team through the entire process with our highly skilled and talented team of experts.