Harnessing the Power of Your Rapidly Growing Data Resources Through Information Governance


The adoption of effective information governance (IG) has become critical in law firms. Clients remain highly interested in how firms govern their information in light of increased regulatory requirements for personally identifiable information (PII) and personal health information (PHI). Information security has become more complex with lawyers’ increased use of mobile devices to access information and the growth of cyber-threats. The Law Firm Information Governance Symposium (Symposium) was established in 2012 as a think tank to give the legal industry a roadmap for addressing the impact of these and other trends on how law firms operate.

The Symposium Steering Committee, Work Group Participants and Iron Mountain are pleased to provide the legal community with this Big Data and What it Means to Your Firm report, a tandem report to the 2013 Building Law Firm Information Governance, Prime Your Key Processes report. In this report the authors offer insight into big data, one of three trends impacting law firm information governance. It is our hope that firms will use this report to infuse IG into their unique processes and culture, and gain the client service improvement, risk mitigation and cost containment benefits we believe result from effective information governance.


Gartner defines big data as high-volume, high-velocity and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making. This is the ubiquitous “Three Vs” concept of data that now dominates the landscape. In everyday terms big data can be defined as data sets so large and varied that they are difficult to manipulate and mine using traditional approaches and tools. Over time the definition has broadened to include types of transactional and system log data.

In the law firm environment, the biggest of the big data sets typically exist in the email, file share, eDiscovery and social media systems. We will discuss the range of repositories for both structured (e.g., transactional) and semi-structured (e.g., email) data. Generally, information containing the “holy trinity” of metadata — client, matter and employee numbers — can be viewed as structured in the law fi rm. “Dark Data” residing on years of backup tapes should also be singled out for special mention, as it can be viewed as a potentially toxic data dump.

A. Why Should Law Firms Care About Big Data?

Law firms are primarily in the knowledge business. As most Information Governance (IG) professionals would attest, a great deal of that knowledge lies in the vast amount of stored data found in the Document Management System (DMS), email, practice-specific systems and lesser-known repositories, such as file shares. However, as the volume of data grows and the diversity of data increases, gleaning knowledge from this vast store becomes a staggering challenge.

Big data technologies can help solve problems that have vexed IG professionals for years. Imagine being able to characterize and classify vast unstructured data stores that once required human eyes to sort and tag effectively. Tools and strategies now exist to tackle this big data task. Classifying data can lead directly to reductions in data volumes, moving firms closer to the IG nirvana of storing only relevant records and discarding the rest. Classifying data could also lead to greatly reduced risk, allowing firms to more reliably find relevant information in response to a client request or a legal hold. Less data, better classified, means less information management overhead.

Big data technologies hold the promise of automating classification of large masses of unstructured data. Purveyors of Concept-Based Auto Categorization technology to primarily litigation-oriented legal service providers believe that with training, firms will be able to automatically classify emails and documents regardless of language and structure. This technology is at the heart of the current predictive coding wave sweeping the litigation support industry. For more details on this topic, review the Predictive Coding for Information Governance report from the Building Law Firm Information Governance: Prime Your Key Processes report.

For the business analyst or strategist, big data allows firms to address challenges previously thought to be logistically impossible, such as understanding who knows whom in a large firm, detecting inefficiencies in a practice under price pressure or finding trends in the business that require quick response. Additionally, because big data’s strength is finding insights and colorations from diverse sets of data, it allows firms to address new and diverse challenges — with the only constraint being the imagination of those asking the questions.

But for the IG professional, using big data in a law firm is all about managing risk and unlocking the value of the information contained in data repositories.

What You Don't Know Can Kill You

Oftentimes structured data stores, such as DMSs, don’t classify data deeply enough. For example, few law firms can assert with certainty which documents contain regulated sensitive information, such as Personally Identifiable Information (PII), Protected Health Information (PHI) and credit card information. There are several commercially available products that specialize in finding such data, which is the first step in securing it so you don’t get hit with a breach problem under privacy regulations.

What You Don't Need Can Cost You Money

The concept of redundant, outdated and trivial (ROT) documents and email is one familiar to IG professionals. The same sorts of analytical software that classify documents can help recommend ROT candidates for removal from the system. The more ROT is found, the more firms can reduce storage, backup, server and similar IT infrastructure costs.

What You Can't Learn Can Cost The Firm Clients

According to the 2013 Q1 Thomson Reuters Peer Monitor® Economic Index (PMI), demand for legal services dropped for the fifthtime in seven quarters, demonstrating that law firms continue to be under tremendous client pressure to reduce their fees. While all law firms are aware of the need to provide lower fees to clients without jeopardizing profitability, very few have a handle on how to do so effectively. One place to start is to analyze your firm’s raw billing data to truly determine what particular types of matters historically cost the firm. Some firms are automatically applying task codes to time entries in order to allow more uniform analysis of this type.

But the real potential of big data comes when linking information from the Time and Billing system with information from other systems to find otherwise undetectable correlations. Imagine examining information from docketing, email and expense recovery systems to draw conclusions about how a process might be reengineered to become more efficient. Law firms looking to provide better value while still making a profit will find these types of approaches to mining big data very attractive.

They’d better. After all, corporate clients have access to big data tools and services, and they can use them to see what law firms should be billing for their services. These service providers use millions of time entries collected from a large number of law firms as part of the normal legal billing process and apply data analytics to provide insight into what clients should expect to pay for particular legal services, in context (e.g., jurisdiction). Read more about this in the case study section later.

One of the strengths of big data technology is that you don’t have to completely understand the questions before searching for the answers — which suggests that the applications of the technology are limitless. Potential examples of uses for big data technology in law firms include:

  • Analyze email reserves to figure out who knows whom.
  • Predict which lawyers or groups may be ready to depart, based on system activity.
  • Determine why client demand changes. (Example: M&A activity from a client dropped by 20 percent. Analysis of communication patterns suggests the firm upset the client.)
  • Understand the likely success rate of a proposed matter (already being developed by vendors in the space for MedMal cases).
  • Examine both internal and external data to forecast a (potential) client’s likelihood of buying certain services from certain law firms.

B.The Role Of Information Governance In Big Data

IG principles can certainly help a firm manage big data repositories as other unstructured data sources. The Building Law Firm Information Governance: Prime Your Key Processes report, which this paper is part of, describes how to build IG into 14 key processes. Below, we look at the subset of those processes that is most likely to be impacted by the big data movement. In general, however, the more that IG-based processes can improve the structure and definition of data, the more potentially useful that data becomes.

Administrative Department Information

Many of the potential applications of big data in law firms revolve around gaining strategic insights from administrative data. IG should influence the development of processes around how this data is captured and analyzed and how the results are memorialized.

Document Preservation And Mandated Destruction

It is hard to predict how law firms will use big data technology creatively, but it seems likely that at least some of these uses will generate records that could be subject to a future litigation hold, protective order or destruction order. IG must ensure that document preservation and destruction issues are considered as new big data systems are implemented.

Information Security

Big data stores could easily contain confidential data, such as information on mergers, or regulated information, such as PII or PHI. As with most new technologies, the capabilities of big data applications are significantly more advanced than the ability to secure the data contained within. IG must be vigilant about the types of information that may find their way into a big data system, and ensure that the proper security principles are applied.

Additionally, some big data “problems” lend themselves best to cloud solutions, meaning large amounts of data could exist outside of a firm’s secure perimeter. Whether the firm is hosting its own data in the cloud, or mining data from social media sites, IG professionals should ensure that proper risk-mitigation steps are in place, similar to any application where confi dential or regulated information could end up in the cloud.

Records Information Management

Big data resurrects the question: “What is a record?” The content of many big data systems (for example, analysis and correlation of events from a variety of system log files) will involve data not found in typical documents. IG professionals will need to wrestle with what definitions of “a record” make sense in the context of big data, and should be aware that the answers will likely vary based on the technology, the application and perhaps even the location of the system.


Currently, most big data solutions are more focused on gaining insights from the data rather than on what happens to the data after the process is complete. In this environment, correct disposition time frames for big data repositories are hard to identify and even harder to implement. IG professionals should help push the agenda by ensuring that there are procedures in place to identify the source data for big data solutions, and providing feedback on the controls necessary to properly enact the firm’s information lifecycle management processes. Note that this issue could be more complicated if the big data solution is hosted with a cloud provider.

C.How to Engage Big Data

As a starting point, define the goal as understanding where structured and semi-structured data exists in the enterprise. Then, introduce the M3 concept of managing (e.g., applying controls like legal holds, ethical walls or disposition guides), mining (e.g., gathering information to answer a question) and manipulating (e.g., presenting results of queries in context) data in place as the logical outcome of successful application of IG principles.

Creating a data map is the first step in identifying where information lives in the firm and understanding its DNA. This can be accomplished by separately indexing content and metadata of data objects in identified repositories, and then displaying them in “heat map” representations. There are a number of commercially available indexing applications that can handle this task. A three-dimensional look at the data can also be accomplished by performing this action over time, with the goal of understanding what it is, who is generating it and under what conditions.

The benefit of applying IG principles to this exercise is to make data reliably available for other interested parties to access. By understanding what data exists and where it is, it becomes easier to understand the context within which it was created in the first instance. As such, it becomes a more valuable resource that is much less susceptible to “noise” and the risk of misinterpretation. In effect, it goes from big data to “data under management.” These indexing tools can also be employed to identify data that may no longer be useful to the firm and data that can be classified as ROT — and to apply classification and context to the remaining information.

The role of IG is to make data available for other parties to access and mine in readable, usable form, and to make it available for others to utilize. Records Management is being asked to manage the changing view of what useful information is. It is also being asked to ease its movement from a relational, structured, database-driven world toward a semi-structured, social media-driven world, following the trend across industries served by law firms.

D.Who Will Join The Big Data Revolution

A big data offensive should have no trouble finding allies, since a phalanx of different departments within a law firm stand to benefit. For example:

Information Technology

Most obviously, IT can benefit from a reduction in infrastructure costs if big data is successfully used to remove ROT from the environment. Additionally, IT has problems that analysis of big data can solve, such as correlating the vast amounts of machine-generated logs to detect and alert about significant events, or drawing patterns out of help desk systems to determine areas for new education offerings.


There is a plethora of opportunity for the firm’s finance department. As a precursor to big data, as we define it here, Business Intelligence (BI) can be aimed to provide visibility into what drives the financial performance of the business by slicing and dicing financial information. Big data can bring a whole new level of insight. For example, using big data tools to examine the firm’s vast store of time-card data could provide a key competitive advantage when pricing services. There are also opportunities to examine what competing law firms are doing with price, hourly rates and even work product using publicly available information and information available for purchase.

Business Development

If your business development department runs the firm’s website, they will be very interested in what big data can tell them about how the website is used, how people click their way through it and what patterns most often lead to an inquiry about services. Using internal data about the type of work the firm does for each client, and external data about the variety of legal services other law firms perform for the same client, business development might be better able to guide cross-selling efforts and target resources appropriately. A simpler effort might be to mine the email system to determine which attorneys actually correspond regularly with which client contacts.

Knowledge Management

There are exciting opportunities for the firm’s Knowledge Management (KM) employees at the intersection of Predictive Analytics, big data and Enterprise Search. For example, imagine using the vast information from a firm’s DMS activity log to determine which documents are most frequently utilized by attorneys as a starting point for new work product. It might even be possible to analyze the documents themselves and determine which ones are most often the sources of cut-and-paste operations, or to correlate this information with raw time-entry information to determine which document-creation paths produced the most efficient outcomes.

Additionally, KM could compare results available within a variety of docket systems with internal information from the DMS to draw correlations around what contract clauses were the most successfully litigated, or the rate of success of summary judgments based on industry, case type or other classifications of interest. These correlations can help guide future successes.

E.How to Implement Big Data Solutions

So far we’ve been examining the “what, why and who” of big data, but you may be asking yourself, “How do I do all this?” Let’s explore some of the available tools and techniques.

Big Data Before The Big Data Craze

Prior to the current big data revolution, there were many tried and true solutions for pulling large data sets from disparate databases. These tools are now well defined and mature, and they work well to solve some of the problems in the big data solution space.

When the data is structured and there is knowledge about the sorts of queries the business wants to run against the data, standard BI tools, such as Data Warehouses and online analytical processing (OLAP) cubes, are excellent solutions. In standard BI applications, it is necessary to first identify the sources of structured data that are of interest (often from SQL databases), then determine the best way to relate the data. Typically, IT will create an Extract, Transform and Load (ETL) process to automate the population of a data warehouse. As the name implies, the process extracts the data from the source databases, transforms it so it can be queried easily and loads it into a data warehouse platform. Once in the platform, the data is available for query and reporting purposes. Usually, the ETL is run on a regular schedule, but the data is not updated in real time. This process greatly reduces the time it takes to run reports and the impact such queries have on the operation of the source database systems.

In a law firm, client, matter and partner profitability are examples of typical BI applications, and there are several products that can be used to automate the BI process for financial reporting within law firms.

If you have used a spreadsheet, you are familiar with rows and columns. An OLAP cube1 can be thought of as a multidimensional spreadsheet. A data analyst might set up an OLAP cube on a data warehouse using a tool, such as Microsoft SQL Server® Analysis Services (SSAS), with predefined dimensions that allow the business to drill down on information it needs to know. Using our example above, if the dimensions included partner, location, practice group and industry, the OLAP cube would facilitate a report on profitability by the client’s industry, broken by practice group and then by partner. It would also allow a data analyst to drill down into the cube to examine a specific combination of values (e.g., profitability of clients in the ice cream industry with relationship partners in the Alaska office).

Panning For Big Data Gold

Admittedly, BI solutions have their roots in the financial reporting world, and while some of the applications discussed in this paper are financial problems, many are not. Traditional BI solutions may not be the best analytical tools to use if the subject data isn’t highly structured and clean, and if there isn’t a clear understanding of the queries intended to run against the data. For such problems, a new set of tools has emerged.

The Big Whoop Around Hadoop

It is hard to talk about big data without eventually talking about Hadoop®. Hadoop is an open-source framework originally developed at Yahoo to improve the scalability of a search-engine project. The strength of the Hadoop framework is the ability to carve up work into small pieces, assign each piece of work amongst many machines (potentially thousands) and then manage the process of reassembling the pieces as the work completes. The framework is a collection of many technologies, all open source, the most famous of which is MapReduce.

Originally developed by Google, MapReduce performs two sets of functions: The Mapper part of MapReduce is responsible for breaking the data down into something meaningful, which can be expressed as key value pairs. Mapper works on many similarly sized chunks of the data at the same time, which allows it to handle large amounts of data and complex jobs. The Reduce function takes the results of all of the Mapper functions and collates the results.

Hadoop is a powerful and viable tool, especially when combined with cloud computing services, such as Amazon® Elastic MapReduce. In Amazon’s cloud, it is possible to create hundreds of instances of a Hadoop node to work on the same job and pay only for the computing time and storage you use.

To better visualize what Hadoop does, imagine a text file that contains 10 million narratives from time card entries. The goal is to determine how many time cards contain a set of keywords, such as “conference,” “drafting” and “deposition,” and the amount of time charged on each time card. The Hadoop framework divides up the files into chunks in order to take advantage of all nodes available to it. MapReduce then goes to work. The Mapper function runs on each node, crunching the data assigned to that node to determine the count of keywords. The result from a single node, a set of key value pairs, might look like this: (Conference, 2303), (Drafting, 512), (Deposition, 343).

Once all nodes are finished, the Reduce function of MapReduce takes the results from each node and combines them to produce one result set for the entire text file. If the results looked like this: (Conference, 127304), (Drafting, 22512), (Deposition, 155343), you might conclude that the lawyers in your firm should spend more time drafting and less in meetings!

The challenge with Hadoop is that someone with programming skills must write a program to define how to crunch the data and arrive at the key-value sets. These programming functions may be written in Java®, Python® or special big data languages like Pig. For many law firms, finding and training such resources could prove challenging. However, there are alternatives.

Big Data Alchemy

Certainly, there are a number of companies and consultants that can provide Hadoop expertise, but it is possible to implement big data solutions without using Hadoop or other open-source solutions. For example, several companies offer out-of-the-box solutions aimed at solving specific IG problems.

These solutions break down the big data challenge into basic steps, starting with finding unstructured data and ending with the disposition of all data identified. They frequently offer connectors to common data repositories, such as SharePoint® and Microsoft Exchange Server®, and feature their own proprietary indexing and search engines. By reducing the solution space of big data, and focusing on a specific problem set, these tools offer the opportunity to implement meaningful business solutions in a more manageable time frame — and most likely at a greater cost. Interestingly, many of these solutions can trace their roots back to the eDiscovery provider industry and focus on information processes, such as litigation holds, auto-classification of email and semi-structured information with an IG perspective.

The Splunk® solution focuses on a different aspect of big data. Its platform is aimed at collecting, indexing and making sense of the vast amount of data that computers generate during their operations. Common applications include monitoring for system failures in the environment, examining logs for evidence of suspicious behavior and even an eDiscovery app aimed at machine-generated data. Part of the approach includes an app store full of user contributions that can customize your Splunk installation. There is even an app called “Finding Apps is Hard as He**” that analyzes your data and suggests useful apps based on what other installations are doing with similar log data.

There are many other vendors and tools, including a slew of NoSQL applications, but hopefully the preceding provides a sense of the options available to tackle big data analytics.

A. Case Studies Using Big Data for the Legal Industry

Not surprisingly, the fruit from harvesting and processing big data has already been canned in the legal industry, both by law firms and providers in the space. A few interesting cases are presented in the following subsections.

Littler CaseSmart®
Leveraging the significant amount of data gained from the hundreds of administrative actions (primarily EEOC) handled by the firm, Littler Mendelson created a workflow system incorporating templates that bring efficiencies to the process by breaking down every step in the handling of these actions. The process was captured in proprietary software, which then allows Littler to offer its clients with large numbers of these types of cases an “all you can eat” pricing model.

The result is a big success because it provides clients with the opportunity to stabilize their legal spend. What’s more, through a client-facing dashboard that shows the status of each case and allows click-through to every detail, the client is able to analyze cases to spot trends in violations — which can afford them the opportunity to take corrective action before violations occur.

Wilson Sonsini’s Convertible Note Term Sheet Generator
Wilson Sonsini Goodrich & Rosati (WSGR) has been the premier venture financing counsel for tech startups and similar companies for quite some time. This expertise and the accompanying trove of information associated with hundreds of these deals positioned the firm to mine and leverage data for future deals. The effort originated as an internal tool for WSGR attorneys to rapidly generate draft term sheets that they would then polish and deliver to clients.

Using the information from an online questionnaire completed by the company seeking financing (usually a tech startup), the Generator creates a venture financing term sheet based on those responses, which can provide immediate transparency to foundling companies in the fast-moving tech sector. The tool also provides an informational component, with tutorials and definitions of financing terms. It is one part of a suite of document-automation tools used by WSGR to generate start-up and venture financing-related documents.

Quantitative Legal Prediction by Lex Machina
The question: “What are the odds of winning this patent litigation?” The answer: Use information from court documents to predict the outcome. Lex Machina, a company with its origins in Stanford University’s IP Litigation Clearinghouse, has spent the last 10 years building an effective case-prediction database in patent litigation, which is both a high-value and high-cost area for corporations.

Its database contains information from more than 128,000 IP cases and more than 134,000 attorney records, in addition to information on the judges and law firms involved in those cases. It took data scientists, attorneys and engineers more than 100,000 hours to conform and normalize this information to make it available for analysis. This effort was definitely worthwhile, since corporations spend significant amounts of money to both procure and protect intellectual property because it is often a significant portion of their value. In fact, according to a report by the Federal Judicial Center, the average cost of taking a patent case to trial is around $5 million per patent. In addition to commercial clients, Lex Machina also makes the information available for use by policy makers, media, academics and other purveyors of the public interest.

Aggregating Law Firm Billing Information for Optimal Pricing for Legal Services by TyMetrix
Starting out as a vendor of e-billing and matter management systems for corporate law departments, TyMetrix created its collection of data on billings and other metrics associated with legal matters in 2009. With its customers’ permission, it has since accumulated a warehouse of data from more than $25 billion worth of legal spending. TyMetrix then uses analytics to mine the information for use in its products.

The TyMetrix Rate Driver mobile app can calculate average hourly legal rates for lawyers across the United States based on such factors as law firm size, geographic location, attorney level and experience and area of specialty. The data for the app comes from another TyMetrix product, the Real Rate Report, which benchmarks law firm rates and identifies the factors that drive those rates. TyMetrix also offers a free app for mobile devices that uses Real Rate Report data to serve up average hourly legal rates of law firms across the country. The ultimate goal of this data is for TyMetrix to provide its clients with the tools to effectively run “what if” scenarios to forecast future cases in order to manage legal costs.


At this point, it should be clear that utilizing big data effectively will accelerate the transformation that is already underway in the legal industry. The technology has the ability to impact areas as diverse as correctly pricing value-based matters to fi nding and correctly classifying dark data and ROT. It may be the only way law firms will solve the data explosion caused by social media, and it could provide a most useful tool in the fight against hackers. And, many firms will be able to realize the oft-promised institutionalization of certain clients through both effective cross-selling and competitive, holistic pricing models.

For big data to reach its potential in law firms, IG professionals must play a key role not only by utilizing the technology, but by ensuring that others observe sound IG practices as they develop big data solutions. With IG leading the way, only the imagination limits where big data can take the legal industry.

Appendix A: Checklist For Assessing Data At Your Firm

A.Perform A General Assessment:

  • Inventory data in identified repositories. Use available resources to assess data (i.e., create a high-level data map and categorize data as either structured [identifiable by client, matter, attorney numbers] or contained within a database structure or semi-structured). Identify sensitive information (PII, credit card numbers, etc.) and perform corrective actions (delete, mask, protect) as an immediate measure.
  • Identify duplicates and perform corrective actions (delete, archive, etc.). (This can be done over time, but should be done before any serious analysis of the data is done.)
  • Decide which information should be moved and which can be managed in place in an ongoing fashion.
  • Establish a solid categorization structure (taxonomy) to mine and use data more effectively.

B.Address The Issues Related To Structured Data, Including:

  • Examine data integrity. Use heat maps to detect trends in how your lawyers are working and to understand the context.
  • Increase your competitive intelligence through data mining. Have records managers assign retention schedules, clean up data and organize data-mining programs.

C.Address The Issues Related To Semi-Structured Data, Including:

  • Find out where it is, what it is, what shape it is in (quality of metadata and content) and who created it using what processes. Several commercially available eDiscovery-focused tools exist that can be deployed to this purpose.
  • Determine the level of ROT.
  • Have IT work with the RM group to identify what is in the data to prevent it from clogging up the system. (In most enterprises, including law firms, more than 40 percent of information is ROT, and of these characteristics, trivial is the most difficult to determine.)
  • Find a major pain point and focus on practical solutions, such as disk-space reduction, to encourage partner support.
  • Establish change-management procedures by selecting the type of change you want to promote in order to produce more usable information from current processes. Then, decide how to communicate your procedures, set priorities and identify champions (i.e., people who will benefit from both the increased efficiency of the process and the usable information output).

D.Follow This General Guidance As You Analyze Your Data:

  • Create and apply a solid categorization/structure scheme to use data more effectively.
  • Work in new ways by cross-pollinating with people you haven’t worked with in the past, in order to understand how and why they create and use information in their particular areas and how it is stored.
  • Look at connecting silos of information. Social media may provide connections and better access to information that someone else has created (e.g., use Twitter as a newsfeed, employing hash tags to follow areas of interest). Begin building an ecosystem of shared information.
  • Obtain ideas for topics to analyze (or, unasked questions to be answered) from primarily business-focused personnel, such as rainmaker partners, practice group leaders, marketing and business development, etc.

1 Note that “cube” is somewhat of a misnomer because OLAP cubes are often more than 3 dimensions and rarely have equal sides.