Now that I’m officially in the blogosphere, I was out giving up a little “link-love” (blog speak) and came across the blog entry by George Rudoy on EDDUpdate, which touches upon a future consideration we should all have on our radar screens.
Blogs should be brief, so follow the link above to read his original post and the follow-up comments that spurred this blog entry.
While the demographic of On The Mark users may comprise mostly technical and industry insiders, perhaps some here would benefit from a quick baseline for those not quite up-to-speed on Unicode.
What is Unicode? That’s a fair question. And a long answer. I’ll take the easy path and suggest you familiarize yourself with the efforts of The Unicode Consortium.
Think of all the different written languages in the world, comprised of text or symbols, and then think of all the problems computers and programs and people have integrating them. Some very smart people identified a need and devised a standard that made sense across the entire spectrum. Here’s a snippet of their definition:
“Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language.”
Scintillating stuff, right? OK, so here’s where it gets interesting for us. Just as corporations are worldwide and diverse, so are the e-discovery factors that make it so challenging. The old tried-and-true ways we process and review legal documents is not going to cut it, if you put multilingual, multinational equations into play. Throw into the mix EU privacy regulation, US privacy laws and all the other laws and regulations, etc. that may be applicable to your data scenario and you can see this is a complex, faceted equation.
Just as George Rudoy’s post was informative, the comments (not mine) made to his post are insightful. I’ll paste some here for continuity and ease of access.
Georgy Pados of Shearman and Sterling opined:
“While no software package is perfect, more will be required to deal with multiple language data than recognizing Unicode characters and throwing analytics on top of it.
“It’s key to have the different components play together hand in hand throughout the entire process.”
“-auto-categorization: how sophisticated this really is? Majority of the cases bring mixed / multi characterset challenges: can the threshold be set in the software to manage the language categorization algorithms to order and score the results ?
(can you set the language score order if only the parent email is in Italian but the attachment is in English what about if the email thread is partly Italian/German (30%/ 60%) but the attachment drafts are 100% English)”
“how does the system handle paper scanned / OCR extracted content ? Can you seamlessly incorporate OCR results into the indexer (Do you have unicode OCR/OWR engines plugged into the indexer)?”
On The Mark: Georgy makes really good points here and gives a really good example.
Another comment, by my learned colleague
Chuck William CTO of MetaLINCS:
Full support for the languages of the world involves many complex technical issues.
First, all of the different character coding systems used around the world. These all need to be recognized and transformed to Unicode, wherever these encodings might occur in content and metadata.
The next challenge is tokenizing the content, i.e. breaking it up into words, numbers and other sensible units. As westerners we think this is easy– just look for the spaces and other punctuation. But take a look at languages like Chinese and Japanese, where there are typically no spaces, and you’ll begin to appreciate the problem.
Then we get into the whole area of linguistic analysis, starting from something simple like stemming (e.g., “going” –> “go”), moving into more complex features like identifying noun phrases (which form the basis for many notions of “concept” used in EDD). These functions are all language-specific.
A good system needs to identify the language(s) in which each document has been written, both for internal reasons such as applying the correct linguistic analysis rules, and for the benefit of users who may wish to search, review or translate content in a specific language and so need to identify the documents using that language.
Integration with the UI is another challenging area. Consider query term highlighting and entering queries in a different language. Right-to-left languages like Arabic and Hebrew present unique challenges in this area.
Fortunately, these are all well-understood technical issues and good products exist that address them effectively.
On The Mark: Chuck also makes really good points .
Our industry is witnessing an evolution in the development of enterprise software for corporations and law firms alike. And few companies have demonstrated leadership in the thoughtful and proper implementation of unicode foreign-language detection, functionality and foreign data analytics for in-house installation. Although I’ve checked my ”evangelist” hat at the door for this post, know I am quite confident of one particular offering “in this space”.