A8体育集团-Karim Keshavjee, a Toronto physician and digital health consultant, crunches mountains of data from 500 doctors to figure out how to improve patient treatment. But it’s a frustrating slog to get a computer to decipher all the misspellings, abbreviations, and notes written in unintelligible medical shorthand.卡里姆o科夏瓦杰是多伦多的一名医生和网络身体健康顾问,他要从500名医生那里对系统的海量数据中总结出有怎样才能更佳地化疗病人。但是众所周知,医生的“书法”本来就堪比天书,要想要让电脑辨识出有其中的拼写错误和简写堪称难于登天。

For example, “smoking information is very hard to parse,” Keshavjee said. “If you read the records, you understand right away what the doctor meant. But good luck trying to make a computer understand. There’s ‘never smoked’ and ‘smoking = 0.’ How many cigarettes does a patient smoke? That’s impossible to figure out.”比如科夏瓦杰认为:“患者否吸烟者是个很最重要的信息。如果你必要读者病历,你立刻就能明白医生是什么意思。

但是要想要让电脑去解读它,那就不能祝你好运了。虽然你也可以在电脑上设置‘从来不吸烟者’或‘吸烟者=0’的选项。但是一个患者每天吸食多少支烟?这完全是电脑不有可能做明白的问题。The hype around slicing and dicing massive amounts of data, or big data, makes it sound so easy: Just plug a library’s worth of information into a computer and wait for valuable insights to pour out about how to speed up an auto assembly line, get online shoppers to buy more sneakers, or fight cancer. The reality is much more complicated. Data is inevitably “dirty” thanks to obsolete, inaccurate, and missing information. Cleaning it up is an increasingly important and overlooked job that can help prevent costly mistakes.由于宣传报道把大数据刮起得神乎其神,因此很多人有可能实在大数据用一起尤其非常简单:只要把相等于一整个图书馆的信息挂到电脑上,然后就可以躺在一旁,等着电脑得出独到看法,告诉他你如何提升自动生产线的生产效率,如何让网购者在网上出售更好的运动鞋,或是如何化疗癌症。


Although techniques are improving all the time, scrubbing data can only accomplish so much. Even when dealing with a relatively tidy set of information, getting useful results can be arduous and time-consuming.虽然科技仍然都在变革,但是人们在净化数据上能想起的法子并不多。即便是处置一些比较较“整洁”的数据,要想要取得简单的结果往往也是件费时费力的事情。“I tell my clients that the world is messy and dirty,” said Josh Sullivan, a vice president at business consulting firm Booz Allen who handles data crunching for clients. “There are no clean data sets.”博思艾伦咨询公司(Booz Allen)副总裁约什o沙利文说道:“我对我的客户说道,这是个恐慌可怕的世界,没几乎整洁的数据集。

”Data analysts start by looking for information that’s out of the norm. Because the volume of data is so huge, they typically hand the job over to software that automatically sifts through numbers and text to look for anything unusual that needs further review. Over time, computers can improve their accuracy in spotting what’s belongs and what doesn’t. They can also better understand what words and phrases mean by clustering similar examples together and then grading their interpretations for accuracy.数据分析师一般讨厌再行找寻非常态的信息。由于数据量过于极大,他们一般都会把检验数据的工作转交软件来已完成,来找寻否有些异常的东西必须更进一步检查。随着时间的流逝,电脑检验数据的精确性也不会提升。通过对类似于案例展开分类,它们也不会更佳地理解一些词语和句子的含义,然后提升检验的精确性。

“The approach is easy and straightforward, but training your models can take weeks and weeks,” Sullivan said.沙利文说道:“这种方法非常简单必要,但‘训练’你的模型可以必须一周又一周的时间。”A constellation of companies offer software and services for cleaning data. They range from technology giants like IBM IBM -0.24% and SAP SAP 0.12% to big data and analytics specialists like Cloudera and Talend Open Studio. A legion of start-ups are also trying to get a toehold as data janitors including Trifacta, Tamr, and Paxata.有些公司也获取了用来净化数据的软件和服务,其中既还包括像IBM和SAP一样的科技巨头,也还包括Cloudera和Talend对外开放工作室专门从事等大数据和分析的专门机构。一大批创业公司也想要争当大数据的看门人,其中有代表性的还包括Trifacta、Tamr和Paxata等。Healthcare, with all its dirty data, is one of the toughest industries for big data technology. Electronic health records make medical information increasingly easy to dump into computers, but there’s still a lot room for improvement before researchers, pharmaceutical companies and hospital business analysts can slice and dice all the information they want.由于“不整洁”的数据过于多,医疗业被指出是大数据技术最无以搞定的行业之一。

虽然随着电子病历的普及,将医疗信息输出电脑的可玩性早已显得越来越低,但是研究人员、制药公司和医疗业分析人士要想要把他们必须的数据乐趣地当作分析,在数据上要提升的地方还有很多。Keshavjee, the doctor and CEO of InfoClin, a health data consulting firm, spends his days trying to tease out ways to improve patient treatment by sifting through tens of thousands of electronic medical records. Obstacles pop up all the time.身体健康数据咨询公司InfoClin的医生兼任CEO科夏瓦杰花上了很多时间,期望数以万计的电子医疗病历中检验简单的数据,以提升对病人的医疗水平。但他们在检验的过程中却大大遇上妨碍。

Many doctors neglect to note a patient’s blood pressure in their medical records, something that no amount of data cleaning can fix. Simply determining what ails patients—based on what’s in their files—is surprisingly difficult for computers. Doctors may enter the proper code for diabetes without clearly indicating whether it’s the patient who has the disease or a family member. Or they may just enter “insulin” without mentioning the underlying diagnosis because, to them, it’s obvious.很多医生在病历中没记录病人的血压,这个问题是无论哪种数据净化方法都修缮没法的。光凭借现有病历的信息去辨别病人得了什么病对电脑来说就早已是一项极为艰难的任务。医生在输出糖尿病编号的时候,有可能忘了确切地标示到底是患者本人得了糖尿病,还是他的某个家人得了糖尿病。

又也许他们光是输出了“胰岛素”三个字,而没提及患者得了什么病,因为这对他们来说是再行显著不过的事情。Physicians also use a lot of idiosyncratic shorthand for medications, illnesses and basic patient details. Deciphering it takes a lot of head scratching for humans and is nearly impossible for a computer. For example, Keshavjee came across one doctor who used the abbreviation”gpa.” Only after coming across a variation, “gma,” did he finally solve the puzzle—they were shorthand for “grandpa” and “grandma.”医生用来临床、开药和填上病人基本信息时会大量中用一套独有的速记字体。即使让人类来密码它也要深感头痛,而对于电脑基本上是不有可能已完成的任务。

比如科夏瓦杰提及有个医生在病历中写“gpa”三个字母,让他百思不得其解。好在他找到后面不远处又写出着“gma”三字,他才恍然大悟——原本它们是爷爷(grandpa)和奶奶(grandma)的简写。“It took a while to figure that one out,” he said.科夏瓦杰说道:“我花上了好半天才明白它们究竟是什么意思。

”Ultimately, Keshavjee said one of the only ways to solve the problem of dirty data in medical records is “data discipline.” Doctors need to be trained to enter information correctly so that cleaning up after them is less of a chore. Incorporating something like Google’s helpful tool that suggests how to spell words as users type them would be a great addition for electronic medical records, he said. Computers can learn to pick out spelling errors, but minimizing the need is a step in the right direction.科夏瓦杰指出,解决问题数据“不整洁”的终极方法之一是要给病历制订一套“数据纪律”。要训练医生教导准确载入信息的习惯,这样事后净化数据时才不至于内乱得一团糟。科夏瓦杰回应,谷歌有一个很简单的工具,可以在用户展开输出时告诉他他们如何拼法生僻字,这样的工具几乎可以加到到电子病历工具中。

电脑虽然可以挑拼写错误,但是让医生抛弃不良习惯才是朝着准确的方向迈进了一步。Another of Keshavjee’s suggestions is to create medical records with more standardized fields. A computer would then know where to look for specific information, reducing the chance of error. Of course, doing so is not as easy as it sounds because many patients suffer from multiple illnesses, he said. A standard form would have to be flexible enough to take such complications into account.科夏瓦杰的另一个建议是,在电子病历中设置更加多标准化的域。这样电脑就不会告诉到哪里去找特定的信息,从而增加出错率。


Still, doctors would need to be able to jot down more free-form electronic notes that could never fit in a small box. Nuance like why a patient fell, for example, and not just the injury suffered, is critical for research. But software is hit and miss in understanding free-form writing without context. Humans searching by keyword may do a better job, but they still inevitably miss many relevant records.但是出于医疗的必须,医生有时必须在病历上记下一些权利行文的东西,这些内容认同不是一个小格子能装得下的。比如一个患者为什么不会跌倒,如果不是伤势造成的,那么原因就十分最重要。

但是在没上下文的条件下,软件对于权利行文的解读不能用撞大运来形容。检验数据的时候,如果人们用关键词搜寻的话可能会做到得更佳些,但这样也难免会漏掉很多有关的记录。Of course, in some cases, what appears to be dirty data, really isn’t. Sullivan, from Booz Allen, gave the example the time his team was analyzing demographic information about customers for a luxury hotel chain and came across data showing that teens from a wealthy Middle Eastern country were frequent guests.当然,在有些案例中,有些看上去不整洁的数并不是知道不整洁。博思艾伦咨询公司副总裁沙利文举例说道,有一次他的团队为一家奢华连锁酒店分析顾客的人口统计数据,忽然找到,数据表明一个富裕的中东国家的青少年群体是这家酒店的常客。

“There were a whole group of 17 year-olds staying at the properties worldwide,’ Sullivan said. “We thought, ‘That can’t be true.’ “沙利文回想道:“有一大群17岁的青少年在世界各地都寄居这家酒店,我们以为:‘这认同不是知道。’”But after some digging, they found that the information was, in fact, correct. The hotel had legions of young customers that it didn’t even realize were there, and had never done anything to market to them. All guests under 22 were automatically logged as “low-income” in the company’s computers. Hotel executives had never considered the possibility of teens with deep pockets.但做到了一些挖出工作后,他们找到这个信息只不过是准确的。这家酒店有大量的青少年顾客,甚至连酒店自己也没意识到,而且酒店也没针对这部分顾客做到过任何广告宣传和宣传。

所有22岁以下的顾客都被这家公司的电脑自动列为“低收入”群体,酒店的高管们也根本没考虑过这些孩子的腰包有多钹。“I think it’s harder to build models if you don’t have outliers,” Sullivan said.沙利文说道:“我指出如果没离群值的话,建构模型不会更加无以。”Even when data is clearly dirty, it can sometimes be put to good use. Take the example, again, of Google’s spelling suggestion technology. It automatically recognizes misspelled words and offers alternative spellings. It’s only possible because Google GOOG -0.34% has collected millions and perhaps billions of misspelled queries over the years. Instead of garbage, the dirty data is an opportunity.即便有时数据显著不整洁,它有时仍然能派上大用场。



Ultimately, humans, and not machines, draw conclusions from the data they crunch. Computers can sort through millions of documents, but they can’t interpret the findings. Cleaning data is just one of step in a long trial and error process to get to that point. Big data, for all its hype about its ability to lift business profits and help humanity, is a big headache.最后,从大数据中取得结论的是人而不是机器。电脑虽然可以整理几百万份文件,但它并无法知道理解它。

数据净化就是为了便利人们从数据中提供结论而重复试错的过程。尽管大数据已被尊为能提升商业利润、能教化全人类的神器,但它也是个很让人头痛的东西。“The idea of failure is completely different in data science,” Sullivan said. “If you they don’t fail 10 or 12 times a day to get to where they should be, they’re not doing it right.”沙利文认为:“告终的概念在数据科学中几乎是另一回事。如果我们每天不告终10次或12次来试错,它们就会得出准确的结果。