This Post will help you to integrate tesseract OCR integration with alfresco.
What is OCR...?
Optical character recognition, usually abbreviated to OCR, is the mechanical or electronic conversion of scanned or photographed images of typewritten or printed text into machine-encoded/computer-readable text. It is widely used as a form of data entry from some sort of original paper data source, whether passport documents, invoices, bank statement, receipts, business card, mail, or any number of printed records. It is a common method of digitizing printed texts so that they can be electronically edited, searched, stored more compactly, displayed on-line, and used in machine processes such as machine translation, text-to-speech, key data extraction and text mining. OCR is a field of research in pattern recognition, artificial intelligence and computer vision.
by doing this integration with alfresco and content will be indexed.
Required things
1)Alfresco2)Tesseract OCR Software
installing Tesstract OCR:
Download tesseract from (https://code.google.com/p/tesseract-ocr/downloads/list)and install it by double click on tesseract-ocr-setup-3.02.02.exe.
After installing tesseract in system "C:\Program Files (x86)\Tesseract-OCR" path will be created with installed Tesseract OCR.
Changes made to be done in alfresco
Files to be added.
1)OCR.bat2)ocrpng-transform-context.xml3)ocrjpeg-transform-context.xml4)ocrtiff-transform-context.xml5)alfresco-tesseract-search.jar6)ocrtransform.log
1)OCR.bat
REM to see what happens
echo from %1 to %2 >>C:\tmp\ocrtransform.log
copy /Y %1 C:\TMP\%~n1%~x1
REM call tesseract and redirect output to $TARGET
"C:\Program Files (x86)\Tesseract-OCR\tesseract.exe" C:\TMP\%~n1%~x1 %~d2%~p2%~n2 -l eng
del C:\TMP\%~n1%~x1
echo from %1 to %2 >>C:\tmp\ocrtransform.log
copy /Y %1 C:\TMP\%~n1%~x1
REM call tesseract and redirect output to $TARGET
"C:\Program Files (x86)\Tesseract-OCR\tesseract.exe" C:\TMP\%~n1%~x1 %~d2%~p2%~n2 -l eng
del C:\TMP\%~n1%~x1
this batch script is to placed in your alfresco in this path "C:\Alfresco"
this batch script will send the the uploaded file to Tesseract ocr to do actual OCR, copies the log to the ocrtransform.log,Tesseract OCR send content to alfresco and we can change the actual language which in the above file default given eng, and we can give multiple languages to this.
These are the transformation xml will be added in "C:\Alfresco\tomcat\shared\classes\alfresco\extension"
2)ocrpng-transform-context.xml
<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE beans PUBLIC '-//SPRING//DTD BEAN//EN' 'http://www.springframework.org/dtd/spring-beans.dtd'>
<beans default-lazy-init="false" default-autowire="no" default-dependency-check="none">
<bean id="transformer.worker.ocr.jpeg" class="org.alfresco.repo.content.transform.RuntimeExecutableContentTransformerWorker" lazy-init="default" autowire="default" dependency-check="default">
<property name="mimetypeService">
<ref bean="mimetypeService" />
</property>
<property name="checkCommand">
<bean class="org.alfresco.util.exec.RuntimeExec" lazy-init="default" autowire="default" dependency-check="default">
<property name="commandsAndArguments">
<map>
<entry key="Windows.*">
<list>
<value>C:\Windows\System32\cmd.exe</value>
<value>/C</value>
<value>dir c:\Alfresco\ocr.bat</value>
</list>
</entry>
</map>
</property>
<property name="errorCodes">
<value>1</value>
</property>
</bean>
</property>
<property name="transformCommand">
<bean class="org.alfresco.util.exec.RuntimeExec" lazy-init="default" autowire="default" dependency-check="default">
<property name="commandsAndArguments">
<map>
<entry key="Windows.*">
<list>
<value>C:\Windows\System32\cmd.exe</value>
<value>/C</value>
<value>C:\Alfresco\ocr.bat</value>
<value>"${source}"</value>
<value>"${target}"</value>
</list>
</entry>
</map>
</property>
<property name="errorCodes">
<value>1,2</value>
</property>
</bean>
</property>
<property name="explicitTransformations">
<list>
<bean class="org.alfresco.repo.content.transform.ExplictTransformationDetails" lazy-init="default" autowire="default" dependency-check="default">
<property name="sourceMimetype">
<value>image/png</value>
</property>
<property name="targetMimetype">
<value>text/plain</value>
</property>
</bean>
</list>
</property>
</bean>
<bean id="transformer.ocr.jpeg" class="org.alfresco.repo.content.transform.ProxyContentTransformer" parent="baseContentTransformer" lazy-init="default" autowire="default" dependency-check="default">
<property name="worker">
<ref bean="transformer.worker.ocr.jpeg" />
</property>
</bean>
</beans>
<!DOCTYPE beans PUBLIC '-//SPRING//DTD BEAN//EN' 'http://www.springframework.org/dtd/spring-beans.dtd'>
<beans default-lazy-init="false" default-autowire="no" default-dependency-check="none">
<bean id="transformer.worker.ocr.jpeg" class="org.alfresco.repo.content.transform.RuntimeExecutableContentTransformerWorker" lazy-init="default" autowire="default" dependency-check="default">
<property name="mimetypeService">
<ref bean="mimetypeService" />
</property>
<property name="checkCommand">
<bean class="org.alfresco.util.exec.RuntimeExec" lazy-init="default" autowire="default" dependency-check="default">
<property name="commandsAndArguments">
<map>
<entry key="Windows.*">
<list>
<value>C:\Windows\System32\cmd.exe</value>
<value>/C</value>
<value>dir c:\Alfresco\ocr.bat</value>
</list>
</entry>
</map>
</property>
<property name="errorCodes">
<value>1</value>
</property>
</bean>
</property>
<property name="transformCommand">
<bean class="org.alfresco.util.exec.RuntimeExec" lazy-init="default" autowire="default" dependency-check="default">
<property name="commandsAndArguments">
<map>
<entry key="Windows.*">
<list>
<value>C:\Windows\System32\cmd.exe</value>
<value>/C</value>
<value>C:\Alfresco\ocr.bat</value>
<value>"${source}"</value>
<value>"${target}"</value>
</list>
</entry>
</map>
</property>
<property name="errorCodes">
<value>1,2</value>
</property>
</bean>
</property>
<property name="explicitTransformations">
<list>
<bean class="org.alfresco.repo.content.transform.ExplictTransformationDetails" lazy-init="default" autowire="default" dependency-check="default">
<property name="sourceMimetype">
<value>image/png</value>
</property>
<property name="targetMimetype">
<value>text/plain</value>
</property>
</bean>
</list>
</property>
</bean>
<bean id="transformer.ocr.jpeg" class="org.alfresco.repo.content.transform.ProxyContentTransformer" parent="baseContentTransformer" lazy-init="default" autowire="default" dependency-check="default">
<property name="worker">
<ref bean="transformer.worker.ocr.jpeg" />
</property>
</bean>
</beans>
3)ocrjpeg-transform-context.xml
<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE beans PUBLIC '-//SPRING//DTD BEAN//EN' 'http://www.springframework.org/dtd/spring-beans.dtd'>
<beans default-lazy-init="false" default-autowire="no" default-dependency-check="none">
<bean id="transformer.worker.ocr.tiff" class="org.alfresco.repo.content.transform.RuntimeExecutableContentTransformerWorker" lazy-init="default" autowire="default" dependency-check="default">
<property name="mimetypeService">
<ref bean="mimetypeService" />
</property>
<property name="checkCommand">
<bean class="org.alfresco.util.exec.RuntimeExec" lazy-init="default" autowire="default" dependency-check="default">
<property name="commandsAndArguments">
<map>
<entry key="Windows.*">
<list>
<value>C:\Windows\System32\cmd.exe</value>
<value>/C</value>
<value>dir c:\Alfresco\ocr.bat</value>
</list>
</entry>
</map>
</property>
<property name="errorCodes">
<value>1</value>
</property>
</bean>
</property>
<property name="transformCommand">
<bean class="org.alfresco.util.exec.RuntimeExec" lazy-init="default" autowire="default" dependency-check="default">
<property name="commandsAndArguments">
<map>
<entry key="Windows.*">
<list>
<value>C:\Windows\System32\cmd.exe</value>
<value>/C</value>
<value>C:\Alfresco\ocr.bat</value>
<value>"${source}"</value>
<value>"${target}"</value>
</list>
</entry>
</map>
</property>
<property name="errorCodes">
<value>1,2</value>
</property>
</bean>
</property>
<property name="explicitTransformations">
<list>
<bean class="org.alfresco.repo.content.transform.ExplictTransformationDetails" lazy-init="default" autowire="default" dependency-check="default">
<property name="sourceMimetype">
<value>image/jpeg</value>
</property>
<property name="targetMimetype">
<value>text/plain</value>
</property>
</bean>
</list>
</property>
</bean>
<bean id="transformer.ocr.tiff" class="org.alfresco.repo.content.transform.ProxyContentTransformer" parent="baseContentTransformer" lazy-init="default" autowire="default" dependency-check="default">
<property name="worker">
<ref bean="transformer.worker.ocr.tiff" />
</property>
</bean>
</beans>
<!DOCTYPE beans PUBLIC '-//SPRING//DTD BEAN//EN' 'http://www.springframework.org/dtd/spring-beans.dtd'>
<beans default-lazy-init="false" default-autowire="no" default-dependency-check="none">
<bean id="transformer.worker.ocr.tiff" class="org.alfresco.repo.content.transform.RuntimeExecutableContentTransformerWorker" lazy-init="default" autowire="default" dependency-check="default">
<property name="mimetypeService">
<ref bean="mimetypeService" />
</property>
<property name="checkCommand">
<bean class="org.alfresco.util.exec.RuntimeExec" lazy-init="default" autowire="default" dependency-check="default">
<property name="commandsAndArguments">
<map>
<entry key="Windows.*">
<list>
<value>C:\Windows\System32\cmd.exe</value>
<value>/C</value>
<value>dir c:\Alfresco\ocr.bat</value>
</list>
</entry>
</map>
</property>
<property name="errorCodes">
<value>1</value>
</property>
</bean>
</property>
<property name="transformCommand">
<bean class="org.alfresco.util.exec.RuntimeExec" lazy-init="default" autowire="default" dependency-check="default">
<property name="commandsAndArguments">
<map>
<entry key="Windows.*">
<list>
<value>C:\Windows\System32\cmd.exe</value>
<value>/C</value>
<value>C:\Alfresco\ocr.bat</value>
<value>"${source}"</value>
<value>"${target}"</value>
</list>
</entry>
</map>
</property>
<property name="errorCodes">
<value>1,2</value>
</property>
</bean>
</property>
<property name="explicitTransformations">
<list>
<bean class="org.alfresco.repo.content.transform.ExplictTransformationDetails" lazy-init="default" autowire="default" dependency-check="default">
<property name="sourceMimetype">
<value>image/jpeg</value>
</property>
<property name="targetMimetype">
<value>text/plain</value>
</property>
</bean>
</list>
</property>
</bean>
<bean id="transformer.ocr.tiff" class="org.alfresco.repo.content.transform.ProxyContentTransformer" parent="baseContentTransformer" lazy-init="default" autowire="default" dependency-check="default">
<property name="worker">
<ref bean="transformer.worker.ocr.tiff" />
</property>
</bean>
</beans>
4)ocrtiff-transform-context.xml
<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE beans PUBLIC '-//SPRING//DTD BEAN//EN' 'http://www.springframework.org/dtd/spring-beans.dtd'>
<beans default-lazy-init="false" default-autowire="no" default-dependency-check="none">
<bean id="transformer.worker.ocr.tiff" class="org.alfresco.repo.content.transform.RuntimeExecutableContentTransformerWorker" lazy-init="default" autowire="default" dependency-check="default">
<property name="mimetypeService">
<ref bean="mimetypeService" />
</property>
<property name="checkCommand">
<bean class="org.alfresco.util.exec.RuntimeExec" lazy-init="default" autowire="default" dependency-check="default">
<property name="commandsAndArguments">
<map>
<entry key="Windows.*">
<list>
<value>C:\Windows\System32\cmd.exe</value>
<value>/C</value>
<value>dir c:\Alfresco\ocr.bat</value>
</list>
</entry>
</map>
</property>
<property name="errorCodes">
<value>1</value>
</property>
</bean>
</property>
<property name="transformCommand">
<bean class="org.alfresco.util.exec.RuntimeExec" lazy-init="default" autowire="default" dependency-check="default">
<property name="commandsAndArguments">
<map>
<entry key="Windows.*">
<list>
<value>C:\Windows\System32\cmd.exe</value>
<value>/C</value>
<value>C:\Alfresco\ocr.bat</value>
<value>"${source}"</value>
<value>"${target}"</value>
</list>
</entry>
</map>
</property>
<property name="errorCodes">
<value>1,2</value>
</property>
</bean>
</property>
<property name="explicitTransformations">
<list>
<bean class="org.alfresco.repo.content.transform.ExplictTransformationDetails" lazy-init="default" autowire="default" dependency-check="default">
<property name="sourceMimetype">
<value>image/tiff</value>
</property>
<property name="targetMimetype">
<value>text/plain</value>
</property>
</bean>
</list>
</property>
</bean>
<bean id="transformer.ocr.tiff" class="org.alfresco.repo.content.transform.ProxyContentTransformer" parent="baseContentTransformer" lazy-init="default" autowire="default" dependency-check="default">
<property name="worker">
<ref bean="transformer.worker.ocr.tiff" />
</property>
</bean>
</beans>
<!DOCTYPE beans PUBLIC '-//SPRING//DTD BEAN//EN' 'http://www.springframework.org/dtd/spring-beans.dtd'>
<beans default-lazy-init="false" default-autowire="no" default-dependency-check="none">
<bean id="transformer.worker.ocr.tiff" class="org.alfresco.repo.content.transform.RuntimeExecutableContentTransformerWorker" lazy-init="default" autowire="default" dependency-check="default">
<property name="mimetypeService">
<ref bean="mimetypeService" />
</property>
<property name="checkCommand">
<bean class="org.alfresco.util.exec.RuntimeExec" lazy-init="default" autowire="default" dependency-check="default">
<property name="commandsAndArguments">
<map>
<entry key="Windows.*">
<list>
<value>C:\Windows\System32\cmd.exe</value>
<value>/C</value>
<value>dir c:\Alfresco\ocr.bat</value>
</list>
</entry>
</map>
</property>
<property name="errorCodes">
<value>1</value>
</property>
</bean>
</property>
<property name="transformCommand">
<bean class="org.alfresco.util.exec.RuntimeExec" lazy-init="default" autowire="default" dependency-check="default">
<property name="commandsAndArguments">
<map>
<entry key="Windows.*">
<list>
<value>C:\Windows\System32\cmd.exe</value>
<value>/C</value>
<value>C:\Alfresco\ocr.bat</value>
<value>"${source}"</value>
<value>"${target}"</value>
</list>
</entry>
</map>
</property>
<property name="errorCodes">
<value>1,2</value>
</property>
</bean>
</property>
<property name="explicitTransformations">
<list>
<bean class="org.alfresco.repo.content.transform.ExplictTransformationDetails" lazy-init="default" autowire="default" dependency-check="default">
<property name="sourceMimetype">
<value>image/tiff</value>
</property>
<property name="targetMimetype">
<value>text/plain</value>
</property>
</bean>
</list>
</property>
</bean>
<bean id="transformer.ocr.tiff" class="org.alfresco.repo.content.transform.ProxyContentTransformer" parent="baseContentTransformer" lazy-init="default" autowire="default" dependency-check="default">
<property name="worker">
<ref bean="transformer.worker.ocr.tiff" />
</property>
</bean>
</beans>
these all are transformations files we can write any no. based on type format of files which want to do OCR with Tesseract.
5)alfresco-tesseract-search.jar
Downolad this jar from this link(https://docs.google.com/file/d/0B94FD2QmPSJCNHpuUVlicW95UjA/edit)and place this jar in this path "C:\Alfresco\tomcat\lib".
6)ocrtransform.log
create an empty file name with ocrtransform.log in "C:\TMP"after that restart the alfresco
then upload a image formatted files, content of the image will be indexed in alfresco so we can search the content of the file.
Testing
Upload a image formatted file to alfresco
simple that's it.
IMG PDF?
ReplyDeletehow can we do this same to the PDF file.
ReplyDeleteI want to perform OCR action on pdf files.
is there any solution for that?
Can you please help me with this?
kind regards.
Yes it will work, instead of image just pass the pdf and check packages and interfaces in the tesseract oct component. From there you can change the XML configuration as required for pdf's
ReplyDelete