Since you need to install tesseract offline, you think of deploying tesseract as a docker image.Container-based mirroring using docker commit.
The general idea is to pull a basic centos image first, then boot it and enter the container to install tesseract before commit is mirrored.Then mirror save as a tar package.This allows offline installation.(This can also be a common way of thinking about making your own mirrors)
In fact, the next step is to install tesseract in the linux environment without downloading a basic centos image and eventually making the container a mirror.
0. Prepare files
(1)tesseract source files:
git download address: https://github.com/tesseract-ocr/tesseract
(2)leptonica-1.79.0.tar.gz.tesseract depends on this project
(3) Language Pack
All it really needs is chi_Sim.traineddataAndEng.traineddata
You can download it from git.Git has two versions, one fast version and one best version.
fast: https://github.com/tesseract-ocr/tessdata_fast
best:https://github.com/tesseract-ocr/tessdata_best
The difference between the two is that the fast version library file is smaller and recognizes faster.The best version of the language library file is larger and slower to recognize.Recognition accuracy, if verified.
(4) Prepare two validated pictures
1. Download centos image
docker pull hub.c.163.com/library/centos:latest
2. Start the mirror and enter the container to view the centos version:
C:\Users\Administrator>docker run -i -t hub.c.163.com/library/centos:latest /bin/bash [root@86867025ffc7 /]# cat /etc/redhat-release CentOS Linux release 7.3.1611 (Core) [root@86867025ffc7 /]# uname -a Linux 86867025ffc7 4.14.154-boot2docker #1 SMP Thu Nov 14 19:19:08 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
3. tesseract-Master.zipUploaded to container's/opt/tesseract directory (host execution)
docker cp ./tesseract-master.zip 86867025ffc7:/opt/tesseract
86867025ffc7 is the centos container ID
4. Unzip in container
You need to download unzip first:
[root@86867025ffc7 tesseract]# yum list unzip Loaded plugins: fastestmirror, ovl Determining fastest mirrors * base: mirrors.cqu.edu.cn * extras: mirror.lzu.edu.cn * updates: mirrors.njupt.edu.cn updates/7/x86_64/primary_db | 1.3 MB 00:00:00 Available Packages unzip.x86_64 6.0-21.el7 base [root@86867025ffc7 tesseract]# yum install unzip
Unzip:
unzip ./tesseract-master.zip
5. Start installing the compilation environment
(1) Install compilation environment: gcc gcc-c++ make
yum install gcc gcc-c++ make
(2) Install packages necessary for tesseract-ocr compilation
yum install autoconf automake libtool libjpeg-devel libpng-devel libtiff-devel
6. Install leptonica
(1) Upload leptonica-1.79.0.Tar.gzTo container: (host execution)
docker cp ./leptonica-1.79.0.tar.gz 86867025ffc7:/opt/tesseract
(2) Unzip and install in the container:
tar -zxvf ./leptonica-1.79.0.tar.gz
./autogen.sh ./configure make && make install
(3) Add environment variables (yum install vim install if vim is not a command)
vim /etc/profile
Finally, add the following:
export LD_LIBRARY_PATH=$LD_LIBRARY_PAYT:/usr/local/lib export LIBLEPT_HEADERSDIR=/usr/local/include export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig
Execute the following command for immediate effect:
source /etc/profile
(4) Execute the following command to confirm lept output, and if not, review the steps above
[root@86867025ffc7 leptonica-1.79.0]# pkg-config --list-all | grep lept lept leptonica - An open source C library for efficient image processing and image analysis operations
7. Install tesseract, enter tesseract-master, and execute the following command:
./autogen.sh ./configure make && make install
I made an error while making and make install. I upgraded the next GCC version. The original GCC version was 4.X. After upgrading to 8, I can make && make install again:
Step 1: Install the scl source: yum install centos-release-scl scl-utils-build Step 2: List available scl sources yum list all --enablerepo='centos-sclo-rh' yum list all --enablerepo='centos-sclo-rh' | grep "devtoolset-" Step 3: Install version 8 of the gcc, gcc-c++, gdb toolchian: yum install -y devtoolset-8-toolchain scl enable devtoolset-8 bash
View gcc version:
[root@86867025ffc7 tesseract-master]# gcc -v Using built-in specs. COLLECT_GCC=gcc COLLECT_LTO_WRAPPER=/opt/rh/devtoolset-8/root/usr/libexec/gcc/x86_64-redhat-linux/8/lto-wrapper Target: x86_64-redhat-linux Configured with: ../configure --enable-bootstrap --enable-languages=c,c++,fortran,lto --prefix=/opt/rh/devtoolset-8/root/usr --mandir=/opt/rh/devtoolset-8/root/usr/share/man --infodir=/opt/rh/devtoolset-8/root/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-shared --enable-threads=posix --enable-checking=release --enable-multilib --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --with-gcc-major-version-only --with-linker-hash-style=gnu --with-default-libstdcxx-abi=gcc4-compatible --enable-plugin --enable-initfini-array --with-isl=/builddir/build/BUILD/gcc-8.3.1-20190311/obj-x86_64-redhat-linux/isl-install --disable-libmpx --enable-gnu-indirect-function --with-tune=generic --with-arch_32=x86-64 --build=x86_64-redhat-linux Thread model: posix gcc version 8.3.1 20190311 (Red Hat 8.3.1-3) (GCC)
Re configure and make, make install after upgrading gcc.Otherwise an invalid pointer will be missed.If make, make install executes without error, you do not need to upgrade gcc.
8. Testing the tesseract version:
[root@86867025ffc7 tesseract-master]# tesseract -v tesseract 5.0.0-alpha leptonica-1.79.0 libjpeg 6b (libjpeg-turbo 1.2.90) : libpng 1.5.13 : libtiff 4.0.3 : zlib 1.2.7 Found AVX2 Found AVX Found SSE Found OpenMP 201511
9. Copy language packs into containers/usr/local/share/tessdata
tesseract --list-langs
I made an error when executing:
*** Error in `tesseract': free(): invalid pointer: 0x000000000065bff0 ***
Solution: (Re-execute configure, make, make install to tesseract-master directory)
Execute the viewing language again:
[root@86867025ffc7 tesseract-master]# tesseract --list-langs List of available languages (2): chi_sim eng
10. Identify for verification: upload prepared pictures to the container's/opt/ocrtemplate
[root@86867025ffc7 ocrtemplate]# tesseract ./zh.png result -l chi_sim -c preserve_interword_spaces=1 --dpi 300 --oem 1 Tesseract Open Source OCR Engine v5.0.0-alpha with Leptonica [root@86867025ffc7 ocrtemplate]# ls result.txt zh.png [root@86867025ffc7 ocrtemplate]# cat result.txt //Chinese Server
The tesseract command is as follows:
#Console Receive tesseract ./normal.png stdout -l chi_sim #Specify only the language.Chi_is required for the specified languageSim_Vert.traineddata tesseract ./normal.png chi_sim__simhei_result -l chi_sim tesseract ./normal.png chi_sim__simhei_result -l chi_sim -c preserve_interword_spaces=1 tesseract ./normal.png chi_sim__simhei_result -l chi_sim -c preserve_interword_spaces=1 --dpi 300 tesseract ./normal.png chi_sim__simhei_result -l chi_sim -c preserve_interword_spaces=1 --dpi 300 --oem 1 # Appoint --psm 1 Required osd.traineddata tesseract ./normal.png chi_sim__simhei_result -l chi_sim -c preserve_interword_spaces=1 --dpi 300 --oem 1 --psm 1
11. Make the container a mirror: (check the container ID, and commit a mirror)
Administrator@MicroWin10-1535 MINGW64 ~/Desktop/dockertest $ docker ps -a CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 86867025ffc7 hub.c.163.com/library/centos:latest "/bin/bash" 18 hours ago Up 18 hours objective_swanson Administrator@MicroWin10-1535 MINGW64 ~/Desktop/dockertest $ docker commit 86867025ffc7 zdtesseract sha256:46153c6cb7da8deb023aacaa46413822ee921a147151a235375e1a7f4ad64fda Administrator@MicroWin10-1535 MINGW64 ~/Desktop/dockertest $ docker images | grep zdte zdtesseract latest 46153c6cb7da About a minute ago 1.64GB Administrator@MicroWin10-1535 MINGW64 ~/Desktop/dockertest $ docker history zdtesseract IMAGE CREATED CREATED BY SIZE COMMENT 46153c6cb7da 7 minutes ago /bin/bash 1.44GB 328edcd84f1b 2 years ago /bin/sh -c #(nop) CMD ["/bin/bash"] 0B <missing> 2 years ago /bin/sh -c #(nop) LABEL name=CentOS Base Im... 0B <missing> 2 years ago /bin/sh -c #(nop) ADD file:63492ba809361c51e... 193MB
The installation of tesseract in the container has now been completed.Installing tesseract in the actual linux is also the step above.
The following actions are for the convenience of the docker.If we want to run without entering the container, we may execute the following commands:
docker run --rm -v /opt/ocrtemplate:/opt/ocrtemplate zdtesseract tesseract /opt/ocrtemplate/normal.png /opt/ocrtemplate/result -l chi_sim --oem 1 --dpi 300
- rm specifies the container to be automatically deleted after completion, -v is the relationship between the specified host and docker container mount directories (host directory: docker container directory).Generated after execution under the / opt/ocrtemplate directory inside the containerResult.txt, the host and container/opt/ocrtemplate directories have a mount relationship, so the file will also be generated in the host directory, resulting in the final OCR result file.
12. Testing
If it is a virtual box virtual machine for windows, you need to enter the virtual machine first.
docker-machine ssh default #Enter Virtual Machine sudo -i #Switch Users
Then execute:
root@default:/opt/ocrtemplate# docker run --rm zdtesseract tesseract -v tesseract 5.0.0-alpha leptonica-1.79.0 libjpeg 6b (libjpeg-turbo 1.2.90) : libpng 1.5.13 : libtiff 4.0.3 : zlib 1.2.7 Found AVX2 Found AVX Found SSE Found OpenMP 201511 root@default:/opt/ocrtemplate# docker run --rm zdtesseract tesseract --list-langs List of available languages (2): chi_sim eng
For convenience, we alias the above command:
root@default:/opt/ocrtemplate# alias tesseract='docker run --rm -v /opt/ocrtemplate:/opt/ocrtemplate zdtesseract tesseract' #Alias root@default:/opt/ocrtemplate# alias #alias alias tesseract='docker run --rm -v /opt/ocrtemplate:/opt/ocrtemplate zdtesseract tesseract' root@default:/opt/ocrtemplate# tesseract -v #View Version with Alias tesseract 5.0.0-alpha leptonica-1.79.0 libjpeg 6b (libjpeg-turbo 1.2.90) : libpng 1.5.13 : libtiff 4.0.3 : zlib 1.2.7 Found AVX2 Found AVX Found SSE Found OpenMP 201511
Identify files for testing:
(1) Console reception:
tesseract /opt/ocrtemplate/normal.png stdout -l chi_sim --oem 1 --dpi 300
(2) Generate files:
tesseract /opt/ocrtemplate/normal.png /opt/ocrtemplate/result -l chi_sim --oem 1 --dpi 300
Interpretation: The above is actually executing commands
docker run --rm -v /opt/ocrtemplate:/opt/ocrtemplate zdtesseract tesseract /opt/ocrtemplate/normal.png /opt/ocrtemplate/result -l chi_sim --oem 1 --dpi 300
- rm specifies the container to be automatically deleted after completion, -v is the relationship between the specified host and docker container mount directories (host directory: docker container directory).Generated after execution under the / opt/ocrtemplate directory inside the containerResult.txt, the host and container/opt/ocrtemplate directories have a mount relationship, so the file will also be generated in the host directory, resulting in the final file.
13. Package and archive mirrors for offline installation
docker save -o zdtesseract.tar zdtesseract
The tar package can then be migrated and installed offline.
Of course, it can be submitted to the mirror warehouse.