Text in Image 2.0: improving OCR service with PaddleOCR What’s OCR? Context Why PaddleOCR? Benchmarking one of the best OCR solution Working with PaddleOCR Text in Image service Takeaways

-

Read how the Cognition team improved the Text in Image service across Adevinta marketplaces using PaddleOCR

Optical Character Recognition (OCR) is a well-liked topic for each industry and private use. In this text, we share how we tested and used an existing open source library, PaddleOCR, to extract text from a picture. This read is for anyone who would really like to search out out more about OCR, the needs of our customers at Adevinta, and the challenges we face in attending to them. You’ll learn how we upgraded an existing service, benchmarked different solutions and delivered the chosen one to satisfy our customers.

OCR stands for “Optical Character Recognition” and is a technology that enables computers to recognise and extract text from images and scanned documents. OCR software uses optical recognition algorithms to interpret the text in images and convert it into machine-readable text that might be edited, searched and stored electronically.

There are many use-cases where OCR might be used:

  • Digitising paper documents: to convert scanned images of text into digital text. This is beneficial for organisations that want to scale back their reliance on paper and improve their document management processes.
  • Extracting data from images: eg from documents equivalent to invoices, receipts and forms. This might be useful for automating data entry tasks and reducing the necessity for manual data entry.
  • Translating documents: to extract text from images of documents written in foreign languages and translate them into a unique language.
  • Archiving: to create digital copies of essential documents that have to be preserved for long periods of time.
  • Improving accessibility: to make scanned documents more accessible to individuals with disabilities by converting the text right into a format that might be read by assistive technologies equivalent to screen readers.
  • Searching documents: to make scanned documents searchable, allowing users to simply find specific information inside a big collection of documents.

Inside Adevinta, a world classifieds specialist with market-leading positions in key European markets, there’s space for the entire cited use cases. Nevertheless, for this text, we focus specifically on “extracting data from images.”

Applying deep learning to photographs is the essential expertise of our team, Cognition. We’re Data Scientists and Machine Learning (ML) Engineers that work together to develop image-based ML solutions at scale, helping Adevinta’s marketplaces construct higher products and experiences for his or her customers. Adevinta’s mission is to attach buyers and sellers, enabling people to search out jobs, homes, cars, consumer goods and more. By making an accessible ML API with features tailored to our different marketplaces’ needs, Adevinta’s marketplaces are empowered with ML tools at an inexpensive cost.

Extracting text from images allows us to detect unwanted content from the ads (insults, hidden messages, racist content), higher understand image content and even propose more efficient searches (as an example using the brand name of an item written on it). Our users’ needs might be sorted in the next categories: general text, url, email and phone number.

At Adevinta, the prevailing Text in Image service processed over 100 million requests monthly with strongly growing demand, but we weren’t completely satisfied with the standard of the service. Given the impact and recognition of the Text in Image service, we made a choice to update it to a newer, more accurate and (ideally) faster solution.

That is where the story begins: Cognition’s journey to supply Text in Image 2.0.

The prevailing service was based on Fast Oriented Text Spotting with a Unified Network (Yan et al., 2018). Despite being state-of-the-art in 2018, the algorithm achieved 0.4 accuracy on our internal benchmark of 200 marketplace images. Nevertheless, accuracy was not the only criteria of selection for the Text in Image 2.0, so we compiled a listing of edge cases where our partner marketplaces require high-performing algorithms.

After reviewing different open source OCR frameworks (including MMOCR, EASY OCR, PaddleOCR and HiveOCR) and different combos of proposed models on our internal benchmark and on the sting cases, a indisputable winner was PaddleOCR with a median accuracy of 0.8 and an appropriate performance on our edge cases. This result competes with the paid Google Cloud Vision OCR API on one of the best accuracy we measured.

Fig 1. The difference between FOTS-based text extraction a) and PaddleOCR-based text extraction b) Source: generated by Cognition, image randomly used from Unspalsh in image generator

To be able to construct our independent benchmark and validate the selection of PaddleOCR at scale, we built a “Text in Image generator” that uses open source images from Unsplash and Pikwizard and adds randomly generated text on top of them. The created tool is extremely customisable in an effort to simulate a wide selection of cases that mix aspects equivalent to font type, rotation, text length, background type, image resolution etc. Using a simulated benchmark of 20k images with a distribution of cases matching business needs, we obtained an improvement factor of x1.4.

Fig 2. Examples of images created with “Text in Image generator”.

We identified several cases where PaddleOCR fails. This is generally when there are different angles of rotated text, some alternative fonts and differing color/contrast. We also observed that in some cases, the proper words are detected however the spaces between them aren’t placed appropriately. This will likely or is probably not a problem depending on the best way the extracted text is used further.

Fig 3. Some cases where PaddleOCR doesn’t detect text appropriately and their possible reasons: a) Font difficult to decipher, b) Rotation angle makes text unreadable, c) Strong contrast and font type. Source: Images generated by the Cognition team.

To be able to evaluate the potential for improvement and mitigation of those errors, along with defining the serving strategy, we needed to deep dive into the PaddleOCR framework.

PaddleOCR builds on PaddlePaddle. Our team had no previous experience with this and it’s less popular in our community than other frameworks equivalent to Tensorflow, Keras or Pytorch.

From a technical perspective, PaddleOCR consists of three distinct models:

  • Detection, for detecting a bounding box where possible text is
  • Classification, rotating the text 180° if needed
  • Recognition, translating the detected image frame to raw text

Pre-trained models in several languages are provided by authors.

Whilst exploring the code base of PaddleOCR for inference, we were faced with convoluted code, which was difficult to read and understand. As we wanted to make use of the PaddleOCR solution in production, we decided to refactor the code, keeping in mind to preserve the performance and the speed of the unique code. You possibly can examine the small print of that process and the PaddleOCR model within the complementary article of this series.After refactoring the code, we had created a clean and readable code base.

We consider our code version is simpler to work with, given the use case of text extraction from images, and are working on making the code available open source. Different steps and pre-processing and post-processing parts are clearly separated, so that they might be called independently, which should make further community extensions easier so as to add. It also makes putting into production easier, because the simplified, modular code combines well with the structure of inference.py for serving SageMaker endpoints. Our proposed code version doesn’t alter predictions (in comparison with the two.6 release) for images.

Using the refactored code, we made the model available as an API. To assist our customers’ transition, we maintained the identical API contract utilized in the previous service.

Serving PaddleOCR might be done in multiple ways. The simple approach is asking its own Python API (provided by the PaddleOCR package) from inside a well known framework. We chosen Multi Model Server, Flask and FastAPI to conduct our benchmark. All our proposed solutions are served by AWS SageMaker Endpoint, constructing our own container (BYOC) from the identical Docker base image.

MultiModel Server uses its own JAVA ModelServer, while for Flask and FastAPI, we use nginx+gunicorn (combined with uvicorn staff for the ASGI FastAPI). The frontend for our customers is served by an API Gateway, which is out of the scope of this text.

For the performance testing, we recreated quite a lot of requests with a controlled amount of text and different image sizes, mimicking the expected distribution from our customers. We used Locust because the testing framework, and stimulated heavy bursts within the waiting time as a stress test.

With the info gathered from the performance tests, we were capable of define our infrastructure (sort of instance and autoscaling policy) in relation to the Service Level Agreement (SLA) terms, while balancing the chance of a sudden shift from the observed distribution (the service is sensitive to the quantity of text per image).

Currently, we take care of 330 million requests monthly, and we have now estimated that next yr, more Adevinta marketplaces will onboard a Text in Image service, leading to a 400% growth.

The brand new API resulted in an improved latency 7.5x in comparison with the FOTS-based solution, while providing a 7% cost reduction in serving. Also, for the reason that recent API being 12x cheaper than a typical external solution, equivalent to GCP OCR, we received positive feedback from our users about each the speed and the accuracy of the Text in Image 2.0.

As a pc vision team working for a world company serving thousands and thousands of individuals on daily basis, we aimed to enhance our OCR API for text extraction from classified ads. After testing quite a few frameworks, we built a picture simulator in an effort to find the algorithm matching the needs of our users. The chosen framework, PaddleOCR, went through our internal review and revamp. (There have been challenges along the best way and you may read more about them in ). Now, we’re pleased to say we’re providing a more accurate, faster and cheaper API using the PaddleOCR framework.

admin

What are your thoughts on this topic?
Let us know in the comments below.

Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments

Share this article

Recent posts

Parking difficulties in complex urban areas ‘solved’ with AI convergence technology

Jeonnam attracted attention by demonstrating long-distance parking for address-based self-driving cars that combined latest technologies of the Fourth Industrial Revolution, corresponding to artificial intelligence...

Figure AI’s $675 Million Breakthrough in Humanoid Robotics

Within the ever-evolving landscape of technology, humanoid robotics stands as a frontier teeming with potential and promise. The concept, once confined to the realms...

Rolls-Royce advances into solar drones…applies for brand new technology patent

It has been reported that Rolls-Royce has entered the sphere of solar-powered drones. News specialist Bnn reported on the twenty sixth (local time) that Rolls-Royce...

The Stacking Ensemble Method

Understand stacking using scikit-learnDiscover the ability of stacking in machine learning — a method that mixes multiple models right into a single powerhouse predictor....

Rosie Brothers holds ‘Preparation for Digital Leading Schools’ webinar for teachers

Three firms, including Rosie Brothers (CEO Sang-min Noh), Classting (CEO Hyeon-gu Cho), and iPortfolio (CEO Seong-yoon Kim), specializing in artificial intelligence (AI) education, held...

Recent comments

AeroSlim Weight loss price on NIA holds AI Ethics Idea Contest Awards Ceremony
skapa binance-konto on LLMs and the Emerging ML Tech Stack
бнанс рестраця для США on Model Evaluation in Time Series Forecasting
Bonus Pendaftaran Binance on Meet Our Fleet
Créer un compte gratuit on About Me — How I give AI artists a hand
To tài khon binance on China completely blocks ‘Chat GPT’
Regístrese para obtener 100 USDT on Reducing bias and improving safety in DALL·E 2
crystal teeth whitening on What babies can teach AI
binance referral bonus on DALL·E API now available in public beta
www.binance.com prihlásení on Neural Networks and Life
Büyü Yapılmışsa Nasıl Bozulur on Introduction to PyTorch: from training loop to prediction
yıldızname on OpenAI Function Calling
Kısmet Bağlılığını Çözmek İçin Dua on Examining Flights within the U.S. with AWS and Power BI
Kısmet Bağlılığını Çözmek İçin Dua on How Meta’s AI Generates Music Based on a Reference Melody
Kısmet Bağlılığını Çözmek İçin Dua on ‘이루다’의 스캐터랩, 기업용 AI 시장에 도전장
uçak oyunu bahis on Thanks!
para kazandıran uçak oyunu on Make Machine Learning Work for You
medyum on Teaching with AI
aviator oyunu oyna on Machine Learning for Beginners !
yıldızname on Final DXA-nation
adet kanı büyüsü on ‘Fake ChatGPT’ app on the App Store
Eşini Eve Bağlamak İçin Dua on LLMs and the Emerging ML Tech Stack
aviator oyunu oyna on AI as Artist’s Augmentation
Büyü Yapılmışsa Nasıl Bozulur on Some Guy Is Trying To Turn $100 Into $100,000 With ChatGPT
Eşini Eve Bağlamak İçin Dua on Latest embedding models and API updates
Kısmet Bağlılığını Çözmek İçin Dua on Jorge Torres, Co-founder & CEO of MindsDB – Interview Series
gideni geri getiren büyü on Joining the battle against health care bias
uçak oyunu bahis on A faster method to teach a robot
uçak oyunu bahis on Introducing the GPT Store
para kazandıran uçak oyunu on Upgrading AI-powered travel products to first-class
para kazandıran uçak oyunu on 10 Best AI Scheduling Assistants (September 2023)
aviator oyunu oyna on 🤗Hugging Face Transformers Agent
Kısmet Bağlılığını Çözmek İçin Dua on Time Series Prediction with Transformers
para kazandıran uçak oyunu on How China is regulating robotaxis
bağlanma büyüsü on MLflow on Cloud
para kazandıran uçak oyunu on Can The 2024 US Elections Leverage Generative AI?
Canbar Büyüsü on The reverse imitation game
bağlanma büyüsü on The NYU AI School Returns Summer 2023
para kazandıran uçak oyunu on Beyond ChatGPT; AI Agent: A Recent World of Staff
Büyü Yapılmışsa Nasıl Bozulur on The Murky World of AI and Copyright
gideni geri getiren büyü on ‘Midjourney 5.2’ creates magical images
Büyü Yapılmışsa Nasıl Bozulur on Microsoft launches the brand new Bing, with ChatGPT inbuilt
gideni geri getiren büyü on MemCon 2023: We’ll Be There — Will You?
adet kanı büyüsü on Meet the Fellow: Umang Bhatt
aviator oyunu oyna on Meet the Fellow: Umang Bhatt
abrir uma conta na binance on The reverse imitation game
código de indicac~ao binance on Neural Networks and Life
Larry Devin Vaughn Wall on How China is regulating robotaxis
Jon Aron Devon Bond on How China is regulating robotaxis
otvorenie úctu na binance on Evolution of Blockchain by DLC
puravive reviews consumer reports on AI-Driven Platform Could Streamline Drug Development
puravive reviews consumer reports on How OpenAI is approaching 2024 worldwide elections
www.binance.com Registrácia on DALL·E now available in beta