Need PDFBox expert to help extract text from pdfs with coordinates and a flag what part of text is visible
$250-750 USD
Fechado
Publicado há mais de 6 anos
$250-750 USD
Pago na entrega
I am looking for help understanding the PDFBox library. Please apply only if you already worked with PDFBox or iText or other PDF software.
What we need: Utility/jar/class we can call from our java WebApp which is running on Linux server (this may affect non-java solutions) under Tomcat with Java 8.
Problem: we need to extract text from searchable PDF (not scanned) and preserve text positions - so ideally lib should return words/tokens with x/y start/end positions as well as start/end coordinates of vertical and horizontal line separators. We need to get only the text a user can see; or if we get full text, we need a clear understanding what part of text is visible to the end-user and what part of text is not-visible. Attached is an example of a pdf file that has hidden text.
We tried Apache PDFBox, however, default PDFTextStripper handles only simple cases, when all extracted text is visible on screen. There are attached files where text is partially invisible because of PDF clipping/filling paths, so to track it, you need manually process PDF instructions and calculate if character is not covered/overlapped by another element, like image, other filled field etc. So we would like to get only the text a user can see; or if we get full text, we need a clear understanding what part of text is visible to the end-user and what part of text is not-visible.
There are some others tools could be used, like iText, Tika, but looks like they are built on top of PDFBox. Also we considered using Acrobat SDK but we are not familiar with it.
Greeting,
I have understood your Need PDFBox expert to help extract text from pdfs with coordinates and a flag what part of text is visible task and can do it with your 100% satisfaction. Please ping me for more discussion.
I have more than 5 years of experience in Java, PDF
Hi,
I have huge experience in PDFbox & iText PDF library, i reviewed your requirement for extracting text from PDF and it's position is looking good to me as it's searchable PDF so we can get the text easily, for getting position of text in Page i can get the X & Y coordinates of the text in that page. I don't think need to use the Adobe SDK, PDFBox, itext is enough for this task. If you want i know another library called tableu which will handle this. If you you have time can we connect on chat so i can ask you few question to get my understanding clear and make sure we both are on the dame page.
Thanks,
I am an IITK graduate and I have 11 years of experience in software development. I have 100% completion rate and I have finished projects with the highest level of customer satisfaction.
I have a team of rock star developers, who are working with top product companies and contribute to these projects as part time gig.
Hello Sir/Mam
Relevant Skills and Experience:
Please send us all details and we will do the job now if possible...and we are always ready to take any challenge + we have an adobe lab too
Proposed Milestones:
475 - (ProjectTitile)
For any query please consult our profile on https://www.freelancer.com/u/benni25.html