Need PDFBox expert to help extract text from pdfs with coordinates and a flag what part of text is visible

$250-750 USD

Fechado

Publicado

há mais de 6 anos

$250-750 USD

Pago na entrega

I am looking for help understanding the PDFBox library. Please apply only if you already worked with PDFBox or iText or other PDF software. What we need: Utility/jar/class we can call from our java WebApp which is running on Linux server (this may affect non-java solutions) under Tomcat with Java 8. Problem: we need to extract text from searchable PDF (not scanned) and preserve text positions - so ideally lib should return words/tokens with x/y start/end positions as well as start/end coordinates of vertical and horizontal line separators. We need to get only the text a user can see; or if we get full text, we need a clear understanding what part of text is visible to the end-user and what part of text is not-visible. Attached is an example of a pdf file that has hidden text. We tried Apache PDFBox, however, default PDFTextStripper handles only simple cases, when all extracted text is visible on screen. There are attached files where text is partially invisible because of PDF clipping/filling paths, so to track it, you need manually process PDF instructions and calculate if character is not covered/overlapped by another element, like image, other filled field etc. So we would like to get only the text a user can see; or if we get full text, we need a clear understanding what part of text is visible to the end-user and what part of text is not-visible. There are some others tools could be used, like iText, Tika, but looks like they are built on top of PDFBox. Also we considered using Acrobat SDK but we are not familiar with it.

Java

PDF

ID do Projeto: 15915061

Sobre o projeto

6 propostas

Projeto remoto

Ativo há 6 anos

Quer ganhar algum dinheiro?

Endereço de e-mail

Benefícios de ofertar no Freelancer

Defina seu orçamento e seu prazo

Seja pago pelo seu trabalho

Descreva sua proposta

É grátis para se inscrever e fazer ofertas em trabalhos

6 freelancers estão ofertando em média $517 USD for esse trabalho

@schoudhary1553

Greeting, I have understood your Need PDFBox expert to help extract text from pdfs with coordinates and a flag what part of text is visible task and can do it with your 100% satisfaction. Please ping me for more discussion. I have more than 5 years of experience in Java, PDF

$500 USD em 6 dias

5,0

(21 avaliações)

5,0

@expertjavagiant

Hi, I have huge experience in PDFbox & iText PDF library, i reviewed your requirement for extracting text from PDF and it's position is looking good to me as it's searchable PDF so we can get the text easily, for getting position of text in Page i can get the X & Y coordinates of the text in that page. I don't think need to use the Adobe SDK, PDFBox, itext is enough for this task. If you want i know another library called tableu which will handle this. If you you have time can we connect on chat so i can ask you few question to get my understanding clear and make sure we both are on the dame page. Thanks,

$480 USD em 10 dias

4,7

(14 avaliações)

4,9

@shahzain93

Hey man , I have worked on PDF box library, I have seen your document and I can try to do it, if interested, message men Thanks

$690 USD em 10 dias

5,0

(10 avaliações)

4,2

@anuragiitk

I am an IITK graduate and I have 11 years of experience in software development. I have 100% completion rate and I have finished projects with the highest level of customer satisfaction. I have a team of rock star developers, who are working with top product companies and contribute to these projects as part time gig.

$555 USD em 10 dias

3,8

(20 avaliações)

5,4

@benni25

Hello Sir/Mam Relevant Skills and Experience: Please send us all details and we will do the job now if possible...and we are always ready to take any challenge + we have an adobe lab too Proposed Milestones: 475 - (ProjectTitile) For any query please consult our profile on https://www.freelancer.com/u/benni25.html

$475 USD em 1 dia