We are looking for a Python coder who can do text classifcation using existing classification algorithms. We want to take already downloaded web pages and classify their content with tags such as "Business", "Sport", "Information Technology" etc. We are initially starting with articles, but it should (in theory) be possible to classify any webpage this way. In a linear example, this would work as follows: 1. Classification module would look in database for a filename 2. It would open this downloaded HTML file and would parse it and classify it. Possibly a document would have more than one classification 3. Once completed, it would write this information to the database 4. It would then pick the next file Some thoughts: - It would be useful if this could work in parallel - The system might need to be 'taught' if using something like a Bayesian filter. If so, if there any way that you might suggest that we could automatically teach it (ie a site somewhere with this information or existing databases?) - It definitely needs to be in Python. It's our language of choice - We have looked at CRM 114 ([login to view URL]) and Python versions of it. There may be other technologies that you are aware of that might do the same if not better job. Whilst we like CRM 114, we're looking for someone with knowledge in this sphere who can help us. Ideally someone with knowledge of parsing and classification already would find this a simple job (I hope) and we love to work with people who are knowledgeable in their field If you have any questions, please ask Thanks Ade and Sergey
## Deliverables
1) Complete and fully-functional working program(s) in executable form as well as complete source code of all work done.
2) Deliverables must be in ready-to-run condition, as follows? (depending on the nature? of the deliverables):
a)? For web sites or? other server-side deliverables intended to only ever exist in one place in the Buyer's environment--Deliverables must be installed by the Seller in ready-to-run condition in the Buyer's environment.
b) For all others including desktop software or software the buyer intends to distribute: A software? installation package that will install the software in ready-to-run condition on the platform(s) specified in this bid request.
3) All deliverables will be considered "work made for hire" under U.S. Copyright law. Buyer will receive exclusive and complete copyrights to all work purchased. (No GPL, GNU, 3rd party components, etc. unless all copyright ramifications are explained AND AGREED TO by the buyer on the site per the coder's Seller Legal Agreement).
## Platform
Linux