We are a team of engeneers, specialized in SEO consulting.
The goal is to
**crawl** blogs and forums and **save** their content into a database.
## Deliverables
## Going to the data
A list of blogs will be given.
Each will need to be crawled throught it's archives, and every article taken.
Also, sometimes a search result page will be given, and you have to open the results, as new blogs to crawl.
A list of forums (phpBB mostly) will be given. With a login and password.
You will need to get the topics and translate them as articles. The first post in the topic is the "content", the others are the "comments".
A list of usenet newsgroups will be given.
You will get their messages, through google group, or a news system (NNRP access)
Each first post is an article, it's answers are comments.
In all cases, an article that is less that X characters long won't be downloaded.
##
## Getting data
Each article in a blog will be a new row in the main table.
You may also create other tables as needed. If you think a table with a row for each blog can be useful, do it.
Main table fields :
* id primary key auto intrement
<!-- -->
* title
* content <- whole article, not whole web page
* date of release of article
* source url <- also used as UNIQUE , so if we crawl the site again, don't take the same article 2 times
* tags (if any) (<- separate table?)
* categories (if any) (<- separate table?)
* user comments (in separate table) : nickname, date, content
* images if article got any. Will be put in a dir which name is the id field. Images can be taken with system("wget ...")
##
## Technical
Name of table, of class(es) to use and some downloading functions are pre-defined or will be changed.
Each blog , or blog platform will be different to parse obviously. You can decide to write parsing information in a table, or in the code, as you wish.