stock-newspaper-crawler

This project has stopped, now only use for reference.

This project has the procedure of three steps:

Create database used for storing crawled essays' data.
Crawl data from web user defined and store them to database.
Describe the metadata of crawled essays' data and generate report, then plot the bar chart and pie chart according to the descriptive report.

In this project, python is used widely. Especially,

In crawler part, library BeautifulSoup, urllib2, re ;
In database part, MySQLdb;
In plot part, matplotlib;
In other parts, logging, os, time, numpy.

2015-10-18 21:57:00

My first respository on GitHub!

I (have to) love ☕. More concretely, it's the first step (crawl corpus from CCSTOCK.CN) of LDA model(one of topic models).

This little project is about the fundamentals of natural language processing, mainly concentrating on Chinese word count, word frequency statistic and etc. The module of Chinese word count is accomplished by MM(Maximum Matching) method and RMM(Reverse Maximum Matching) method.

Summary

2015-7-29

Project stops temporarily. Now I have realized the main function of crawl stock news data from CCSTOCK.CN. However, there still has remained some tasks:

Further improve in success match rate of stock news. The regular expression need to be further optimized. Current match rate is about 0.86.
Some Variables can be a generator type. Such as the variable all_essays_link_list, etc.
Use map method to improve efficiency, such as when inserting records into database, etc.
Go on Crawl function. If network status is bad or program stops accidentally, restart our main.py to go on crawling news from last break point.

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
.idea		.idea
data/output		data/output
myclass		myclass
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.idea

.idea

data/output

data/output

myclass

myclass

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

main.py

main.py

Repository files navigation

stock-newspaper-crawler

My first respository on GitHub!

Summary

2015-7-29

About

Releases

Packages

Languages

License

ysh329/stock-newspaper-crawler

Folders and files

Latest commit

History

Repository files navigation

stock-newspaper-crawler

My first respository on GitHub!

Summary

2015-7-29

About

Topics

Resources

License

Stars

Watchers

Forks

Languages