There are other great ways of using Wordpress engine. For instate, I’m going to put out a site with music lyrics. This sort of websites is used by SEOs for building another back-end dedicated to link exchange programs, like SEO directories, content automatically generated from RSS feeds, bibles, books, or any other useless SEO crap. From time to time I see some offers of lyrics, databases for sale, and the question is why pay for something you can make on your own for free?
So I’ve just downloaded 300 MB of song lyrics by HTTrack (that’s another killer app!), now I do have a few thousands of html files and what do I need now? I’m in need of a smart script that will be able to convert all these files into a little bit more useful format by using regular expressions, written in Pearl or Python. Also, it will be pretty cool to cut off all unwanted content and insert that stuff into a batch of XML files ready to use along with Wordpress Import. It’s high time for me to learn Pearl, you know I’m a noobie in programming, srsly.
Before you ask a question:
There’s a sea of lyrics in the web. So if you are not advanced in SEO, do not ask how to develop lyrics website successfully - even if you get a piece of your spammy content indexed in Google, you won’t ever take advantages of it. There is more than one way to do it ![]()



August 7th, 2007 #
This is a pretty good idea if you are planning on building up a repetoir of spam sites to beef up your links juice. A lot of these lyrics sites are full of abnner ads, pop-ups and porn. If you create one that is a little more tasteful with a nice css design and light bit of JS, you should easliy be able to start getting the links in. My only worry would be is that there are so many lyrics sites out there that you would have to compete against.
Also, I find for stealing data from other sites the PHP DOM function is excellent.
August 8th, 2007 #
I’m not gonna build my own spam empire (not yet
), just want to start a social oriented lyrics website of full value and check my skills to have it ranked highly in search engines. It won’t be easy but damn I want to try.
Thanks a lot for your tip, honesty still I don’t know how to get down to that scripting. It looks like I do need to consult with a professional programmer, if he doesn’t have enough time for me I will be forced to learn on my own and seek for help on forums eventually. I like to learn languages and it’s cool to know something new but hell I do prefer foreign, not programming ones. I am more afraid of C++ than Chinese putonghua, srsly
August 8th, 2007 #
Scraping can be a real pain especially if page styles / layouts vary (using HTML as opposed to CSS).
You also want to be mindful of the IP you use in these exercises…
The fact that Python has seems to have so much Kudos amuses me, that was the first language taught when I was at University and everyone else thought it was a waste of time having little commercial value….
Its a nice language, but I think you would be able to do all you need with PHP, which will also give you the means to work with the vast majority of open source webapps….
The only popular app I know of written in Python is Plone…. I believe NASA were quite big on Python at one time, in case you were thinking of working there
August 8th, 2007 #
Fortunately if all pages are written in one scheme you can get all the things done by using regular expressions, I guess. Some sites have a cover, but it’s not impossible to introduce yourself as Microsoft Internet Explorer, most of website downloaders do have it somewhere in options, also you can set a “pause”, or use some proxy when you get banned…
Jez I cannot believe that you recognize only Plone, if you use Linux (even from time to time) you must bump into a sea of great scripts written in Python
The only one programming language I know well is XHTML & CSS. I’m just a noob in any other stuff, including PHP. I’ve mentioned Python only thanks to the fact this language looks the easiest for me to learn
August 8th, 2007 #
Hi Bart,
I was not talking about scripts but full blown apps, in PHP you have LOTS of CMS / Community / E-Commerce applications available, and it can also be used for scripting…. and it is easier to find paid work with PHP.
Python is a good language for beginners, and for rapid prototyping… I liked Python a lot, but did not pursue it, I learned Java instead, and am now learning PHP….
It doesn’t matter too much which you learn, once you know one language its very easy to pick up another…
On Linux I am no expert… I only really use it in a LAMP context and a bit on my Laptop which dual boots XP / Ubuntu…. I know more about apache / mySql than I do about Linux if that makes sense….
August 8th, 2007 #
Yeh, I bet some of these sites get a surprise when they look in there stats to find out some cheeky polish guy has taken them for 100MB of bandwith.
August 9th, 2007 #
Eventually there is an opportunity to be okay and download everything from the cache of Live.com, but let’s say the most popular sites with song lyrics do have really fat resources of data transfer so owners won’t be angry when you gather 100 MB once in a blue moon…
Alright so you can switch user agent to MSN and use American proxy, it will be counted just as one another frenzy of MSN bot
August 9th, 2007 #
I can’t believe I hadn’t thought of swiping Google, Yahoo or Live cache. Great idea!
There is one web site that I was trying to swipe 1000s of pages from and they had such an advanced system that I just gave up. After grabbing a about 50 pages with no delay you would have your IP banned. So i got about 30 unreliable proxies and set them to 60 second delays, but they it was too slow and the proxies would die after a day.
August 9th, 2007 #
Honesty I don’t know why folks that develop website downloaders haven’t included caching of search engines into their programs as an alternative to proxy servers.
Anyway I’m glad you found that tip useful. Seems it was not an “advanced system” but just another extension of Apache, I think I’ve seen it before and must figure it out
August 9th, 2007 #
I recently took 10k pages off a site. It took all night to run, I was surprised I my IP was not blocked… just depends who you are dealing with I spose…
August 9th, 2007 #
Let’s do a contest: first spammer that scrap all content from yahoo.com (with subdomains) and get it completely indexed in Google might win a linkback!
By the way compare the numbers of indexed pages of yahoo.com in Google (324) and Live.com (20,866) WTF
August 9th, 2007 #
Your starting to sound like your bordering on the criminally insane !!!
So google dont like Yahoo much then… or is it just easier to get into Live?
August 9th, 2007 #
On another note, I just added Geyes to my top panel… I really dont know how I survived all this time without them, made me a lot happier haha!