How to develop another website with song lyrics

There are other great ways of using Wordpress engine. For instate, I’m going to put out a site with music lyrics. This sort of websites is used by SEOs for building another back-end dedicated to link exchange programs, like SEO directories, content automatically generated from RSS feeds, bibles, books, or any other useless SEO crap. From time to time I see some offers of lyrics, databases for sale, and the question is why pay for something you can make on your own for free?

So I’ve just downloaded 300 MB of song lyrics by HTTrack (that’s another killer app!), now I do have a few thousands of html files and what do I need now? I’m in need of a smart script that will be able to convert all these files into a little bit more useful format by using regular expressions, written in Pearl or Python. Also, it will be pretty cool to cut off all unwanted content and insert that stuff into a batch of XML files ready to use along with Wordpress Import. It’s high time for me to learn Pearl, you know I’m a noobie in programming, srsly.

Before you ask a question:
There’s a sea of lyrics in the web. So if you are not advanced in SEO, do not ask how to develop lyrics website successfully - even if you get a piece of your spammy content indexed in Google, you won’t ever take advantages of it. There is more than one way to do it ;)

Tags: , ,

Commented 13 times

  1. Mutiny Design says:

    This is a pretty good idea if you are planning on building up a repetoir of spam sites to beef up your links juice. A lot of these lyrics sites are full of abnner ads, pop-ups and porn. If you create one that is a little more tasteful with a nice css design and light bit of JS, you should easliy be able to start getting the links in. My only worry would be is that there are so many lyrics sites out there that you would have to compete against.

    Also, I find for stealing data from other sites the PHP DOM function is excellent.

  2. Florchakh says:

    I’m not gonna build my own spam empire (not yet ;) ), just want to start a social oriented lyrics website of full value and check my skills to have it ranked highly in search engines. It won’t be easy but damn I want to try.

    Thanks a lot for your tip, honesty still I don’t know how to get down to that scripting. It looks like I do need to consult with a professional programmer, if he doesn’t have enough time for me I will be forced to learn on my own and seek for help on forums eventually. I like to learn languages and it’s cool to know something new but hell I do prefer foreign, not programming ones. I am more afraid of C++ than Chinese putonghua, srsly :)

  3. Jez says:

    Scraping can be a real pain especially if page styles / layouts vary (using HTML as opposed to CSS).

    You also want to be mindful of the IP you use in these exercises…

    The fact that Python has seems to have so much Kudos amuses me, that was the first language taught when I was at University and everyone else thought it was a waste of time having little commercial value….

    Its a nice language, but I think you would be able to do all you need with PHP, which will also give you the means to work with the vast majority of open source webapps….

    The only popular app I know of written in Python is Plone…. I believe NASA were quite big on Python at one time, in case you were thinking of working there ;-)

  4. Florchakh says:

    Fortunately if all pages are written in one scheme you can get all the things done by using regular expressions, I guess. Some sites have a cover, but it’s not impossible to introduce yourself as Microsoft Internet Explorer, most of website downloaders do have it somewhere in options, also you can set a “pause”, or use some proxy when you get banned…

    Jez I cannot believe that you recognize only Plone, if you use Linux (even from time to time) you must bump into a sea of great scripts written in Python ;)

    The only one programming language I know well is XHTML & CSS. I’m just a noob in any other stuff, including PHP. I’ve mentioned Python only thanks to the fact this language looks the easiest for me to learn :)

  5. Jez says:

    Hi Bart,

    I was not talking about scripts but full blown apps, in PHP you have LOTS of CMS / Community / E-Commerce applications available, and it can also be used for scripting…. and it is easier to find paid work with PHP.

    Python is a good language for beginners, and for rapid prototyping… I liked Python a lot, but did not pursue it, I learned Java instead, and am now learning PHP….

    It doesn’t matter too much which you learn, once you know one language its very easy to pick up another…

    On Linux I am no expert… I only really use it in a LAMP context and a bit on my Laptop which dual boots XP / Ubuntu…. I know more about apache / mySql than I do about Linux if that makes sense….

  6. Mutiny Design says:

    Yeh, I bet some of these sites get a surprise when they look in there stats to find out some cheeky polish guy has taken them for 100MB of bandwith.

  7. Florchakh says:

    Eventually there is an opportunity to be okay and download everything from the cache of Live.com, but let’s say the most popular sites with song lyrics do have really fat resources of data transfer so owners won’t be angry when you gather 100 MB once in a blue moon…

    Alright so you can switch user agent to MSN and use American proxy, it will be counted just as one another frenzy of MSN bot ;)

  8. Mutiny Design says:

    I can’t believe I hadn’t thought of swiping Google, Yahoo or Live cache. Great idea!

    There is one web site that I was trying to swipe 1000s of pages from and they had such an advanced system that I just gave up. After grabbing a about 50 pages with no delay you would have your IP banned. So i got about 30 unreliable proxies and set them to 60 second delays, but they it was too slow and the proxies would die after a day.

  9. Florchakh says:

    Honesty I don’t know why folks that develop website downloaders haven’t included caching of search engines into their programs as an alternative to proxy servers.

    Anyway I’m glad you found that tip useful. Seems it was not an “advanced system” but just another extension of Apache, I think I’ve seen it before and must figure it out ;)

  10. Jez says:

    I recently took 10k pages off a site. It took all night to run, I was surprised I my IP was not blocked… just depends who you are dealing with I spose…

  11. Florchakh says:

    Let’s do a contest: first spammer that scrap all content from yahoo.com (with subdomains) and get it completely indexed in Google might win a linkback!

    By the way compare the numbers of indexed pages of yahoo.com in Google (324) and Live.com (20,866) WTF :D

  12. Jez says:

    Your starting to sound like your bordering on the criminally insane !!!

    So google dont like Yahoo much then… or is it just easier to get into Live?

  13. Jez says:

    On another note, I just added Geyes to my top panel… I really dont know how I survived all this time without them, made me a lot happier haha!

Leave a Reply