View Full Version : Automatic Source URL Edits
azulmarino
03-11-2007, 03:50 PM
I daily use iSilo to read a copy of the online local newspaper; the only inconvenience was that the URL to the current issue is dynamic, like this:
http://newspaper.com/year/month/day/index.html
So everyday, before the scheduled iSiloX task begins to download the newspaper, I had to manually change the part of the URL setting the current date; and obviously, sometimes I forget to do this and iSiloX just re-downloads the issue from the previous day.
Unfortunately, iSiloX and iSiloXC offer no way to dinamically change that part of the source URL, but finally I managed to do it by writing a macro in an advanced text editor like UltraEdit-32 and setting up a daily schedule for this to run before iSiloX download job.
The steps to perform are as follows:
Download UltraEdit-32 text editor and install it
Open your iSiloX .ixl file in UltraEdit
Position the cursor where you want to start recording the macro
Start to record a macro named lr21chdate
Use just the keyboard and the F7 shortcut key to insert the current date until your edit is complete
Stop the macro
Save the macro to a file named dailynews
Open Windows scheduler and create a new task
The task program is uedit32
The task parameters are "G:\dox\AppFiles\iSiloX\dailynews.ixl" -l6 -c61 /M,E="G:\dox\AppFiles\iSiloX\dailynews.MAC/lr21chdate"
Where the -l and -c switches locate the cursor where you started recording your macro
Set this task to run daily before the task iSiloX uses to download the newspaper
Voila
You can get more details reading UltraEdit local help.
Do you know a better, faster way to do this? Share with us!
Starman
03-31-2007, 04:41 AM
Would a wildcard * work? I use them all the time for fetching news stories whose URLs have a changing date in them.
It would save you all the macro work. You can set them up using URL filters in the links tab for a particular document.
azulmarino
04-01-2007, 01:08 PM
Regular wildcards are useless in this case because if I use something like * or ? iSilo will interpret that I want to download the newspaper edition from any available dates.
A useful wildcard for this case would be one that inserts the current year, another for the current month, and finally other for the current day, all in numbers and with or without leading zeros.
Starman
04-03-2007, 06:08 AM
I use the * wildcard all the time for fetching newspapers whose dates change daily! It will only fetch everything (all dates) if they are available from the main page! (Or on another page, depending on the link depth you've set.)
I have no problem using * wildcard expressions (and one can add certain exclusions, too) as the newspapers I fetch only give the current day's stories--and, for those sites that keep adding stories, it fetches the page, sure, but one recognizes what one's read before.
Does your site just keep adding pages and keeping links to old stories? Can you give us the link in question? That way others could test various wildcard expressions to help you get this solved.
Is there an alternative web page to go to-- e.g., an RSS feed that updates daily for you?
azulmarino
04-04-2007, 12:52 AM
The homepage of the online newspaper I read is http://www.larepublica.com.uy/lr3/
That link redirects to http://www.larepublica.com.uy/lr3/larepublica/2007/04/04/portada/ Note: hover over the link to see the full URL in your browser status bar.
That corresponds to the current day issue showing main titles with links to individual sections in the sidebar. This is the layout that they call portada (cover).
The site offers an unlinked layout format called sumario (summary) which features titles, abstracts and links to all the current articles arranged by section all in a single long page. In this case the URL would be: http://www.larepublica.com.uy/lr3/larepublica/2007/04/04/sumario/
That is the kind of source URL I download into iSiloX because this layout allows a faster checking of the whole issue with less page flipping.
As I said, that kind of URL ending with "/sumario/" is not linked from any page at the site, I really don't remember how I discovered it. So I have to use it as the source URL for iSiloX.
Then, if I use wildcards as you suggest, I can't thing of any filters or configuration that would prevent iSiloX from downloading all available issues, which all use the same URL formatting just changing the date numbers.
Starman
04-06-2007, 05:17 PM
Been away from the net for a few days... I'll play with this some over the weekend and see what I can come up with.
My first cursory look at the links suggested that you could use
http://www.larepublica.com.uy/lr3/larepublica/
as your source URL
Make your link depth 1 and uncheck the follow off site links.
Then, to prevent the advertisements
use * for the exclude filter
and use http://www.larepublica.com.uy/lr3/larepublica/* for the inclusion filter.
I just ran it within iSilo and you can watch it fetch the pages-- it only fetches the current day's stories and is a remarkably compact 53K.
But that's not what you wanted. It's based on the first link you posted-- and the side bar framing is hard to navigate. There's probably a way to set up iSilo to reorganize those-- but what you really want is the wonderful long summary page that groups the various category links on one page. Hmm...
I played around with the URL and discovered the following link
http://www.larepublica.com.uy/lr3/sumario
seems to produce exactly what you want and appears to give just the current day's summary!!
I'll have to try it again tomorrow and see... but give it a whirl and let me know!
You know if this fails, you could try contacting them and see if they have an RSS feed--or nudge them to set one up!
Buena suerte! (I grew up in Latin America; used to read La Prensa when living in Peru. I did find an intriguing RSS feed for the La Republica out of Lima, but it seems to be a weekly edition and includes some older links in it. And it wouildn't be the Montevideo edition in any case.)
azulmarino
04-07-2007, 04:34 PM
Starman, you are right that this link:
http://www.larepublica.com.uy/lr3/sumario
will load a summary for the current issue, that's a nice tip!
There's a catch I chose not to mention before just to keep things simple for you the first time; at least the last time I tried to read a current issue of the newspaper (the one published the same day I read it) most content was only available to subscribers upon entering a code. That issue was unlocked as soon as the next issue was published a few minutes after midnight.
That's why everyday, half an hour after midnight, I chose to download and read the previous day issue and so I had to come up with a complicated way to automate the download.
Now that I tried your link, I am surprised that all the content of the current issue is unlocked and freely available! I guess they are selling more paper editions and so they decided to give the online edition for free.
So now I can use the url you mention above to download the summary of current issue, I just hope it continues free as now. Anyway, just in case if they decide to lock it again, I know how to automate the downloading of yesterday's issues.
Starman
04-08-2007, 01:51 PM
Glad it worked!
Yeah, I noticed that it was from the day before but I thought that was just a matter of when the paper was put on line electronically...
Oh, well. At least it's close. If I get some time this week (unlikely), I will still play around with the current date web page.
I did try some URL entries and filters with the first part of the URL, then a * substituting for today's date, and then /sumario but those didn't take. If only they'd put the sumario link on the main page--or simply use the sumario link I found for the current day's edition.
Good luck with it... it would be cool if iSilo had automatic incrementing counters-- e.g., a date counter built in. This must come up a lot. In fact, it wouldn't even need to be a date counter, it could simply take the date from the current system date/time clock.
vBulletin® v3.7.4, Copyright ©2000-2008, Jelsoft Enterprises Ltd.