PDA

View Full Version : Ability to modify HTML source using regular expressions


No-Nonsense
04-12-2006, 01:52 PM
I really would like to have the ability to modify the HTML source of the pages to be converted with iSiloX using regular expressions. At the moment I help myself by using WebGrab (Version 1.2 Beta1) as some sort of proxy to archive this. I also thought of using Proximitron for this.

Some example what is possible with WebGrab:

# mathäser ################################################## ###################
$wg['loc'][6]['map_remote'] = 'www.mathaeser.de';
$wg['loc'][6]['map_local'] = 'http://'.$wg['server_name'].'/webgrab/';

$wg['loc'][6]['search_block_start'][0] = '<script';
$wg['loc'][6]['search_block_stop'] [0] = '</script>';
$wg['loc'][6]['replace_block_line'][0] = '<!-- Webgrab: Removed Script -->';

$wg['loc'][6]['search_block_start'][1] = '<link rel="stylesheet"';
$wg['loc'][6]['search_block_stop'] [1] = '>';
$wg['loc'][6]['replace_block_line'][1] = '<!-- Webgrab: Removed Stylesheet -->';

$wg['loc'][6]['regexp']['search'] [0] = '/<body.*<td width="650" height="300" valign="top" bgcolor="#FFFFFF">(.*)<table width="770" border="0" cellpadding="0" cellspacing="0">.*<\/body>/is';
$wg['loc'][6]['regexp']['replace'][0] = '<body>\\1</body>';

// <a class="schwarz" href="javascript:;" onClick="window.open('http://www.kinopolis.de/filme/filminfo_mathaeser.dhtml?filmoid=21876','_blank',' status,width=700,height=600,scrollbars=yes'); return false">
$wg['loc'][6]['regexp']['search'] [1] = '/<a(.*)href="javascript:;"(.*)onClick="window.open\(\'(.*)\'.*"(.*)>/isU';
$wg['loc'][6]['regexp']['replace'][1] = '<a\\1href="\\3"\\2\\4>';

// <a href="javascript:;" onClick="MyWindow=window.open('images/die_kinos/kino_01.htm','kino','toolbar=no,location=no,direct ories=no,status=no,menubar=no,scrollbars=no,resiza ble=no,width=400,height=272'); MyWindow.focus(); return false;">
$wg['loc'][6]['regexp']['search'] [2] = '/<a(.*)href="javascript:;"(.*)onClick="MyWindow=window.open\(\'(.*)\'.*"(.*)>/isU';
$wg['loc'][6]['regexp']['replace'][2] = '<a\\1href="http://www.mathaeser.de/filmpalast/service/\\3"\\2\\4>';


Unmodified page: mathäser Kinoprogramm (http://www.mathaeser.de/filmpalast/programm/programm.php)
Modified page using WebGrab with the configuration above: mathäser Kinoprogramm (WebGrab) (http://www.no-nonsense.de/webgrab/wg.php?webgrab_path=http://www.mathaeser.de/filmpalast/programm/programm.php)

The problem with WebGrab is that it is a PHP script that fetches the wanted document and forces me to allow offsite links (e.g. for images). This makes it harder to define what to be included an what not. It also slows down the conversion a lot.

If this takes too much time implementing this using some RegExp library, maybe integrating Lua (http://www.lua.org/; very popular scripting languare, free, used in World of Warcraft, Girder, ...) is a solution that would even be better.

-Jens

testtest
11-02-2006, 07:45 AM
I tried to do what you had sugest, in a different way: Using RIP (http://rip.mozdev.org/), a firefox extension that provides the ability to point at and remove permanently any item you can select in a webpage. It provide a flexible and easily configurable solution to removing unwanted content.

Then, I tried to download a website using offline browsers or spiders, but, this extension only changes the way I look, not the download... Anyway, it would be a great alternative to make complex sites "isilable".

rgee
02-19-2007, 03:01 AM
A very simple option to just remove sections of text based on regex would be really useful, to remove repeated headers/footers/sidebars from sites.