No-Nonsense
04-12-2006, 01:52 PM
I really would like to have the ability to modify the HTML source of the pages to be converted with iSiloX using regular expressions. At the moment I help myself by using WebGrab (Version 1.2 Beta1) as some sort of proxy to archive this. I also thought of using Proximitron for this.
Some example what is possible with WebGrab:
# mathäser ################################################## ###################
$wg['loc'][6]['map_remote'] = 'www.mathaeser.de';
$wg['loc'][6]['map_local'] = 'http://'.$wg['server_name'].'/webgrab/';
$wg['loc'][6]['search_block_start'][0] = '<script';
$wg['loc'][6]['search_block_stop'] [0] = '</script>';
$wg['loc'][6]['replace_block_line'][0] = '<!-- Webgrab: Removed Script -->';
$wg['loc'][6]['search_block_start'][1] = '<link rel="stylesheet"';
$wg['loc'][6]['search_block_stop'] [1] = '>';
$wg['loc'][6]['replace_block_line'][1] = '<!-- Webgrab: Removed Stylesheet -->';
$wg['loc'][6]['regexp']['search'] [0] = '/<body.*<td width="650" height="300" valign="top" bgcolor="#FFFFFF">(.*)<table width="770" border="0" cellpadding="0" cellspacing="0">.*<\/body>/is';
$wg['loc'][6]['regexp']['replace'][0] = '<body>\\1</body>';
// <a class="schwarz" href="javascript:;" onClick="window.open('http://www.kinopolis.de/filme/filminfo_mathaeser.dhtml?filmoid=21876','_blank',' status,width=700,height=600,scrollbars=yes'); return false">
$wg['loc'][6]['regexp']['search'] [1] = '/<a(.*)href="javascript:;"(.*)onClick="window.open\(\'(.*)\'.*"(.*)>/isU';
$wg['loc'][6]['regexp']['replace'][1] = '<a\\1href="\\3"\\2\\4>';
// <a href="javascript:;" onClick="MyWindow=window.open('images/die_kinos/kino_01.htm','kino','toolbar=no,location=no,direct ories=no,status=no,menubar=no,scrollbars=no,resiza ble=no,width=400,height=272'); MyWindow.focus(); return false;">
$wg['loc'][6]['regexp']['search'] [2] = '/<a(.*)href="javascript:;"(.*)onClick="MyWindow=window.open\(\'(.*)\'.*"(.*)>/isU';
$wg['loc'][6]['regexp']['replace'][2] = '<a\\1href="http://www.mathaeser.de/filmpalast/service/\\3"\\2\\4>';
Unmodified page: mathäser Kinoprogramm (http://www.mathaeser.de/filmpalast/programm/programm.php)
Modified page using WebGrab with the configuration above: mathäser Kinoprogramm (WebGrab) (http://www.no-nonsense.de/webgrab/wg.php?webgrab_path=http://www.mathaeser.de/filmpalast/programm/programm.php)
The problem with WebGrab is that it is a PHP script that fetches the wanted document and forces me to allow offsite links (e.g. for images). This makes it harder to define what to be included an what not. It also slows down the conversion a lot.
If this takes too much time implementing this using some RegExp library, maybe integrating Lua (http://www.lua.org/; very popular scripting languare, free, used in World of Warcraft, Girder, ...) is a solution that would even be better.
-Jens
Some example what is possible with WebGrab:
# mathäser ################################################## ###################
$wg['loc'][6]['map_remote'] = 'www.mathaeser.de';
$wg['loc'][6]['map_local'] = 'http://'.$wg['server_name'].'/webgrab/';
$wg['loc'][6]['search_block_start'][0] = '<script';
$wg['loc'][6]['search_block_stop'] [0] = '</script>';
$wg['loc'][6]['replace_block_line'][0] = '<!-- Webgrab: Removed Script -->';
$wg['loc'][6]['search_block_start'][1] = '<link rel="stylesheet"';
$wg['loc'][6]['search_block_stop'] [1] = '>';
$wg['loc'][6]['replace_block_line'][1] = '<!-- Webgrab: Removed Stylesheet -->';
$wg['loc'][6]['regexp']['search'] [0] = '/<body.*<td width="650" height="300" valign="top" bgcolor="#FFFFFF">(.*)<table width="770" border="0" cellpadding="0" cellspacing="0">.*<\/body>/is';
$wg['loc'][6]['regexp']['replace'][0] = '<body>\\1</body>';
// <a class="schwarz" href="javascript:;" onClick="window.open('http://www.kinopolis.de/filme/filminfo_mathaeser.dhtml?filmoid=21876','_blank',' status,width=700,height=600,scrollbars=yes'); return false">
$wg['loc'][6]['regexp']['search'] [1] = '/<a(.*)href="javascript:;"(.*)onClick="window.open\(\'(.*)\'.*"(.*)>/isU';
$wg['loc'][6]['regexp']['replace'][1] = '<a\\1href="\\3"\\2\\4>';
// <a href="javascript:;" onClick="MyWindow=window.open('images/die_kinos/kino_01.htm','kino','toolbar=no,location=no,direct ories=no,status=no,menubar=no,scrollbars=no,resiza ble=no,width=400,height=272'); MyWindow.focus(); return false;">
$wg['loc'][6]['regexp']['search'] [2] = '/<a(.*)href="javascript:;"(.*)onClick="MyWindow=window.open\(\'(.*)\'.*"(.*)>/isU';
$wg['loc'][6]['regexp']['replace'][2] = '<a\\1href="http://www.mathaeser.de/filmpalast/service/\\3"\\2\\4>';
Unmodified page: mathäser Kinoprogramm (http://www.mathaeser.de/filmpalast/programm/programm.php)
Modified page using WebGrab with the configuration above: mathäser Kinoprogramm (WebGrab) (http://www.no-nonsense.de/webgrab/wg.php?webgrab_path=http://www.mathaeser.de/filmpalast/programm/programm.php)
The problem with WebGrab is that it is a PHP script that fetches the wanted document and forces me to allow offsite links (e.g. for images). This makes it harder to define what to be included an what not. It also slows down the conversion a lot.
If this takes too much time implementing this using some RegExp library, maybe integrating Lua (http://www.lua.org/; very popular scripting languare, free, used in World of Warcraft, Girder, ...) is a solution that would even be better.
-Jens