Scraping phpBB with xpath php and nice css ipage

Post Reply
darknkreepy3#
Site Admin
Posts: 247
Joined: Tue Oct 27, 2009 9:33 pm

Scraping phpBB with xpath php and nice css ipage

Post by darknkreepy3# »

So, I have it's been fun playing with the DOM and xpath with php and curl. I thought I could do some fun stuff locally to my ipage phpBB and list all the cool things I have written over 10 years just in case the database format goes crazy or for whatever reason I one day want to keep a record or simple look up of the topics I have written.

Well, now you can do it to your simple phpBB if you keep it in the [phpBB3] *folder in your root and use say iPage and change a few things i have noted, and curl will work remotely. This pulls your main topics, then reads each topic area and then pulls all sub entries, so, yes, only 2 levels deep, but you'll learn how to do it here.

files here
_files/phpbb_scraper_sc.rar

my setup from home
win 10x64
apache apache2_4_33x64-vc15
php7211x64vc15
-curl enabled (in windows it's turned on by removing ; from ;extension=curl (no .dll here in 7.x versions of php)
-allow_url_fopen=on (in your php.ini)
-remember to make php and php's /bin directory listed in your windows PATH environment if you haven't yet

my setup on ipage (01/12/2021)
php 7.2x
-allow_url_fopen=on (in your php.ini)
* /phpBB3/ so that's my [/phpBB3/] meaning my root site. you can change the php variable in my code for your path if not that

****do NOT remove the ; from the modules from ipage's php.ini they don't change anything, run your own phpinfo test to see

So, I chopped up some online tutorials and came up with this very long, super basic way to use the chrome browser, right click the title of a topic in my phpBB and use the plugin for chrome called SCRAPER and choose [scrape similar...] then I read about xpath, and it gave me a basic pull. There were two columns and they were accessed in php with [a href] as the attribute, which I figured out from what scraper had as a row in it's output.

I also saved the process (xpath) and named it "topics" just in case I forgot how to do this in chrome inside scraper's interface. very nice feature from them. I then also looked at a topic page and right clicked an entry title and the xpath was different (the dom is) and saved that scraper search as well.

main page topic xpath

Code: Select all

'//div[1]/div/ul[2]/li/dl/dt/div/a'
topic sub-page listings xpath{/b]

Code: Select all

'//dt/div/a'
Then I tried just the first page, and took the scraped data in and formatted it with fancy css and fun string echo outputs. You can do whatever you want and make much better programming structures than my inline and one function stuff. Enjoy. I am posting the code per page here so Google scan scrape my tutorial on... scraping :)

test_board_index.php{/b]

Code: Select all

<?php
	/*
	test_board_index.php v1.0 by kristoffe brodeur. ©2021 All Rights Reserved.
	
	01-11-2021
		simple program to scrape the board indexes
		of the main area of a phpbb with [Lucid Lime] theme
	*/
	
	$to_root="./";
	$base="http://www.supercala.net/phpBB3/";//allows us to use the scraped data to make links to the forum page areas
	$url=$base."index.php";
	$html=file_get_contents($url);
	//echo "scraping away v1.0 by kristoffe<hr />";
	
	$phpbbMain_doc=new DomDocument();
	
	libxml_use_internal_errors(TRUE);//disable libxml errors
	$bbStr="";
	$topMenuStr="";
	
	//
	if(!empty($html))
		{
		//echo "cool we got the website as data<hr />";
		
		$phpbbMain_doc->loadHTML($html);
		libxml_clear_errors();//remove any of the weird html errors etc we don't need
		
		$phpbbMain_xpath=new DOMXPath($phpbbMain_doc);		
		$phpbbMain_topics=$phpbbMain_xpath->query('//div[1]/div/ul[2]/li/dl/dt/div/a');
		$lenT=0;
		$lenSS=0;
			
		//
		if($phpbbMain_topics->length >0)
			{
			//echo "[let's look into an object node for it's parts now]<br/>";
			$lenT=count($phpbbMain_topics);
			$pagePos=-1;
			
			//
			foreach($phpbbMain_topics as $row)
				{
				$pagePos++;
				$a_textCol=$row->nodeValue;
				$a_hrefCol=$row->getAttribute("href");
				
				//get rid of the one character (.) period in the return ./viewforum.php?f= etc
				$aStr=substr($a_hrefCol,1);
				
				/*
				use the $base not the $url so it starts from the phpbb forum folder root as the ^ link goes to on the page(s)
				pretend we're at the scraped url and going to the link as if we were locally there (./)
				*/
				$a_hrefLink=$base.$aStr;
				
				//now let's get the variables (queries) from both ? and & in the href string phpbb outputs to each area
				$queryArr=parse_url($a_hrefCol);
				//
				foreach($queryArr as $key=>$val)
					{
					$subPageStr="";
					
					//echo "[$key][$val]<br />";
					//
					if($key=="query")
						{
						$tmpQListArr=array();
						parse_str($val,$tmpQListArr);
						//echo count($tmpQListArr)."<br />";
						$forumNode=-1;
						//
						foreach($tmpQListArr as $key2=>$val2)
							{
							//
							switch($key2)
								{
								case("f"):
									$forumNode=$val2;
									//echo "[forum id]$forumNode<br />";	
									break;
								case("sid"):
									break;
								default:
									break;
								}
							}
						//
						if($forumNode!=-1)
							{
							/*
							
							test with $pagePos==0 to save time and debug if one loop of doSubArea(#) will work
							
							test with $pagePos<$lenT for all of them. 
							it might be better to call this with javascript and populate areas with a 1 second wait
							instead of a really long pause. maybe
								html->
									jquery php that does this with a result->
									add to area on screen per main area with sub-areas via DOM js->
									wait 1s->
									loop->
							*/
							
							//
							if($pagePos<$lenT)
								{
								$subPageStr=doSubAreaPage($forumNode);
								//echo "!!!!! [$subPageStr] !!!!! <br />";
								}
							}
						}
					/*
					now the forum areas have titles and links to all the their listed nodes too
					time to loop but with a different target per loop on the cpath found with [scraper] and chrome (I'm learning)
					*/
					
					}
				$topMenuStr.="
					<div class='bb_areaButton'><a href='#area".$pagePos."'>".$a_textCol."</a></div>
				";
				
				$bbStr.="
					<div class='bb_page' id='area".$pagePos."'>
						<div class='dataBox_m'>
							$a_textCol
						</div>
						<div class='dataBox_l'>
							<a href='$a_hrefLink'>$a_hrefCol</a>
						</div>
						<div class='clearBoth'></div>
							$subPageStr
					</div>
					";
				}
			}
			
		}
	//
	function doSubAreaPage($sentArea)
		{
		global $base;
		global $lenSS;
		
		$sub_url=$base."viewforum.php?f=".$sentArea;
		//echo "$sub_url<br />";
		$sub_html=file_get_contents($sub_url);
		$phpbbSub_doc=new DomDocument();
		
		//maybe only declare this once, not each time per php page (done in page root)
		//libxml_use_internal_errors(TRUE);
		
		$sub_bbStr="";
		//
		if(!empty($sub_html))
			{
			$phpbbSub_doc->loadHTML($sub_html);
			libxml_clear_errors();
			$phpbbSub_xpath=new DOMXPath($phpbbSub_doc);
			/*
			new query, the template puts the areas in a different level of the DOM structure
			found right scraper and just right clicking the title of the sub area entry and [scraper similar...] 
			lazy, but it's a lot to learn at once
			*/
			$phpbbSub_topics=$phpbbSub_xpath->query('//dt/div/a');
			//echo "subTopic areas found with xpath [".count($phpbbSub_topics)."]<br />";
			
			$lenST=0;
				
			//
			if($phpbbSub_topics->length >0)
				{
				//echo "sub topic page[$sentArea] so far so good!<br />";
				$pagePos=-1;
				
				//
				foreach($phpbbSub_topics as $row)
					{
					$pagePos++;
					$a_textCol=$row->nodeValue;
					$a_hrefCol=$row->getAttribute("href");
					$aStr=substr($a_hrefCol,1);
					$a_hrefLink=$GLOBALS['base'].$aStr;
					
					$lenSS++;
					
					//
					$sub_bbStr.="
						<div class='bb_page_sub'>
							<div class='dataBox_m'>
								$a_textCol
							</div>
							<div class='dataBox_l'>
								<a href='$a_hrefLink'>$a_hrefCol</a>
							</div>
							<div class='clearBoth'></div>
						</div>
						";
					}
				}					
			}
		//
		else
			{
		$sub_bbStr="page load error<br />";
			}

		return $sub_bbStr;
		}
		
	//
	function showRows()
		{
		$lenT=count($phpbbMain_topics);
		//echo "wow. I found [$lenT] rows of topics! thanks to php xpath and chrome with the 'scraper' plugin<hr />";
		//
		foreach($phpbbMain_topics as $row)
			{
			//echo "<div class=''>".$row->nodeValue."</div>";
			}
		}
	//echo "<hr />finished<hr />";
?>

<html>
	<head>
		<link type="text/css" rel="stylesheet" href="<?php echo $to_root;?>css/page.css" />
	</head>
	<body>
		<div class='dataBox_m'>Forum Main Sections</div>
		<div class='dataBox_l'><?php echo $lenT;?></div>
		<div class="clearBoth"></div>
		
		<div class='dataBox_m'>Total Postings In All Main Areas</div>
		<div class='dataBox_l'><?php echo $lenSS;?></div>
		<div class="clearBoth"></div>
				
		<div class="topMenuButtons">
			<?php echo $topMenuStr;?>
		</div>
		<div class="clearBoth"></div>
		
		<?php echo $bbStr;?>
	</body>
</html>
[c]css/page.css[/b]

Code: Select all

.dataBox_m,.dataBox_m
	{
	float:left;
	padding:4px;
	}
.clearBoth
	{
	clear:both;
	}
.dataBox_m
	{
	width:600px;
	}
.bb_page,.bb_page_sub
	{
	width:100%;
	}		
.bb_page_sub
	{
	padding:0px 0px 0px 72px;
	background-color:#DDFFDD;
	border:solid;
	border-color:#55FF55;
	border-width:0px 0px 2px 0px;
	}
.bb_page_sub:hover
	{
	background-color:#AAFFAA;
	}
.bb_areaButton
	{
	padding:8px;
	margin:2px;
	background-color:#00CC00;
	color:#FFFFFF;
	float:left;
	}
ipage_curl_test.php

Code: Select all

<?php
	$to_root="./";
?>

<html>
	<head>
		<link type="text/css" rel="stylesheet" href="<?php echo $to_root;?>css/ipage_curl_test.css" />
	</head>
	
	<body>
		seems that ipage has
		<br />
		<b>allow_url_fopen=Off</b>
		<br />
		<br />
		by default, so change it in the php.ini on the ipage shared server control panel to
		<br />
		<b>allow_url_fopen=On</b>
		<hr />
		
		<div class='story_half'><img src="ipage_curl_step1.jpg" /></div>
		<div class='story_half'><img src="ipage_curl_step2.jpg" /></div>
		<div class='clearBoth'></div>
		
		<div class='story_half'><img src="ipage_curl_step3.jpg" /></div>
		<div class='story_half'><img src="ipage_curl_step4.jpg" /></div>
		<div class='clearBoth'></div>
		
		<?php
			/*
			test_board_index.php v1.0 by kristoffe brodeur. ©2021 All Rights Reserved.
			
			01-11-2021
				simple program to scrape the board indexes
				of the main area of a phpbb with [Lucid Lime] theme
				
			seems that ipage has
				allow_url_fopen=Off
				
			by default, so change it in the php.ini on the ipage shared server control panel to
				allow_url_fopen=On
			
			*/
			echo "testing curl on ipage<br />";
			$base="http://www.supercala.net/phpBB3/";//allows us to use the scraped data to make links to the forum page areas
			$url=$base."index.php";
			$html=file_get_contents($url);
			echo "scraping away v1.0 by kristoffe<hr />";
			
			$phpbbMain_doc=new DomDocument();
			
			libxml_use_internal_errors(TRUE);//disable libxml errors
			$bbStr="";
			
			//
			if(!empty($html))
				{
				echo "cool we got the website as data<hr />";
				}
		?>

	</body>
</html>
css/ipage_curl_test.css{/b]

Code: Select all

.clearBoth
	{
	clear:both;
	}
.story_half
	{
	width:47%;
	float:left;
	border:solid;
	border-width:2px;
	margin:4px;
	border-color:#AAFFAA;

	}
.story_half img,.story_half a img
	{
	width:100%;
	}
@media screen and (min-width:400px) and (max-width:1000px)
	{
	.story_half
		{
		width:100% !important;
		}
	}
is_curl_installed_in_this_server.php

Code: Select all

<?php

// Script to test if the CURL extension is installed on this server

// Define function to test
function _is_curl_installed() {
    if  (in_array  ('curl', get_loaded_extensions())) {
        return true;
    }
    else {
        return false;
    }
}

// Output text to user based on test
if (_is_curl_installed()) {
  echo "cURL is <span style=\"color:blue\">installed</span> on this server";
} else {
  echo "cURL is NOT <span style=\"color:red\">installed</span> on this server";
}
?>
steps for ipage php setup in image form (jpgs)
http://www.supercala.net/sites/phpbb_sc ... _step1.jpg
http://www.supercala.net/sites/phpbb_sc ... _step2.jpg
http://www.supercala.net/sites/phpbb_sc ... _step3.jpg
http://www.supercala.net/sites/phpbb_sc ... _step4.jpg
Post Reply