Python is famous for being fun and it is. I have a pet project here and I have tried scraping with PyQT. Particularly note how easy it is to traverse the DOM with QWebElement (new in QT 4.6): use a simple CSS2 selector and that's it.
# These lines will get us the modules we need.
from PyQt4.QtCore import QUrl, SIGNAL
from PyQt4.QtGui import QApplication
from PyQt4.QtWebKit import QWebPage, QWebView
class Scrape(QApplication):
def __init__(self):
# Apparently there are a number of versions of this init and PyQT
# figures out which you want based on the number of arguments. So pass
# in one argument but we do not need anything really, so None.
super(Scrape, self).__init__(None)
# Create a QWebView instance and store it.
self.webView = QWebView()
# Connect our loadFinished method to the loadFinished signal of this new
# QWebView.
self.webView.loadFinished.connect(self.loadFinished)
def load(self, url):
# In the __init__ we stored a QWebView instance into self.webView so
# we can load a url into it. It needs a QUrl instance though.
self.webView.load(QUrl(url))
def loadFinished(self):
# We landed here because the load is finished. Now, load the root document
# element. It'll be a QWebElement instance. QWebElement is a QT4.6
# addition and it allows easier DOM interaction.
documentElement = self.webView.page().currentFrame().documentElement()
# Let's find the search input element.
inputSearch = documentElement.findFirst('input[title="Google Search"]')
# Print it out.
print unicode(inputSearch.toOuterXml())
# We are inside a QT application and need to terminate that properly.
self.exit()
# Instantiate our class.
myScrape = Scrape()
# Load the Google homepage.
myScrape.load('http://google.com/ncr')
# Start the QT event loop.
myScrape.exec_()
In subsequent posts I will show how to actually do a search and do something with the elements.
Commenting on this Story is closed.



![Popular open source software is more secure than unpopular open source software, because insecure software becomes unpopular fast. [That doesn't happen for proprietary software.]](../sites/all/themes/drupal4hu/images/bg-center/bg-center_4.png)














