Friday, February 4, 2011

PowerShell; download, parse, extract

Wrote my first powershell script and it finally clicked how it all ties together, and why I should spend a bit more time learning it. Generally when I need to do something quick and dirty I’ll fire up VS and create a console project or write a *.vbs script. Not anymore got a bit of love now for powershell just need to do it to “get it”.

Basically I wanted to download a webpage, parse it using a regex, then extract the named groups from the multiple matches, and finally output to a tab delimited text file. As I could not find a complete script that worked, I’m putting this one up, for others and for personal storage (as I’m likely to forget after the weekend).

$wc = New-Object System.Net.WebClient
$wc.Encoding = [System.Text.Encoding]::UTF8
$pagetext = $wc.DownloadString("http://www.cro.ie/ena/forms_s_se.aspx")
$matches = $pagetext | select-string -pattern "<strong>(?<form>.*?)</strong>(?<desc>.*?)</p>" -allmatches | %{$_.matches }
$extractedtext = $matches | %{ $_.groups['form'].value.trim() + "`t" + $_.groups['desc'].value.trim()}
add
-content formtypes.txt $extractedtext
What I learnt:
There is no native Get-Url or Wget.
% is an alias for ForEach-Object.
$_ is the object been passed in.
`t is tab and that ` is very hard to find as it is not ' . I can't remember ever using that key on a keyboard before.

Hope it helps and any changes let me know.