didi, I don't know exactly your context, but I have exported my data from Wordpress before, so I am going to guess your context. I'm supposing now that you have completed the export into an XML file and that you have "slurped" that into newLISP and it is now an SXML (list) data structure.
But just to back up a moment, say I exported my Wordpress to a file called wp-export.xml. Then, in newLISP, I would "slurp" that file like this.
Code: Select all
(define (sxml<-file FILEPATH)
(xml-type-tags nil nil nil nil)
(xml-parse (read-file FILEPATH) (+ 1 2 4 16)))
(define *wp-export-db* (sxml<-file "wp-export.xml"))
I had problems using the 8 flag to
xml-parse, so I left it out. The 8 flag makes
xml-parse turn the tag names into newLISP symbols, which normally are convenient, but there are some tag names I wanted to use as search criteria which had a colon in the name, like "wp:status". newLISP uses the colon too for symbol context qualification. If my recollection is correct, during my testing/debugging, some function like
match was failing to match on these qualified tag names, so without any further investigation (because I'm lazy like that), I quickly switched to using strings for tag names. To do this, leave the 8 flag off in the call to
xml-parse.
Using the
ref-all function with
match as a helper is a pretty good way to munge the SXML (as bairui and cormullion have mentioned); so I'm going to use
ref-all also. I haven't thought about what is a better or the best way to process XML, so you'll hear no unqualified comments from me on that. :) At any rate, here is a helper function I wrote, because I don't like the look of the calling code having
match and
true at the end of the call to
ref-all. That's just me though.
Code: Select all
(define (ref-all-match KEY LST) (ref-all KEY LST match true))
So you could write a function to get all the titles from the export, like this.
Code: Select all
(define (wpx-get-titles WPX)
(map last
(ref-all-match '("title" *)
(ref-all-match '("item" ("title" *) *) WPX))))
This might yield more "titles" than you want. When I run it on my export, I get this.
Code: Select all
> (wpx-get-titles *wp-export-db*)
("Sample Page" "About" "Things" "title" "Home" "Home" "title" "Notes on Clojure Records"
"Unit-Slope" "basis" "Horizontal" "Skating" "tumble" "separate" "simplescene" "simplescene"
"spirograph-demo" "StarFish" "Unit-Slope" "half-slope (1)" "Permutations of a Multiset"
"Ripping Access Databases in Clojure" "Reactive Swing via Observables Pt 1"
"Reactive Swing and Observables Pt 2" "Reactive Swing and Observables Pt 3"
"Reactive Swing and Observables pt 4" "Reactive Swing and Observables Pt 5" "Reactive Swing and Observables Pt 6"
"Fun with Functional Reactive Programming Pt 1" "A Simple Scene Graph in Clojure")
I don't know what you want out of the title extraction, but I was looking for blog post titles. Wordpress uses the "title" tag in a more general sense. For instance, in my output, "Sample Page", "About", "Things" are titles of Wordpress Pages. "Unit-Slope", "basis", ..., "half-slope (1)" are titles of attachments (like imported JPEGs). The others are blog post titles. Notice also that there is a "title" title (the 7th element). This is a remnant of an empty title element, which could be a problem, but in the spirit of laziness, let's move on, shall we? :)
I wanted just the published post titles, so I needed some more munging.
Code: Select all
(define (wpx-get-published-post-titles WPX)
(map last
(ref-all-match '("title" *)
(ref-all-match '("item" ("title" *) *
("wp:status" "publish") *
("wp:post_type" "post") *)
WPX))))
That should pretty much do it. Here's my trial run.
Code: Select all
> (wpx-get-published-post-titles *wp-export-db*)
("Permutations of a Multiset" "Ripping Access Databases in Clojure"
"Reactive Swing via Observables Pt 1" "Reactive Swing and Observables Pt 2"
"Reactive Swing and Observables Pt 3" "Reactive Swing and Observables pt 4"
"Reactive Swing and Observables Pt 5" "Reactive Swing and Observables Pt 6"
"Fun with Functional Reactive Programming Pt 1" "A Simple Scene Graph in Clojure")
The only potential problem I can see with this function definition is with the reliance on the order of the "title", "wp:status" and "wp:post_type" elements. In general, we should be concerned with it, but my export was small enough for me to notice that these elements are consistently output in the order indicated by the function definition above.
As extra credit, the following is some code that also worked for me to get the published post titles. This set of functions
don't rely on the order of the sub-elements (like "wp:status" and "wp:post_type") which are used for filtering. And I like the "small building blocks" design -- I like Legos too. :)
Code: Select all
(define (wpx-extract-items WPX)
(ref-all-match '("item" *) WPX))
(define (mfilter M LST)
(filter (curry member M) LST))
(define (wpx-get-post-items WPX)
(mfilter '("wp:post_type" "post") (wpx-extract-items WPX)))
(define (wpx-get-published-post-items WPX)
(mfilter '("wp:status" "publish") (wpx-get-post-items WPX)))
(define (wpx-get-post-titles WPX)
(map (curry lookup "title") (wpx-get-post-items WPX)))
(define (wpx-get-published-post-titles WPX)
(map (curry lookup "title") (wpx-get-published-post-items WPX)))
Perhaps other people will share the way they like to process (S)XML. I'm very curious. Thanks!
P.S. -- Nice find on the wikibooks, cormullion!