Searching in lists

didi · Post by **didi** » Wed Jan 09, 2013 6:16 am

I want to import data directly from Wordpress to my ZZBlogX. I can export data as XML-Files. I've parsed it with xml-parse and filtered out unnecessary things. Now I have the problem to seek through the file for eg. to look for "title" and then get the next element which is the title to make a title-list.

Code: Select all

( do-while (find "title" outlist)
  ( set 'i  ( find "title" outlist ))
  ( set 'outlist ( i outlist))
  ( push (pop outlist) title-list )
)

That doesn't work . Would it be better to work with slice eg. get the index of the element and make an new list with slice ?

bairui · Post by **bairui** » Wed Jan 09, 2013 7:45 am

Without seeing your data, I can only guess at the structure. As such, this code will almost certainly fail:

Code: Select all

(map rest (ref-all '("title" *) x match true))

Hopefully the intent survives though and you're able to find a built-in search function to suit your needs. There are about three hundred of them by my last count, once you factor in the myriad switches and leavers and moon phases that affect their operation.

Ok... it is perhaps a bit persnickety of me to blame my tools for my own inability to learn/memorise the many different search functions in newLISP. I need to find a way to better absorb and retain this knowledge. :-/

cormullion · Post by **cormullion** » Wed Jan 09, 2013 8:37 am

Without seeing your file, it's not easy to suggest anything precisely, but for general tips on importing XML, see if any of this Wikibooks chapter is useful.

cormullion · Post by **cormullion** » Wed Jan 09, 2013 8:39 am

Moon phases ... :)

didi · Post by **didi** » Wed Jan 09, 2013 9:19 pm

The wikibook shows exactly what I'm looking for.
I try to adapt that for wordpress. I'll be back after everything works

this works now, too - but not elegant :

Code: Select all

( do-while (find "title" outlist)
  ( set 'i  ( find "title" outlist ))
  ( set 'outlist ( i outlist))
  ( pop outlist )
  ( push (pop outlist) title-list ))

( outlist is the cleaned and flatned parsed xml - title-list it the list of titles I looked for. )

rickyboy · Post by **rickyboy** » Wed Jan 09, 2013 10:27 pm

didi, I don't know exactly your context, but I have exported my data from Wordpress before, so I am going to guess your context. I'm supposing now that you have completed the export into an XML file and that you have "slurped" that into newLISP and it is now an SXML (list) data structure.

But just to back up a moment, say I exported my Wordpress to a file called wp-export.xml. Then, in newLISP, I would "slurp" that file like this.

Code: Select all

(define (sxml<-file FILEPATH)
  (xml-type-tags nil nil nil nil)
  (xml-parse (read-file FILEPATH) (+ 1 2 4 16)))

(define *wp-export-db* (sxml<-file "wp-export.xml"))

I had problems using the 8 flag to xml-parse, so I left it out. The 8 flag makes xml-parse turn the tag names into newLISP symbols, which normally are convenient, but there are some tag names I wanted to use as search criteria which had a colon in the name, like "wp:status". newLISP uses the colon too for symbol context qualification. If my recollection is correct, during my testing/debugging, some function like match was failing to match on these qualified tag names, so without any further investigation (because I'm lazy like that), I quickly switched to using strings for tag names. To do this, leave the 8 flag off in the call to xml-parse.

Using the ref-all function with match as a helper is a pretty good way to munge the SXML (as bairui and cormullion have mentioned); so I'm going to use ref-all also. I haven't thought about what is a better or the best way to process XML, so you'll hear no unqualified comments from me on that. :) At any rate, here is a helper function I wrote, because I don't like the look of the calling code having match and true at the end of the call to ref-all. That's just me though.

Code: Select all

(define (ref-all-match KEY LST) (ref-all KEY LST match true))

So you could write a function to get all the titles from the export, like this.

Code: Select all

(define (wpx-get-titles WPX)
  (map last
       (ref-all-match '("title" *)
         (ref-all-match '("item" ("title" *) *) WPX))))

This might yield more "titles" than you want. When I run it on my export, I get this.

Code: Select all

> (wpx-get-titles *wp-export-db*)
("Sample Page" "About" "Things" "title" "Home" "Home" "title" "Notes on Clojure Records" 
 "Unit-Slope" "basis" "Horizontal" "Skating" "tumble" "separate" "simplescene" "simplescene" 
 "spirograph-demo" "StarFish" "Unit-Slope" "half-slope (1)" "Permutations of a Multiset" 
 "Ripping Access Databases in Clojure" "Reactive Swing via Observables Pt 1"
 "Reactive Swing and Observables Pt 2" "Reactive Swing and Observables Pt 3" 
 "Reactive Swing and Observables pt 4" "Reactive Swing and Observables Pt 5" "Reactive Swing and Observables Pt 6" 
 "Fun with Functional Reactive Programming Pt 1" "A Simple Scene Graph in Clojure")

I don't know what you want out of the title extraction, but I was looking for blog post titles. Wordpress uses the "title" tag in a more general sense. For instance, in my output, "Sample Page", "About", "Things" are titles of Wordpress Pages. "Unit-Slope", "basis", ..., "half-slope (1)" are titles of attachments (like imported JPEGs). The others are blog post titles. Notice also that there is a "title" title (the 7th element). This is a remnant of an empty title element, which could be a problem, but in the spirit of laziness, let's move on, shall we? :)

I wanted just the published post titles, so I needed some more munging.

Code: Select all

(define (wpx-get-published-post-titles WPX)
  (map last
       (ref-all-match '("title" *)
         (ref-all-match '("item" ("title" *) *
                                 ("wp:status" "publish") *
                                 ("wp:post_type" "post") *)
                        WPX))))

That should pretty much do it. Here's my trial run.

Code: Select all

> (wpx-get-published-post-titles *wp-export-db*)
("Permutations of a Multiset" "Ripping Access Databases in Clojure" 
 "Reactive Swing via Observables Pt 1" "Reactive Swing and Observables Pt 2"
 "Reactive Swing and Observables Pt 3"  "Reactive Swing and Observables pt 4"
 "Reactive Swing and Observables Pt 5" "Reactive Swing and Observables Pt 6" 
 "Fun with Functional Reactive Programming Pt 1" "A Simple Scene Graph in Clojure")

The only potential problem I can see with this function definition is with the reliance on the order of the "title", "wp:status" and "wp:post_type" elements. In general, we should be concerned with it, but my export was small enough for me to notice that these elements are consistently output in the order indicated by the function definition above.

As extra credit, the following is some code that also worked for me to get the published post titles. This set of functions don't rely on the order of the sub-elements (like "wp:status" and "wp:post_type") which are used for filtering. And I like the "small building blocks" design -- I like Legos too. :)

Code: Select all

(define (wpx-extract-items WPX)
  (ref-all-match '("item" *) WPX))

(define (mfilter M LST)
  (filter (curry member M) LST))

(define (wpx-get-post-items WPX)
  (mfilter '("wp:post_type" "post") (wpx-extract-items WPX)))

(define (wpx-get-published-post-items WPX)
  (mfilter '("wp:status" "publish") (wpx-get-post-items WPX)))

(define (wpx-get-post-titles WPX)
  (map (curry lookup "title") (wpx-get-post-items WPX)))

(define (wpx-get-published-post-titles WPX)
  (map (curry lookup "title") (wpx-get-published-post-items WPX)))

Perhaps other people will share the way they like to process (S)XML. I'm very curious. Thanks!

P.S. -- Nice find on the wikibooks, cormullion!

cormullion · Post by **cormullion** » Wed Jan 09, 2013 10:47 pm

You could make this into a blog post... :) Always good to read something by you!

rickyboy · Post by **rickyboy** » Wed Jan 09, 2013 10:59 pm

That's your job, my friend. :) I miss Unbalanced Parentheses ...

cormullion · Post by **cormullion** » Wed Jan 09, 2013 11:01 pm

Thanks! Someone else's turn now, though :)

Lutz · Post by **Lutz** » Wed Jan 09, 2013 11:57 pm

like "wp:status". newLISP uses the colon too for symbol context qualification.

in the next version newLISP translates XML : colons in tag names to dots in symbols names:

http://www.newlisp.org/downloads/develo ... 10.4.6.txt

rickyboy · Post by **rickyboy** » Thu Jan 10, 2013 4:26 am

Thanks, Lutz!

didi · Post by **didi** » Thu Jan 10, 2013 6:19 am

Thanks Ricky - that's really awesome - nearly can't wait till the evening to test it !

cormullion · Post by **cormullion** » Thu Jan 10, 2013 4:27 pm

Once I tried to make it so that an XML file would execute itself. So after converting an xml file to sxml, you'd then evaluate the sxml, having previously defined functions called title, item or status which would evaluate their attributes in turn. I think it all went wrong with the @ signs - can't remember now. But at the time it seemed more logical to let the title and item functions decide what to do and when to do it, rather than try to scan through a big list of stuff and extract the information in procedural style.

didi · Post by **didi** » Thu Jan 10, 2013 6:52 pm

Ricky's code works fine. I've got all titles out of a 600kb textfile in a blink.

Now I want to get the post-content marked as <content:encoded> , I changed "title" to "content:encoded" and tested evey kind and combination of "*" nothing worked. Maybe I couldn't find out the right pattern or it doesn't work . Here one sample-post in xml :

Code: Select all

<item>
<title>TEST</title>
<link>http://www.obeabe.de/?p=1090</link>
<pubDate>Thu, 10 Jan 2013 18:38:16 +0000</pubDate>
<dc:creator><![CDATA[Didi]]></dc:creator>

		<category><![CDATA[Allgemein]]></category>

		<category domain="category" nicename="allgemein"><![CDATA[Allgemein]]></category>

<guid isPermaLink="false">http://www.obeabe.de/?p=1090</guid>
<description></description>
<content:encoded><![CDATA[Testpost Testpost]]></content:encoded>
<excerpt:encoded><![CDATA[]]></excerpt:encoded>
<wp:post_id>1090</wp:post_id>
<wp:post_date>2013-01-10 18:38:16</wp:post_date>
<wp:post_date_gmt>2013-01-10 18:38:16</wp:post_date_gmt>
<wp:comment_status>open</wp:comment_status>
<wp:ping_status>open</wp:ping_status>
<wp:post_name>test</wp:post_name>
<wp:status>publish</wp:status>
<wp:post_parent>0</wp:post_parent>
<wp:menu_order>0</wp:menu_order>
<wp:post_type>post</wp:post_type>
<wp:post_password></wp:post_password>
<wp:is_sticky>0</wp:is_sticky>
<wp:postmeta>
<wp:meta_key>_edit_lock</wp:meta_key>
<wp:meta_value>1357843097</wp:meta_value>
</wp:postmeta>
<wp:postmeta>
<wp:meta_key>_edit_last</wp:meta_key>
<wp:meta_value>1</wp:meta_value>
</wp:postmeta>
	</item>

The result should be "Testpost Testpost" . Maybe someone has an idea.

rickyboy · Post by **rickyboy** » Thu Jan 10, 2013 7:19 pm

This is what I would define if I wanted to be in the (aforementioned) "Lego" scheme.

Code: Select all

(define *didi-example-item* (sxml<-file "didi-example-item.xml"))

(define (wpx-get-post-contents WPX)
  (map (curry lookup "content:encoded") (wpx-get-post-items WPX)))

Then, say this in the REPL.

Code: Select all

> (wpx-get-post-contents *didi-example-item*)
("Testpost Testpost")
>

rickyboy · Post by **rickyboy** » Thu Jan 10, 2013 7:53 pm

If you want to transform the WP export file into another format (like your blog data format), I'd recommend that you process whole posts ("items") instead of extracting just the constituents. That way, you don't lose the coupling structure of the data.

Here's how I would define a function to do that -- again, in "Lego land".

Code: Select all

(define (wpx-process-all-posts-from-wp-export WP-EXPORT-FILENAME)
  (let (posts (wpx-get-post-items
               (wpx-extract-items
                (sxml<-file WP-EXPORT-FILENAME)))
        post-filter
          (fn (post)
            (list 'post
                  (list 'title (lookup "title" post))
                  (list 'link (lookup "link" post))
                  (list 'author (lookup "dc:creator" post))
                  (list 'date (lookup "wp:post_date" post))
                  (list 'text (lookup "content:encoded" post)))))
    (map post-filter posts)))

I tested this on an export file I generated from my (old, v3.2.1) Wordpress blog and it worked great. But since that example is too big to display here, here's how it worked on your singleton example.

Code: Select all

> (wpx-process-all-posts-from-wp-export "didi-example-item.xml")
((post (title "TEST")
       (link "http://www.obeabe.de/?p=1090")
       (author "Didi")
       (date "2013-01-10 18:38:16")
       (text "Testpost Testpost")))

Notice that in the post-filter function that I'm extracting only the constituents I want (namely title, URL, author, date and text) and I associate them to newLISP symbols. You will just change this part of post-filter to get whatever you want that corresponds to your blog data structures (that you commit to your nldb backend).

(nldb FTW!)

didi · Post by **didi** » Fri Jan 11, 2013 6:10 am

Thanks Ricky !! I like the "Lego" style .
Thats exactly what I wanted, in the end all wordpress-posts should be in Cormullions nldb.lsp database.
So that I can translate a wordpress-blog in the static blog with ZZBlogX .

Many thanks to all. Hope I can show you the final result soon.

didi · Post by **didi** » Sat Jan 12, 2013 11:38 am

With Wordpress 3.x everything works fine. With 2.x I had to delete all "youtube videos" , because the output of xml-parse was "nil" due to "not well formed XML", to analyze it I opened the xml-file with my firefox-browser, there you get detailed informations about the errors , I had to throw out links like this:

Code: Select all

<wp:postmeta>
<wp:meta_key>_oembed_f109b1315b82821c5f7d1d98a3530231</wp:meta_key>
<wp:meta_value><iframe width="500" height="375" src="http://www.youtube.com/embed/kKH77hfWYPU?fs=1&feature=oembed" frameborder="0" allowfullscreen></iframe></wp:meta_value>
</wp:postmeta>

btw: newLISPs is great , Ricky's few lines transformed my 600kb big xml-file, in no-time into a post-list of 359kB !

rickyboy · Post by **rickyboy** » Tue Jan 22, 2013 9:01 pm

There are two things that xml-parse (and other xml parsers) will not like about this input: (1) any ampersand character in an attribute value needs to be escaped, and (2) the allowfullscreen attribute is not expressed as a key-value pair, e.g. allowfullscreen="Yes". This is all according to XML standard, although don't quote me as I'm not an expert. The problems arise because the parser looks at the <iframe ...></iframe> part as XML, when it's not. That's WordPress's fault, not ours.

It seems to me that the WordPress export output is producing "dirty XML". IMO, they should have enveloped the iframe (X)HTML entity with something like a <![CDATA[ ...]]>, as they did with other DB values.

By the way, as soon as I used the "CDATA envelope", xml-parse happily parsed it.

Code: Select all

> (sxml<-file "didi-example-2-fix.xml")
(("wp:postmeta" ("wp:meta_key" "_oembed_f109b1315b82821c5f7d1d98a3530231") ("wp:meta_value" 
   "<iframe width=\"500\" height=\"375\" src=\"http://www.youtube.com/embed/kKH77hfWYPU?fs=1&feature=oembed\" frameborder=\"0\" allowfullscreen></iframe>")))

Of course, WordPress doesn't care that an intermediate process they don't know about (like our XML processing here) would be using their output -- only that their export and import processes work as functional inverses in WordPress. :(

didi · Post by **didi** » Thu Jan 24, 2013 6:13 am

Thanks Ricky !

It seems that in Wordpress 3.xx it is correct and the problem is only with the older versions 2.9x

Currently I'm struggeling with some regex functions to replace the image and video-links. More after solving that ..

newlispfanclub.alh.net

Searching in lists

Searching in lists

Re: Searching in lists

Re: Searching in lists

Re: Searching in lists

Re: Searching in lists

Re: Searching in lists

Re: Searching in lists

Re: Searching in lists

Re: Searching in lists

Re: Searching in lists

Re: Searching in lists

Re: Searching in lists

Re: Searching in lists

Re: Searching in lists

Re: Searching in lists

Re: Searching in lists

Re: Searching in lists

Re: Searching in lists

Re: Searching in lists

Re: Searching in lists