cleaning strings

Q&A's, tips, howto's
Locked
joejoe
Posts: 173
Joined: Thu Jun 25, 2009 5:09 pm
Location: Denver, USA

cleaning strings

Post by joejoe »

Hi -

I would like to know how to do one thing, please.

I would like to know how to remove string elements which are less than 4 characters long from my string. (By characters, i mean letters, numbers, !/#/$//,/'/;/:/etc).

For example:

If my string is this:

Code: Select all

("a bb ccc dddd eeeee ffffff")
I would like to know which function is best to make this list only a list of strings four or more characters long.

Code: Select all

("dddd eeeee ffffff")


I tried replace-ing

Code: Select all

(replace "[.]" title "" 0)
and

Code: Select all

(replace "[+]" title "" 0)
but that did not seem to clean out one character strings.

I tried various maneuvers w/ define and length but wound up lost and without solution.

Am I simply missing the "magic" regex that means "characters less than 4 characters long"?

And is replace the correct function to "remove" these things?

Thank you for any guidance!

xytroxon
Posts: 296
Joined: Tue Nov 06, 2007 3:59 pm
Contact:

Re: cleaning strings

Post by xytroxon »

One method, is to parse the line into words, define a small? predicate, then use the clean function. The join function can be used to make a string again.

Code: Select all

(setq input (parse "a bb ccc dddd eeeee ffffff"))

(println input)

(define (small? x) (< (length x) 4))

(setq output (clean small? input))

(println output)
(println (join output " "))

(exit)
-- xytroxon
"Many computers can print only capital letters, so we shall not use lowercase letters."
-- Let's Talk Lisp (c) 1976

bairui
Posts: 64
Joined: Sun May 06, 2012 2:04 am
Location: China
Contact:

Re: cleaning strings

Post by bairui »

Cool. I was thinking of something along the same lines:

(setf input "a bb ccc dddd eeeee ffffff")
(filter (fn (s) (>= (length s) 4)) (parse input))

joejoe
Posts: 173
Joined: Thu Jun 25, 2009 5:09 pm
Location: Denver, USA

Re: cleaning strings

Post by joejoe »

Thanks xytroxon and bairui !

I appreciate the examples and now understand how to use them.

Two things -

I tried both examples with this string (containing random non letter/number characters):

Code: Select all

(setq input (parse "! @ # $$$ *- a bb ccc dddd eeeee ffffff"))

(println input)

(define (small? x) (< (length x) 4))

(setq output (clean small? input))

(println output)
(println (join output " "))

(exit)
On xytroxon's example, I get a result of this:

Code: Select all

("!" "@")
()
On bairui's example, I get this:

Code: Select all

()
I am still after only the string components with 4 or more characters, meaning somehow strip out those exclamations, symbols, etc. Is this possible?

The second thing,

The examples you both provided, am I close the correct way to translate to a list example?

Code: Select all

(setq 'input '("a" "bb" "ccc" "dddd" "eeeee" "ffffff"))

(println input)

(define (small? x) (< (length x) 4))

(setq output (clean small? input))

(println output)
(println (join output " "))

(exit)
I am getting this:

Code: Select all

nil

ERR: list expected in function clean : input
and for bairui's example:

Code: Select all

(setq 'input '("a" "bb" "ccc" "dddd" "eeeee" "ffffff"))
(filter (fn (s) (>= (length s) 4)) (parse input))
(exit)
I get this, too:

Code: Select all

ERR: string expected in function parse : input
Okay and thanks!

xytroxon
Posts: 296
Joined: Tue Nov 06, 2007 3:59 pm
Contact:

Re: cleaning strings

Post by xytroxon »

You have a symbol quoting error...

Use: (set 'input ...

or: (setq input ... or (setf input ...

but not: (setq 'input ... nor (setf 'input ...

(setq and (setf are the same as using (set '

-----------------

We can then also add some regex to parse to force it to break on one or more whitespace chars {\s+}

Code: Select all

(setq input (parse "! @ # $$$ *- a bb ccc dddd eeeee ffffff" {\s+} 0))

(println input)

(define (small? x) (< (length x) 4))

(setq output (clean small? input))

(println output)

(println (join output " "))

(exit)
-- xytroxon
"Many computers can print only capital letters, so we shall not use lowercase letters."
-- Let's Talk Lisp (c) 1976

cormullion
Posts: 2038
Joined: Tue Nov 29, 2005 8:28 pm
Location: latiitude 50N longitude 3W
Contact:

Re: cleaning strings

Post by cormullion »

Joejoe, you should usually use parse with a string-break argument and optionally a regex option:

Code: Select all

(parse string string-break regex-option)
otherwise you will see unexpected results, as newLISP tries to treat your input as source code.

Code: Select all

> (parse "this is #1 in a list of 3")
("this" "is")
> (parse "well ; there's a thing!")
("well")
> (parse "[This sentence isn't going to be broken into words, whatever you do.")
("[This sentence isn't going to be broken into words, whatever you do.]")
> (parse "0800-074-085")
("0" "800" "-074" "-0" "85")
>
 

joejoe
Posts: 173
Joined: Thu Jun 25, 2009 5:09 pm
Location: Denver, USA

Re: cleaning strings

Post by joejoe »

xytroxon, major thanks on using set, setq and setf properly! got it!

cormullion, thanks for the parse guidance because i will be using that a lot! ;0)

Lutz
Posts: 5289
Joined: Thu Sep 26, 2002 4:45 pm
Location: Pasadena, California
Contact:

Re: cleaning strings

Post by Lutz »

You could also use 'find-all'. In that case the regular expression describes a class of tokens instead of break strings:

Code: Select all

(set 'input  "! @ # $$$$$ *- a bb ccc dddd eeeee ffffff")

(find-all {\w{4,}} input)   => ("dddd" "eeeee" "ffffff")

(find-all "\\w{4,}" input)   => ("dddd" "eeeee" "ffffff")

(find-all "[^ ]{4,}" input)   => ("$$$$$" "dddd" "eeeee" "ffffff")

joejoe
Posts: 173
Joined: Thu Jun 25, 2009 5:09 pm
Location: Denver, USA

Re: cleaning strings

Post by joejoe »

Most excellent!

That is the magic regex of 4+ characters! :0)

Thanks very much Lutz and I will study the slight differences in your regexes.

Very much appreciated and thanks again!

Locked