20090713

Garbage to UTF-8

I had a problem last year with a legacy database filled with a mix of utf-8, windows cp-1252, extended and regular ascii. I needed a way to clean up the information without losing any of the information contained in it.

Being familiar with regular expressions, I looked up how UTF-8 was formatted, made a couple of assumptions about the malformations I would find therein, figured out which code points were the ones I should replace and the following function was born.

Or something like that. Anyway here is the code :


function garbage_to_utf8_character_replacement_function( $matches )
{

// converts binary 10000000 -> 11111111 that do not appear
// as part of a unicode character into a unicode character
// under the assumption that a portion of them are windows
// cp-1252 characters, and the rest are exteneded ascii
// characters

$o = ord( $matches[ 0 ] ) ;

switch( $o )
{
// check for windows code page 1252 characters
case 130 : return "\xe2\x80\x94" ; // Single Low-9 Quotation Mark
case 131 : return "\xc6\x92" ; // Latin Small Letter F With Hook
case 132 : return "\xe2\x80\x9e" ; // Double Low-9 Quotation Mark
case 133 : return "\xe2\x80\xa6" ; // Horizontal Ellipsis
case 134 : return "\xe2\x80\xa0" ; // Dagger
case 135 : return "\xe2\x80\xa1" ; // Double Dagger
case 136 : return "\xcb\x86" ; // Modifier Letter Circumflex Accent
case 137 : return "\xe2\x80\xb0" ; // Per Mille Sign
case 138 : return "\xc5\xa0" ; // Latin Capital Letter S With Caron
case 139 : return "\xe2\x80\xb9" ; // Single Left-Pointing Angle Quotation Mark
case 140 : return "\xc5\x92" ; // Latin Capital Ligature OE
//gap
case 145 : return "\xe2\x80\x98" ; // Left Single Quotation Mark
case 146 : return "\xe2\x80\x99" ; // Right Single Quotation Mark
case 147 : return "\xe2\x80\x9c" ; // Left Double Quotation Mark
case 148 : return "\xe2\x80\x9d" ; // Right Double Quotation Mark
case 149 : return "\xe2\x80\xa2" ; // Bullet
case 150 : return "\xe2\x80\x93" ; // En Dash
case 151 : return "\xe2\x80\x94" ; // Em Dash
case 152 : return "\xcb\x9c" ; // Small Tilde
case 153 : return "\xe2\x84\xa2" ; // Trade Mark Sign
case 154 : return "\xc5\xa1" ; // Latin Small Letter S With Caron
case 155 : return "\xe2\x80\xba" ; // Single Right-Pointing Angle Quotation Mark
case 156 : return "\xc5\x93" ; // Latin Small Ligature OE
//gap
case 159 : return "\xc5\xb8" ; // Latin Capital Letter Y With Diaeresis

default :
return chr( 192 | ( 3 & ( $o >> 6 ) ) ) . chr( $o & 191 ) ;

}

}

function garbage_to_utf8( $text )
{

// locate all bytes with 0x80 set that are not a proper
// component of a unicode character. pass them to
// garbage_to_utf8_character_replacement_function
// to convert them to unicode under the assumptions they
// are either windows characters or extended ascii

$bad_replace = ''
. '/('
. '('
. '[\\xF0-\\xF7](?![\\x80-\\xBF][\\x80-\\xBF][\\x80-\\xBF])' // find 11110xxx not followed by 3 10xxxxxx
. '|'
. '[\\xE0-\\xEF](?![\\x80-\\xBF][\\x80-\\xBF])' // find 1110xxxx not followed by 2 10xxxxxx
. '|'
. '[\\xC0-\\xDF](?![\\x80-\\xBF])' // find 110xxxxx not followed by 1 10xxxxxx
. '|'
. '(?<!'
. '[\\xF0-\\xF7][\\x80-\\xBF][\\x80-\\xBF]' // find 10xxxxxx not part of code point
. '|'
. '[\\xF0-\\xF7][\\x80-\\xBF]'
. '|'
. '[\\xF0-\\xF7]'
. '|'
. '[\\xE0-\\xEF][\\x80-\\xBF]'
. '|'
. '[\\xE0-\\xEF]'
. '|'
. '[\\xC0-\\xDF]'
. ')'
. '[\\x80-\\xBF]'
. ')'
. ')/'
;

return preg_replace_callback( $bad_replace ,
'garbage_to_utf8_character_replacement_function' ,
$text ) ;

}


A quick search shows I'm not the only one to have solved this using regular expressions.

FixLatin

Too bad he didn't post sooner, it would've saved me having to figure out the encoding transformation on my own. Ah well, at least I'm not the only one.

20090514

Human Readable Sort

I was using sort the other day at work and got annoyed that it wouldn't sort by human-readable units. I wrote a patch that night, signed up for coreutils mailing list the next day and emailed it in.

I worked with one of the developers trading the patch back and forth the next two nights.

Now, `du -hs * | sort -h` is ready to be added to the coreutils. My first FSF contribution.

http://www.nabble.com/Human-readable-sort-td23223205.html

20080306

Hidden backup files with emacs

Ever since I started using emacs the directories full of `whatever~' files have annoyed me. No more! I put this into my .emacs ( along with a (require 'cl) ) and voila, backups for `whatever' end up in `.whatever~' in the same directory. .emacs? `..emacs~'. Thus when I run ls my directory listing is clean.

;; hidden backup files - i hate seeing them in listings ...                                                                                                                                                       
;; prefix with a dot as well as postfix with a tilde
(defun custom-make-backup-file-name ( file )
(let ((d (file-name-directory file))
(f (file-name-nondirectory file)))
(concat d "." f "~")))
(setq make-backup-file-name-function 'custom-make-backup-file-name)

(defun backup-file-name-p ( file )
(let ((letters (string-to-list (file-name-nondirectory file))))
(and (> 2 (length letters))
(equal "." (first letters))
(equal "~" (last letters)))))

(defun file-name-sans-versions ( file )
(if (not (backup-file-name-p file))
file
(let ((d (file-name-directory file))
(f (file-name-nondirectory file)))
(let ((letters (string-to-list f)))
(concat d (subseq letters 1 (- (length f) 1)))))))


While I'm busy dumping from my .emacs file, I like the truncated lines when I use ( C-x 3 ) to divide the display vertically except when I'm running a shell. Then I want to see everything.

;; do not truncate lines in shell
(add-hook 'shell-mode-hook (lambda () (progn
(make-local-variable 'truncate-partial-width-windows)
(setq truncate-partial-width-windows nil))))

20080225

Batch-fu

I don't know how many avid windows batchers are out there ( I was one some years ago ), but perhaps a few of you can use / be horrified by this little helper. Ever get annoyed because you can't easily reuse functions between scripts since they'll stomp all over each others environment variables? Probably not. Just in case, here's how to create lexically scoped batch file functions.

::#
:ServerName_Service
setlocal

:: blah blah do anything to namespace blah

::now pass the full name / status back out of the setlocal
for /F "usebackq tokens=1,2 delims=~" %%a in (`echo.%ServiceName%~"%Status%"`) do (
endlocal
set CACHE~%%a=%%b
)
goto :eof

Tada!

Even better if you structure them such that the first argument is the name of the variable to receive the value from the function call and then write your exit similar to this :

::now pass the full name / status back out of the setlocal
for /F "usebackq tokens=1,2 delims=~" %%a in (`echo.%ServiceName%~"%Status%"`) do (
endlocal
set %1=%%b
)

BTW, I don't really recommend writing large programs in batch, but if draconian network policies make it all you've got, good luck.

Remember that it is two phase, first variable expansion happens, then execution occurs. Execution of lines starting with `:' makes these lines into labels. Lines that start `::' are label errors and dropped ( making for better comments than rem, which executes and freaks out all to hell if special characters are in its argument list / comment area ). Lines starting `%%en_var%%' where the environment variable `en_var' has the value `::' are label errors and dropped. ( This can be used to great effect. I can't take credit for this hack though. I found it on Rob van der Woude's scripting site ).

20080201

Function chaining in arc.

I noticed some complaints about a lack of the equivalent of Haskell's point-free programming style for Arc, I decided to implement one of Haskell's more useful operators.

The $ operator in Haskell breaks a statement into two statements. It executes the second and then uses the output as the final variable to the first.

 e.g. 
main = print $ "hello" ++ " " ++ "world"


is the same as

main = print ( "hello" ++ " " ++ "world )


Lacking infix operators I cannot recreate the functionality exactly. Nor, given that Arc is a lisp and doing so would be decidedly unlispy, would I want to. Instead I have created the following chain macro that will allow a similar form of composition. I do not believe it would be difficult to turn the following into a mutli-argument compose macro either.

 ;; chains similar to haskells $ operator.                                                                                                                                                                         
(mac chain args
(if (is 0 (len args))
nil
(is 1 (len args))
(args 0)
t
(let n (uniq)
`(let ,n ,(eval `(chain ,@(cdr args)))
(eval (+ ',(args 0) (list ,n)))))))


This allows for the following :

arc>  (chain (+ 3) (+ 4 5) (- 2) (+ 6) 8)
0
arc> (chain)
nil
arc> (chain (+ 4))
4
arc> (chain (+ 4) (+ 4 8))
16
arc> (chain (+ 4) (+ 5 6) (- 5) 50)
-30
arc> (chain (+ 3) ([/ _ 3]) ([- _ 2]) (+ 100) 12)
119/3
arc>


If it didn't wrap a (list ...) around each return value and required each return value to be a list in its own right ( or only wrapped non-cons structures ) it would have similar semantics to perls execution model, compressing each list into the current argument chain.

The chain macro needs all of the function calls to be wrapped in paren for it to operate properly, since it takes each value and appends it to the next function.

20080122

Irrefutable Pattern Love.

Irrefutable patterns are awesome. If anyone tells you differently, well, they're probably right. I'm 0 for 1 regarding challenging people on the inner workings of Haskell thus far. Anyway, this exerpt from my code is the exact reason irrefutable patterns rock.

 rexn ns pps = let ( ~( xs , rps ) ,
~( ~( nxs ) ,
~( rxs , rrps ) ) ) = ( exn nxs pps ,
case rps of
('?':'?':rr) -> ( ( ns ) ,
( ns ++ xs , rr ) )
('?':rr) -> ( ( ns ) ,
( xs ++ ns , rr ) )
('*':'?':rr) -> ( ( ns ++ xs ) ,
( ns ++ xs , rr ) )
('*':rr) -> ( ( xs ++ ns ) ,
( xs ++ ns , rr ) )
('+':'?':rr) -> ( ( ns ++ xs ) ,
( xs , rr ) )
('+':rr) -> ( ( xs ++ ns ) ,
( xs , rr ) )
_ -> ( ( ns ) ,
( xs , rps ) )
)
in ( rxs , rrps )


reading key :: xs - extracted nodes , rps - remaining-pattern-string , nxs - next-nodes ( what the nodes being created will use as the next step when executing the regex ) , rxs - the set of nodes generated for the section of pattern being interpreted , rrps - remaining remaining pattern string - the final amount after accounting for repitition operators.

See those tildes? Those tell the haskell compiler that I guarantee that the variables being bound there will eventually be filled in and that I want it to assume the patterns will match irrefutably. Guess thats where they got the name. This allows me to do some fun things. If you stare into that code for a few seconds, you'll see that the function exn depends on the variable nxs, which is generated in the case statement, which determines its value based on the remaining pattern string ( rps ) output by exn. For bonus points the next-nodes ( nxs ) sometimes contain the node being generated to create a loop.

This sort function interdependance would explode the stack of any other language with ease. Not Haskell.

The source to the newest iteration of my regex engine is follows. It is incomplete, but the parts that have been implemented seem to be functioning fine.

-- regular expression engine -- (c) 2008 michael speer

import Char ( isSpace )
-- import Debug.Trace ( trace )

xor :: Bool -> Bool -> Bool
xor True a = not a
xor False a = a

data RxToken = RxStart -- matchable start of target string
| RxChar Char -- a literal character to match
| RxBound -- inserted wherever alphanums touch whitespace
| RxEnd -- matchable end of target string
| RxEOF -- an additional token to push through to catch anything trying for RxEnd
-- RxEOF is never matched.
deriving ( Show )

rxTokenize tts = RxStart : case tts of
[] -> RxEnd : RxEOF : []
tts@(t:_) -> case not $ isSpace t of
True -> RxBound : rxt tts
False -> rxt tts
where
rxt (t:[]) | not $ isSpace t = RxChar t : RxBound : RxEnd : RxEOF : []
| otherwise = RxChar t : RxEnd : RxEOF : []
rxt (t:ts@(t':_)) | isSpace t `xor` isSpace t' = RxChar t : RxBound : rxt ts
| otherwise = RxChar t : rxt ts

data RxTransform = RxTransform ( RxNode -> RxToken -> [ RxNode ] )
| RxNullTransform

data RxNode = RxActive { rxTransforms :: [RxTransform] ,
rxMatched :: String ,
rxNumSubs :: Integer ,
rxSubExprs :: [ String ] }
| RxComplete { rxMatched :: String ,
rxNumSubs :: Integer ,
rxSubExprs :: [ String ] }

instance Show RxNode where
show (RxComplete matched _ _) = "<rx|matched:[" ++ matched ++ "]>"

data RxDepth = RxTop | RxSub deriving ( Show )

rxCompile pps = let ( xs , rps ) = oexn [success] RxTop pps
in case length rps of
0 -> RxActive { rxTransforms = xs ,
rxMatched = [] ,
rxNumSubs = 0 ,
rxSubExprs = [] }
_ -> error $ "Not all of pattern consumed : remains : " ++ rps
where
-- or together different expression sections -- (a|b|c)
oexn ns RxTop [] = ( ns , [] )
oexn _ RxSub [] = error "Pattern ended while still in sub expression"
oexn ns d pps = let ( ~( xs , rps ) ,
~( nxs , nrps ) ) = ( aexn ns pps ,
case rps of
('|':rr) -> let ( inxs , irps ) = oexn ns d rr
in ( xs ++ inxs , irps )
(')':rr) -> case d of
RxTop -> error "Erroneous close parenthesis in pattern "
RxSub -> ( xs , rr )
[] -> case d of
RxTop -> ( xs , [] )
RxSub -> error "End of pattern while still in sub expression" )
in ( nxs , nrps )
-- and together extracted nodes in a given expression segment -- abd?dfs
aexn ns pps = let ( ~( xs , rps ) ,
~( nxs , nrps ) ) = ( rexn nxs pps ,
case rps of
('|':_) -> ( ns , rps )
(')':_) -> ( ns , rps )
[] -> ( ns , rps )
_ -> aexn ns rps )
in ( xs , nrps )
-- replication application - weee!
rexn ns pps = let ( ~( xs , rps ) ,
~( ~( nxs ) ,
~( rxs , rrps ) ) ) = ( exn nxs pps ,
case rps of
('?':'?':rr) -> ( ( ns ) ,
( ns ++ xs , rr ) )
('?':rr) -> ( ( ns ) ,
( xs ++ ns , rr ) )
('*':'?':rr) -> ( ( ns ++ xs ) ,
( ns ++ xs , rr ) )
('*':rr) -> ( ( xs ++ ns ) ,
( xs ++ ns , rr ) )
('+':'?':rr) -> ( ( ns ++ xs ) ,
( xs , rr ) )
('+':rr) -> ( ( xs ++ ns ) ,
( xs , rr ) )
_ -> ( ( ns ) ,
( xs , rps ) )
)
in
( rxs , rrps )
-- extract node ( including an entire subexpression as a single node )
exn _ ('?':_) = error "Bad question mark operator"
exn _ ('*':_) = error "Bad splat operator"
exn _ ('+':_) = error "Bad plus sign operator"
exn ns ('(':ps) = oexn ns RxSub ps
exn ns (p:ps) = ( [ RxTransform ( \rxn k -> case k of
(RxChar c) -> if c == p
then
[ RxActive { rxTransforms = ns ,
rxMatched = c : rxMatched rxn ,
rxNumSubs = rxNumSubs rxn ,
rxSubExprs = rxSubExprs rxn } ]
else
[]
(RxStart) -> [rxn]
(RxBound) -> [rxn]
_ -> []
) ] ,
ps )

exn ns [] = error "can this be reached?"

success = RxTransform ( \ rxn k -> [ RxComplete { rxMatched = reverse $ rxMatched rxn ,
rxNumSubs = rxNumSubs rxn ,
rxSubExprs = map reverse $ rxSubExprs rxn } ] )


rxExec n tts = iexec [n] (rxTokenize tts)
where
iexec (win@(RxComplete _ _ _ ):_) _ = Just win
iexec [] _ = Nothing
iexec nns (k:ks) = iexec ( concatMap ( \n -> case n of
(RxComplete _ _ _) -> [n]
a@(RxActive _ _ _ _) -> concatMap (\xf -> case xf of
(RxTransform f) -> f a k
(RxNullTransform) -> []
) (rxTransforms a) ) nns ) ks

main = do
print $ rxTokenize "this is a test"
print $ rxExec (rxCompile "hello|world") "hello"
print $ rxExec (rxCompile "hello|world") "world"
print $ rxExec (rxCompile "abcde|ab") "abcd"
print $ rxExec (rxCompile "ab?c") "ac"
print $ rxExec (rxCompile "ab?c") "abc"
print $ rxExec (rxCompile "ab*c") "ac"
print $ rxExec (rxCompile "ab*c") "abc"
print $ rxExec (rxCompile "ab*c") "abbbc"
print $ rxExec (rxCompile "ab+c") "ac"
print $ rxExec (rxCompile "ab+c") "abc"
print $ rxExec (rxCompile "ab+c") "abbbbbc"
print $ rxExec (rxCompile "(a|b)+") "aaabbbabaaababaaabbbabbaba"
print $ rxExec (rxCompile "abc|") "zyx"


Sweet.

20080115

Twin naming golf

I saw Brad Fitzpatrick's post on anagram twin names and decided to join in the golf. Weeee!


import Control.Arrow ( first , second , (&&&) , (***) )
import List ( sortBy , groupBy , intersperse )

main = interact $ unlines
. map ( uncurry (++) )
. map ( ( ++ " : " ) . fst . head &&& concat . intersperse ", " . map snd )
. filter ( (> 1) . length )
. groupBy ( curry $ uncurry (==) . (***) fst fst )
. sortBy (curry $ uncurry compare . (***) fst fst )
. map ( sortBy compare &&& id )
. filter ( not . null )
. map ( fst . break (== ' ') )
. lines


hehehe

Michael Speer
(2) the subculture of the compulsive programmer, whose ethics prescribe that one silly idea and a month of frantic coding should suffice to make him a life-long millionaire. --ewd1036
View my complete profile