html - Get contents Attribute Value pairs using BeautifulSoup or XPATH -


For the following XHTML snippets, I use the attributes value combinations with structured HTML, using BS4 or xpath to get the attribute name The H5 is present in the tag and its value is either low in the SPAN tag or AP tag.

For the code given below, I should get the following output as a dictionary:

Husband management: 'Animals: Cow farmer: Mr. Smith,' < p> Milk range: 'milk supply'

Services: 'cow's milk, ghee'

Animal color: 'red, brown ...'

  & lt; div id = "animalcontainer" class = "container last fixed height" & gt; & Lt; H5 & gt; Husband management & lt; / H5> & Lt; Period & gt; Animals: Cows & lt; / Span & gt; & Lt; Period & gt; Farmer: Mr. Smith & Lt; / Span & gt; & Lt; H5 & gt; Lactation category & lt; / H5> & Lt; P & gt; Milk supply & lt; / P & gt; & Lt; H5 & gt; Services & lt; / H5> & Lt; P & gt; Cow's milk, ghee and lt; / P & gt; & Lt; H5 & gt; Animal color & lt; / H5> & Lt; Period & gt; Green, red & lt; / Span & gt; & Lt; / Div & gt; Html code.findAll ('h5') finds the h5 elements, but I want the h5 element and successor before 'h5'   

< "post-text" itemprop Example solution to use XPath = "text">

lxml.html and XPath:

  1. all h5 < elements
  2. and for each h5 element,
    1. select the elements of the next sibling - the following- sibling :: *
    2. that is not h5 , itself - - [no (auto :: h5)]
    3. and dependent on it h5 before the number Brother - [count (predecessor-sibling :: H5) = 1] then 2, then 3 ...

      ( for loop enumerate (starts from <1)

      The sample content with simple prints of the content of the text elements (elements on the Lxml.html ' using .text_content () ):

        import lxml.html Html = "" "& lt; div id = "animalcontainer" class = "final container fixed -height" & gt; & Lt; H5 & gt; Husband management & lt; / H5> & Lt; Period & gt; Animals: Cows & lt; / Span & gt; & Lt; Period & gt; Farmer: Mr. Smith & Lt; / Span & gt; & Lt; H5 & gt; Lactation category & lt; / H5> & Lt; P & gt; Milk supply & lt; / P & gt; & Lt; H5 & gt; Services & lt; / H5> & Lt; P & gt; Cow's milk, ghee and lt; / P & gt; & Lt; H5 & gt; Animal color & lt; / H5> & Lt; Period & gt; Green, red & lt; / Span & gt; For i & lt; / Div & gt; "Doc = lxml.html.fromstring (html) header = doc.xpath ('// div / h5'), enumerate hummer (header, start = 1): print" - ---------- -------------------- "print header.text_content (). strip () in header.xpath for the following (" "" following-siblings :: * [ (Auto: H5)] [Calculation (predecessor-sibling :: H5) =% d] "" "% i: Print" \ t ", the following text. ()   

      This output:

        ------------------------ -------- Husband management Animals: Cow farmer: Mr. Smith -------------------------------- Milk range milk supply - ------------------------------- Services cow milk, ghee -------------- ------------------ Animals color green, red    

Comments

Popular posts from this blog

Pass DB Connection parameters to a Kettle a.k.a PDI table Input step dynamically from Excel -

multithreading - PhantomJS-Node in a for Loop -

c++ - MATLAB .m file to .mex file using Matlab Compiler -